Oracle ASM Filter Driver: First Impressions

This is a very quick post, because I’m about to log off and take an extended summer holiday (or vacation as my crazy American friends call it… but then they call football  “soccer” too). Before I go, I wanted to document my initial findings with the new ASM Filter Driver feature introduced in this week’s 12.1.o.2 patchset.

Currently a Linux-only feature, the ASM Filter Driver (or AFD) is a replacement for ASMLib and is described by Oracle as follows:

Oracle ASM Filter Driver (Oracle ASMFD) is a kernel module that resides in the I/O path of the Oracle ASM disks. Oracle ASM uses the filter driver to validate write I/O requests to Oracle ASM disks.

The Oracle ASMFD simplifies the configuration and management of disk devices by eliminating the need to rebind disk devices used with Oracle ASM each time the system is restarted.

The Oracle ASM Filter Driver rejects any I/O requests that are invalid. This action eliminates accidental overwrites of Oracle ASM disks that would cause corruption in the disks and files within the disk group. For example, the Oracle ASM Filter Driver filters out all non-Oracle I/Os which could cause accidental overwrites.

Interesting, eh? So let’s find out how that works.


I found this a real pain as you need to have installed before the AFD is available to label your disks, yet the default OUI mode wants to create an ASM diskgroup… and you cannot do that without any labelled disks.

The only solution I could come up with was to perform a software-only install, which in itself is a pain. I’ll skip the numerous screenshots of that part though and just skip straight to the bit where I have Grid Infrastructure installed.

I’m following these instructions because I am using a single-instance Oracle Restart system rather than a true cluster.

First of all we need to do this:

[oracle@server3 ~]$ $ORACLE_HOME/bin/asmcmd dsset 'AFD:*'

[oracle@server3 ~]$ $ORACLE_HOME/bin/asmcmd dsget
[oracle@server3 ~]$ srvctl config asm
ASM home: 
Password file:
ASM listener: LISTENER
Spfile: /u01/app/oracle/admin/+ASM/pfile/spfile+ASM.ora
ASM diskgroup discovery string: AFD:*

Then we need to stop HAS and run the AFD_CONFIGURE command:

[root@server3 ~]# $ORACLE_HOME/bin/crsctl stop has -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'server3'
CRS-2673: Attempting to stop 'ora.asm' on 'server3'
CRS-2673: Attempting to stop 'ora.evmd' on 'server3'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'server3'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'server3' succeeded
CRS-2677: Stop of 'ora.evmd' on 'server3' succeeded
CRS-2677: Stop of 'ora.asm' on 'server3' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'server3'
CRS-2677: Stop of 'ora.cssd' on 'server3' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'server3' has completed
CRS-4133: Oracle High Availability Services has been stopped.

[root@server3 ~]# $ORACLE_HOME/bin/asmcmd afd_configure
Connected to an idle instance.
AFD-627: AFD distribution files found.
AFD-636: Installing requested AFD software.
AFD-637: Loading installed AFD drivers.
AFD-9321: Creating udev for AFD.
AFD-9323: Creating module dependencies - this may take some time.
AFD-9154: Loading 'oracleafd.ko' driver.
AFD-649: Verifying AFD devices.
AFD-9156: Detecting control device '/dev/oracleafd/admin'.
AFD-638: AFD installation correctness verified.
Modifying resource dependencies - this may take some time.
ASMCMD-9524: AFD configuration failed 'ERROR: OHASD start failed'

Er… that’s not really what I had in mind. But hey, let’s carry on regardless:

[root@server3 oracleafd]# $ORACLE_HOME/bin/asmcmd afd_state
Connected to an idle instance.
ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'DEFAULT' on host 'server3.local'

[root@server3 oracleafd]# $ORACLE_HOME/bin/crsctl start has
CRS-4123: Oracle High Availability Services has been started.

Ok it seems to be working. I wonder what it’s done?


The first thing I notice is some Oracle kernel modules have been loaded:

[root@server3 ~]# lsmod | grep ora
oracleafd             208499  1
oracleacfs           3307969  0
oracleadvm            506254  0
oracleoks             505749  2 oracleacfs,oracleadvm

I also see that, just like ASMLib, a driver has been plonked into the /opt/oracle/extapi directory:

[root@server3 1]# find /opt/oracle/extapi -ls
2752765    4 drwxr-xr-x   3 root     root         4096 Jul 25 15:15 /opt/oracle/extapi
2752766    4 drwxr-xr-x   3 root     root         4096 Jul 25 15:15 /opt/oracle/extapi/64
2753508    4 drwxr-xr-x   3 root     root         4096 Jul 25 15:15 /opt/oracle/extapi/64/asm
2756532    4 drwxr-xr-x   3 root     root         4096 Jul 25 15:15 /opt/oracle/extapi/64/asm/orcl
2756562    4 drwxr-xr-x   2 root     root         4096 Jul 25 15:15 /opt/oracle/extapi/64/asm/orcl/1
2756578  268 -rwxr-xr-x   1 oracle   dba        272513 Jul 25 15:15 /opt/oracle/extapi/64/asm/orcl/1/

And again, just like ASMLib, there is a new directory under /dev called /dev/oracleafd (whereas for ASMLib it’s called /dev/oracleasm):

[root@server3 ~]# ls -la /dev/oracleafd/
total 0
drwxrwx---  3 oracle dba      80 Jul 25 15:15 .
drwxr-xr-x 21 root   root  15820 Jul 25 15:15 ..
brwxrwx---  1 oracle dba  249, 0 Jul 25 15:15 admin
drwxrwx---  2 oracle dba      40 Jul 25 15:15 disks

The disks directory is currently empty. Maybe I should create some AFD devices and see what happens?


So let’s look at my Violin devices and see if I can label them:

root@server3 mapper]# ls -l /dev/mapper
total 0
crw-rw---- 1 root root 10, 236 Jul 11 16:52 control
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data1 -> ../dm-3
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data2 -> ../dm-4
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data3 -> ../dm-5
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data4 -> ../dm-6
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data5 -> ../dm-7
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data6 -> ../dm-8
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data7 -> ../dm-9
lrwxrwxrwx 1 root root       8 Jul 25 15:49 data8 -> ../dm-10
lrwxrwxrwx 1 root root       7 Jul 11 16:53 VolGroup-lv_home -> ../dm-2
lrwxrwxrwx 1 root root       7 Jul 11 16:53 VolGroup-lv_root -> ../dm-0
lrwxrwxrwx 1 root root       7 Jul 11 16:52 VolGroup-lv_swap -> ../dm-1

The documentation appears to be incorrect here, when it says to use the command $ORACLE_HOME/bin/afd_label. It’s actually $ORACLE_HOME/bin/asmcmd with the first parameter afd_label. I’m going to label the devices called /dev/mapper/data*:

[root@server3 mapper]# for lun in 1 2 3 4 5 6 7 8; do
> asmcmd afd_label DATA$lun /dev/mapper/data$lun
> done
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.

root@server3 mapper]# asmcmd afd_lsdsk
Connected to an idle instance.
Label                     Filtering   Path
DATA1                       ENABLED   /dev/mapper/data1
DATA2                       ENABLED   /dev/mapper/data2
DATA3                       ENABLED   /dev/mapper/data3
DATA4                       ENABLED   /dev/mapper/data4
DATA5                       ENABLED   /dev/mapper/data5
DATA6                       ENABLED   /dev/mapper/data6
DATA7                       ENABLED   /dev/mapper/data7
DATA8                       ENABLED   /dev/mapper/data8

That seemed to work ok. So what’s going on in the /dev/oracleafd/disks directory now?

[root@server3 ~]# ls -l /dev/oracleafd/disks/
total 32
-rw-r--r-- 1 root root 26 Jul 25 15:52 DATA1
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA2
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA3
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA4
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA5
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA6
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA7
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA8

There they are, just like with ASMLib. But look at the permissions, they are all owned by root with read-only privs for other users. In an ASMLib environment these devices are owned by oracle:dba, which means non-Oracle processes can write to them and corrupt them in some situations. Is this how Oracle claims the AFD protects devices?

I haven’t had time to investigate further but I assume that the database will access the devices via this mysterious block device:

[oracle@server3 oracleafd]$ ls -l /dev/oracleafd/admin
brwxrwx--- 1 oracle dba 249, 0 Jul 25 16:25 /dev/oracleafd/admin

It will be interesting to find out.


Of course, if you are logged in as root you aren’t going to be protected from any crazy behaviour:

[root@server3 ~]# cd /dev/oracleafd/disks
[root@server3 disks]# ls -l
total 496
-rw-r--r-- 1 root root 475877 Jul 25 16:40 DATA1
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA2
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA3
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA4
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA5
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA6
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA7
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA8
[root@server3 disks]# od -c -N 256 DATA8
0000000   /   d   e   v   /   m   a   p   p   e   r   /   d   a   t   a
0000020   8  \n
[root@server3 disks]# dmesg >> DATA8
[root@server3 disks]# od -c -N 256 DATA8
0000000   /   d   e   v   /   m   a   p   p   e   r   /   d   a   t   a
0000020   8   \n   z   r   d   b   t   e   2  l   I   n   i   t   i   a
0000040   l   i   z   i   n   g       c   g   r   o   u   p       s   u
0000060   b   s   y   s       c   p   u   s   e   t  \n   I   n   i   t
0000100   i   a   l   i   z   i   n   g       c   g   r   o   u   p
0000120   s   u   b   s   y   s       c   p   u  \n   L   i   n   u   x
0000140       v   e   r   s   i   o   n       3   .   8   .   1   3   -
0000160   2   6   .   2   .   3   .   e   l   6   u   e   k   .   x   8
0000200   6   _   6   4       (   m   o   c   k   b   u   i   l   d   @
0000220   c   a   -   b   u   i   l   d   4   4   .   u   s   .   o   r
0000240   a   c   l   e   .   c   o   m   )       (   g   c   c       v
0000260   e   r   s   i   o   n       4   .   4   .   7       2   0   1
0000300   2   0   3   1   3       (   R   e   d       H   a   t       4
0000320   .   4   .   7   -   3   )       (   G   C   C   )       )
0000340   #   2       S   M   P       W   e   d       A   p   r       1
0000360   6       0   2   :   5   1   :   1   0       P   D   T       2

Proof, if ever you need it, that root access is still the fastest and easiest route to total disaster…

New AWR Report Format: Oracle and 12c


This is a post about Oracle Automatic Workload Repository (AWR) Reports. If you are an Oracle professional you doubtless know what these are – and if you have to perform any sort of performance tuning as part of your day job it’s likely you spend a lot of time immersed in them. Goodness knows I do – a few weeks ago I had to analyse 2,304 of them in one (long) day. But for anyone else, they are (huge) reports containing all sorts of information about activities that happened between two points of time on an Oracle instance. If that doesn’t excite you now, please move along – there is nothing further for you here.

truthAWR Reports have been with us since the introduction of the Automatic Workload Repository back in 10g and can be considered a replacement for the venerable Statspack tool. Through each major incremental release the amount of information contained in an AWR Report has grown; for instance, the 10g reports didn’t even show the type of operating system, but 11g reports do. More information is of course a good thing, but sometimes it feels like there is so much data now it’s hard to find the truth hidden among all the distractions.

I recently commented in another post about the change in AWR report format introduced in This came as a surprise to me because I cannot previously remember report formatting changing mid-release, especially given the scale of the change. Not only that, but I’m sure I’ve seen reports from in the new format too (implying it was added via a patch set update), although I can’t find the evidence now so am forced to concede I may have imagined it. The same new format also continues into incidentally.

The New Features document doesn’t mention anything about a new report format. I can’t find anything about it on My Oracle Support (but then I can never find anything about anything I’m looking for on MOS these days). So I’m taking it upon myself to document the new format and the changes introduced – as well as point out a nasty little issue that’s caught me out a couple of times already.

Comparing Old and New Formats

From what I can tell, all of the major changes except one have taken place in the Report Summary section at the start of the AWR report. Oracle appears to have re-ordered the subsections and added a couple of new ones:

  • Wait Classes by Total Wait Time
  • IO Profile

The new Wait Classes section is interesting because there is already a section called Foreground Wait Class down in the Wait Event Statistics section of the Main Report, but the additional section appears to include background waits as well. The IO Profile section is especially useful for people like me who work with storage – and I’ve already blogged about it here.

In addition, the long-serving Top 5 Timed Foreground Events section has been renamed and extended to become Top 10 Foreground Events by Total Wait Time.

Here are the changes in tabular format:

Old Format

New Format

Cache Sizes

Load Profile

Instance Efficiency Percentages

Shared Pool Statistics

Top 5 Timed Foreground Events

Host CPU

Instance CPU

Memory Statistics



Time Model Statistics

Load Profile

Instance Efficiency Percentages

Top 10 Foreground Events by Total Wait Time

Wait Classes by Total Wait Time

Host CPU

Instance CPU

IO Profile

Memory Statistics

Cache Sizes

Shared Pool Statistics

Time Model Statistics

I also said there was one further change outside of the Report Summary section. It’s the long-standing Instance Activity Stats section, which has now been divided into two:

Old Format

New Format

Instance Activity Stats


Key Instance Activity Stats

Other Instance Activity Stats

I don’t really understand the point of that change, nor why a select few statistics are deemed to be more “key” than others. But hey, that’s the mystery of Oracle, right?

Tablespace / Filesystem IO Stats

Another, more minor change, is the addition of some cryptic-looking “1-bk” columns to the two sections Tablespace IO Stats and File IO Stats:

          Av       Av     Av      1-bk  Av 1-bk          Writes   Buffer  Av Buf
  Reads   Rds/s  Rd(ms) Blks/Rd   Rds/s  Rd(ms)  Writes   avg/s    Waits  Wt(ms)
------- ------- ------- ------- ------- ------- ------- ------- -------- -------
8.4E+05      29     0.7     1.0 6.3E+06    29.2       1     220    1,054     4.2
 95,054       3     0.8     1.0  11,893     3.3       1       0        1    60.0
    745       0     0.0     1.0   1,055     0.0       0       0       13     0.8
    715       0     0.0     1.0     715     0.0       0       0        0     0.0
      0       0     0.0     N/A       7     0.0       0       0        0     0.0

I have to confess it took me a while to figure out what they meant – in the end I had to consult the documentation for the view DBA_HIST_FILESTATXS:

Column Datatype NULL Description
SINGLEBLKRDS NUMBER Number of single block reads
SINGLEBLKRDTIM NUMBER Cumulative single block read time (in hundredths of a second)

Aha! So the AWR report is now giving us the number of single block reads (SINGLEBLKRDS) and the average read time for them (SINGLEBLKRDTIM / SINGLEBLKRDS). That’s actually pretty useful information for testing storage, since single block reads tell no lies. [If you want to know what I mean by that, visit Frits Hoogland's blog and download his white paper on multiblock reads...]

Top 10: Don’t Believe The Stats

One thing you might want to be wary about is the new Top 10 section… Here are the first two lines from mine after running a SLOB test:

Top 10 Foreground Events by Total Wait Time
                                            Tota    Wait   % DB
Event                                 Waits Time Avg(ms)   time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
db file sequential read        3.379077E+09 2527       1   91.4 User I/O
DB CPU                                      318.           11.5

Now, normally when I run SLOB and inspect the post-run awr.txt file I work out the average wait time for db file sequential read so I can work out the latency. Since AWR reports do not have enough decimal places for the sort of storage I use (the wait shows simply as 0 or 1), I have to divide the total wait time by the number of waits. But in the report above, the total wait time of 2,527 divided by 3,379,077,000 waits gives me an average of 0.000747 microseconds. Huh? Looking back at the numbers above it’s clear that the Total Time column has been truncated and some of the digits are missing. That’s bad news for me, as I regularly use scripts to strip this information out and parse it.

This is pretty poor in my opinion, because there is no warning and the number is just wrong. I assume this is an edge case because the number of waits contains so many digits, but for extended SLOB tests that’s not unlikely. Back in the good old Top 5 days it looked like this, which worked fine:

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                              Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
db file sequential read         119,835,425      50,938      0   84.0 User I/O
latch: cache buffers lru chain   20,051,266       6,221      0   10.3 Other

Unfortunately, in the new and above Top 10 report, the Total Time column simply isn’t wide enough. Instead, I have to scan down to the Foreground Wait Events section to get my true data:

                                        %Time Total Wait    wait    Waits   % DB
Event                             Waits -outs   Time (s)    (ms)     /txn   time
-------------------------- ------------ ----- ---------- ------- -------- ------
db file sequential read    3.379077E+09     0  2,527,552       1     11.3   91.4

This is something worth looking out for, especially if you also use scripts to fetch data from AWR files. Of course, the HTML reports don’t suffer from this problem, which just makes it even more annoying as I can’t parse HTML reports automatically (and thus I despise them immensely). AWR Reports

One final thing to mention is the AWR report format of (which was just released at the time of writing). There aren’t many changes from but just a few extra lines have crept in, which I’ll highlight here. In the main, they are related to the new In Memory Database option.

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):               0.3               4.8      0.00      0.02
              DB CPU(s):               0.1               1.2      0.00      0.00
      Background CPU(s):               0.0               0.1      0.00      0.00
      Redo size (bytes):          50,171.6         971,322.7
  Logical read (blocks):             558.6          10,814.3
          Block changes:             152.2           2,947.0
 Physical read (blocks):              15.1             292.0
Physical write (blocks):               0.2               4.7
       Read IO requests:              15.1             292.0
      Write IO requests:               0.2               3.3
           Read IO (MB):               0.1               2.3
          Write IO (MB):               0.0               0.0
           IM scan rows:               0.0               0.0
Session Logical Read IM:
             User calls:              16.1             312.0
           Parses (SQL):              34.0             658.0
      Hard parses (SQL):               4.6              88.0
     SQL Work Area (MB):               0.9              17.2
                 Logons:               0.1               1.7
         Executes (SQL):              95.4           1,846.0
              Rollbacks:               0.0               0.0
           Transactions:               0.1

Instance Efficiency Percentages (Target 100%)
            Buffer Nowait %:   97.55       Redo NoWait %:  100.00
            Buffer  Hit   %:   97.30    In-memory Sort %:  100.00
            Library Hit   %:   81.75        Soft Parse %:   86.63
         Execute to Parse %:   64.36         Latch Hit %:   96.54
Parse CPU to Parse Elapsd %:   19.45     % Non-Parse CPU:   31.02
          Flash Cache Hit %:    0.00


Cache Sizes                       Begin        End
~~~~~~~~~~~                  ---------- ----------
               Buffer Cache:       960M       960M  Std Block Size:         8K
           Shared Pool Size:     4,096M     4,096M      Log Buffer:   139,980K
             In-Memory Area:         0M         0M

New section: Oracle SLOB Testing

slob ghost

For some time now I have preferred Oracle SLOB as my tool for generating I/O workloads using Oracle databases. I’ve previously blogged some information on how to use SLOB for PIO testing, as well as shared some scripts for running tests and extracting results. I’ve now added a whole new landing page for SLOB and a complete guide to running sustained throughput testing.

Why would you want to run sustained throughput tests? Well, one great reason is that not all storage platforms can cope with sustained levels of write workload. Flash arrays, or any storage array which contains flash, have a tendency to suffer from garbage collection issues when sustained write workloads hit them hard enough.

Find out more by following the links below:

Understanding Flash: SLC, MLC and TLC


The last post in this series discussed the layout of NAND flash memory chips and the way in which cells can be read and written (programmed) at the page level but have to be erased at the (larger) block level. I finished by mentioning that erase operations take substantially longer than read or program operations… but just how big is the difference?

Knowing the answer to this involves first understanding the different types of flash memory available: SLC, MLC and TLC.

Electrons In A Bucket?

Whenever I’ve seen anyone attempt to explain this in the past, they have almost always resorted to drawing a picture of electrons or charge filling up a bucket. This is a terrible analogy and causes anyone with a deep understanding of physics to cringe in horror. Luckily, I don’t have a deep understanding of physics, so I’m going to go right along with the herd and get my bucket out.

A NAND flash cell, i.e. the thing that stores a value of one or zero, is actually a floating gate transistor. Programming the cell means putting electrons into the floating gate, causing it to become (negatively) charged. Erasing the cell means removing the electrons from the floating gate, draining the charge. The amount of charge contained in the floating gate can be varied from zero up to a maximum value – this is an analogue system so there is no simple FULL or EMPTY state.

Because of this, the amount of charge can be measured and thresholds assigned to indicate a binary value. What does that mean? It means that, in the case of Single Level Cell (SLC) flash anything below 50% of charge can be considered to be a bit with a value of 1, while anything above 50% can be considered a bit with a value of 0.

But if i decided to be a bit more careful in the way I fill or empty my bucket of charge (sorry), I could perhaps define more thresholds and thus hold two bits of data instead of one. I could say that below 25% is 11, from 25% to 50% is 10, from 50% to 75% is 01 and above 75% is 00. Now I can keep twice as much data in the same bucket. This is in fact Multi Level Cell (MLC). And as the picture shows, if I was really careful in the way I treated my bucket, I could even keep three bits of data in there, which is what happens in Three Level Cell (TLC):


The thing is, imagine this was a bucket of water (comparing electrons to water is probably the last straw for anyone reading this who has a degree in physics, so I bid you farewell at this point). If you were to fill up your bucket using the SLC method, you could be pretty slap-dash about it. I mean it’s pretty obvious when the bucket is more than half full or empty. But if you were using a more fine-grained method such as MLC or TLC you would need to fill / empty very carefully and take exact measurements, which means the act of filling (programming) would be a lot slower.

To really stretch this analogy to breaking point, imagine that every time you fill your bucket it gets slightly damaged, causing it to leak. In the SLC world, even a number of small leaks would not be a big deal. But in the MLC or (especially) the TLC world, those leaks mean it would quickly become impossible to keep using your bucket, because the tolerance between different bit values is so small. For similar reasons, NAND flash endurance is greatly influenced by the type of cell used. Storing more bits per cell means a lower tolerance for errors, which in turn means that higher error rates are experienced and endurance (the number of program/erase cycles that can be sustained) is lower.

Timing and Wear

Enough of the analogies, let’s look at some proper data. The chart below uses sample figures from AnandTech:


You can see that as the number of bits per cell increases, so does the time taken to perform read, program (i.e. write) and erase operations. Erases in particular are especially slow, with values measured in milliseconds instead of microseconds. Given that erases also affect larger areas of flash than reads and programs, you can start to see why the management of erase operations on flash is critical to performance.

Also apparent on the chart above is the massive difference in the number of program / erase cycles between the different flash types: for SLC we’re talking about orders of magnitude in difference. But of course SLC can only store one bit per cell, which means it’s much more expensive from a capacity perspective than MLC. TLC, meanwhile, offers the potential for great value for money, but none of the performance requirements you would need for tier one storage (although it may well have a place in the world of backups). It is for this reason that MLC is the most commonly used type of flash in enterprise storage systems. (By the way I’m so utterly disinterested in the phenomena of “eMLC” that I’m not going to cover it here, but you can read this and this if you want to know more on the subject…)

Warning: Know Your Flash

cautionOne final thing. When you buy an SSD, a PCIe flash card or, in the case of Violin Memory, an all-flash array you tend to choose between SLC and MLC. As a very rough rule of thumb you can consider MLC to be twice the capacity for half the performance of SLC, although this in fact varies depending on many factors. However there are some all flash array vendors who use both SLC and MLC in a sort of tiered approach. That’s fine -and if you are buying a flash array I’m sure you’ll take the time to understand how it works.

But here’s the thing. At least one of these vendors insists on describing the SLC layer as “NVRAM” to differentiate from the MLC layer which it simply describes as using flash SSDs. The truth is that the NVRAM is also just a bunch of flash SSDs, except they are SLC instead of MLC. I’m not in favour of using educational posts to criticise competitors, but in the interest of bring clarity to this subject I will say this: I think this is a marketing exercise which deliberately adds confusion to try and make the design sound more exciting. “Ooooh, NVRAM that sounds like something I ought to have in my flash array…” – or am I being too cynical?

Understanding Flash: Blocks, Pages and Program / Erases

In the last post on this subject I described the invention of NAND flash and the way in which erase operations affect larger areas than write operations. Let’s have a look at this in more detail and see what actually happens. First of all, we need to know our way around the different entities on a flash chip: the die, the plane, the block and the page:

NAND Flash Die Layout (image courtesy of AnandTech)

NAND Flash Die Layout (image courtesy of AnandTech)

Note: What follows is a high-level description of the generic behaviour of flash. There are thousands of different NAND chips available, each potentially with slightly different instruction sets, block/page sizes, performance characteristics etc.

  • The die is the memory chip, i.e. the black rectangle with little electrical connectors sticking out of it. If you look at an SSD, a flash card or the internals of a flash array you will see many flash dies, each of which is produced by one of the big flash manufacturers: Toshiba, Samsung, Micron, Intel, SanDisk, SK Hynix. These are the only companies with the multi-billion dollar fabrication plants necessary to make NAND flash.
  • Each die contains one or more planes (usually two). Identical, concurrent operations can take place on each plane, although with some restrictions.
  • Each plane contains a number of blocks, which are the smallest unit that can be erased. Remember that, it’s really important.
  • Each block contains a number of pages, which are the smallest unit that can be programmed (i.e. written to).

The important bit here is that program operations (i.e. writes) take place to a page, which might typically be 8-16KB in size, while erase operations take place to a block, which might be 4-8MB in size. Since a block needs to be erased before it can be programmed again (*sort of, I’m generalising to make this easier), all of the pages in a block need to be candidates for erasure before this can happen.

Program / Erase Cycles

When your flash device arrives fresh from the vendor, all of the pages are “empty”. The first thing you will want to do, I’m sure, is write some data to them – which in the world of memory chips we call a program operation. As discussed, these program operations take place at the page level. You can then read your fresh data back out again with read operations, which also take place at the page level. [Having said that, the instruction to read a page places the data from that page in a memory register, so your reading process can in fact then selectively access subsets of the page if it desires - but maybe that's going into too much detail...]

NAND-flash-blocks-pages-program-erasesWhere it gets interesting is if you want to update the data you just wrote. There is no update operation for flash, no undo or rewind mechanism for changing what is currently in place, just the erase operation. It’s a little bit like an etch-a-sketch, in that you can continue to turn the dials and make white sections of screen go black, but you cannot turn black sections of screen to white again with erasing the entire screen. Etch-a-SketchAn erase operation on a flash chip clears the data from all pages in the block, so if some of the other pages contain active data (stuff you want to keep) you either have to copy it elsewhere first or hold off from doing the erase.

In fact, that second option (don’t erase just yet) makes the most sense, because the blocks on a flash chip can only tolerate a limited number of program and erase options (known as the program erase cycle or PE cycle because for obvious reasons they follow each other in turn). If you were to erase the block every time you wanted to change the contents of a page, your flash would wear out very quickly.

So a far better alternative is to simply mark the old page (containing the unchanged data) as INVALID and then write the new, changed data to an empty page. All that is required now is a mechanism for pointing any subsequent access operations to the new page and a way of tracking invalid pages so that, at some point, they can be “recycled”.


Updating a page in NAND flash. Note that the new page location does not need to be within the same block, or even the same flash die. It is shown in the same block here purely for ease of drawing.

This “mechanism” is known as the flash translation layer and it has responsibility for these tasks as well as a number of others. We’ll come back to it in subsequent posts because it is a real differentiator between flash products. For now though, think about the way the device is filling up with data. Although we’ve delayed issuing erase operations by cleverly moving data to different pages, at some point clearly there will be no empty pages left and erases will become essential. This is where the bad news comes in: it takes many times longer to perform an erase than it does to perform a read or program. And that clearly has consequences for performance if not managed correctly.

In the next post we’ll look at the differences in time taken to perform reads, programs and erases – which first requires looking at the different types of flash available: SLC, MLC and TLC…

caution[* Technical note: Ok so actually when a NAND flash page is empty it is all binary ones, e.g. 11111111. A program operation sets any bit with the value of 1 to 0, so for example 11111111 could become 11110000. This means that later on it is still possible to perform another program operation to set 11110000 to 00110000 for example. Until all bits are zero it's technical possible to perform another program. But hey, that's getting a bit too deep into the details for our requirements here, so just pretend you never read this...]

New My Oracle Support note on Advanced Format (4k) storage


In the past I have been a little critical of Oracle’s support notes and documentation regarding the use of Advanced Format 4k storage devices. I must now take that back, as my new friends in Oracle ASM Development and Product Management very kindly offered to let me write a new support note, which they have just published on My Oracle Support. It’s only supposed to be high level, but it does confirm that the _DISK_SECTOR_SIZE_OVERRIDE parameter can be safely set in database instances when using 512e storage and attempting to create 4k online redo logs.

The new support note is:

Using 4k Redo Logs on Flash and SSD-based Storage (Doc ID 1681266.1)

Don’t forget that you can read all about the basics of using Oracle with 4k sector storage here. And if you really feel up to it, I have a 4k deep dive page here.

Understanding Flash: What Is NAND Flash?


In the early 1980s, before we ever had such wondrous things as cell phones, tablets or digital cameras, a scientist named Dr Fujio Masuoka was working for Toshiba in Japan on the limitations of EPROM and EEPROM chips. An EPROM (Erasable Programmable Read Only Memory) is a type of memory chip that, unlike RAM for example, does not lose its data when the power supply is lost – in the technical jargon it is non-volatile. It does this by storing data in “cells” comprising of floating-gate transistors. I could start talking about Fowler-Nordheim tunnelling and hot-carrier injection at this point, but I’m going to stop here in case one of us loses the will to live. (But if you are the sort of person who wants to know more though, I can highly recommend this page accompanied by some strong coffee.)

Anyway, EPROMs could have data loaded into them (known as programming), but this data could also be erased through the use of ultra-violet light so that new data could be written. This cycle of programming and erasing is known as the program erase cycle (or PE Cycle) and is important because it can only happen a limited number of times per device… but that’s a topic for another post. However, while the reprogrammable nature of EPROMS was useful in laboratories, it was not a solution for packaging into consumer electronics – after all, including an ultra-violet light source into a device would make it cumbersome and commercially non-viable.

US Patent US4531203: Semiconductor memory device and method for manufacturing the same

US Patent US4531203: Semiconductor memory device and method for manufacturing the same

A subsequent development, known as the EEPROM, could be erased through the application of an electric field, rather than through the use of light, which was clearly advantageous as this could now easily take place inside a packaged product. Unlike EPROMs, EEPROMs could also erase and program individual bytes rather than the entire chip. However, the EEPROMs came with a disadvantage too: every cell required at least two transistors instead of the single transistor required in an EPROM. In other words, they stored less data: they had lower density.

The Arrival of Flash

So EPROMs had better density while EEPROMs had the ability to electrically reprogram cells. What if a new method could be found to incorporate both benefits without their associated weaknesses? Dr Masuoka’s idea, submitted as US patent 4612212 in 1981 and granted four years later, did exactly that. It used only one transistor per cell (increasing density, i.e. the amount of data it could store) and still allowed for electrical reprogramming.

If you made it this far, here’s the important bit. The new design achieved this goal by only allowing multiple cells to be erased and programmed instead of individual cells. This not only gives the density benefits of EPROM and the electrically-reprogrammable benefits of EEPROM, it also results in faster access times: it takes less time to issue a single command for programming or erasing a large number of cells than it does to issue one per cell.

However, the number of cells that are affected by a single erase operation is different – and much larger – than the number of cells affected by a single program operation. And it is this fact that, above all else, that results in the behaviour we see from devices built on flash memory. In the next post we will look at exactly what happens when program and erase operations take place, before moving on to look at the types of flash available (SLC, MLC etc) and their behaviour.


To try and keep this post manageable I’ve chosen to completely bypass the whole topic of NOR flash and just tell you that from this moment on we are talking about NAND flash, which is what you will find in SSDs, flash cards and arrays. It’s a cop out, I know – but if you really want to understand the difference then other people can describe it better than me.

In the meantime, we all have our good friend Dr Masuoka to thank for the flash memory that allows us to carry around the phones and tablets in our pockets and the SD cards in our digital cameras. Incidentally, popular legend has it that the name “flash” came from one of Dr Masuoka’s colleagues because the process of erasing data reminded him of the flash of a camera. flash-chipPresumably it was an analogue camera because digital cameras only became popular in the 1990s after the commoditisation of a new, solid-state storage technology called …


Understanding Disk: Caching and Tiering



When I was a child, about four or five years old, my dad told me a joke. It wasn’t a very funny joke, but it stuck in my mind because of what happened next. The joke went like this:

Dad: “What’s big at the bottom, small at the top and has ears?”

Me: “I don’t know?”

Dad: “A mountain!”

Me: “Er…<puzzled>…  What about the ears?”

Dad: (Triumphantly) “Haven’t you heard of mountaineers?!”

So as I say, not very funny. But, by a twist of fate, the following week at primary school my teacher happened to say, “Now then children, does anybody know any jokes they’d like to share?”. My hand shot up so fast that I was immediately given the chance to bring the house down with my new comedy routine. “What’s big at the bottom, small at the top and has ears?” I said, with barely repressed glee. “I don’t know”, murmured the teacher and other children expectantly. “A mountain!”, I replied.

Silence. Awkwardness. Tumbleweed. Someone may have said “Duuh!” under their breath. Then the teacher looked faintly annoyed and said, “That’s not really how a joke works…” before waiving away my attempts to explain and moving on to hear someone else’s (successful and funny) joke. Why had it been such a disaster? The joke had worked perfectly on the previous occasion, so why didn’t it work this time?

There are two reasons I tell this story: firstly because, as you can probably tell, I am scarred for life by what happened. And secondly, because it highlights what happens when you assume you can predict the unpredictable.*

Gambling With Performance

So far in this mini-series on Understanding Disk we’ve covered the design of hard drives, their mechanical limitations and some of the compromises that have to be made in order to achieve acceptable performance. The topic of this post is more about bandaids; the sticking plasters that users of disk arrays have to employ to try and cover up their performance problems. Or as my boss likes to call it, lipstick on a pig.

roulette-wheelIf you currently use an enterprise disk array the chances are it has some sort of DRAM cache within the array. Blocks stored in this cache can be read at a much lower latency than those residing only on disk, because the operation avoids paying the price of seek time and rotational latency. If the cache is battery-backed, it can also be used to accelerate writes too. But DRAM caches in storage area networks are notoriously expensive in relation to their size, which is often significantly smaller than the size of the active data set. For this reason, many array vendors allow you to use SSDs as an additional layer of slower (but higher capacity) cache.

Another common approach to masking the performance of disk arrays is tiering. This is where different layers of performance are identified (e.g. SATA disk, fibre-channel disk, SSD, etc) and data moved about according to its performance needs. Tiering can be performed manually, which requires a lot of management overhead, or automatically by software – perhaps the most well-known example being EMC’s Fully Automated Storage Tiering (or “FAST”) product. Unlike caching, which creates a copy of the data somewhere temporary, tiering involves permanently relocating the persistent location of the data. This relocation has a performance penalty, particularly if data is being moved frequently. Moreover, some automatic tiering solutions can take 24 hours to respond to changes in access patterns – now that’s what I call bad latency.

The Best Predictor of Future Behaviour is Past Behaviour

The problem with automatic tiering is that, just like caching, it relies on past behaviour to predict the future. That principle works well in psychology, but isn’t always as successful in computing. It might be acceptable if your workload is consistent and predictable, but what happens when you run your end of month reporting? What happens when you want to run a large ad-hoc query? What happens when you tell a joke about mountains and expect everyone to ask “but what about the ears”? You end up looking pretty stupid, I can tell you.

las-vegasI have no problem with caching or tiering in principle. After all, every computer system uses cache in multiple places: your CPUs probably have two or three levels of cache, your server is probably stuffed with DRAM and your Oracle database most likely has a large block buffer cache. What’s more, in my day job I have a lot of fun helping customers overcome the performance of nasty old spinning disk arrays using Violin’s Maestro memory services product.

But ultimately, caching and tiering are bandaids. They reduce the probability of horrible disk latency but they cannot eliminate it. And like a gambler on a winning streak, if you become more accustomed to faster access times and put more of your data at risk, the impact when you strike out is felt much more deeply. The more you bet, the more you have to lose.

Shifting the Odds in Your Favour

I have a customer in the finance industry who doesn’t care (within reason) what latency their database sees from storage. All they care about is that their end users see the same consistent and sustained performance. It doesn’t have to be lightning fast, but it must not, ever, feel slower than “normal”. As soon as access times increase, their users’ perception of the system’s performance suffers… and the users abandon them to use rival products.

poker-gameThey considered high-end storage arrays but performance was woefully unpredictable, no matter how much cache and SSD they used. They considered Oracle Exadata but ruled it out because Exadata Flash Cache is still a cache – at some point a cache miss will mean fetching data from horrible, spinning disk. Now they use all flash arrays, because the word “all” means their data is always on flash: no gambling with performance.

Caching and tiering will always have some sort of place in the storage world. But never forget that you cannot always win – at some point (normally the worst possible time) you will need to access data from the slowest media used by your platform. Which is why I like all flash arrays: you have a 100% chance of your data being on flash. If I’m forced to gamble with performance, those are the odds I prefer…

* I know. It’s a tenuous excuse for telling this story, but on the bright side I feel a lot better for sharing it with you.

Oracle SLOB On Solaris

Guest Post

This is another guest post from my buddy Nate Fuzi, who performs the same role as me for Violin but is based in the US instead of EMEA. Nate believes that all English people live in the Dickensian London of the 19th century and speak in Cockney rhyming slang. I hate to disappoint, so have a butcher’s below and feast your mince pies on his attempts to make SLOB work on Solaris without going chicken oriental. Over to you Nate, me old china plate.

slob ghost

Note: The Silly Little Oracle Benchmark, or SLOB, is a Linux-only tool designed and released for the community by Kevin Closson. There are no ports for other operating systems – and Kevin has always advised that the solution for testing on another platform is to use a Linux VM and connect via TNS. The purpose of this post is simply to show what happens when you have no other choice but to try and get SLOB working natively on Solaris…

I wrestled with SLOB 2 for a couple hours last week for a demo build we had in-house to show our capabilities to a prospective customer. I should mention I’ve had great success—and ease!—with SLOB 2 previously. But that was on Linux. This was on Solaris 10—to mimic the setup the customer has in-house. No problem, I thought; there’s some C files to compile, but then there’s just shell scripts to drive the thing. What could go wrong?

Well, it would seem Kevin Closson, the creator of SLOB and SLOB 2, did his development on an OS with a better sense of humor than Solaris. The package unzipped, and the script appeared to run successfully, but would load up the worker threads and wait several seconds before launching them—and then immediately call it “done” and bail out, having executed on the database only a couple seconds. Huh? I had my slob.conf set to execute for 300 seconds.

I had two databases created: one with 4K blocks and one with 8K blocks. I had a tablespace created for SLOB data called SLOB4K and SLOB8K, respectively. I ran SLOB4K 128, and the log file showed no errors. All good, I thought. Now run 12, and it stops as quickly as it starts. Oof.

It took Bryan Wood, a much better shell script debugger (hey, I said DEbugger) than myself, to figure out all the problems.

First, there was this interesting line of output from the command:

NOTIFY: Connecting users 1 2 3 Usage: mpstat [-aq] [-p | -P processor_set] [interval [count]]
4 5 6 7 8 9 10

Seems Solaris doesn’t like mpstat –P ALL. However it seems that on Solaris 10 the mpstat command shows all processors even without the -P option.

Next, Solaris doesn’t like Kevin’s “sleep .5” command inside; it wants whole numbers only. That raises the question in my mind why he felt the need to check for running processes every half second rather than just letting it wait a full second between checks, but fine. Modify the command in the wait_pids() function to sleep for a full second, and that part is happy.

But it still kicks out immediately and kills the OS level monitoring commands, even though there are active SQL*Plus sessions out there. It seems on Solaris the ps –p command to report status on a list of processes requires the list of process IDs to be escaped where Linux does not. IE:

-bash-3.2$ ps -p 1 2 3
usage: ps [ -aAdeflcjLPyZ ] [ -o format ] [ -t termlist ]
        [ -u userlist ] [ -U userlist ] [ -G grouplist ]
        [ -p proclist ] [ -g pgrplist ] [ -s sidlist ] [ -z zonelist ]
  'format' is one or more of:
        user ruser group rgroup uid ruid gid rgid pid ppid pgid sid taskid ctid
        pri opri pcpu pmem vsz rss osz nice class time etime stime zone zoneid
        f s c lwp nlwp psr tty addr wchan fname comm args projid project pset

But with quotes:

-bash-3.2$ ps -p "1 2 3"
   PID TTY         TIME CMD
     1 ?           0:02 init
     2 ?           0:00 pageout
     3 ?          25:03 fsflush

After some messing about, Bryan had the great idea to simply replace the command:

while ( ps -p $pids > /dev/null 2>&1 )


while ( ps -p "$pids" > /dev/null 2>&1 )

Just thought I might save someone else some time and hair pulling by sharing this info… Here are the finished file diffs:

-bash-3.2$ diff
< while ( ps -p "$pids" > /dev/null 2>&1 )
> while ( ps -p $pids > /dev/null 2>&1 )
<       sleep 1
>       sleep .5
<       ( mpstat 3  > mpstat.out ) &
>       ( mpstat -P ALL 3  > mpstat.out ) &

How To Succeed In Presales?


This article is aimed at anyone considering making the move into technical presales who currently works in a professional services, consultancy or support role, or as customers and end-users. You will notice that the title of the article has a question mark at the end – that’s because I don’t have the answer – and I have neither the confidence nor the evidence to claim that I have been a success. But I have made lots of mistakes… and apparently you can learn from those, so here I will share some of my experiences of making the leap. I’d also like to point out that after two years working in presales for Violin Memory I haven’t – yet – been fired, which I believe is a good sign.

My first day at Violin Memory was a disaster. In fact, let me be more specific, it was my first hour. Having spent my career working variously as in development, as a consultant for an Oracle partner, an Oracle customer and then in professional services for Oracle itself, I’d finally taken the leap into technical presales and this was to be my first day. Sure I’d done bits and pieces of presales work before, but this was a proper out-and-out presales role – and I knew that I was either going to sink or swim. Within the first hour I was deep underwater.

Someone doing a much better job than me

Someone doing a much better job than me

By a quirk of timing, it turned out that my first day at Violin was actually the first day of the annual sales conference held in silicon valley. So the day before I’d boarded a plane with a number of other fresh recruits (all veterans of sales organisations) and flown to San Francisco where, in the true spirit of any Englishman abroad, I’d set about using alcohol as a tool to combat jet lag. Smart move.

It’s not as if I didn’t know what a mistake this was. I’d learnt the same lesson on countless other occasions in my career – that having a hangover is painful, unprofessional and deeply unimpressive to those unfortunate enough to meet you the following morning – but maybe my attempts to integrate with my new team were a little bit over-enthusiastic. Meanwhile, my new boss (the one who had taken the gamble of employing me) had asked me to prepare a presentation which would be delivered to the team at some point on day 1.

Day 1 arrived – and at 8.00am we gathered in a conference room to see the agenda. Guess who was up first? At 8.05am I stood up, hungover, unprepared, very nervous and (give me a break, I’m looking for excuses) jet-lagged to deliver… well if it wasn’t the worst presentation of my life, it was in the top two (I have previously mentioned the other one here). I mumbled, I didn’t look for feedback from the audience, I talked too fast for the non-native English speakers in the room and my content was too technical. I’m surprised they didn’t pack me off to the airport right then. It was not a success.

And Yet…

The thing is, the subject of my talk was databases. As anyone who knows me can attest (to their misfortune), I love talking about databases. It’s a subject I am passionate about and, if you get me started, you may have trouble stopping me. This is because to me, talking about database is just as much fun as actually using them, but much easier and with less need to remember the exact syntax. So why did I choke that day – the worst possible day to make a bad impression?

unpreparedThere is an obvious moral to my tale, which is not to be so stupid. Don’t turn up unprepared, do make sure your content is at the right level and don’t drink too much the night before. Follow those rules and you’ll be confident and enthusiastic instead of nervous and monotonous. But you know this and I didn’t tell this sorry tale just to deliver such a lame piece of advice.

One of the enduring myths about working in sales is that it’s all about delivering presentations – and this sometimes puts people off, since many have little presentation experience nor an easy way of gaining it. While it’s true that being able to present and articulate ideas or solutions is an essential part of any sales role (I wouldn’t recommend presales if you actively dislike presenting), in the last two years I’ve come to the conclusion that there are more important qualities required. In my opinion the single most important skill needed to work in presales is the ability to become a trusted advisor to your (potential) customers. People buy from people they trust. If you want to help sell, you don’t need to impress people with your flair, style or smart suit (not that those things won’t help)… you need to earn their trust. And if you deliver on that trust, not only will they buy from you, they will come back again for more.

If you work (successfully) in professional services or consultancy right now, the chances are you already do this. Your customers won’t value your contribution unless they trust you. Likewise if you work for a customer or end-user, it’s quite likely that you have internal customers (e.g. the “business” that your team supports) and if you’ve gained their trust, you’re already selling something: yourself, your skills and the service you provide.

beerIt’s not for everybody, but I find working in technical presales hugely fulfilling. I get to meet lots of interesting customers, see how they run their I.T. organisations and services, talk to them about existing and future technologies and I get to experience the highs (and lows) of winning (or not winning) deals.

If you’re thinking of making the move, don’t be put off by concerns over a lack of sales experience. You may not be aware, but the chances are, you already have it. Just don’t drink too much the night before your first day…


Get every new post delivered to your Inbox.

Join 757 other followers