Oracle ASM Filter Driver: First Impressions

This is a very quick post, because I’m about to log off and take an extended summer holiday (or vacation as my crazy American friends call it… but then they call football  “soccer” too). Before I go, I wanted to document my initial findings with the new ASM Filter Driver feature introduced in this week’s 12.1.o.2 patchset.

Currently a Linux-only feature, the ASM Filter Driver (or AFD) is a replacement for ASMLib and is described by Oracle as follows:

Oracle ASM Filter Driver (Oracle ASMFD) is a kernel module that resides in the I/O path of the Oracle ASM disks. Oracle ASM uses the filter driver to validate write I/O requests to Oracle ASM disks.

The Oracle ASMFD simplifies the configuration and management of disk devices by eliminating the need to rebind disk devices used with Oracle ASM each time the system is restarted.

The Oracle ASM Filter Driver rejects any I/O requests that are invalid. This action eliminates accidental overwrites of Oracle ASM disks that would cause corruption in the disks and files within the disk group. For example, the Oracle ASM Filter Driver filters out all non-Oracle I/Os which could cause accidental overwrites.

Interesting, eh? So let’s find out how that works.


I found this a real pain as you need to have installed before the AFD is available to label your disks, yet the default OUI mode wants to create an ASM diskgroup… and you cannot do that without any labelled disks.

The only solution I could come up with was to perform a software-only install, which in itself is a pain. I’ll skip the numerous screenshots of that part though and just skip straight to the bit where I have Grid Infrastructure installed.

I’m following these instructions because I am using a single-instance Oracle Restart system rather than a true cluster.

First of all we need to do this:

[oracle@server3 ~]$ $ORACLE_HOME/bin/asmcmd dsset 'AFD:*'

[oracle@server3 ~]$ $ORACLE_HOME/bin/asmcmd dsget
[oracle@server3 ~]$ srvctl config asm
ASM home: 
Password file:
ASM listener: LISTENER
Spfile: /u01/app/oracle/admin/+ASM/pfile/spfile+ASM.ora
ASM diskgroup discovery string: AFD:*

Then we need to stop HAS and run the AFD_CONFIGURE command:

[root@server3 ~]# $ORACLE_HOME/bin/crsctl stop has -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'server3'
CRS-2673: Attempting to stop 'ora.asm' on 'server3'
CRS-2673: Attempting to stop 'ora.evmd' on 'server3'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'server3'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'server3' succeeded
CRS-2677: Stop of 'ora.evmd' on 'server3' succeeded
CRS-2677: Stop of 'ora.asm' on 'server3' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'server3'
CRS-2677: Stop of 'ora.cssd' on 'server3' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'server3' has completed
CRS-4133: Oracle High Availability Services has been stopped.

[root@server3 ~]# $ORACLE_HOME/bin/asmcmd afd_configure
Connected to an idle instance.
AFD-627: AFD distribution files found.
AFD-636: Installing requested AFD software.
AFD-637: Loading installed AFD drivers.
AFD-9321: Creating udev for AFD.
AFD-9323: Creating module dependencies - this may take some time.
AFD-9154: Loading 'oracleafd.ko' driver.
AFD-649: Verifying AFD devices.
AFD-9156: Detecting control device '/dev/oracleafd/admin'.
AFD-638: AFD installation correctness verified.
Modifying resource dependencies - this may take some time.
ASMCMD-9524: AFD configuration failed 'ERROR: OHASD start failed'

Er… that’s not really what I had in mind. But hey, let’s carry on regardless:

[root@server3 oracleafd]# $ORACLE_HOME/bin/asmcmd afd_state
Connected to an idle instance.
ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'DEFAULT' on host 'server3.local'

[root@server3 oracleafd]# $ORACLE_HOME/bin/crsctl start has
CRS-4123: Oracle High Availability Services has been started.

Ok it seems to be working. I wonder what it’s done?


The first thing I notice is some Oracle kernel modules have been loaded:

[root@server3 ~]# lsmod | grep ora
oracleafd             208499  1
oracleacfs           3307969  0
oracleadvm            506254  0
oracleoks             505749  2 oracleacfs,oracleadvm

I also see that, just like ASMLib, a driver has been plonked into the /opt/oracle/extapi directory:

[root@server3 1]# find /opt/oracle/extapi -ls
2752765    4 drwxr-xr-x   3 root     root         4096 Jul 25 15:15 /opt/oracle/extapi
2752766    4 drwxr-xr-x   3 root     root         4096 Jul 25 15:15 /opt/oracle/extapi/64
2753508    4 drwxr-xr-x   3 root     root         4096 Jul 25 15:15 /opt/oracle/extapi/64/asm
2756532    4 drwxr-xr-x   3 root     root         4096 Jul 25 15:15 /opt/oracle/extapi/64/asm/orcl
2756562    4 drwxr-xr-x   2 root     root         4096 Jul 25 15:15 /opt/oracle/extapi/64/asm/orcl/1
2756578  268 -rwxr-xr-x   1 oracle   dba        272513 Jul 25 15:15 /opt/oracle/extapi/64/asm/orcl/1/

And again, just like ASMLib, there is a new directory under /dev called /dev/oracleafd (whereas for ASMLib it’s called /dev/oracleasm):

[root@server3 ~]# ls -la /dev/oracleafd/
total 0
drwxrwx---  3 oracle dba      80 Jul 25 15:15 .
drwxr-xr-x 21 root   root  15820 Jul 25 15:15 ..
brwxrwx---  1 oracle dba  249, 0 Jul 25 15:15 admin
drwxrwx---  2 oracle dba      40 Jul 25 15:15 disks

The disks directory is currently empty. Maybe I should create some AFD devices and see what happens?


So let’s look at my Violin devices and see if I can label them:

root@server3 mapper]# ls -l /dev/mapper
total 0
crw-rw---- 1 root root 10, 236 Jul 11 16:52 control
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data1 -> ../dm-3
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data2 -> ../dm-4
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data3 -> ../dm-5
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data4 -> ../dm-6
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data5 -> ../dm-7
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data6 -> ../dm-8
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data7 -> ../dm-9
lrwxrwxrwx 1 root root       8 Jul 25 15:49 data8 -> ../dm-10
lrwxrwxrwx 1 root root       7 Jul 11 16:53 VolGroup-lv_home -> ../dm-2
lrwxrwxrwx 1 root root       7 Jul 11 16:53 VolGroup-lv_root -> ../dm-0
lrwxrwxrwx 1 root root       7 Jul 11 16:52 VolGroup-lv_swap -> ../dm-1

The documentation appears to be incorrect here, when it says to use the command $ORACLE_HOME/bin/afd_label. It’s actually $ORACLE_HOME/bin/asmcmd with the first parameter afd_label. I’m going to label the devices called /dev/mapper/data*:

[root@server3 mapper]# for lun in 1 2 3 4 5 6 7 8; do
> asmcmd afd_label DATA$lun /dev/mapper/data$lun
> done
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.

root@server3 mapper]# asmcmd afd_lsdsk
Connected to an idle instance.
Label                     Filtering   Path
DATA1                       ENABLED   /dev/mapper/data1
DATA2                       ENABLED   /dev/mapper/data2
DATA3                       ENABLED   /dev/mapper/data3
DATA4                       ENABLED   /dev/mapper/data4
DATA5                       ENABLED   /dev/mapper/data5
DATA6                       ENABLED   /dev/mapper/data6
DATA7                       ENABLED   /dev/mapper/data7
DATA8                       ENABLED   /dev/mapper/data8

That seemed to work ok. So what’s going on in the /dev/oracleafd/disks directory now?

[root@server3 ~]# ls -l /dev/oracleafd/disks/
total 32
-rw-r--r-- 1 root root 26 Jul 25 15:52 DATA1
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA2
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA3
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA4
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA5
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA6
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA7
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA8

There they are, just like with ASMLib. But look at the permissions, they are all owned by root with read-only privs for other users. In an ASMLib environment these devices are owned by oracle:dba, which means non-Oracle processes can write to them and corrupt them in some situations. Is this how Oracle claims the AFD protects devices?

I haven’t had time to investigate further but I assume that the database will access the devices via this mysterious block device:

[oracle@server3 oracleafd]$ ls -l /dev/oracleafd/admin
brwxrwx--- 1 oracle dba 249, 0 Jul 25 16:25 /dev/oracleafd/admin

It will be interesting to find out.


Of course, if you are logged in as root you aren’t going to be protected from any crazy behaviour:

[root@server3 ~]# cd /dev/oracleafd/disks
[root@server3 disks]# ls -l
total 496
-rw-r--r-- 1 root root 475877 Jul 25 16:40 DATA1
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA2
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA3
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA4
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA5
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA6
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA7
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA8
[root@server3 disks]# od -c -N 256 DATA8
0000000   /   d   e   v   /   m   a   p   p   e   r   /   d   a   t   a
0000020   8  \n
[root@server3 disks]# dmesg >> DATA8
[root@server3 disks]# od -c -N 256 DATA8
0000000   /   d   e   v   /   m   a   p   p   e   r   /   d   a   t   a
0000020   8   \n   z   r   d   b   t   e   2  l   I   n   i   t   i   a
0000040   l   i   z   i   n   g       c   g   r   o   u   p       s   u
0000060   b   s   y   s       c   p   u   s   e   t  \n   I   n   i   t
0000100   i   a   l   i   z   i   n   g       c   g   r   o   u   p
0000120   s   u   b   s   y   s       c   p   u  \n   L   i   n   u   x
0000140       v   e   r   s   i   o   n       3   .   8   .   1   3   -
0000160   2   6   .   2   .   3   .   e   l   6   u   e   k   .   x   8
0000200   6   _   6   4       (   m   o   c   k   b   u   i   l   d   @
0000220   c   a   -   b   u   i   l   d   4   4   .   u   s   .   o   r
0000240   a   c   l   e   .   c   o   m   )       (   g   c   c       v
0000260   e   r   s   i   o   n       4   .   4   .   7       2   0   1
0000300   2   0   3   1   3       (   R   e   d       H   a   t       4
0000320   .   4   .   7   -   3   )       (   G   C   C   )       )
0000340   #   2       S   M   P       W   e   d       A   p   r       1
0000360   6       0   2   :   5   1   :   1   0       P   D   T       2

Proof, if ever you need it, that root access is still the fastest and easiest route to total disaster…

New section: Oracle SLOB Testing

slob ghost

For some time now I have preferred Oracle SLOB as my tool for generating I/O workloads using Oracle databases. I’ve previously blogged some information on how to use SLOB for PIO testing, as well as shared some scripts for running tests and extracting results. I’ve now added a whole new landing page for SLOB and a complete guide to running sustained throughput testing.

Why would you want to run sustained throughput tests? Well, one great reason is that not all storage platforms can cope with sustained levels of write workload. Flash arrays, or any storage array which contains flash, have a tendency to suffer from garbage collection issues when sustained write workloads hit them hard enough.

Find out more by following the links below:

Understanding Flash: SLC, MLC and TLC


The last post in this series discussed the layout of NAND flash memory chips and the way in which cells can be read and written (programmed) at the page level but have to be erased at the (larger) block level. I finished by mentioning that erase operations take substantially longer than read or program operations… but just how big is the difference?

Knowing the answer to this involves first understanding the different types of flash memory available: SLC, MLC and TLC.

Electrons In A Bucket?

Whenever I’ve seen anyone attempt to explain this in the past, they have almost always resorted to drawing a picture of electrons or charge filling up a bucket. This is a terrible analogy and causes anyone with a deep understanding of physics to cringe in horror. Luckily, I don’t have a deep understanding of physics, so I’m going to go right along with the herd and get my bucket out.

A NAND flash cell, i.e. the thing that stores a value of one or zero, is actually a floating gate transistor. Programming the cell means putting electrons into the floating gate, causing it to become (negatively) charged. Erasing the cell means removing the electrons from the floating gate, draining the charge. The amount of charge contained in the floating gate can be varied from zero up to a maximum value – this is an analogue system so there is no simple FULL or EMPTY state.

Because of this, the amount of charge can be measured and thresholds assigned to indicate a binary value. What does that mean? It means that, in the case of Single Level Cell (SLC) flash anything below 50% of charge can be considered to be a bit with a value of 1, while anything above 50% can be considered a bit with a value of 0.

But if i decided to be a bit more careful in the way I fill or empty my bucket of charge (sorry), I could perhaps define more thresholds and thus hold two bits of data instead of one. I could say that below 25% is 11, from 25% to 50% is 10, from 50% to 75% is 01 and above 75% is 00. Now I can keep twice as much data in the same bucket. This is in fact Multi Level Cell (MLC). And as the picture shows, if I was really careful in the way I treated my bucket, I could even keep three bits of data in there, which is what happens in Three Level Cell (TLC):


The thing is, imagine this was a bucket of water (comparing electrons to water is probably the last straw for anyone reading this who has a degree in physics, so I bid you farewell at this point). If you were to fill up your bucket using the SLC method, you could be pretty slap-dash about it. I mean it’s pretty obvious when the bucket is more than half full or empty. But if you were using a more fine-grained method such as MLC or TLC you would need to fill / empty very carefully and take exact measurements, which means the act of filling (programming) would be a lot slower.

To really stretch this analogy to breaking point, imagine that every time you fill your bucket it gets slightly damaged, causing it to leak. In the SLC world, even a number of small leaks would not be a big deal. But in the MLC or (especially) the TLC world, those leaks mean it would quickly become impossible to keep using your bucket, because the tolerance between different bit values is so small. For similar reasons, NAND flash endurance is greatly influenced by the type of cell used. Storing more bits per cell means a lower tolerance for errors, which in turn means that higher error rates are experienced and endurance (the number of program/erase cycles that can be sustained) is lower.

Timing and Wear

Enough of the analogies, let’s look at some proper data. The chart below uses sample figures from AnandTech:


You can see that as the number of bits per cell increases, so does the time taken to perform read, program (i.e. write) and erase operations. Erases in particular are especially slow, with values measured in milliseconds instead of microseconds. Given that erases also affect larger areas of flash than reads and programs, you can start to see why the management of erase operations on flash is critical to performance.

Also apparent on the chart above is the massive difference in the number of program / erase cycles between the different flash types: for SLC we’re talking about orders of magnitude in difference. But of course SLC can only store one bit per cell, which means it’s much more expensive from a capacity perspective than MLC. TLC, meanwhile, offers the potential for great value for money, but none of the performance requirements you would need for tier one storage (although it may well have a place in the world of backups). It is for this reason that MLC is the most commonly used type of flash in enterprise storage systems. (By the way I’m so utterly disinterested in the phenomena of “eMLC” that I’m not going to cover it here, but you can read this and this if you want to know more on the subject…)

Warning: Know Your Flash

cautionOne final thing. When you buy an SSD, a PCIe flash card or, in the case of Violin Memory, an all-flash array you tend to choose between SLC and MLC. As a very rough rule of thumb you can consider MLC to be twice the capacity for half the performance of SLC, although this in fact varies depending on many factors. However there are some all flash array vendors who use both SLC and MLC in a sort of tiered approach. That’s fine -and if you are buying a flash array I’m sure you’ll take the time to understand how it works.

But here’s the thing. At least one of these vendors insists on describing the SLC layer as “NVRAM” to differentiate from the MLC layer which it simply describes as using flash SSDs. The truth is that the NVRAM is also just a bunch of flash SSDs, except they are SLC instead of MLC. I’m not in favour of using educational posts to criticise competitors, but in the interest of bring clarity to this subject I will say this: I think this is a marketing exercise which deliberately adds confusion to try and make the design sound more exciting. “Ooooh, NVRAM that sounds like something I ought to have in my flash array…” – or am I being too cynical?

Understanding Flash: Blocks, Pages and Program / Erases

In the last post on this subject I described the invention of NAND flash and the way in which erase operations affect larger areas than write operations. Let’s have a look at this in more detail and see what actually happens. First of all, we need to know our way around the different entities on a flash chip: the die, the plane, the block and the page:

NAND Flash Die Layout (image courtesy of AnandTech)

NAND Flash Die Layout (image courtesy of AnandTech)

Note: What follows is a high-level description of the generic behaviour of flash. There are thousands of different NAND chips available, each potentially with slightly different instruction sets, block/page sizes, performance characteristics etc.

  • The die is the memory chip, i.e. the black rectangle with little electrical connectors sticking out of it. If you look at an SSD, a flash card or the internals of a flash array you will see many flash dies, each of which is produced by one of the big flash manufacturers: Toshiba, Samsung, Micron, Intel, SanDisk, SK Hynix. These are the only companies with the multi-billion dollar fabrication plants necessary to make NAND flash.
  • Each die contains one or more planes (usually two). Identical, concurrent operations can take place on each plane, although with some restrictions.
  • Each plane contains a number of blocks, which are the smallest unit that can be erased. Remember that, it’s really important.
  • Each block contains a number of pages, which are the smallest unit that can be programmed (i.e. written to).

The important bit here is that program operations (i.e. writes) take place to a page, which might typically be 8-16KB in size, while erase operations take place to a block, which might be 4-8MB in size. Since a block needs to be erased before it can be programmed again (*sort of, I’m generalising to make this easier), all of the pages in a block need to be candidates for erasure before this can happen.

Program / Erase Cycles

When your flash device arrives fresh from the vendor, all of the pages are “empty”. The first thing you will want to do, I’m sure, is write some data to them – which in the world of memory chips we call a program operation. As discussed, these program operations take place at the page level. You can then read your fresh data back out again with read operations, which also take place at the page level. [Having said that, the instruction to read a page places the data from that page in a memory register, so your reading process can in fact then selectively access subsets of the page if it desires - but maybe that's going into too much detail...]

NAND-flash-blocks-pages-program-erasesWhere it gets interesting is if you want to update the data you just wrote. There is no update operation for flash, no undo or rewind mechanism for changing what is currently in place, just the erase operation. It’s a little bit like an etch-a-sketch, in that you can continue to turn the dials and make white sections of screen go black, but you cannot turn black sections of screen to white again with erasing the entire screen. Etch-a-SketchAn erase operation on a flash chip clears the data from all pages in the block, so if some of the other pages contain active data (stuff you want to keep) you either have to copy it elsewhere first or hold off from doing the erase.

In fact, that second option (don’t erase just yet) makes the most sense, because the blocks on a flash chip can only tolerate a limited number of program and erase options (known as the program erase cycle or PE cycle because for obvious reasons they follow each other in turn). If you were to erase the block every time you wanted to change the contents of a page, your flash would wear out very quickly.

So a far better alternative is to simply mark the old page (containing the unchanged data) as INVALID and then write the new, changed data to an empty page. All that is required now is a mechanism for pointing any subsequent access operations to the new page and a way of tracking invalid pages so that, at some point, they can be “recycled”.


Updating a page in NAND flash. Note that the new page location does not need to be within the same block, or even the same flash die. It is shown in the same block here purely for ease of drawing.

This “mechanism” is known as the flash translation layer and it has responsibility for these tasks as well as a number of others. We’ll come back to it in subsequent posts because it is a real differentiator between flash products. For now though, think about the way the device is filling up with data. Although we’ve delayed issuing erase operations by cleverly moving data to different pages, at some point clearly there will be no empty pages left and erases will become essential. This is where the bad news comes in: it takes many times longer to perform an erase than it does to perform a read or program. And that clearly has consequences for performance if not managed correctly.

In the next post we’ll look at the differences in time taken to perform reads, programs and erases – which first requires looking at the different types of flash available: SLC, MLC and TLC…

caution[* Technical note: Ok so actually when a NAND flash page is empty it is all binary ones, e.g. 11111111. A program operation sets any bit with the value of 1 to 0, so for example 11111111 could become 11110000. This means that later on it is still possible to perform another program operation to set 11110000 to 00110000 for example. Until all bits are zero it's technical possible to perform another program. But hey, that's getting a bit too deep into the details for our requirements here, so just pretend you never read this...]

New My Oracle Support note on Advanced Format (4k) storage


In the past I have been a little critical of Oracle’s support notes and documentation regarding the use of Advanced Format 4k storage devices. I must now take that back, as my new friends in Oracle ASM Development and Product Management very kindly offered to let me write a new support note, which they have just published on My Oracle Support. It’s only supposed to be high level, but it does confirm that the _DISK_SECTOR_SIZE_OVERRIDE parameter can be safely set in database instances when using 512e storage and attempting to create 4k online redo logs.

The new support note is:

Using 4k Redo Logs on Flash and SSD-based Storage (Doc ID 1681266.1)

Don’t forget that you can read all about the basics of using Oracle with 4k sector storage here. And if you really feel up to it, I have a 4k deep dive page here.

Understanding Flash: What Is NAND Flash?


In the early 1980s, before we ever had such wondrous things as cell phones, tablets or digital cameras, a scientist named Dr Fujio Masuoka was working for Toshiba in Japan on the limitations of EPROM and EEPROM chips. An EPROM (Erasable Programmable Read Only Memory) is a type of memory chip that, unlike RAM for example, does not lose its data when the power supply is lost – in the technical jargon it is non-volatile. It does this by storing data in “cells” comprising of floating-gate transistors. I could start talking about Fowler-Nordheim tunnelling and hot-carrier injection at this point, but I’m going to stop here in case one of us loses the will to live. (But if you are the sort of person who wants to know more though, I can highly recommend this page accompanied by some strong coffee.)

Anyway, EPROMs could have data loaded into them (known as programming), but this data could also be erased through the use of ultra-violet light so that new data could be written. This cycle of programming and erasing is known as the program erase cycle (or PE Cycle) and is important because it can only happen a limited number of times per device… but that’s a topic for another post. However, while the reprogrammable nature of EPROMS was useful in laboratories, it was not a solution for packaging into consumer electronics – after all, including an ultra-violet light source into a device would make it cumbersome and commercially non-viable.

US Patent US4531203: Semiconductor memory device and method for manufacturing the same

US Patent US4531203: Semiconductor memory device and method for manufacturing the same

A subsequent development, known as the EEPROM, could be erased through the application of an electric field, rather than through the use of light, which was clearly advantageous as this could now easily take place inside a packaged product. Unlike EPROMs, EEPROMs could also erase and program individual bytes rather than the entire chip. However, the EEPROMs came with a disadvantage too: every cell required at least two transistors instead of the single transistor required in an EPROM. In other words, they stored less data: they had lower density.

The Arrival of Flash

So EPROMs had better density while EEPROMs had the ability to electrically reprogram cells. What if a new method could be found to incorporate both benefits without their associated weaknesses? Dr Masuoka’s idea, submitted as US patent 4612212 in 1981 and granted four years later, did exactly that. It used only one transistor per cell (increasing density, i.e. the amount of data it could store) and still allowed for electrical reprogramming.

If you made it this far, here’s the important bit. The new design achieved this goal by only allowing multiple cells to be erased and programmed instead of individual cells. This not only gives the density benefits of EPROM and the electrically-reprogrammable benefits of EEPROM, it also results in faster access times: it takes less time to issue a single command for programming or erasing a large number of cells than it does to issue one per cell.

However, the number of cells that are affected by a single erase operation is different – and much larger – than the number of cells affected by a single program operation. And it is this fact that, above all else, that results in the behaviour we see from devices built on flash memory. In the next post we will look at exactly what happens when program and erase operations take place, before moving on to look at the types of flash available (SLC, MLC etc) and their behaviour.


To try and keep this post manageable I’ve chosen to completely bypass the whole topic of NOR flash and just tell you that from this moment on we are talking about NAND flash, which is what you will find in SSDs, flash cards and arrays. It’s a cop out, I know – but if you really want to understand the difference then other people can describe it better than me.

In the meantime, we all have our good friend Dr Masuoka to thank for the flash memory that allows us to carry around the phones and tablets in our pockets and the SD cards in our digital cameras. Incidentally, popular legend has it that the name “flash” came from one of Dr Masuoka’s colleagues because the process of erasing data reminded him of the flash of a camera. flash-chipPresumably it was an analogue camera because digital cameras only became popular in the 1990s after the commoditisation of a new, solid-state storage technology called …


Understanding Disk: Caching and Tiering



When I was a child, about four or five years old, my dad told me a joke. It wasn’t a very funny joke, but it stuck in my mind because of what happened next. The joke went like this:

Dad: “What’s big at the bottom, small at the top and has ears?”

Me: “I don’t know?”

Dad: “A mountain!”

Me: “Er…<puzzled>…  What about the ears?”

Dad: (Triumphantly) “Haven’t you heard of mountaineers?!”

So as I say, not very funny. But, by a twist of fate, the following week at primary school my teacher happened to say, “Now then children, does anybody know any jokes they’d like to share?”. My hand shot up so fast that I was immediately given the chance to bring the house down with my new comedy routine. “What’s big at the bottom, small at the top and has ears?” I said, with barely repressed glee. “I don’t know”, murmured the teacher and other children expectantly. “A mountain!”, I replied.

Silence. Awkwardness. Tumbleweed. Someone may have said “Duuh!” under their breath. Then the teacher looked faintly annoyed and said, “That’s not really how a joke works…” before waiving away my attempts to explain and moving on to hear someone else’s (successful and funny) joke. Why had it been such a disaster? The joke had worked perfectly on the previous occasion, so why didn’t it work this time?

There are two reasons I tell this story: firstly because, as you can probably tell, I am scarred for life by what happened. And secondly, because it highlights what happens when you assume you can predict the unpredictable.*

Gambling With Performance

So far in this mini-series on Understanding Disk we’ve covered the design of hard drives, their mechanical limitations and some of the compromises that have to be made in order to achieve acceptable performance. The topic of this post is more about bandaids; the sticking plasters that users of disk arrays have to employ to try and cover up their performance problems. Or as my boss likes to call it, lipstick on a pig.

roulette-wheelIf you currently use an enterprise disk array the chances are it has some sort of DRAM cache within the array. Blocks stored in this cache can be read at a much lower latency than those residing only on disk, because the operation avoids paying the price of seek time and rotational latency. If the cache is battery-backed, it can also be used to accelerate writes too. But DRAM caches in storage area networks are notoriously expensive in relation to their size, which is often significantly smaller than the size of the active data set. For this reason, many array vendors allow you to use SSDs as an additional layer of slower (but higher capacity) cache.

Another common approach to masking the performance of disk arrays is tiering. This is where different layers of performance are identified (e.g. SATA disk, fibre-channel disk, SSD, etc) and data moved about according to its performance needs. Tiering can be performed manually, which requires a lot of management overhead, or automatically by software – perhaps the most well-known example being EMC’s Fully Automated Storage Tiering (or “FAST”) product. Unlike caching, which creates a copy of the data somewhere temporary, tiering involves permanently relocating the persistent location of the data. This relocation has a performance penalty, particularly if data is being moved frequently. Moreover, some automatic tiering solutions can take 24 hours to respond to changes in access patterns – now that’s what I call bad latency.

The Best Predictor of Future Behaviour is Past Behaviour

The problem with automatic tiering is that, just like caching, it relies on past behaviour to predict the future. That principle works well in psychology, but isn’t always as successful in computing. It might be acceptable if your workload is consistent and predictable, but what happens when you run your end of month reporting? What happens when you want to run a large ad-hoc query? What happens when you tell a joke about mountains and expect everyone to ask “but what about the ears”? You end up looking pretty stupid, I can tell you.

las-vegasI have no problem with caching or tiering in principle. After all, every computer system uses cache in multiple places: your CPUs probably have two or three levels of cache, your server is probably stuffed with DRAM and your Oracle database most likely has a large block buffer cache. What’s more, in my day job I have a lot of fun helping customers overcome the performance of nasty old spinning disk arrays using Violin’s Maestro memory services product.

But ultimately, caching and tiering are bandaids. They reduce the probability of horrible disk latency but they cannot eliminate it. And like a gambler on a winning streak, if you become more accustomed to faster access times and put more of your data at risk, the impact when you strike out is felt much more deeply. The more you bet, the more you have to lose.

Shifting the Odds in Your Favour

I have a customer in the finance industry who doesn’t care (within reason) what latency their database sees from storage. All they care about is that their end users see the same consistent and sustained performance. It doesn’t have to be lightning fast, but it must not, ever, feel slower than “normal”. As soon as access times increase, their users’ perception of the system’s performance suffers… and the users abandon them to use rival products.

poker-gameThey considered high-end storage arrays but performance was woefully unpredictable, no matter how much cache and SSD they used. They considered Oracle Exadata but ruled it out because Exadata Flash Cache is still a cache – at some point a cache miss will mean fetching data from horrible, spinning disk. Now they use all flash arrays, because the word “all” means their data is always on flash: no gambling with performance.

Caching and tiering will always have some sort of place in the storage world. But never forget that you cannot always win – at some point (normally the worst possible time) you will need to access data from the slowest media used by your platform. Which is why I like all flash arrays: you have a 100% chance of your data being on flash. If I’m forced to gamble with performance, those are the odds I prefer…

* I know. It’s a tenuous excuse for telling this story, but on the bright side I feel a lot better for sharing it with you.

The Ultimate Guide To Oracle with Advanced Format 4k


It’s a brave thing, calling something the “Ultimate Guide To …” as it can leave you open to criticism that it’s anything but. However, this topic – of how Oracle runs on Advanced Format storage systems and which choices have which consequences – is one I’ve been learning for two years now, so this really is everything I know. And from my desperate searching of the internet, plus discussions with people who are usually much knowledgeable than me, I’ve come to the conclusion that nobody else really understands it.

In fact, you could say that it’s a topic full of confusion – and if you browsed the support notes on My Oracle Support you’d definitely come to that conclusion. Part of that confusion is unfortunately FUD, spread by storage vendors who do not (yet) support Advanced Format and therefore benefit from scaring customers away from it. Be under no illusions, with the likes of Western DigitalHGST and Seagate all signed up to Advanced Format, plus Violin Memory and EMC’s XtremIO both using it, it’s something you should embrace rather than avoid.

However, to try and lessen the opportunity for those competitors to point and say “Look how complicated it is!”, I’ve split my previous knowledge repository into two: a high-level page and an Oracle on 4k deep dive. It’s taken me years to work all this stuff out – and days to write it all down, so I sincerely hope it saves someone else a lot of time and effort…!

Advanced Format with 4k Sectors

Advanced Format: Oracle on 4k Deep Dive

New installation cookbook for SUSE Linux Enterprise Server 11 SP3

Exactly what it says on the tin, I’ve added a new installation cookbook for SUSE 11 SP3 which creates Violin on a set of 4k devices.

I’ve started setting the add_random tunable of the noop I/O scheduler because it seems to give a boost in performance during benchmarking runs. If I can find the time, I will blog about this at some point…

For more details read this document from Red Hat.

Postcards from Storageland: Two Years Flash By


The start of March means I have been working at Violin Memory for exactly two years. This also corresponds to exactly two years of the flashdba blog, so I thought I’d take stock and look at what’s happened since I embarked on my journey into the world of storage. Quite a lot, as it happens…

Flash Is No Longer The Future

The single biggest difference between now and the world of storage I entered two years ago is that flash memory is no longer an unknown. In early 2012 I used to visit customers and talk about one thing: disk. I would tell them about the mechanical limitations of disk, about random versus sequential I/O, about how disk was holding them back. Sure I would discuss flash too – I’d attempt to illustrate the enormous benefits it brings in terms of performance, cost and environmental benefits (power, cooling, real estate etc)… but it was all described in relation to disk. Flash was a future technology and I was attempting to bring it into the present.

Today, we hardly ever talk disk. Everyone knows the limitations of mechanical storage and very few customers ever compare us against disk-based solutions. These days customers are buying flash systems and the only choice they want to make is over which vendor to use.

Violin Memory 2.0

The storage industry is awash with people who use “personal” blogs as a corporate marketing mouthpiece to trumpet their products and trash the competition. I always avoid that, because I think it’s insulting to the reader’s intelligence; the point of blogging is to share knowledge and personal opinion. I also try to avoid talking about corporate topics such as roadmap, financial performance, management changes etc. But if I wrote an article looking back at two years of Violin Memory and didn’t even mention the IPO, all credibility would be gone.

So let me be honest [please note the disclaimer, these are my personal opinions), the Violin Memory journey over the last couple of years has been pretty crazy. We have such a great product – and the flash market is such a great opportunity – that the wave of negative press last year came as a surprise to me. I guess that shows some naivety on my part for forgetting that product and opportunity are only two pieces of the puzzle, but all the same what I read in the news didn’t seem to correspond to what I saw in my day job as we successfully competed for business around Europe. I had customers who had not just improved but transformed their businesses by using Violin. That had to be a good sign, right?

Now here we are in 2014 and, despite some changes, Violin continues to develop as a company under the guidance of an experienced new CEO. I’m still doing what I love, which is travelling around Europe (in the last month alone I’ve been in the UK, France, Switzerland, Turkey and Germany) meeting exciting new customers and competing against the biggest names in storage (see below). In a world where things change all the time, I’m happy to say this is one thing that remains constant.

The Competition

Now for the juicy bit. Part of the reason I was invited to join Violin was to compete against my former employer’s Exadata product – something I have been doing ever since. However, in those heady days of 2011 it also appeared that Fusion IO would be the big competitor in the flash space. Meanwhile, at that time, none of the big boys had flash array products of note. EMC, IBM, NetApp, Cisco, Dell… nothing. The only one who did was HP, who were busy reselling a product called VMA – yes that’s right, the Violin Memory Array – despite having recently paid $2.4b for 3PAR.

Then everything seemed to happen at once. EMC paid an astonishing amount of money for the “pre-product” XtremIO, which took 18 months to achieve general availability. IBM bought the struggling Texas Memory Systems. HP decided to focus on 3PAR over Violin. Cisco surprised everyone by buying Whiptail (including themselves, apparently). And NetApp finally admitted that their strategy of ignoring flash arrays may not have been such a good idea.

That’s the market, but what have I seen? I regularly compete against other flash vendors in the EMEA region – and don’t forget, I only get involved if it’s an Oracle database solution under consideration. The Oracle deals tend to be the largest by size and occupy a space which you could clearly describe as “enterprise class” – I rarely get involved in midrange or small business-sized deals.

The truth is I see the same thing pretty much every time: I compete against EMC and IBM, or I compete against Oracle Exadata. I’ve never seen Fusion IO in a deal – which is not surprising because their cards and our arrays tend to be solutions for different problems. However, I’ve also never – ever – seen Pure Storage in a competitive situation on one of my accounts, nor Nimbus, Nimble, SolidFire, Kaminario or Skyera. I’ve seen Whiptail, HDS and Huawei maybe once each; HP probably a few more times. But when it comes down to the final bake off, it’s EMC, IBM or Exadata. I claim that my experience is representative of the market, but it is real.

Who is the biggest threat? That’s an easy one to answer: it’s always the incumbent storage supplier. No matter how great a solution you have, or how low a price, it’s always easier for a customer to do nothing than to make a change. Inertia is our biggest competitor. Yet at the same time the incumbent has the biggest problem: so much to lose.

And how am I doing in these competitions? Well, that would be telling. But look at it this way – two years on and I’m still trusted to do this.

I wonder what the next two years will bring?

Update (Spring 2014): I’ve finally had my first ever competitive POC against Pure Storage at a customer in Germany. It would be inappropriate for me to say who won. :-)


Get every new post delivered to your Inbox.

Join 759 other followers