Understanding Flash: What Is NAND Flash?

circuit-board

In the early 1980s, before we ever had such wondrous things as cell phones, tablets or digital cameras, a scientist named Dr Fujio Masuoka was working for Toshiba in Japan on the limitations of EPROM and EEPROM chips. An EPROM (Erasable Programmable Read Only Memory) is a type of memory chip that, unlike RAM for example, does not lose its data when the power supply is lost – in the technical jargon it is non-volatile. It does this by storing data in “cells” comprising of floating-gate transistors. I could start talking about Fowler-Nordheim tunnelling and hot-carrier injection at this point, but I’m going to stop here in case one of us loses the will to live. (But if you are the sort of person who wants to know more though, I can highly recommend this page accompanied by some strong coffee.)

Anyway, EPROMs could have data loaded into them (known as programming), but this data could also be erased through the use of ultra-violet light so that new data could be written. This cycle of programming and erasing is known as the program erase cycle (or PE Cycle) and is important because it can only happen a limited number of times per device… but that’s a topic for another post. However, while the reprogrammable nature of EPROMS was useful in laboratories, it was not a solution for packaging into consumer electronics – after all, including an ultra-violet light source into a device would make it cumbersome and commercially non-viable.

US Patent US4531203: Semiconductor memory device and method for manufacturing the same
US Patent US4531203: Semiconductor memory device and method for manufacturing the same

A subsequent development, known as the EEPROM, could be erased through the application of an electric field, rather than through the use of light, which was clearly advantageous as this could now easily take place inside a packaged product. Unlike EPROMs, EEPROMs could also erase and program individual bytes rather than the entire chip. However, the EEPROMs came with a disadvantage too: every cell required at least two transistors instead of the single transistor required in an EPROM. In other words, they stored less data: they had lower density.

The Arrival of Flash

So EPROMs had better density while EEPROMs had the ability to electrically reprogram cells. What if a new method could be found to incorporate both benefits without their associated weaknesses? Dr Masuoka’s idea, submitted as US patent 4612212 in 1981 and granted four years later, did exactly that. It used only one transistor per cell (increasing density, i.e. the amount of data it could store) and still allowed for electrical reprogramming.

If you made it this far, here’s the important bit. The new design achieved this goal by only allowing multiple cells to be erased and programmed instead of individual cells. This not only gives the density benefits of EPROM and the electrically-reprogrammable benefits of EEPROM, it also results in faster access times: it takes less time to issue a single command for programming or erasing a large number of cells than it does to issue one per cell.

However, the number of cells that are affected by a single erase operation is different – and much larger – than the number of cells affected by a single program operation. And it is this fact that, above all else, that results in the behaviour we see from devices built on flash memory. In the next post we will look at exactly what happens when program and erase operations take place, before moving on to look at the types of flash available (SLC, MLC etc) and their behaviour.

NAND and NOR

To try and keep this post manageable I’ve chosen to completely bypass the whole topic of NOR flash and just tell you that from this moment on we are talking about NAND flash, which is what you will find in SSDs, flash cards and arrays. It’s a cop out, I know – but if you really want to understand the difference then other people can describe it better than me.

In the meantime, we all have our good friend Dr Masuoka to thank for the flash memory that allows us to carry around the phones and tablets in our pockets and the SD cards in our digital cameras. Incidentally, popular legend has it that the name “flash” came from one of Dr Masuoka’s colleagues because the process of erasing data reminded him of the flash of a camera. flash-chipPresumably it was an analogue camera because digital cameras only became popular in the 1990s after the commoditisation of a new, solid-state storage technology called …

 

Understanding Disk: Caching and Tiering

 

roulette-and-casino

When I was a child, about four or five years old, my dad told me a joke. It wasn’t a very funny joke, but it stuck in my mind because of what happened next. The joke went like this:

Dad: “What’s big at the bottom, small at the top and has ears?”

Me: “I don’t know?”

Dad: “A mountain!”

Me: “Er…<puzzled>…  What about the ears?”

Dad: (Triumphantly) “Haven’t you heard of mountaineers?!”

So as I say, not very funny. But, by a twist of fate, the following week at primary school my teacher happened to say, “Now then children, does anybody know any jokes they’d like to share?”. My hand shot up so fast that I was immediately given the chance to bring the house down with my new comedy routine. “What’s big at the bottom, small at the top and has ears?” I said, with barely repressed glee. “I don’t know”, murmured the teacher and other children expectantly. “A mountain!”, I replied.

Silence. Awkwardness. Tumbleweed. Someone may have said “Duuh!” under their breath. Then the teacher looked faintly annoyed and said, “That’s not really how a joke works…” before waiving away my attempts to explain and moving on to hear someone else’s (successful and funny) joke. Why had it been such a disaster? The joke had worked perfectly on the previous occasion, so why didn’t it work this time?

There are two reasons I tell this story: firstly because, as you can probably tell, I am scarred for life by what happened. And secondly, because it highlights what happens when you assume you can predict the unpredictable.*

Gambling With Performance

So far in this mini-series on Understanding Disk we’ve covered the design of hard drives, their mechanical limitations and some of the compromises that have to be made in order to achieve acceptable performance. The topic of this post is more about bandaids; the sticking plasters that users of disk arrays have to employ to try and cover up their performance problems. Or as my boss likes to call it, lipstick on a pig.

roulette-wheelIf you currently use an enterprise disk array the chances are it has some sort of DRAM cache within the array. Blocks stored in this cache can be read at a much lower latency than those residing only on disk, because the operation avoids paying the price of seek time and rotational latency. If the cache is battery-backed, it can also be used to accelerate writes too. But DRAM caches in storage area networks are notoriously expensive in relation to their size, which is often significantly smaller than the size of the active data set. For this reason, many array vendors allow you to use SSDs as an additional layer of slower (but higher capacity) cache.

Another common approach to masking the performance of disk arrays is tiering. This is where different layers of performance are identified (e.g. SATA disk, fibre-channel disk, SSD, etc) and data moved about according to its performance needs. Tiering can be performed manually, which requires a lot of management overhead, or automatically by software – perhaps the most well-known example being EMC’s Fully Automated Storage Tiering (or “FAST”) product. Unlike caching, which creates a copy of the data somewhere temporary, tiering involves permanently relocating the persistent location of the data. This relocation has a performance penalty, particularly if data is being moved frequently. Moreover, some automatic tiering solutions can take 24 hours to respond to changes in access patterns – now that’s what I call bad latency.

The Best Predictor of Future Behaviour is Past Behaviour

The problem with automatic tiering is that, just like caching, it relies on past behaviour to predict the future. That principle works well in psychology, but isn’t always as successful in computing. It might be acceptable if your workload is consistent and predictable, but what happens when you run your end of month reporting? What happens when you want to run a large ad-hoc query? What happens when you tell a joke about mountains and expect everyone to ask “but what about the ears”? You end up looking pretty stupid, I can tell you.

las-vegasI have no problem with caching or tiering in principle. After all, every computer system uses cache in multiple places: your CPUs probably have two or three levels of cache, your server is probably stuffed with DRAM and your Oracle database most likely has a large block buffer cache. What’s more, in my day job I have a lot of fun helping customers overcome the performance of nasty old spinning disk arrays using Violin’s Maestro memory services product.

But ultimately, caching and tiering are bandaids. They reduce the probability of horrible disk latency but they cannot eliminate it. And like a gambler on a winning streak, if you become more accustomed to faster access times and put more of your data at risk, the impact when you strike out is felt much more deeply. The more you bet, the more you have to lose.

Shifting the Odds in Your Favour

I have a customer in the finance industry who doesn’t care (within reason) what latency their database sees from storage. All they care about is that their end users see the same consistent and sustained performance. It doesn’t have to be lightning fast, but it must not, ever, feel slower than “normal”. As soon as access times increase, their users’ perception of the system’s performance suffers… and the users abandon them to use rival products.

poker-gameThey considered high-end storage arrays but performance was woefully unpredictable, no matter how much cache and SSD they used. They considered Oracle Exadata but ruled it out because Exadata Flash Cache is still a cache – at some point a cache miss will mean fetching data from horrible, spinning disk. Now they use all flash arrays, because the word “all” means their data is always on flash: no gambling with performance.

Caching and tiering will always have some sort of place in the storage world. But never forget that you cannot always win – at some point (normally the worst possible time) you will need to access data from the slowest media used by your platform. Which is why I like all flash arrays: you have a 100% chance of your data being on flash. If I’m forced to gamble with performance, those are the odds I prefer…

* I know. It’s a tenuous excuse for telling this story, but on the bright side I feel a lot better for sharing it with you.

The Ultimate Guide To Oracle with Advanced Format 4k

fud

It’s a brave thing, calling something the “Ultimate Guide To …” as it can leave you open to criticism that it’s anything but. However, this topic – of how Oracle runs on Advanced Format storage systems and which choices have which consequences – is one I’ve been learning for two years now, so this really is everything I know. And from my desperate searching of the internet, plus discussions with people who are usually much knowledgeable than me, I’ve come to the conclusion that nobody else really understands it.

In fact, you could say that it’s a topic full of confusion – and if you browsed the support notes on My Oracle Support you’d definitely come to that conclusion. Part of that confusion is unfortunately FUD, spread by storage vendors who do not (yet) support Advanced Format and therefore benefit from scaring customers away from it. Be under no illusions, with the likes of Western DigitalHGST and Seagate all signed up to Advanced Format, plus Violin Memory and EMC’s XtremIO both using it, it’s something you should embrace rather than avoid.

However, to try and lessen the opportunity for those competitors to point and say “Look how complicated it is!”, I’ve split my previous knowledge repository into two: a high-level page and an Oracle on 4k deep dive. It’s taken me years to work all this stuff out – and days to write it all down, so I sincerely hope it saves someone else a lot of time and effort…!

Advanced Format with 4k Sectors

Advanced Format: Oracle on 4k Deep Dive

New installation cookbook for SUSE Linux Enterprise Server 11 SP3

Exactly what it says on the tin, I’ve added a new installation cookbook for SUSE 11 SP3 which creates Violin on a set of 4k devices.

I’ve started setting the add_random tunable of the noop I/O scheduler because it seems to give a boost in performance during benchmarking runs. If I can find the time, I will blog about this at some point…

For more details read this document from Red Hat.

Postcards from Storageland: Two Years Flash By

calendar

The start of March means I have been working at Violin Memory for exactly two years. This also corresponds to exactly two years of the flashdba blog, so I thought I’d take stock and look at what’s happened since I embarked on my journey into the world of storage. Quite a lot, as it happens…

Flash Is No Longer The Future

The single biggest difference between now and the world of storage I entered two years ago is that flash memory is no longer an unknown. In early 2012 I used to visit customers and talk about one thing: disk. I would tell them about the mechanical limitations of disk, about random versus sequential I/O, about how disk was holding them back. Sure I would discuss flash too – I’d attempt to illustrate the enormous benefits it brings in terms of performance, cost and environmental benefits (power, cooling, real estate etc)… but it was all described in relation to disk. Flash was a future technology and I was attempting to bring it into the present.

Today, we hardly ever talk disk. Everyone knows the limitations of mechanical storage and very few customers ever compare us against disk-based solutions. These days customers are buying flash systems and the only choice they want to make is over which vendor to use.

Violin Memory 2.0

The storage industry is awash with people who use “personal” blogs as a corporate marketing mouthpiece to trumpet their products and trash the competition. I always avoid that, because I think it’s insulting to the reader’s intelligence; the point of blogging is to share knowledge and personal opinion. I also try to avoid talking about corporate topics such as roadmap, financial performance, management changes etc. But if I wrote an article looking back at two years of Violin Memory and didn’t even mention the IPO, all credibility would be gone.

So let me be honest [please note the disclaimer, these are my personal opinions], the Violin Memory journey over the last couple of years has been pretty crazy. We have such a great product – and the flash market is such a great opportunity – that the wave of negative press last year came as a surprise to me. I guess that shows some naivety on my part for forgetting that product and opportunity are only two pieces of the puzzle, but all the same what I read in the news didn’t seem to correspond to what I saw in my day job as we successfully competed for business around Europe. I had customers who had not just improved but transformed their businesses by using Violin. That had to be a good sign, right?

Now here we are in 2014 and, despite some changes, Violin continues to develop as a company under the guidance of an experienced new CEO. I’m still doing what I love, which is travelling around Europe (in the last month alone I’ve been in the UK, France, Switzerland, Turkey and Germany) meeting exciting new customers and competing against the biggest names in storage (see below). In a world where things change all the time, I’m happy to say this is one thing that remains constant.

The Competition

Now for the juicy bit. Part of the reason I was invited to join Violin was to compete against my former employer’s Exadata product – something I have been doing ever since. However, in those heady days of 2011 it also appeared that Fusion IO would be the big competitor in the flash space. Meanwhile, at that time, none of the big boys had flash array products of note. EMC, IBM, NetApp, Cisco, Dell… nothing. The only one who did was HP, who were busy reselling a product called VMA – yes that’s right, the Violin Memory Array – despite having recently paid $2.4b for 3PAR.

Then everything seemed to happen at once. EMC paid an astonishing amount of money for the “pre-product” XtremIO, which took 18 months to achieve general availability. IBM bought the struggling Texas Memory Systems. HP decided to focus on 3PAR over Violin. Cisco surprised everyone by buying Whiptail (including themselves, apparently). And NetApp finally admitted that their strategy of ignoring flash arrays may not have been such a good idea.

That’s the market, but what have I seen? I regularly compete against other flash vendors in the EMEA region – and don’t forget, I only get involved if it’s an Oracle database solution under consideration. The Oracle deals tend to be the largest by size and occupy a space which you could clearly describe as “enterprise class” – I rarely get involved in midrange or small business-sized deals.

The truth is I see the same thing pretty much every time: I compete against EMC and IBM, or I compete against Oracle Exadata. I’ve never seen Fusion IO in a deal – which is not surprising because their cards and our arrays tend to be solutions for different problems. However, I’ve also never – ever – seen Pure Storage in a competitive situation on one of my accounts, nor Nimbus, Nimble, SolidFire, Kaminario or Skyera. I’ve seen Whiptail, HDS and Huawei maybe once each; HP probably a few more times. But when it comes down to the final bake off, it’s EMC, IBM or Exadata. I claim that my experience is representative of the market, but it is real.

Who is the biggest threat? That’s an easy one to answer: it’s always the incumbent storage supplier. No matter how great a solution you have, or how low a price, it’s always easier for a customer to do nothing than to make a change. Inertia is our biggest competitor. Yet at the same time the incumbent has the biggest problem: so much to lose.

And how am I doing in these competitions? Well, that would be telling. But look at it this way – two years on and I’m still trusted to do this.

I wonder what the next two years will bring?

Update (Spring 2014): I’ve finally had my first ever competitive POC against Pure Storage at a customer in Germany. It would be inappropriate for me to say who won. 🙂

More Problems with Oracle’s Support of 4k Devices

This is going to be another one of those posts, a bit like this one, that discuss the use of Oracle’s database product with Advanced Format devices. I wish there weren’t so many of these posts, but it seems that Oracle has a lot of issues with it’s implementation of 4k support.

(Before reading on, if you aren’t sure what I’m talking about here then please have a read of this page…)

In the last post I built a database which used Oracle ASM (and the Linux ASMLib kernel driver) but found that if the database used an SPFILE which was located on a 4k device (within an ASM diskgroup) it didn’t work. Today, I’m going to forego ASM and use a filesystem instead (something I would never do in real life).

Building a 4k Filesystem

Let’s start with a single 4k LUN being presented from my Violin array. I’ve already configured the Linux device mapper multipathing so that it presents itself as a nicely-named device in the /dev/mapper directory:

[oracle@half-server4 ~]$ ls -l /dev/mapper/fs4ktest 
lrwxrwxrwx 1 root root 7 Feb 25 15:53 /dev/mapper/fs4ktest -> ../dm-7
[oracle@half-server4 ~]$ fdisk -l /dev/mapper/fs4ktest

Note: sector size is 4096 (not 512)

Disk /dev/mapper/fs4ktest: 215.8 GB, 215822106624 bytes
255 heads, 63 sectors/track, 3279 cylinders
Units = cylinders of 16065 * 4096 = 65802240 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 524288 bytes

We can see that this is indeed a 4k device, i.e. it has not only a 4096 byte physical blocksize, but a 4096 byte logical blocksize too. The fdisk command has even taken the time to print a special “Note” to ensure we see the sector size is not the usual 512 bytes. The next thing to do is format it with a filesystem so I’m going to use ext4:

[root@half-server4 ~]# mkfs.ext4 /dev/mapper/fs4test 
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=1 blocks, Stripe width=128 blocks
2097152 inodes, 8388608 blocks
419430 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
256 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624

Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

Now it needs to be mounted. I’m just going to stick it on a mount point in a new top-level directory called /fstest:

[root@half-server4 ~]# mkdir -p /fstest/fs4ktest
[root@half-server4 ~]# chown -R oracle:oinstall /fstest
[root@half-server4 ~]# mount /dev/mapper/fs4ktest /fstest/fs4ktest

Finally, we I am going to create an Oracle database using this filesystem. I’m not going to cut and paste all the output for that, because it’s all a bit dull… so let’s just skip to the bit where DBCA has completed and the database is open.

Oracle On A 4k Filesystem

So the good news is, it worked. The database is up and running and the datafiles are located on the 4k filesystem:

[oracle@half-server4 fstest]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.1.0 Production on Mon Mar 3 21:30:02 2014
Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> select name from v$datafile;

NAME
---------------------------------------------------------------------------------------
/fstest/fs4ktest/oracle/oradata/FSTEST/datafile/o1_mf_system_9k9wxmw6_.dbf
/fstest/fs4ktest/oracle/oradata/FSTEST/datafile/o1_mf_sysaux_9k9ww6r6_.dbf
/fstest/fs4ktest/oracle/oradata/FSTEST/datafile/o1_mf_undotbs1_9k9wz257_.dbf
/fstest/fs4ktest/oracle/oradata/FSTEST/datafile/o1_mf_users_9k9wz123_.dbf

Cool. We can end this post here then, right? Well, no… because there is a bit of a problem with this database. Let’s just have a quick check of the FILESYSTEMIO_OPTIONS parameter:

SQL> show parameter filesystem

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
filesystemio_options		     string	 none

This parameter controls the way that I/O is performed for files located on filesystems. It isn’t relevant for databases using Oracle ASM (for which the DISK_ASYNCH_IO parameter exists instead), but here it’s making a massive difference. According to the Oracle documentation, it has four possible options:

  • ASYNCH: enable asynchronous I/O on file system files, which has no timing requirement for transmission.
  • DIRECTIO: enable direct I/O on file system files, which bypasses the buffer cache.
  • SETALL: enable both asynchronous and direct I/O on file system files.
  • NONE: disable both asynchronous and direct I/O on file system files.

Normally, when I see filesystem-based databases, I find this parameter set to SETALL. This means asynchronous and direct I/O, but here it is set to NONE which means neither. And it’s the DIRECTIO that we are interested in.

One Buffer Cache Is Enough

As you are no doubt aware, Oracle databases have a buffer cache which is used to cache copies of database blocks. However, the Linux operating system also has its own buffer cache for filesystems. Most people would consider it ineffective to use two levels of cache – and if that is the case, it will obviously be the Oracle buffer cache that needs to be used. So let’s set the parameter to use direct I/O and then restart the database (as the parameter is not dynamic):

SQL> alter system set filesystemio_options='directIO' scope=spfile;

System altered.

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 1.3896E+10 bytes
Fixed Size		    4663568 bytes
Variable Size		 2751465200 bytes
Database Buffers	 1.1107E+10 bytes
Redo Buffers		   33673216 bytes
ORA-00205: error in identifying control file, check alert log for more info

Oh dear. What happened?

[oracle@half-server4 ~]$ tail /u01/app/oracle/diag/rdbms/fstest/fstest/trace/alert_fstest.log 
ORA-00210: cannot open the specified control file
ORA-00202: control file: '/fstest/fs4ktest/oracle/oradata/FSTEST/controlfile/o1_mf_9k9wzp31_.ctl'
ORA-27047: unable to read the header block of file
Linux-x86_64 Error: 22: Invalid argument
Additional information: 1
ORA-205 signalled during: ALTER DATABASE   MOUNT...

The answer, which you can find in My Oracle Support note 1133713.1, is that Oracle does not support 4k devices with direct I/O. This has been the case for a long time – I remember first discovering this nearly two years ago, on 11.2.0.2, yet there is no sign of it being fixed. According to the note, “It is not yet known in which version this support will be available.” Pah.

There’s More: Diagnostic Destination on 4k

And then there’s the diagnostic destination. How about if I choose to locate this on a 4k filesystem?

SQL> show parameter diagnostic_dest

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
diagnostic_dest 		     string	 /u01/app/oracle

SQL> alter system set diagnostic_dest='/fstest/fs4ktest/oracle' scope=spfile;

System altered.

I’ll give it a few minutes and then go and look in some of the files… guess what I see?

ORA-48101: error encountered when attempting to read a file [block] [/fstest/fs4ktest/oracle/diag/rdbms/fstest/fstest/metadata/INCIDENT.ams] [0]
ORA-27072: File I/O error
Linux-x86_64 Error: 22: Invalid argument
Additional information: 4
Additional information: 1
Additional information: -1

Look familiar? (This is nothing to do with direct I/O by the way, I disabled that again before this test.)

So let’s be honest, things aren’t going all that well here. There are still a lot of things that do not appear to work properly when using 4k devices. Luckily, my Violin array can present storage as 512 byte to avoid this sort of issue, but really I feel that Oracle needs to get cracking on its Advanced Format support. This is not just a flash memory thing, pretty much every major disk vendor is making Advanced Format devices now from Western Digital, through HGST to Seagate.

Time to get with the programme?

Oracle ASMLib: Physical and Logical Blocksize

This article is about the use of Advanced Format devices on Oracle’s ASMLib kernel library for Linux. For background, read this page on 4k sector sizes first, otherwise it might all sound like nonsense. Mind you, it mind sound like nonsense anyway, I can’t guarantee anything here. By the way, a big hello to my buddy Nate who asked for this information: you rock, dude.

In more recent versions of ASMLib, Oracle introduced a new parameter into the /etc/sysconfig/oracleasm file:

[root@half-server4 mapper]# tail -5 /etc/sysconfig/oracleasm
# ORACLEASM_USE_LOGICAL_BLOCK_SIZE: 'true' means use the logical block size
# reported by the underlying disk instead of the physical. The default
# is 'false'
ORACLEASM_USE_LOGICAL_BLOCK_SIZE=false

To understand what this parameter does, consider this device which I am presenting from a Violin array:

[root@half-server4 ~]# ls -l /dev/mapper/testlun
lrwxrwxrwx 1 root root 8 Feb 27 15:33 /dev/mapper/testlun -> ../dm-19
[root@half-server4 ~]# fdisk -l /dev/mapper/testlun

Disk /dev/mapper/testlun: 34.4 GB, 34359738368 bytes
255 heads, 63 sectors/track, 4177 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 524288 bytes

The important bit there is highlighted in red. This device has a 4k physical blocksize (as all Violin devices do, as well as many other modern storage systems) but has a 512 byte logical blocksize. Essentially, this LUN is pretending to be a 512 byte based.

Now that’s all well and good. Operating systems and applications that cannot support 4k block sizes (e.g. Red Hat 5 and Oracle Linux 5) will happily use this, because they believe it to be 512 byte. But later versions of ASMLib have started being too clever for their own good.

Don’t Look Behind The Curtain

Let’s create an ASMLib label on this device:

root@half-server4 ~]# oracleasm createdisk TESTLUN /dev/mapper/testlun 
Writing disk header: done
Instantiating disk: done

And now we can attempt to put an ASM diskgroup on it:

SQL> CREATE DISKGROUP TEST EXTERNAL REDUNDANCY
DISK 'ORCL:TESTLUN'
ATTRIBUTE
     'sector_size'='512',
     'compatible.asm' = '11.2',
     'compatible.rdbms' = '11.2';  
CREATE DISKGROUP TEST EXTERNAL REDUNDANCY
*
ERROR at line 1:
ORA-15018: diskgroup cannot be created
ORA-15038: disk '' mismatch on 'Sector Size' with target disk group [4096]
[512]

What happened? Well, ASMLib has looked behind the smoke and mirrors and decided that this is actually a 4k device. It’s therefore presenting this to Oracle ASM as 4k, which causes the problem (because I explicitly asked for sector size to be 512 byte on this diskgroup).

One possible solution is to change the ASM_DISKSTRING from it’s default value of NULL (meaning ‘ORCL:*’) to ‘/dev/oracleasm/disks/*’, i.e. the location where ASMLib creates its own block devices. We can test this theory with fdisk:

[oracle@half-server4 ~]$ ls -l /dev/oracleasm/disks/TESTLUN 
brw-rw---- 1 oracle dba 252, 19 Feb 27 15:38 /dev/oracleasm/disks/TESTLUN
[oracle@half-server4 ~]$ fdisk -l /dev/oracleasm/disks/TESTLUN | grep "Sector size"
Sector size (logical/physical): 512 bytes / 4096 bytes

So that would work. But it would lose many of the claimed benefits of ASMLib such as reduced file descriptors and context switching. Also, it feels like a hack.

Setting ORACLEASM_USE_LOGICAL_BLOCK_SIZE

The answer, as you probably guessed, is to set this new parameter. It defaults, wrongly in my opinion, to using the physical block size. We can either edit the value in the file to be true in order to use the logical blocksize, or preferably use the oracleasm configure command:

root@half-server4 ~]# oracleasm configure -b
Writing Oracle ASM library driver configuration: done
[root@half-server4 ~]# oracleasm configure | grep ORACLEASM_USE_LOGICAL_BLOCK_SIZE
ORACLEASM_USE_LOGICAL_BLOCK_SIZE="true"

It can be set back to using the physical blocksize with the following command:

[root@half-server4 ~]# oracleasm configure -p
Writing Oracle ASM library driver configuration: done
[root@half-server4 ~]# oracleasm configure | grep ORACLEASM_USE_LOGICAL_BLOCK_SIZE
ORACLEASM_USE_LOGICAL_BLOCK_SIZE="false"

Finally, a word of warning. If you are like me, then you are a bit stupid and can’t follow instructions properly. I set the value of the parameter to TRUE in upper case and then spent hours wondering why it didn’t work. The answer, to my embarrassment, is that it’s case sensitive. TRUE is not a valid value so it defaults to false. Doh.

Oracle AWR Reports: Understanding I/O Statistics

truth

One consequence of my job is that I spend a lot of time looking at Oracle Automatic Workload Repository reports, specifically at information about I/O. I really do mean a lot of time (honestly, I’m not kidding, I have had dreams about AWR reports). One thing that comes up very frequently is the confusion relating to how the measurements of IOPS and throughput are displayed in the AWR report Load Profile section. The answer, is that they aren’t. Well, not exactly… let me explain.

Physical Read and Write I/O

Right at the top of an AWR report, just after the Database and Host details, you’ll find the familiar Load Profile section. Until recently, it had changed very little through the releases of Oracle since its introduction in 10g. Here’s a sample from 11g Release 2:

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):               44.1                0.4       0.07       1.56
       DB CPU(s):                1.6                0.0       0.00       0.06
       Redo size:      154,034,644.3        1,544,561.0
    Logical read:          154,436.1            1,548.6
   Block changes:           82,491.9              827.2
  Physical reads:              150.6                1.5
 Physical writes:           18,135.2              181.9
      User calls:               28.3                0.3
          Parses:              142.7                1.4
     Hard parses:                7.5                0.1
W/A MB processed:                2.1                0.0
          Logons:                0.1                0.0
        Executes:              607.7                6.1
       Rollbacks:                0.0                0.0
    Transactions:               99.7

In my role I have to look at the amount of I/O being driven by a database, so I can size a solution based on flash memory. This means knowing two specific metrics: the number of I/Os per second (IOPS) and the throughput (typically measured in MB/sec). I need to know these values for both read and write I/O so that I can understand the ratio. I also want to understand things like the amount of random versus sequential I/O, but that’s beyond the scope of this post.

The first thing to understand is that none of this information is shown above. There are values for Physical reads and Physical writes but these are actually measured in database blocks. Even if we knew the block size (which we don’t because Oracle databases can have multiple block sizes) we do not know how many I/Os were required. Ten Oracle blocks could be written in one sequential I/O or ten individual “random” I/Os, completely changing the IOPS measurement. To find any of this information we have to descend into the depths of the AWR report to find the Instance Activity Stats section.

In Oracle 12c, the format of the AWR report changed, especially the AWR Load Profile section, which was modified to show the units that each measurement uses. It also includes some new lines such as Read/Write IO Requests and Read/Write IO. Here’s a sample from a 12c database (taken during a 30 second run of SLOB):

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):              44.1               0.4      0.07      1.56
              DB CPU(s):               1.6               0.0      0.00      0.06
      Redo size (bytes):     154,034,644.3       1,544,561.0
  Logical read (blocks):         154,436.1           1,548.6
          Block changes:          82,491.9             827.2
 Physical read (blocks):             150.6               1.5
Physical write (blocks):          18,135.2             181.9
       Read IO requests:             150.3               1.5
      Write IO requests:          15,096.9             151.4
           Read IO (MB):               1.2               0.0
          Write IO (MB):             141.7               1.4
             User calls:              28.3               0.3
           Parses (SQL):             142.7               1.4
      Hard parses (SQL):               7.5               0.1
     SQL Work Area (MB):               2.1               0.0
                 Logons:               0.1               0.0
         Executes (SQL):             607.7               6.1
              Rollbacks:               0.0               0.0
           Transactions:              99.7

Now, you might be forgiven for thinking that the values highlighted in red and blue above tell me the very IOPS and throughput information I need. If this were the case, we could say that this system performed 150 physical read IOPS and 15k write IOPS, with throughput of 1.2 MB/sec reads and 141.7 MB/sec writes. Right?

But that isn’t the case – and to understand why, we need to page down five thousand times through the increasingly-verbose AWR report until we eventually find the Other Instance Activity Stats section (or just Instance Activity Stats in pre-12c reports) and see this information (edited for brevity):

Other Instance Activity Stats                  DB/Inst: ORCL/orcl  Snaps: 7-8
-> Ordered by statistic name

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
physical read IO requests                     5,123          150.3           1.5
physical read bytes                      42,049,536    1,233,739.3      12,371.2
physical read total IO requests              37,162        1,090.3          10.9
physical read total bytes            23,001,900,544  674,878,987.9   6,767,255.2
physical read total multi block              21,741          637.9           6.4
....
physical write IO requests                  514,547       15,096.9         151.4
physical write bytes                  5,063,483,392  148,563,312.9   1,489,698.0
physical write total IO requests            537,251       15,763.0         158.1
physical write total bytes           18,251,309,056  535,495,967.4   5,369,611.4
physical write total multi block             18,152          532.6           5.3

The numbers in red and blue match up with those above, albeit with the throughput values using different units of bytes/sec instead of MB/sec. But the problem is, these aren’t the “total” values – which are highlighted in green. So what are those total values showing us?

Over to the Oracle Database 12c Reference Guide:

physical read IO requests:  Number of read requests for application activity (mainly buffer cache and direct load operation) which read one or more database blocks per request. This is a subset of “physical read total IO requests” statistic.

physical read total IO requests: Number of read requests which read one or more database blocks for all instance activity including application, backup and recovery, and other utilities. The difference between this value and “physical read total multi block requests” gives the total number of single block read requests.

The values that don’t have the word total in them, i.e. the values shown in the AWR Profile section at the start of a report, are only showing what Oracle describes as “application activity“. That’s all very well, but it’s meaningless if you want to know how much your database is driving your storage. This is why the values with total in the name are the ones you should consider. And in the case of my sample report above, there is a massive discrepancy between the two: for example, the read throughput value for application activity is just 1.2 MB/sec while the total value is actually 644 MB/sec – over 500x higher! That extra non-application activity is definitely worth knowing about. (In fact, I was running a highly parallelised RMAN backup during the test just to make the point…)

[Note: There was another section here detailing how to find and include the I/O generated by redo into the totals, but after consultation with guru and legend Tanel Poder it’s come to my attention that this is incorrect. In fact, reads and writes to redo logs are included in the physical read/write total statistics…]

Oracle 12c IO Profile Section

Luckily, Oracle 12c now has a new section which presents all the information in one table. Here’s a sample extracted from the same report as above:

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:          16,853.4         1,090.3        15,763.0
         Database Requests:          15,247.2           150.3        15,096.9
        Optimized Requests:               0.1             0.0             0.0
             Redo Requests:             517.5             1.2           516.3
                Total (MB):           1,154.3           643.6           510.7
             Database (MB):             142.9             1.2           141.7
      Optimized Total (MB):               0.0             0.0             0.0
                 Redo (MB):             295.7             0.0           295.7
         Database (blocks):          18,285.8           150.6        18,135.2
 Via Buffer Cache (blocks):          18,282.1           150.0        18,132.0
           Direct (blocks):               3.7             0.6             3.1

Suddenly life is more simple. You want to know the total IOPS and throughput? It’s all in one place. You want to calculate the ratio of reads to writes? Just compare the read and write columns. Happy days.

One word of warning though: there are other database processes driving I/O which may not be tracked in these statistics. I see no evidence for control file reads and writes being shown, although these are insignificant in magnitude. More significant would be I/O from the archiver process for databases running in archive log mode, as each redo log must be sequentially read and re-written out as an archive log. Are these included? Yet another possibility would be the Recovery Writer (RVWR) process which is responsible for writing flashback logs when database flashback logging is enabled. [Discussions with Jonathan Lewis suggest these stats are all included – and let’s face it, he wrote the book on the subject…!]  It all adds up… Oracle really needs to provide better clarity on what these statistics are measuring.

Conclusion

If you want to know how much I/O is being driven by your database, do not use the information in the Load Profile section of an AWR report. Use the I/O Profile section if available, or otherwise skip to the Instance Activity Stats section and look at the total values for physical reads and writes (and redo). Everything else is just lies, damned lies and (I/O) statistics.

New script for setting up multipath.conf entries

I’ve added a new script to the Useful Scripts page called setup-violin-mpath.sh which automates the process of creating entries for the /etc/multipath.conf file on Red Hat 6 / Oracle Linux 6.

As the name suggests, I wrote it with Violin devices in mind, but there should be overlap with other storage which will potentially make it useful elsewhere… see here for more details.

Oracle Fixes The 4k SPFILE Problem…But It’s Still Broken

As anyone familiar with the use of Oracle on Advanced Format storage devices will know to their cost, Oracle has had some difficulties implementing support of 4k devices. Officially, support for devices with a 4096 byte sector size was introduced in Oracle 11g Release 2 (see section 4.8.1.4 of the New Features Guide) but actually, if the truth be told, there were some holes.

(Before reading on, if you aren’t sure what I’m talking about here then please have a read of this page…)

I should say at this point that most 4k Advanced Format storage products have the ability to offer 512 byte emulation, which means any of the problems shown here can be avoided with very little effort (or performance overhead), but since 4096 byte devices are widely expected to take over, it would be nice if Oracle could tighten up some of the problems. After all, it’s not just flash memory devices that tend to be 4k-based: Toshiba, HGST, Seagate and Western Digital are all making hard disk drives that use Advanced Format too.

The SPFILE Problem in <11.2.0.4

Given that 4k devices are allegedly supported in Oracle 11g Release 2 you would think it would make sense that you can provision a load of 4k LUNs and then install the Oracle Grid Infrastructure and Database software on them. But no, in versions up to and including 11.2.0.3 this caused a problem with the SPFILE.

Here’s my sample system. I have 10 LUNs all with a physical and logical blocksize of 4k. I’m using Oracle’s ASMLib kernel driver to present them to ASM and the 4k logical and physical properties are preserved through into the ASMLib device too:

[root@half-server4 mapper]# fdisk -l /dev/mapper/slobdata1 | grep "Sector size"
Sector size (logical/physical): 4096 bytes / 4096 bytes
[root@half-server4 mapper]# oracleasm querydisk /dev/mapper/slobdata1
Device "/dev/mapper/slobdata1" is marked an ASM disk with the label "SLOBDATA1"
[root@half-server4 mapper]# fdisk -l /dev/oracleasm/disks/SLOBDATA1 | grep "Sector size"
Sector size (logical/physical): 4096 bytes / 4096 bytes

Next I’ve installed 11.2.0.3 Grid Infrastructure and created an ASM diskgroup on these LUNs. As you can see, Oracle has successfully spotted the devices are 4k and correspondingly set the ASM diskgroup sector size to 4096:

ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N        4096   4096  1048576    737280   737207                0          737207              0             N  DATA/

Time to install the database. I’ll just fire up the 11.2.0.3 OUI and install the database software complete with a default database, choosing to locate the database files in this +DATA diskgroup. What could possibly go wrong?

oracle-11203-spfile-error

The installer gets as far as running the database configuration assistant and then crashes out with the message:

PRCR-1079 : Failed to start resource ora.orcl.db
CRS-5017: The resource action "ora.orcl.db start" encountered the following error:
ORA-01078: failure in processing system parameters
ORA-15081: failed to submit an I/O operation to a disk
ORA-27091: unable to queue I/O
ORA-17507: I/O request size 512 is not a multiple of logical block size
ORA-06512: at line 4
. For details refer to "(:CLSN00107:)" in "/home/OracleHome/product/11.2.0/grid/log/half-server4/agent/ohasd/oraagent_oracle/oraagent_oracle.log".

Why did this happen? The clue is in the error message highlighted in red. Over the years that this has been happening (and trust me, it’s been happening for far too long) various notes have appeared on My Oracle Support, such as 1578983.1 and 14626924.8. The cause is the following bug:

Bug 14626924  Not able to read spfile from ASM diskgroup and disk with sector size of 4096

At the time of  writing, this bug is shown as fixed in 11.2.0.4 and the 12.2 forward code stream, with backports available for 11.2.0.2.0, 11.2.0.3.0, 11.2.0.3.7, 12.1.0.1.0 and 12.1.0.1.2. Alternatively, there is the simple workaround (documented in my Install Cookbooks) of placing the SPFILE in a non-4k location.

What the heck, I have the 11.2.0.4 binaries stored locally, so let’s fire it up and see the fixed SPFILE in action.

11.2.0.4 with 4k Devices

As with the previous example, I have a set of 4k LUNs presented via ASMLib – I won’t repeat the output from above as it’s identical. The ASM diskgroup correctly shows the sector size as 4096, so we’re ready to install the database software and let it create a database. As before the database files will be located in the diskgroup – including the SPFILE – but this time, it won’t fail because bug 14626924 is fixed in 11.2.0.4 right? Right?

oracle-11204-spfile-error

Oh dear. That’s not the error-free installation we were hoping for. Also, one of my pet hates, why are these Clusterware messages so incredibly unhelpful? Looking in the oraagent_oracle.log file we find the following nugget of information:

ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
Linux-x86_64 Error: 2: No such file or directory
Process ID: 0
Session ID: 0 Serial number: 0

That’s not entirely useful either, so let’s try and fire up the database manually:

[oracle@half-server4 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Wed Feb 12 17:56:29 2014
Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to an idle instance.
SQL> startup nomount
ORA-01078: failure in processing system parameters
ORA-17510: Attempt to do i/o beyond file size

Aha! The problem happens even attempting to start in NOMOUNT mode, so it seems likely this is related to reading the PFILE or SPFILE. Let’s just check to see what we have:

[oracle@half-server4 ~]$ ls -l $ORACLE_HOME/dbs/initorcl.ora
-rw-r----- 1 oracle oinstall 35 Feb 12 17:26 /u01/app/oracle/product/11.2.0/dbhome_1/dbs/initorcl.ora
[oracle@half-server4 ~]$ cat $ORACLE_HOME/dbs/initorcl.ora
SPFILE='+DATA/orcl/spfileorcl.ora'

Sure enough, we have an SPFILE located in the ASM diskgroup… and we cannot read it. Could it possibly be that even in 11.2.0.4 there are problems with SPFILEs being located in 4k devices? Searching My Oracle Support for ORA-17510 initially draws a blank. But a search of the Oracle bug database for the previous bug number (14626924) brings up some interesting new bugs:

Bug 16870214 : DB STARTUP FAILS WITH ORA-17510 IF SPFILE IS IN 4K SECTOR SIZE DISKGROUP

In the description of this bug, the following statement is made:

PROBLEM:
--------
ORA-17510: Attempt to do i/o beyond file size after applying patch 14626924 
TO READ SPFILE FROM ASM DISKGROUP AND DISK WITH SECTOR SIZE OF 4096 

DIAGNOSTIC ANALYSIS:
--------------------
1. create init file initvarial.ora in dbs directory with below
spfile='+DATA1/VARIAL/spfilevarial.ora'

2. startup pfile=/u01/app/oracle/product/11.2.0.3/db_1/dbs/initvarial.ora

SQL> startup pfile=/u01/app/oracle/product/11.2.0.3/db_1/dbs/initvarial.ora
ORA-17510: Attempt to do i/o beyond file size

This looks very similar. Unfortunately this bug is marked as a duplicate of base bug 18016679, which sadly is unpublished. All we know about it is that, at the time of writing, it isn’t fixed – the status of the duplicate is still “Waiting for the base bug fix“.

So there we have it. The infamous 4k SPFILE issue is fixed in 11.2.0.4 and replaced with something else that makes it equally unusable. For now, we’ll just have to keep those SPFILEs in 512 byte devices…

Update Feb 14th 2014

I kind of had a feeling that the above problem was in some way related to the use of ASMLib, so I thought I’d repeat the entire 11.2.0.4 install using normal block devices. Essentially this means changing the ASM discovery path from it’s default value of ‘ORCL:*’ to the path of the device mapper multipath devices, which is my case is ‘/dev/mapper/slob*’.

This time we don’t even get through the Grid Infrastructure installation, which fails while running the Oracle ASM Configuration Assistance (asmca) giving the following error messages:

[main] [ 2014-02-14 15:55:22.112 GMT ] [UsmcaLogger.logInfo:143]  CREATE DISKGROUP SQL: CREATE DISKGROUP DATA EXTERNAL REDUNDANCY  DISK '/dev/mapper/slobdata1', '/dev/mapper/slobdata2', '/dev/mapper/slobdata3', '/dev/mapper/slobdata4',
'/dev/mapper/slobdata5', '/dev/mapper/slobdata6', '/dev/mapper/slobdata7', '/dev/mapper/slobdata8'
ATTRIBUTE 'compatible.asm'='11.2.0.0.0','au_size'='1M'
[main] [ 2014-02-14 15:55:22.206 GMT ] [SQLEngine.done:2189]  Done called
[main] [ 2014-02-14 15:55:22.224 GMT ] [UsmcaLogger.logException:173]  SEVERE:method oracle.sysman.assistants.usmca.backend.USMDiskGroupManager:createDiskGroups
[main] [ 2014-02-14 15:55:22.224 GMT ] [UsmcaLogger.logException:174]  ORA-15018: diskgroup cannot be created
ORA-27061: waiting for async I/Os failed

It seems Oracle still has some way to go before this will work properly…