Understanding Flash: The Flash Translation Layer

electronics

A couple of posts ago in this series, I explained how a NAND flash die is comprised of planes, which contain blocks, which contain pages… which contain individual cells of data. Read operations take place at the page level, as do write operations (although we call them program operations in the flash world). But crucially, erase operations take place at the block level and so affect multiple pages.

Erases are also slow (at least relative to reads and writes) and cause wear of the flash media, gradually moving it closer to its end of life. It’s therefore the case that when you want to update an existing page of data it is faster, simpler and less damaging to simply write the updated information to an empty page. If you’re going to do that, you will probably also want to choose a new page somewhere completely different from the old one to ensure that your flash wears out evenly. And hey, don’t forget that the block containing the old page will need to be erased at some point before it can be reused.

flash-translation-layer-burgerSo to make flash a friendly medium for storing our data, we need a mechanism which will:

  1. write updated information to a new empty page and then divert all subsequent read requests to its new address
  2. ensure that newly-programmed pages are evenly distributed across all of the available flash so that it wears evenly
  3. keep a list of all the old invalid pages so that at some point later on they can all be recycled ready for reuse

This mechanism is called the flash translation layer (FTL) and you will find it on all flash media if you look hard enough. The FTL has a number of responsibilities, so let’s look at them now.

Logical Block Mapping

Abstraction is everywhere in computing. The URL of this website is a logical address which maps to a physical address, i.e. the IP address. An IP address is in fact a logical address which maps to a physical address, i.e. the MAC address of a network interface. There are so many examples of abstraction in technology that sometimes I think the whole world of computing is just one massive layer of abstraction.

logical-physical-block-addressingIn storage, the idea of a logical block address (LBA) has been around since long before NAND flash and is primarily used to make addressing simpler and more flexible. Like all abstraction concepts it exists to make the (potentially complex) management of a low level system invisible to the higher levels that consume its services. For example, if a hard drive within a RAID group fails and the data it contained has to be rebuilt on a hot spare, the physical block address of certain blocks of data will change. To avoid the need to notify all possible interested parties (e.g applications, databases, etc) of the new address, the extra layer of logical block addressing is used; the map from LBA to PBA is amended and nobody else needs to know.

In a flash system this same mechanism can be used for updates, so that when a page is considered invalid the logical block address can be remapped to the newly-programmed page. This provides the solution to number 1 in our list above in a way that is both simple and transparent to anyone issuing I/O requests.

Wear Levelling

On the face of it, wear levelling seems like a fairly simple method of handling number 2 in our list. You have a predefined number of flash blocks, each of which can be programmed and erased (known as the P/E cycle) a similar number of times before they are no longer usable. Clearly the object of any wear levelling algorithm is to smoothly distribute all P/E cycles so that the blocks all reach their limit at the same time. flash-remaining-lifetimeWithout wear levelling it’s entirely possible that a subset of blocks could receive the majority of P/E cycles and thus wear out very quickly, reducing the available capacity of the system.

If all blocks in a system were regularly updated this would be no problem, because wear levelling would happen almost naturally as pages are marked invalid and then recycled. But here’s the problem: if we have some cold blocks, i.e. locations where the data never changes, then we have to take steps to manually relocate that data otherwise those blocks won’t ever wear… and that means we are actually adding write workload to the system, which ultimately means increasing the wear.

So to put it in simple terms, the more aggressive we are about wear levelling evenly, the more wear we cause. But not being aggressive enough could result in hot and cold spots as wear becomes more uneven. As always, it’s a question of finding the right balance. Or, if you prefer, finding the write balance.*

Garbage Collection

The third requirement in our list is a way of recycling pages that are no longer required (i.e. invalid) but have not yet been erased. Of course you cannot simply erase them at your leisure, because flash requires the entire containing block to be erased too. So instead it’s necessary to consider the remaining contents of the block and – if necessary – transparently move some active data elsewhere.recycling-bins Once the block has no remaining active data in it, it can be erased and then becomes ready to use again.

The thing is, the block in question might have 128 or 256 pages in it, many of which contain active data. You might have to go to a lot of effort in order to recycle just a small number of invalid pages – and just like with wear levelling, the act of moving data around in the background will have consequences to things like performance and endurance. Again it’s a question of finding the right balance.

Garbage Collection is such an important part of the management of flash that I’m going to devote a whole post to it next in the series. But for now, consider this: what happens if you fill up your trash faster than it gets taken away? You run out of space in your bins and everything starts to smell pretty bad. We definitely need to make sure that doesn’t happen here…

Write Amplification

Now you might have noticed that a lot of the processes described here result in additional operations taking place on the flash media. Wear Levelling can relocate inactive (cold) pages in order to ensure they wear evenly in comparison to active (hot) pages. Garbage Collection can result in pages being moved from a block which will subsequently be erased. In short, stuff is happening in the background as a consequence of the actions taking place on the host – which we will call the foreground in order to distinguish it. What’s more, if you are looking at things from the point of view of the foreground – like, for example, a database server accessing flash storage – you have no visibility of what’s going on in the background.

It doesn’t matter what OS monitoring tools you run on the host (iostat, for example, or dstat), you will only see the foreground I/O operations that the host knows about. This is important because the performance and endurance of your flash storage is dependent on the sum of foreground and background operations.

We call this phenomena write amplification and we can express it as a value using the following formula:

Write Amplification = Data Written To Flash / Data Written By The Host

A higher value indicates an increased workload on the flash storage system, which is likely to mean reduced endurance and performance. For this reason, if you are testing any sort of flash system (from the mighty All Flash Array to a simple SSD) you should make every effort to observe the workload within the storage system as well as from the host. Of course with the humble SSD that’s often not possible…

Where Is Your FTL?

cautionAs a final thought, earlier on I said about the FTL that “you will find it on all flash media if you look hard enough“. What did I mean? Well, the FTL has a lot of duties to perform, as we’ve seen. That requires effort, which for a computer equates to processing power and memory. In most situations (e.g. All Flash Arrays) this happens in firmware under the covers. But with some products, for example certain PCIe flash cards, it’s possible that some or all of the FTL functionality runs on the host, using host CPU cycles and DRAM.

In principle there may be benefits and drawbacks of either method (host-based FTL or array-based FTL), but if you are running a database on your server they become insignificant in relation to the cost. After all, those processors in your database servers? The ones that affect your core-based licenses for Oracle or other database software? They are the most expensive CPUs you own. If you are donating CPU cycles from these cores to manage your flash, you are effectively throwing away license money. And you probably feel you pay enough to your database vendor already, right?

* Seriously, if you didn’t think that was a great pun then you’re reading the wrong blog.

Oracle, Parallelism and Direct Path Reads… on Flash

3000-open-case

Guest Post

This is another guest post from my buddy Nate Fuzi, who performs the same role as me for Violin but is based in the US instead of EMEA. Because he’s an American, Nate believes that “football” is played using your hands and that the ball is actually egg-shaped. This is of course ridiculous, because as the entire rest of the world knows, this is football whereas the game Nate is thinking of is actually called “HandEgg”. Now that we’ve cleared that up, over to you, Nate:

Lately, I’ve been running into much confusion around Oracle’s direct path IO functionality (11g+) and, unusually, not all of that confusion is my own. There is a perplexing lack of literature and experimentation with direct path IO on the Internet today. Seriously, I’ve looked. I decided I needed to better understand this event and its timing in order to properly extend suggestions to customers. I set about trying to prove some things I thought I knew, and I managed to confirm several suspicions but also surprised myself with some unexpected results. I’d like to share these in hopes of clarifying this event for everyone in practical terms.

Direct Path IO Background

To set the stage a bit, at the highest level, Oracle created the direct path IO event to describe an IO executed by an Oracle process that reads into (or writes from) the process global area (think of this as the session’s private memory) directly from (to) storage, bypassing the Oracle buffer cache. The rationale is this: full table scans of large tables into the buffer cache consume a lot of space, pushing out likely useful buffers in favor of buffers unlikely to be needed again in the near future. Reading directly into the process global area instead of the shared global area keeps full table scans from polluting the buffer cache and diminishing its overall effectiveness. Since the direct path IO is used for full scanning large objects, it looks to the database’s DB_FILE_MULTIBLOCK_READ_COUNT (henceforth referred to as DBMRC) setting for guidance on the size of IO calls to issue.

Makes sense. But what’s been confusing me is the apparently inconsistent performance of direct path reads and writes, even against Violin’s all-flash arrays. With random and other multi-block IO events showing very low, consistent performance, direct path reads can still be all over the board. How is that? Is it truly impacting performance? How can I make it better, or should I even try? After seeing this at a number of customer installations, I decided to run some tests on a smallish lab server attached to a single Violin array.

The Setup

I have a test database with a number of tables almost exactly 125GB in size full of randomized data. Full-scanning one of these tables via “select count(*)” was plenty to exercise the direct path read repeatedly, varying parallelism and DBMRC. My goal was to see the effect of these settings on both elapsed time and perceived latency. With an 8K database block size (RHEL 6.3, Oracle 12.1.0.1, ASM), I ran the test with DBMRC set to 4, then 8, 16, and finally 128. I ran each test with no parallelism, then with “parallel 16” hinted. So what did I see?

Test Results

direct-path-read-testing

Note that elapsed times represent the time my query returned to the SQL*Plus prompt with the “set timing on” directive applied to my session and are not 100% representative of time spent on the database but are close enough for my purposes. Total direct path read (DPR) time was pulled from the respective AWR report after execution finished. I asterisked the Physical Read Requests column because some of the reports showed 0 physical reads for the test SQL, while it was clear from the total physical reads that my read operation was the only possible culprit; therefore I felt justified in attributing that total (minus a few here and there from the AWR snapshotting process) to the test SQL. Note also that the best elapsed time was achieved with the lowest DBMRC and parallel 16. Worst time by far was also obtained with DBMRC set to 4 but without parallelism—although it accumulated the least amount of wait time on DPR. In general, throwing more cores at the problem improved performance hugely; not surprising, but noteworthy. We know that flash does not benefit from multi-block IO as a rule: at the lowest level, every IO is effectively a random IO, and larger blocks / groups of blocks are fetched independently, assembled, and returned to the caller as a single unit. However, there is a definite overhead in issuing IO requests, waiting for the calls to return, and consuming the requested data. This is evidenced by the high elapsed time for the single-threaded run with DBMRC set to 4: the least amount of reported IO wait time still contributed to the longest overall elapsed time.

So what do these values tell us? For one thing, parallelism is your friend. One core performing a FTS just isn’t going to get the job done nearly as quickly as multiple cores. Also, parallelism vastly trumps DBMRC as a tool for improving performance on flash when CPU resources are available. Performance between parallel processing runs was within 2%, no matter what the DBMRC setting. This I expected, having come into the testing with the assumption that DBMRC was irrelevant when working with flash. I was surprised at the exceedingly high elapsed time with the single-threaded query using small DBMRC. I would expect that to be higher than the others, but not nearly 3X longer than the single-threaded run with DBMRC at 128.

These revelations are mildly interesting, but what I find much more curious is the difference in reported DPR latency. Certainly, a highly parallel execution can accumulate more database time than wall clock time for any event. But we can tell from the elapsed times that, when we’re not starving the database for parallelism, DBMRC is practically meaningless when applied to flash. Yet the calculation of the average latency of the event is mysterious in that 1) 16 threads operating with DBMRC of 128 experiences roughly 4X the number of waits the single-threaded execution performs; 2) it does so apparently at about 13X the average latency of the single-threaded run; and 3) it racks up about 51X the amount of total DPR wait time.

What’s worse is that DPR stats are very strangely represented in the Tablespace IO stats section of the report. Here’s the snippet from test run #2:

                     Av       Av     Av      1-bk  Av 1-bk          Writes   Buffer  Av Buf
Tablespace   Reads   Rds/s  Rd(ms) Blks/Rd   Rds/s  Rd(ms)  Writes   avg/s    Waits  Wt(ms)
---------- ------- ------- ------- ------- ------- ------- ------- ------- -------- -------
DEMO       4.1E+06  49,395     0.0     4.0       2     0.0       0       0        0     0.0

We have to cut Oracle some slack on the Av Rds/s value here because it’s now averaging over the time it took me to start the test after my initial snapshot, then realize the test was done and execute another AWR snapshot to end the reporting period. Fine. But an average read time of 0.0ms?! Clearly, Oracle is recording some number of reads, but it’s not reporting timing on them at all in this section of the report. We have to look to the SQL Ordered by Physical Reads (Unoptimized) section of the report to confirm it’s actually doing a relevant number of IO requests:

-> Total Physical Read Requests:       4,092,389
-> Captured SQL account for    0.0% of Total
-> Total UnOptimized Read Requests:       4,092,389
-> Captured SQL account for    0.0% of Total
-> Total Optimized Read Requests:               1
-> Captured SQL account for    0.0% of Total
 
[some lines removed]
 
UnOptimized   Physical              UnOptimized
  Read Reqs   Read Reqs Executions Reqs per Exe   %Opt %Total    SQL Id
----------- ----------- ---------- ------------ ------ ------ -------------
          0           0          1          0.0    N/A    0.0 4kpvpt49hm3nf
Module: SQL*Plus
   PDB: DEMO
select /*+ parallel 16 */ count(*) from demo.length100_1

Oh, wait.  Oracle doesn’t credit my query with any physical read requests.  I have to look at the total just above in the report, and see that only the AWR snapshot performed any other IO on the system, and subtract that from the total.  Sigh.  At least ~4.1M reads at 32KB comes close to 125GB.

So what gives, Oracle?  I’ve read some Oracle notes and other blogs on the subject of DPR, and they suggest the wait event is not necessarily triggered when the IO call is initially issued, but instead when the session decides it needs all outstanding DPR IOs it has issued to complete before moving on—or it fills up all its “slots” and has to wait for those to free up.  Thus the under-reporting of the actual number of DPR waits and the artificially high wait time for each of those waits:  fewer waits, along with potentially many IO requests outstanding when the wait is triggered and timing starts.  But nowhere in all of this is there a set of numbers that I can trust to accurately describe my DPR performance.  The fact that DPR IO is completely left out of tablespace timings is seriously troubling:  we trust these stats to determine “hot” tablespaces and under-performing mount points.  This throws all kinds of doubt into the mix.

What can I say about Oracle’s DPR at this point?  While it works just fine and serves its purpose, the instrumentation appears to be lacking, even in Oracle 12.1.  After this testing, I feel even more confident telling customers to ignore the latency reported for this event—at least for now.  And I’ve confirmed my belief that, with any sort of parallelism enabled on your database, DBMRC is largely irrelevant for flash storage and only adds a mystery factor to reported latencies.  Yes, setting this to a low value will affect costing of FTS vs. index access, so you should verify that plans currently employing FTS that you want to remain that way still do.  This is easy enough with an alter session and explain plan.  With that, Oracle, the ball is in your court:  please define your terms, fix your instrumentation around DPR, or tell customers to stop worrying about DPR latencies.  Meanwhile, I’m going to advise people who are otherwise happy with their performance but want better latency numbers in their reports to set DBMRC lower and get on with their lives.

Oracle 12.1.0.2 ASM Filter Driver: Advanced Format Fail

wrong-way

In my previous post on the subject of the new ASM Filter Driver (AFD) feature introduced in Oracle’s 12.1.0.2 patchset, I installed the AFD to see how it fulfilled its promise that it “filters out all non-Oracle I/Os which could cause accidental overwrites“. However, because I was ten minutes away from my summer vacation at the point of finishing that post, I didn’t actually get round to writing about what happens when you try and create ASM diskgroups on the devices it presents.

Obviously I’ve spent the intervening period constantly worrying about this oversight – indeed, it was only through the judicious application of good food and drink plus some committed relaxation in the sun that I was able to pull through. However, I’m back now and it seems like time to rectify that mistake. So here goes.

Creating ASM Diskgroups with the ASM Filter Driver

It turns out I need not have worried, because it doesn’t work right now… at least, not for me. Here’s why:

First of all, I installed Oracle 12.1.0.2 Grid Infrastructure. I then labelled some block devices presented from my Violin storage array. As I’ve already pasted all the output from those two steps in the previous post, I won’t repeat myself.

The next step is therefore to create a diskgroup. Since I’ve only just come back from holiday and so I’m still half brain-dead, I’ll choose the simple route and fire up the ASM Configuration Assistant (ASMCA) so that I don’t have to look up any of that nasty SQL. Here goes:

afd_create

But guess what happened when I hit the OK button? It failed, bigtime. Here’s the alert log – if you don’t like huge amounts of meaningless text I suggest you skip down… a lot… (although thinking about it, my entire blog could be described as meaningless text):

SQL> CREATE DISKGROUP DATA EXTERNAL REDUNDANCY  DISK 'AFD:DATA1' SIZE 72704M ,
'AFD:DATA2' SIZE 72704M ,
'AFD:DATA3' SIZE 72704M ,
'AFD:DATA4' SIZE 72704M ,
'AFD:DATA5' SIZE 72704M ,
'AFD:DATA6' SIZE 72704M ,
'AFD:DATA7' SIZE 72704M ,
'AFD:DATA8' SIZE 72704M  ATTRIBUTE 'compatible.asm'='12.1.0.0.0','au_size'='1M' /* ASMCA */
Fri Jul 25 16:25:33 2014
WARNING: Library 'AFD Library - Generic , version 3 (KABI_V3)' does not support advanced format disks
Fri Jul 25 16:25:33 2014
NOTE: Assigning number (1,0) to disk (AFD:DATA1)
NOTE: Assigning number (1,1) to disk (AFD:DATA2)
NOTE: Assigning number (1,2) to disk (AFD:DATA3)
NOTE: Assigning number (1,3) to disk (AFD:DATA4)
NOTE: Assigning number (1,4) to disk (AFD:DATA5)
NOTE: Assigning number (1,5) to disk (AFD:DATA6)
NOTE: Assigning number (1,6) to disk (AFD:DATA7)
NOTE: Assigning number (1,7) to disk (AFD:DATA8)
NOTE: initializing header (replicated) on grp 1 disk DATA1
NOTE: initializing header (replicated) on grp 1 disk DATA2
NOTE: initializing header (replicated) on grp 1 disk DATA3
NOTE: initializing header (replicated) on grp 1 disk DATA4
NOTE: initializing header (replicated) on grp 1 disk DATA5
NOTE: initializing header (replicated) on grp 1 disk DATA6
NOTE: initializing header (replicated) on grp 1 disk DATA7
NOTE: initializing header (replicated) on grp 1 disk DATA8
NOTE: initializing header on grp 1 disk DATA1
NOTE: initializing header on grp 1 disk DATA2
NOTE: initializing header on grp 1 disk DATA3
NOTE: initializing header on grp 1 disk DATA4
NOTE: initializing header on grp 1 disk DATA5
NOTE: initializing header on grp 1 disk DATA6
NOTE: initializing header on grp 1 disk DATA7
NOTE: initializing header on grp 1 disk DATA8
NOTE: Disk 0 in group 1 is assigned fgnum=1
NOTE: Disk 1 in group 1 is assigned fgnum=2
NOTE: Disk 2 in group 1 is assigned fgnum=3
NOTE: Disk 3 in group 1 is assigned fgnum=4
NOTE: Disk 4 in group 1 is assigned fgnum=5
NOTE: Disk 5 in group 1 is assigned fgnum=6
NOTE: Disk 6 in group 1 is assigned fgnum=7
NOTE: Disk 7 in group 1 is assigned fgnum=8
NOTE: initiating PST update: grp = 1
Fri Jul 25 16:25:33 2014
GMON updating group 1 at 1 for pid 7, osid 16745
NOTE: group DATA: initial PST location: disk 0000 (PST copy 0)
NOTE: set version 1 for asmCompat 12.1.0.0.0
Fri Jul 25 16:25:33 2014
NOTE: PST update grp = 1 completed successfully
NOTE: cache registered group DATA 1/0xD9B6AE8D
NOTE: cache began mount (first) of group DATA 1/0xD9B6AE8D
NOTE: cache is mounting group DATA created on 2014/07/25 16:25:33
NOTE: cache opening disk 0 of grp 1: DATA1 label:DATA1
NOTE: cache opening disk 1 of grp 1: DATA2 label:DATA2
NOTE: cache opening disk 2 of grp 1: DATA3 label:DATA3
NOTE: cache opening disk 3 of grp 1: DATA4 label:DATA4
NOTE: cache opening disk 4 of grp 1: DATA5 label:DATA5
NOTE: cache opening disk 5 of grp 1: DATA6 label:DATA6
NOTE: cache opening disk 6 of grp 1: DATA7 label:DATA7
NOTE: cache opening disk 7 of grp 1: DATA8 label:DATA8
NOTE: cache creating group 1/0xD9B6AE8D (DATA)
NOTE: cache mounting group 1/0xD9B6AE8D (DATA) succeeded
WARNING: cache read a corrupt block: group=1(DATA) dsk=0 blk=1 disk=0 (DATA1) incarn=3493224069 au=0 blk=1 count=1
Fri Jul 25 16:25:33 2014
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_16745.trc:
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
NOTE: a corrupted block from group DATA was dumped to /u01/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_16745.trc
WARNING: cache read (retry) a corrupt block: group=1(DATA) dsk=0 blk=1 disk=0 (DATA1) incarn=3493224069 au=0 blk=1 count=1
Fri Jul 25 16:25:33 2014
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_16745.trc:
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
WARNING: cache read (retry) a corrupt block: group=1(DATA) dsk=0 blk=1 disk=0 (DATA1) incarn=3493224069 au=11 blk=1 count=1
Fri Jul 25 16:25:33 2014
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_16745.trc:
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
NOTE: a corrupted block from group DATA was dumped to /u01/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_16745.trc
WARNING: cache read (retry) a corrupt block: group=1(DATA) dsk=0 blk=1 disk=0 (DATA1) incarn=3493224069 au=11 blk=1 count=1
Fri Jul 25 16:25:33 2014
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_16745.trc:
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ERROR: cache failed to read group=1(DATA) dsk=0 blk=1 from disk(s): 0(DATA1) 0(DATA1)
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]

NOTE: cache initiating offline of disk 0 group DATA
NOTE: process _user16745_+asm (16745) initiating offline of disk 0.3493224069 (DATA1) with mask 0x7e in group 1 (DATA) with client assisting
NOTE: initiating PST update: grp 1 (DATA), dsk = 0/0xd0365e85, mask = 0x6a, op = clear
Fri Jul 25 16:25:34 2014
GMON updating disk modes for group 1 at 2 for pid 7, osid 16745
ERROR: disk 0(DATA1) in group 1(DATA) cannot be offlined because the disk group has external redundancy.
Fri Jul 25 16:25:34 2014
ERROR: too many offline disks in PST (grp 1)
Fri Jul 25 16:25:34 2014
ERROR: no read quorum in group: required 1, found 0 disks
ERROR: Could not read PST for grp 1. Force dismounting the disk group.
Fri Jul 25 16:25:34 2014
NOTE: halting all I/Os to diskgroup 1 (DATA)
Fri Jul 25 16:25:34 2014
ERROR: no read quorum in group: required 1, found 0 disks
ASM Health Checker found 1 new failures
Fri Jul 25 16:25:36 2014
ERROR: no read quorum in group: required 1, found 0 disks
Fri Jul 25 16:25:36 2014
ERROR: Could not read PST for grp 1. Force dismounting the disk group.
Fri Jul 25 16:25:36 2014
ERROR: no read quorum in group: required 1, found 0 disks
ERROR: Could not read PST for grp 1. Force dismounting the disk group.
Fri Jul 25 16:25:36 2014
ERROR: no read quorum in group: required 1, found 0 disks
ERROR: Could not read PST for grp 1. Force dismounting the disk group.
Fri Jul 25 16:25:37 2014
NOTE: AMDU dump of disk group DATA initiated at /u01/app/oracle/diag/asm/+asm/+ASM/trace
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_16745.trc  (incident=3257):
ORA-15335: ASM metadata corruption detected in disk group 'DATA'
ORA-15130: diskgroup "DATA" is being dismounted
ORA-15066: offlining disk "DATA1" in group "DATA" may result in a data loss
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]
Incident details in: /u01/app/oracle/diag/asm/+asm/+ASM/incident/incdir_3257/+ASM_ora_16745_i3257.trc
Fri Jul 25 16:25:37 2014
Sweep [inc][3257]: completed
Fri Jul 25 16:25:37 2014
SQL> alter diskgroup DATA check
System State dumped to trace file /u01/app/oracle/diag/asm/+asm/+ASM/incident/incdir_3257/+ASM_ora_16745_i3257.trc
NOTE: erasing header (replicated) on grp 1 disk DATA1
NOTE: erasing header (replicated) on grp 1 disk DATA2
NOTE: erasing header (replicated) on grp 1 disk DATA3
NOTE: erasing header (replicated) on grp 1 disk DATA4
NOTE: erasing header (replicated) on grp 1 disk DATA5
NOTE: erasing header (replicated) on grp 1 disk DATA6
NOTE: erasing header (replicated) on grp 1 disk DATA7
NOTE: erasing header (replicated) on grp 1 disk DATA8
NOTE: erasing header on grp 1 disk DATA1
NOTE: erasing header on grp 1 disk DATA2
NOTE: erasing header on grp 1 disk DATA3
NOTE: erasing header on grp 1 disk DATA4
NOTE: erasing header on grp 1 disk DATA5
NOTE: erasing header on grp 1 disk DATA6
NOTE: erasing header on grp 1 disk DATA7
NOTE: erasing header on grp 1 disk DATA8
Fri Jul 25 16:25:37 2014
NOTE: cache dismounting (clean) group 1/0xD9B6AE8D (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 16745, image: oracle@server3.local (TNS V1-V3)
NOTE: dbwr not being msg'd to dismount
NOTE: LGWR not being messaged to dismount
NOTE: cache dismounted group 1/0xD9B6AE8D (DATA)
NOTE: cache ending mount (fail) of group DATA number=1 incarn=0xd9b6ae8d
NOTE: cache deleting context for group DATA 1/0xd9b6ae8d
Fri Jul 25 16:25:37 2014
GMON dismounting group 1 at 3 for pid 7, osid 16745
Fri Jul 25 16:25:37 2014
NOTE: Disk DATA1 in mode 0x7f marked for de-assignment
NOTE: Disk DATA2 in mode 0x7f marked for de-assignment
NOTE: Disk DATA3 in mode 0x7f marked for de-assignment
NOTE: Disk DATA4 in mode 0x7f marked for de-assignment
NOTE: Disk DATA5 in mode 0x7f marked for de-assignment
NOTE: Disk DATA6 in mode 0x7f marked for de-assignment
NOTE: Disk DATA7 in mode 0x7f marked for de-assignment
NOTE: Disk DATA8 in mode 0x7f marked for de-assignment
ERROR: diskgroup DATA was not created
ORA-15018: diskgroup cannot be created
ORA-15335: ASM metadata corruption detected in disk group 'DATA'
ORA-15130: diskgroup "DATA" is being dismounted
Fri Jul 25 16:25:37 2014
ORA-15032: not all alterations performed
ORA-15066: offlining disk "DATA1" in group "DATA" may result in a data loss
ORA-15001: diskgroup "DATA" does not exist or is not mounted
ORA-15196: invalid ASM block header [kfc.c:29297] [endian_kfbh] [2147483648] [1] [0 != 1]

Now then. First of all, thanks for making it this far – I promise not to do that again in this post. Secondly, in case you really did just hit page down *a lot* you might want to skip back up and look for the bits I’ve conveniently highlighted in red. Specifically, this bit:

WARNING: Library 'AFD Library - Generic , version 3 (KABI_V3)' does not support advanced format disks

Many modern storage platforms use Advanced Format – if you want to know what that means, read here. The idea that AFD doesn’t support advanced format is somewhat alarming – and indeed incorrect, according to interactions I have subsequently had with Oracle’s ASM Product Management people. From what I understand, the problem is tracked as bug 19297177 (currently unpublished) and is caused by AFD incorrectly checking the physical blocksize of the storage device (4k) instead of the logical block size (which was 512 bytes). I currently have a request open with Oracle Support for the patch, so when that arrives I will re-test and add another blog article.

Until then, I guess I might as well take another well-earned vacation?

Oracle 12.1.0.2 ASM Filter Driver: First Impressions

This is a very quick post, because I’m about to log off and take an extended summer holiday (or vacation as my crazy American friends call it… but then they call football  “soccer” too). Before I go, I wanted to document my initial findings with the new ASM Filter Driver feature introduced in this week’s 12.1.o.2 patchset.

Currently a Linux-only feature, the ASM Filter Driver (or AFD) is a replacement for ASMLib and is described by Oracle as follows:

Oracle ASM Filter Driver (Oracle ASMFD) is a kernel module that resides in the I/O path of the Oracle ASM disks. Oracle ASM uses the filter driver to validate write I/O requests to Oracle ASM disks.

The Oracle ASMFD simplifies the configuration and management of disk devices by eliminating the need to rebind disk devices used with Oracle ASM each time the system is restarted.

The Oracle ASM Filter Driver rejects any I/O requests that are invalid. This action eliminates accidental overwrites of Oracle ASM disks that would cause corruption in the disks and files within the disk group. For example, the Oracle ASM Filter Driver filters out all non-Oracle I/Os which could cause accidental overwrites.

Interesting, eh? So let’s find out how that works.

Installation

I found this a real pain as you need to have 12.1.0.2 installed before the AFD is available to label your disks, yet the default OUI mode wants to create an ASM diskgroup… and you cannot do that without any labelled disks.

The only solution I could come up with was to perform a software-only install, which in itself is a pain. I’ll skip the numerous screenshots of that part though and just skip straight to the bit where I have 12.1.0.2 Grid Infrastructure installed.

I’m following these instructions because I am using a single-instance Oracle Restart system rather than a true cluster.

First of all we need to do this:

[oracle@server3 ~]$ $ORACLE_HOME/bin/asmcmd dsset 'AFD:*'

[oracle@server3 ~]$ $ORACLE_HOME/bin/asmcmd dsget
parameter:AFD:*
profile:AFD:*
[oracle@server3 ~]$ srvctl config asm
ASM home: 
Password file:
ASM listener: LISTENER
Spfile: /u01/app/oracle/admin/+ASM/pfile/spfile+ASM.ora
ASM diskgroup discovery string: AFD:*

Then we need to stop HAS and run the AFD_CONFIGURE command:

[root@server3 ~]# $ORACLE_HOME/bin/crsctl stop has -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'server3'
CRS-2673: Attempting to stop 'ora.asm' on 'server3'
CRS-2673: Attempting to stop 'ora.evmd' on 'server3'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'server3'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'server3' succeeded
CRS-2677: Stop of 'ora.evmd' on 'server3' succeeded
CRS-2677: Stop of 'ora.asm' on 'server3' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'server3'
CRS-2677: Stop of 'ora.cssd' on 'server3' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'server3' has completed
CRS-4133: Oracle High Availability Services has been stopped.

[root@server3 ~]# $ORACLE_HOME/bin/asmcmd afd_configure
Connected to an idle instance.
AFD-627: AFD distribution files found.
AFD-636: Installing requested AFD software.
AFD-637: Loading installed AFD drivers.
AFD-9321: Creating udev for AFD.
AFD-9323: Creating module dependencies - this may take some time.
AFD-9154: Loading 'oracleafd.ko' driver.
AFD-649: Verifying AFD devices.
AFD-9156: Detecting control device '/dev/oracleafd/admin'.
AFD-638: AFD installation correctness verified.
Modifying resource dependencies - this may take some time.
ASMCMD-9524: AFD configuration failed 'ERROR: OHASD start failed'

Er… that’s not really what I had in mind. But hey, let’s carry on regardless:

[root@server3 oracleafd]# $ORACLE_HOME/bin/asmcmd afd_state
Connected to an idle instance.
ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'DEFAULT' on host 'server3.local'

[root@server3 oracleafd]# $ORACLE_HOME/bin/crsctl start has
CRS-4123: Oracle High Availability Services has been started.

Ok it seems to be working. I wonder what it’s done?

Investigation

The first thing I notice is some Oracle kernel modules have been loaded:

[root@server3 ~]# lsmod | grep ora
oracleafd             208499  1
oracleacfs           3307969  0
oracleadvm            506254  0
oracleoks             505749  2 oracleacfs,oracleadvm

I also see that, just like ASMLib, a driver has been plonked into the /opt/oracle/extapi directory:

[root@server3 1]# find /opt/oracle/extapi -ls
2752765    4 drwxr-xr-x   3 root     root         4096 Jul 25 15:15 /opt/oracle/extapi
2752766    4 drwxr-xr-x   3 root     root         4096 Jul 25 15:15 /opt/oracle/extapi/64
2753508    4 drwxr-xr-x   3 root     root         4096 Jul 25 15:15 /opt/oracle/extapi/64/asm
2756532    4 drwxr-xr-x   3 root     root         4096 Jul 25 15:15 /opt/oracle/extapi/64/asm/orcl
2756562    4 drwxr-xr-x   2 root     root         4096 Jul 25 15:15 /opt/oracle/extapi/64/asm/orcl/1
2756578  268 -rwxr-xr-x   1 oracle   dba        272513 Jul 25 15:15 /opt/oracle/extapi/64/asm/orcl/1/libafd12.so

And again, just like ASMLib, there is a new directory under /dev called /dev/oracleafd (whereas for ASMLib it’s called /dev/oracleasm):

[root@server3 ~]# ls -la /dev/oracleafd/
total 0
drwxrwx---  3 oracle dba      80 Jul 25 15:15 .
drwxr-xr-x 21 root   root  15820 Jul 25 15:15 ..
brwxrwx---  1 oracle dba  249, 0 Jul 25 15:15 admin
drwxrwx---  2 oracle dba      40 Jul 25 15:15 disks

The disks directory is currently empty. Maybe I should create some AFD devices and see what happens?

Labelling

So let’s look at my Violin devices and see if I can label them:

root@server3 mapper]# ls -l /dev/mapper
total 0
crw-rw---- 1 root root 10, 236 Jul 11 16:52 control
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data1 -> ../dm-3
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data2 -> ../dm-4
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data3 -> ../dm-5
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data4 -> ../dm-6
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data5 -> ../dm-7
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data6 -> ../dm-8
lrwxrwxrwx 1 root root       7 Jul 25 15:49 data7 -> ../dm-9
lrwxrwxrwx 1 root root       8 Jul 25 15:49 data8 -> ../dm-10
lrwxrwxrwx 1 root root       7 Jul 11 16:53 VolGroup-lv_home -> ../dm-2
lrwxrwxrwx 1 root root       7 Jul 11 16:53 VolGroup-lv_root -> ../dm-0
lrwxrwxrwx 1 root root       7 Jul 11 16:52 VolGroup-lv_swap -> ../dm-1

The documentation appears to be incorrect here, when it says to use the command $ORACLE_HOME/bin/afd_label. It’s actually $ORACLE_HOME/bin/asmcmd with the first parameter afd_label. I’m going to label the devices called /dev/mapper/data*:

[root@server3 mapper]# for lun in 1 2 3 4 5 6 7 8; do
> asmcmd afd_label DATA$lun /dev/mapper/data$lun
> done
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.

root@server3 mapper]# asmcmd afd_lsdsk
Connected to an idle instance.
--------------------------------------------------------------------------------
Label                     Filtering   Path
================================================================================
DATA1                       ENABLED   /dev/mapper/data1
DATA2                       ENABLED   /dev/mapper/data2
DATA3                       ENABLED   /dev/mapper/data3
DATA4                       ENABLED   /dev/mapper/data4
DATA5                       ENABLED   /dev/mapper/data5
DATA6                       ENABLED   /dev/mapper/data6
DATA7                       ENABLED   /dev/mapper/data7
DATA8                       ENABLED   /dev/mapper/data8

That seemed to work ok. So what’s going on in the /dev/oracleafd/disks directory now?

[root@server3 ~]# ls -l /dev/oracleafd/disks/
total 32
-rw-r--r-- 1 root root 26 Jul 25 15:52 DATA1
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA2
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA3
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA4
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA5
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA6
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA7
-rw-r--r-- 1 root root 26 Jul 25 15:49 DATA8

There they are, just like with ASMLib. But look at the permissions, they are all owned by root with read-only privs for other users. In an ASMLib environment these devices are owned by oracle:dba, which means non-Oracle processes can write to them and corrupt them in some situations. Is this how Oracle claims the AFD protects devices?

I haven’t had time to investigate further but I assume that the database will access the devices via this mysterious block device:

[oracle@server3 oracleafd]$ ls -l /dev/oracleafd/admin
brwxrwx--- 1 oracle dba 249, 0 Jul 25 16:25 /dev/oracleafd/admin

It will be interesting to find out.

Distruction

Of course, if you are logged in as root you aren’t going to be protected from any crazy behaviour:

[root@server3 ~]# cd /dev/oracleafd/disks
[root@server3 disks]# ls -l
total 496
-rw-r--r-- 1 root root 475877 Jul 25 16:40 DATA1
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA2
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA3
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA4
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA5
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA6
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA7
-rw-r--r-- 1 root root     26 Jul 25 15:49 DATA8
[root@server3 disks]# od -c -N 256 DATA8
0000000   /   d   e   v   /   m   a   p   p   e   r   /   d   a   t   a
0000020   8  \n
0000032
[root@server3 disks]# dmesg >> DATA8
[root@server3 disks]# od -c -N 256 DATA8
0000000   /   d   e   v   /   m   a   p   p   e   r   /   d   a   t   a
0000020   8   \n   z   r   d   b   t   e   2  l   I   n   i   t   i   a
0000040   l   i   z   i   n   g       c   g   r   o   u   p       s   u
0000060   b   s   y   s       c   p   u   s   e   t  \n   I   n   i   t
0000100   i   a   l   i   z   i   n   g       c   g   r   o   u   p
0000120   s   u   b   s   y   s       c   p   u  \n   L   i   n   u   x
0000140       v   e   r   s   i   o   n       3   .   8   .   1   3   -
0000160   2   6   .   2   .   3   .   e   l   6   u   e   k   .   x   8
0000200   6   _   6   4       (   m   o   c   k   b   u   i   l   d   @
0000220   c   a   -   b   u   i   l   d   4   4   .   u   s   .   o   r
0000240   a   c   l   e   .   c   o   m   )       (   g   c   c       v
0000260   e   r   s   i   o   n       4   .   4   .   7       2   0   1
0000300   2   0   3   1   3       (   R   e   d       H   a   t       4
0000320   .   4   .   7   -   3   )       (   G   C   C   )       )
0000340   #   2       S   M   P       W   e   d       A   p   r       1
0000360   6       0   2   :   5   1   :   1   0       P   D   T       2
0000400

Proof, if ever you need it, that root access is still the fastest and easiest route to total disaster…

New section: Oracle SLOB Testing

slob ghost

For some time now I have preferred Oracle SLOB as my tool for generating I/O workloads using Oracle databases. I’ve previously blogged some information on how to use SLOB for PIO testing, as well as shared some scripts for running tests and extracting results. I’ve now added a whole new landing page for SLOB and a complete guide to running sustained throughput testing.

Why would you want to run sustained throughput tests? Well, one great reason is that not all storage platforms can cope with sustained levels of write workload. Flash arrays, or any storage array which contains flash, have a tendency to suffer from garbage collection issues when sustained write workloads hit them hard enough.

Find out more by following the links below:

Understanding Flash: SLC, MLC and TLC

slc-mlc-tlc-fruitmachine

The last post in this series discussed the layout of NAND flash memory chips and the way in which cells can be read and written (programmed) at the page level but have to be erased at the (larger) block level. I finished by mentioning that erase operations take substantially longer than read or program operations… but just how big is the difference?

Knowing the answer to this involves first understanding the different types of flash memory available: SLC, MLC and TLC.

Electrons In A Bucket?

Whenever I’ve seen anyone attempt to explain this in the past, they have almost always resorted to drawing a picture of electrons or charge filling up a bucket. This is a terrible analogy and causes anyone with a deep understanding of physics to cringe in horror. Luckily, I don’t have a deep understanding of physics, so I’m going to go right along with the herd and get my bucket out.

A NAND flash cell, i.e. the thing that stores a value of one or zero, is actually a floating gate transistor. Programming the cell means putting electrons into the floating gate, causing it to become (negatively) charged. Erasing the cell means removing the electrons from the floating gate, draining the charge. The amount of charge contained in the floating gate can be varied from zero up to a maximum value – this is an analogue system so there is no simple FULL or EMPTY state.

Because of this, the amount of charge can be measured and thresholds assigned to indicate a binary value. What does that mean? It means that, in the case of Single Level Cell (SLC) flash anything below 50% of charge can be considered to be a bit with a value of 1, while anything above 50% can be considered a bit with a value of 0.

But if i decided to be a bit more careful in the way I fill or empty my bucket of charge (sorry), I could perhaps define more thresholds and thus hold two bits of data instead of one. I could say that below 25% is 11, from 25% to 50% is 10, from 50% to 75% is 01 and above 75% is 00. Now I can keep twice as much data in the same bucket. This is in fact Multi Level Cell (MLC). And as the picture shows, if I was really careful in the way I treated my bucket, I could even keep three bits of data in there, which is what happens in Three Level Cell (TLC):

slc-mlc-tlc-buckets

The thing is, imagine this was a bucket of water (comparing electrons to water is probably the last straw for anyone reading this who has a degree in physics, so I bid you farewell at this point). If you were to fill up your bucket using the SLC method, you could be pretty slap-dash about it. I mean it’s pretty obvious when the bucket is more than half full or empty. But if you were using a more fine-grained method such as MLC or TLC you would need to fill / empty very carefully and take exact measurements, which means the act of filling (programming) would be a lot slower.

To really stretch this analogy to breaking point, imagine that every time you fill your bucket it gets slightly damaged, causing it to leak. In the SLC world, even a number of small leaks would not be a big deal. But in the MLC or (especially) the TLC world, those leaks mean it would quickly become impossible to keep using your bucket, because the tolerance between different bit values is so small. For similar reasons, NAND flash endurance is greatly influenced by the type of cell used. Storing more bits per cell means a lower tolerance for errors, which in turn means that higher error rates are experienced and endurance (the number of program/erase cycles that can be sustained) is lower.

Timing and Wear

Enough of the analogies, let’s look at some proper data. The chart below uses sample figures from AnandTech:

slc-mlc-tlc-performance-chart

You can see that as the number of bits per cell increases, so does the time taken to perform read, program (i.e. write) and erase operations. Erases in particular are especially slow, with values measured in milliseconds instead of microseconds. Given that erases also affect larger areas of flash than reads and programs, you can start to see why the management of erase operations on flash is critical to performance.

Also apparent on the chart above is the massive difference in the number of program / erase cycles between the different flash types: for SLC we’re talking about orders of magnitude in difference. But of course SLC can only store one bit per cell, which means it’s much more expensive from a capacity perspective than MLC. TLC, meanwhile, offers the potential for great value for money, but none of the performance requirements you would need for tier one storage (although it may well have a place in the world of backups). It is for this reason that MLC is the most commonly used type of flash in enterprise storage systems. (By the way I’m so utterly disinterested in the phenomena of “eMLC” that I’m not going to cover it here, but you can read this and this if you want to know more on the subject…)

Warning: Know Your Flash

cautionOne final thing. When you buy an SSD, a PCIe flash card or, in the case of Violin Memory, an all-flash array you tend to choose between SLC and MLC. As a very rough rule of thumb you can consider MLC to be twice the capacity for half the performance of SLC, although this in fact varies depending on many factors. However there are some all flash array vendors who use both SLC and MLC in a sort of tiered approach. That’s fine -and if you are buying a flash array I’m sure you’ll take the time to understand how it works.

But here’s the thing. At least one of these vendors insists on describing the SLC layer as “NVRAM” to differentiate from the MLC layer which it simply describes as using flash SSDs. The truth is that the NVRAM is also just a bunch of flash SSDs, except they are SLC instead of MLC. I’m not in favour of using educational posts to criticise competitors, but in the interest of bring clarity to this subject I will say this: I think this is a marketing exercise which deliberately adds confusion to try and make the design sound more exciting. “Ooooh, NVRAM that sounds like something I ought to have in my flash array…” – or am I being too cynical?

Understanding Flash: Blocks, Pages and Program / Erases

In the last post on this subject I described the invention of NAND flash and the way in which erase operations affect larger areas than write operations. Let’s have a look at this in more detail and see what actually happens. First of all, we need to know our way around the different entities on a flash chip: the die, the plane, the block and the page:

NAND Flash Die Layout (image courtesy of AnandTech)

NAND Flash Die Layout (image courtesy of AnandTech)

Note: What follows is a high-level description of the generic behaviour of flash. There are thousands of different NAND chips available, each potentially with slightly different instruction sets, block/page sizes, performance characteristics etc.

  • The die is the memory chip, i.e. the black rectangle with little electrical connectors sticking out of it. If you look at an SSD, a flash card or the internals of a flash array you will see many flash dies, each of which is produced by one of the big flash manufacturers: Toshiba, Samsung, Micron, Intel, SanDisk, SK Hynix. These are the only companies with the multi-billion dollar fabrication plants necessary to make NAND flash.
  • Each die contains one or more planes (usually two). Identical, concurrent operations can take place on each plane, although with some restrictions.
  • Each plane contains a number of blocks, which are the smallest unit that can be erased. Remember that, it’s really important.
  • Each block contains a number of pages, which are the smallest unit that can be programmed (i.e. written to).

The important bit here is that program operations (i.e. writes) take place to a page, which might typically be 8-16KB in size, while erase operations take place to a block, which might be 4-8MB in size. Since a block needs to be erased before it can be programmed again (*sort of, I’m generalising to make this easier), all of the pages in a block need to be candidates for erasure before this can happen.

Program / Erase Cycles

When your flash device arrives fresh from the vendor, all of the pages are “empty”. The first thing you will want to do, I’m sure, is write some data to them – which in the world of memory chips we call a program operation. As discussed, these program operations take place at the page level. You can then read your fresh data back out again with read operations, which also take place at the page level. [Having said that, the instruction to read a page places the data from that page in a memory register, so your reading process can in fact then selectively access subsets of the page if it desires - but maybe that's going into too much detail...]

NAND-flash-blocks-pages-program-erasesWhere it gets interesting is if you want to update the data you just wrote. There is no update operation for flash, no undo or rewind mechanism for changing what is currently in place, just the erase operation. It’s a little bit like an etch-a-sketch, in that you can continue to turn the dials and make white sections of screen go black, but you cannot turn black sections of screen to white again with erasing the entire screen. Etch-a-SketchAn erase operation on a flash chip clears the data from all pages in the block, so if some of the other pages contain active data (stuff you want to keep) you either have to copy it elsewhere first or hold off from doing the erase.

In fact, that second option (don’t erase just yet) makes the most sense, because the blocks on a flash chip can only tolerate a limited number of program and erase options (known as the program erase cycle or PE cycle because for obvious reasons they follow each other in turn). If you were to erase the block every time you wanted to change the contents of a page, your flash would wear out very quickly.

So a far better alternative is to simply mark the old page (containing the unchanged data) as INVALID and then write the new, changed data to an empty page. All that is required now is a mechanism for pointing any subsequent access operations to the new page and a way of tracking invalid pages so that, at some point, they can be “recycled”.

NAND-flash-page-update

Updating a page in NAND flash. Note that the new page location does not need to be within the same block, or even the same flash die. It is shown in the same block here purely for ease of drawing.

This “mechanism” is known as the flash translation layer and it has responsibility for these tasks as well as a number of others. We’ll come back to it in subsequent posts because it is a real differentiator between flash products. For now though, think about the way the device is filling up with data. Although we’ve delayed issuing erase operations by cleverly moving data to different pages, at some point clearly there will be no empty pages left and erases will become essential. This is where the bad news comes in: it takes many times longer to perform an erase than it does to perform a read or program. And that clearly has consequences for performance if not managed correctly.

In the next post we’ll look at the differences in time taken to perform reads, programs and erases – which first requires looking at the different types of flash available: SLC, MLC and TLC…

caution[* Technical note: Ok so actually when a NAND flash page is empty it is all binary ones, e.g. 11111111. A program operation sets any bit with the value of 1 to 0, so for example 11111111 could become 11110000. This means that later on it is still possible to perform another program operation to set 11110000 to 00110000 for example. Until all bits are zero it's technical possible to perform another program. But hey, that's getting a bit too deep into the details for our requirements here, so just pretend you never read this...]

Follow

Get every new post delivered to your Inbox.

Join 804 other followers