A couple of posts ago in this series, I explained how a NAND flash die is comprised of planes, which contain blocks, which contain pages… which contain individual cells of data. Read operations take place at the page level, as do write operations (although we call them program operations in the flash world). But crucially, erase operations take place at the block level and so affect multiple pages.
Erases are also slow (at least relative to reads and writes) and cause wear of the flash media, gradually moving it closer to its end of life. It’s therefore the case that when you want to update an existing page of data it is faster, simpler and less damaging to simply write the updated information to an empty page. If you’re going to do that, you will probably also want to choose a new page somewhere completely different from the old one to ensure that your flash wears out evenly. And hey, don’t forget that the block containing the old page will need to be erased at some point before it can be reused.
- write updated information to a new empty page and then divert all subsequent read requests to its new address
- ensure that newly-programmed pages are evenly distributed across all of the available flash so that it wears evenly
- keep a list of all the old invalid pages so that at some point later on they can all be recycled ready for reuse
This mechanism is called the flash translation layer (FTL) and you will find it on all flash media if you look hard enough. The FTL has a number of responsibilities, so let’s look at them now.
Logical Block Mapping
Abstraction is everywhere in computing. The URL of this website is a logical address which maps to a physical address, i.e. the IP address. An IP address is in fact a logical address which maps to a physical address, i.e. the MAC address of a network interface. There are so many examples of abstraction in technology that sometimes I think the whole world of computing is just one massive layer of abstraction.
In storage, the idea of a logical block address (LBA) has been around since long before NAND flash and is primarily used to make addressing simpler and more flexible. Like all abstraction concepts it exists to make the (potentially complex) management of a low level system invisible to the higher levels that consume its services. For example, if a hard drive within a RAID group fails and the data it contained has to be rebuilt on a hot spare, the physical block address of certain blocks of data will change. To avoid the need to notify all possible interested parties (e.g applications, databases, etc) of the new address, the extra layer of logical block addressing is used; the map from LBA to PBA is amended and nobody else needs to know.
In a flash system this same mechanism can be used for updates, so that when a page is considered invalid the logical block address can be remapped to the newly-programmed page. This provides the solution to number 1 in our list above in a way that is both simple and transparent to anyone issuing I/O requests.
On the face of it, wear levelling seems like a fairly simple method of handling number 2 in our list. You have a predefined number of flash blocks, each of which can be programmed and erased (known as the P/E cycle) a similar number of times before they are no longer usable. Clearly the object of any wear levelling algorithm is to smoothly distribute all P/E cycles so that the blocks all reach their limit at the same time. Without wear levelling it’s entirely possible that a subset of blocks could receive the majority of P/E cycles and thus wear out very quickly, reducing the available capacity of the system.
If all blocks in a system were regularly updated this would be no problem, because wear levelling would happen almost naturally as pages are marked invalid and then recycled. But here’s the problem: if we have some cold blocks, i.e. locations where the data never changes, then we have to take steps to manually relocate that data otherwise those blocks won’t ever wear… and that means we are actually adding write workload to the system, which ultimately means increasing the wear.
So to put it in simple terms, the more aggressive we are about wear levelling evenly, the more wear we cause. But not being aggressive enough could result in hot and cold spots as wear becomes more uneven. As always, it’s a question of finding the right balance. Or, if you prefer, finding the write balance.*
The third requirement in our list is a way of recycling pages that are no longer required (i.e. invalid) but have not yet been erased. Of course you cannot simply erase them at your leisure, because flash requires the entire containing block to be erased too. So instead it’s necessary to consider the remaining contents of the block and – if necessary – transparently move some active data elsewhere. Once the block has no remaining active data in it, it can be erased and then becomes ready to use again.
The thing is, the block in question might have 128 or 256 pages in it, many of which contain active data. You might have to go to a lot of effort in order to recycle just a small number of invalid pages – and just like with wear levelling, the act of moving data around in the background will have consequences to things like performance and endurance. Again it’s a question of finding the right balance.
Garbage Collection is such an important part of the management of flash that I’m going to devote a whole post to it next in the series. But for now, consider this: what happens if you fill up your trash faster than it gets taken away? You run out of space in your bins and everything starts to smell pretty bad. We definitely need to make sure that doesn’t happen here…
Now you might have noticed that a lot of the processes described here result in additional operations taking place on the flash media. Wear Levelling can relocate inactive (cold) pages in order to ensure they wear evenly in comparison to active (hot) pages. Garbage Collection can result in pages being moved from a block which will subsequently be erased. In short, stuff is happening in the background as a consequence of the actions taking place on the host – which we will call the foreground in order to distinguish it. What’s more, if you are looking at things from the point of view of the foreground – like, for example, a database server accessing flash storage – you have no visibility of what’s going on in the background.
It doesn’t matter what OS monitoring tools you run on the host (iostat, for example, or dstat), you will only see the foreground I/O operations that the host knows about. This is important because the performance and endurance of your flash storage is dependent on the sum of foreground and background operations.
We call this phenomena write amplification and we can express it as a value using the following formula:
A higher value indicates an increased workload on the flash storage system, which is likely to mean reduced endurance and performance. For this reason, if you are testing any sort of flash system (from the mighty All Flash Array to a simple SSD) you should make every effort to observe the workload within the storage system as well as from the host. Of course with the humble SSD that’s often not possible…
Where Is Your FTL?
As a final thought, earlier on I said about the FTL that “you will find it on all flash media if you look hard enough“. What did I mean? Well, the FTL has a lot of duties to perform, as we’ve seen. That requires effort, which for a computer equates to processing power and memory. In most situations (e.g. All Flash Arrays) this happens in firmware under the covers. But with some products, for example certain PCIe flash cards, it’s possible that some or all of the FTL functionality runs on the host, using host CPU cycles and DRAM.
In principle there may be benefits and drawbacks of either method (host-based FTL or array-based FTL), but if you are running a database on your server they become insignificant in relation to the cost. After all, those processors in your database servers? The ones that affect your core-based licenses for Oracle or other database software? They are the most expensive CPUs you own. If you are donating CPU cycles from these cores to manage your flash, you are effectively throwing away license money. And you probably feel you pay enough to your database vendor already, right?
* Seriously, if you didn’t think that was a great pun then you’re reading the wrong blog.