In the last post in this series I discussed some of the various tasks that need to be performed by the flash translation layer – the layer of abstraction that sits between us and the raw NAND flash on which we desire to store our data. One of those tasks is the infamous garbage collection process (or “GC”) – and in these next couple of posts I’m going to look into GC a little deeper.
But first, let’s consider what would happen on a flash system if there were no GC at all.
Why We Need Garbage Collection
Let’s imagine a very simple flash system comprising of just five blocks, each with just ten pages. In the beginning, all the pages are empty, so we can say that 100% of the capacity is Free:
There’s no point in wasting all that lovely flash, so I’m going to write some data to it. Remember that a write operation in the world of flash is known as a program operation and it takes place at the page level. So here goes:
You can see a couple of things happening here. First of all, some of the capacity is now showing as Used. Secondly, you can see that our wear levelling algorithm has kicked in to spread the data over all the blocks fairly evenly. Some blocks have more used pages than others, but the wear levelling doesn’t have to be exact, just approximate. Let’s add some more data:
Now we’re at 50% Used – and again the data has been spread evenly for the purposes of wear levelling. The flash translation later is also taking care of logical to physical block mapping, so although data has been spread across multiple blocks it has the appearance of being sequential when accessed by the calling application.
Time For Some Updates
So far so good. But at some point we’re probably going to want to update some of the existing data. And as we know, on NAND flash that’s not straightforward because used pages cannot be updated without the entire block first being erased. So for now we simply write the updated data to an empty page and mark the old page as stale. The logical to physical block mapping allows us to simply redirect the logical address of an updated block to the physical address of the new page:
See how some of the blocks which were showing as Used (in red) are now showing as Stale (in yellow)? And of course the capacity graph is now showing a percentage of the total capacity as being stale. But have you also noticed that we are approaching the point where we can never free up the stale pages?
Never forget that NAND flash can only be erased at the block level. In the above example, the first (left-most) block has two stale pages and four used pages. In order to free up those stale pages I need to copy the used pages somewhere else and then erase the block. If I don’t do that very soon, I’m going to run out of free pages in the other blocks and then … I’m screwed:
This situation is a disaster, because at this point I can never free up the stale blocks, which means I’ve effectively just turned my flash system into a read only device. And yet if you look at the capacity graph, it’s only around 75% full of real data (the stuff in red). So does this mean I can never use 100% of a flash device?
The Return of Over-Provisioning
Earlier in this Storage for DBAs blog series I wrote an article about one of my least favourite practices in the world of hard disk drives: over-provisioning. I now have to own up and confess that in the world of flash we do pretty much the same thing, although for slightly different reasons (and without the massive overhead of power, cooling and real-estate costs that makes me so mad with disk).
To ensure that there is always enough room for garbage collection to continue (as well as to allow for things like bad blocks, wear levelling etc), pretty much every flash device vendor (from SSDs through to All Flash Arrays) uses the same trick:
Yep, there’s more flash in your device than you can actually see. For example, rip off the covers of a 400GB HGST Ultrastar SSD800MM (the SSD that used to be found in XtremIO arrays) and you’ll find 576GB of raw flash. There’s 44% more capacity on the device than it advertises when you access it. This extra area is not pinned by the way, it’s not a dedicated set of blocks or pages. It’s just mandatory headroom that stops situations like the one we described above ever occurring.
The Violin Way
Since I work for Violin Memory it would be remiss of me not to discuss the concept of format levels at this point. When you buy flash in the form of SSDs, or flash arrays that contain SSDs, the amount of over-provisioning is generally fixed. Almost all SSDs have a predefined over-provision area and that cannot be changed (if there is an SSD on the market that allows this, I am not aware of it).
One of the advantages we have at Violin is that we build our arrays from the ground up using NAND flash chips rather than pre-packaged SSDs – our flash is located on our unique Violin Intelligent Memory Modules (VIMMs). This means we have control down to the NAND flash level – and one of the benefits here is that we can choose how much over-provisioning is required based on workload. At Violin we call this the format level of an array and we allow customers to set it when they take delivery (it can also be changed at a later time but obviously requires a reformat of the array).
I see this as a great benefit because some workloads are very read-intensive and so a large over-provision would essentially be a waste of good flash. Conversely some workloads are so write-heavy that the hard limits found on an SSD device would not be suitable. In general I believe that choice (as long as it comes with advice) is a good thing.
In the next post we’ll look at garbage collection some more, in particular the concept of background versus foreground garbage collection, as well as cover the infamous write cliff. But for now, just remember: no matter what form of flash you are using, in order for it to work and perform properly, something somewhere has to take out the trash…