Understanding Flash: Garbage Collection Matters

garbage-collection

In the last post in this series I discussed some of the various tasks that need to be performed by the flash translation layer – the layer of abstraction that sits between us and the raw NAND flash on which we desire to store our data. One of those tasks is the infamous garbage collection process (or “GC”) – and in these next couple of posts I’m going to look into GC a little deeper.

But first, let’s consider what would happen on a flash system if there were no GC at all.

Why We Need Garbage Collection

Let’s imagine a very simple flash system comprising of just five blocks, each with just ten pages. In the beginning, all the pages are empty, so we can say that 100% of the capacity is Free:

garbage-collection-fill1

 

There’s no point in wasting all that lovely flash, so I’m going to write some data to it. Remember that a write operation in the world of flash is known as a program operation and it takes place at the page level. So here goes:

garbage-collection-fill2

 

You can see a couple of things happening here. First of all, some of the capacity is now showing as Used. Secondly, you can see that our wear levelling algorithm has kicked in to spread the data over all the blocks fairly evenly. Some blocks have more used pages than others, but the wear levelling doesn’t have to be exact, just approximate. Let’s add some more data:

garbage-collection-fill3

 

Now we’re at 50% Used – and again the data has been spread evenly for the purposes of wear levelling. The flash translation later is also taking care of logical to physical block mapping, so although data has been spread across multiple blocks it has the appearance of being sequential when accessed by the calling application.

Time For Some Updates

So far so good. But at some point we’re probably going to want to update some of the existing data. And as we know, on NAND flash that’s not straightforward because used pages cannot be updated without the entire block first being erased. So for now we simply write the updated data to an empty page and mark the old page as stale. The logical to physical block mapping allows us to simply redirect the logical address of an updated block to the physical address of the new page:

garbage-collection-fill4

See how some of the blocks which were showing as Used (in red) are now showing as Stale (in yellow)? And of course the capacity graph is now showing a percentage of the total capacity as being stale. But have you also noticed that we are approaching the point where we can never free up the stale pages?

Never forget that NAND flash can only be erased at the block level. In the above example, the first (left-most) block has two stale pages and four used pages. In order to free up those stale pages I need to copy the used pages somewhere else and then erase the block. If I don’t do that very soon, I’m going to run out of free pages in the other blocks and then … I’m screwed:

garbage-collection-fill5

This situation is a disaster, because at this point I can never free up the stale blocks, which means I’ve effectively just turned my flash system into a read only device. And yet if you look at the capacity graph, it’s only around 75% full of real data (the stuff in red). So does this mean I can never use 100% of a flash device?

The Return of Over-Provisioning

Earlier in this Storage for DBAs blog series I wrote an article about one of my least favourite practices in the world of hard disk drives: over-provisioning. I now have to own up and confess that in the world of flash we do pretty much the same thing, although for slightly different reasons (and without the massive overhead of power, cooling and real-estate costs that makes me so mad with disk).

To ensure that there is always enough room for garbage collection to continue (as well as to allow for things like bad blocks, wear levelling etc), pretty much every flash device vendor (from SSDs through to All Flash Arrays) uses the same trick:

garbage-collection-overprovision

Yep, there’s more flash in your device than you can actually see. For example, rip off the covers of a 400GB HGST Ultrastar SSD800MM (the SSD that used to be found in XtremIO arrays) and you’ll find 576GB of raw flash. There’s 44% more capacity on the device than it advertises when you access it. This extra area is not pinned by the way, it’s not a dedicated set of blocks or pages. It’s just mandatory headroom that stops situations like the one we described above ever occurring.

The Violin Way

Violin Intelligent Memory Module (VIMM)Since I work for Violin Memory it would be remiss of me not to discuss the concept of format levels at this point. When you buy flash in the form of SSDs, or flash arrays that contain SSDs, the amount of over-provisioning is generally fixed. Almost all SSDs have a predefined over-provision area and that cannot be changed (if there is an SSD on the market that allows this, I am not aware of it).

One of the advantages we have at Violin is that we build our arrays from the ground up using NAND flash chips rather than pre-packaged SSDs – our flash is located on our unique Violin Intelligent Memory Modules (VIMMs). This means we have control down to the NAND flash level – and one of the benefits here is that we can choose how much over-provisioning is required based on workload. At Violin we call this the format level of an array and we allow customers to set it when they take delivery (it can also be changed at a later time but obviously requires a reformat of the array).

I see this as a great benefit because some workloads are very read-intensive and so a large over-provision would essentially be a waste of good flash. Conversely some workloads are so write-heavy that the hard limits found on an SSD device would not be suitable. trashcanIn general I believe that choice (as long as it comes with advice) is a good thing.

In the next post we’ll look at garbage collection some more, in particular the concept of background versus foreground garbage collection, as well as cover the infamous write cliff. But for now, just remember: no matter what form of flash you are using, in order for it to work and perform properly, something somewhere has to take out the trash…

Advertisements

6 Responses to Understanding Flash: Garbage Collection Matters

  1. Bart Sjerps says:

    Nice post! Liked the pic with the garbage trucks :p

    Overthinking all this, I argue that garbage collection is a misnomer. The garbage collection does not actually collect garbage. It collects chunks of good data, glues it together in a more compact and efficient way, and then thrashes the remaining garbage completely (by clearing the flash cells) to free up the space for future writes with new valuable data…

    • flashdba says:

      Hello Bart!

      I agree. I blame those Java developers for inventing the phrase, which the flash industry seems to have adopted. In fact, my esteemed Violin colleague Steve Willson gives an excellent talk on flash management in which he shows garbage collection as essentially a massive game of Tetris where blocks need to be fitted into the available holes in order to reach the best efficiency. I love that image and may even steal it for the next post.

      Thanks for stopping by 🙂

      • Steve says:

        Would you mind linking the talk by your esteemed Violin colleague Steve Wilson? I would be interested to watch it!

        • flashdba says:

          I’m afraid that talk wasn’t recorded. Steve now works for DellEMC’s DSSD business unit and so can be found travelling around Europe spouting nonsense to anyone with a multimillion dollar budget.

          I learnt a lot from Steve during our time working together at Violin. One of the main things I learnt was that after delivering any kind of pitch or presentation, upon returning to the office and being asked by your colleagues “How did it go?” one should always employ the standard response: “I was brilliant.”

  2. Pingback: learning about all flash arrays | SSKANTH

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s