Understanding Flash: The Fall and Rise of Flash Memory

grave

This month sees the four year anniversary of some interesting events. Commonwealth countries around the world celebrated the Diamond Jubilee of Queen Elizabeth II. Whitney Houston was tragically found dead in a Beverly Hills hotel. The Caribbean was hit hard by sargassum seaweed invasion. And I made the decision to leave the comfort of Oracle databases and join the exciting new All-Flash Array industry.

Ok, I might have been stretching the use of the word “interesting” there. But for those with an interest in flash memory, February 2012 was still a very important month due to the publication of a research paper co-authored by the University of California’s Department of Computer Science and Engineering and Microsoft Research.

The paper was entitled The Bleak Future of NAND Flash Memory – and it wasn’t pleasant reading for somebody who had just abandoned a career in databases to bet everything on flash.

The Death of Flash Memory

Rest In PeaceI have never spoken to the authors of this paper so I don’t know where the “Bleak Future” title came from, but it seems reasonable to say that it was somewhat more inflammatory than the content. In the body of the paper, the authors examined the behaviour of NAND flash memory chips as the lithography shrank – and also as the number of bits per cell increased from SLC through MLC to TLC. At the time of publication the authors were examining 25nm technology but it was already obvious that this form of NAND (known as 2D planar NAND) was going to hit physical limitations beyond which it could no longer shrink. This is known in the semiconductor world as the scaling limit.

The paper concluded:

“SSDs will continue to improve by some metrics (notably density and cost per bit), but everything else about them is poised to get worse. This makes the future of SSDs cloudy”

This sentiment, along with the “bleak future” thing, caused a bit of a stir in the tech world. TheRegister, for example, ran a typically tongue-in-cheek headline: “Flash DOOMED to drive itself off a cliff – boffins“. Various industry bloggers discussed the potential of technologies like ReRAM to take over for the next decade, while HP made it’s annual claim that Memristor technology (a form of ReRAM) would soon be here to save the day. I started wondering if I should register the domain name ReRamDBA.com

The Resurrection – Now Showing in 3D

Four years later, ReRAM is still just around the corner but now in the form of Intel and Micron’s 3D XPoint technology, while HP has significantly backtracked on its Memristor programme. Flash memory, meanwhile, is still going strong thanks to the introduction of vertical or 3D NAND.

“Reports of my death have been greatly exaggerated” – Mark Twain

Of course, hindsight is a wonderful thing. It’s easy to look back now at the publication of the Bleak Future… paper and consider it flawed. To see the flaws at the time of publication would have required a bit more thought.

So that’s why this month’s hero is Allyn Malventano, Storage Editor for PC Perspective, who published an article on 21st February 2012 (the same month!) called NAND Flash Memory – A Future Not So Bleak After All in which he described the original publication as “bad science”. Allyn’s conclusion was so prescient that I’m going to quote it right here (although you should read the whole article to get the full context):

“The point I want all of you to take home here is that just as with the CPU, RAM, or any other industry involving wafers and dies, the manufacturers will adapt and overcome to the hurdles they meet. There is always another way, and when the need arises, manufacturers will figure it out.”

Bravo. Samsung is now manufacturing its third generation V-NAND chips, while the Toshiba/SanDisk and Intel/Micro partnerships are both going 3D. Samsung’s V-NAND has already moved from 24 through 32 to 48 layers, while it has been theorised that there is no natural limit on the number of layers possible.

3d-xpointOf course, there’s always the spectre of a new technology sweeping everything before it – and the big story right now is Intel/Micron’s 3D XPoint technology. Will it take over from flash in the future? Who knows.

One thing I do know is that new technologies find their rightful place when they are both technically capable and economically viable. If 3D XPoint or any other non-volatile memory product can win the day, it will leave us all better off – and hopefully without the need for alarmist research papers.

Now, if you’ll excuse me, I’m off to check on the availability of the 3D-XPointDBA.com domain name…

(You can read more about 3D NAND here)

Understanding Flash: What is 3D NAND?

grid-cube-3d

About 18 months ago I wrote a post describing the different types of NAND flash known as SLC, MLC and TLC. However, 18 months is a lifetime in the world of technology so now I need to clarify it based on the widespread adoption of a new type of NAND flash. Let me explain…

Recap: 2D Planar NAND

Until recently, most of the flash memory used for data storage was of a form known as 2D Planar NAND and could be found in three types called Single Level Cell (SLC), Multi-Level Cell (MLC) and TLC (Triple-Level Cell). I always used to use my bucket of electrons analogy to describe the difference between them:

slc-mlc-tlc-buckets

Each cell within planar NAND flash memory stores charge in a way similar to how a bucket stores water. By considering an imaginary line half-way up the bucket we can assign a binary one or zero based on whether the bucket contains more or less water than the line. Thus a full bucket, or a fully-charged NAND cell, denotes a zero while an empty bucket / cell denotes a one… assuming we are considering SLC, where each bucket stores one bit.

Moving to MLC (two bits) or TLC (three bits) is therefore a case of adding more lines, allowing us to differentiate between more states within the same bucket. The benefit is double (MLC) or quadruple (TLC) density but the drawback is that there will be a lower margin for error when measuring the amount of water/charge stored. As a consequence, the actions of reading, writing and erasing take longer while the endurance of the cell also drops drastically (leaky buckets are more of a problem as you try to be more precise about the measurements). The original article covers this all in more detail.

Shrinking Lithographies

If you remember, I also talked about the way that flash memory manufacturers are constantly shrinking the size of NAND flash cells in order to make increasingly dense packages, thus reducing the cost – but that the technology was now approaching its physical limits. In the bucket example, just imagine that the buckets are getting smaller and smaller. This is initially a good thing because smaller buckets (actually floating gate transistors) mean more buckets can fit in the same overall space, but in time the buckets become so small that they are no longer manageable – and then the technology hits a brick wall.

So Why Is It 2D?

In NAND flash memory, sets of cells are connected together in a string to form a NAND gate:

NAND-flash-structure

Image courtesy of Warren Miller at Avnet

If you consider one of the pieces of silicon substrate contained inside a flash chip as a rectangle with dimensions X and Y, each one of these strings of cells will take up some space stretching out in one of these two dimensions. Shrinking the lithography, i.e. manufacturing everything on a smaller scale, will give us the opportunity to fit more strings on the same about of substrate. But as we previously discussed, there comes a point when things are simply too small and too close together, resulting in interference and leakage.

3D NAND: Going Vertical

Image courtesy of Kristian Vättö at AnandTech

Image courtesy of Kristian Vättö at AnandTech

The cost of a semiconductor is proportional to the die size. It is therefore a good thing for the cost if more electronics can be crammed into the same tiny piece of silicon. The fundamental difference in 3D NAND, which gives rise to its name, is that the strings previously described are now arranged vertically – another words in the Z dimension. For this reason, Samsung calls the technology V-NAND.

Imagine the string of cells shown earlier, but this time stood on its end and then folded in two to make a U shape. We now have a vertical string which takes up only a fraction of the original space in the X and Y dimensions. What’s more, we can continue to build in the Z dimension as manufacturing processes allow. Samsung’s first generation of V-NAND had strings of 24 layers, while the second generation had 32. The latest 3rd generation now has 48. And as Jim Handy explains, there are few theoretical limits on the number of layers possible. (Just to be clear, these layers are all within the same “wafer” of silicon, otherwise there would be no cost benefit…)

Crucially, since the move to a Z dimension relieves the pressure on the X and Y dimensions, 3D flash is actually free to return to a slightly larger lithography, thus avoiding all of the nasty problems that 2D planar NAND was starting to hit as it approached the 10nm range.

Charge Trap Flash and 3D TLC

Aside from the vertical stacking, there is another fundamental change with 3D NAND – it no longer uses floating gate transistors (yes, that’s right, the buckets from earlier). Instead, it uses a technology called Charge Trap Flash. cheese-iconI’m not going to attempt an explanation of CTF here, but it was memorably described by Samsung as like using cheese instead of water. So, instead of the buckets from earlier, picture cheese.

This cheese has a number of benefits over floating gate transistors in terms of endurance and power consumption, but it still works in a similar way in terms of the number of bits that can be stored – in other words SLC, MLC and TLC. However, because of its better endurance rates, with 3D NAND it is now a realistic proposition to use TLC to replace 2D planar MLC (something my employer Kaminario has already embraced).

This is big news. The cost per density of 3D TLC NAND flash is revolutionary, with plenty of room for further developments as the flash fabricators add more layers. Three years ago it looked like NAND flash was a technology in terminal decline, but with 3D techniques the future is bright. We might even get to a point soon where we see the introduction of…

Quad-Level Cell (QLC) Flash

If the endurance of CTF-based 3D NAND is acceptable, it’s not hard to envisage one of the flash fabricators releasing a quadruple-level cell version of the medium. The potential benefit is an order-of-magnitude increase in density for roughly the same cost.

After all, everybody wants more cheese… right?

Understanding Flash: Summary – NAND Flash Is A Royal Pain In The …

chaos-order

So this is it – the last article in my mini-series on understanding flash. This is the bit where I draw it all together in a neat conclusion that makes you think, “Yes! That was worth reading”. No pressure eh?

So let me start with the conclusion first: as a storage medium, NAND flash is a royal pain in the ass.

Chaos

Why? Well, let’s look back at what we’ve learned in the previous 9 articles:

In short, NAND flash is a tricky medium to use for enterprise storage. A whole lot of work is required to make a collection of flash chips appear to be a unified, resilient block of storage with fast, predictable performance.

And I haven’t even told you everything. Consider, for example, the phenomenon of read disturb. When you read a page within a NAND flash chip, you cause a very minor electronic field in the locality of the cells it contains. That field will cause a small disturbance to any neighbouring cells – usually not enough to cause concern, but significant nevertheless. So what happens when you repeatedly read that page? Eventually, after X number of reads, the data stored within the nearby cells becomes questionable.

NAND-flashThe solution, therefore, is to keep track of the number of times each page is disturbed in this manner and then set a threshold (let’s say 50 disturbances) beyond which you will copy the data out to a clean page and then mark the old page as stale. Easy.

But just think about what that means for a moment. Remember when I said that write amplification was mainly impacted by write workloads? This new piece of information means that even on a 100% read workload there will be additional back-end writes taking place on the array. Just another example of why flash is a tricky medium to manage.

Order

Of course, it would be remiss of me not to mention that NAND flash brings a tremendous set of benefits along with these problems. You could say they come as a package (oh come on, that was one of my better puns).

Let’s go back to basics for a moment: if you want to take a defined quantity of work and do it in a shorter amount of time, what are your choices? Put simply, there are two options: do the same work faster, or do more of it in parallel (and of course both options can be used together for extra gain).

The basic building block of a disk array is, obviously, the hard disk drive. I’ve already explained at tedious length about the performance gap between disk and flash, so we know that we can access data faster using flash. Technologies like RAID allow multiple disks to be used in parallel to achieve performance (and resilience) gains, but given a limited amount of physical space (such as a data centre rack), how many hard drives can you actually squeeze into one system?

Now compare this to the number of NAND flash packages you could fit into the same space, all of which you could potentially utilise in parallel and at a lower latency. Doing the same work faster – and doing more of it in parallel.

Image courtesy of Google Inc.

Image courtesy of Google Inc.

And there’s more. Those clunky great big cabinets of disk use up horrendous amounts of power just to spin those little rotating platters – with much of the energy converted to heat and noise: waste. The heat results in a requirement for additional cooling, which uses even more power: more waste. And it all takes up so much physical space that data centres become overrun with storage.

In contrast, all flash arrays (AFAs) require less power, less cooling and take up less physical space: it’s not uncommon for customers to pay for the move to flash simply by avoiding the need to build a new data centre or extend an existing one. In summary, the net cost of using flash is now less than that of using disk.

When I first started writing this blog back in 2012 there was still a debate over whether flash would replace disk for enterprise storage. That debate was over some time ago: flash has already won.

Architecture Matters

So this post marks the end of my journey into explaining and understanding NAND flash. Yet there is a whole new area which needs exploring: the architecture of all flash arrays.

Enterprise storage needs be safe, reliable, predictable and fast. Yet at a package level, NAND flash is a tricky little beast that has to be constantly watched to make sure it behaves itself. There’s a dichotomy here: how do we use the latter to deliver the former? How do we take a component designed for consumer electronics and use it to build an enterprise-class AFA? In short, how we derive order from chaos?

architectureThe answer is in the architecture. At the time of writing this blog there are a number of AFA vendors on the market, each with a different approach to taming the beast. Apart from my own employer, Violin Memory, there is EMC, IBM, HDS, Pure Storage, SolidFire, Kaminario and a whole load more.

And that’s why this industry is so interesting to me. Everybody is trying to do this differently, although you can broadly categorise the solutions into three distinct ranges: hybrid arrays, SSD-based arrays and ground-up arrays. Everybody thinks their way is right – and nobody can afford to be wrong. The market for flash-based primary storage is huge and growing all the time: the winners get unparalleled success, while the losers … are simply left in disarray*

*I won’t lie – I’m so proud of that pun I’m going to award myself a couple of weeks off.

Understanding Flash: Fabrication, Shrinkage and the Next Big Thing

Semiconductor Fabrication Plant (picture courtesy of SemiWiki.com)

Semiconductor Fabrication Plant (picture courtesy of SemiWiki.com)

Before I draw this series on Understanding Flash to a close, I wanted to briefly touch on the subject of manufacturing. Don’t worry, I’ve taken heed of the kind feedback I had after my floating gate transistor blog post (“Please stop talking about electrons!“) and will instead focus on the commercial aspects, because ultimately they affect the price you will be paying for your flash-based storage. If you’re not familiar with the way things are you might be surprised…

The Supplier Landscape

Semiconductor chips, such as the NAND flash packages found in your storage, phones and tablets, are manufactured in fabrication plants known as “fabs” or flash “foundries”. To say that fabs are not cheap to build would be somewhat of an understatement – they are mind-bogglingly, ludicrously expensive.

In 2014 the global semiconductor industry posted record sales totalling $335.8 billion. That’s the entire semiconductor industry, not just the subset that produces NAND flash… and I think you’ll agree that’s quite a lot of money. But to put that figure into perspective, when Samsung decided to build an entirely new fab in late 2014, it had to commit $15 billion for a project that won’t be completed “until mid-2017”.

Clearly a fab is an eye-watering investment – and it is mainly for this reason that there are (at the time of writing) only six key companies worldwide who run flash foundries. What’s more, because of that staggering cost, four of those six are working together in pairs to share the investment burden. The four teams are:

  • Toshiba and SanDisk
  • Samsung
  • Intel and Micron
  • SK Hynix

With only four sets of fabs in operation, the market is hardly awash with an abundance of NAND flash – which of course suits the fab operators just fine as it keeps the price of flash higher. Oversupply would be unsustainable for an industry with such high costs.

Process Shrinking

Talking of costs, the fab operators are never allowed to stand still because of the relentless drive to make things smaller – Moore’s Law doesn’t just apply to processors; NAND flash is a semiconductor too. The act of taking the design of a microchip and reducing it in size is known as a process shrink – and it brings all sorts of benefits. Remember that a NAND flash cell is basically constructed from floating gate transistors? Well if those transistors are reduced in size:

  • Less current is needed per transistor, reducing the overall power consumption of the flash
  • This in turn reduces the heat output of an identical design with the same clock frequency (or alternatively the clock frequency can be increased)
  • Most importantly, more flash dies can be produced from the same silicon wafers (the raw material), resulting in either reduced costs or increased density

It may sound like smaller is better, but as always there is a flip side. Retooling a flash foundry to move to a smaller process geometry takes time and money – and the return on investment for the fab is reduced with each shrink. But we’ll come back to that in a minute.

Process Geometries

The different sizes in which flash is fabricated are known as process geometries. Traditionally, they are defined by measuring the distance (in nanometers) between the source and drain of each transistor (although these days this practice has become much less well-defined). In a strange quirk of algebra, the two digit numbers are written as <digit><letter> so that, for example, the first geometry to be in the range 29-20nm (e.g. 25nm) is called 2X, then a second in that range (e.g. 21nm) is called 2Y. Once into the teens (e.g. 19nm) we go back to 1X. [Although interestingly, Toshiba and SanDisk have a second generation NAND which they call 1Y to distinguish it from first-generation 1X even though both are 19nm]

Here’s the current technology roadmap for NAND flash courtesy of my friends at TechInsights:

Technology Roadmap for NAND Flash (Image courtesy of TechInsights)

Technology Roadmap for NAND Flash (Image courtesy of TechInsights)

So what’s the problem with flash getting smaller? Well, hopefully for the very last time, cast your mind back to the concept of floating gate transistors. On these tiny devices the floating gates effectively capture and store electrons that are passing from one side of the transistor to another, retaining charge – and that charge determines the cell’s value of 0 or 1. But at the tiny geometries that flash is approaching now, the number of electrons involved is desperately small, resulting in large margins for error. Electrons, after all, are notoriously good at hide and seek.

Floating Gate Limitations (SK Hynix Presentation - Aug 2012)

Floating Gate Limitations (SK Hynix Presentation – Aug 2012)

This means that flash cells are less reliable, that even more error correction codes are required… and that ultimately, we are reaching what is known as the scaling limit. It looks like the end is it sight for flash as we know it.

3D NAND Flash

Except it isn’t. If you look again at the Technology Roadmap picture you’ll see entries appearing alongside each foundry operator for 3D NAND. This is a significant new fabrication process that looks like it will be the direction taken by the flash industry for the foreseeable future. I would love to attempt a detailed explanation of 3D NAND design here, but a) it’s beyond me, b) I promised to listen to the feedback after the last time I went sub-atomic, and c) Jim Handy (AKA The Memory Guy) already has a fantastic set of articles all about it.

The point is, 3D NAND maps out a future for NAND flash beyond the scaling limits of 2D planar NAND – and the big fab operators have already invested. That puts 3D NAND in pole position as we increasingly look to non-volatile memory to store our data.

But there are other technologies trying to compete…

The Next Big Thing

The holy grail of memory technology is universal memory. This is a hypothetical technology which brings all the benefits, but none of the drawbacks, of the multitude of memory technologies in use today: SRAM, DRAM, NAND Flash and so on. SRAM (often used on-chip as a CPU cache) is fast but expensive, DRAM is cheaper but must be constantly refreshed (using considerably more power) while NAND flash is comparatively slower and wears out with use. If only there were some new technology that met all our requirements?

Well, first of all, it’s worth mentioning that if universal memory did come about it would have far-reaching consequences on the way we design and build computers… but all the same, there are various technologies in R&D right now which make claims to be the NVM solution of the future: PCM, MRAM, RRAM, FeRAM, Memristor and so on. But these – and any other – technologies all face the same challenge: that massive initial investment required to go from prototype to production. In other words, the cost and time associated with building a fabrication plant.

R&D Centre for SHRAM

R&D Centre for SHRAM

Say you converted your shed into a clean room and invented a wonderful new memory technology called Shed-RAM (SHRAM). SHRAM is fast, cheap, dense and only uses a tiny amount of power. You’ve still got to convince somebody to splurge billions of dollars on a foundry before it can be productised. And, as I’m sure you’ll agree, that’s a pretty big bet on an untested technology – especially if it takes years to build. What happens if, one year in, your neighbour (who recently converted his garage into a clean room) invents Garage-RAM (GaRAM) which makes your SHRAM worthless? Nobody is going to cope well with the loss of billions of dollars on a bad bet.

Conclusion

The investment hurdle required to turn a new NVM technology into a product is both a challenge and a stabiliser to the NVM industry. In theory it could stifle potential new technologies, but at the same time (at least for those of us in enterprise storage) it means there is plenty of warning when the tide turns: you are likely to see any successful new tech in your phone before it appears in your data centre.

Now, who can lend me $15 billion to build a new fab? I can’t afford anything more than 0% interest, but I can offer a lifetime’s supply of USB sticks to sweeten the deal…

Understanding Flash: Floating Gates and Wear

OLYMPUS DIGITAL CAMERA

One of the important characteristics of flash memory is wear. We know from previous articles in this series that flash packages consist of dies, which contain planes, which contain blocks, which in turn contain pages. We also know that these pages contain individual cells which store the bits of data… but to understand what wear is we need to look a little bit closer at those cells, where we will find something called a floating gate transistor.

Now don’t run away. Electronics might not be your thing, but I’m won’t be getting too deep. [After all, I once attempted degree-level education in Electronic Engineering but had to move to another course because I kept burning my fingers on the soldering irons during practical laboratory sessions…] If you really don’t want to think about transistors you can just skip to the section called Wear, or ignore this blog post entirely and spend a few minutes looking at picture of cats instead.

Field Effect Transistors

If you are reading this blog on a computer or a phone (and how else could you be reading it?) you owe a debt of gratitude to a humble device called the metal oxide semiconductor field effect transistor, or MOSFET. These tiny devices revolutionised the world and are considered by some to be the most important invention of the 20th century.

Students being stimulated to go to a bar and drink alcohol

Students being stimulated to go to a bar and drink alcohol

They therefore deserve to be explained in a serious and respectful manner, but sadly I don’t have time for that so I’m going to resort to one of my silly analogies. Again.

Imagine you have a house full of students, which we’ll call the source. Two doors down the road you have a nightclub full of free alcohol, which we’ll call the drain. However, it’s freezing cold outside and the students won’t venture out of the door, meaning there is no flow of students from source to drain. Now let’s set up a banging sound system behind the wall of the property in between. If we play music loud enough through the wall then we can excite the students into action, causing them to run down the road and enter the nightclub, which creates the flow. The loud music (which we’re going to call the gate) is the stimulus which causes this flow, essentially acting as a switch, while the volume of the music which first drives the students into action is called the threshold. The beauty of this design is that we can control the flow of students from behind the wall, therefore avoiding having to come into direct contact with them. Phew.

I really cannot tell you how bad that analogy is, but I’m afraid it’s only going to get worse later on. If you don’t know how a transistor works I implore you to watch this short video from the excellent folk at Veritasium, which describes it far better than I ever intend to.

Metal Oxide Semiconductor Field Effect Transistor (MOSFET)

Metal Oxide Semiconductor Field Effect Transistor (MOSFET)

In a MOSFET, electrons (students!) can be allowed to flow from the source terminal to the drain terminal by the application of charge to the gate terminal. This charge creates an electric field which alters the behaviour of the silicon layers (the pinkish parts of the diagram) and thus controls the flow. The important bit here is the yellow rectangle which represents an insulating layer (commonly known as the oxide layer). This means the gate is never physically connected to either the source or drain terminals – and you’ll see why this is important as we turn our attention now to something called the floating grate transistor, or FGMOS.

There are many things I haven’t explained in that clumsy analogy, but this is not an electronics blog. Of specific interest is the way that the two types of silicon (n and p) are doped in order to create free electrons and electron holes. Seriously, watch the video. If you haven’t ever learnt how a semiconductor works and my student party analogy is the nearest you get, it would be a crime. Although that’s not going to stop me from using it again in the next section…

Floating Gate Transistors

Floating Gate MOSFET (FGMOS)

Floating Gate MOSFET (FGMOS)

The diagram on the right (labelled “FGMOS”) is of a Floating Gate MOSFET, which is essentially what you will find in a flash memory cell. If you play spot the difference with the previous diagram above (the one labelled “MOSFET”) you’ll see that there are now two gates, one above the yellow oxide layer as before but the other entirely surrounded by it. This second gate is known as the floating gate because it is completely electrically isolated.

Notice also that the oxide layer beneath the floating gate is deliberately thinner than that above it in the diagram. Now, back to my analogy…

strip-doorsThe students are still there, as is the nightclub. The property in between is still there, but this time we have replaced the brick wall at the front with one of those sets of PVC strip curtains that you sometimes find covering the doors of factories, or butchers shops to keep insects out (or even lining the edge of the cold aisle in a data centre). Inside, we’ll put some comfy chairs and maybe a beer fridge. This is now our floating gate party room and the PVC strips are our thinner section of oxide layer.

Excited students going to a nightclub are stimulated so that some fall into the "Floating Gate" room

Excited students going to a nightclub are stimulated so that some fall into the “Floating Gate” room

As we now play the music loud enough to exceed the threshold, the excited students will run up the road towards the bar as before, but this time – with enough excitement – some of them will pass through the divider into our floating gate room and remain there, thus trapping electrons (sorry, I mean students) on the floating gate. Meanwhile, as the floating gate room fills up with people, the sound of the music becomes more muffled which slows down the flow of students in the road outside. To put it another way, the volume threshold at which stimulation will occur rises if there are students (sorry, I mean electrons) on the floating gate.

In a FGMOS, if a high charge is applied to the control gate in the same manner as with a MOSFET, electrons flowing from source to drain can get excited and “jump” through the oxide layer into the floating gate, increasing its retained charge. This is the program operation we have talked about so many times before: the floating gate is the “bucket of electrons” from my previous posts (a classic case of mixed metaphors). To erase the charge stored on the floating gate, a high voltage is applied across the source and drain while a negative voltage is applied to the control gate, causing the retained electrons to “jump” back off the floating gate (through the oxide layer). I’m putting the word “jump” in inverted commas there because it’s slightly more complicated and usually involves a process called Fowler–Nordheim tunnelling. I’ll explain everything I understand about that in the next paragraph.

[This paragraph intentionally left blank]

Yeah, it’s complicated, it involves quantum mechanics and it’s way over my head. I’m just taking it for granted that it works.

Read Operations

Now that we have methods for programming and erasing we just need a way of testing the value stored: a read operation. When we were talking about the MOSFET in the previous section we could control the flow of charge (which I had better start calling current) between source and drain by varying the voltage applied to the gate.

FGMOS Read Thresholds: The voltage threshold at which current begins to flow from drain to source is different depending on the charge stored on the floating gate. By testing at an intermediate reference voltage (VtREF) called the "read point" we can determine whether the floating gate contains charge (which we call ZERO) or not (which we call ONE)

FGMOS Read Thresholds: The voltage threshold at which current begins to flow from drain to source is different depending on the charge stored on the floating gate. By testing at an intermediate reference voltage (VtREF) called the “read point” we can determine whether the floating gate contains charge (which we call ZERO) or not (which we call ONE)

In the FGMOS this can be turned around so that by measuring the current we can determine the voltage on the floating gate, because electrons trapped on the floating gate cause the threshold we’ve previously mentioned to move. By applying a certain voltage (VtREF in the diagram on the right) across the source and drain and then testing the current we can determine if the voltage on the gate is above or below a specific point, called the read point.

So, if we play music at a certain threshold volume when the floating gate party room is empty, the sound travels far enough to stimulate a flow of students to the nightclub. But if the room is full (the floating gate contains charge) this specific volume will not stimulate a flow. Instead, it will require a louder threshold volume to get the students out of bed. And I think it’s time to abandon the student analogy now… let us never speak of it again.

Read Points for SLC and MLC

Read Points for SLC and MLC (image courtesy AnandTech)

If you remember, flash comes in different forms: SLC, MLC and TLC. So why are MLC reads slower than SLC, with TLC even slower still? Well, because SLC contains only one bit of data (two possible states: a zero or one), we only need to test one threshold voltage, i.e. SLC only has one read point. But for MLC, where there are two bits (and therefore four possible states), there are three read points… while for TLC there are even more.

Remember the bucket full of electrons analogy? When you test the SLC bucket to see if it’s above or below 50% full, the answer will tell you whether the stored value is a zero or a one. But for the MLC bucket, the answer to that test isn’t enough: based on the first answer you then need to perform a second test to see if the bucket is above or below 25% / 75% [delete as appropriate] full. All this additional testing takes time, which is why SLC reads are faster than MLC reads, which in turn are faster than TLC reads.

But what about the wear?

Flash Wear

broken-blindsFinally I’m getting to the point. Remember the PVC strip curtains, like the ones in the main picture at the top of this post? What do you think happens to them as all those excited students (sorry) hurtle through them? They get damaged. In a FGMOS the oxide layer which isolates the floating gate from the silicon substrate is designed to be thin enough to allow quantum tunnelling of electrons when a high enough charge is applied, but this process gradually damages the layer. Reads are not a problem because only lower voltages are used and no electron tunnelling takes place. But program and erase operations are a different story, which is why wear is measured by the number of program/erase cycles. As the layer gets more damaged, the isolation of the floating gate is increasingly affected and the probability of stored charge leaking out will increase.

For SLC this is less of an issue, because during a read we only need to measure at one read point – so there is a lot of room for error either side of the threshold. But for MLC, with three read points, we need to be much more exact. Thus the wear caused by using SLC, MLC and TLC isn’t really very different, it’s simply that the tolerances for error are much finer with the increased bit counts of MLC and TLC.

At some point, the oxide layer for a FGMOS will become sufficiently degraded so that it can no longer store charge properly on the floating gate. We won’t know about this until it happens, but at some point a read from the cell will no longer be trustworthy. And clearly that’s a problem for data storage, which is why (just like disk) flash stores error correction codes (ECC) alongside user data to ensure that incorrect information is spotted and dealt with while any underlying pages are marked as unusable – all without impacting users.

Comparison of SLC, MLC and EMLC (courtesy of EE Times)

Comparison of SLC, MLC and EMLC (courtesy of EE Times)

A final point to consider is that there is a (more expensive) form of MLC known as eMLC (the e stands for enterprise, with “normal” MLC then sometimes referred to as consumer or cMLC). The only difference between eMLC and the standard cMLC we have discussed in the past is that program and erase operations are “slowed down” for eMLC in order to cause less damage to the oxide layer. This gives slightly reduced performance but also significantly reduces wear, allowing up to 10x more P/E cycles. Opinion is divided on whether this is actually a worthwhile investment (not for me though, I think it’s a waste of money in the majority of cases).

Hot News

That’s the end of this post – and as usual I’ve committed two of my three regular blogging sins: writing too much, using silly analogies and finishing on a terrible pun. Time to complete the trio:

Some flash companies have been experimenting with methods of refreshing the lifetime of flash, with one avenue of exploration focussed on providing a short burst of heat to repair the damage to the oxide layer. It has been claimed that flash treated this way can sustain over 100 million P/E cycles with no noticeable degradation. If that is really the case – and this technology can be put into production – we might finally find that the Talking Heads were correct: in the world of flash memory we are on the road to no-wear…

Understanding Flash: Unpredictable Write Performance

fast-page-slow-page

I’ve spent a lot of time in this blog series talking about the challenges involved in using flash, such as the way that pages have to be erased before they are written and the restriction that erase operations take place on a whole block. I also described the problem of erase operations being slow in comparison to reads and writes – and the resulting processes we have to put in place to manage that problem (i.e. garbage collection) . And most recently I covered the way that garbage collection can result in unpredictable performance.

But so far we’ve always worked under the assumption that reads and writes to NAND flash have the same predictably low latency. This post is all about bursting that particular bubble…

Programming NAND Flash: A Quick Recap

You might remember from my post on the subject of SLC, MLC and TLC that I used the analogy of electrons in a bucket to explain the programming of NAND flash cells:

slc-mlc-tlc-buckets

I’d now like to change that analogy slightly, so I’m asking you to consider that you have an empty bucket and a powerful hose pipe. You can turn the hose on and off whenever you want to fill the bucket up, but you cannot empty water out of the bucket unless you completely empty it. Ok, now we’re ready.

For SLC we simply say that an empty bucket denotes a binary value of 1 and a full bucket denotes binary 0. Thus when you want to program an SLC bucket you simply let rip with your hose pipe until it’s full. No need to measure whether the water line is above or below the halfway point (the threshold), just go crazy. Blam! That was quick, wasn’t it?

For MLC however, we have three thresholds – and again we start with the bucket empty (denoting binary 11). Now, if I want to program the binary values of 01 or 10 in the above diagram I need to be careful, because if I overfill I cannot go backwards. bucketI therefore have to fill a little, test, fill some more, test and so on. It’s actually kind of tricky – and it’s one of the reasons that MLC is both slower than SLC and has a lower wear limit. But here’s the thing… if I want to program my MLC to have a value of binary 00 in the above diagram, I have no such problems because (as with SLC) I can just open the hose up on full power and hit it.

What we’ve demonstrated here is that programming a full charge value to an MLC cell is faster than programming any of the other available values. With a little more thought you can probably see that TLC has this problem to an even worse degree – imagine how accurate you need to be with that hose when you have seven thresholds to consider!

One final thought. We read and write (program) to NAND flash at the page level, which means we are accessing a large collection of cells as if they are one single unit. What are the chances that when we write a page we will want every cell to be programmed to full charge? I’d say extremely low. So even if some cells are programmed “the fast way”, just one “slow” program operation to a non-full-charge threshold will slow the whole program operation down. In other words, I can hardly ever take advantage of the faster latency experienced by full charge operations.

Fast Pages and Slow Pages

The majority of flash seen in the data centre today is MLC, which contains two bits per cell. Is there a way to program MLC in order that, at least sometimes, I can program at the faster speeds of a full-charge operation?

mlc-bucket-msb-lsbLet’s take my MLC bucket diagram from above and remap the binary values like the diagram on the left. What have I changed? Well most importantly I’ve reordered the binary values that correspond to each voltage level; empty charge still represents 11 but now full charge represents 10. Why did I do that?

The clue is the dotted line separating the most significant bit (MSB) and the least significant bit (LSB) of each value. Let’s consider two NAND flash pages, each comprising many cells. Now, instead of having both bits from each MLC cell used for a single page, I will put all of the MSB values into one page and call that the slow page. Then I’ll take all of the LSB values and put that into the other page and call that the fast page.

Why did I do this? Well, consider what happens when I want to program my fast page: in the diagram you can see that it’s possible to turn the LSB value from one to zero by programming it to either of the two higher thresholds… including the full charge threshold. In fact, if you forget about the MSB side for a second, the LSB side very similar to an SLC cell – and therefore performs like one.

The slow page, meanwhile, has to be programmed just like we discussed previously and therefore sees no benefit from this configuration. What’s more, if I want to program the fast page in this way I can’t store data in the corresponding slow page (the one with the matching MSBs) because every time I program a full charge to this cell the MSB ends up with a value of one. Also, when I want to program the slow page I have to erase the whole block first and then program both pages together (slowly!).

It’s kind of complicated… but potentially we now have the option to program certain MLC pages using a faster operation, with the trade-off that other pages will be affected as a result.

Getting To The Point

I should point out here that this is pretty low-level stuff which requires direct access to NAND flash (rather than via an SSD for example). It may also require a working relationship with the flash manufacturer. So why am I mentioning it here?

Well first of all I want to show you that NAND flash is actually a difficult and unpredictable medium on which to store data – unless you truly understand how it works and make allowances for its behaviour. NAND-flashThis is one of the reasons why so many flash products exist on the market with completely differing performance characteristics.

When you look at the datasheet for an MLC flash product and you see write / program times shown as, for example, 1.4 milliseconds it’s important to realise that this is the average of its bi-modal behaviour. Fast (LSB) pages may well have program times of 300 microseconds, while slow (MSB) pages might take up to 2.5 milliseconds.

Secondly, I want to point out that direct access to the flash (instead of via an SSD) brings certain benefits. What if, in my all flash array, I send all inbound user writes to fast pages but then, later on during garbage collection, I move data to be stored in slow pages? If I could do that, I’d effectively be hiding much of the slower performance of MLC writes from my users. And that would be a wonderful thing…

…which is why, at Violin, we’ve been doing it for years 🙂

 

Understanding Flash: The Write Cliff

cliffs

For the last couple of posts in this series I’ve been banging on about the importance of garbage collection (GC) in a flash system. I attempted to show you what happens if you don’t perform any GC at all (clue: you turn your flash device into a slow ROM), but clearly in the real world every flash array or SSD vendor has GC technology built into their flash translation layer. So why am I going to devote yet another post to it?

Predictable Performance

When you consider the performance of a system, what’s the number one requirement on your list? Is it “fast”? I would argue not. In my opinion, the first and most important requirement when considering performance is predictability. If you know how a system will perform at any time then (even if you would prefer things to happen faster) you can plan accordingly. If the same repeatable action behaves completely differently over random samples, how can you ever consider it reliable?

Cast your mind back to the post about flash blocks and pages. Remember that reads happen at the page level, as do writes (known as program operations when working with NAND flash)queues-likely but only empty pages can be programmed. To make a page empty you must perform an erase operation – and, crucially, these happen at the block level thereby affecting an entire set of pages.

As you fill up your flash device, pages containing data must be relocated in order to free up blocks so that they can be erased. At this point there are two pieces of bad news to consider:

First of all, in general each flash die can only perform one operation at a time (sometimes this is one operation per plane but that really doesn’t detract from the point). That means if you are performing an erase operation on block A, a read operation from a page in block Z on the same die has to be queued. It’s a completely different block – one of thousands on the same die – but the operation is queued nonetheless.

The second bit of news is that erase operations are slow… really slow. For MLC flash we’re talking maybe 3 milliseconds, which is an age when you compare it to the ~50 microseconds it takes to perform a read. Program operations are also slower than reads (but faster than erases) and they also have to be queued.

So based on all this information, a user who simply reads data at a predictable rate may suddenly see their latency spike up to around 60x higher than the 50us they are expecting if their read gets queued behind an erase. That doesn’t sound like fun does it?

Background vs Active / Foreground Garbage Collection

We know that garbage collection has to try and take care of erases in order to stop you running out of space as you fill up your flash. But let’s now consider this in terms of the performance problems caused by “user operations” (i.e. active reads and writes from the host) queuing up behind “background operations” (i.e. activity caused by the flash translation layer doing its job). Clearly the latter will affect the former if we are not careful. It therefore makes sense that we should try and perform all of our background operations at times when they will not cause problems to the users. As you know if you’ve read my blog before, I love an analogy… so let’s consider garbage collection a bit like the process of washing dishes at a busy restaurant.

dishesA restaurant only has a finite number of plates, glasses, cutlery etc, so once stuff is used it has to return to the kitchen and be washed ready for reuse. In a well-functioning restaurant this process takes place without disrupting the flow of cooked meals leaving the kitchen and being served to customers. In the same way, if our flash garbage collection is taking place in a manner which does not affect the active I/O operations of our users, this is known as background garbage collection (BGC). You can consider BGC “a good thing”, since it “hides” the impact of erase operations and results in more stable, predictable I/O times from the host. In other words, the kitchen runs smoothly and our customers are happy. Bravo.

On the other hand, if our dishes are not being washed fast enough in the kitchen of our restaurant, at some point there will be a shortage of clean plates etc and the customers will have to wait. Likewise, if more data is being changed than BGC can keep up with, the flash device is now running out of free space in which to program incoming writes. This means we have to switch into a different mode called active garbage collection (AGC), sometimes also described as foreground garbage collection. In AGC, user I/O inevitably ends up queuing behind background operations – and in severe cases we have to throttle user I/O requests because they cannot be serviced in time. Yes, we actually have to tell the waiters not to take any more orders until we can get our act together in the kitchen.

You might remember from my previous post that all flash vendors overprovision their flash to allow an additional working area where new writes can land while stale pages are being erased. In the same way, most restaurants probably have more plates and cutlery than they have table settings out front. It helps – and the more you overprovision, the more breathing space you have – but at some point if you don’t take care of your dirty dishes you will still run out of clean ones.

The Infamous Write Cliff

There has been a lot of talk about the write cliff by various commentators, flash vendors and bloggers over the years. I’ve read articles that say it’s no longer a problem (“in most SSD arrays”), articles that show it causing significant problems and white papers on how to avoid it.

My advice is to keep it simple: your flash device has an over-provisioned “buffer zone” which you may or may not be able to see (on Violin you can actually configure it). If you change more data on your flash device than the background garbage collection algorithm can keep up with, you will eat into this buffer zone until you hit active garbage collection. Keep pushing your device at this point and you will see the latency rise as the number of IOPS falls. It’s a simple as that.

My good buddy Maxim from Violin’s amazing engineering team demonstrated this to me in real life by deliberately limiting the ability of background garbage collection on a test system and then hitting it with lots of writes. Here’s the result:

active-vs-background-garbage-collection

This same pattern can be seen in numerous other places on the internet; for example, in the Preconditioning Curve graphs of reviews of SSDs. In fact, there are only two possible scenarios where a flash device won’t hit the write cliff (assuming you push it hard enough):

  1. Flash devices (mainly SLC) which can perform garbage collection really quickly and have lots of overprovisioned space (e.g. the Violin Memory 6616 array frequently used for setting benchmark records)
  2. Flash devices where the limited ingest capability means they can never accept enough writes to exhaust their overprovisioned space

That last one might seem contentious, but think about it: it’s a simple fact of NAND flash that erases are slower than writes (programs). This means if you are able to perform enough writes then eventually you will always exhaust any finite overprovisioned buffer space. At that point, writes must slow down to the speed of erases. As the theory of constraints says, “a chain is no stronger than its weakest link“…

Understanding Flash: Garbage Collection Matters

garbage-collection

In the last post in this series I discussed some of the various tasks that need to be performed by the flash translation layer – the layer of abstraction that sits between us and the raw NAND flash on which we desire to store our data. One of those tasks is the infamous garbage collection process (or “GC”) – and in these next couple of posts I’m going to look into GC a little deeper.

But first, let’s consider what would happen on a flash system if there were no GC at all.

Why We Need Garbage Collection

Let’s imagine a very simple flash system comprising of just five blocks, each with just ten pages. In the beginning, all the pages are empty, so we can say that 100% of the capacity is Free:

garbage-collection-fill1

 

There’s no point in wasting all that lovely flash, so I’m going to write some data to it. Remember that a write operation in the world of flash is known as a program operation and it takes place at the page level. So here goes:

garbage-collection-fill2

 

You can see a couple of things happening here. First of all, some of the capacity is now showing as Used. Secondly, you can see that our wear levelling algorithm has kicked in to spread the data over all the blocks fairly evenly. Some blocks have more used pages than others, but the wear levelling doesn’t have to be exact, just approximate. Let’s add some more data:

garbage-collection-fill3

 

Now we’re at 50% Used – and again the data has been spread evenly for the purposes of wear levelling. The flash translation later is also taking care of logical to physical block mapping, so although data has been spread across multiple blocks it has the appearance of being sequential when accessed by the calling application.

Time For Some Updates

So far so good. But at some point we’re probably going to want to update some of the existing data. And as we know, on NAND flash that’s not straightforward because used pages cannot be updated without the entire block first being erased. So for now we simply write the updated data to an empty page and mark the old page as stale. The logical to physical block mapping allows us to simply redirect the logical address of an updated block to the physical address of the new page:

garbage-collection-fill4

See how some of the blocks which were showing as Used (in red) are now showing as Stale (in yellow)? And of course the capacity graph is now showing a percentage of the total capacity as being stale. But have you also noticed that we are approaching the point where we can never free up the stale pages?

Never forget that NAND flash can only be erased at the block level. In the above example, the first (left-most) block has two stale pages and four used pages. In order to free up those stale pages I need to copy the used pages somewhere else and then erase the block. If I don’t do that very soon, I’m going to run out of free pages in the other blocks and then … I’m screwed:

garbage-collection-fill5

This situation is a disaster, because at this point I can never free up the stale blocks, which means I’ve effectively just turned my flash system into a read only device. And yet if you look at the capacity graph, it’s only around 75% full of real data (the stuff in red). So does this mean I can never use 100% of a flash device?

The Return of Over-Provisioning

Earlier in this Storage for DBAs blog series I wrote an article about one of my least favourite practices in the world of hard disk drives: over-provisioning. I now have to own up and confess that in the world of flash we do pretty much the same thing, although for slightly different reasons (and without the massive overhead of power, cooling and real-estate costs that makes me so mad with disk).

To ensure that there is always enough room for garbage collection to continue (as well as to allow for things like bad blocks, wear levelling etc), pretty much every flash device vendor (from SSDs through to All Flash Arrays) uses the same trick:

garbage-collection-overprovision

Yep, there’s more flash in your device than you can actually see. For example, rip off the covers of a 400GB HGST Ultrastar SSD800MM (the SSD that used to be found in XtremIO arrays) and you’ll find 576GB of raw flash. There’s 44% more capacity on the device than it advertises when you access it. This extra area is not pinned by the way, it’s not a dedicated set of blocks or pages. It’s just mandatory headroom that stops situations like the one we described above ever occurring.

The Violin Way

Violin Intelligent Memory Module (VIMM)Since I work for Violin Memory it would be remiss of me not to discuss the concept of format levels at this point. When you buy flash in the form of SSDs, or flash arrays that contain SSDs, the amount of over-provisioning is generally fixed. Almost all SSDs have a predefined over-provision area and that cannot be changed (if there is an SSD on the market that allows this, I am not aware of it).

One of the advantages we have at Violin is that we build our arrays from the ground up using NAND flash chips rather than pre-packaged SSDs – our flash is located on our unique Violin Intelligent Memory Modules (VIMMs). This means we have control down to the NAND flash level – and one of the benefits here is that we can choose how much over-provisioning is required based on workload. At Violin we call this the format level of an array and we allow customers to set it when they take delivery (it can also be changed at a later time but obviously requires a reformat of the array).

I see this as a great benefit because some workloads are very read-intensive and so a large over-provision would essentially be a waste of good flash. Conversely some workloads are so write-heavy that the hard limits found on an SSD device would not be suitable. trashcanIn general I believe that choice (as long as it comes with advice) is a good thing.

In the next post we’ll look at garbage collection some more, in particular the concept of background versus foreground garbage collection, as well as cover the infamous write cliff. But for now, just remember: no matter what form of flash you are using, in order for it to work and perform properly, something somewhere has to take out the trash…

Understanding Flash: The Flash Translation Layer

electronics

A couple of posts ago in this series, I explained how a NAND flash die is comprised of planes, which contain blocks, which contain pages… which contain individual cells of data. Read operations take place at the page level, as do write operations (although we call them program operations in the flash world). But crucially, erase operations take place at the block level and so affect multiple pages.

Erases are also slow (at least relative to reads and writes) and cause wear of the flash media, gradually moving it closer to its end of life. It’s therefore the case that when you want to update an existing page of data it is faster, simpler and less damaging to simply write the updated information to an empty page. If you’re going to do that, you will probably also want to choose a new page somewhere completely different from the old one to ensure that your flash wears out evenly. And hey, don’t forget that the block containing the old page will need to be erased at some point before it can be reused.

flash-translation-layer-burgerSo to make flash a friendly medium for storing our data, we need a mechanism which will:

  1. write updated information to a new empty page and then divert all subsequent read requests to its new address
  2. ensure that newly-programmed pages are evenly distributed across all of the available flash so that it wears evenly
  3. keep a list of all the old invalid pages so that at some point later on they can all be recycled ready for reuse

This mechanism is called the flash translation layer (FTL) and you will find it on all flash media if you look hard enough. The FTL has a number of responsibilities, so let’s look at them now.

Logical Block Mapping

Abstraction is everywhere in computing. The URL of this website is a logical address which maps to a physical address, i.e. the IP address. An IP address is in fact a logical address which maps to a physical address, i.e. the MAC address of a network interface. There are so many examples of abstraction in technology that sometimes I think the whole world of computing is just one massive layer of abstraction.

logical-physical-block-addressingIn storage, the idea of a logical block address (LBA) has been around since long before NAND flash and is primarily used to make addressing simpler and more flexible. Like all abstraction concepts it exists to make the (potentially complex) management of a low level system invisible to the higher levels that consume its services. For example, if a hard drive within a RAID group fails and the data it contained has to be rebuilt on a hot spare, the physical block address of certain blocks of data will change. To avoid the need to notify all possible interested parties (e.g applications, databases, etc) of the new address, the extra layer of logical block addressing is used; the map from LBA to PBA is amended and nobody else needs to know.

In a flash system this same mechanism can be used for updates, so that when a page is considered invalid the logical block address can be remapped to the newly-programmed page. This provides the solution to number 1 in our list above in a way that is both simple and transparent to anyone issuing I/O requests.

Wear Levelling

On the face of it, wear levelling seems like a fairly simple method of handling number 2 in our list. You have a predefined number of flash blocks, each of which can be programmed and erased (known as the P/E cycle) a similar number of times before they are no longer usable. Clearly the object of any wear levelling algorithm is to smoothly distribute all P/E cycles so that the blocks all reach their limit at the same time. flash-remaining-lifetimeWithout wear levelling it’s entirely possible that a subset of blocks could receive the majority of P/E cycles and thus wear out very quickly, reducing the available capacity of the system.

If all blocks in a system were regularly updated this would be no problem, because wear levelling would happen almost naturally as pages are marked invalid and then recycled. But here’s the problem: if we have some cold blocks, i.e. locations where the data never changes, then we have to take steps to manually relocate that data otherwise those blocks won’t ever wear… and that means we are actually adding write workload to the system, which ultimately means increasing the wear.

So to put it in simple terms, the more aggressive we are about wear levelling evenly, the more wear we cause. But not being aggressive enough could result in hot and cold spots as wear becomes more uneven. As always, it’s a question of finding the right balance. Or, if you prefer, finding the write balance.*

Garbage Collection

The third requirement in our list is a way of recycling pages that are no longer required (i.e. invalid) but have not yet been erased. Of course you cannot simply erase them at your leisure, because flash requires the entire containing block to be erased too. So instead it’s necessary to consider the remaining contents of the block and – if necessary – transparently move some active data elsewhere.recycling-bins Once the block has no remaining active data in it, it can be erased and then becomes ready to use again.

The thing is, the block in question might have 128 or 256 pages in it, many of which contain active data. You might have to go to a lot of effort in order to recycle just a small number of invalid pages – and just like with wear levelling, the act of moving data around in the background will have consequences to things like performance and endurance. Again it’s a question of finding the right balance.

Garbage Collection is such an important part of the management of flash that I’m going to devote a whole post to it next in the series. But for now, consider this: what happens if you fill up your trash faster than it gets taken away? You run out of space in your bins and everything starts to smell pretty bad. We definitely need to make sure that doesn’t happen here…

Write Amplification

Now you might have noticed that a lot of the processes described here result in additional operations taking place on the flash media. Wear Levelling can relocate inactive (cold) pages in order to ensure they wear evenly in comparison to active (hot) pages. Garbage Collection can result in pages being moved from a block which will subsequently be erased. In short, stuff is happening in the background as a consequence of the actions taking place on the host – which we will call the foreground in order to distinguish it. What’s more, if you are looking at things from the point of view of the foreground – like, for example, a database server accessing flash storage – you have no visibility of what’s going on in the background.

It doesn’t matter what OS monitoring tools you run on the host (iostat, for example, or dstat), you will only see the foreground I/O operations that the host knows about. This is important because the performance and endurance of your flash storage is dependent on the sum of foreground and background operations.

We call this phenomena write amplification and we can express it as a value using the following formula:

Write Amplification = Data Written To Flash / Data Written By The Host

A higher value indicates an increased workload on the flash storage system, which is likely to mean reduced endurance and performance. For this reason, if you are testing any sort of flash system (from the mighty All Flash Array to a simple SSD) you should make every effort to observe the workload within the storage system as well as from the host. Of course with the humble SSD that’s often not possible…

Where Is Your FTL?

cautionAs a final thought, earlier on I said about the FTL that “you will find it on all flash media if you look hard enough“. What did I mean? Well, the FTL has a lot of duties to perform, as we’ve seen. That requires effort, which for a computer equates to processing power and memory. In most situations (e.g. All Flash Arrays) this happens in firmware under the covers. But with some products, for example certain PCIe flash cards, it’s possible that some or all of the FTL functionality runs on the host, using host CPU cycles and DRAM.

In principle there may be benefits and drawbacks of either method (host-based FTL or array-based FTL), but if you are running a database on your server they become insignificant in relation to the cost. After all, those processors in your database servers? The ones that affect your core-based licenses for Oracle or other database software? They are the most expensive CPUs you own. If you are donating CPU cycles from these cores to manage your flash, you are effectively throwing away license money. And you probably feel you pay enough to your database vendor already, right?

* Seriously, if you didn’t think that was a great pun then you’re reading the wrong blog.

Understanding Flash: SLC, MLC and TLC

slc-mlc-tlc-fruitmachine

The last post in this series discussed the layout of NAND flash memory chips and the way in which cells can be read and written (programmed) at the page level but have to be erased at the (larger) block level. I finished by mentioning that erase operations take substantially longer than read or program operations… but just how big is the difference?

Knowing the answer to this involves first understanding the different types of flash memory available: SLC, MLC and TLC.

Electrons In A Bucket?

Whenever I’ve seen anyone attempt to explain this in the past, they have almost always resorted to drawing a picture of electrons or charge filling up a bucket. This is a terrible analogy and causes anyone with a deep understanding of physics to cringe in horror. Luckily, I don’t have a deep understanding of physics, so I’m going to go right along with the herd and get my bucket out.

A NAND flash cell, i.e. the thing that stores a value of one or zero, is actually a floating gate transistor. Programming the cell means putting electrons into the floating gate, causing it to become (negatively) charged. Erasing the cell means removing the electrons from the floating gate, draining the charge. The amount of charge contained in the floating gate can be varied from zero up to a maximum value – this is an analogue system so there is no simple FULL or EMPTY state.

Because of this, the amount of charge can be measured and thresholds assigned to indicate a binary value. What does that mean? It means that, in the case of Single Level Cell (SLC) flash anything below 50% of charge can be considered to be a bit with a value of 1, while anything above 50% can be considered a bit with a value of 0.

But if i decided to be a bit more careful in the way I fill or empty my bucket of charge (sorry), I could perhaps define more thresholds and thus hold two bits of data instead of one. I could say that below 25% is 11, from 25% to 50% is 10, from 50% to 75% is 01 and above 75% is 00. Now I can keep twice as much data in the same bucket. This is in fact Multi Level Cell (MLC). And as the picture shows, if I was really careful in the way I treated my bucket, I could even keep three bits of data in there, which is what happens in Three Level Cell (TLC):

slc-mlc-tlc-buckets

The thing is, imagine this was a bucket of water (comparing electrons to water is probably the last straw for anyone reading this who has a degree in physics, so I bid you farewell at this point). If you were to fill up your bucket using the SLC method, you could be pretty slap-dash about it. I mean it’s pretty obvious when the bucket is more than half full or empty. But if you were using a more fine-grained method such as MLC or TLC you would need to fill / empty very carefully and take exact measurements, which means the act of filling (programming) would be a lot slower.

To really stretch this analogy to breaking point, imagine that every time you fill your bucket it gets slightly damaged, causing it to leak. In the SLC world, even a number of small leaks would not be a big deal. But in the MLC or (especially) the TLC world, those leaks mean it would quickly become impossible to keep using your bucket, because the tolerance between different bit values is so small. For similar reasons, NAND flash endurance is greatly influenced by the type of cell used. Storing more bits per cell means a lower tolerance for errors, which in turn means that higher error rates are experienced and endurance (the number of program/erase cycles that can be sustained) is lower.

Timing and Wear

Enough of the analogies, let’s look at some proper data. The chart below uses sample figures from AnandTech:

slc-mlc-tlc-performance-chart

You can see that as the number of bits per cell increases, so does the time taken to perform read, program (i.e. write) and erase operations. Erases in particular are especially slow, with values measured in milliseconds instead of microseconds. Given that erases also affect larger areas of flash than reads and programs, you can start to see why the management of erase operations on flash is critical to performance.

Also apparent on the chart above is the massive difference in the number of program / erase cycles between the different flash types: for SLC we’re talking about orders of magnitude in difference. But of course SLC can only store one bit per cell, which means it’s much more expensive from a capacity perspective than MLC. TLC, meanwhile, offers the potential for great value for money, but none of the performance requirements you would need for tier one storage (although it may well have a place in the world of backups). It is for this reason that MLC is the most commonly used type of flash in enterprise storage systems. (By the way I’m so utterly disinterested in the phenomena of “eMLC” that I’m not going to cover it here, but you can read this and this if you want to know more on the subject…)

Warning: Know Your Flash

cautionOne final thing. When you buy an SSD, a PCIe flash card or, in the case of Violin Memory, an all-flash array you tend to choose between SLC and MLC. As a very rough rule of thumb you can consider MLC to be twice the capacity for half the performance of SLC, although this in fact varies depending on many factors. However there are some all flash array vendors who use both SLC and MLC in a sort of tiered approach. That’s fine -and if you are buying a flash array I’m sure you’ll take the time to understand how it works.

But here’s the thing. At least one of these vendors insists on describing the SLC layer as “NVRAM” to differentiate from the MLC layer which it simply describes as using flash SSDs. The truth is that the NVRAM is also just a bunch of flash SSDs, except they are SLC instead of MLC. I’m not in favour of using educational posts to criticise competitors, but in the interest of bring clarity to this subject I will say this: I think this is a marketing exercise which deliberately adds confusion to try and make the design sound more exciting. “Ooooh, NVRAM that sounds like something I ought to have in my flash array…” – or am I being too cynical?

Update: January 2016

This article discusses what is known as 2D Planar NAND flash. Since publication, a new type of NAND flash called 3D NAND (led by Samsung’s branded V-NAND product) has become popular on the market. You can read about that here.