Understanding Flash: SLC, MLC and TLC

slc-mlc-tlc-fruitmachine

The last post in this series discussed the layout of NAND flash memory chips and the way in which cells can be read and written (programmed) at the page level but have to be erased at the (larger) block level. I finished by mentioning that erase operations take substantially longer than read or program operations… but just how big is the difference?

Knowing the answer to this involves first understanding the different types of flash memory available: SLC, MLC and TLC.

Electrons In A Bucket?

Whenever I’ve seen anyone attempt to explain this in the past, they have almost always resorted to drawing a picture of electrons or charge filling up a bucket. This is a terrible analogy and causes anyone with a deep understanding of physics to cringe in horror. Luckily, I don’t have a deep understanding of physics, so I’m going to go right along with the herd and get my bucket out.

A NAND flash cell, i.e. the thing that stores a value of one or zero, is actually a floating gate transistor. Programming the cell means putting electrons into the floating gate, causing it to become (negatively) charged. Erasing the cell means removing the electrons from the floating gate, draining the charge. The amount of charge contained in the floating gate can be varied from zero up to a maximum value – this is an analogue system so there is no simple FULL or EMPTY state.

Because of this, the amount of charge can be measured and thresholds assigned to indicate a binary value. What does that mean? It means that, in the case of Single Level Cell (SLC) flash anything below 50% of charge can be considered to be a bit with a value of 1, while anything above 50% can be considered a bit with a value of 0.

But if i decided to be a bit more careful in the way I fill or empty my bucket of charge (sorry), I could perhaps define more thresholds and thus hold two bits of data instead of one. I could say that below 25% is 11, from 25% to 50% is 10, from 50% to 75% is 01 and above 75% is 00. Now I can keep twice as much data in the same bucket. This is in fact Multi Level Cell (MLC). And as the picture shows, if I was really careful in the way I treated my bucket, I could even keep three bits of data in there, which is what happens in Three Level Cell (TLC):

slc-mlc-tlc-buckets

The thing is, imagine this was a bucket of water (comparing electrons to water is probably the last straw for anyone reading this who has a degree in physics, so I bid you farewell at this point). If you were to fill up your bucket using the SLC method, you could be pretty slap-dash about it. I mean it’s pretty obvious when the bucket is more than half full or empty. But if you were using a more fine-grained method such as MLC or TLC you would need to fill / empty very carefully and take exact measurements, which means the act of filling (programming) would be a lot slower.

To really stretch this analogy to breaking point, imagine that every time you fill your bucket it gets slightly damaged, causing it to leak. In the SLC world, even a number of small leaks would not be a big deal. But in the MLC or (especially) the TLC world, those leaks mean it would quickly become impossible to keep using your bucket, because the tolerance between different bit values is so small. For similar reasons, NAND flash endurance is greatly influenced by the type of cell used. Storing more bits per cell means a lower tolerance for errors, which in turn means that higher error rates are experienced and endurance (the number of program/erase cycles that can be sustained) is lower.

Timing and Wear

Enough of the analogies, let’s look at some proper data. The chart below uses sample figures from AnandTech:

slc-mlc-tlc-performance-chart

You can see that as the number of bits per cell increases, so does the time taken to perform read, program (i.e. write) and erase operations. Erases in particular are especially slow, with values measured in milliseconds instead of microseconds. Given that erases also affect larger areas of flash than reads and programs, you can start to see why the management of erase operations on flash is critical to performance.

Also apparent on the chart above is the massive difference in the number of program / erase cycles between the different flash types: for SLC we’re talking about orders of magnitude in difference. But of course SLC can only store one bit per cell, which means it’s much more expensive from a capacity perspective than MLC. TLC, meanwhile, offers the potential for great value for money, but none of the performance requirements you would need for tier one storage (although it may well have a place in the world of backups). It is for this reason that MLC is the most commonly used type of flash in enterprise storage systems. (By the way I’m so utterly disinterested in the phenomena of “eMLC” that I’m not going to cover it here, but you can read this and this if you want to know more on the subject…)

Warning: Know Your Flash

cautionOne final thing. When you buy an SSD, a PCIe flash card or, in the case of Violin Memory, an all-flash array you tend to choose between SLC and MLC. As a very rough rule of thumb you can consider MLC to be twice the capacity for half the performance of SLC, although this in fact varies depending on many factors. However there are some all flash array vendors who use both SLC and MLC in a sort of tiered approach. That’s fine -and if you are buying a flash array I’m sure you’ll take the time to understand how it works.

But here’s the thing. At least one of these vendors insists on describing the SLC layer as “NVRAM” to differentiate from the MLC layer which it simply describes as using flash SSDs. The truth is that the NVRAM is also just a bunch of flash SSDs, except they are SLC instead of MLC. I’m not in favour of using educational posts to criticise competitors, but in the interest of bring clarity to this subject I will say this: I think this is a marketing exercise which deliberately adds confusion to try and make the design sound more exciting. “Ooooh, NVRAM that sounds like something I ought to have in my flash array…” – or am I being too cynical?

Update: January 2016

This article discusses what is known as 2D Planar NAND flash. Since publication, a new type of NAND flash called 3D NAND (led by Samsung’s branded V-NAND product) has become popular on the market. You can read about that here.

Understanding Flash: Blocks, Pages and Program / Erases

In the last post on this subject I described the invention of NAND flash and the way in which erase operations affect larger areas than write operations. Let’s have a look at this in more detail and see what actually happens. First of all, we need to know our way around the different entities on a flash chip (or “package“), which are: the die, the plane, the block and the page:

NAND Flash Die Layout (image courtesy of AnandTech)

NAND Flash Die Layout (image courtesy of AnandTech)

Note: What follows is a high-level description of the generic behaviour of flash. There are thousands of different NAND chips available, each potentially with slightly different instruction sets, block/page sizes, performance characteristics etc.

  • The package is the memory chip, i.e. the black rectangle with little electrical connectors sticking out of it. If you look at an SSD, a flash card or the internals of a flash array you will see many flash packages, each of which is produced by one of the big flash manufacturers: Toshiba, Samsung, Micron, Intel, SanDisk, SK Hynix. These are the only companies with the multi-billion dollar fabrication plants necessary to make NAND flash.
  • Each package contains one or more dies (for example one, two, or four). The die is the smallest unit that can independently execute commands or report status.
  • Each die contains one or more planes (usually one or two). Identical, concurrent operations can take place on each plane, although with some restrictions.
  • Each plane contains a number of blocks, which are the smallest unit that can be erased. Remember that, it’s really important.
  • Each block contains a number of pages, which are the smallest unit that can be programmed (i.e. written to).

The important bit here is that program operations (i.e. writes) take place to a page, which might typically be 8-16KB in size, while erase operations take place to a block, which might be 4-8MB in size. Since a block needs to be erased before it can be programmed again (*sort of, I’m generalising to make this easier), all of the pages in a block need to be candidates for erasure before this can happen.

Program / Erase Cycles

When your flash device arrives fresh from the vendor, all of the pages are “empty”. The first thing you will want to do, I’m sure, is write some data to them – which in the world of memory chips we call a program operation. As discussed, these program operations take place at the page level. You can then read your fresh data back out again with read operations, which also take place at the page level. [Having said that, the instruction to read a page places the data from that page in a memory register, so your reading process can in fact then selectively access subsets of the page if it desires – but maybe that’s going into too much detail…]

NAND-flash-blocks-pages-program-erasesWhere it gets interesting is if you want to update the data you just wrote. There is no update operation for flash, no undo or rewind mechanism for changing what is currently in place, just the erase operation. It’s a little bit like an etch-a-sketch, in that you can continue to turn the dials and make white sections of screen go black, but you cannot turn black sections of screen to white again without erasing the entire screen. Etch-a-SketchAn erase operation on a flash chip clears the data from all pages in the block, so if some of the other pages contain active data (stuff you want to keep) you either have to copy it elsewhere first or hold off from doing the erase.

In fact, that second option (don’t erase just yet) makes the most sense, because the blocks on a flash chip can only tolerate a limited number of program and erase options (known as the program erase cycle or PE cycle because for obvious reasons they follow each other in turn). If you were to erase the block every time you wanted to change the contents of a page, your flash would wear out very quickly.

So a far better alternative is to simply mark the old page (containing the unchanged data) as INVALID and then write the new, changed data to an empty page. All that is required now is a mechanism for pointing any subsequent access operations to the new page and a way of tracking invalid pages so that, at some point, they can be “recycled”.

NAND-flash-page-update

Updating a page in NAND flash. Note that the new page location does not need to be within the same block, or even the same flash die. It is shown in the same block here purely for ease of drawing.

This “mechanism” is known as the flash translation layer and it has responsibility for these tasks as well as a number of others. We’ll come back to it in subsequent posts because it is a real differentiator between flash products. For now though, think about the way the device is filling up with data. Although we’ve delayed issuing erase operations by cleverly moving data to different pages, at some point clearly there will be no empty pages left and erases will become essential. This is where the bad news comes in: it takes many times longer to perform an erase than it does to perform a read or program. And that clearly has consequences for performance if not managed correctly.

In the next post we’ll look at the differences in time taken to perform reads, programs and erases – which first requires looking at the different types of flash available: SLC, MLC and TLC…

caution[* Technical note: Ok so actually when a NAND flash page is empty it is all binary ones, e.g. 11111111. A program operation sets any bit with the value of 1 to 0, so for example 11111111 could become 11110000. This means that later on it is still possible to perform another program operation to set 11110000 to 00110000 for example. Until all bits are zero it’s technical possible to perform another program. But hey, that’s getting a bit too deep into the details for our requirements here, so just pretend you never read this…]

Understanding Flash: What Is NAND Flash?

circuit-board

In the early 1980s, before we ever had such wondrous things as cell phones, tablets or digital cameras, a scientist named Dr Fujio Masuoka was working for Toshiba in Japan on the limitations of EPROM and EEPROM chips. An EPROM (Erasable Programmable Read Only Memory) is a type of memory chip that, unlike RAM for example, does not lose its data when the power supply is lost – in the technical jargon it is non-volatile. It does this by storing data in “cells” comprising of floating-gate transistors. I could start talking about Fowler-Nordheim tunnelling and hot-carrier injection at this point, but I’m going to stop here in case one of us loses the will to live. (But if you are the sort of person who wants to know more though, I can highly recommend this page accompanied by some strong coffee.)

Anyway, EPROMs could have data loaded into them (known as programming), but this data could also be erased through the use of ultra-violet light so that new data could be written. This cycle of programming and erasing is known as the program erase cycle (or PE Cycle) and is important because it can only happen a limited number of times per device… but that’s a topic for another post. However, while the reprogrammable nature of EPROMS was useful in laboratories, it was not a solution for packaging into consumer electronics – after all, including an ultra-violet light source into a device would make it cumbersome and commercially non-viable.

US Patent US4531203: Semiconductor memory device and method for manufacturing the same

US Patent US4531203: Semiconductor memory device and method for manufacturing the same

A subsequent development, known as the EEPROM, could be erased through the application of an electric field, rather than through the use of light, which was clearly advantageous as this could now easily take place inside a packaged product. Unlike EPROMs, EEPROMs could also erase and program individual bytes rather than the entire chip. However, the EEPROMs came with a disadvantage too: every cell required at least two transistors instead of the single transistor required in an EPROM. In other words, they stored less data: they had lower density.

The Arrival of Flash

So EPROMs had better density while EEPROMs had the ability to electrically reprogram cells. What if a new method could be found to incorporate both benefits without their associated weaknesses? Dr Masuoka’s idea, submitted as US patent 4612212 in 1981 and granted four years later, did exactly that. It used only one transistor per cell (increasing density, i.e. the amount of data it could store) and still allowed for electrical reprogramming.

If you made it this far, here’s the important bit. The new design achieved this goal by only allowing multiple cells to be erased and programmed instead of individual cells. This not only gives the density benefits of EPROM and the electrically-reprogrammable benefits of EEPROM, it also results in faster access times: it takes less time to issue a single command for programming or erasing a large number of cells than it does to issue one per cell.

However, the number of cells that are affected by a single erase operation is different – and much larger – than the number of cells affected by a single program operation. And it is this fact that, above all else, that results in the behaviour we see from devices built on flash memory. In the next post we will look at exactly what happens when program and erase operations take place, before moving on to look at the types of flash available (SLC, MLC etc) and their behaviour.

NAND and NOR

To try and keep this post manageable I’ve chosen to completely bypass the whole topic of NOR flash and just tell you that from this moment on we are talking about NAND flash, which is what you will find in SSDs, flash cards and arrays. It’s a cop out, I know – but if you really want to understand the difference then other people can describe it better than me.

In the meantime, we all have our good friend Dr Masuoka to thank for the flash memory that allows us to carry around the phones and tablets in our pockets and the SD cards in our digital cameras. Incidentally, popular legend has it that the name “flash” came from one of Dr Masuoka’s colleagues because the process of erasing data reminded him of the flash of a camera. flash-chipPresumably it was an analogue camera because digital cameras only became popular in the 1990s after the commoditisation of a new, solid-state storage technology called …

 

Understanding Disk: Caching and Tiering

 

roulette-and-casino

When I was a child, about four or five years old, my dad told me a joke. It wasn’t a very funny joke, but it stuck in my mind because of what happened next. The joke went like this:

Dad: “What’s big at the bottom, small at the top and has ears?”

Me: “I don’t know?”

Dad: “A mountain!”

Me: “Er…<puzzled>…  What about the ears?”

Dad: (Triumphantly) “Haven’t you heard of mountaineers?!”

So as I say, not very funny. But, by a twist of fate, the following week at primary school my teacher happened to say, “Now then children, does anybody know any jokes they’d like to share?”. My hand shot up so fast that I was immediately given the chance to bring the house down with my new comedy routine. “What’s big at the bottom, small at the top and has ears?” I said, with barely repressed glee. “I don’t know”, murmured the teacher and other children expectantly. “A mountain!”, I replied.

Silence. Awkwardness. Tumbleweed. Someone may have said “Duuh!” under their breath. Then the teacher looked faintly annoyed and said, “That’s not really how a joke works…” before waiving away my attempts to explain and moving on to hear someone else’s (successful and funny) joke. Why had it been such a disaster? The joke had worked perfectly on the previous occasion, so why didn’t it work this time?

There are two reasons I tell this story: firstly because, as you can probably tell, I am scarred for life by what happened. And secondly, because it highlights what happens when you assume you can predict the unpredictable.*

Gambling With Performance

So far in this mini-series on Understanding Disk we’ve covered the design of hard drives, their mechanical limitations and some of the compromises that have to be made in order to achieve acceptable performance. The topic of this post is more about bandaids; the sticking plasters that users of disk arrays have to employ to try and cover up their performance problems. Or as my boss likes to call it, lipstick on a pig.

roulette-wheelIf you currently use an enterprise disk array the chances are it has some sort of DRAM cache within the array. Blocks stored in this cache can be read at a much lower latency than those residing only on disk, because the operation avoids paying the price of seek time and rotational latency. If the cache is battery-backed, it can also be used to accelerate writes too. But DRAM caches in storage area networks are notoriously expensive in relation to their size, which is often significantly smaller than the size of the active data set. For this reason, many array vendors allow you to use SSDs as an additional layer of slower (but higher capacity) cache.

Another common approach to masking the performance of disk arrays is tiering. This is where different layers of performance are identified (e.g. SATA disk, fibre-channel disk, SSD, etc) and data moved about according to its performance needs. Tiering can be performed manually, which requires a lot of management overhead, or automatically by software – perhaps the most well-known example being EMC’s Fully Automated Storage Tiering (or “FAST”) product. Unlike caching, which creates a copy of the data somewhere temporary, tiering involves permanently relocating the persistent location of the data. This relocation has a performance penalty, particularly if data is being moved frequently. Moreover, some automatic tiering solutions can take 24 hours to respond to changes in access patterns – now that’s what I call bad latency.

The Best Predictor of Future Behaviour is Past Behaviour

The problem with automatic tiering is that, just like caching, it relies on past behaviour to predict the future. That principle works well in psychology, but isn’t always as successful in computing. It might be acceptable if your workload is consistent and predictable, but what happens when you run your end of month reporting? What happens when you want to run a large ad-hoc query? What happens when you tell a joke about mountains and expect everyone to ask “but what about the ears”? You end up looking pretty stupid, I can tell you.

las-vegasI have no problem with caching or tiering in principle. After all, every computer system uses cache in multiple places: your CPUs probably have two or three levels of cache, your server is probably stuffed with DRAM and your Oracle database most likely has a large block buffer cache. What’s more, in my day job I have a lot of fun helping customers overcome the performance of nasty old spinning disk arrays using Violin’s Maestro memory services product.

But ultimately, caching and tiering are bandaids. They reduce the probability of horrible disk latency but they cannot eliminate it. And like a gambler on a winning streak, if you become more accustomed to faster access times and put more of your data at risk, the impact when you strike out is felt much more deeply. The more you bet, the more you have to lose.

Shifting the Odds in Your Favour

I have a customer in the finance industry who doesn’t care (within reason) what latency their database sees from storage. All they care about is that their end users see the same consistent and sustained performance. It doesn’t have to be lightning fast, but it must not, ever, feel slower than “normal”. As soon as access times increase, their users’ perception of the system’s performance suffers… and the users abandon them to use rival products.

poker-gameThey considered high-end storage arrays but performance was woefully unpredictable, no matter how much cache and SSD they used. They considered Oracle Exadata but ruled it out because Exadata Flash Cache is still a cache – at some point a cache miss will mean fetching data from horrible, spinning disk. Now they use all flash arrays, because the word “all” means their data is always on flash: no gambling with performance.

Caching and tiering will always have some sort of place in the storage world. But never forget that you cannot always win – at some point (normally the worst possible time) you will need to access data from the slowest media used by your platform. Which is why I like all flash arrays: you have a 100% chance of your data being on flash. If I’m forced to gamble with performance, those are the odds I prefer…

* I know. It’s a tenuous excuse for telling this story, but on the bright side I feel a lot better for sharing it with you.

Playing The Data Reduction Lottery

Picture courtesy of Capsun Poe

Picture courtesy of Capsun Poe

Storage for DBAs: Do you want to sell your house? Or your car? Let’s go with the car – just indulge me on this one. You have a car, which you weren’t especially planning on selling, but I’m making you an offer you can’t refuse. I’m offering you one million dollars so how can you say no?

The only thing is, when we come to make the trade I turn up not with a suitcase full of cash but a single Mega Millions lottery ticket. How would you feel about that? You may well feel aggrieved that I am offering you something which cost me just $1 but my response is this: it has an effective value of well over $1m. Does that work for you?

Blurred Lines

The thing is, this happens all the time in product marketing and we just put up with it. Oracle’s new Exadata Database Machine X4-2 has 44.8TB of raw flash in a full rack configuration, yet the datasheet states it has an effective flash capacity of 448TB. Excuse me? Let’s read the small print to find out what this means: apparently this is “the size of the data files that can often be stored in Exadata and be accessed at the speed of flash memory“.  No guarantees then, you just might get that, if you’re lucky. I thought datasheets where supposed to be about facts?

Meanwhile, back in storageland, a look at some of the datasheets from various flash array vendors throws up a similar practice. One vendor shows the following flash capacity figures for their array:

  • 2.75 – 11 TBs raw capacity
  • 5 – 50 TBs effective capacity

In my last two posts I covered deduplication and data compression as part of an overall data reduction strategy in storage. To recap, I gave my opinion that dedupe has no place with databases (although it has major benefits in workloads such as VDI) while data compression has benefits but is not necessarily best implemented at the storage level.

Here’s the thing. Your database vendor’s software has options that allow you to perform data reduction. You can also buy host-level software to do this. And of course, you can buy storage products that do this too. So which is best? It probably depends on which vendor you ask (i.e. database, host-level or storage), since each one is chasing revenue for that option – and in some storage vendor cases the data reduction is “always on”, which means you get it whether you want it or not (and whether you want to pay for it or not). But what you should know is this: your friendly flash storage vendor has the most to gain or lose when it comes to data reduction software.

Lies, Damned Lies and Capacities

When you purchase storage, you invariably buy it at a value based on price per usable capacity, most commonly using the unit of dollars per GB. This is simply a convenient way of comparing the price of competing products which may otherwise have different capacities: if a storage array costs $X and gives you GB of usable capacity, then the price in $/GB (dollars per gig) is therefore X/Y.

Now this practice originally developed when buying disk arrays – and there are some arguments to be made that $/GB carries less significance with flash… but everyone does it. Even if you aren’t doing it, chances are somebody in your purchasing department is. And even though it may not be the best way to compare two different products, you can bet that the vendor whose product has the lowest $/GB price will be the one looking most comfortable when it comes to decision day.

But what if there was a way to massage those figures? Each vendor wants to beat the competition, so they start to say things like, “Hey, what about if you use our storage compression features? On average our customers see a 10x reduction in data. This means the usable capacity is actually 10Y!“. Wouldn’t you know it? The price per gig (which is now X/10Y) just came down by 90%!

The First Rule of Compression

You all know this, but I’m going to say it anyway. Different sets of data result in different levels of compression (and deduplication). It’s obvious. Yet in the sterile environment of datasheets and TCO calculations it often gets overlooked. So let me spell it out for once and for all:

The first rule of compression is that the compression ratio is entirely dependant on the data being compressed.

Thus if you are buying or selling a product that uses compression, deduplication and data reduction, you cannot make any guarantees. Sure you can talk about “average compression ratios”, but what does that mean? Is there really such a thing as the average dataset?

Conclusion: Know What You Are Paying For

It’s a very simple message: when you buy a flash array (or indeed any storage array) be sure to understand the capacity values you are buying and paying for. Dollar per GB values are only relevant with usable capacities, not so-called effective or logical capacities. Also, don’t get too hung up on raw capacity values, since they won’t help you when you run out of usable space.

Definitions are important. Without them, nothing we talk about is … well, definite. So here are mine:

Lies, Damned Lies and Capacities

Storage Myths: Storage Compression Has No Downside

Image courtesy of marcovdz

Image courtesy of marcovdz

Storage for DBAs: My last post in this blog series was aimed at dispelling the myth that dedupe is a suitable storage technology for databases. To my surprise it became the most popular article I’ve ever published (based on reads per day). Less surprisingly though, it lead to quite a backlash from some of the other flash storage vendors who responded with comments along the lines of “well we don’t need dedupe because we also have compression”. Fair enough. So today let’s take a look at the benefits and drawbacks of storage-level compression as part of an overall data reduction strategy. And by the way, I’m not against either dedupe or storage-level compression. I just think they have drawbacks as well as benefits – something that isn’t always being made clear in the marketing literature. And being in the storage industry, I know why that is…

What Is Compression?

In storageland we tend to talk about the data reduction suite of tools, which comprise of deduplication, compression and thin provisioning. The latter is a way of freeing up capacity which is allocated but not used… but that’s a topic for another day.

bookshelfDedupe and compression have a lot in common: they both fundamentally involve the identification and removal of patterns, which are then replaced with keys. The simplest way to explain the difference would be to consider a book shelf filled with books. If you were going to dedupe the bookshelf you would search through all of the books, removing any duplicate titles and making a note of how many duplicates there were. Easy. Now you want to compress the books, so you need to read each book and look for duplicate patterns of words. If you find the same sentence repeated numerous times, you can replace it with a pointer to a notebook where you can jot down the original. Hmmm…. less easy. You can see that dedupe is much more of a quick win.

Of course there is more to compression than this. Prior to any removal of duplicate patterns data is usually transformed – and it is the method used in this transformation process that differentiates all of the various compression algorithms. I’m not going to delve into the detail in this article, but if you are interested then a great way to get an idea of what’s involved is to read the Wikipedia page on the BZIP2 file compression tool and look at all the processes involved.

Why Compress?

Data compression is essentially a trade-off, where reduced storage footprint is gained at the expense of extra CPU cycles – and therefore as a consequence, extra time. This additional CPU and time must be spent whenever the compressed data is read, thus increasing the read latency. It also needs to be spent during a write, but – as with dedupe – this can take place during the write process (known as inline) or at some later stage (known as post-process). dedupe-inline-or-post-processInline compression will affect the write latency but will also eliminate the need for a staging area where uncompressed data awaits compression.

Traditionally, compression has been used for archive data, i.e. data that must be retained but is seldom accessed. This is a good fit for compression, since the additional cost of decompression will rarely be paid. However, the use of compression with primary data is a different story: does it make sense to repeatedly incur time and CPU penalties on data that is frequently read or written? The answer, of course, is that it’s entirely down to any business requirements. However, I do strongly believe that – as with dedupe – there should be a choice, rather than an “always on” solution where you cannot say no. One vendor I know makes this rather silly claim: “so important it’s always-on”. What a fine example of a design limitation manifesting itself as a marketing claim.

Where to Compress?

As with all applications, data tends to flow down from users through an application layer, into a database. This database sits on top of a host, which is connected to some sort of persistent storage. There are therefore a number of possible places where data can be compressed:

  • where-to-compressDatabase-level compression, such as using basic compression in Oracle (part of the core product), or the Advanced Compression option (extra license required).
  • Host-level compression, such as you might find in products like Symantec’s Veritas Storage Foundation software suite.
  • Storage-level compression, where the storage array compresses data either at a global level, or more ideally, at some configurable level (e.g. by the LUN).

Of course, compressed data doesn’t easily compress again, since all of the repetitive patterns will have been removed. In fact, running compression algorithms on compressed data is, at the very least, a waste of time and CPU – while in the worst case it could actually increase the size of the compressed data. This means it doesn’t really make sense to use multiple levels of compression, such as both database-level and storage-level. Choosing the correct level is therefore important. So which is best?

Benefits and Drawbacks

If you read some of the marketing literature I’ve seen recently you would soon come to the conclusion that compressing your data at the storage level is the only way to go. It certainly has some advantages, such as ease of deployment: just switch it on and sit back, all of your data is now compressed. But there are drawbacks too – and I believe it pays to make an informed decision.

Performance

rev-counter-inverseThe most obvious and measurable drawback is the addition of latency to I/O operations. In the case of inline compression this affects both reads and writes, while post-process compression inevitably results in more background I/O operations taking place, increasing wear and potentially impacting other workloads. Don’t take it for granted that this additional latency won’t affect you, especially at peak workload. Everyone in the flash industry knows about a certain flash vendor whose inline dedupe and compression software has to switch into post-process mode under high load, because it simply cannot cope.

Influence

This one is less obvious, but in my opinion far more important. Let’s say you compress your database at the storage-level so that as blocks are written to storage they are compressed, then decompressed again when they are read back out into the buffer cache. That’s great, you’ve saved yourself some storage capacity at the overhead of some latency. But what would have happened if you’d used database-level compression instead?

Random Access MemoryWith database-level compression the data inside the data blocks would be compressed. This means not just that data which resides on storage, but also the data in memory – inside the buffer cache. That means you need less physical memory to hold the same amount of data, because it’s compressed in memory as well as on storage. What will you do with the excess physical memory? You could increase the size of the buffer cache, holding more data in memory and possibly improving performance through a reduction in physical I/O. Or you could run more instances on the same server… in fact, database-level compression is very useful if you want to build a consolidation environment, because it allows a greater density of databases per physical host.

There’s more. Full table scans will scan a smaller number of blocks because the data is compressed. Likewise any blocks sent over the network contain compressed data, which might make a difference to standby or Data Guard traffic. When it comes to compression, the higher up in the stack you begin, the more benefits you will see.

Don’t Believe The Hype

The moral of this story is that compression, just like deduplication, is a fantastic option to have available when and if you want to use it. Both of these tools allow you to trade time and CPU resource in favour of a reduced storage footprint. Choices are a good thing.

They are not, however, guaranteed wins – and they should not be sold as such. Take the time to understand the drawbacks before saying yes. If your storage vendor – or your database vendor (“storage savings of up to 204x“!!) – is pushing compression maybe they have a hidden agenda? In fact, they almost definitely will have.

And that will be the subject of the next post…

Storage Myths: Dedupe for Databases

rubber-ducks

Spot the duplicate duck

Storage for DBAs: Data deduplication – or “dedupe” – is a technology which falls under the umbrella of data reduction, i.e. reducing the amount of capacity required to store data. In very simple terms it involves looking for repeating patterns and replacing them with a marker: as long as the marker requires less space than the pattern it replaces, you have achieved a reduction in capacity. Deduplication can happen anywhere: on storage, in memory, over networks, even in database design – for example, the standard database star or snowflake schema. However, in this article we’re going to stick to talking about dedupe on storage, because this is where I believe there is a myth that needs debunking: databases are not a great use case for dedupe.

Deduplication Basics: Inline or Post-Process

dedupe-inline-or-post-processIf you are using data deduplication either through a storage platform or via software on the host layer, you have two basic choices: you can deduplicate it at the time that it is written (known as inline dedupe) or allow it to arrive and then dedupe it at your leisure in some transparent manner (known as post-process dedupe). Inline dedupe affects the time taken to complete every write, directly affecting I/O performance. The benefit of post-process dedupe therefore appears to be that it does not affect performance – but think again: post-process dedupe first requires data to be written to storage, then read back out into the dedupe algorithm, before being written to storage again in its deduped format – thus magnifying the amount of I/O traffic and indirectly affecting I/O performance. In addition, post-process dedupe requires more available capacity to provide room for staging the inbound data prior to dedupe.

Deduplication Basics: (Block) Size Matters

In most storage systems dedupe takes place at a defined block size, whereby each block is hashed to produce a unique key before being compared with a master lookup table containing all known hash keys. If the newly-generated key already exists in the lookup table, the block is a duplicate and does not need to be stored again. The block size is therefore pretty important, because the smaller the granularity, the higher the chances of finding a duplicate:

dedupe-block-sizeIn the picture you can see that the pattern “1234”repeats twice over a total of 16 digits. With an 8-digit block size (the lower line) this repeat is not picked up, since the second half of the 8-digit pattern does not repeat. However, by reducing the block size to 4 digits (the upper line) we can now get a match on our unique key, meaning that the “1234” pattern only needs to be stored once.

This sounds like great news, let’s just choose a really small block size, right? But no, nothing comes without a price – and in this case the price comes in the size of the hashing lookup table. This table, which contains one key for every unique block, must range in size from containing just one entry (the “ideal” scenario where all data is duplicated) to having one entry for each block (the worst case scenario where every block is unique). By making the block size smaller, we are inversely increasing the maximum size of the hashing table: half the block size means double the potential number of hash entries.

Hash Abuse

Why do we care about having more hash entries? There are a few reasons. First there is the additional storage overhead: if your data is relatively free of duplication (or the block size does not allow duplicates to be detected) then not only will you fail to reclaim any space but you may end up using extra space to store all of the unique keys associated with each block. This is clearly not a great outcome when using a technology designed to reduce the footprint of your data. hashSecondly, the more hash entries you have, the more entries you need to scan through when comparing freshly-hashed blocks during writes or locating existing blocks during reads. In other words, the more of a performance overhead you will suffer in order to read your data and (in the case of inline dedupe) write it.

If this is sounding familiar to you, it’s because the hash data is effectively a database in which storage metadata is stored and retrieved. Just like any database the performance will be dictated by the volume of data as well as the compute resource used to manipulate it, which is why many vendors choose to store this metadata in DRAM. Keeping the data in memory brings certain performance benefits, but with the price of volatility: changes in memory will be lost if the power is interrupted, so regular checkpoints are required to persistent storage. Even then, battery backup is often required, because the loss of even one hash key means data corruption. If you are going to replace your data with markers from a lookup table, you absolutely cannot afford to lose that lookup table, or there will be no coming back.

Database Deduplication – Don’t Be Duped

Now that we know what dedupe is all about, let’s attempt to apply it to databases and see what happens. You may be considering the use of dedupe technology with a database system, or you may simply be considering the use of one of a number of recent storage products that have inline dedupe in place as an “always on” option, i.e. you cannot turn it off regardless of whether it helps or hinders. The vendor may make all sorts of claims about the possibilities of dedupe, but how much benefit will you actually see?

Let’s consider the different components of a database environment in the context of duplication:

  • Oracle datafiles contain data blocks which have block headers at the start of the block. These contain numbers which are unique for each datafile, making deduplication impossible at the database block size. In addition, the end of each block contains a tailcheck section which features a number generated using data such as the SCN, so even if the block were divided into two the second half would offer limited opportunity for dedupe while the first half would offer none.
  • Even if you were able to break down Oracle blocks into small enough chunks to make dedupe realistic, any duplication of data is really a massive warning about your database design: normalise your data! Also, consider features like index key compression which are part of the Enterprise Edition license.
  • Most Oracle installations have multiplexed copies of important files like online redo logs and controlfiles. These files are so important that Oracle synchronously maintains multiple copies in order to ensure against data loss. If your storage system is deduplicating these copies, this is a bad thing – particularly if it’s an always on feature that gives you no option.
  • While unallocated space (e.g. in an ASM diskgroup) might appear to offer the potential for dedupe, this is actually a problem which you should solve using another storage technology: thin provisioning.
  • You may have copies of datafiles residing on the same storage as production, which therefore allow large-scale deduplication to take place; perhaps they are used as backups or test/development environments. However, in the latter case, test/dev environments are a use case for space-efficient snapshots rather than dedupe. And if you are keeping your backups on the same storage system as your production data, well… good luck to you. There is nothing more for you here.
  • Maybe we aren’t talking about production data at all. You have a large storage array which contains multiple copies of your database for use with test/dev environments – and thus large portions of the data are duplicated. Bingo! The perfect use case for storage dedupe, right? Wrong. Database-level problems require database-level solutions, not storage-level workarounds. Get yourself some licenses for Delphix and you won’t look back.

cautionTo conclude, while dedupe is great in use cases like VDI, it offers very limited benefit in database environments while potentially making performance worse. That in itself is worrying, but what I really see as a problem is the way that certain storage vendors appear to be selling their capacity based on assumed levels of dedupe, i.e. “Sure we are only giving you X terabytes of storage for Y price, but actually you’ll get 10:1 dedupe which means the price is really ten times lower!”

Sizing should be based on facts, not assumptions. Just like in the real world, nothings comes for free in I.T. – and we’ve all learnt that the hard way at some point. Don’t be duped.

Understanding Disk: Over-Provisioning

Image courtesy of Google Inc.

Image courtesy of Google Inc.

Storage for DBAs: In a recent news article in the UK, supermarket giant Tesco said it threw away almost 30,000 tonnes of food in the first half of 2013. That’s about 33,000 tons for those of you who can’t cope with the metric system. The story caused a lot of debate about the way in which we ignore the issue of wasted food – with Tesco being both criticised for the wastage and praised for publishing the figures. But waste isn’t a problem confined to just the food industry. The chances are it’s happening on your data centre too.

Stranded Capacity

As a simple example, let’s take a theoretical database which requires just under 6TB of storage capacity. To avoid complicating things we are going to ignore concepts such as striping, mirroring, caching and RAID for a moment and just pretend you want to stick a load of disks in a server. How many super-fast 15k RPM disk drives do you need if each one is 600GB? You need about ten, more or less, right? But here’s the thing: the database creates a lot of random I/O so it has a peak requirement for around 20,000 physical IOPS (I/O operations per second). Those 600GB drives can only service 200 IOPS each. So now you need 100 disks to be able to cope with the workload. 100 multiplied by 600GB is of course 60TB, so you will end up deploying sixty terabytes of capacity in order to service a database of six terabytes in size. Welcome to over-provisioning.

padlockNow here’s the real kicker. That remaining 54TB of capacity? You can’t use it. At least, you can’t use it if you want to be able to guarantee the 20,000 IOPS requirement we started out with. Any additional workload you attempt to deploy using the spare capacity will be issuing I/Os against it, resulting in more IOPS. If you were feeling lucky, you could take a gamble on trying to avoid any new workloads being present during peak requirement of the original database, but gambling is not something most people like to do in production environments. In other words, your spare capacity is stranded. Of your total disk capacity deployed, you can only ever use 10% of it.

Of course, disk arrays in the real world tend to use concepts such as wide-striping (spreading chunks of data across as many disks as possible to take advantage of all available performance) and caching (staging frequently accessed blocks in faster DRAM) but the underlying principle remains.

Short Stroking

hard-drive-short-strokingIf that previous example makes you cringe at the level of waste, prepare yourself for even worse. In my previous article I talked about the mechanical latency associated with disk, which consists of seek time (the disk head moving across the platter) and rotational latency (the rotation of the platter to the correct sector). If latency is critical (which it always, always is) then one method of reducing the latency experienced on a disk system is to limit the movement of the head, thus reducing the seek time. This is known as short stroking. If we only use the outer sectors of the platter (such as those coloured green in the diagram here), the head is guaranteed to always be closer to the next sector we require – and note that the outer sectors are preferable because they have a higher transfer rate than the inner sectors (to understand why, see the section on zones in this post). Of course this has a direct consequence in that a large portion of the disk is now unused, sometimes up to 90%. In the case of a 600GB disk short stroking may now result in only 60GB of capacity, which means ten times as many disks are necessary to provide the same capacity as a disk which is not short stroked.

Two Types of Capacity

When people talk about disk capacity then tend to be thinking of the storage capacity, i.e. the number of bytes of data that can be stored. However, while every storage device must have a storage capacity, it will also have a performance capacity – a limit to the amount of performance it can deliver, measured in I/Os per second and/or some derivative of bytes per second. And the thing about capacities is that bad things tend to happen when you try to exceed them.

performance-and-storage-capacity

In simplistic terms, performance and storage capacity are linked, with the ratio between them being specific to each type of storage. With disk drives, the performance capacity usually becomes the blocker before the storage capacity, particularly if the I/O is random (which means high numbers of IOPS). This means any overall solution you design must exceed the required storage capacity in order to deliver on performance. In the case of flash memory, the opposite is usually true: by supplying the required storage capacity there will be a surplus of performance capacity. Provide enough space and you shouldn’t need to worry about things like IOPS and bandwidth. (Although I’m not suggesting you should forego due diligence and just hope everything works out ok…)

Waste Watching

trashcanI opened with a reference to the story about food wastage – was it fair to compare this to wasted disk capacity in the data centre? One is a real world problem and the other a hypothetical idea taking place somewhere in cyberspace, right? Well maybe not. Think of all those additional disks that are required to provide the performance capacity you need, resulting in excess storage capacity which is either stranded or (in the case of short stroking) not even addressable. All those spindles require power to keep them spinning – power that mostly comes from power stations burning fossil fuels. The heat that they produce means additional cooling is required, adding to the power draw. And the additional data centre floor space means more real estate, all of which costs money and consumes resources. It’s all waste.

And that’s just the stuff you can measure. What about the end users that have to wait longer for their data because of the higher latency of disk? Those users may be expensive resources in their own right, but they are also probably using computers or smart devices which consume power, accessing your database over a network that consumes more power, via application servers that consume yet more power… all wasting time waiting for their results.

Wasted time, waster money, wasted resources. The end result of over-provisioning is not something you should under-estimate…

Understanding Disk: Mechanical Limitations

mathematics

Storage for DBAs: The year 2000. Remember that? The IT industry had just spent years making money off the back of the millenium bug (which turned out not to exist). The world had spent even more money than usual on fireworks, parties and alcohol. England failed to win an international football tournament again (some things never change). It was all quite a long time ago.

It was also an important year in the world of disk drive manufacturing, because in February 2000 Seagate released the Cheatah X15 – the world’s first disk drive spinning at 15,000 rotations per minute (RPM). At the time this was quite a milestone, because it came only four years after the first 10k RPM drive (also a Seagate Cheetah) had been released. The 50% jump from 10k to 15k made it look like disks were on a path of continual acceleration… and yet now, as I write this article in 2013, 15k remains the fastest disk drive available. What happened?

Disks haven’t completely stagnated. They continue to get both larger and smaller: larger in terms of capacity, smaller in terms of form factor. But they just aren’t getting any faster…

You don’t need to take my word for this. This technology paper (from none other than Seagate themselves) contains the following table on page 2:

Document TP-525 - Seagate Global Product Marketing (May 2004)

Document TP-525 – Seagate Global Product Marketing (May 2004)

Published in 2004, this document shows that between 1987 and 2004 the performance of CPU increased by a factor of two million, while disk drive performance increased by a factor of just eleven. And that was over a decade ago, so CPU has improved many times more since then. Yet the 15k RPM drive remains the upper limit of engineering for rotating media – and this is unlikely to change, for reasons I will explain in a minute. But first, a more basic question: why do we care?

Seek Time and Rotational Latency

hard-drive-diagramAs we discussed previously, a hard drive contains one or more rotating aluminium disks (known as “platters“) which are coated in a ferromagnetic material used to magnetically store bits (i.e. zeros and ones). In the case of a 15k RPM drive it is the platter that rotates 15,000 times per minute, or 250 times per second. Data is read from and written to the platter by a read/write head mounted on top of an actuator arm which moves to seek out the desired track. Once the head is above this track, the platter must rotate to the start of the required sector – and only at this point does the I/O begin.

Both of those mechanical actions take time – and therefore introduce delays into the process of performing I/O. The time taken to move the head to the correct track is known as the seek time, while the time taken to rotate the platter is called the rotational latency. Since the platter does not stop spinning during normal operation, it is not possible to perform both actions concurrently – if the platter spins to the correct sector before the head is above the right track, it must continue to spin through another revolution.

stopwatchNow obviously, both of these wait times are highly dependant on where the head and platter were prior to the I/O call. In the best case, the head does not need to move at all and the platter is already at the correct sector, requiring zero wait time. But in the worst case, the head has to travel from one edge to the other and then the platter needs to rotate 359 degrees to start the read or write. Each I/O is like a throw of the dice, which is why when we discuss latency we have to talk in averages.

Modern enterprise disk drives such as this Seagate Cheetah 15K.7 SAS drive have average seek times of around 3.5ms to 4ms – although in a later article I will talk about how this can be reduced. Rotational latency is a factor of the speed at which the platter rotates – hence the interest in making disks that spin at 15k RPM or faster. At 15,000 revolutions per minute, each revolution takes 4ms, meaning the set of all possible latencies ranges from 0ms to 4ms. Once you consider it like that, it’s pretty obvious that the average rotational latency of a 15k RPM drive is 2ms. You can use the same method to calculate that a 10k RPM drive has an average of 3ms and so on.

Why Not Spin Faster?

A surprisingly-common misconception about disk drives is that the head touches the platter, in the same way as a vinyl record player has a needle/stylus that touches the record (for any younger readers: this is how we used to live). But in a disk drive, the head “flies” above the platter at a distance of a few nanometers, using airflow to keep itself in place (similar to the concept of a fluid bearing).

batteryThere are three main problems with making disks that revolve faster than 15k, the first of which is energy: the power consumption required to create all of that additional kinetic energy, as well as the associated heat it creates. At a time when energy costs are constantly increasing, customers do not want to pay even more money for devices that require more power and more cooling to run.

tornadoA more fundamental problem is related to aerodynamics, in that by rotating the platter faster the platter’s outer edge starts to travel at very high speed causing it to interact negatively with the surrounding air (a phenomenon known scientifically as flutter). For a 3.5 inch platter, the outer edge is travelling at approximately 150 miles per hour. It isn’t possible to remove the air, because this would result in the head touching the platter – causing devastation. One potential solution is to reduce the diameter of the platter, e.g. 2.5 inches or less (thereby reducing the speed of the edge) but this results in reduced disk capacity. Some manufacturers are replacing the air with helium in an attempt to reduce flutter and buffeting, but helium is both expensive and scarce, while the manufacturing process is undoubtedly more complex.

dollarsBut the biggest – and potentially insurmountable – problem is the cost. Technical challenges can (mostly) be overcome, but customers will not pay for products that do not make economic sense. Disks are built and sold in large volumes, so any new manufacturing process requires a heavy investment – essentially a gamble taken by the manufacturer on whether customers will buy the new product or not. It’s unlikely we’ll see anything faster than 15k RPM manufactured in volume, because there simply isn’t the demand. Not when there are so many cheaper workarounds…

Spin Doctors

As the old saying goes, necessity is the mother of invention. Give people enough time to work around a problem and you will get all sorts of interesting tricks, hacks and dodges. And in the case of disk performance, we’ve been suffering from the limitations of mechanical latency for decades. The next articles in this series will look at some of the more common strategies employed to avoid, or limit, the impact of mechanical latency to disk performance: over-provisioning, short-stroking, caching, tiering… But rest assured of one thing: they are all just attempts at hiding the problem. Ultimately, the only way to remove the pain of mechanical latency is to remove the mechanics. By moving to semiconductor-based storage, all of the delays introduced by moving heads and spinning platters simply disappear… in a flash.

Understanding Disk: Superpowers

disk-platter

Storage for DBAs: It’s a familiar worn-out story. A downtrodden and oppressed population are rescued from their plight by a mysterious superhero. Over time they come to rely on this new superbeing – taking him for granted even, complaining when he isn’t immediately available to save them and alleviate their pain. As the years progress, memories of the “old days” fade away while younger generations grow up with no concept of how bad things used to be. Our superhero is no longer special to us, in fact we feel he isn’t doing enough. As he grows old we grumble and complain while he desperately looks to the skies for someone younger, faster and with greater powers to relieve him of his burden: the burden of our expectation.

No it’s ok, I haven’t taken a creative writing course and taken this opportunity to practice my (lack of) skills on you. The aging superhero in my story is the humble disk drive.

The disk drive has been around, in one form or another, for well over 50 years. It’s changed, of course (the original IBM RAMAC 305 weighed over a ton) and capacity figures have changed too. But the mechanical aspects of storing data on a spinning magnetic disk – the physics of the design – remain the same.

Terminology

hard-drive-diagramA hard disk drive consists of a set of one or more platters, which are disks of non-magnetic material such as aluminium. The platters spin at a constant speed on a common axle which we call a spindle – and by extension we often refer to the entire hard drive unit as a spindle too. The platters are coated in a thin layer of ferromagnetic material, which is where data is stored in binary form in concentric circles which we call tracks. Each track is divided into equal-sized segments called sectors and it is these that hold the data, along with additional overheads such as error correction codes. Traditionally, a sector contained 512 bytes of user data – but modern disks conforming to the Advanced Format standard use 4096 bytes for a sector. Data is read and written by a read/write head located on the end of a movable actuator arm which can traverse the platter – and of course with multiple and/or double-sided platters there will be multiple heads.

That’s where disk’s superpowers came from: the winning combination of a moving head and the concentric tracks of data. These days it almost seems like a flaw, but to appreciate the magic you need to consider the technology that disk replaced.

The Bad Old Days

tapeBefore disk, we had tape – a medium still in use today for other purposes such as backups. A big spinning reel of magnetic tape can transfer data in and out pretty fast (i.e. it has a high bandwidth) when the I/O is sequential, because the blocks are stored contiguously on the tape. But any kind of random I/O requires a mechanical delay (i.e. high latency) as the tape is wound backwards or forwards to locate the starting block and place it in front of the fixed read/write head. The time taken to locate the starting block is known as the seek time – a term that has haunted storage for decades.

When disk arrived it seemed revolutionary (pardon the pun). Like tape, disk used spinning magnetic media, but unlike tape, the read/write head could now move – allowing drastically reduced seek times. Want to move from reading the last sector on the disk to the first? No problem, the actuator arm simply moves across the tracks and then the platter rotates to find the first sector. A tape, on the other hand, would need to rewind the entire reel.

So, ironically, disk represented a massive leap forward for the performance of random I/O. How the mighty have fallen… but I’m going to save any talk of performance until the next post. For now, we need to finish off describing the basic layout of disk.

On The Edge

hard-drive-structureThe picture on the right shows a traditional disk layout, where individual sectors (C) spread out across tracks (A) in the same way as a slice of cake or pie. If you consider a set of sectors (B – all those shaded in blue) you can see that they get longer the further towards the edge they get. How long does it take to read three consecutive sectors on a track, such as those highlighted green (D)? In the diagram those three sectors cover 90 degrees of the platter, so the time to read them would be one quarter of a revolution. And crucially, that would be the same no matter how far in or out they were from the edges.

In this traditional model, each sector contains the same amount of data (512 or 4096 bytes plus overheads). Since data is nothing more than bits (zeros and ones) we can say that the density of those bits is greater towards the centre of the platter and lower at the edge. In other words, we aren’t really utilising all of the available platter surface as we move further towards the edge. This directly affects the capacity of the drive, since less data is being stored than the platter can physically allow. There are solutions to this, however – and the most common solution is zoned bit recording.

In The Zone

hard-drive-zoned-bit-recordingTo ensure that all of the available surface is used on each platter, many modern disk drives used a technique where the number of sectors per “slice” increases towards the edge of the disk. To simplify this design, tracks are placed into zones, each of which has a defined number of sectors per track. The result is that outer zones squeeze more sectors on to each track than inner zones. This has the benefit of increasing the capacity of the drive, because the surface is more efficiently used and bit densities remain consistently high.

But zoned bit recording also has another interesting effect: on the outer edge, more sectors now pass under the head per revolution. To put that more simply, the outer edge has a higher transfer rate (i.e. bandwidth) than the inner edge. And since most drives tend to number their tracks starting at the outer edge and working inwards, the result is that data stored at the logical start of a drive benefits from this higher bandwidth while data stored at the end experiences the opposite effect.

This is nothing to do with latency though, this is purely a bandwidth phenomenon. Latency is a whole different discussion – and as such, the subject of a whole different post…