Storage Myths: Storage Compression Has No Downside

Image courtesy of marcovdz

Image courtesy of marcovdz

Storage for DBAs: My last post in this blog series was aimed at dispelling the myth that dedupe is a suitable storage technology for databases. To my surprise it became the most popular article I’ve ever published (based on reads per day). Less surprisingly though, it lead to quite a backlash from some of the other flash storage vendors who responded with comments along the lines of “well we don’t need dedupe because we also have compression”. Fair enough. So today let’s take a look at the benefits and drawbacks of storage-level compression as part of an overall data reduction strategy. And by the way, I’m not against either dedupe or storage-level compression. I just think they have drawbacks as well as benefits – something that isn’t always being made clear in the marketing literature. And being in the storage industry, I know why that is…

What Is Compression?

In storageland we tend to talk about the data reduction suite of tools, which comprise of deduplication, compression and thin provisioning. The latter is a way of freeing up capacity which is allocated but not used… but that’s a topic for another day.

bookshelfDedupe and compression have a lot in common: they both fundamentally involve the identification and removal of patterns, which are then replaced with keys. The simplest way to explain the difference would be to consider a book shelf filled with books. If you were going to dedupe the bookshelf you would search through all of the books, removing any duplicate titles and making a note of how many duplicates there were. Easy. Now you want to compress the books, so you need to read each book and look for duplicate patterns of words. If you find the same sentence repeated numerous times, you can replace it with a pointer to a notebook where you can jot down the original. Hmmm…. less easy. You can see that dedupe is much more of a quick win.

Of course there is more to compression than this. Prior to any removal of duplicate patterns data is usually transformed – and it is the method used in this transformation process that differentiates all of the various compression algorithms. I’m not going to delve into the detail in this article, but if you are interested then a great way to get an idea of what’s involved is to read the Wikipedia page on the BZIP2 file compression tool and look at all the processes involved.

Why Compress?

Data compression is essentially a trade-off, where reduced storage footprint is gained at the expense of extra CPU cycles – and therefore as a consequence, extra time. This additional CPU and time must be spent whenever the compressed data is read, thus increasing the read latency. It also needs to be spent during a write, but – as with dedupe – this can take place during the write process (known as inline) or at some later stage (known as post-process). dedupe-inline-or-post-processInline compression will affect the write latency but will also eliminate the need for a staging area where uncompressed data awaits compression.

Traditionally, compression has been used for archive data, i.e. data that must be retained but is seldom accessed. This is a good fit for compression, since the additional cost of decompression will rarely be paid. However, the use of compression with primary data is a different story: does it make sense to repeatedly incur time and CPU penalties on data that is frequently read or written? The answer, of course, is that it’s entirely down to any business requirements. However, I do strongly believe that – as with dedupe – there should be a choice, rather than an “always on” solution where you cannot say no. One vendor I know makes this rather silly claim: “so important it’s always-on”. What a fine example of a design limitation manifesting itself as a marketing claim.

Where to Compress?

As with all applications, data tends to flow down from users through an application layer, into a database. This database sits on top of a host, which is connected to some sort of persistent storage. There are therefore a number of possible places where data can be compressed:

  • where-to-compressDatabase-level compression, such as using basic compression in Oracle (part of the core product), or the Advanced Compression option (extra license required).
  • Host-level compression, such as you might find in products like Symantec’s Veritas Storage Foundation software suite.
  • Storage-level compression, where the storage array compresses data either at a global level, or more ideally, at some configurable level (e.g. by the LUN).

Of course, compressed data doesn’t easily compress again, since all of the repetitive patterns will have been removed. In fact, running compression algorithms on compressed data is, at the very least, a waste of time and CPU – while in the worst case it could actually increase the size of the compressed data. This means it doesn’t really make sense to use multiple levels of compression, such as both database-level and storage-level. Choosing the correct level is therefore important. So which is best?

Benefits and Drawbacks

If you read some of the marketing literature I’ve seen recently you would soon come to the conclusion that compressing your data at the storage level is the only way to go. It certainly has some advantages, such as ease of deployment: just switch it on and sit back, all of your data is now compressed. But there are drawbacks too – and I believe it pays to make an informed decision.

Performance

rev-counter-inverseThe most obvious and measurable drawback is the addition of latency to I/O operations. In the case of inline compression this affects both reads and writes, while post-process compression inevitably results in more background I/O operations taking place, increasing wear and potentially impacting other workloads. Don’t take it for granted that this additional latency won’t affect you, especially at peak workload. Everyone in the flash industry knows about a certain flash vendor whose inline dedupe and compression software has to switch into post-process mode under high load, because it simply cannot cope.

Influence

This one is less obvious, but in my opinion far more important. Let’s say you compress your database at the storage-level so that as blocks are written to storage they are compressed, then decompressed again when they are read back out into the buffer cache. That’s great, you’ve saved yourself some storage capacity at the overhead of some latency. But what would have happened if you’d used database-level compression instead?

Random Access MemoryWith database-level compression the data inside the data blocks would be compressed. This means not just that data which resides on storage, but also the data in memory – inside the buffer cache. That means you need less physical memory to hold the same amount of data, because it’s compressed in memory as well as on storage. What will you do with the excess physical memory? You could increase the size of the buffer cache, holding more data in memory and possibly improving performance through a reduction in physical I/O. Or you could run more instances on the same server… in fact, database-level compression is very useful if you want to build a consolidation environment, because it allows a greater density of databases per physical host.

There’s more. Full table scans will scan a smaller number of blocks because the data is compressed. Likewise any blocks sent over the network contain compressed data, which might make a difference to standby or Data Guard traffic. When it comes to compression, the higher up in the stack you begin, the more benefits you will see.

Don’t Believe The Hype

The moral of this story is that compression, just like deduplication, is a fantastic option to have available when and if you want to use it. Both of these tools allow you to trade time and CPU resource in favour of a reduced storage footprint. Choices are a good thing.

They are not, however, guaranteed wins – and they should not be sold as such. Take the time to understand the drawbacks before saying yes. If your storage vendor – or your database vendor (“storage savings of up to 204x“!!) – is pushing compression maybe they have a hidden agenda? In fact, they almost definitely will have.

And that will be the subject of the next post…

Storage Myths: Dedupe for Databases

rubber-ducks

Spot the duplicate duck

Storage for DBAs: Data deduplication – or “dedupe” – is a technology which falls under the umbrella of data reduction, i.e. reducing the amount of capacity required to store data. In very simple terms it involves looking for repeating patterns and replacing them with a marker: as long as the marker requires less space than the pattern it replaces, you have achieved a reduction in capacity. Deduplication can happen anywhere: on storage, in memory, over networks, even in database design – for example, the standard database star or snowflake schema. However, in this article we’re going to stick to talking about dedupe on storage, because this is where I believe there is a myth that needs debunking: databases are not a great use case for dedupe.

Deduplication Basics: Inline or Post-Process

dedupe-inline-or-post-processIf you are using data deduplication either through a storage platform or via software on the host layer, you have two basic choices: you can deduplicate it at the time that it is written (known as inline dedupe) or allow it to arrive and then dedupe it at your leisure in some transparent manner (known as post-process dedupe). Inline dedupe affects the time taken to complete every write, directly affecting I/O performance. The benefit of post-process dedupe therefore appears to be that it does not affect performance – but think again: post-process dedupe first requires data to be written to storage, then read back out into the dedupe algorithm, before being written to storage again in its deduped format – thus magnifying the amount of I/O traffic and indirectly affecting I/O performance. In addition, post-process dedupe requires more available capacity to provide room for staging the inbound data prior to dedupe.

Deduplication Basics: (Block) Size Matters

In most storage systems dedupe takes place at a defined block size, whereby each block is hashed to produce a unique key before being compared with a master lookup table containing all known hash keys. If the newly-generated key already exists in the lookup table, the block is a duplicate and does not need to be stored again. The block size is therefore pretty important, because the smaller the granularity, the higher the chances of finding a duplicate:

dedupe-block-sizeIn the picture you can see that the pattern “1234”repeats twice over a total of 16 digits. With an 8-digit block size (the lower line) this repeat is not picked up, since the second half of the 8-digit pattern does not repeat. However, by reducing the block size to 4 digits (the upper line) we can now get a match on our unique key, meaning that the “1234” pattern only needs to be stored once.

This sounds like great news, let’s just choose a really small block size, right? But no, nothing comes without a price – and in this case the price comes in the size of the hashing lookup table. This table, which contains one key for every unique block, must range in size from containing just one entry (the “ideal” scenario where all data is duplicated) to having one entry for each block (the worst case scenario where every block is unique). By making the block size smaller, we are inversely increasing the maximum size of the hashing table: half the block size means double the potential number of hash entries.

Hash Abuse

Why do we care about having more hash entries? There are a few reasons. First there is the additional storage overhead: if your data is relatively free of duplication (or the block size does not allow duplicates to be detected) then not only will you fail to reclaim any space but you may end up using extra space to store all of the unique keys associated with each block. This is clearly not a great outcome when using a technology designed to reduce the footprint of your data. hashSecondly, the more hash entries you have, the more entries you need to scan through when comparing freshly-hashed blocks during writes or locating existing blocks during reads. In other words, the more of a performance overhead you will suffer in order to read your data and (in the case of inline dedupe) write it.

If this is sounding familiar to you, it’s because the hash data is effectively a database in which storage metadata is stored and retrieved. Just like any database the performance will be dictated by the volume of data as well as the compute resource used to manipulate it, which is why many vendors choose to store this metadata in DRAM. Keeping the data in memory brings certain performance benefits, but with the price of volatility: changes in memory will be lost if the power is interrupted, so regular checkpoints are required to persistent storage. Even then, battery backup is often required, because the loss of even one hash key means data corruption. If you are going to replace your data with markers from a lookup table, you absolutely cannot afford to lose that lookup table, or there will be no coming back.

Database Deduplication – Don’t Be Duped

Now that we know what dedupe is all about, let’s attempt to apply it to databases and see what happens. You may be considering the use of dedupe technology with a database system, or you may simply be considering the use of one of a number of recent storage products that have inline dedupe in place as an “always on” option, i.e. you cannot turn it off regardless of whether it helps or hinders. The vendor may make all sorts of claims about the possibilities of dedupe, but how much benefit will you actually see?

Let’s consider the different components of a database environment in the context of duplication:

  • Oracle datafiles contain data blocks which have block headers at the start of the block. These contain numbers which are unique for each datafile, making deduplication impossible at the database block size. In addition, the end of each block contains a tailcheck section which features a number generated using data such as the SCN, so even if the block were divided into two the second half would offer limited opportunity for dedupe while the first half would offer none.
  • Even if you were able to break down Oracle blocks into small enough chunks to make dedupe realistic, any duplication of data is really a massive warning about your database design: normalise your data! Also, consider features like index key compression which are part of the Enterprise Edition license.
  • Most Oracle installations have multiplexed copies of important files like online redo logs and controlfiles. These files are so important that Oracle synchronously maintains multiple copies in order to ensure against data loss. If your storage system is deduplicating these copies, this is a bad thing – particularly if it’s an always on feature that gives you no option.
  • While unallocated space (e.g. in an ASM diskgroup) might appear to offer the potential for dedupe, this is actually a problem which you should solve using another storage technology: thin provisioning.
  • You may have copies of datafiles residing on the same storage as production, which therefore allow large-scale deduplication to take place; perhaps they are used as backups or test/development environments. However, in the latter case, test/dev environments are a use case for space-efficient snapshots rather than dedupe. And if you are keeping your backups on the same storage system as your production data, well… good luck to you. There is nothing more for you here.
  • Maybe we aren’t talking about production data at all. You have a large storage array which contains multiple copies of your database for use with test/dev environments – and thus large portions of the data are duplicated. Bingo! The perfect use case for storage dedupe, right? Wrong. Database-level problems require database-level solutions, not storage-level workarounds. Get yourself some licenses for Delphix and you won’t look back.

cautionTo conclude, while dedupe is great in use cases like VDI, it offers very limited benefit in database environments while potentially making performance worse. That in itself is worrying, but what I really see as a problem is the way that certain storage vendors appear to be selling their capacity based on assumed levels of dedupe, i.e. “Sure we are only giving you X terabytes of storage for Y price, but actually you’ll get 10:1 dedupe which means the price is really ten times lower!”

Sizing should be based on facts, not assumptions. Just like in the real world, nothings comes for free in I.T. – and we’ve all learnt that the hard way at some point. Don’t be duped.

Storage Myths: Put Oracle Redo on SSD

tortoise

Storage for DBAs: “My database is slow”… “Well then why not put your redo logs on SSDs?” Gaaaah. I still hear people having this discussion and it drives me mad. “Nobody got fired for putting Oracle redo on …<flash vendor>”. Yeah right, but does that mean it was worth the investment?

I’m bored of this line of illogical “reasoning”, so here are three reasons why you shouldn’t put your redo logs on SSD.

1. Solve The Right Problem

If a database is slow, find out why. Investigate, troubleshoot, resolve. Don’t throw hardware at it without understanding what the problem is. Redo is written by the Oracle Log Writer process – and the wait event log file parallel write covers the writing of redo records from the log buffer into the online redo log files. If you are seeing high average wait times for log file parallel write (or occasional high wait times in the Wait Event Histogram) maybe it’s time to investigate the speed of redo I/Os. Otherwise … leave it alone, or you are fixing the wrong issue.

Also, let’s not confuse the wait event log file sync with log file parallel write. Log file sync is experienced by foreground processes waiting on the log writer to complete a flush of the log buffer to storage. It’s tempting to assume high log file sync times are therefore a consequence of slow log writes, but as Kevin Closson points out in this must-read article, most log file sync waits are actually processing issues where the log writer is not getting enough CPU time.

2. SSD Write Performance Sucks

Huh? You thought I was pro-SSD right? Ok so I’m being a bit crafty, because the terms SSD and Flash are not really synonymous. SSD stands for Solid State Disk (or Device depending on who you ask), which generally means a set of flash chips crafted into the shape of a hard disk drive and plugged into a HDD-shaped hole somewhere via the use of a Flash Memory Controller. This interface takes page-based flash memory and makes it look like block-based storage – and each SSD in an array has its own controller.

snailThere is a fundamental difference between an all-flash array and a set of SSDs masquerading as disks: an all-flash array can manage the flash holistically while the SSD-populated array cannot. This matters because flash is awkward to work with – for example, flash pages must be erased before they are written to – a process which is both slow and cumbersome, since other pages are locked (even from reads) during an erase.

The all-flash array is able to avoid the consequences of these restrictions by managing the flash globally, so that erases do not block reads and writes. In contrast, SSDs shoved into a disk array cannot communicate with each other to indicate when they are busy performing this garbage collection process, resulting in unpredictable performance and horrible spikes in latency as I/Os queue up behind the erase process.

3. Disk Is Good Enough

Disk

You didn’t expect me to say that, did you? Don’t get me wrong, disk is terrible at random I/O. Really, truly awful. But here’s the thing: the Oracle log writer performs large, sequential writes. And disk is ok with sequential I/O, particularly if you are using faster spindles like the 15k RPM drives.

Flushing the log buffer to storage involves writing some multiple of the redo log block size (512 byte default but configurable to 1024 or 4096 bytes from Oracle version 11.2). If your system is busy enough that you believe you have redo performance issues, it seems likely that those writes will be larger as more redo is created per log flush. The larger the write, the more efficient it will be on disk as the impact of the initial seek time is averaged out.

But hey, don’t take my word for it. Trust the evidence – and it turns out there is a wealth of data out there for anyone to analyse… right here: http://www.tpc.org/

The thing about TPC-C benchmarks is that they generate redo logs like you wouldn’t believe. So if anyone needs the ultimate redo performance it’s a system like this one, which set a world record back in September 2012 (which Oracle crowed about in it’s usual classy way by using it to bash IBM). The great thing about TPC results is that they come with a complete full disclosure report so you can see just how the vendors did it. And in the full disclosure report for this submission, where was the redo located? On a RAID set consisting of 600GB 15K RPM disk drives (see page 21). If disk is fast enough for a world record, it’s fast enough for you.

Incidentally, the datafiles in that benchmark were located on 2x Violin Memory 6616 arrays – which also tells you something important: if you are migrating from disk to flash, the first thing you need to move is the primary data, not the redo.

The Counter-Argument: Flash is Not SSD

Now I don’t want to wrap this article up giving you the impression that you shouldn’t move your redo logs to flash memory, so I’ll leave you with some counter arguments to the above. When I build a database, I always put the redo logs on flash (not on SSD mind, but on a flash memory array). Here’s why:

1. Violin Isn’t Limited By Writes

I know, I know … that sounds like a sales pitch. I usually try to talk about flash in general, which is why I originally wrote “All-flash arrays aren’t limited by writes”, but the truth is I don’t know other all-flash arrays to the extent that I know Violin… so forgive me for sticking with what I know.

I’ll explain Violin’s methods for guaranteeing sustained ultra-low write latency some other day. for now, let’s just see the evidence:

Load Profile              Per Second    Per Transaction
~~~~~~~~~~~~         ---------------    ---------------
      DB Time(s):              197.6                2.8
       DB CPU(s):               18.8                0.3
       Redo size:    1,477,126,876.3       20,568,059.6
   Logical reads:          896,951.0           12,489.5
   Block changes:          672,039.3            9,357.7
  Physical reads:           15,529.0              216.2
 Physical writes:          166,099.8            2,312.8

That’s over 1.4GB/sec of sustained redo generation from a 5 minute snapshot (see this post for details) using just a single Violin Memory 6616 array connected over 8Gb fibre channel. The AWR snapshot was 5 minutes long but the workload had been running for an extended period prior to the capture. Don’t leave here with the illusion that redo on flash memory isn’t blindingly fast.

2. Your First Design Goal Should Be Simplicity

There is a quote often attributed to Albert Einstein which says, “Everything should be made as simple as possible, but no simpler“. This applies perfectly to system design – and is one reason why I always recommend an all-flash database design over a flash and disk hybrid. Yes it’s possible to put some datafiles here and others over there, redo logs on disk and primary data on flash, etc. But the simplest design is to put everything on high performance, low latency flash. Is it the cheapest solution? Maybe not always on list price, but it probably will be based on TCO.

Conclusion

Look, if you want to put your redo logs on flash, I’m not going to argue. I’m not saying that it’s a bad thing.

cautionWhat is a bad thing though is the practice of taking a disk-based database and sticking some SSDs in to home the redo logs. That’s just silly. The first part of the database you should move to flash is the primary data. If it makes sense to relocate the whole database (which it almost always does, because that disk array doesn’t belong in your data centre anymore – it belongs in a museum) then go for it. Just don’t compromise on having only the redo logs on flash or SSD, because then you have essentially built yourself an anti-TPC-C benchmarking system! And what’s the opposite of a system that goes really fast…?

Storage Myths: IOPS Matter

shout-iops

Storage for DBAs: Having now spent over a year in the storage industry, I’ve decided it’s time to call out an industry-wide obsession that I previously wasn’t aware of: everyone in storage is obsessed with IOPS (the performance metric I/O Operations Per Second). Take a minute to perform a web search for “flash iops” and you’ll see countless headlines from vendors that have broken new IOPS records – and yes, these days my own employer is often one of them. You’d be forgiven for thinking that, in storage, IOPS was the most important thing ever.

I’m here to tell you that it isn’t. At least, not if databases are your game.

storage-characteristicsFundamental Characteristics of Storage

In a previous article I described the three fundamental characteristics of storage: latency, IOPS and bandwidth (or throughput). I even drew a simple, boxy diagram which, despite being one of the least-inspiring pieces of artwork ever created, serves me well enough to warrant its inclusion again here. These three properties are related – when one changes, the others change. With that in mind, here’s lesson #1:

High numbers of IOPS are useless unless they are delivered at low latency.

It’s all very well saying you can supply 1 million, 2 million, 4 million IOPS but if the latency sucks it’s not going to be of much value in the real world. Flash is great for delivering higher numbers of IOPS than disk, particularly for random I/Os (as I’ve written about previously), but ultimately the delay introduced by high latency is going to make real-world workloads unusable.

And there’s another, oft-hidden, problem that many flash vendors face: unpredictable latency. This is particularly the case during write-heavy workloads where garbage collection cannot always keep up with load, resulting in the infamous “write cliff” (more technically described as bandwidth degradationsee figure 7 of this paper). Maybe we should revise that previous line to be lesson #1.5;

High numbers of IOPS are useless unless they are delivered at predictable low latency.

But what about when we deal with volumes of data? If your requirement is to process vast amounts of information, do IOPS then become more important? Not really, because this is a bandwidth challenge – you need to design and build a system to suit your bandwidth requirements. How many GB/sec do you need? What can the storage subsystem deliver and how fast can you process it? Unlike bandwidth, an IOPS measurement does not contain the critical component of block size, so information is missing. And if you have the bandwidth figures, there is little additional value in knowing the IOPS, is there? Cue lesson #2:

Bandwidth figures are more useful for describing data volumes than IOPS

So what good are IOPS figures? And why does the storage industry talk about them all the time? Personally I think it’s a hang-up from the days of disk, when IOPS were such a limiting factor… and partly a marketing thing, because multi-million results sound impressive. Who knows? I’m more interested in what we should be asking about than what we shouldn’t.

So what does matter?

Latency Is King

crown-latencyForget everything else. Latency is the critical factor because this is what injects delay into your system. Latency means lost time; time that could have been spent busily producing results, but is instead spent waiting for I/O resources.

Forget IOPS. The whole point of a flash array is that IOPS effectively become an unlimited resource. Sure, there is always a real limit – but it’s so high that it’s no longer necessary to worry about it.

Bandwidth still matters, particularly when you are doing something which requires volume, such as analytics or data warehousing. But bandwidth is a question of plumbing – designing a solution with enough capability to deliver throughout the stack: storage, fabric, HBAs, network, processor… build it and the data will come.

Latency is the “application stealth tax”, extending the elapsed times of individual tasks and processes until everything takes slightly longer. Add up all those delays and you have a significant problem. This is why, when you consider buying flash, you need to test the latency – and not just at the storage level, but end-to-end via the application (I’ll talk about this more in a following post).

“But I Don’t Need That Many IOPS…”

This is classic misunderstanding, often the result of confusion brought about by FUD from storage vendors who cannot deliver at the higher end of the market. To repeat my previous statement, with a good flash system IOPS will effectively become an unlimited resource. This does not mean that it’s overkill for your needs. There is no point in spending more money on a solution than is necessary, but IOPS is not the indicator you should use to determine this – decisions like that should rely entirely on business requirements. I have yet to ever see a business requirement that related to IOPS (emphasis on business rather than technical).

Business requirements tend to be along the lines of needing to supply trading reports faster, or reduce the time spent by call centre operatives waiting for their CRM screens to refresh. These almost always translate back into latency requirements. After all, the key to solving any performance issue is always to follow the time and find out where it is being spent. Have you noticed that latency is the only one of our three fundamental characteristics which is expressed solely in units of time?

Don’t get distracted by IOPS… it’s all about latency.