Storage for DBAs: My last post in this blog series was aimed at dispelling the myth that dedupe is a suitable storage technology for databases. To my surprise it became the most popular article I’ve ever published (based on reads per day). Less surprisingly though, it lead to quite a backlash from some of the other flash storage vendors who responded with comments along the lines of “well we don’t need dedupe because we also have compression”. Fair enough. So today let’s take a look at the benefits and drawbacks of storage-level compression as part of an overall data reduction strategy. And by the way, I’m not against either dedupe or storage-level compression. I just think they have drawbacks as well as benefits – something that isn’t always being made clear in the marketing literature. And being in the storage industry, I know why that is…
What Is Compression?
In storageland we tend to talk about the data reduction suite of tools, which comprise of deduplication, compression and thin provisioning. The latter is a way of freeing up capacity which is allocated but not used… but that’s a topic for another day.
Dedupe and compression have a lot in common: they both fundamentally involve the identification and removal of patterns, which are then replaced with keys. The simplest way to explain the difference would be to consider a book shelf filled with books. If you were going to dedupe the bookshelf you would search through all of the books, removing any duplicate titles and making a note of how many duplicates there were. Easy. Now you want to compress the books, so you need to read each book and look for duplicate patterns of words. If you find the same sentence repeated numerous times, you can replace it with a pointer to a notebook where you can jot down the original. Hmmm…. less easy. You can see that dedupe is much more of a quick win.
Of course there is more to compression than this. Prior to any removal of duplicate patterns data is usually transformed – and it is the method used in this transformation process that differentiates all of the various compression algorithms. I’m not going to delve into the detail in this article, but if you are interested then a great way to get an idea of what’s involved is to read the Wikipedia page on the BZIP2 file compression tool and look at all the processes involved.
Data compression is essentially a trade-off, where reduced storage footprint is gained at the expense of extra CPU cycles – and therefore as a consequence, extra time. This additional CPU and time must be spent whenever the compressed data is read, thus increasing the read latency. It also needs to be spent during a write, but – as with dedupe – this can take place during the write process (known as inline) or at some later stage (known as post-process). Inline compression will affect the write latency but will also eliminate the need for a staging area where uncompressed data awaits compression.
Traditionally, compression has been used for archive data, i.e. data that must be retained but is seldom accessed. This is a good fit for compression, since the additional cost of decompression will rarely be paid. However, the use of compression with primary data is a different story: does it make sense to repeatedly incur time and CPU penalties on data that is frequently read or written? The answer, of course, is that it’s entirely down to any business requirements. However, I do strongly believe that – as with dedupe – there should be a choice, rather than an “always on” solution where you cannot say no. One vendor I know makes this rather silly claim: “so important it’s always-on”. What a fine example of a design limitation manifesting itself as a marketing claim.
Where to Compress?
As with all applications, data tends to flow down from users through an application layer, into a database. This database sits on top of a host, which is connected to some sort of persistent storage. There are therefore a number of possible places where data can be compressed:
- Database-level compression, such as using basic compression in Oracle (part of the core product), or the Advanced Compression option (extra license required).
- Host-level compression, such as you might find in products like Symantec’s Veritas Storage Foundation software suite.
- Storage-level compression, where the storage array compresses data either at a global level, or more ideally, at some configurable level (e.g. by the LUN).
Of course, compressed data doesn’t easily compress again, since all of the repetitive patterns will have been removed. In fact, running compression algorithms on compressed data is, at the very least, a waste of time and CPU – while in the worst case it could actually increase the size of the compressed data. This means it doesn’t really make sense to use multiple levels of compression, such as both database-level and storage-level. Choosing the correct level is therefore important. So which is best?
Benefits and Drawbacks
If you read some of the marketing literature I’ve seen recently you would soon come to the conclusion that compressing your data at the storage level is the only way to go. It certainly has some advantages, such as ease of deployment: just switch it on and sit back, all of your data is now compressed. But there are drawbacks too – and I believe it pays to make an informed decision.
The most obvious and measurable drawback is the addition of latency to I/O operations. In the case of inline compression this affects both reads and writes, while post-process compression inevitably results in more background I/O operations taking place, increasing wear and potentially impacting other workloads. Don’t take it for granted that this additional latency won’t affect you, especially at peak workload. Everyone in the flash industry knows about a certain flash vendor whose inline dedupe and compression software has to switch into post-process mode under high load, because it simply cannot cope.
This one is less obvious, but in my opinion far more important. Let’s say you compress your database at the storage-level so that as blocks are written to storage they are compressed, then decompressed again when they are read back out into the buffer cache. That’s great, you’ve saved yourself some storage capacity at the overhead of some latency. But what would have happened if you’d used database-level compression instead?
With database-level compression the data inside the data blocks would be compressed. This means not just that data which resides on storage, but also the data in memory – inside the buffer cache. That means you need less physical memory to hold the same amount of data, because it’s compressed in memory as well as on storage. What will you do with the excess physical memory? You could increase the size of the buffer cache, holding more data in memory and possibly improving performance through a reduction in physical I/O. Or you could run more instances on the same server… in fact, database-level compression is very useful if you want to build a consolidation environment, because it allows a greater density of databases per physical host.
There’s more. Full table scans will scan a smaller number of blocks because the data is compressed. Likewise any blocks sent over the network contain compressed data, which might make a difference to standby or Data Guard traffic. When it comes to compression, the higher up in the stack you begin, the more benefits you will see.
Don’t Believe The Hype
The moral of this story is that compression, just like deduplication, is a fantastic option to have available when and if you want to use it. Both of these tools allow you to trade time and CPU resource in favour of a reduced storage footprint. Choices are a good thing.
They are not, however, guaranteed wins – and they should not be sold as such. Take the time to understand the drawbacks before saying yes. If your storage vendor – or your database vendor (“storage savings of up to 204x“!!) – is pushing compression maybe they have a hidden agenda? In fact, they almost definitely will have.
And that will be the subject of the next post…