Storage Myths: Storage Compression Has No Downside

Image courtesy of marcovdz

Image courtesy of marcovdz

Storage for DBAs: My last post in this blog series was aimed at dispelling the myth that dedupe is a suitable storage technology for databases. To my surprise it became the most popular article I’ve ever published (based on reads per day). Less surprisingly though, it lead to quite a backlash from some of the other flash storage vendors who responded with comments along the lines of “well we don’t need dedupe because we also have compression”. Fair enough. So today let’s take a look at the benefits and drawbacks of storage-level compression as part of an overall data reduction strategy. And by the way, I’m not against either dedupe or storage-level compression. I just think they have drawbacks as well as benefits – something that isn’t always being made clear in the marketing literature. And being in the storage industry, I know why that is…

What Is Compression?

In storageland we tend to talk about the data reduction suite of tools, which comprise of deduplication, compression and thin provisioning. The latter is a way of freeing up capacity which is allocated but not used… but that’s a topic for another day.

bookshelfDedupe and compression have a lot in common: they both fundamentally involve the identification and removal of patterns, which are then replaced with keys. The simplest way to explain the difference would be to consider a book shelf filled with books. If you were going to dedupe the bookshelf you would search through all of the books, removing any duplicate titles and making a note of how many duplicates there were. Easy. Now you want to compress the books, so you need to read each book and look for duplicate patterns of words. If you find the same sentence repeated numerous times, you can replace it with a pointer to a notebook where you can jot down the original. Hmmm…. less easy. You can see that dedupe is much more of a quick win.

Of course there is more to compression than this. Prior to any removal of duplicate patterns data is usually transformed – and it is the method used in this transformation process that differentiates all of the various compression algorithms. I’m not going to delve into the detail in this article, but if you are interested then a great way to get an idea of what’s involved is to read the Wikipedia page on the BZIP2 file compression tool and look at all the processes involved.

Why Compress?

Data compression is essentially a trade-off, where reduced storage footprint is gained at the expense of extra CPU cycles – and therefore as a consequence, extra time. This additional CPU and time must be spent whenever the compressed data is read, thus increasing the read latency. It also needs to be spent during a write, but – as with dedupe – this can take place during the write process (known as inline) or at some later stage (known as post-process). dedupe-inline-or-post-processInline compression will affect the write latency but will also eliminate the need for a staging area where uncompressed data awaits compression.

Traditionally, compression has been used for archive data, i.e. data that must be retained but is seldom accessed. This is a good fit for compression, since the additional cost of decompression will rarely be paid. However, the use of compression with primary data is a different story: does it make sense to repeatedly incur time and CPU penalties on data that is frequently read or written? The answer, of course, is that it’s entirely down to any business requirements. However, I do strongly believe that – as with dedupe – there should be a choice, rather than an “always on” solution where you cannot say no. One vendor I know makes this rather silly claim: “so important it’s always-on”. What a fine example of a design limitation manifesting itself as a marketing claim.

Where to Compress?

As with all applications, data tends to flow down from users through an application layer, into a database. This database sits on top of a host, which is connected to some sort of persistent storage. There are therefore a number of possible places where data can be compressed:

  • where-to-compressDatabase-level compression, such as using basic compression in Oracle (part of the core product), or the Advanced Compression option (extra license required).
  • Host-level compression, such as you might find in products like Symantec’s Veritas Storage Foundation software suite.
  • Storage-level compression, where the storage array compresses data either at a global level, or more ideally, at some configurable level (e.g. by the LUN).

Of course, compressed data doesn’t easily compress again, since all of the repetitive patterns will have been removed. In fact, running compression algorithms on compressed data is, at the very least, a waste of time and CPU – while in the worst case it could actually increase the size of the compressed data. This means it doesn’t really make sense to use multiple levels of compression, such as both database-level and storage-level. Choosing the correct level is therefore important. So which is best?

Benefits and Drawbacks

If you read some of the marketing literature I’ve seen recently you would soon come to the conclusion that compressing your data at the storage level is the only way to go. It certainly has some advantages, such as ease of deployment: just switch it on and sit back, all of your data is now compressed. But there are drawbacks too – and I believe it pays to make an informed decision.

Performance

rev-counter-inverseThe most obvious and measurable drawback is the addition of latency to I/O operations. In the case of inline compression this affects both reads and writes, while post-process compression inevitably results in more background I/O operations taking place, increasing wear and potentially impacting other workloads. Don’t take it for granted that this additional latency won’t affect you, especially at peak workload. Everyone in the flash industry knows about a certain flash vendor whose inline dedupe and compression software has to switch into post-process mode under high load, because it simply cannot cope.

Influence

This one is less obvious, but in my opinion far more important. Let’s say you compress your database at the storage-level so that as blocks are written to storage they are compressed, then decompressed again when they are read back out into the buffer cache. That’s great, you’ve saved yourself some storage capacity at the overhead of some latency. But what would have happened if you’d used database-level compression instead?

Random Access MemoryWith database-level compression the data inside the data blocks would be compressed. This means not just that data which resides on storage, but also the data in memory – inside the buffer cache. That means you need less physical memory to hold the same amount of data, because it’s compressed in memory as well as on storage. What will you do with the excess physical memory? You could increase the size of the buffer cache, holding more data in memory and possibly improving performance through a reduction in physical I/O. Or you could run more instances on the same server… in fact, database-level compression is very useful if you want to build a consolidation environment, because it allows a greater density of databases per physical host.

There’s more. Full table scans will scan a smaller number of blocks because the data is compressed. Likewise any blocks sent over the network contain compressed data, which might make a difference to standby or Data Guard traffic. When it comes to compression, the higher up in the stack you begin, the more benefits you will see.

Don’t Believe The Hype

The moral of this story is that compression, just like deduplication, is a fantastic option to have available when and if you want to use it. Both of these tools allow you to trade time and CPU resource in favour of a reduced storage footprint. Choices are a good thing.

They are not, however, guaranteed wins – and they should not be sold as such. Take the time to understand the drawbacks before saying yes. If your storage vendor – or your database vendor (“storage savings of up to 204x“!!) – is pushing compression maybe they have a hidden agenda? In fact, they almost definitely will have.

And that will be the subject of the next post…

8 Responses to Storage Myths: Storage Compression Has No Downside

  1. sshdba says:

    Your articles are fascinating as always. Being a DBA it gives me more insight into Storage technology then I ever would reading manuals or listening to the crisp suits who peddle storages with fancy terms.

  2. dkrenik says:

    Good stuff – thanks.

    There appears to be a real-world example of using storage compression that reduces latency and capacity requirements:
    https://blogs.oracle.com/si/entry/zfssa_smashes_ibm_xiv_while

    Would appreciate your perspective.

    • flashdba says:

      Let’s be clear here, the IBM XIV array is a disk-based workhorse. People have relied on them for years, but they are a previous generation of storage and have the performance characteristics to match. On the XIV the average wait for db file sequential read (i.e. random read I/Os) is 14ms. Terrible!

      So it’s hardly surprising that a ZFSSA stuffed with SSDs and, in the words of the author, “lots” of cache takes it to the cleaners. In the land of the blind, the one-eyed man is king.

      The thing is, the author describes the performance gleaned from the ZFSSA as follows: “out of about 11000 IOPS over 10000 of them are less then 1.02ms”. This is deeply unimpressive. 11k IOPS is a tiny workload and yet 9% of them are in excess of 1.02ms. I regularly compete in POCs using Violin storage, where my competition is the likes of IBM’s TMS FlashSystem, EMC’s XtremIO, Oracle Exadata etc. That level of performance would be laughed out of the POC, it wouldn’t even make the initial testing. I suspect anyone reading this who works for a flash vendor would say the same.

      Now I’m not saying there’s no place for this type of storage. Sometimes “good enough” will do, as long as the financials are right. But I strongly suspect, looking at that article, that most of the “Percent Improvement” seen in the “Big Ugly SQL” was in fact brought about by improved CPU performance from the T4.

      One other thing. It’s not clear in the article, but it looks very much as if the performance statistics shown were collected prior to enabling any compression. We cannot therefore make any judgements about the effect of storage compression on performance. But then you probably know a lot more about the background of this article than me, given your role…

      • dkrenik says:

        Yes, I should have been up front about my role. Full disclosure: I work for Oracle in Storage.

        I mentioned the blog from one of Oracle’s Storage SC’s because I wanted to understand your perspective – no shenanigan’s intended. Your input re: latencies are much appreciated. To be fair, Exadata, TMS, et al are positioned to different markets than Oracle’s ZFS Storage Appliance.

        WRT lzjb compression, Oracle IT uses it extensively for its effect on boosting write performance.

        Keep up the good work. I enjoy your posts.

        • flashdba says:

          No worries. I’ll be honest with you too, I’m not a fan of the ZFSSA. I see it as pretty ancient technology and in my view Oracle has stooped rather low to try and make it more attractive. I’m talking here about the way that the hybrid columnar compression feature has been tied to Oracle storage despite being generic to the database product. It’s sort of like tying a big, juicy pork chop around the ZFSSA to make it look more tasty.

          But I genuinely mean no disrespect to you in saying this, it’s just my opinion after all. At the end of the day all that matters is the results you get, so if Oracle IT is happy then so be it. Thanks for visiting, I appreciate your comments.

  3. I love reading your and @KevinClosson’s articles on storage, IO stats and #Exadata. I have just finished my Exadata training and would be going in for a certification as well. All the blogs that you guys write highlighting the traps and catches in the Exadata documentation are very helpful, as it is important to know the truth in between all the marketing jargon that gets thrown around.

    As for Oracle, I don’t know why the Balance Sheet has become so important than their data sheet :(.

    Thank you and wish you a happy new year.

Leave a reply to flashdba Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.