February 20, 2017 2 Comments
I’m excited announce another guest blog written by my good friend and funny-talking American cousin Nathan Fuzi. Like me, Nate comes from a database background but joined the all-flash storage revolution back in its infancy. Which means, like me, Nate how has a little tombstone on his résumé marked Violin Memory. But even though he has since moved up to working in THE CLOUD, Nate’s experience working for an AFA vendor is invaluable. Over six years, he worked with hundreds of database customers who were deciding whether to purchase all-flash storage and – more importantly – wondering how to test their databases on those storage platforms. Now, for your benefit, he writes about one of the most crucial stages of the process: the proof of concept (POC).
Indulge me, if you will: take yourself back to a time long, long ago–perhaps nearly forgotten. Waaaay back when storage arrays were built of spinning hard drives front-ended with DRAM for caching purposes, and conventional wisdom had not yet agreed whether flash memory could serve as persistent storage media. I know: it seems like forever ago. Even the ghost of Christmas Past is like, Really? But I assure you that time happened. I lived through it, and so did my buddy flashdba and a number of others. Those were heady days, full of wonder and spectacle and … many, many proofs of concept.
And who could blame folks back then for wanting to see more than that these mysterious and spectacular “all flash” storage arrays could ingest synthetic data and spit it back at previously unseen IOPS rates, incredibly low latency numbers, and firehose-like bandwidth volumes? Because let’s face it: marketing numbers and theoretical performance are just that. Theoretical. You know, as in “your mileage may vary”. What makes a difference to people is what kind of performance the product delivers to their specific application. Folks like flashdba and myself got pretty good guessing at the latency numbers our products would deliver at the IOPS rates we observed in applications. We could then do some simple math to substitute in our anticipated latency for the current value and accurately predict our improvement on execution time for a given SQL statement. But in the early days, proving our claims to a skeptical customer often meant asking them to deploy their application on our array, as the IO profile was complex and varied.
Oh… the Pain
The PoC is still quite common and often necessary–and not just for storage products, although especially for storage products, with their increasingly wild performance claims. But it’s painful. You have to have an entire non-production setup in place or build one just for the PoC, and then you have to have enough additional ports on your Ethernet or FC switches (or whatever new-fangled connectivity the latest flashy product is sporting) that you can leave everything intact and hook up the new array, expose storage to the host, perform some tests, and then ideally migrate the application data over to run some “real world” tests.
But what could we achieve without doing a full-blown PoC? There are lots of synthetic load generation utilities out there these days, some easier to use than others and some more flexible and fuller-featured. A short list of popular tools here:
What are you really testing for?
One common aspect I have seen of what are, frankly, flawed testing paradigms is that admins often attempt to spin up to the max IOPS the host/array combination can drive for that particular workload setting and then hold that rate for some period. This methodology demonstrates a couple of array attributes: maximum sustained performance and, run long enough, the point at which caching and garbage collection mechanisms are overrun and a worst case sustained performance profile presents itself. What it definitely does not demonstrate is the latency you can expect for your workload, which for most database environments is likely less than 10% of the maximum IOPS performance capacity of the modern all-flash array. And what about the fact that complex animals like the Oracle Database perform both random single-block IOs and sequential multi-block IOs simultaneously and at a nanosecond’s notice, depending on the whim of the optimizer? Simplistic performance evaluation unfortunately brings the average storage or database admin no closer to understanding how the array will perform for his actual workload–and isn’t that the whole point of doing such an evaluation?
What’s a DBA to do?
A while back, our friends over at Pure Storage wrote a blog in which they shared some metrics they had pulled from call-home data from their customer environments. They said, for example, that Oracle environment IO activity broke down like this on average, in terms of block sizes and reads versus writes, and they helpfully provided a VDBench configuration file to drive that IO pattern:
That was really cool of them, but, on closer examination, it occurred to me that this profile really only described a blender of some number of disparate Oracle environments. The chances of it approximating any one Oracle environment were nominal, and the chances of it approximating YOUR Oracle environment went to monkeys with typewriters producing Shakespeare. So this driver doesn’t actually issue the IOs that your Oracle database is going to issue. To me, that seriously limits its value. Another problem I have with it is that, with its single read workload definition, it is going to show me the average latency for all read IOs as a single number. But Oracle helpfully shows me my random read time separate from my random write time–and my multi-block read time separate from those, and my sequential write time for redo separate from those, etc. This granularity is what makes Oracle’s instrumentation so valuable in performance analysis. I refuse to give it up.
So what can you do? Well, Oracle is capturing all of your IO metrics for you automatically, so just take a look at your AWR report (you guys on SE can get this from Statspack reports) for them and build your own IO driver for VDBench. As an example, one customer–let’s call them a large international bank–was curious to see if our products could deliver comparable or better latency than their existing storage. They shared their AWR reports with me, and I found their IO profile section for the period they really cared about. Here’s a snippet:
Statistic Total per Second per Trans -------------------------------- ------------------ -------------- ------------- <SNIP> physical read IO requests 55,301,220 7,682.5 1,180.1 physical read bytes 1.936982535373E+13 2.69087766E+09 4.1334639E+08 physical read partial requests 26,445 3.7 0.6 physical read requests optimized 49,680,085 6,901.6 1,060.2 physical read total IO requests 55,479,809 7,707.3 1,183.9 physical read total bytes 1.938706428365E+13 2.69327251E+09 4.1371427E+08 physical read total bytes optimi 1.773273192858E+13 2.46345082E+09 3.7841130E+08 physical read total multi block 19,552,557 2,716.3 417.3 physical reads cache 14,137,864 1,964.1 301.7 physical reads cache prefetch 11,716,783 1,627.7 250.0 physical reads direct 1,168,102,453 162,274.1 24,927.0 physical reads direct (lob) 22 0.0 0.0 physical reads direct temporary 307,926,728 42,777.5 6,571.1 physical reads prefetch warmup 0 0.0 0.0 physical write IO requests 37,072,831 5,150.2 791.1 physical write bytes 4,873,114,484,736 676,978,477.6 1.0399083E+08 physical write requests optimize 31,566,182 4,385.2 673.6 physical write total IO requests 37,460,357 5,204.0 799.4 physical write total bytes 4,908,503,636,480 681,894,777.9 1.0474603E+08 physical write total bytes optim 3,540,697,530,368 491,877,634.2 75,557,447.1 physical write total multi block 5,511,767 765.7 117.6 <SNIP> redo writes 341,363 47.4 7.3
Of course, not every multi-block read is 1M because that would be too easy. And good luck trying to get all the numbers to line up exactly. That Oracle pulls the metrics from different places still means some rough math. But, with a little patience and fiddling, we can get a great approximation of the number of single block random reads, large block sequential reads, random and multi-block writes, and redo writes that match up closely to these values, both in IOPS and bandwidth. When in doubt, use the higher of [IOPS listed, Bandwidth listed]. Thus I could set up my VDBench workload definitions:
# single-block, 100% random reads wd=wd_oracle_rand_read,rdpct=100,xfersize=16k,seekpct=100,iorate=3250,sd=sd*,priority=1 # multi-block, 100% sequential reads wd=wd_oracle_seq_read,rdpct=100,xfersize=1024k,seekpct=0,iorate=2500,sd=sd*,priority=2 # single-block, 100% random writes wd=wd_oracle_rand_write,rdpct=0,xfersize=16k,seekpct=100,iorate=5800,sd=sd*,priority=3 # multi-block, 100% sequential writes wd=wd_oracle_seq_write,rdpct=0,xfersize=768k,seekpct=0,iorate=750,sd=sd*,priority=4 # redo write sizes vary per the LGWR mechanism, so we’ll go with redo size (bytes) per second / redo writes per second wd=wd_oracle_redo_write,rdpct=0,xfersize=64k,seekpct=0,iorate=50,sd=sd*,priority=5 rd=rd_oracle_ramp,wd=wd_oracle*,iorate=12350,interval=1,elapsed=120,forthreads=8,warmup=5
As a quick check, with the customer’s 16KB block size, this config drives just over 50 MB/s random reads + 2500 MB/s sequential reads, which gets really close to the 2566 MB/s total reads stated in the snippet above. It also drives about 91 MB/s random writes + 563 MB/s sequential writes + 3 MB/s redo for a total of 657 MB/s writes, which is really close to the reported 650 MB/s write bandwidth in the snippet. I could take this even further to break out if I needed to characterize performance for other IO types or block sizes. VDBench helpfully puts out a separate HTML file for each workload definition, allowing us to see the latency metrics for each IO type and size that you can then compare against the values in our AWR or Statspack report. Note that you should set your forthreads value just high enough that you can drive the desired IOPS total; any higher and you’ll push latency up without achieving anything useful. And clearly the total IOPS target for the run definition should match the sum of the individual workload drivers.
PoC Avoided? Maybe.
All of what I have described here helps to answer the question of What would each latency number look like for my IO workload as it exists today? From this, you can use a little math to answer with great accuracy the execution time for any particular SQL with the lower latency. The next logical question is What is going to happen to overall application performance when each query runs so much faster and completes sooner, allowing the next query to start earlier, etc? That part is much more difficult to predict and may require a full-blown PoC to answer definitively, but at least you know the product you’re about to invest time in can deliver the latency you expect with your current IO workload profile. If you’re hoping for a 10X performance improvement for your application, you’d better see that IO wait currently accounts for a large percentage of database time and that the latency of your new array beats the current latency by enough to make that dream a reality.