The Flash Insider: To POC or Not To POC?

Proof of Concept?

Guest Post

I’m excited announce another guest blog written by my good friend and funny-talking American cousin Nathan Fuzi. Like me, Nate comes from a database background but joined the all-flash storage revolution back in its infancy. Which means, like me, Nate how has a little tombstone on his résumé marked Violin Memory. But even though he has since moved up to working in THE CLOUD, Nate’s experience working for an AFA vendor is invaluable. Over six years, he worked with hundreds of database customers who were deciding whether to purchase all-flash storage and – more importantly – wondering how to test their databases on those storage platforms. Now, for your benefit, he writes about one of the most crucial stages of the process: the proof of concept (POC).

Indulge me, if you will:  take yourself back to a time long, long ago–perhaps nearly forgotten.  Waaaay back when storage arrays were built of spinning hard drives front-ended with DRAM for caching purposes, and conventional wisdom had not yet agreed whether flash memory could serve as persistent storage media.  I know:  it seems like forever ago.  Even the ghost of Christmas Past is like, Really?  But I assure you that time happened.  I lived through it, and so did my buddy flashdba and a number of others.  Those were heady days, full of wonder and spectacle and … many, many proofs of concept.

storage-characteristicsAnd who could blame folks back then for wanting to see more than that these mysterious and spectacular “all flash” storage arrays could ingest synthetic data and spit it back at previously unseen IOPS rates, incredibly low latency numbers, and firehose-like bandwidth volumes?  Because let’s face it:  marketing numbers and theoretical performance are just that.  Theoretical.  You know, as in “your mileage may vary”.  What makes a difference to people is what kind of performance the product delivers to their specific application.  Folks like flashdba and myself got pretty good guessing at the latency numbers our products would deliver at the IOPS rates we observed in applications.  We could then do some simple math to substitute in our anticipated latency for the current value and accurately predict our improvement on execution time for a given SQL statement.  But in the early days, proving our claims to a skeptical customer often meant asking them to deploy their application on our array, as the IO profile was complex and varied.

Oh… the Pain

The PoC is still quite common and often necessary–and not just for storage products, although especially for storage products, with their increasingly wild performance claims.  But it’s painful.  You have to have an entire non-production setup in place or build one just for the PoC, and then you have to have enough additional ports on your Ethernet or FC switches (or whatever new-fangled connectivity the latest flashy product is sporting) that you can leave everything intact and hook up the new array, expose storage to the host, perform some tests, and then ideally migrate the application data over to run some “real world” tests.

But what could we achieve without doing a full-blown PoC?  There are lots of synthetic load generation utilities out there these days, some easier to use than others and some more flexible and fuller-featured.  A short list of popular tools here:

Iometer                http://www.iometer.org/

DiskSpd                https://gallery.technet.microsoft.com/DiskSpd-a-robust-storage-6cd2f223

VDBench              http://www.oracle.com/technetwork/server-storage/vdbench-downloads-1901681.html

Fio                          http://freecode.com/projects/fio

What are you really testing for?

One common aspect I have seen of what are, frankly, flawed testing paradigms is that admins often attempt to spin up to the max IOPS the host/array combination can drive for that particular workload setting and then hold that rate for some period.  This methodology demonstrates a couple of array attributes:  maximum sustained performance and, run long enough, the point at which caching and garbage collection mechanisms are overrun and a worst case sustained performance profile presents itself.  test-blackboardWhat it definitely does not demonstrate is the latency you can expect for your workload, which for most database environments is likely less than 10% of the maximum IOPS performance capacity of the modern all-flash array.  And what about the fact that complex animals like the Oracle Database perform both random single-block IOs and sequential multi-block IOs simultaneously and at a nanosecond’s notice, depending on the whim of the optimizer?  Simplistic performance evaluation unfortunately brings the average storage or database admin no closer to understanding how the array will perform for his actual workload–and isn’t that the whole point of doing such an evaluation?

What’s a DBA to do?

A while back, our friends over at Pure Storage wrote a blog in which they shared some metrics they had pulled from call-home data from their customer environments.  They said, for example, that Oracle environment IO activity broke down like this on average, in terms of block sizes and reads versus writes, and they helpfully provided a VDBench configuration file to drive that IO pattern:

http://blog.purestorage.com/modeling-io-size-mixes-with-vdbench/

That was really cool of them, but, on closer examination, it occurred to me that this profile really only described a blender of some number of disparate Oracle environments.  The chances of it approximating any one Oracle environment were nominal, and the chances of it approximating YOUR Oracle environment went to monkeys with typewriters producing Shakespeare.  So this driver doesn’t actually issue the IOs that your Oracle database is going to issue.  To me, that seriously limits its value.  Another problem I have with it is that, with its single read workload definition, it is going to show me the average latency for all read IOs as a single number.  But Oracle helpfully shows me my random read time separate from my random write time–and my multi-block read time separate from those, and my sequential write time for redo separate from those, etc.  This granularity is what makes Oracle’s instrumentation so valuable in performance analysis.  I refuse to give it up.

Taking Charge

So what can you do?  Well, Oracle is capturing all of your IO metrics for you automatically, so just take a look at your AWR report (you guys on SE can get this from Statspack reports) for them and build your own IO driver for VDBench.  As an example, one customer–let’s call them a large international bank–was curious to see if our products could deliver comparable or better latency than their existing storage.  They shared their AWR reports with me, and I found their IO profile section for the period they really cared about.  Here’s a snippet:

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
<SNIP>
physical read IO requests                55,301,220        7,682.5       1,180.1
physical read bytes              1.936982535373E+13 2.69087766E+09 4.1334639E+08
physical read partial requests               26,445            3.7           0.6
physical read requests optimized         49,680,085        6,901.6       1,060.2
physical read total IO requests          55,479,809        7,707.3       1,183.9
physical read total bytes        1.938706428365E+13 2.69327251E+09 4.1371427E+08
physical read total bytes optimi 1.773273192858E+13 2.46345082E+09 3.7841130E+08
physical read total multi block          19,552,557        2,716.3         417.3
physical reads cache                     14,137,864        1,964.1         301.7
physical reads cache prefetch            11,716,783        1,627.7         250.0
physical reads direct                 1,168,102,453      162,274.1      24,927.0
physical reads direct (lob)                      22            0.0           0.0
physical reads direct temporary         307,926,728       42,777.5       6,571.1
physical reads prefetch warmup                    0            0.0           0.0
physical write IO requests               37,072,831        5,150.2         791.1
physical write bytes              4,873,114,484,736  676,978,477.6 1.0399083E+08
physical write requests optimize         31,566,182        4,385.2         673.6
physical write total IO requests         37,460,357        5,204.0         799.4
physical write total bytes        4,908,503,636,480  681,894,777.9 1.0474603E+08
physical write total bytes optim  3,540,697,530,368  491,877,634.2  75,557,447.1
physical write total multi block          5,511,767          765.7         117.6
<SNIP>
redo writes                                 341,363           47.4           7.3

Of course, not every multi-block read is 1M because that would be too easy.  And good luck trying to get all the numbers to line up exactly.  That Oracle pulls the metrics from different places still means some rough math.  But, with a little patience and fiddling, we can get a great approximation of the number of single block random reads, large block sequential reads, random and multi-block writes, and redo writes that match up closely to these values, both in IOPS and bandwidth.  When in doubt, use the higher of [IOPS listed, Bandwidth listed].  Thus I could set up my VDBench workload definitions:

# single-block, 100% random reads
wd=wd_oracle_rand_read,rdpct=100,xfersize=16k,seekpct=100,iorate=3250,sd=sd*,priority=1

# multi-block, 100% sequential reads
wd=wd_oracle_seq_read,rdpct=100,xfersize=1024k,seekpct=0,iorate=2500,sd=sd*,priority=2

# single-block, 100% random writes
wd=wd_oracle_rand_write,rdpct=0,xfersize=16k,seekpct=100,iorate=5800,sd=sd*,priority=3

# multi-block, 100% sequential writes
wd=wd_oracle_seq_write,rdpct=0,xfersize=768k,seekpct=0,iorate=750,sd=sd*,priority=4

# redo write sizes vary per the LGWR mechanism, so we’ll go with redo size (bytes) per second / redo writes per second
wd=wd_oracle_redo_write,rdpct=0,xfersize=64k,seekpct=0,iorate=50,sd=sd*,priority=5
rd=rd_oracle_ramp,wd=wd_oracle*,iorate=12350,interval=1,elapsed=120,forthreads=8,warmup=5

As a quick check, with the customer’s 16KB block size, this config drives just over 50 MB/s random reads + 2500 MB/s sequential reads, which gets really close to the 2566 MB/s total reads stated in the snippet above.  It also drives about 91 MB/s random writes + 563 MB/s sequential writes + 3 MB/s redo for a total of 657 MB/s writes, which is really close to the reported 650 MB/s write bandwidth in the snippet.  I could take this even further to break out if I needed to characterize performance for other IO types or block sizes.  VDBench helpfully puts out a separate HTML file for each workload definition, allowing us to see the latency metrics for each IO type and size that you can then compare against the values in our AWR or Statspack report.  Note that you should set your forthreads value just high enough that you can drive the desired IOPS total; any higher and you’ll push latency up without achieving anything useful. And clearly the total IOPS target for the run definition should match the sum of the individual workload drivers.

PoC Avoided?  Maybe.

question-mark-diceAll of what I have described here helps to answer the question of What would each latency number look like for my IO workload as it exists today?  From this, you can use a little math to answer with great accuracy the execution time for any particular SQL with the lower latency.  The next logical question is What is going to happen to overall application performance when each query runs so much faster and completes sooner, allowing the next query to start earlier, etc?  That part is much more difficult to predict and may require a full-blown PoC to answer definitively, but at least you know the product you’re about to invest time in can deliver the latency you expect with your current IO workload profile.  If you’re hoping for a 10X performance improvement for your application, you’d better see that IO wait currently accounts for a large percentage of database time and that the latency of your new array beats the current latency by enough to make that dream a reality.

2 Responses to The Flash Insider: To POC or Not To POC?

  1. Lynn Sattler says:

    Chris,
    I am a USA based Oracle DBA. Very much enjoyed reading all of your blog postings that I came upon in the last couple of days! One question I did not find an answer to was the net effects of this technology wearing out. We have both oltp and a data warehouse. I believe the data warehouse does a lot of writes compared to oltp. How long will the ssd drives last in these arrays?
    Also, I am not sure I have this right, but are you saying that for the vendors to survive, they have to go to ssd drives in their arrays instead of having “ground-up designs”? You were sure high on the “ground-up designs” before Violin got into financial trouble.

    • flashdba says:

      Hi Lynn

      First of all, I wrote an article about flash wear which you can find here:

      Understanding Flash: Floating Gates and Wear

      But that article is about wear at the cell level. What you are asking about is wear at the array level, which is much more interesting. I should blog about it sometime soon.

      In most cases the answer is that you don’t need to worry, because typically your AFA vendor would replace any worn SSD if your system is under a support contract. I can’t speak for the others but that’s certainly what my employers past and present have done. They are able to do this because the vast majority of customers would never get anywhere near the point where their drives wear out of the lifetime of the arrays (typically 5 to 7 years).

      So why is that? Well, for example, a Samsung PM863 data centre-class SSD has a Drive Writes Per Day (DWPD) rating of around 1.3, which means you can write an average of 1.3x the drive capacity every day for the warranty period. Considering that these drives go up to 3.84TB in size that’s a lot of writes. And of course your storage array will have many drives in it (for example 24 per shelf in a Kaminario K2 array). Of more importance is the fact that what your database writes isn’t what will be written to the flash. Your All Flash Array will likely use a combination of technologies such as thin provisioning, compression, deduplication and zero detection to ensure that what is written down to persistent media is a fraction of the data being sent from the server. So all in all, wear isn’t really an issue unless you have a very unusual requirement. I’ve worked on some of the largest and most write-intensive databases in the world and I can only think of two incidences where this was even a factor.

      Ahh yes, the question of ground up versus commodity component. As you point out, I’ve been on both sides of this equation. I could write at length about Violin’s woes but this isn’t the place. However, I do think history has now shown that hardware-centric designs aren’t agile enough to adapt to market forces (look at Violin, Skyera and DSSD). We got badly burned by that at Violin. I’ve come to the conclusion that software-based approaches are the answer, especially when they are built on the back of commodity components built by companies like Samsung, who has the economies of scale to produce millions of SSD units per year. But hey, I’ve been wrong before so don’t take my word for it. I’m only hoping to share my experiences as one possible set of views.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.