The Flash Insider: To POC or Not To POC?

Proof of Concept?

Guest Post

I’m excited announce another guest blog written by my good friend and funny-talking American cousin Nathan Fuzi. Like me, Nate comes from a database background but joined the all-flash storage revolution back in its infancy. Which means, like me, Nate how has a little tombstone on his résumé marked Violin Memory. But even though he has since moved up to working in THE CLOUD, Nate’s experience working for an AFA vendor is invaluable. Over six years, he worked with hundreds of database customers who were deciding whether to purchase all-flash storage and – more importantly – wondering how to test their databases on those storage platforms. Now, for your benefit, he writes about one of the most crucial stages of the process: the proof of concept (POC).

Indulge me, if you will:  take yourself back to a time long, long ago–perhaps nearly forgotten.  Waaaay back when storage arrays were built of spinning hard drives front-ended with DRAM for caching purposes, and conventional wisdom had not yet agreed whether flash memory could serve as persistent storage media.  I know:  it seems like forever ago.  Even the ghost of Christmas Past is like, Really?  But I assure you that time happened.  I lived through it, and so did my buddy flashdba and a number of others.  Those were heady days, full of wonder and spectacle and … many, many proofs of concept.

storage-characteristicsAnd who could blame folks back then for wanting to see more than that these mysterious and spectacular “all flash” storage arrays could ingest synthetic data and spit it back at previously unseen IOPS rates, incredibly low latency numbers, and firehose-like bandwidth volumes?  Because let’s face it:  marketing numbers and theoretical performance are just that.  Theoretical.  You know, as in “your mileage may vary”.  What makes a difference to people is what kind of performance the product delivers to their specific application.  Folks like flashdba and myself got pretty good guessing at the latency numbers our products would deliver at the IOPS rates we observed in applications.  We could then do some simple math to substitute in our anticipated latency for the current value and accurately predict our improvement on execution time for a given SQL statement.  But in the early days, proving our claims to a skeptical customer often meant asking them to deploy their application on our array, as the IO profile was complex and varied.

Oh… the Pain

The PoC is still quite common and often necessary–and not just for storage products, although especially for storage products, with their increasingly wild performance claims.  But it’s painful.  You have to have an entire non-production setup in place or build one just for the PoC, and then you have to have enough additional ports on your Ethernet or FC switches (or whatever new-fangled connectivity the latest flashy product is sporting) that you can leave everything intact and hook up the new array, expose storage to the host, perform some tests, and then ideally migrate the application data over to run some “real world” tests.

But what could we achieve without doing a full-blown PoC?  There are lots of synthetic load generation utilities out there these days, some easier to use than others and some more flexible and fuller-featured.  A short list of popular tools here:

Iometer                http://www.iometer.org/

DiskSpd                https://gallery.technet.microsoft.com/DiskSpd-a-robust-storage-6cd2f223

VDBench              http://www.oracle.com/technetwork/server-storage/vdbench-downloads-1901681.html

Fio                          http://freecode.com/projects/fio

What are you really testing for?

One common aspect I have seen of what are, frankly, flawed testing paradigms is that admins often attempt to spin up to the max IOPS the host/array combination can drive for that particular workload setting and then hold that rate for some period.  This methodology demonstrates a couple of array attributes:  maximum sustained performance and, run long enough, the point at which caching and garbage collection mechanisms are overrun and a worst case sustained performance profile presents itself.  test-blackboardWhat it definitely does not demonstrate is the latency you can expect for your workload, which for most database environments is likely less than 10% of the maximum IOPS performance capacity of the modern all-flash array.  And what about the fact that complex animals like the Oracle Database perform both random single-block IOs and sequential multi-block IOs simultaneously and at a nanosecond’s notice, depending on the whim of the optimizer?  Simplistic performance evaluation unfortunately brings the average storage or database admin no closer to understanding how the array will perform for his actual workload–and isn’t that the whole point of doing such an evaluation?

What’s a DBA to do?

A while back, our friends over at Pure Storage wrote a blog in which they shared some metrics they had pulled from call-home data from their customer environments.  They said, for example, that Oracle environment IO activity broke down like this on average, in terms of block sizes and reads versus writes, and they helpfully provided a VDBench configuration file to drive that IO pattern:

http://blog.purestorage.com/modeling-io-size-mixes-with-vdbench/

That was really cool of them, but, on closer examination, it occurred to me that this profile really only described a blender of some number of disparate Oracle environments.  The chances of it approximating any one Oracle environment were nominal, and the chances of it approximating YOUR Oracle environment went to monkeys with typewriters producing Shakespeare.  So this driver doesn’t actually issue the IOs that your Oracle database is going to issue.  To me, that seriously limits its value.  Another problem I have with it is that, with its single read workload definition, it is going to show me the average latency for all read IOs as a single number.  But Oracle helpfully shows me my random read time separate from my random write time–and my multi-block read time separate from those, and my sequential write time for redo separate from those, etc.  This granularity is what makes Oracle’s instrumentation so valuable in performance analysis.  I refuse to give it up.

Taking Charge

So what can you do?  Well, Oracle is capturing all of your IO metrics for you automatically, so just take a look at your AWR report (you guys on SE can get this from Statspack reports) for them and build your own IO driver for VDBench.  As an example, one customer–let’s call them a large international bank–was curious to see if our products could deliver comparable or better latency than their existing storage.  They shared their AWR reports with me, and I found their IO profile section for the period they really cared about.  Here’s a snippet:

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
<SNIP>
physical read IO requests                55,301,220        7,682.5       1,180.1
physical read bytes              1.936982535373E+13 2.69087766E+09 4.1334639E+08
physical read partial requests               26,445            3.7           0.6
physical read requests optimized         49,680,085        6,901.6       1,060.2
physical read total IO requests          55,479,809        7,707.3       1,183.9
physical read total bytes        1.938706428365E+13 2.69327251E+09 4.1371427E+08
physical read total bytes optimi 1.773273192858E+13 2.46345082E+09 3.7841130E+08
physical read total multi block          19,552,557        2,716.3         417.3
physical reads cache                     14,137,864        1,964.1         301.7
physical reads cache prefetch            11,716,783        1,627.7         250.0
physical reads direct                 1,168,102,453      162,274.1      24,927.0
physical reads direct (lob)                      22            0.0           0.0
physical reads direct temporary         307,926,728       42,777.5       6,571.1
physical reads prefetch warmup                    0            0.0           0.0
physical write IO requests               37,072,831        5,150.2         791.1
physical write bytes              4,873,114,484,736  676,978,477.6 1.0399083E+08
physical write requests optimize         31,566,182        4,385.2         673.6
physical write total IO requests         37,460,357        5,204.0         799.4
physical write total bytes        4,908,503,636,480  681,894,777.9 1.0474603E+08
physical write total bytes optim  3,540,697,530,368  491,877,634.2  75,557,447.1
physical write total multi block          5,511,767          765.7         117.6
<SNIP>
redo writes                                 341,363           47.4           7.3

Of course, not every multi-block read is 1M because that would be too easy.  And good luck trying to get all the numbers to line up exactly.  That Oracle pulls the metrics from different places still means some rough math.  But, with a little patience and fiddling, we can get a great approximation of the number of single block random reads, large block sequential reads, random and multi-block writes, and redo writes that match up closely to these values, both in IOPS and bandwidth.  When in doubt, use the higher of [IOPS listed, Bandwidth listed].  Thus I could set up my VDBench workload definitions:

# single-block, 100% random reads
wd=wd_oracle_rand_read,rdpct=100,xfersize=16k,seekpct=100,iorate=3250,sd=sd*,priority=1

# multi-block, 100% sequential reads
wd=wd_oracle_seq_read,rdpct=100,xfersize=1024k,seekpct=0,iorate=2500,sd=sd*,priority=2

# single-block, 100% random writes
wd=wd_oracle_rand_write,rdpct=0,xfersize=16k,seekpct=100,iorate=5800,sd=sd*,priority=3

# multi-block, 100% sequential writes
wd=wd_oracle_seq_write,rdpct=0,xfersize=768k,seekpct=0,iorate=750,sd=sd*,priority=4

# redo write sizes vary per the LGWR mechanism, so we’ll go with redo size (bytes) per second / redo writes per second
wd=wd_oracle_redo_write,rdpct=0,xfersize=64k,seekpct=0,iorate=50,sd=sd*,priority=5
rd=rd_oracle_ramp,wd=wd_oracle*,iorate=12350,interval=1,elapsed=120,forthreads=8,warmup=5

As a quick check, with the customer’s 16KB block size, this config drives just over 50 MB/s random reads + 2500 MB/s sequential reads, which gets really close to the 2566 MB/s total reads stated in the snippet above.  It also drives about 91 MB/s random writes + 563 MB/s sequential writes + 3 MB/s redo for a total of 657 MB/s writes, which is really close to the reported 650 MB/s write bandwidth in the snippet.  I could take this even further to break out if I needed to characterize performance for other IO types or block sizes.  VDBench helpfully puts out a separate HTML file for each workload definition, allowing us to see the latency metrics for each IO type and size that you can then compare against the values in our AWR or Statspack report.  Note that you should set your forthreads value just high enough that you can drive the desired IOPS total; any higher and you’ll push latency up without achieving anything useful. And clearly the total IOPS target for the run definition should match the sum of the individual workload drivers.

PoC Avoided?  Maybe.

question-mark-diceAll of what I have described here helps to answer the question of What would each latency number look like for my IO workload as it exists today?  From this, you can use a little math to answer with great accuracy the execution time for any particular SQL with the lower latency.  The next logical question is What is going to happen to overall application performance when each query runs so much faster and completes sooner, allowing the next query to start earlier, etc?  That part is much more difficult to predict and may require a full-blown PoC to answer definitively, but at least you know the product you’re about to invest time in can deliver the latency you expect with your current IO workload profile.  If you’re hoping for a 10X performance improvement for your application, you’d better see that IO wait currently accounts for a large percentage of database time and that the latency of your new array beats the current latency by enough to make that dream a reality.

Advertisement

All Flash Arrays: Active/Active versus Active/Passive

running

I want you to imagine that you are about to run a race. You have your trainers on, your pre-race warm up is complete and you are at the start line. You look to your right… and see the guy next to you, the one with the bright orange trainers, is hopping up and down on one leg. He does have two legs – the other one is held up in the air – he’s just choosing to hop the whole race on one foot. Why?

You can’t think of a valid reason so you call across, “Hey buddy… why are you running on one leg?”

His reply blows your mind: “Because I want to be sure that, if one of my legs falls off, I can still run at the same speed”.

Welcome, my friends, to the insane world of storage marketing.

High Availability Clusters

The principles of high availability are fairly standard, whether you are discussing enterprise storage, databases or any other form of HA. The basic premise is that, to maintain service in the event of unexpected component failures, you need to have at least two of everything. In the case of storage array HA, we are usually talking about the storage controllers which are the interfaces between the outside world and the persistent media on which data resides.

Ok so let’s start at the beginning: if you only have one controller then you are running at risk, because a controller failure equals a service outage. No enterprise-class storage array would be built in this manner. So clearly you are going to want a minimum of two controllers… which happens to be the most common configuration you’ll find.

So now we have two controllers, let’s call them A and B. Each controller has CPU, memory and so on which allow it to deliver a certain level of performance – so let’s give that an arbitrary value: one controller can deliver 1P of performance. And finally, let’s remember that those controllers cost money – so let’s say that a controller capable of giving 1P of performance costs five groats.

Active/Passive Design

In a basic active/passive design, one controller (A) handles all traffic while the other (B) simply sits there waiting for its moment of glory. That moment comes when A suffers some kind of failure – and B then leaps into action, immediately replacing A by providing the same service. There might be a minor delay as the system performs a failover, but with multipathing software in place it will usually be quick enough to go unnoticed.

active-passive

So what are the downsides of active/passive? There are a few, but the most obvious one is that you are architecturally limited to seeing only 50% of your total available performance. You bought two controllers (costing you ten groats!) which means you have 2P of performance in your pocket, but you will forever be limited to a maximum of 1P of performance under this design.

Active/Active Design

In an active/active architecture, both controllers (A and B) are available to handle traffic. This means that under normal operation you now have 2P of performance – and all for the same price of ten groats. Both the overall performance and the price/performance have doubled.

active-active

What about in a failure situation? Well, if controller A fails you still have controller B functioning, which means you are now down to 1P of performance. It’s now half the performance you are used to in this architecture, but remember that 1P is still the same performance as the active/passive model. Yes, that’s right… the performance under failure is identical for both designs.

What About The Cost?

Smart people look at technical criteria and choose the one which best fits their requirements. But really smart people (like my buddy Shai Maskit) remember that commercial criteria matter too. So with that in mind, let’s go back and consider those prices a little more. For ten groats, the active/active solution delivered performance of 2P under normal operation. The active/passive solution only delivered 1P. What happens if we attempt to build an active/passive system with 2P of performance?

active-passive-larger

To build an active/passive solution which delivers 2P of performance we now need to use bigger, more powerful controllers. Architecturally that’s not much of a challenge – after all, most modern storage controllers are just x86 servers and there are almost always larger models available. The problem comes with the cost. To paraphrase Shai’s blog on this same subject:

Cost of storage controller capable of 1P performance  <  Cost of storage controller capable of 2P performance

In other words, building an active/passive system requires more expensive hardware than building a comparable active/active system. It might not be double, as in my picture, but it will sure as hell be more expensive – and that cost is going to be passed on to the end user.

Does It Scale?

Another question that really smart people ask is, “How does it scale?”. So let’s think about what happens when you want to add more performance.

In an active/active design you have the option of adding more performance by adding more controllers. As long as your architecture supports the ability for all controllers to be active concurrently, adding performance is as simple as adding nodes into a cluster.

But what happens when you add a node to an active/passive solution? Nothing. You are architecturally limited to the performance of one controller. Adding more controllers just makes the price/performance even worse. This means that the only solution for adding performance to an active/passive system is to replace the controllers with more powerful versions…

The Pure Storage Architecture

active-passive-backendPure Storage is an All Flash Array vendor who knows how to play the marketing game better than most, so let’s have a look at their architecture. The PS All Flash Array is a dual-controller design where both controllers send and receive I/Os to the hosts. But… only one controller processes I/Os to and from the underlying persistent media (the SSDs). So what should we call this design, active/active or active/passive?

According to an IDC white paper published on PS’s website, PS controllers are sized so that each controller can deliver 100% of the published performance of the array. The paper goes on to explain that under normal operation each controller is loaded to a maximum of 50% on the host side. This way, PS promises that performance under failure will be equal to the performance under normal operations.

In other words, as an architectural decision, the sum of the performance of both controllers can never be delivered.

So which of the above designs does that sound like to you? It sounds like active/passive to me, but of course that’s not going to help PS sell its flash arrays. Unsurprisingly, on the PS website the product is described as “active/active” at every opportunity.

Yet even PS’s chief talking head, Vaughn Stewart, has to ask the question, “Is the FlashArray an Active/Active or Active/Passive Architecture?” and eventually comes to the conclusion that, “Active/Active or Active/Passive may be debatable”.

There’s no debate in my view.

Conclusion

You will obviously draw your own conclusions on everything I’ve discussed above. I don’t usually pick on other AFA vendors during these posts because I’m aiming for an educational tone rather than trying to fling FUD. But I’ll be honest, it pisses me off when vendors appear to misuse technical jargon in a way which conveniently masks their less-glamorous architectural decisions.

My advice is simple. Always take your time to really look into each claim and then frame it in your own language. It’s only then that you’ll really start to understand whether something you read about is an innovative piece of design from someone like PS… or more likely just another load of marketing BS.

* Many thanks to my colleague Rob Li for the excellent running-on-one-leg metaphor

All Flash Arrays: Controllers Are The New Bottleneck

bottleneck

Today’s storage array market contains a wild variation of products: block storage, file storage or object storage; direct attached, SANs or NAS systems; fibre-channel, iSCSI or Infiniband… Even the SAN section of the market is full of diversity: from legacy hard disk drive-based arrays through the transitory step of tiered disk+flash hybrid systems and on to modern All-Flash Arrays (AFAs).

If you were partial to the odd terrible pun, you might even say that it was a bewildering array of choices. [*Array* of choices? Oh come on. If you’re expecting a higher class of humour than that, I’m afraid this is not the blog for you]

Anyway, one thing that pretty much all storage arrays have in common is the basic configuration blocks of which they comprise:

  • Array controllers
  • Internal networking
  • Persistent Media (flash or disk)

Over the course of this blog series I’ve talked a lot about both flash and disk media, but now it’s time to concentrate a little more on the other stuff – specifically, the controllers. A typical Storage Area Network delivers a lot more functionality than would be expected from just connecting a bunch of disks or flash – and it’s the controllers that are responsible for most of that added functionality.

Storage Array Controllers

Think of your storage system as a private network on which is located a load of dumb disk or flash drives. I say dumb because they can do little else other than accept I/O requests: reads and writes. The controllers are therefore required to provide the intelligence needed to present those drives to the outside world and add all of the functionality associated with enterprise-class storage:air-traffic-control

  • Resilience, automatic fault tolerance and high availability, RAID
  • Mirroring and/or replication
  • Data reduction technologies (compression, deduplication, thin provisioning)
  • Data management features (snapshots, clones, etc)
  • Management and monitoring interfaces
  • Vendor support integration such as call-home and predictive analytics

Controllers are able to add this “intelligence” because they are actually computers in their own right, acting as intermediaries between the back-end storage devices and the front-end storage fabric which connects the array to its clients. And as computers, they rely on the three classic computing resources (which I’m going to list using fancy colours so I can sneak them up on you again later in this post):

  1. Memory (DRAM)
  2. Processing (CPU)
  3. Networking

It’s the software running on the array controllers – and utilising these resources – that describes their behaviour. But with the rise of flash storage, this behaviour has had to change… drastically.

boxing-afa-hdd

Disk Array Controllers vs Flash Array Controllers

In the days when storage arrays were crammed full of disk drives, the controllers in those arrays spent lots of time waiting on mechanical latency. This meant that the CPUs within the controllers had plenty of idle cycles where they simply had to wait for data to be stored or retrieved by the disks they were addressing. To put it another way, CPU power wasn’t such an important priority when specifying the controller hardware.

This is clearly not the case with the controllers in an all flash array, since mechanical latency is a thing of the past. The result is that controller CPUs now have no time to spare – data is constantly being handled, addressed, moved and manipulated. Suddenly, the choice of CPU has a direct effect on the array’s ability to process I/O requests.

Dedupe Kills It

But there’s more. One of the biggest shifts in behaviour seen with flash arrays is the introduction of data reduction technology – specifically deduplication. This functionality, known colloquially as dedupe, intervenes in the write process to see if an exact copy of any written block already exists somewhere else on the array. copy-stampIf the block does exist, the duplicate copy does not need to be written – and instead a pointer to the existing version can be stored, saving considerable space. This pointer is an example of what we call metadata – information about data.

I will cover deduplication at greater length in another post, but for now there are three things to consider about the effect dedupe has on storage array controllers:

  1. Dedupe requires the creation of a fairly complex set of metadata structures – and for performance reasons much of this will need to reside in DRAM on the controllers. And as more data is stored on the array, the amount of metadata created increases – hence a growing dependency on the availability of (expensive) memory in those array controllers.
  2. The process of checking each incoming block (which involves calculating a hash value) and comparing against a table of metadata stored in DRAM is very CPU intensive. Thus array controllers which support DRAM have increasing requirements for (expensive) processing power.
  3. For storage arrays which run in an active/active configuration (i.e. with multiple redundant controllers, each of which actively send and receive data from the persistent storage layer), much of this information will need to be passed between controllers over the array’s internal networking.

Did you spot the similarities between this list and the colourful one from earlier? If you didn’t, you must be colour blind. Flash Array controllers are much more dependant on their resources – particularly CPU and DRAM – than disk array controllers.

Summary

Flash array controllers have to do almost everything that their ancestors, the venerable disk array controllers, used to do. But they have to do it much faster and in greater volume. bottleneck-signNot only that, but they have to do so much more… especially for the process known as data reduction. And as we’ve seen, the overhead of all these tasks causes a much greater strain on memory, processing and networking than was previously seen in the world of disk arrays – which is one of the reasons you cannot simply retrofit SSDs into a disk array architecture.

With the introduction of flash into storage, the bottleneck has moved away from the persistence layer and is now with the controllers. Over the next few articles we’ll look at what that means and consider the implications of various AFA architecture strategies on that bottleneck. After all, as is so often the case when it comes to matters of performance, you can’t always remove the bottleneck… but you can choose the one which works best for you 🙂

New Installation Cookbook: Oracle Linux 6.7 with Oracle 11.2.0.4 RAC

cookbookI’ve updated my install cookbooks page to include a new cookbook for installation of Oracle 11.2.0.4 Real Application Clusters on Oracle Linux 6.7.

This is also the first one I’ve published since I left the employment of Violin Memory to work for Kaminario, so this install uses a Kaminario K2 All Flash Array. However, it applies very well to any Oracle RAC installation which uses relatively capable storage.

Enjoy:

https://flashdba.com/install-cookbooks/oracle-linux-6-7-with-oracle-11-2-0-4-rac/

How the Next Generation of Flash Storage is Changing the Economics Of SaaS Businesses (Recorded Webinar)

SaaS Webinar

This week I had the opportunity to record a webinar on a subject very close to my heart, the Software-as-a-Service industry. From 2003 to 2007 I managed the production infrastructure for a global SaaS company through the transition from startup to acquisition (partly by Salesforce.com). At the time, SaaS was a relatively new phenomenon, predating any concept of “Cloud”, but the challenges we faced then are still very relevant today.

The company was run by charismatic American entrepreneur Mark Suster, now a well known venture capitalist and blogger. Looking back, it was an incredible learning experience – but I do also remember that I spent a lot of time trying to coax more performance our of multi-tenancy database platform, which was constantly being held back by … yes, you guessed it… disk performance.

The webinar was hosted by Kaminario (my employer) and co-presented by myself and Jeff Kaplan of ThinkStrategies. Here’s the synopsis and the link (registration required). Enjoy!

http://info.kaminario.com/how-flash-storage-is-changing-saas-businesses

Advances in all flash storage are reshaping infrastructure strategies for the modern data center. SaaS businesses are on the leading edge of adopting all flash storage as they build application delivery infrastructure that supports the performance, scalability, and agility required to deliver high quality business apps to users around the world.

Join Jeff Kaplan, Managing Director of ThinkStrategies and Chris Buckel, author of the FlashDBA Blog and Technology Evangelist from Kaminario for this webinar discussion of infrastructure strategies for modern SaaS Businesses.

Understanding Flash: The Fall and Rise of Flash Memory

grave

This month sees the four year anniversary of some interesting events. Commonwealth countries around the world celebrated the Diamond Jubilee of Queen Elizabeth II. Whitney Houston was tragically found dead in a Beverly Hills hotel. The Caribbean was hit hard by sargassum seaweed invasion. And I made the decision to leave the comfort of Oracle databases and join the exciting new All-Flash Array industry.

Ok, I might have been stretching the use of the word “interesting” there. But for those with an interest in flash memory, February 2012 was still a very important month due to the publication of a research paper co-authored by the University of California’s Department of Computer Science and Engineering and Microsoft Research.

The paper was entitled The Bleak Future of NAND Flash Memory – and it wasn’t pleasant reading for somebody who had just abandoned a career in databases to bet everything on flash.

The Death of Flash Memory

Rest In PeaceI have never spoken to the authors of this paper so I don’t know where the “Bleak Future” title came from, but it seems reasonable to say that it was somewhat more inflammatory than the content. In the body of the paper, the authors examined the behaviour of NAND flash memory chips as the lithography shrank – and also as the number of bits per cell increased from SLC through MLC to TLC. At the time of publication the authors were examining 25nm technology but it was already obvious that this form of NAND (known as 2D planar NAND) was going to hit physical limitations beyond which it could no longer shrink. This is known in the semiconductor world as the scaling limit.

The paper concluded:

“SSDs will continue to improve by some metrics (notably density and cost per bit), but everything else about them is poised to get worse. This makes the future of SSDs cloudy”

This sentiment, along with the “bleak future” thing, caused a bit of a stir in the tech world. TheRegister, for example, ran a typically tongue-in-cheek headline: “Flash DOOMED to drive itself off a cliff – boffins“. Various industry bloggers discussed the potential of technologies like ReRAM to take over for the next decade, while HP made it’s annual claim that Memristor technology (a form of ReRAM) would soon be here to save the day. I started wondering if I should register the domain name ReRamDBA.com

The Resurrection – Now Showing in 3D

Four years later, ReRAM is still just around the corner but now in the form of Intel and Micron’s 3D XPoint technology, while HP has significantly backtracked on its Memristor programme. Flash memory, meanwhile, is still going strong thanks to the introduction of vertical or 3D NAND.

“Reports of my death have been greatly exaggerated” – Mark Twain

Of course, hindsight is a wonderful thing. It’s easy to look back now at the publication of the Bleak Future… paper and consider it flawed. To see the flaws at the time of publication would have required a bit more thought.

So that’s why this month’s hero is Allyn Malventano, Storage Editor for PC Perspective, who published an article on 21st February 2012 (the same month!) called NAND Flash Memory – A Future Not So Bleak After All in which he described the original publication as “bad science”. Allyn’s conclusion was so prescient that I’m going to quote it right here (although you should read the whole article to get the full context):

“The point I want all of you to take home here is that just as with the CPU, RAM, or any other industry involving wafers and dies, the manufacturers will adapt and overcome to the hurdles they meet. There is always another way, and when the need arises, manufacturers will figure it out.”

Bravo. Samsung is now manufacturing its third generation V-NAND chips, while the Toshiba/SanDisk and Intel/Micro partnerships are both going 3D. Samsung’s V-NAND has already moved from 24 through 32 to 48 layers, while it has been theorised that there is no natural limit on the number of layers possible.

3d-xpointOf course, there’s always the spectre of a new technology sweeping everything before it – and the big story right now is Intel/Micron’s 3D XPoint technology. Will it take over from flash in the future? Who knows.

One thing I do know is that new technologies find their rightful place when they are both technically capable and economically viable. If 3D XPoint or any other non-volatile memory product can win the day, it will leave us all better off – and hopefully without the need for alarmist research papers.

Now, if you’ll excuse me, I’m off to check on the availability of the 3D-XPointDBA.com domain name…

(You can read more about 3D NAND here)

Understanding Flash: What is 3D NAND?

grid-cube-3d

About 18 months ago I wrote a post describing the different types of NAND flash known as SLC, MLC and TLC. However, 18 months is a lifetime in the world of technology so now I need to clarify it based on the widespread adoption of a new type of NAND flash. Let me explain…

Recap: 2D Planar NAND

Until recently, most of the flash memory used for data storage was of a form known as 2D Planar NAND and could be found in three types called Single Level Cell (SLC), Multi-Level Cell (MLC) and TLC (Triple-Level Cell). I always used to use my bucket of electrons analogy to describe the difference between them:

slc-mlc-tlc-buckets

Each cell within planar NAND flash memory stores charge in a way similar to how a bucket stores water. By considering an imaginary line half-way up the bucket we can assign a binary one or zero based on whether the bucket contains more or less water than the line. Thus a full bucket, or a fully-charged NAND cell, denotes a zero while an empty bucket / cell denotes a one… assuming we are considering SLC, where each bucket stores one bit.

Moving to MLC (two bits) or TLC (three bits) is therefore a case of adding more lines, allowing us to differentiate between more states within the same bucket. The benefit is double (MLC) or quadruple (TLC) density but the drawback is that there will be a lower margin for error when measuring the amount of water/charge stored. As a consequence, the actions of reading, writing and erasing take longer while the endurance of the cell also drops drastically (leaky buckets are more of a problem as you try to be more precise about the measurements). The original article covers this all in more detail.

Shrinking Lithographies

If you remember, I also talked about the way that flash memory manufacturers are constantly shrinking the size of NAND flash cells in order to make increasingly dense packages, thus reducing the cost – but that the technology was now approaching its physical limits. In the bucket example, just imagine that the buckets are getting smaller and smaller. This is initially a good thing because smaller buckets (actually floating gate transistors) mean more buckets can fit in the same overall space, but in time the buckets become so small that they are no longer manageable – and then the technology hits a brick wall.

So Why Is It 2D?

In NAND flash memory, sets of cells are connected together in a string to form a NAND gate:

NAND-flash-structure

Image courtesy of Warren Miller at Avnet

If you consider one of the pieces of silicon substrate contained inside a flash chip as a rectangle with dimensions X and Y, each one of these strings of cells will take up some space stretching out in one of these two dimensions. Shrinking the lithography, i.e. manufacturing everything on a smaller scale, will give us the opportunity to fit more strings on the same about of substrate. But as we previously discussed, there comes a point when things are simply too small and too close together, resulting in interference and leakage.

3D NAND: Going Vertical

Image courtesy of Kristian Vättö at AnandTech

Image courtesy of Kristian Vättö at AnandTech

The cost of a semiconductor is proportional to the die size. It is therefore a good thing for the cost if more electronics can be crammed into the same tiny piece of silicon. The fundamental difference in 3D NAND, which gives rise to its name, is that the strings previously described are now arranged vertically – another words in the Z dimension. For this reason, Samsung calls the technology V-NAND.

Imagine the string of cells shown earlier, but this time stood on its end and then folded in two to make a U shape. We now have a vertical string which takes up only a fraction of the original space in the X and Y dimensions. What’s more, we can continue to build in the Z dimension as manufacturing processes allow. Samsung’s first generation of V-NAND had strings of 24 layers, while the second generation had 32. The latest 3rd generation now has 48. And as Jim Handy explains, there are few theoretical limits on the number of layers possible. (Just to be clear, these layers are all within the same “wafer” of silicon, otherwise there would be no cost benefit…)

Crucially, since the move to a Z dimension relieves the pressure on the X and Y dimensions, 3D flash is actually free to return to a slightly larger lithography, thus avoiding all of the nasty problems that 2D planar NAND was starting to hit as it approached the 10nm range.

Charge Trap Flash and 3D TLC

Aside from the vertical stacking, there is another fundamental change with 3D NAND – it no longer uses floating gate transistors (yes, that’s right, the buckets from earlier). Instead, it uses a technology called Charge Trap Flash. cheese-iconI’m not going to attempt an explanation of CTF here, but it was memorably described by Samsung as like using cheese instead of water. So, instead of the buckets from earlier, picture cheese.

This cheese has a number of benefits over floating gate transistors in terms of endurance and power consumption, but it still works in a similar way in terms of the number of bits that can be stored – in other words SLC, MLC and TLC. However, because of its better endurance rates, with 3D NAND it is now a realistic proposition to use TLC to replace 2D planar MLC (something my employer Kaminario has already embraced).

This is big news. The cost per density of 3D TLC NAND flash is revolutionary, with plenty of room for further developments as the flash fabricators add more layers. Three years ago it looked like NAND flash was a technology in terminal decline, but with 3D techniques the future is bright. We might even get to a point soon where we see the introduction of…

Quad-Level Cell (QLC) Flash

If the endurance of CTF-based 3D NAND is acceptable, it’s not hard to envisage one of the flash fabricators releasing a quadruple-level cell version of the medium. The potential benefit is an order-of-magnitude increase in density for roughly the same cost.

After all, everybody wants more cheese… right?

All Flash Arrays: Hybrid Means Compromise

hybrid-car-engine

Sometimes the transition between two technologies is long and complicated. It may be that the original technology is so well established that it’s entrenched in people’s minds as simply “the way things are” – inertia, you might say. It could be that there is more than one form of the new technology to choose from, with smart customers holding back to wait and see which emerges as a stronger contender for their investment. Or it could just be that the newer form of technology doesn’t yet deliver all of the benefits of the legacy version.

hybrid-car-toyotaThe automotive industry seems like a good example here. After over a century of using internal combustion engines, we are now at the point where electric vehicles are a serious investment for manufacturers. However, fully-electric vehicles still have issues to overcome, while there is continued debate over which approach is better: batteries or hydrogen fuel cells. Needless to say, the majority of vehicles on the road today still use what you could call the legacy method of propulsion.

However, one type of vehicle which has been successful in gaining market share is the hybrid electric vehicle. This solution attempts to offer customers the best of both worlds: the lower fuel consumption and claimed environmental benefits of an electric vehicle, but with the range, performance and cost of a fuel-powered vehicle. Not everybody believes it makes sense, but enough do to make it a worthwhile venture for the manufacturers.

Now here’s the interesting thing about hybrid vehicles… the thing that motivated me to write two paragraphs about cars instead of flash arrays… Nobody believes that hybrid electric vehicles are the permanent solution. Everybody knows that hybrids are a transient solution on the way to somewhere else. Nobody at all thinks that hybrid is the end-game. But the people who buy hybrid cars also believe that this state of affairs will not change during the period in which they own the car.

Hybrid Flash Arrays (HFAs)

There are two types of flash storage architecture which could be labelled a hybrid – those where a disk array has been repopulated with flash and those which are designed specifically for the purpose of mixing flash and disk. I’ve talked about naming conventions before, it’s a tricky subject. But for the purposes of this article I am only discussing the latter: systems where the architecture has been designed so that disk and flash co-exist as different tiers of storage. Think along the lines of Nimble Storage, Tegile and Tintri.

Why do this? Well, as with hybrid electric vehicles the idea is to bridge the gap between two technologies (disk and flash) by giving customers the best of both worlds. That means the performance of flash plus its low power, cooling and physical space requirements – combined with the density of disk and its corresponding impact on price. In other words, if disk is cheap but slow while flash is fast but expensive, HFAs are aimed at filling the gap.

Hybrid (adjective)
of mixed character; composed of different elements.
bred as a hybrid from different species or varieties.

As you can see there are a lot of synergies between this trend and that of the electric vehicle. Also, most storage systems are purchased with a five-year refresh cycle in mind, which is not dissimilar to the average length of ownership of a car. But there’s a massive difference: the rate of change in the development of flash memory technology.

In recent years the density of NAND flash has increased by orders of magnitude, especially with the introduction of 3D NAND technology and the subsequent use of Triple-Level Cell (TLC). And when the density goes up, the price comes down – closing the gap between disk and flash. In fact we’re at the point now where Wikibon predicts that “flash … will become a lower cost media than disk … for almost all storage in 2016″:

Image courtesy of Wikibon "Evolution of All-Flash Array Architectures" by David Floyer (2015)

Image courtesy of Wikibon “Evolution of All-Flash Array Architectures” by David Floyer (2015)

That’s great news for customers – but definitely not for HFA vendors.

Conclusion

And so we reach the root of the problem with HFAs. It’s not just that they are slower than All Flash Arrays. It’s not even that they rely on the guesswork of automatic tiering algorithms to move data between their tiers of disk and flash. It’s simply that their entire existence is predicated on the idea of being a transitory solution designed to bridge a gap which is already closing faster than they can fill it.

mind-the-gapIf you want proof of this, just look at the three HFA vendors I name checked earlier – all of which are rushing to bring out All Flash versions of their arrays. Nimble Storage is the only one of the three to be publicly listed – and its recent results indicate a strategic rethink may be required.

When it comes to hybrid electric vehicles, it’s true that the concept of mass-owned fully-electric cars still belongs in the future. But when it comes to hybrid flash arrays, the adoption of All-Flash is already happening today. The advice to customers looking to invest in a five-to-seven year storage project is therefore pretty simple: Mind the gap.

Why Kaminario?

K2-mountain

This summer I made the decision to leave my previous employer and join another vendor in the All Flash Array space – a company called Kaminario. A lot of people have been in touch to ask me about this, so I thought I’d answer the question here… Why Kaminario?

To answer the question, we first need to look at where the All Flash industry finds itself today…

The Path To Flash Adoption

We all know that disk-based storage has been struggling to deliver to the enterprise for many years now. And most of us are aware that flash memory is the technology most suitably placed to take over the mantle as the storage medium of choice. However, even keeping in mind the typical five-to-seven year refresh cycle for enterprise storage platforms, the journey to adopt flash in the enterprise data centre has been slower than some might have expected. Why?

There are three reasons, in my view. The first two are pretty obvious: cost and functionality. I’ll cover the third in another post – but cost and functionality have changed drastically over the four phases of flash:

Phase One: Extreme Performance

The early days of enterprise flash storage were pioneered by the likes of FusionIO with their PCIe flash cards. These things sold for a $/GB price that would seem obscene in today’s AFA marketplace – and (at least initially) Fusion-io-ioDrive-Duo-640GBthey had almost no functionality in terms of thin provisioning, replication, snapshots, data reduction technologies and so on. They weren’t even shared storage! They were just fast blobs of flash that you could stick right inside your server to get performance which, at the time, seemed insanely fast – think <250 microseconds of latency.

This meant they were only really suitable for extreme performance requirements, where the cost and complexity was justified by the resultant improvement to the application.

Phase Two: Niche Performance Applications

Violin 6000 series flash memory arrayThe next step on the path to flash adoption was the introduction of flash as shared storage (i.e. SAN). These were the first All Flash Arrays, a marketplace pioneered by Violin Memory (my former employer) and Texas Memory Systems (subsequently acquired by IBM). The fact that they were shared allowed a larger number of applications to be migrated to flash, but they were still very much used as a niche performance play due to a lack of features such as data reduction, replication etc.

Phase Three: Virtualization for Servers and Desktops

The third phase was driven by the introduction of a very important feature: data reduction. Pure StorageBy implementing deduplication and/or compression – therefore massively reducing the effective price in $/GB – a couple of new entrants to the AFA space were able to redefine the marketplace and leave the pioneering AFA vendors floundering. These new players were Pure Storage and EMC with its XtremIO system – and they were able to create and attack an entirely new market: virtualization. Initially they went after Virtual Desktop Infrastructure projects, which have lots of duplicate data and create lots of IOPS, but in time the market for Virtual Server Infrastructure (i.e. VMware, Hyper-V, Xen etc) became a target too.

Phase Four: General Purpose Storage

This is where we are now – or at least, it’s where we’ve just arrived. The price of flash storage has consistently dropped as the technology has advanced, while almost all of the features and functionality originally found on enterprise-class disk arrays are now available on AFAs. We’re finally at the point now where, with some caveats, customers are either moving to or planning the wholesale replacement of their general purpose disk arrays with All Flash. Indeed, with the constant evolution of NAND flash technology it’s no longer fanciful to believe that Backup and Archive workloads could also move to flash…

We are now at the inflection point where, thanks to the combination of data reduction features and constantly-evolving NAND flash development, the cost of All Flash storage has fallen as low as enterprise disk storage while delivering all the functionality required to replace disk entirely. We call this concept the All-Flash Data Centre.

So to answer the question at the start of this post, I have joined Kaminario because I believe they are ideally placed, architecturally and commercially, to lead the adoption of this new phase of flash storage – a technology that I fundamentally believe in.

Making The All-Flash Data Centre A Reality

iphoneI mentioned earlier that NAND flash is changing and evolving all the time. It reminds me a little of smartphones – you buy the latest and greatest model only for it to become yesterday’s news almost before you’ve worked out how to use it. But the typical refresh cycle of a smartphone is one-to-two years, while for enterprise storage it’s five-to-seven. That’s a long time to risk an investment in evolving technology.

Kaminario’s K2 All Flash Array is based on its SPEAR architecture. Essentially, what Kaminario has created is a high-performance, scalable framework for taking memory and presenting it as enterprise-class storage – with all Kaminario's SPEAR architecturethe resilience, functionality and performance you would expect. When the company was first founded this memory was just that: DRAM. But since NAND flash became economically viable, Kaminario has been using flash – and the architecture is agile enough to adopt whichever technology makes the most sense in the future.

As an example of this agility, Kaminario was the first AFA vendor to adopt 3D NAND technology and the first to adopt 3D TLC. This obviously allows a major competitive advantage when it comes to providing the most cost-efficient All Flash Array. But what really drew me to Kaminario was their decision to allow customers to integrate future hardware (such as new types of flash) into their existing arrays rather than making them migrate to a new product as is typical in the industry. By protecting customers’ investments, Kaminario is taking some of the risk out of moving to an AFA solution. It calls this programme the Perpetual Array.

In addition to this, Kaminario has a unique ability to offer both scale out and scale up architecture (scalability is something I will discuss further in my Storage for DBAs series soon) and to deliver workload agnostic performance… all technical features that deliver real business value. But those are for discussion another day.

For today the message is simple: Kaminario is making the All Flash Data Centre a reality.. and I want to be here to help customers make that happen.

All Flash Arrays: SSD-based versus Ground-Up Design

design-papers

In recent articles in this series I’ve been looking at the architectural choices for building All Flash Arrays (AFAs). I surmised that there are three main approaches:

  • Hybrid Flash Arrays
  • SSD-based All Flash Arrays
  • Ground-Up All Flash Arrays (which from here on I’ll refer to as Custom Flash Module arrays or CFM arrays)

I’ve already blown metaphorical raspberries at the hybrid approach, so now it’s time to cover the other two.

SSD or CFM: The Big Question

I think the most interesting question in the AFA industry right now is the one of whether the SSD or CFM design will win. Of course, it’s easy to say “win” like that as if it’s a simple race, but this is I.T. – there’s never a simple answer. However, the reality is that each method offers benefits and drawbacks, so I’m going to use this blog post to simply describe them as I see them.

Before I do that, let me just remind you of what the vendor landscape looks like at this time:

SSD-based architecture: Right now you can buy SSD-based arrays from EMC (XtremIO), Pure Storage, Kaminario, Solidfire, HP 3PAR and Huawei to name a few. It’s fair to say that the SSD-based design has been the most common in the AFA space so far.

CFM-based architecture: On the other hand, you can now buy ground-up CFM-based arrays from Violin Memory, IBM (FlashSystem), HDS (VSP), Pure Storage (FlashArray//m) and EMC (DSSD). The latter has caused some excitement because of DSSD’s current air of mystery in the marketplace – in other words, the product isn’t yet generally available.

So which approach is “the best”?

The SSD-based Approach

If you were going to start an All Flash Array company and needed to bring a product to market as soon as possible, it’s quite likely you would go down the SSD route. Apart from anything else, flash management is hard work – and needs constant attention as new types of flash come to market. A flash hardware engineer friend of mine used to say that each new flash chip is like a snowflake – they all behave slightly differently. So by buying flash in the ready-made form of an SSD you bypass the requirement to put in all this work. The flash controller from the SSD vendor does it for you, leaving you to concentrate on the other stuff that’s needed in enterprise storage: resilience, availability, data services, etc.Samsung_840_EVO_SSD

On the other hand, it seems clear that an SSD is a package of flash pretending to behave like a disk. That often means I/Os are taking place via protocols that were designed for disk, such as Serial Attached SCSI. Also, in a unit the size of an all flash array there are likely to be many SSDs… but because each one is an isolated package of flash, they cannot work together and manage the flash holistically. In other words, if one SSD is experiencing issues due to garbage collection (for example), the others cannot take the strain.

The Ground-Up Approach

For a number of years I worked for Violin Memory, which adopted the ground-up approach at its very core. Violin’s position was that only the CFM approach could unlock the full potential benefits from NAND flash. By tightly integrating the NAND flash into its array – and by using its own controllers to manage that flash – Violin believed it could deliver the best performance in the AFA market. On the other hand, many SSD vendors build products for the consumer market where the highest levels of performance simply aren’t necessary. All that’s required is something faster than disk – it doesn’t always have to be the fastest possible solution.electronics

It could also be argued that any CFM vendor who has a good relationship with a flash fabricator (for example, Violin was partly-owned by Toshiba) could gain a competitive advantage by working on the very latest NAND flash technologies before they are available in SSD form. What’s more, SSDs represent an additional step in the process of taking NAND flash from chip to All Flash Array, which potentially means there’s an extra party needing to make their margin. Could it be that the CFM approach is more cost effective? [Update from Jan 2017: Violin Memory has now filed for chapter 11 bankruptcy protection]

SSD Economics

The argument about economics is an interesting one. Many technical people have a tendency to focus on what they know and love: technology. I’m as guilty of this as anyone – given two solutions to a problem I tend to gravitate toward the one that has the most elegant technical design, even if it isn’t necessarily the most commercially-favourable. Taking raw flash and integrating it into a custom flash module sounds great, but what is the cost of manufacturing those CFMs?

moneyManufacturing is all about economies of scale. If you design something and then build thousands of them, it will obviously cost you more per unit than if you build millions of them. How many ground-up all flash vendors are building their custom flash modules by the millions? In May 2015, IBM issued this press release in which they claimed that they were the “number one all-flash storage array vendor in 2014“. How many units did they ship? 2,100.

In just the second quarter of 2015, almost 24 million SSDs were shipped to customers, with Samsung responsible for 43.8% of that total (according to US analyst firm Trendfocus, Inc). Who do you think was able to achieve the best economy of scale?

Design Agility

The other important question is the one about New Stuff ™. We are always being told about fantastic new storage technologies that are going to change our lives, so who is best placed to adopt them first?

Again there’s an argument to be made on both sides. If the CFM flash vendor is working hand-in-glove with a fabricator, they may have access to the latest technology coming down the line. That means they can be prepared ahead of the pack – a clear competitive advantage, right?

But how agile is the CFM design? Changing the NVM media requires designing an entirely-new flash module, with all the associated hardware engineering costs such as prototyping, testing, QA and limited initial manufacturing runs.

For an SSD all flash array vendor, however, that work is performed by the SSD vendor… again somebody like Samsung, Intel or Micron who have vast infrastructures in place to perform that sort of work all the time. After all, a finished SSD must behave exactly like a disk, regardless of what NVM technology it uses under the covers.

Conclusion

There are obviously two sides to this argument. The SSD was designed to replace a fundamental bottleneck in storage systems: the hard disk drive. Ironically, it may be the fate of the SSD to become exactly what it replaced. For flash to become mainstream it was necessary to create a “flash-behaving-as-disk” package, but the flip side of this is the way that SSDs stifle the true potential of the underlying flash. (Although perhaps NVMe technologies will offer us some salvation…)question-mark-dice

However, unless you are a company the size of Samsung, Intel or Micron it seems unlikely that you would be able to retain the manufacturing agility and economies of scale required to produce custom flash modules at the price point of SSDs. Nor would you be likely to have the agility to adopt new NVM technologies at the moment that they become economically preferable to whatever medium you were using previously.

Whatever happens, you can be sure that each side will claim victory. With the entire primary data market to play for, this is a high stakes game. Every vendor has to invest a large amount of money to enter the field, so nobody wants to end up being consigned to the history books as the Betamax of flash…

For younger readers, Betamax was the loser in a battle with VHS over who would dominate the video tape market. You can read about it here. What do you mean, “What is a video tape?” Those things your parents used to watch movies on before the days of DVDs. What do you mean, “What is a DVD?” Jeez, I feel old.