All Flash Arrays: What Is An AFA?

All Flash Arrays - Hybrid, SSD-based or Ground-Up

For the last couple of years I’ve been writing a series of blog posts introducing the concepts of flash-memory and solid state storage to those who aren’t part of the storage industry. I’ve covered storage fundamentals, some of what I consider to be the enduring myths of storage, a section of unashamed disk-bashing and then a lengthy set of articles about NAND flash itself.

Now it’s time to talk about all flash arrays. But first, a warning.

Although I work for a flash array vendor, I have attempted to keep my posts educational and relatively unbiased. That’s pretty tricky when talking about the flash media, but it’s next to impossible when talking about arrays themselves. So from here on this is all just my opinion – you can form your own and disagree with me if you choose – there’s a comment box below. But please be up front if you work for a vendor yourself.

All Flash Array Definition(s)

It is surprisingly hard to find a common definition of the All Flash Array (or AFA), but one thing that everyone appears to agree on is the shared nature of AFAs – they are network-attached, shared storage (i.e. SAN or NAS). After that, things get tricky.

IDC, in its 2015 paper Worldwide Flash Storage Solutions in the Datacenter Taxonomy, divides network-attached flash storage into All Flash Arrays (AFAs) and Hybrid Flash Arrays (HFAs). It further divides AFAs into categories based on their use of custom flash modules (CFMs) and solid state disks (SSDs), while HFAs are divided into categories of mixed (where both disks and flash are used) and all-flash (using CFMs or SSDs but with no disk media present).

definitionDid you make it through that last paragraph? Perhaps, like me, you find the HFA category “all-flash” confusingly named given the top-level category of “all flash arrays”? Then let’s go and see what Gartner says.

Gartner doesn’t even get as far as using the term AFA, preferring the term Solid State Array (or SSA). I once asked Gartner’s Joe Unsworth about this (I met him in the kitchen at a party – he was considerably more sober than I) and he explained that the SSA term is designed to cope with any future NAND-flash replacement technology, rather than restricting itself to flash-based arrays… which seems reasonable enough, but it does not appear to have caught on outside of Gartner.

The big catch with Gartner’s SSA definition is that, to qualify, any potential SSA product must be positioned and marketed “with specific model numbers, which cannot be used as, upgraded or converted to general-purpose or hybrid storage arrays“. In other words, if you can put a disk in it, you won’t see it on the Gartner SSA magic quadrant – a decision which has drawn criticism from industry commentators for the way it arbitrarily divides the marketplace (with a response from Gartner here).

The All Flash Array Definition at flashdba.com

So that’s IDC and Gartner covered; now I’m going to give my definition of the AFA market sector. I may not be as popular or as powerful as IDC or Gartner but hey, this is my website and I make the rules.

In my humble opinion, an AFA should be defined as follows:

An all flash array is a shared storage array in which all of the persistent storage media comprises of flash memory.

Yep, if it’s got a disk in it, it’s not an AFA.

This then leads us to consider three categories of AFA:

Hybrid AFAs

disk-platterThe hybrid AFA is the poor man’s flash array. It’s performance can best be described as “disk plus” and it is extremely likely to descend from a product which is available in all-disk, mixed (disk+SSD) or all-SSD configurations. Put simply, a hybrid AFA is a disk array in which the disks have been swapped out for SSDs. There are many of these products out there (EMC’s VNX-F and HP’s all-flash 3PAR StoreServ spring to mind) – and often the vendors are at pains to distance themselves from this definition. But the truth lies in the architecture: a hybrid AFA may contain flash in the form of SSDs, but it is fundamentally and inescapably architected for disk. I will discuss this in more detail in a future article.

SSD-based AFAs

Samsung_840_EVO_SSDThe next category covers all-flash arrays that have been architected with flash in mind but which only use flash in the form of solid state drives (SSDs). A typical SSD-based AFA consists of two controllers (usually Intel x86-based servers) and one or more shelves of SSDs – examples would be EMC’s XtremIO, Pure Storage, Kaminario and SolidFire. Since these SSDs are usually sourced from a third party vendor – as indeed are the servers – the majority of the intellectual property of an SSD-based array concerns the software running on the controllers. In other words, for the majority of SSD-based array vendors the secret sauce is all software. What’s more, that software generally doesn’t cover the tricky management of the flash media, since that task is offloaded to the SSD vendor’s firmware. And from a purely go-to-market position (imagine you were founding a company that made one of these arrays), this approach is the fastest.

Ground-Up AFAs

NAND-flashThe final category is the ground-up designed AFA – one that is architected and built from the ground up to use raw flash media straight from the NAND flash fabricators. There are, at the time of writing, only two vendors in the industry who offer this type of array: Violin Memory (my employer) and IBM with its FlashSystem. A ground-up array implements many of its features in hardware and also takes a holistic approach to managing the NAND flash media, because it is able to orchestrate the behaviour of the flash across the entire array (whereas SSDs are essentially isolated packages of flash). So in contrast with the SSD-based approach, the ground-up array has a much larger proportion of it’s intellectual property in its hardware.

Why are there only two ground-up AFAs on the market? Well, mainly because it takes a lot longer to create this sort of product: Violin is ten years old this year, while IBM acquired the RamSan product from Texas Memory Systems who had been around since 1987. In comparison, the remaining AFA companies are mostly under six years old. It also requires hardware engineering with NAND flash knowledge, usually coupled to a relationship with a NAND flash foundry (Violin, for example, has a strategic alliance with Toshiba – the inventor of NAND flash).

Which Is Best?

Ahh well that’s the question, isn’t it? Which architecture will win the day, or will something else replace them all? So that’s what we’ll be looking at next… starting with the Hybrid Array. And while I don’t want to give away too much too soon <spoiler alert>, in my book the hybrid array is an absolute stinker.

Was I Mentioned During Oracle’s Q4 2015 Results Call?

hurd

In a proud moment for me, it appears that Mark Hurd, CEO of Oracle, has mentioned my flashdba blog during the Oracle Q4 2015 results call. At least, that’s what I’m reading into this section from the transcript published by Seeking Alpha:

We grew in storage in the quarter and this is — really we are going through a shift in storage now. We released our SAN product FS1 in the quarter which saw some bookings. This is really the first quarter we got any bookings out of FS1 or GFS product, somebody’s renamed that but I haven’t recently – BS1. I wish they wouldn’t do that to me but so they renamed BS1 – so I missed the – but anyway so we had good growth in PaaS – as well.

I’m pretty sure that my blog post entitled Postcard from Oracle OpenWorld 2014: The Oracle FS1 Flash Array was the first place in which Oracle’s newly-announced FS1 Flash Storage System was ironically described as the “BS1 Flash Storage Array” due to some of the baffling marketing claims made at its announcement during Oracle OpenWorld 2014. Claims like, “The Oracle FS1 is the first mainstream, general purpose flash array”.

I haven’t heard the recording of the call, just read the transcript, but it appears to me that Mr Hurd uses the BS1 phrase to get some laughs from the analysts on the call.

So hey, thanks Mark! It’s exciting to know that finally, even in a small way, I’ve been able to make a contribution at the highest levels within Oracle. I am open to discussions about filling a new role as Oracle’s SVP of Investor Comedy Moments. And with those results, it could be an increasingly essential role…!

Understanding Flash: Summary – NAND Flash Is A Royal Pain In The …

chaos-order

So this is it – the last article in my mini-series on understanding flash. This is the bit where I draw it all together in a neat conclusion that makes you think, “Yes! That was worth reading”. No pressure eh?

So let me start with the conclusion first: as a storage medium, NAND flash is a royal pain in the ass.

Chaos

Why? Well, let’s look back at what we’ve learned in the previous 9 articles:

In short, NAND flash is a tricky medium to use for enterprise storage. A whole lot of work is required to make a collection of flash chips appear to be a unified, resilient block of storage with fast, predictable performance.

And I haven’t even told you everything. Consider, for example, the phenomenon of read disturb. When you read a page within a NAND flash chip, you cause a very minor electronic field in the locality of the cells it contains. That field will cause a small disturbance to any neighbouring cells – usually not enough to cause concern, but significant nevertheless. So what happens when you repeatedly read that page? Eventually, after X number of reads, the data stored within the nearby cells becomes questionable.

NAND-flashThe solution, therefore, is to keep track of the number of times each page is disturbed in this manner and then set a threshold (let’s say 50 disturbances) beyond which you will copy the data out to a clean page and then mark the old page as stale. Easy.

But just think about what that means for a moment. Remember when I said that write amplification was mainly impacted by write workloads? This new piece of information means that even on a 100% read workload there will be additional back-end writes taking place on the array. Just another example of why flash is a tricky medium to manage.

Order

Of course, it would be remiss of me not to mention that NAND flash brings a tremendous set of benefits along with these problems. You could say they come as a package (oh come on, that was one of my better puns).

Let’s go back to basics for a moment: if you want to take a defined quantity of work and do it in a shorter amount of time, what are your choices? Put simply, there are two options: do the same work faster, or do more of it in parallel (and of course both options can be used together for extra gain).

The basic building block of a disk array is, obviously, the hard disk drive. I’ve already explained at tedious length about the performance gap between disk and flash, so we know that we can access data faster using flash. Technologies like RAID allow multiple disks to be used in parallel to achieve performance (and resilience) gains, but given a limited amount of physical space (such as a data centre rack), how many hard drives can you actually squeeze into one system?

Now compare this to the number of NAND flash packages you could fit into the same space, all of which you could potentially utilise in parallel and at a lower latency. Doing the same work faster – and doing more of it in parallel.

Image courtesy of Google Inc.

Image courtesy of Google Inc.

And there’s more. Those clunky great big cabinets of disk use up horrendous amounts of power just to spin those little rotating platters – with much of the energy converted to heat and noise: waste. The heat results in a requirement for additional cooling, which uses even more power: more waste. And it all takes up so much physical space that data centres become overrun with storage.

In contrast, all flash arrays (AFAs) require less power, less cooling and take up less physical space: it’s not uncommon for customers to pay for the move to flash simply by avoiding the need to build a new data centre or extend an existing one. In summary, the net cost of using flash is now less than that of using disk.

When I first started writing this blog back in 2012 there was still a debate over whether flash would replace disk for enterprise storage. That debate was over some time ago: flash has already won.

Architecture Matters

So this post marks the end of my journey into explaining and understanding NAND flash. Yet there is a whole new area which needs exploring: the architecture of all flash arrays.

Enterprise storage needs be safe, reliable, predictable and fast. Yet at a package level, NAND flash is a tricky little beast that has to be constantly watched to make sure it behaves itself. There’s a dichotomy here: how do we use the latter to deliver the former? How do we take a component designed for consumer electronics and use it to build an enterprise-class AFA? In short, how we derive order from chaos?

architectureThe answer is in the architecture. At the time of writing this blog there are a number of AFA vendors on the market, each with a different approach to taming the beast. Apart from my own employer, Violin Memory, there is EMC, IBM, HDS, Pure Storage, SolidFire, Kaminario and a whole load more.

And that’s why this industry is so interesting to me. Everybody is trying to do this differently, although you can broadly categorise the solutions into three distinct ranges: hybrid arrays, SSD-based arrays and ground-up arrays. Everybody thinks their way is right – and nobody can afford to be wrong. The market for flash-based primary storage is huge and growing all the time: the winners get unparalleled success, while the losers … are simply left in disarray*

*I won’t lie – I’m so proud of that pun I’m going to award myself a couple of weeks off.

The Great Hypervisor Bake-off: VMware ESX vs Oracle VM

lock-horns

This is a very simple post to show the results of some recent testing that Tom and I ran using Oracle SLOB on Violin to determine the impact of using virtualization. But before we get to that, I am duty bound to write a paragraph of text featuring lots of long sentences peppered with industry buzz words. Forgive me, it’s just the way I’m wired.

It is increasingly common these days to find database environments running in virtual machines – even large, business critical ones. The driver is the trend to commoditize I.T. services and build consolidated, private-cloud style solutions in order to control operational expense and increase agility (not to mention reduce exposure to Oracle licenses). But, as I’ve said in previous posts, the catalyst has been the unblocking of I/O as legacy disk systems are replaced by flash memory. In the past, virtual environments caused a kind of I/O blender effect whereby I/O calls become increasingly randomized – and this sucked for the performance of disk drives. Flash memory arrays on the other hand can deliver random I/O all day long because… well, if you don’t know the reasons by now can I just recommend starting at the beginning. The outcome is that many large and medium-sized organisations are now building database-as-a-service platforms with Oracle databases (other database products are available) running in virtual machines. It’s happening right now.

Phew. Anyway, that last paragraph was just a wordy way of telling you that I’m often seeing Oracle running in virtual machines on top of hypervisors. But how much of a performance impact do those hypervisors have? Step this way to find out.

The Contenders

boxersWhen it comes to running Oracle on a hypervisor using Intel x86 hardware (for that is what I have available), I only know of three real contenders:

Hyper-V has been an option for a couple of years now, but I’ll be honest – I have neither the time nor the inclination to test it today. It’s not that I don’t rate it as a product, it’s just that I’ve never used it before and don’t have enough time to learn something new right now. Maybe someday I’ll come back and add it to the mix.

In the meantime, it’s the big showdown: VMware versus Oracle VM. Not that Oracle VM is really in the same league as VMware in terms of market share… but you know, I’m trying to make this sound exciting.

The Test

This is going to be an Oracle SLOB sustained throughput test. In other words, I’m going to build an Oracle database and then shovel a massive amount of I/O through it (you can read all about SLOB here and here). SLOB will be configured to run with 25% of statements being UPDATEs (the remainder are SELECTs) and will run for 8 hours straight. What we want to see is a) which hypervisor configuration allows the greatest I/O bandwidth, and b) which hypervisor configuration exhibits the most predictable performance.

This is the configuration. First the hardware:

Violin Memory 6616 flash Memory Array

Violin Memory 6616 flash Memory Array

  • 1x Dell PowerEdge R720 server
  • 2x Intel Xeon CPU E5-2690 v2 10-core @ 3.00GHz [so that’s 2 sockets, 20 cores, 40 threads for this server]
  • 128GB DRAM
  • 1x Violin Memory 6616 (SLC) flash memory array [the one that did this]
  • 8GB fibre-channel

And the software:

  • Hypervisor: VMware ESXi 5.5.1
  • Hypervisor: Oracle VM for x86 3.3.1
  • VM: Oracle Linux 6 Update 5 (with the Unbreakable Enterprise v3 Kernel 3.6.18)
  • Oracle Grid Infrastructure 11.2.0.4 (for Automatic Storage Management)
  • Oracle Database Enterprise Edition 11.2.0.4

Each VM is configured with 20 vCPUs and is using Linux Device Mapper Multipath and Oracle ASMLib. ASM is configured to use one single +DATA disgroup comprising 8 ASM disks (LUNs from Violin) with external redundancy. The database parameters and SLOB settings are all listed on the SLOB sustained throughput test page.

Results: Bare Metal (Baseline)

First let’s see what happens when we don’t use a hypervisor at all and just run OL6.5 on bare metal:

Oracle SLOB- 8 Hour Sustained Throughput Test with no hypervisor (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         232,431.0       194,452.3        37,978.7
         Database Requests:         228,909.4       194,447.9        34,461.5
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           3,515.1             0.3         3,514.8
                Total (MB):           1,839.6         1,519.2           320.4

Ok so we’re looking at 1519 MB/sec of read throughput and 320 MB/sec of write throughput. Crucially, the lines are nice and consistent – with very little deviation from the mean. By dividing the amount of time spent waiting on db file sequential read (i.e. random physical reads) with the number of waits, we can calculate that the average latency for random reads was 438 microseconds.

Now we know what to expect, let’s look at the result from the hypervisor tests.

Results: VMware vSphere

VMware is configured to use Raw Device Mapping (RDM) which essentially gives the benefits of raw devices… read here for more details on that. Here are the test results:

Oracle SLOB- 8 Hour Sustained Throughput Test with VMware ESXi 5.5.1 (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         173,141.7       145,066.8        28,075.0
         Database Requests:         170,615.3       145,064.0        25,551.4
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           2,522.8             0.1         2,522.7
                Total (MB):           1,370.0         1,133.4           236.7

Average read throughput for this test was 1133 MB/sec and write throughput averaged at 237 MB/sec. Average read latency was 596 microseconds. That’s an increase of 36%.

In comparison to the bare metal test, we see that total bandwidth dropped by around 25%. That might seem like a lot but remember, we are absolutely hammering this system. A real database is unlikely to ever create this level of sustained I/O. In my role at Violin I’ve been privileged to work on some of the busiest databases in Europe – nothing is ever this crazy (although a few do come close).

Results: Oracle VM

Oracle VM is based on the Xen hypervisor and therefore uses Xen virtual disks to present block devices. For this test I downloaded the Oracle Linux 6 Update 5 template from Oracle’s eDelivery site. You can see more about the way this VM was configured here. Here are the test results:

Oracle SLOB- 8 Hour Sustained Throughput Test with Oracle VM 3.3.1 (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         160,563.8       134,592.9        25,970.9
         Database Requests:         158,538.1       134,587.3        23,950.8
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           2,017.2             0.2         2,016.9
                Total (MB):           1,273.4         1,051.6           221.9

This time we see average read bandwidth of 1052MB/sec and average write bandwidth of 222MB/sec, with the average read latency at 607 microseconds, which is 39% higher than the baseline test.

Meanwhile, total bandwidth dropped by 31%. That’s slightly worse than VMware, but what’s really interesting is the deviation. Look at how ragged the lines are on the OVM test! There is a much higher degree of variance exhibited here than on the VMware test.

Conclusion

This is only one test so I’m not claiming it’s conclusive. VMware does appear to deliver slightly better performance than OVM in my tests, but it’s not a huge difference. However, I am very much concerned by the variance of the OVM test in comparison to VMware. Look, for example, at the wait event histograms for db file sequential read:

Wait Event Histogram
-> Units for Total Waits column: K is 1000, M is 1000000, G is 1000000000
-> % of Waits: value of .0 indicates value was <.05%; value of null is truly 0
-> % of Waits: column heading of <=1s is truly <1024ms, >1s is truly >=1024ms
-> Ordered by Event (idle events last)

                                                             % of Waits
                                          -----------------------------------------------
                                    Total
Hypervisor  Event                   Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s   >1s
----------- ----------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
Bare Metal: db file sequential read 5557.  98.7   1.3    .0    .0    .0    .0
VMware ESX: db file sequential read 4164.  92.2   6.7   1.1    .0    .0    .0
Oracle VM : db file sequential read 3834.  95.6   4.1    .1    .1    .0    .0    .0    .0

The OVM tests show occasional results in the two highest buckets, meaning once or twice there were waits in excess of 1 second! However, to be fair, OVM also had more millisecond waits than VMware.

Anyway, for now – and for this setup at least – I’m sticking with VMware. You should of course test your own workloads before choosing which hypervisor works for you…

Thanks as always to Kevin for bringing Oracle SLOB to the community.

ASM Rebalance Too Slow? 3 Tips To Improve Rebalance Times

see-saw

I’ve run into a few customers recently who have had problems with their ASM rebalance operations running too slowly. Surprisingly, there were some simple concepts being overlooked – and once these were understood, the rebalance times were dramatically improved. For that reason, I’m documenting the solutions here… I hope that somebody, somewhere benefits…

1. Don’t Overbalance

Every time you run an ALTER DISKGROUP REBALANCE operation you initiate a large amount of I/O workload as Oracle ASM works to evenly stripe data across all available ASM disks (i.e. LUNs). The most common cause of rebalance operations running slowly that I see (and I’m constantly surprised how much I see this) is to overbalance, i.e. cause ASM to perform more I/O than is necessary.

It almost always goes like this. The customer wants to migrate some data from one set of ASM disks to another, so they first add the new disks:

alter diskgroup data
add disk  'ORCL:NEWDATA1','ORCL:NEWDATA2','ORCL:NEWDATA3','ORCL:NEWDATA4',
          'ORCL:NEWDATA5','ORCL:NEWDATA6','ORCL:NEWDATA7','ORCL:NEWDATA8'
rebalance power 11 wait;

Then they drop the old disks like this:

alter diskgroup data
drop disk 'DATA1','DATA2','DATA3','DATA4',
          'DATA5','DATA6','DATA7','DATA8'
rebalance power 11 wait;

Well guess what? That causes double the amount of I/O that is actually necessary to migrate, because Oracle evenly stripes across all disks and then has to rebalance a second time once the original disks are dropped.

This is how it should be done – in one single operation:

alter diskgroup data
add disk  'ORCL:NEWDATA1','ORCL:NEWDATA2','ORCL:NEWDATA3','ORCL:NEWDATA4',
          'ORCL:NEWDATA5','ORCL:NEWDATA6','ORCL:NEWDATA7','ORCL:NEWDATA8'
drop disk 'DATA1','DATA2','DATA3','DATA4',
          'DATA5','DATA6','DATA7','DATA8'
rebalance power 11 wait;

A customer of mine tried this earlier this week and reported back that their ASM rebalance time had reduced by a factor of five!

By the way, the WAIT command means the cursor doesn’t return until the command is finished. To have the command essentially run in the background you can simply change this to NOWAIT. Also, you could run the ADD and DROP commands separately if you used a POWER LIMIT of zero for the first command, as this would pause the rebalance and then the second command would kick it off.

2. Power Limit Goes Up To 1024

Simple one this, but easily forgotten. From the early days of ASM, the maximum power limit for rebalance operations was 11. See here if you don’t know why.

From 11.2.0.2, if the COMPATIBLE.ASM disk group attribute is set to 11.2.0.2 or higher the limit is now 1024. That means 11 really isn’t going to cut it anymore. If you are asking for full power, make sure you know what number that is.

3. Avoid The Compact Phase (for Flash Storage Systems)

An ASM rebalance operation comprises three phases, where the third one is the compact phase. This attempts to move data as close as possible to the outer tracks of the disks ASM is using.

Did you spot the issue there? Disks. This I/O-heavy phase is completely pointless on a flash system, where I/O is served evenly from any logical address within a LUN.

You can therefore avoid that potentially-massive I/O hit by disabling the compact phase, using the underscore parameter _DISABLE_REBALANCE_COMPACT=TRUE. Remember that you need to get Oracle Support’s permission before setting underscore parameters! Point your SR in the direction of the following My Oracle Support note:

What is ASM rebalance compact Phase and how it can be disabled (Doc ID 1902001.1)

Unfortunately it appears the parameter was deprecated in 12c, so from now on you have to set the ASM diskgroup attribute “_rebalance_compact” to FALSE (note the opposite value to that set at the instance level!), for example:

ALTER DISKGROUP  SET ATTRIBUTE "_rebalance_compact”="FALSE";

If you want to know more about this topic (for example, what the first two rebalance phases are), or indeed anything about ASM in general, I highly recommend the legendary ASM blogger that is Bane Radulovic a.k.a. ASM Support Guy.

Conclusion

An ASM rebalance potentially creates a lot of I/O, which means you may need to wait for a long time before it finishes. For that reason, make sure you understand what you are doing and make every effort to perform only as much I/O as you actually need. Don’t forget you can use the EXPLAIN WORK command to gauge in advance how much work is required.

Happy rebalancing!

Implementing Linux native multipathing or DM-MPIO together with EMC PowerPath

puzzle

Guest Post

I’m delighted to say that this is another guest post from my good friend Nate Fuzi, who performs the same role as me for Violin but is based in the US instead of EMEA. Because he is American, Nate thinks that scones are called “biscuits”, that chips are called “fries” and that there is nothing – *nothing* – that cannot be improved with the simple addition of bacon. Clearly, something is fundamentally wrong with him – and yet he is like a brother to me. Like the strange, American step-brother I only see a few times a year and whom I cannot understand without the use of a translator. But he’s family all the same. So over to you Nate… and remember: Mom loved me more.

Remember when your parent answering your whiny “but whyyyyyyy???” with “Because I said so” was something you just had to accept? It meant there was no more explanation coming, and it was time for you to move on. Over the years, that answer broke down, and you grew confident you were owed more. And parents agreed: the more you demonstrated your ability to reason, the more reason you got to help you over the denial. It’s a sign of respect that we pay each other in adult life. And it can feel like disrespect if the reason offered feels weak or like it is intended to discourage further inquiry.

I was recently faced with solving what seemed a straightforward problem: take an existing Linux server running EMC’s multipathing software, PowerPath or “PP” as I will refer to it here, to access LUNs presented from that company’s SAN product, the VNX array, and attach and run Violin storage alongside the VNX. PP didn’t then support Violin arrays (still doesn’t at the time of this writing), so what was the client to do when they wanted to try out Violin’s AFA for their database environments? Just run PP and native Linux multipathing, called DM-MPIO, side by side, letting PP manage the VNX LUNs and DM-MPIO manage I/Os bound for Violin, right?

PowerPath versus DM-MPIO

PowerPath versus DM-MPIO

Wrong. Won’t work, I read. PowerPath does something at the HBA layer, I read a seemingly helpful web poster explain, that will corrupt either the VNX data or the Violin data. Well… maybe it will work, suggested another poster, but EMC might not support customers running in such a configuration. Others suggested ominously that PP and DM-MPIO don’t work well together… leaving it to the reader’s imagination what might result. I’m no master Googler, but I couldn’t find where anyone had put aside the rumors and vaguely threatening suggestions and actually tried it. Well, I did it, and I want to write about it so others know it can be done and how to do it because, well, those explanations I read didn’t stand up to question and felt like they were meant to scare me into not trying it. Of course I had to try it! Now, let’s be clear about what I am and am not saying: I am saying I have done this and it works. It’s in production at a customer site, running for months without issue. I am not saying that I have spoken to your EMC support rep and that you’ve been green-lighted to do this in your production environment. I’m not an EMC customer, and I don’t have a buddy in EMC support. So let’s consider this for informational purposes only for the time being.

First off, as several folks rightly pointed out, DM-MPIO could easily manage LUNs from both SAN products. Drop PP, configure DM-MPIO, and done. Well, that just sounds too simple. But it’s true: DM-MPIO has come a long way over the last few years and offers a pretty good set of features for free. PP costs money but is not without added value, as it does have additional configurability for reserve paths that become active in the event of a failure scenario, as well as IO distribution models beyond those offered by DM-MPIO, for example. My customer wanted to keep running PP, so this option was off the table for me.

Next up is the fun fact that PP advises you upon installation that you should “Blacklist all devices in /etc/multipath.conf and stop multipathd service”. The installer doesn’t say what will happen if you don’t do this, only that it is “*** IMPORTANT ***”. Check. Easy enough to ignore if this is the first thing you do. But if DM-MPIO is already running on the system and you try to start PP, it tells you this (verified in 5.7 and 6.0 only):

[root@host] # /etc/init.d/PowerPath start
Starting PowerPath:                                                      [FAILED]
Aborting PowerPath start since DM-Multipath is active.
Refer to PowerPath for Linux Installation and Administration Guide for more information

That’s a bummer. You actually have to stop multipathd and flush its paths before PP will start up. OK, I can do that. And, to be sure, you do NOT want both products attempting to manage IOs for the same device at the same time. That really is a bad thing. As we’ll see shortly, we might even want to segregate traffic across different FC ports, although this is strictly for optimization, not because you can’t mix traffic. But, as soon as we’ve installed the device-mapper-multipath-* packages, let’s honor this restriction right away by blacklisting the EMC devices in /etc/multipath.conf like this:

blacklist {
       devnode "^(control|vg|ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"    # standard stuff
       devnode "^hd[a-z][0-9]*"                                          # this line too
       device {
              vendor "DGC"
              product "*"
       }
       device {
              vendor "EMC"
              product "*"
       }
}

Note that the VNX line grew up in the Clariion company later acquired by EMC and presents a vendor string of “DGC”. Don’t ask me why. [Because the Clariion was a product from Data General Corporation? — flashdba :-)] It is my understanding that VMAX arrays do present “EMC” as their vendor string. Having done this, we want to explicitly except Violin devices from getting blacklisted:

blacklist_exceptions {
       device {
              vendor "VIOLIN"
              product "*"
       }
}

This isn’t completely necessary, but it does make clear our intentions: don’t manage VNX/EMC devices but do manage Violin devices. Having both entries in the file means that adding some third storage product to the FC SAN won’t cause it to get picked up by DM-MPIO without us consciously making it so. Belt and suspenders, they used to say.

Verify your multipath configuration without actually running it. Do this by adding the “-d” flag to your multipath command:

[root@host] # multipath -v3 -d

The “-v3″ flag gives us a verbose parsing of the configuration file so we can see each device and whether, what, and why DM-MPIO is going to do that with device. Make changes ad nauseum, and once you like what you see, run the command without “-d”, and create your multipath devices.

Cool. But remember when PP refused to start up earlier, saying DM-MPIO was found running? Guess what: PP’s inexplicable method of editing your /etc/rc.d/rc.sysinit script to insert its startup lines means it doesn’t attempt to start up until after DM-MPIO gets started on reboot. (Take a look for yourself; it’s there. It also makes you manually start PP if you apply a kernel update that resets the contents of rc.sysinit, at which point it reinserts the startup lines. Sweet.) How to get around this? I’m sure there are lots of solutions. I created a script to flush existing multipaths and start up PP in /etc/init.d and linked it as /etc/rc.d/rc3.d/S86PowerPath. This makes it so PP gets called just prior to DM-MPIO, and each is happy. The later call in /etc/rc.d/rc.sysinit is then redundant but causes no harm. I suppose you could almost as easily edit the rc.sysinit script to remove the check–just remember to make the same edit if/when you update PP.

Now, what was that I said a bit ago about segregating traffic on different HBA ports? This is not required; no magic is happening on the HBA with either product. Each one will discover the devices it is concerned with via its own callout routine and handle that device how you configure it to. But let’s imagine you have 4 FC ports on your host and choose to allow PP and DM-MPIO to each manage devices across all those ports. Neither will be aware what the other is doing in terms of trying to optimize IO distribution across all paths available, and you could well end up shooting yourself in the foot with sub-optimal end results. Segregating traffic also allows you to set different HBA queue depths or optimization settings as recommended by each storage vendor, and we all want to comply with best practices, right?!

Conclusion

None of this is meant to disparage EMC. Well, OK: the part about having the PowerPath startup script insert lines into /etc/rc.d/rc.sysinit is meant to disparage. I think that’s archaic and clunky. I have to believe there’s a more elegant way to do that today. I do hope I save some other soul the frustration I went through determining if this could be done and then how. If anyone has implemented more elegant solutions, I’d love to read about them.

Postcards from Storageland: Three Years At Violin

calendar

A few weeks ago, in what seems to be a truly modern phenomenon, I became aware that it was my third anniversary of joining Violin after I noticed a number of people congratulating me on LinkedIn. In many ways it feels like I’ve already been here for a lifetime, but it was only twelve months ago I was trying to think of a suitable flash-based pun for the title of an article just like this one. This year I opted out of the “Three years in a flash” headline, it seemed a bit too lame. Those NAND-based puns were only ever a flash in the pan.*

So what’s happened in the three years since I joined Violin? Well, quite a lot. When I signed up in early 2012 Violin was pioneering the flash array industry – and when I say pioneering I mean that, unlike in today’s crowded AFA market, it was a pretty lonely place. The only other all-flash array vendor with a presence was Texas Memory Systems (TMS), but they had seemingly gone into hibernation in the markets I had exposure to (as it turned out they were looking for a buyer, which they found in the form of IBM).

I was one of the first employees in EMEA, part of a business which was rapidly expanding due to a global reseller agreement with HP for our 3000 series array. The main enemy was the status quo – monolithic disk arrays from EMC, IBM, HP, HDS etc, perhaps with a smattering of SSDs to try and alleviate the terrible performance of random I/O. With the 3000 on HP’s price list and no real competition to worry about it seemed like the world was there for the taking. Time to pay of the mortgage.

Were we overconfident? Guilty of hubris, perhaps? We must have alienated a few people in the industry because I know not everyone felt sympathy for what happened next.

Pride Cometh Before A Fall

With hindsight, the $2.35 billion that HP paid for 3PAR meant it was unlikely to continue using Violin as a strategic product. HP may have a history of write downs, but it simply couldn’t justify OEMing the new 6000 series array with 3PAR still on the books so… it didn’t.

Meanwhile, EMC purchased a company that hadn’t yet shipped a product, IBM did its deal with TMS, Cisco bizarrely purchased Whiptail (which now appears to be suspended as a product) and a number of SSD-based flash array startups (e.g. Pure Storage) appeared on the market.

crash-chartAll of which meant that, when Violin went to IPO, things didn’t exactly go to plan. In fact, it eventually resulted in a change of management and the introduction of a new CEO and management team who have systematically transformed the company over the last year. But at the time, it felt like a roller coaster.

So why am I reminiscing about the bad times? Partly because I don’t want to gloss over the past, but also because I genuinely think that Violin has had to do a lot of growing up in the last year or so – and that’s a good thing. When I look at other flash vendors throwing FUD at each other, getting into legal disputes over employees or burning bridges with their channel partners to try and get their pre-IPO books look more attractive… I can’t help a wry smile. Youth, eh? Some people still have harsh lessons to learn.

From Niche to Platform

This year, on the third anniversary of my joining Violin, we announced an important new product – the 7000 series Flash Storage Platform. Until the FSP, Violin had generally competed in the niche performance-optimized market – what some people call Tier 0 – where the single most important attribute is… well, performance (think database workloads). We’ve been pretty successful there, mainly because the 6000 series was (and still is) unbelievably fast, but also partly because much of the competition competes lower down in the capacity-optimized market (where price per GB is key – think VDI workloads). But we also attracted a surprising amount of criticism for the lack of certain Data Services features, such as deduplication (a feature that I’ve never coveted for database workloads).

But with the Flash Storage Platform, Violin – and flash in general – is moving into a new, larger and much more demanding market: Tier 1 primary storage. This is the big playground where all the major disk array vendors are desperately trying to stem the losses from their legacy SAN products. flash-market-venn-diagramIt’s also a market which is nearly 15 times larger than the one we used to operate in. And most importantly, it’s the one where you need to be able to deliver on all three requirements of the Primary Storage Trinity:

  • Performance (high IOPS and low latency)
  • Data Services (lots of features, fully integrated)
  • Capacity Optimization (low $/GB price)

By complete coincidence, this product launch also coincides with the end of the Understanding Flash section of my blog series on Storage for DBAs (when I started the flashdba blog it was aimed at database administrators, but over time the intended audience has expanded to anyone with an interest in flash storage).

With that in mind, in the next set of posts I’ll be turning my attention to the concepts and architecture of All Flash Arrays. What defines an AFA? What needs to be considered when designing one? And why doesn’t it make sense to stuff a load of SSDs into an existing disk array in the hope that it will deliver the performance of All Flash?

This is a really exciting time to be working in the storage industry – there’s lots to do and a massive opportunity to embrace. Because of this, the blog posts haven’t been coming as quickly as I’d hoped. But I still have much I want to talk about… so don’t worry, the next one will be back in a flash.**

* I really will stop making flash-based puns now

** Apart from this one

Follow

Get every new post delivered to your Inbox.

Join 1,042 other followers