Storage Myths: IOPS Matter

shout-iops

Storage for DBAs: Having now spent over a year in the storage industry, I’ve decided it’s time to call out an industry-wide obsession that I previously wasn’t aware of: everyone in storage is obsessed with IOPS (the performance metric I/O Operations Per Second). Take a minute to perform a web search for “flash iops” and you’ll see countless headlines from vendors that have broken new IOPS records – and yes, these days my own employer is often one of them. You’d be forgiven for thinking that, in storage, IOPS was the most important thing ever.

I’m here to tell you that it isn’t. At least, not if databases are your game.

storage-characteristicsFundamental Characteristics of Storage

In a previous article I described the three fundamental characteristics of storage: latency, IOPS and bandwidth (or throughput). I even drew a simple, boxy diagram which, despite being one of the least-inspiring pieces of artwork ever created, serves me well enough to warrant its inclusion again here. These three properties are related – when one changes, the others change. With that in mind, here’s lesson #1:

High numbers of IOPS are useless unless they are delivered at low latency.

It’s all very well saying you can supply 1 million, 2 million, 4 million IOPS but if the latency sucks it’s not going to be of much value in the real world. Flash is great for delivering higher numbers of IOPS than disk, particularly for random I/Os (as I’ve written about previously), but ultimately the delay introduced by high latency is going to make real-world workloads unusable.

And there’s another, oft-hidden, problem that many flash vendors face: unpredictable latency. This is particularly the case during write-heavy workloads where garbage collection cannot always keep up with load, resulting in the infamous “write cliff” (more technically described as bandwidth degradationsee figure 7 of this paper). Maybe we should revise that previous line to be lesson #1.5;

High numbers of IOPS are useless unless they are delivered at predictable low latency.

But what about when we deal with volumes of data? If your requirement is to process vast amounts of information, do IOPS then become more important? Not really, because this is a bandwidth challenge – you need to design and build a system to suit your bandwidth requirements. How many GB/sec do you need? What can the storage subsystem deliver and how fast can you process it? Unlike bandwidth, an IOPS measurement does not contain the critical component of block size, so information is missing. And if you have the bandwidth figures, there is little additional value in knowing the IOPS, is there? Cue lesson #2:

Bandwidth figures are more useful for describing data volumes than IOPS

So what good are IOPS figures? And why does the storage industry talk about them all the time? Personally I think it’s a hang-up from the days of disk, when IOPS were such a limiting factor… and partly a marketing thing, because multi-million results sound impressive. Who knows? I’m more interested in what we should be asking about than what we shouldn’t.

So what does matter?

Latency Is King

crown-latencyForget everything else. Latency is the critical factor because this is what injects delay into your system. Latency means lost time; time that could have been spent busily producing results, but is instead spent waiting for I/O resources.

Forget IOPS. The whole point of a flash array is that IOPS effectively become an unlimited resource. Sure, there is always a real limit – but it’s so high that it’s no longer necessary to worry about it.

Bandwidth still matters, particularly when you are doing something which requires volume, such as analytics or data warehousing. But bandwidth is a question of plumbing – designing a solution with enough capability to deliver throughout the stack: storage, fabric, HBAs, network, processor… build it and the data will come.

Latency is the “application stealth tax”, extending the elapsed times of individual tasks and processes until everything takes slightly longer. Add up all those delays and you have a significant problem. This is why, when you consider buying flash, you need to test the latency – and not just at the storage level, but end-to-end via the application (I’ll talk about this more in a following post).

“But I Don’t Need That Many IOPS…”

This is classic misunderstanding, often the result of confusion brought about by FUD from storage vendors who cannot deliver at the higher end of the market. To repeat my previous statement, with a good flash system IOPS will effectively become an unlimited resource. This does not mean that it’s overkill for your needs. There is no point in spending more money on a solution than is necessary, but IOPS is not the indicator you should use to determine this – decisions like that should rely entirely on business requirements. I have yet to ever see a business requirement that related to IOPS (emphasis on business rather than technical).

Business requirements tend to be along the lines of needing to supply trading reports faster, or reduce the time spent by call centre operatives waiting for their CRM screens to refresh. These almost always translate back into latency requirements. After all, the key to solving any performance issue is always to follow the time and find out where it is being spent. Have you noticed that latency is the only one of our three fundamental characteristics which is expressed solely in units of time?

Don’t get distracted by IOPS… it’s all about latency.

28 Responses to Storage Myths: IOPS Matter

  1. I was looking for your thoughts on putting whole database on flash. I am under the expression that flash is good for random writes but not for sequential writes, and that was one of the reason we try to avoid redo logs over flash.

    But I was little surprise when I came to know ODA does uses flash for redo only.

    any comments please ?

    • flashdba says:

      Hi Jagjeet

      “I am under the expression that flash is good for random writes but not for sequential writes”

      That depends on your definition of “good”. In absolute terms flash is very good at sequential I/O (with some caveats which I’ll come to), but in relative terms when compared to the performance of disk (which is what most people do compare to since that is what they are used to) the performance is relatively similar.

      Disk is terrible at random I/O because of the mechanical latency experienced as the actuator head moves to the correct track and the disk rotates to the correct sector. However, once those conditions are met, a fast disk such as a 15k RPM drive can perform large I/Os very quickly (hence why disks have great bandwidth figures but terrible IOPS figures – it is the IOPS that incur this “mechanical latency tax”).

      Flash systems can handle sequential I/O, but in the case of prolonged sequential writes there is a danger of hitting the infamous “write cliff”. I have some articles pending which will explain this in more detail, as well as Violin’s patented solution to avoid the issue, but for now I’ll restrict myself to saying that individual SSD drives are not going to be able to cope with sustained sequential writes and will, at some point, experience a large increase in response times as garbage collection is unable to keep up with demand for erased blocks.

      Redo logs are written as large, sequential writes – which is ideal for disks. For this reason, almost all TPC-C benchmark systems use flash for primary data and disk for redo logs. I’m not really sure why the ODA uses redo for flash, but then I have to confess that I have never really understood the concept of the ODA. I get that it’s nice to tick a single box on an order sheet and have a “pre-configured” system arrive, but I also believe that no solution architect would ever actually build that particular system in the real world. The ODA’s top-end Intel Xeon E5-2690 processors are so completely unbalanced against all those 10k RPM SAS drives that it reminds me of a Ferrari towing a caravan.

  2. obtechora says:

    Hi Flash,
    have two questions.

    1) I do not get you completly about IOPS vs Latency.
    You say what matters is not IOPs but latency. However, as far as i understood, the lower latency is, the higher the number of IOPs that can be achieved is. So could you please clarify your statement ?

    2) I remember that in a previous post you stated that a sequential I/O did not cost much than a random I/O. If this is true, how does it come oracle compute in system satistics the time it takes to perform a random I/O and the time it takes to perform a sequential one ? Was this statement related only to SSDs ? or did it also apply to Hard Drives ?

    Thanks again for you help.

    • flashdba says:

      Hey

      1) Latency and IOPS are related, but not in an absolute and strict, mathematical sense. To try and understand what that means, think about two car manufacturing plants. The Renault factory in Valladolid, Spain produces around 100,000 cars per year. The Jaguar Land Rover plant in Halewood, Liverpool (UK) is designed to have a similar capacity of 100,000 cars per year at peak. However, the Renault cars are mostly mid-range vehicles while the Jaguar Land Rover cars are at the premium end of the market, which means individually they take longer to construct. As a result, both factories can have the same output rate and yet the time for an individual car to be manufactured (i.e. the waiting list for customers) is significantly different.

      Now, convert the cars per year metric to IOPS and the waiting list time to latency. The customers are your end users. Do they care about the IOPS? Do they care what number of cars the factory can produce in a year? Indirectly, yes – because a higher manufacturing volume would suggest they get their shiny new car delivered faster. But the reality is that it’s only the waiting list time which has a *direct effect* on their situation. This is the time they spend waiting in the system – as is the case with latency when discussing end users. So the key to understanding latency and IOPS is to see that while both metrics have relevance, only one is directly relevant.

      2) I’m not sure of the individual post you are referring to. But in the world of disk, the time to service a set of blocks sequentially is far lower than the time to service an identically-sized set of blocks randomly, due to the need to incur a seek time penalty for each random I/O. In flash, since there is no seek time nor any concept of contiguous blocks this is not the case. This can be portrayed in two ways depending on the motives of the person doing the describing: a) the benefit of performing random I/O on flash is extremely high, or b) the benefit of performing sequential I/O on flash is limited.

      Oracle’s system statistics will calculate random and sequential service times because they were designed to deal with disk systems. However, this doesn’t make them inaccurate or invalid for flash…

  3. obtechora says:

    Flash, while i understand that not because the storage itself is able to deliver massive amount of IOPS does not mean one will not experience latency, it seems to me that not having enough IOPS will simply kill response time if you application is asking for a huge amout of I/Os. Killing response time because latency will high. Now if you say that with flash the number of IOPs is ~”unlimited” then in which circumstances would you have poor latency at the physical storage layer ?

    Just do not forget also that not all companies have flash storage yet and are relying on hard drives where being able to achieve a better application throughput and being able to scale means being able to achieve more IOPs. This reminds me also that usually, the higher throughput you’re able to achieve (in terms of transactions/s) the lower the average response time (still from a transaction perspective) will be (well this is from queuing theory and can be verified).

    There are business requirements which are highly dependant on latency (from a transaction perspective) but there are also business requirements which are highly dependant on throughput (from a transaction perspective), even if it’s at the expense of the average response time.

    while large amount of IOPS may not garantee a good latency, having enough IOPS to increase transaction throughput is many times mandatory. Yes I know , with flash storage IOPs are unlimited, but again, not all companies can afford to buy Teras of SSD and still need lots of IOPS.

    • flashdba says:

      Ok I take on board what you say but perhaps you have missed part of my message in the above post – or perhaps I didn’t make it clear enough.

      At one point I considered calling this post something like “It’s All About Latency” or some variant, but realised that would be a mistake. It’s not ALL about latency. It’s about latency and bandwidth / throughput. (Ok maybe the “Latency is King” subheading eroded that message a little!)

      Really the point of the above post is to take the three fundamental characteristics of storage that I defined in a previous post, i.e. Latency, IOPS and Bandwidth, then point out that of those three we only need two in the world of flash. IOPS doesn’t really add anything in a flash conversation, since IOPS are *effectively* unlimited. That’s perhaps dangerous use of the word “unlimited” but the point I’m trying to get across is that you shouldn’t size a flash system by the IOPS requirement, you should size it by the bandwidth requirement.

      As a simple rule of thumb, in transactional workloads it’s the latency that matters most. In data warehousing, DSS, analytics, BI etc workloads it’s bandwidth / throughput. As you mix workloads they both become important, but rarely do you need to consider IOPS. Yet if you look at the press releases coming from flash vendors, IOPS is the thing they all talk about the most!

      You also mention that not everyone has flash. That’s true (much to my disappointment!) but I’m afraid that until I decide to rename this website diskdba.com I’m going to be sticking with the flash-based stuff 🙂

      Thanks for commenting my friend, you always keep me on my toes.

  4. obtechora says:

    Thanks a lot flash ! did not have time to read all the details of your answer (will do when out of the office) and will carrefully read, and probably will be back soon with additional questions.
    I’m really lucky to have someone like you sharing and explaining that much.
    Thanks again my friend !

  5. Hi, IOPS is really a point of concern in Storage decisions.
    But as you hint here, the latency that your clients will see when access the Storage count more than IOPS that you can delivery. Latency is lost time.
    Depending the system that you use is hard to define (or find) the limits of the Storage, I wrote something in this way in my blog (sorry, posted just in Portuguese).
    Thanks for share it.

  6. Thanks for your reply, I still need some more understanding about IOPs/Bandwidth/Latency .. is there any book you would like to refer to me.

    Thanks for your reply !

  7. Prashanth says:

    Hi Flashdba,

    Firstly want to thank you for the wonderful posts you write about your flash experience.
    I have some experience on working on flash storage and have query on your example of cars comparing to IOPS vs Latency. In your case, both of them can produce 100,000 cars even thought production for each car varies but the end result result is Jaguar has bigger factory to produce same amount of cars. How does this make a difference?

    When we say Throughput is 1GB/sec, this also take into account what was the latency for each sub-pieces of IO it took (considering physical flash sector of 512 bytes). The more the latency here, lesser the throughput. As i know, flash vendors don’t mention about block size when they mention throughput which means any IO size is expected to produce same throughput.

    • flashdba says:

      Hi Prashanth

      At a high level it’s relatively simple. Let’s say we can achieve X number of IOPS at Y latency: X => Y. If the latency is doubled (i.e. each I/O takes twice as long) then you would expect the number of IOPS to halve: X/2 => Y/2. But never forget that this I/O is being driven by processes somewhere on a host submitting I/O requests, usually in parallel.

      If we want to achieve the original X number of IOPS at this new latency it is still possible, we just have to double the number of concurrent processes performing this I/O. This is analogous to doubling the number of production lines within the car factory of our example.

      Of course, for this to be successful you have to have a storage array that can cope with this level of concurrent I/O as well as a host (or hosts) that are able to drive it…

      • flashdba says:

        I want to extend that previous reply a little, because throughput and latency are not necessarily related.

        Consider two copper pipes of identical diameter. Inside the pipes you place identical marbles with 10mm diameters – we’ll assume that they fit perfectly and can roll from one end to the other. The pipes are horizontal so gravity isn’t having any effect on the marbles.

        Now, the first pipe is 1m long so when you fill it with marbles you get exactly 100 of them inside. You then push more marbles in at a rate of 1 per second. What is the throughput of this system? Simple, it’s 1 marble per second – since that is what drops out of the other end. What is the latency though? The time taken for a marble to enter the system and exit at the other end is 100 seconds.

        The second pipe is half the length – 50cm. This means you only get 50 of them inside. As before you push in one more marble every second, giving a throughput of 1 marble per second just like before. But now the latency is only 50 seconds, half of that in the other tube.

        Ok so it’s a hugely simplified analogy which doesn’t take into account things like variable block size, but I just want to make the point that latency can actually vary independently of throughput.

  8. Andrea says:

    Thank you for this article, it has been very useful.

    I was looking at getting some SSD into a Raid server for work and was initially looking at the Samsung 840 PRO models but recently have been tempted by the Intel DC 3500 Series.

    However I could not understand why they were rated so high when they had so low IOPS.
    The Intel DC3500 has 7,500 Write IOPS while the 840 Pro has 90,000 IOPS

    If I understand your article correctly the DC3500 is a no brainier for best performance regardless of the lower iops rating?

    • flashdba says:

      Well it depends on what you want to do with these SSDs. Both models you mention are SSDs, rather than flash arrays or PCIe cards, which means performance is limited by the need to communicate via controller technology designed for disk.

      The IOPS limit isn’t important, as long as it’s above the requirement you have. That sounds like total common sense – and it is – but sometimes it gets overlooked. If you need 10,000 IOPS then who cares what each drive can deliver, both are well in excess of that requirement.

      The latency, on the other hand, will directly affect the performance of your system. If the latency of the Intel drive is generally lower than that of the Samsung drive then it would certainly seem the stronger choice. Of course, published latency figures may not be experienced in real life, particularly if your workload if mixed or (even worse) write heavy.

      Most SSDs can only sustain a limited amount of write workload before running into garbage collection issues (as discarded blocks have to be erased before they can be used again). Once garbage collection interferes with the foreground reading or writing processes, performance will become substantially worse – probably worse than the equivalent disk system. This is one of the reasons why I always advise against putting database redo logs on SSD…

      • dupeguy says:

        I think another good analogy in place of the vehicle factory and the marbles, bringing it back closer to tech, is to think of processing microarchitecture. Most techies have probably studied the basics of superscalar architecture. A processor may have a short pipeline that gets each instruction done in 1 nanosecond, or a long pipeline that allows parallelism gets 4 instructions done in 4 nanoseconds, but each instruction can finish in no less than 4 nanoseconds. Both have the same throughput (1 instruction per nanosecond) but the latency of each instruction was 4 nanoseconds. If your system design required 4 users all to get 1 usec latency (so you thought your throughput requirement was 1 nanosecond/instruction), you may need to buy 4 of the short pipeline processors rather than one of the longer.

        BW and IOPS both inform you how much gets done per second. There are a lot of design decisions that can be made differently from SSD to SSD. How many flash chips of what capability and parallelism… different IO manipulation for either wear leveling, random or sequnetial performance, etc… different points of acknowledging the user (also different amounts and levels of built in cache)… when or if garbage collection is done… These all end up to be tradeoffs similar to pipelining.

        The BW or IOPS numbers could be acceptable but your design may fail because of many reasons, here are a couple examples.
        1. You might only achieve these rates with many users or applications working in different (or vice versa) areas of the flash
        2. Suffer latency impacts if, for instance, the flash is accepting many IO’s but waiting to process them at an optimal queue depth for the flash architecture. Time is not wasted on inefficiencies, but latency (especially the std deviation of it) will likely be worse and potentially impact applications.

        The difficulty is that OEMs try to make devices that appeal to a wide spread of customers. It’s easy to pick a throughput number to target, it’s not easy to predict what latency requirements will be. The Samsung 840 Pro vs. Intel is probably a good example. The Samsung is a consumer level device, which requires support for single user general workloads… and an Intel SSD like the one mentioned (which uses marketing language like consistent performance and wear leveling) probably targeted towards more business RAID environments where multi-user, transactional applications are more common. LSI cards didn’t even work with Samsung 840 Pro’s for quite a while after launch! After all, a single consumer level user probably won’t by 4 SSDs to RAID, rather than one 840 pro (to go back to the CPU comparison).

        • flashdba says:

          Your analogy is much better than mine. Vehicle factories, marbles… what was I thinking?

          My argument, which may span multiple posts rather than being confined to this one in particular, is that back in the days of disk there were three *fundamental* characteristics of storage: latency, IOPS and bandwidth / throughput. Now, in the modern world of persistent memory-based storage such as NAND flash arrays, IOPS is no longer fundamental.

          Having said that, SSDs remain a half-way house between the two – and never more so than in the case of consumer SSDs.

          Thanks for commenting. I can see from your IP address who you work for, so you’re clearly not a newcomer to these technologies. It’s good to have knowledgeable readers keeping me honest 😉

  9. Flashwannabe says:

    Just to get the conversation rolling – re the article and the comments above (which are great and highly beneficial by the way!)
    When designing a SAN storage system, typically in the SME space we have system Capex costs and ongoing management Opex costs to worry about – with management and customers wanting ever-cheaper solutions! – Cloud (saying no more on this)
    In an ideal world each application (VMware, SQL, VDI, File or Exchange etc) would run on its own SAN, so avoiding contention, but the costs would be way too high to realistically achieve this – or in the flash/SSD world (as we have a very high number of IOPS available?) are we saying application contention is now a non issue?
    The issue I see with most Flash/SSD vendors is that most SANs (within a single head unit or even management domain) top out at 40TB, yet typically the standard SME requires 80-100TB useable or more (ignoring any dedupe or compression “IF” ☺ options are available). Hence the benefit of Hybrid solutions or tiering based solutions.
    Flash vendors have to start thinking mainstream replacement, not high cost point/tactical solutions.

    • flashdba says:

      That is an interesting point, although not directly related to the dedupe post as it’s more about the suitability of current flash systems for true enterprise-scale solutions. I dare say there are a number of vendors who would want to take issue with your concern about the 40TB limit… some more successfully than others. In fact I would expand that point, because for flash to be a truly enterprise-class platform there are more questions that need to be answered.

      All the flash vendors right now are talking about data management features like dedupe, compression, thin provisioning, snapshots etc. I would argue that none of those are *essential* in enterprise class storage, but what is absolutely essential is features like non-disruptive upgrades (too *everything* rather than just storage heads or gateways… think, for example, of SSD firmware – can this be upgraded non-disruptively?)

      My advice, for what it’s worth, is to draw up a list of mandatory requirements, plus nice-to-haves, before talking to any vendor and being deluged in marketing bluster.

  10. Flashwannabe says:

    Bandwidth definition: The term bandwidth has a number of technical meanings but since the popularization of the Internet, it has generally referred to the volume of information per unit of time that a transmission medium (like an Internet connection) can handle.
    Here’s a quick analogy: Data is to available bandwidth as water is to the size of the pipe. In other words, as the bandwidth increases, so does the amount of data that can flow through in a given amount of time, just like as the size of the pipe increases, so does the amount of water that can flow through during a period of time.
    An Internet connection with a larger bandwidth can move a set amount of data (say, a video file) much faster than an Internet connection with a lower bandwidth.

    I would say that with 8GB FC switches/HBAs etc (and now 16GB) has that helped NO! did we ever see a 4GB FC network saturated – NO! (well I haven’t) – yet disk based SANs with 100 hundreds of 15K drives had latency issues. When we moved from 2GB to 4GB and then 4GB to 8GB – it made no difference……SAN Latency is key!

    As a discussion point – But how should one size a Flash/SSD Greenfield’s SAN from a latency point of view? – what are the latency requirements of a typical Windows server? – then x by 100 VMs? SQL we can say no higher than 10MS, but how should one scale / size when we have capacity requirements, but not the performance?

    • flashdba says:

      The water-in-a-pipe analogy is often used when talking about bandwidth / throughput. It works well enough in some situations, but doesn’t adequately deal with latency, because the analogy is focussed on the volume per unit of time rather than the velocity (i.e. distance over time) at which the water travels. Increase the diameter of a pipe and it stands to reason you can deliver increased volume, but that doesn’t directly imply the velocity will change. The reason the analogy breaks down is because water can remain in a pipe at rest, whereas data packets on a network cannot. Thus when you turn on a tap (or faucet for my American friends) water is immediately available, but when you request an I/O you have to wait for data to complete the entire journey from sender to receiver.

      In my view, there are (or rather were) three fundamental characteristics of storage: latency, IOPS and bandwidth / throughput. You can read more about that here:

      The Fundamental Characteristics of Storage

      Also, it’s impossible to talk about SAN latency without thinking about the difference between random and sequential I/O:

      Understanding I/O: Random vs Sequential

  11. NM says:

    Thanks for the post flashdba – quite informative. I’ve got a question with respect to latency not with respect to IOPS but more to the bandwidth side of things. I’m sure latency would also be a factor when it comes to the amount of bandwidth delivered by the system – among other things. However I was wondering, for a data warehouse where large volumes of data have to be processed – is it normal to see average wait times of IO operations being higher compared to those on an OLTP based system? Where I’m coming from is – With respect to Oracle, generally on OLTP systems that perform well, you’ll see IO response times at 2-4ms or well generally under 10ms (for sequential and random reads). Now when it comes to data warehousing would we expect to see the same sort of IO response times or would we expect them to degrade as more data is processed? Just wondering what your thoughts are on this?

    • flashdba says:

      It depends on the block size. If the block size is relatively large then it’s more likely you will see larger response times – and high bandwidth tends to indicate larger block sizes.

      Of course, it also depends on whether you are using disk or flash. I wouldn’t be happy with 2-4ms response times on flash, but on disk I’d be pretty relieved.

  12. Sreenath Gupta says:

    Hello My friend, i think i reached you atlast, i am in very big trouble with my new Dell R820 server with Hyper-v 2012 installed on it. I am facing high latency

    issues with all the vm’s as well with the host machine, for few hours server works good and after that i am facing the latency issue again and at the time of latency,

    i am restarting the server and problem will disappear, and again it come back after some time. I have raised ticket with Dell and Microsoft for the same and both of

    them could not solve my issue. We are using the server for virtuliazation and the hardware configuration is 64 GB RAM/xeon processor quad core 2 sockets total 32

    threads/1 x 3 SAS 6 gb 7k RPM. Kindly suggest me if is this due to the HDD issue.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.