The Most Expensive CPUs You Own

CPU-gold

Storage for DBAs: Take a look in your data centre at all those humming boxes and flashing lights. Ignore the storage and networking gear for now and just concentrate on the servers. You probably have many different models, with different types and numbers of CPUs and DRAM inside. My question is, which CPUs are the most expensive? Almost without exception, the answer will be the CPUs inside your database servers…

In the last couple of posts I talked about the real cost of enterprise database software in general and Oracle RAC in particular. The point I was making was that database software, which is traditionally licensed by the CPU core, is expensive in comparison to the cost of the hardware on which it runs. But since the hardware fundamentally affects the performance – and therefore value for money – of the software, it’s important to make the right choices when building a database system. And yes, predictably, I believe that this means using flash memory instead of disk – but don’t worry, that’s not the main message behind post.

Lawn Mower Tax

OLYMPUS DIGITAL CAMERAThink of any consumer item which comes in multiple sizes and price brackets. I don’t know, let’s say a lawn mower. To simplify, let’s assume you can buy three different types of mower: small ($250), medium ($500) and large ($1000). The small one is cheaper but less powerful, so it takes longer to cut your grass, while the large one is the most expensive but requires the shortest amount of time. Which would you pick?

There’s no right answer because it depends on your requirements. But let’s introduce an unexpected complication into the mix: lawn mower tax. The government, in their wisdom, imposes a $50,000 tax on the purchase of any new lawn mower regardless of size. You still need a mower so you are forced to pay the tax, but is your choice influenced? The chances are you would buy the larger model, because a) the percentage difference in overall price is much less, and b) it avoids the risk of needing to upgrade in the future and having to pay the tax again. The $51,000 large mower represents better value for money than the two smaller models.

CPU Tax

You can think of database software in the same way. There are countless types of CPU available on the market right now: Intel, AMD, ARM, IBM Power, Oracle / Fujitsu SPARC, etc. Each vendor has many models and architectures, clock speeds and power ratings, yet they all share one important property: core count. And that core count is subject to the massive “CPU tax” that is the database software license. I’m sticking to the Oracle Database in this post but the same applies to Microsoft SQL Server (where licenses are core-based from SQL2012 onwards), Sybase and so on.

Picture Courtesy of 401(K) 2013 (Flickr)

Take a standard two-socket sixteen-core Intel Xeon-based server as an example: there are a multitude of CPU models fitting that description. Even if we restrict ourselves to the Sandy Bridge-EP range Wikipedia shows there are 11 different models fitting the description of “8 cores per socket”. Yet not all CPUs are equal. Wouldn’t it make sense, given the massive cost associated with core-based licensing, so ensure you are using the processor which gives you the best performance, i.e. value for money, per license?

Performance Per Licenseable Core

The problem of determining which CPUs provide the best value for money was one I struggled with for a while. Looking at benchmarks like SPECint and the datasheets from Intel and co, it’s hard not to be overwhelmed by data – and if I’m honest I probably don’t have the systems-level knowledge to interpret it accurately. Ironically, the solution came from someone who does have that knowledge, but showed me that it isn’t required because there’s a much simpler way. More importantly, benchmarks like SPECint don’t take into account what we want these CPUs to do, which is to run the Oracle Database.

Kevin Closson‘s elegant and annoyingly simple solution was to use TPC benchmarks – specifically the transactional TPC-C benchmark from Oracle databases, results from which are freely available here. All we need to do then is simply download the spreadsheet, filter out the non-Oracle workloads and then divide the value of tpmC (the number of orders that can be fully processed per minute) by the number of CPU cores to get the performance per core.

Since this is an Oracle-specific calculation we also then need to multiply this by Oracle’s Processor Core Factor (see link on this page) to get the ultimate figure we need to know, the performance per license. Here’s my working copy of the spreadsheet, but I make no claims to its accuracy and will not keep this screenshot up-to-date. You should recalculate every time you want to make a judgement on which servers to use, it’s a very simple exercise.

Click to enlarge

Performance per licensable core (based on published TPC-C benchmark results using Oracle) – click to enlarge

The red column is the performance per licenseable core, marked “Perf / license“. Hopefully it’s obvious that this is just a re-work of Kevin’s ideas, many of which he posted in this blog article, which I highly recommend reading. As such I can claim no credit, except for any mistakes.

The Flash Angle

Of course, this wouldn’t be a flashdba article without some mention of flash memory. As discussed above there are many different types and models of CPU, but there is one great leveller: CPUs are all equally good at doing nothing. If your processors are waiting on I/O then they are not working – and that has a direct negative effect on the value you are realising from them.

Violin Memory 6000 seriesIn the above chart, the last benchmark result (with the best value for performance per licensable core) is this one performed by Cisco. Now, I honestly didn’t engineer this article to work out this way, but it so happens that Cisco used a pair of Violin Memory 6616 flash memory arrays to achieve this workload. (I’d almost* be happier if this had been a competitor’s flash array, because I don’t want this to look like an advert for my employer and therefore detract from my point…)

The point I’m aiming to make here is that it’s worth using the best-performing processors in order to see value for money from your database licenses. But to enable that, the processors need to be released from the chains of high-latency storage – and that, quite simply, means using flash.

* almost, but not quite

Advertisement

The Real Cost of Oracle RAC

coins

Storage for DBAs: In my previous article (in this mini-series on database economics) I explained how to calculate the cost of a mid-range Oracle database system. My motive was a concern that many people working either directly or indirectly with database software are uninformed about just how expensive it is – particularly in comparison to the cost of hardware. And in this article I want to cover the great granddaddy of Oracle licenses costs: Oracle Real Application Clusters (RAC).

I also want to show you a little-known trick that can allow you to build a two-node fully active/active RAC cluster for a fraction of the price you would normally expect to pay.

But first, let’s talk about RAC…

Oracle RAC: High Availability For the Masses

There was a time, long ago, when big servers were very expensive. Many people ran Oracle on RISC-based UNIX systems, which had limited scalability in terms of the number of CPU cores and the maximum amount of physical memory. Oracle recognised this scalability issue and built a software solution for it, initially called Oracle Parallel Server (OPS). If you never used OPS in anger you should ask some of the grizzled, battle-scarred veterans who did how they fared against it, but at least in theory it allowed customers to scale out when scaling up wasn’t really possible.

However, things change – and nowhere more so than in IT. The days of big iron RISC systems seem long ago and nowadays (comparatively) cheap multicore x86 hardware is the norm. Scaling up to 80 cores in a server is not unusual, so the need for a software scalability solution is less strong than it was. However, Oracle knows a thing or two about staying at the top, so OPS became Real Application Servers and the scalability marketing message got overtaken by a new claim: high availability. Yes, Oracle RAC allows you to run one database across multiple nodes so if you look at it the right way that’s increasing system availability.

Of course, If you look at it another way (as I do), increasing the number of nodes is actually increasing the risk of failure to a single node. Plus, adding a whole raft of cluster functionality such as cache coherence, cluster filesystems and cluster ready services is just adding complexity, which is the enemy of availability. Yet everyone in the RAC game lives with the same shared deception: that losing a whole node does not count as a service outage. Sure, you get a whole load of users that get kicked off. Ok, so you have to bounce a whole set of application servers. But hey, technically it wasn’t a full outage so the SLAs weren’t affected. Er… ok… I think I’ve made my thoughts clear on this before.

Oracle RAC: The Expensive Way

There are two reasons why RAC can be expensive, or to put it another way two dimensions. The price goes up as the license cost increases, but it also goes up in multiples as the architecture scales out to multiple nodes.

In general, RAC is a feature of Oracle Enterprise Edition – in fact looking at the prices on the Oracle Store as I write this it’s the joint-most-expensive option (along with Oracle OLAP) priced at $23k per core (list)… If you consider that the Enterprise Edition license is $47.5k per core then that’s nearly half as much again. Don’t forget that Oracle’s core multiplication factor table determines that we need to multiply these costs by 0.5 for Intel Xeon processors, which is what I’m using in this example (see the first article in this series if you don’t know what this means).

oracle-rac-price-assumptions

Let’s state some assumptions for this imaginary Oracle RAC cluster we are building. It will have 4 nodes (16 cores per node) and 20TB of usable disk storage. We’ll also assume that in buying the licenses we got a 60% discount. We’re looking at the three-year price and, as always, the maintenance costs us 22% of the net license cost. I’m including the Oracle Diagnostics Pack ($5k per core) in the license cost too – surely nobody can cope without it these days?

oracle-rac-price-breakdowns

The total cost over three years, just for hardware, software and support (i.e. discounting TCO-type calculations like power, cooling, etc) is now up at £1.8m. That’s a relatively large amount of money! But what I find really interesting is the proportion that goes to the database vendor compared to the proportion that is spent on hardware:

oracle-rac-4node-price-breakdown

The storage (which I naturally have an interest in) is just 8% of the total cost, while the database vendor’s products and support services comprise 89% of the total cost. This is where database consolidation starts to make sense (more databases on the same hardware means better value for money from the core-based licenses). It’s also where flash memory storage makes sense, because it allows a far better return on this massive investment: firstly by unleashing applications to run at the speed of memory, and secondly by unlocking (expensive) CPUs which are otherwise stuck waiting on I/O from slow disk storage systems.

Oracle RAC: The Inexpensive Way

But wait, I promised you an alternative to the costly system above. What is it? The answer can be found buried deep within Oracle’s Software Investment Guide (page 11 of the current published version) where we find the following information: from Oracle 10g onwards, Oracle Standard Edition includes the Real Application Clusters Option provided customers use Oracle Clusterware and ASM. Since Standard Edition is limited to a maximum of 4 CPU sockets (not cores!!) this effectively means a two-node system using two-socket servers.

That’s still an amazing revelation – it’s basically RAC (with certain caveats) for free! With the right choice of high-end CPU, a two-socket server can deliver massive performance. Let’s have a look at the cost of a two-node RAC system running on Standard Edition using the same assumptions from above [massive thanks to Doug (see comments below) for pointing out my mistake – now corrected – that Standard Edition is licensed by the socket not the core and that the core multiplication factor therefore does not apply]:

oracle-se-rac-price-breakdowns

Now many people will think, “Hang on I can’t cope without Enterprise Edition” … but for this level of saving, isn’t it worth giving that some closer analysis? The real bonus here is that, in only paying licenses by the socket, you can achieve a massive benefit if you use the fastest processors with the largest number of cores and not pay any penalty.

oracle-se-rac-2node-price-breakdown

The price of Standard Edition RAC is 87% of the price of our previous configuration. (If you were to compare a like-for-like scenario where 2 node Enterprise Edition RAC moved to 2 node Standard Edition RAC the saving would instead be £702.5k or 76%)

Conclusion

Everything here is just speculation, based on the information available from Oracle at the time of writing. You should not construe my remarks as guarantees or facts, but instead do your own research and talk to your local database vendor’s representatives.

The point of writing this article is that technical people don’t always have a handle on price, because in some organisations they don’t always need to. But when the technical design has such a dramatic effect on the price, I think we all ought to be looking at the bigger picture and taking the time to work out the implications of our choices.

Software, as they say in Redwood Shores, doesn’t come cheap…

The Real Cost of Enterprise Database Software

moneyStorage for DBAs: The strange thing about enterprise databases is that the people who design, manage and support them are often disassociated from the people who pay the bills. In fact, that’s not unusual in enterprise IT, particularly in larger organisations where purchasing departments are often at opposite ends of the org chart to operations and engineering staff.

I know this doesn’t apply to everyone but I spent many years working in development, operations and consultancy roles without ever having to think about the cost of an Oracle license. It just wasn’t part of my remit. I knew software was expensive, so I occasionally felt guilt when I absolutely insisted that we needed the Enterprise Edition licenses instead of Standard Edition (did we really, or was I just thinking of my CV?) but ultimately my job was to justify the purchase rather than explain the cost.

On the off chance that there are people like me out there who are still a little bit in the dark about pricing, I’m going to use this post to describe the basic price breakdown of a database environment. I also have a semi-hidden agenda for this, which is to demonstrate the surprisingly small proportion of the total cost that comprises the storage system. If you happen to be designing a database environment and you (or your management) think the cost of high-end storage is prohibitive, just keep in mind how little it affects the overall three-year cost in comparison to the benefits it brings.

Pricing a Mid-Range Oracle Database

Let’s take a simple mid-range database environment as our starting point. None of your expensive Oracle RAC licenses, just Enterprise Edition and one or two options running on a two-socket server.

At the moment, on the Oracle Store, a perpetual license for Enterprise Edition is retailing at $47,500 per processor. We’ll deal with the whole per processor thing in a minute. Keep in mind that this is the list price as well. Discounts are never guaranteed, but since this is a purely hypothetical system I’m going to apply a hypothetical 60% discount to the end product later on.

midrange-price-assumptions

I said one or two options, so I’m going to pick the Partitioning option for this example – but you could easily choose Advanced Compression, Active Data Guard, Spatial or Real Application Testing as they are all currently priced at $11,500 per processor (with the license term being perpetual – if you don’t know the difference between this and named user then I recommend reading this). For the second option I’ll pick one of the cheaper packs… none of us can function without the wait interface anymore, so let’s buy the Tuning Pack for $5,000 per processor.

The Processor Core Factor

I guess we’d better discuss this whole processor thing now. Oracle uses per core licensing which means each CPU core needs a license, as opposed to per socket which requires one license per physical chip in the server. This is normal practice these days since not all sockets are equal – different chips can have anything from one to ten or more cores in them, making socket-based licensing a challenge for software vendors. Sybase is licensed by the core, as is Microsoft SQL Server from SQL 2012. However, not all cores are equal either… meaning that different types of architecture have to be priced according to their ability.

The solution, in Oracle’s case, is the Oracle Processor Core Factor, which determines a multiplier to be applied to each processor type in order to calculate the number of licenses required. (At the time of writing the latest table is here but always check for an updated version.) So if you have a server with two sockets containing Intel Xeon E5-2690 processors (each of which has eight cores, giving a total of sixteen) you would multiply this by Oracle’s core factor of 0.5 meaning you need a total of 16 x 0.5 = 8 licenses. That’s eight licenses for Enterprise Edition, eight licenses for Partitioning and eight licenses for the Tuning Pack.

midrange-price-breakdown

What else do we need? Well there’s the server cost, obviously. A mid-range Xeon-based system isn’t going to be much more than $16,000. Let’s also add the Oracle Linux operating system (one throat to choke!) for which Premier Support is currently listing at $6,897 for three years per system. We’ll need Oracle’s support and maintenance of all these products too – traditionally Oracle sells support at 22% of the net license cost (i.e. what you paid rather than the list price), per year. As with everything in this post, the price / percentage isn’t guaranteed (speak to Oracle if you want a quote) but it’s good enough for this rough sketch.

Finally, we need some storage. Since I’m actually describing from memory an existing environment I’ve worked on in the past, I’m going to use a legacy mid-range disk array priced at $7 per GB – and I want 10TB of usable storage. It’s got some SSD in it and some DRAM cache but obviously it’s still leagues apart from an enterprise flash array.

Price Breakdown

That’s everything. I’m not going to bother with a proper TCO analysis, so these are just the costs of hardware, software and support. If you’ve read this far your peripheral vision will already have taken in the graph below. so I can’t ask you to take a guess… but think about your preconceptions. Of the total price, how much did you think the storage was going to be? And how much of the total did you think would go to the database vendor?

midrange-database-price-breakdown

The storage is just 17% of the total, while the database vendor gets a whopping 80%. That’s four-fifths… and they don’t even have to deal with the logistics of shipping and installing a hardware product!

Still, the total price is “only” $430k, so it’s not in the millions of dollars, plus you might be able to negotiate a better discount. But ask yourself this: what would happen if you added Oracle Real Application Clusters (currently listing at $23,000 per processor) to the mix. You’d need to add a whole set of additional nodes too. The price just went through the roof. What about if you used a big 80-core NUMA server… thereby increasing the license cost by a factor of five (16 cores to 80)? Kerching!

Performance and Cost are Interdependent

light_bulbThere are two points I want to make here. One is that the cost of storage is often relatively small in terms of the total cost. If a large amount of money is being spent on licensing the environment it makes sense to ensure that the storage enables better performance, i.e. results in a better return on investment.

The second point is more subtle – but even more important. Look at the price calculations above and think about how important the number of CPU cores is. It makes a massive difference to the overall cost, right? So if that’s the case, how important do you think it is that you use the best CPUs? If CPU type A gives significantly better performance than CPU type B, it’s imperative that you use the former because the (license-related) cost of adding more CPU is prohibitive.

Yet many environments are held back by CPUs that are stuck waiting on I/O. This is bad news for end users and applications, bad news for batch jobs and backups. But most of all, this is terrible news for data centre economics, because those CPUs are much, much more expensive than the price you pay to put them in the server.

There is more to come on this subject…

Storage Myths: Put Oracle Redo on SSD

tortoise

Storage for DBAs: “My database is slow”… “Well then why not put your redo logs on SSDs?” Gaaaah. I still hear people having this discussion and it drives me mad. “Nobody got fired for putting Oracle redo on …<flash vendor>”. Yeah right, but does that mean it was worth the investment?

I’m bored of this line of illogical “reasoning”, so here are three reasons why you shouldn’t put your redo logs on SSD.

1. Solve The Right Problem

If a database is slow, find out why. Investigate, troubleshoot, resolve. Don’t throw hardware at it without understanding what the problem is. Redo is written by the Oracle Log Writer process – and the wait event log file parallel write covers the writing of redo records from the log buffer into the online redo log files. If you are seeing high average wait times for log file parallel write (or occasional high wait times in the Wait Event Histogram) maybe it’s time to investigate the speed of redo I/Os. Otherwise … leave it alone, or you are fixing the wrong issue.

Also, let’s not confuse the wait event log file sync with log file parallel write. Log file sync is experienced by foreground processes waiting on the log writer to complete a flush of the log buffer to storage. It’s tempting to assume high log file sync times are therefore a consequence of slow log writes, but as Kevin Closson points out in this must-read article, most log file sync waits are actually processing issues where the log writer is not getting enough CPU time.

2. SSD Write Performance Sucks

Huh? You thought I was pro-SSD right? Ok so I’m being a bit crafty, because the terms SSD and Flash are not really synonymous. SSD stands for Solid State Disk (or Device depending on who you ask), which generally means a set of flash chips crafted into the shape of a hard disk drive and plugged into a HDD-shaped hole somewhere via the use of a Flash Memory Controller. This interface takes page-based flash memory and makes it look like block-based storage – and each SSD in an array has its own controller.

snailThere is a fundamental difference between an all-flash array and a set of SSDs masquerading as disks: an all-flash array can manage the flash holistically while the SSD-populated array cannot. This matters because flash is awkward to work with – for example, flash pages must be erased before they are written to – a process which is both slow and cumbersome, since other pages are locked (even from reads) during an erase.

The all-flash array is able to avoid the consequences of these restrictions by managing the flash globally, so that erases do not block reads and writes. In contrast, SSDs shoved into a disk array cannot communicate with each other to indicate when they are busy performing this garbage collection process, resulting in unpredictable performance and horrible spikes in latency as I/Os queue up behind the erase process.

3. Disk Is Good Enough

Disk

You didn’t expect me to say that, did you? Don’t get me wrong, disk is terrible at random I/O. Really, truly awful. But here’s the thing: the Oracle log writer performs large, sequential writes. And disk is ok with sequential I/O, particularly if you are using faster spindles like the 15k RPM drives.

Flushing the log buffer to storage involves writing some multiple of the redo log block size (512 byte default but configurable to 1024 or 4096 bytes from Oracle version 11.2). If your system is busy enough that you believe you have redo performance issues, it seems likely that those writes will be larger as more redo is created per log flush. The larger the write, the more efficient it will be on disk as the impact of the initial seek time is averaged out.

But hey, don’t take my word for it. Trust the evidence – and it turns out there is a wealth of data out there for anyone to analyse… right here: http://www.tpc.org/

The thing about TPC-C benchmarks is that they generate redo logs like you wouldn’t believe. So if anyone needs the ultimate redo performance it’s a system like this one, which set a world record back in September 2012 (which Oracle crowed about in it’s usual classy way by using it to bash IBM). The great thing about TPC results is that they come with a complete full disclosure report so you can see just how the vendors did it. And in the full disclosure report for this submission, where was the redo located? On a RAID set consisting of 600GB 15K RPM disk drives (see page 21). If disk is fast enough for a world record, it’s fast enough for you.

Incidentally, the datafiles in that benchmark were located on 2x Violin Memory 6616 arrays – which also tells you something important: if you are migrating from disk to flash, the first thing you need to move is the primary data, not the redo.

The Counter-Argument: Flash is Not SSD

Now I don’t want to wrap this article up giving you the impression that you shouldn’t move your redo logs to flash memory, so I’ll leave you with some counter arguments to the above. When I build a database, I always put the redo logs on flash (not on SSD mind, but on a flash memory array). Here’s why:

1. Violin Isn’t Limited By Writes

I know, I know … that sounds like a sales pitch. I usually try to talk about flash in general, which is why I originally wrote “All-flash arrays aren’t limited by writes”, but the truth is I don’t know other all-flash arrays to the extent that I know Violin… so forgive me for sticking with what I know.

I’ll explain Violin’s methods for guaranteeing sustained ultra-low write latency some other day. for now, let’s just see the evidence:

Load Profile              Per Second    Per Transaction
~~~~~~~~~~~~         ---------------    ---------------
      DB Time(s):              197.6                2.8
       DB CPU(s):               18.8                0.3
       Redo size:    1,477,126,876.3       20,568,059.6
   Logical reads:          896,951.0           12,489.5
   Block changes:          672,039.3            9,357.7
  Physical reads:           15,529.0              216.2
 Physical writes:          166,099.8            2,312.8

That’s over 1.4GB/sec of sustained redo generation from a 5 minute snapshot (see this post for details) using just a single Violin Memory 6616 array connected over 8Gb fibre channel. The AWR snapshot was 5 minutes long but the workload had been running for an extended period prior to the capture. Don’t leave here with the illusion that redo on flash memory isn’t blindingly fast.

2. Your First Design Goal Should Be Simplicity

There is a quote often attributed to Albert Einstein which says, “Everything should be made as simple as possible, but no simpler“. This applies perfectly to system design – and is one reason why I always recommend an all-flash database design over a flash and disk hybrid. Yes it’s possible to put some datafiles here and others over there, redo logs on disk and primary data on flash, etc. But the simplest design is to put everything on high performance, low latency flash. Is it the cheapest solution? Maybe not always on list price, but it probably will be based on TCO.

Conclusion

Look, if you want to put your redo logs on flash, I’m not going to argue. I’m not saying that it’s a bad thing.

cautionWhat is a bad thing though is the practice of taking a disk-based database and sticking some SSDs in to home the redo logs. That’s just silly. The first part of the database you should move to flash is the primary data. If it makes sense to relocate the whole database (which it almost always does, because that disk array doesn’t belong in your data centre anymore – it belongs in a museum) then go for it. Just don’t compromise on having only the redo logs on flash or SSD, because then you have essentially built yourself an anti-TPC-C benchmarking system! And what’s the opposite of a system that goes really fast…?

Storage Myths: IOPS Matter

shout-iops

Storage for DBAs: Having now spent over a year in the storage industry, I’ve decided it’s time to call out an industry-wide obsession that I previously wasn’t aware of: everyone in storage is obsessed with IOPS (the performance metric I/O Operations Per Second). Take a minute to perform a web search for “flash iops” and you’ll see countless headlines from vendors that have broken new IOPS records – and yes, these days my own employer is often one of them. You’d be forgiven for thinking that, in storage, IOPS was the most important thing ever.

I’m here to tell you that it isn’t. At least, not if databases are your game.

storage-characteristicsFundamental Characteristics of Storage

In a previous article I described the three fundamental characteristics of storage: latency, IOPS and bandwidth (or throughput). I even drew a simple, boxy diagram which, despite being one of the least-inspiring pieces of artwork ever created, serves me well enough to warrant its inclusion again here. These three properties are related – when one changes, the others change. With that in mind, here’s lesson #1:

High numbers of IOPS are useless unless they are delivered at low latency.

It’s all very well saying you can supply 1 million, 2 million, 4 million IOPS but if the latency sucks it’s not going to be of much value in the real world. Flash is great for delivering higher numbers of IOPS than disk, particularly for random I/Os (as I’ve written about previously), but ultimately the delay introduced by high latency is going to make real-world workloads unusable.

And there’s another, oft-hidden, problem that many flash vendors face: unpredictable latency. This is particularly the case during write-heavy workloads where garbage collection cannot always keep up with load, resulting in the infamous “write cliff” (more technically described as bandwidth degradationsee figure 7 of this paper). Maybe we should revise that previous line to be lesson #1.5;

High numbers of IOPS are useless unless they are delivered at predictable low latency.

But what about when we deal with volumes of data? If your requirement is to process vast amounts of information, do IOPS then become more important? Not really, because this is a bandwidth challenge – you need to design and build a system to suit your bandwidth requirements. How many GB/sec do you need? What can the storage subsystem deliver and how fast can you process it? Unlike bandwidth, an IOPS measurement does not contain the critical component of block size, so information is missing. And if you have the bandwidth figures, there is little additional value in knowing the IOPS, is there? Cue lesson #2:

Bandwidth figures are more useful for describing data volumes than IOPS

So what good are IOPS figures? And why does the storage industry talk about them all the time? Personally I think it’s a hang-up from the days of disk, when IOPS were such a limiting factor… and partly a marketing thing, because multi-million results sound impressive. Who knows? I’m more interested in what we should be asking about than what we shouldn’t.

So what does matter?

Latency Is King

crown-latencyForget everything else. Latency is the critical factor because this is what injects delay into your system. Latency means lost time; time that could have been spent busily producing results, but is instead spent waiting for I/O resources.

Forget IOPS. The whole point of a flash array is that IOPS effectively become an unlimited resource. Sure, there is always a real limit – but it’s so high that it’s no longer necessary to worry about it.

Bandwidth still matters, particularly when you are doing something which requires volume, such as analytics or data warehousing. But bandwidth is a question of plumbing – designing a solution with enough capability to deliver throughout the stack: storage, fabric, HBAs, network, processor… build it and the data will come.

Latency is the “application stealth tax”, extending the elapsed times of individual tasks and processes until everything takes slightly longer. Add up all those delays and you have a significant problem. This is why, when you consider buying flash, you need to test the latency – and not just at the storage level, but end-to-end via the application (I’ll talk about this more in a following post).

“But I Don’t Need That Many IOPS…”

This is classic misunderstanding, often the result of confusion brought about by FUD from storage vendors who cannot deliver at the higher end of the market. To repeat my previous statement, with a good flash system IOPS will effectively become an unlimited resource. This does not mean that it’s overkill for your needs. There is no point in spending more money on a solution than is necessary, but IOPS is not the indicator you should use to determine this – decisions like that should rely entirely on business requirements. I have yet to ever see a business requirement that related to IOPS (emphasis on business rather than technical).

Business requirements tend to be along the lines of needing to supply trading reports faster, or reduce the time spent by call centre operatives waiting for their CRM screens to refresh. These almost always translate back into latency requirements. After all, the key to solving any performance issue is always to follow the time and find out where it is being spent. Have you noticed that latency is the only one of our three fundamental characteristics which is expressed solely in units of time?

Don’t get distracted by IOPS… it’s all about latency.

The Most Important Thing You Need To Know About Flash

NAND-flash

Storage for DBAs: There are many things you need to know about flash: it’s performance, it’s behaviour, it’s durability etc. But there’s one single piece of information which tells you more than anything else, because it gives you an insight into the future of not just flash memory, but the primary data storage industry. Let me explain, but first let me contrast against something we all more familiar with: disk drives.

Disk Drive Market History

Almost all disk drives are made by three manufacturers: Western Digital, Seagate and Toshiba. There used to be a lot more than that, but all the others either went out of business or got acquired. These are tough times for disk drive manufacturers, with sales expected to take a double-digit dive in 2013. It was not always this way though; for decades the thirst for HDDs was unquenchable, with large volumes of them being required in desktop PCs (remember them?) as well as enterprise disk arrays. Fast forward to the present day and desktop PCs have been ambushed by tablets and SSDs, while flash is now similarly disrupting the data centre.

For a minute though, let’s remember the golden era of disk. If we rewind around eight years ago, the disk industry was thriving. To quote a TrendFocus report published in Businesswire (emphasis added by me):

The industry’s 25% unit growth in 2005 was based on solid fundamentals in core markets like PCs and servers… Booming notebook PC sales caused a surge in 2.5″ HDD shipments to 77 million, a 45% growth. Enterprise HDD shipments grew 11% to over 26 million units. HDD industry revenue was $28 billion, an increase of 18% from 2004.

An 18% annual increase… that’s impressive! As the quote states, this growth was built on the “solid fundamentals” of PCs and Servers – nobody foresaw the end of the PC disk market,2013 HDD Market by Application because a new and exciting range of “Notebook PCs” seemed ready to drive demand even higher. And on top of that, two new market segments were growing rapidly: Consumer Electronics and Mobile. The graph on the left was created in 2005 and showed HDD market predictions going forward until 2010. The yellow and light blue colours indicate CE and Mobile respectively – you can see that there was a lot of optimism for the future – while the reddish colour indicates the huge returns from desktop PCs (while the Enterprise segment is merely a small purple brick at the bottom of each column). But somewhere at the bottom of a 2005 Q4 report published later that year were the first indications of a changing tide:

Shipments accounted for slightly less than the 20 percent forecast, and the drop-off is attributed to Apple’s decision to move away from 1-inch-based hard drives for its iPod mini business.

What did Apple move to? Take a guess. Very slowly, but very surely, the potential CE and Mobile markets evaporated. At the same time, the Desktop PC business died a lingering death, leaving enterprise storage as the mainstay for hard drives.

That’s one thing which remained constant through this period though: the enterprise HDD market. Enterprise Storage vendors like EMC and NetApp bought up massive volumes of disk drives to put in enterprise class arrays for their customers – and these massive volumes meant two critical outcomes:

  1. The larger Enterprise Storage vendors bought at enormous discounts
  2. HDD vendors selling to Enterprise Storage focussed their R&D efforts on improving the characteristics that these customers desired

The characteristics required for enterprise storage were: density and performance. It’s a simple case of supply and demand. As with all business, the demand shapes the supply.

NAND Flash Market Forces

One thing that flash has in common with disk is the relatively small number of manufacturers. Note that I’m not talking about companies like Violin here, I’m talking about the flash chip manufacturers who own the fabs. In 2012 the NAND flash market consisted of Samsung (38%), Toshiba (28%) – the inventor of flash, Micron (14%) and Hynix (12%). But who was the largest buyer worldwide? Was it EMC? Netapp? IBM, HP or Dell?

2013 NAND Market by ApplicationIn 2011, Apple became the largest worldwide consumer of NAND flash. And as I’m sure you can guess, the reason for this was the iPhone. Today, according to IC Insights, the majority of NAND flash (59%) is used in smartphones, tablets and portable devices, with another 17% used in USB keys and cameras. If you look at the pie chart on the right, that little red portion marked SSD (just 13%) comprises all the flash used in both enterprise storage and consumer solid state drives (e.g. the ones you might get in an ultrabook).

And that trend is only going one way. By the end of 2013 it is forecast that there will be nearly 1.5 billion smartphones in the world – one smartphone for every five people. Meanwhile, tablets are not only the fastest growing segments but also one of the fastest-growing consumer devices of all time.

What does this mean? It means that NAND flash development is driven by the consumer market, by smartphones and portable devices. In enterprise storage, when we talk about flash we always talk about performance and endurance – but the consumer market isn’t interested in either of these. The consumer market is interested in density, i.e. how much data you can fit on a chip, as well as power consumption and cost. If a NAND flash manufacturer could produce larger flash chips at the cost of 20% slower performance, for example, this would be considered a great result. There’s a fundamental difference in requirements between the consumer and enterprise markets: only the enterprise cares about performance.

The Balance Of Power

In the heyday of disk, the enterprise storage industry had serious influence on what came out of the factories. But with NAND flash, the power of the enterprise storage industry to influence the direction of development is clearly far weaker. Sure there are relationships between flash storage vendors and the manufacturers – in fact, one of the strongest is between Violin and Toshiba – but market forces dictate that NAND flash development will be mostly influenced by the consumer market: the phone in your pocket and the tablet on your desk. (Don’t be confused by claims of “enterprise-class” NAND flash either – the key is to follow where the billions of dollars of R&D money are going, i.e. the consumer market. Enterprise-class flash is merely the least-consumer-like consumer flash…)

cautionWhat does this mean for enterprise storage? It’s simple – it means that each enterprise vendor will have to take “consumer” flash and come up with innovative ways to make it perform like an enterprise product. Each flash vendor needs to do this to deliver the performance you need. Anybody can take a bunch of flash cards or SSDs and put them in a box, but that’s not innovation. The flash vendors who survive the great flash market consolidation will be the ones with intellectual property and patents around making consumer NAND flash perform for the enterprise.

Understand this and you will know the most important thing to ask a potential flash vendor about their product is not “How fast is it?”, or “How long does it last?”, but “Where’s your innovation?”. After all, if your vendor isn’t adding anything to the equation, you might as well be doing it yourself…

Does My Database Need Flash?

flash-and-the-disk-is-gone

Storage for DBAs: Here’s a question I get asked a lot: “Does my database need flash?”. In fact it’s the most common question customers have, followed by the alternative version, “Does my database need SSD?”. In fact, often customers already have some SSDs in their disk arrays but still see poor performance, so really I ought to wind it back a level and call this article, “Does my database need low latency storage?”. This would in fact be a much better headline from a technical perspective, but until I change the name of this site to LowLatencyDBA I’m sticking with the current title.

Flash is no longer a cutting edge new technology, it’s a mainstream product sold by almost every storage vendor. This means that you or your organisation will probably already have some flash sales person beating down your door to flog you some sort of flash product, whether it’s an all-flash array, a hybrid flash/disk system or a set of PCIe flash cards. While these products are diverse in nature, they all share two main characteristics: low latency and large numbers of IOPS. But how do you know whether you really need them?

In a later post I’ll be running through the questions which I think need to be asked in order to whittle down the massive list of flash vendors to the select few capable of servicing your needs. This, of course, will be difficult to achieve without being biased towards my own employer – but that’s a problem for another day. For now, here’s the first (and potentially most important) step: working out whether you actually need low latency flash storage in the first place.

Who Needs Flash?

For the world of databases, there are three main reasons why you might want to switch to low latency flash:

acceleration-consolidation-virtualizationAcceleration – perhaps the most obvious reason is to go faster. There are many reasons why people desire better performance, but they generally boil down to one of two scenarios: Not Good Enough Now and Not Good Enough For The Future. In the former, bad performance is holding back an application, denying potential revenue or incurring penalties in some way (either SLA-based financial penalties or simply the loss of customers due to poor service levels). In the latter, existing infrastructure is incapable of allowing increased agility, i.e. the ability to do more (offering new services for example, or adding more concurrent users).

Consolidation – always on the mind of CIOs and CTOs is the benefit of consolidating database and server estates. Consolidation brings agility and risk benefits as well as the new and important benefit of cost savings. By consolidating (and standardising) multiple databases onto a smaller pool of servers, organisations save money on hardware, on maintenance and administration, and on the holy grail of all cost savings: software license fees. If you think that sounds like an exaggeration, take a look at this article on Wikibon which demonstrates that Oracle license costs account for 82% of the total cost of a traditional database deployment. Consolidation allows for reduced CPU cores, which means a reduction in the number of licenses, but it also increases I/O as workloads are “stacked” on the same infrastructure. The Wikibon article argues that by moving to flash storage and consolidating, the total cost drops significantly – by around 26% in fact.

Virtualisation – an increasingly prevalent option in the database world. The use of server virtualisation technologies is allowing organisations to move to cloud architectures, where environments are automatically provisioned, managed and migrated across hardware. Virtualisation brings massive agility benefits but also carries a risk because, just like with consolidation, I/O workloads accumulate on the same infrastructure. Unlike consolidation though, virtualisation adds an extra layer of latency, making the I/O even more of a potential bottleneck. Flash systems now make this option practical, as hypervisor vendors begin to realise the potential of flash memory.

There is actually a fourth reason, which is Infrastructure Optimisation. If you have data centres stuffed with disk arrays there is every chance that they can be replaced by a small number of flash arrays, thus reducing power, cooling and real estate requirements and saving large amounts of money. But as this article is primarily targeted at databases I thought I’d leave that one out for now. Consider it the icing on the cake… but don’t forget it, because sometimes it turns out that there’s a lot of icing.

So now we know the reasons why, let’s have a look at which sorts of systems are suitable for flash and which aren’t, starting with the Performance requirement…

Databases Love Flash If…

  • tickThey create lots of I/O! I know, it sounds obvious, but more than once I’ve seen customers with CPU-bound applications that generate hardly any I/O. Flash is a fantastic technology, but its not magic.
  • There is lots of random I/O. Now don’t take that the wrong way – sequential I/O is good too. But if you currently have a random I/O workload running on a disk system you will see the most dramatic benefit after switching that to flash. Here’s why.
  • High amounts of parallelism. The simple fact is that a single process cannot drive anywhere near the amount of I/O that a good flash system can support. If you think of flash as being like a highway, not only is it fast, it’s also wide. Use all the lanes.
  • Large IOWAIT times. If you are using an operating system that has a concept of IOWAIT (Linux and most versions of UNIX do, Windows doesn’t) then this can be a great indicator that processes are stuck waiting on I/O. It’s not perfect though, because IOWAIT is actually an idle wait (within the operating system, this is nothing to do with Oracle wait events) so if the system is really busy it may not be present.

Those are all great indicators, but the next two should be considered the golden rules:

  • I/O wait times are high. Essentially we are looking for high latency from the existing storage system. Flash memory systems should deliver I/O with sub-millisecond latency, so if you see an average latency of 8ms on random reads (db file sequential read), for example, you know there is potential for reducing latency to an eighth of its previous average value.
  • I/O forms a significant percentage of Database Time. If I/O is only responsible for 5% of database time, no amount of lightening-fast flash is going to give you a big performance boost… your problems are elsewhere. On the other hand, if I/O is comprising a large portion of database time, you have lots of room for improvement. (I plan to post a guide to reading AWR Reports pretty soon)

If any of this is ticking boxes for you, it’s time to consider what flash could do for the performance of your database. On the other hand…

Performance Won’t Improve If…

  • red-crossThere isn’t any I/O. Any flash vendor in the industry would be happy to sell you their products in this situation – and let’s face it you’ll get great latency! – but be realistic. If you don’t generate I/O, what’s the point? Unless of course you aren’t after performance. If consolidation, virtualisation or infrastructure optimisation is your aim, there could be a benefit. Also, consider the size of your memory components – if your database produces no physical I/O, could you consider reducing the size of the buffer cache? One of the big benefits of flash to consolidation is the ability to reduce SGA sizes and thus fit more databases onto the same DRAM-restricted server.
  • Single threaded workloads. Sure your application will run slightly faster, but will that speed-up be enough to justify the change of infrastructure? I’m not ruling this out – I have customers with single-threaded ETL jobs that bought flash because it was easier (and cheaper) than rewriting legacy code, but the impact of low-latency storage may well be reduced.
  • Application serialisation points. A session waiting on a lock will not wait any faster! Basically, if your application regularly ties itself in a knot with locks and contention issues, putting it on flash may well just increase the speed at which you hit those problems. Sometimes people use flash to overcome bad programming, but it’s by no means guaranteed to work.
  • CPU-bound systems. CPU starvation is a CPU problem, not an I/O problem. If anything, moving to low-latency storage will reduce the amount of time CPUs spent waiting on I/O and thus increase the amount of time they spend working, i.e. in a busy state. If your CPU is close to the limit and you remove the ballast that is a disk system, you might find that you hit the limit very quickly.

If you are unfortunate enough to be struggling with a badly-performing application that fits into one of these areas, flash probably isn’t the magic bullet you’re looking for.

Consolidation and Virtualisation

This is a different area where it’s no longer valid to only look at individual databases and their workloads. The key factor for both of these areas is density i.e. the number of databases or virtual machines that can fit on a single physical server. The main challenges here are memory usage and I/O generation: databases SGAs tend to be large, but flash allows for the possibility of reducing the buffer cache; while I/O generation is a problem in the disk world because consolidated workloads tend to create more random I/O. Of course, with flash that’s not really a problem. I’ve written a number of articles on consolidation and virtualisation in the past – I’m sure I’ll be writing more about them in the future too.

Summary

I work for a flash vendor – we want you to buy our products. We have competitors who want you to buy their products instead. If everyone in the industry is telling you to buy flash, how do you know if it’s relevant to you? Here’s my advice: make them speak your language and then check their claims against what you can see yourself.

Take some time to understand your workload. Look at the amount of I/O generated and the latency experienced; look at how random the workload is and the ratio of reads to writes (I’ll post a guide for this soon). Ask your (potential) flash vendor how much benefit you will see from your existing storage and then get them to explain why. If you’re a database person, make them speak in your language – don’t accept someone talking in the language of storage. Likewise if you’re an application person make them explain the benefits from an application perspective. You’re the customer, after all.

If your flash vendor can’t communicate with you in your language to explain the benefit you will see, there’s only one course of action: Get rid of them in a flash.

Footnote

Incidentally, if you live outside the UK and you’re wondering about the picture at the top of this article, check out this. If you live inside the UK you will know it’s a Cillit Bang reference… unless you live in a cave and shun the outside world – in which case, how are you reading this?

Understanding I/O: Random vs Sequential

sushiStorage for DBAs: Ever been to one of those sushi restaurants where the food comes round in dishes on a conveyor belt? As each dish travels around the loop you eye it up and, as long as you can make your mind up in time, grab it. However, if you are as indecisive as me, there’s a chance it will be out of range before you come to your senses – in which case you have to wait for it to complete a further full revolution before getting another chance. And that’s assuming someone else doesn’t get to it first.

Let’s assume that it takes a dish exactly 4 minutes to complete a whole lap of the conveyor belt. And just for simplicity’s sake let’s also assume that no two dishes on the belt are identical. As a hungry diner you look in the little menu and see a particular dish which you decide you want. It’s somewhere on the belt, so how long will it take to arrive?

Probability dictates that it could be anywhere on the belt. It could be passing by right now, requiring no wait time – or it could have just passed out of reach, thus requiring 4 minutes of wait time to go all the way round again. As you follow this random method (choose from the menu then look at the belt) it makes sense that the average wait time will tend towards half way between the min and max wait times, i.e. 2 minutes in this case. So every time you pick a dish you wait an average of 2 minutes: if you have eight dishes the odds say that you will spend (8 x 2) = 16 minutes waiting for your food. Welcome to the disk data diet, I hope you weren’t too hungry?

Now let’s consider an alternative option, where you order eight dishes from the chef and he or she places all of them sequentially (i.e. next to each other) somewhere on the conveyor belt. That location is random, so again you might have to wait anywhere between 0 and 4 minutes (an average of 2 minutes) for the first dish to pass… but the next seven will follow one after the other with no wait time. So now, in this scenario, you only had to wait 2 minutes for all eight dishes. Much better.

I’m sure you will have seen through my analogy right from the start. The conveyor belt is a hard disk and the sushi dishes are blocks which are being eaten / read. I haven’t yet worked out how to factor a bottle Asahi Super Dry into this story, but I’ll have one all the same thanks.

Random versus Sequential I/O

I have another article planned for later in this series which describes the inescapable mechanics of disk. For now though, I’ll outline the basics: every time you need to access a block on a disk drive, the disk actuator arm has to move the head to the correct track (the seek time), then the disk platter has to rotate to locate the correct sector (the rotational latency). This mechanical action takes time, just like the sushi travelling around the conveyor belt.

random-or-sequentialObviously the amount of time depends on where the head was previously located and how fortunate you are with the location of the sector on the platter: if it’s directly under the head you do not need to wait, but if it just passed the head you have to wait for a complete revolution. Even on the fastest 15k RPM disk that takes 4 milliseconds (15,000 rotations per minute = 250 rotations per second, which means one rotation is 1/250th of a second or 4ms). Admittedly that’s faster than the sushi in my earlier analogy, but the chances are you will need to read or write a far larger number of blocks than I can eat sushi dishes (and trust me, on a good day I can pack a fair few away).

What about the next block? Well, if that next block is somewhere else on the disk, you will need to incur the same penalties of seek time and rotational latency. We call this type of operation a random I/O. But if the next block happened to be located directly after the previous one on the same track, the disk head would encounter it immediately afterwards, incurring no wait time (i.e. no latency). This, of course, is a sequential I/O.

Size Matters

In my last post I described the Fundamental Characteristics of Storage: Latency, IOPS and Bandwidth (or Throughput). As a reminder, IOPS stands for I/Os Per Second and indicates the number of distinct Input/Output operations (i.e. reads or writes) that can take place within one second. You might use an IOPS figure to describe the amount of I/O created by a database, or you might use it when defining the maximum performance of a storage system. One is a real-world value and the other a theoretical maximum, but they both use the term IOPS.storage-characteristics

When describing volumes of data, things are slightly different. Bandwidth is usually used to describe the maximum theoretical limit of data transfer, while throughput is used to describe a real-world measurement. You might say that the bandwidth is the maximum possible throughput. Bandwidth and throughput figures are usually given in units of size over units of time, e.g. Mb/sec or GB/sec. It pays to look carefully at whether the unit is using bits (b) or bytes (B), otherwise you are likely to end up looking a bit silly (sadly, I speak from experience).

In the previous post we stated that IOPS and throughput were related by the following relationship:

Throughput   =   IOPS   x   I/O size

It’s time to start thinking about that I/O size now. If we read or write a single random block in one second then the number of IOPS is 1 and the I/O size is also 1 (I’m using a unit of “blocks” to keep things simple). The Throughput can therefore be calculated as (1 x 1) = 1 block / second.

Alternatively, if we wanted to read or write eight contiguous blocks from disk as a sequential operation then this again would only result in the number of IOPS being 1, but this time the I/O size is 8. The throughput is therefore calculated as (1 x 8) = 8 blocks / second.

Hopefully you can see from this example the great benefit of sequential I/O on disk systems: it allows increased throughput. Every time you increase the I/O size you get a corresponding increase in throughput, while the IOPS figure remains resolutely fixed. But what happens if you increase the number of IOPS?

Latency Kills Disk Performance

In the example above I described a single-threaded process reading or writing a single random block on a disk. That I/O results in a certain amount of latency, as described earlier on (the seek time and rotational latency). We know that the average rotational latency of a 15k RPM disk is 4ms, so let’s add another millisecond for the disk head seek time and call the average I/O latency 5ms. snailHow many (single-threaded) random IOPS can we perform if each operation incurs an average of 5ms wait? The answer is 1 second / 5 ms = 200 IOPS. Our process is hitting a physical limit of 200 IOPS on this disk.

What do you do if you need more IOPS? With a disk system you only really have one choice: add more disks. If each spindle can drive 200 IOPS and you require 80,000 IOPS then you need (80,000 / 200) = 400 spindles. Better clear some space in that data centre, eh?

On the other hand, if you can perform the I/O sequentially you may be able to reduce the IOPS requirement and increase the throughput, allowing the disk system to deliver more data. I know of Oracle customers who spend large amounts of time and resources carving up and re-ordering their data in order to allow queries to perform sequential I/O. They figure that the penalty incurred from all of this preparation is worth it in the long run, as subsequent queries perform better. That’s no surprise when the alternative was to add an extra wing to the data centre to house another bunch of disk arrays, plus more power and cooling to run them. This sort of “no pain, no gain” mentality used to be commonplace because there really weren’t any other options. Until now.

Flash Offers Another Way

The idea of sequential I/O doesn’t exist with flash memory, because there is no physical concept of blocks being adjacent or contiguous. Logically, two blocks may have consecutive block addresses, but this has no bearing on where the actual information is electronically stored. You might therefore say that all flash I/O is random, but in truth the principles of random I/O versus sequential I/O are disk concepts so don’t really apply. And since the latency of flash is sub-millisecond, it should be possible to see that, even for a single-threaded process, a much larger number of IOPS is possible. When we start considering concurrent operations things get even more interesting… but that topic is for another day.

sushi-dishBack to the sushi analogy, there is no longer a conveyor belt – the chefs are standing right in front of you. When you order a dish, it is placed in front of you immediately. Order a number of dishes and you might want to enlist the help of a few friends to eat in parallel, because the food will start arriving faster than you can eat it on your own. This is the world of flash memory, where hunger for data can be satisfied and appetites can be fulfilled. Time to break that disk diet, eh?

Looking back at the disk model, all that sitting around waiting for the sushi conveyor belt just takes too long. Sure you can add more conveyor belts or try to get all of your sushi dishes arranged in a line, but at the end of the day the underlying problem remains: it’s disk. And now that there’s an alternative, disk just seems a bit too fishy to me…

The Fundamental Characteristics of Storage

storage-characteristicsStorage for DBAs: As a rule of thumb, pretty much any storage system can be characterised by three fundamental properties:

Latency is a measurement of delay in a system; so in the case of storage it is the time taken to respond to an I/O request. It’s a term which is frequently misused – more on this later – but when found in the context of a storage system’s data sheet it often means the average latency of a single I/O. Latency figures for disk are usually measured in milliseconds; for flash a more common unit of measurement would be microseconds.

IOPS (which stands for I/Os Per Second) represents the number of individual I/O operations taking place in a second. IOPS figures can be very useful, but only when you know a little bit about the nature of the I/O such as its size and randomicity. If you look at the data sheet for a storage product you will usually see a Max IOPS figure somewhere, with a footnote indicating the I/O size and nature.

Bandwidth (also variously known as throughput) is a measure of data volume over time – in other words, the amount of data that can be pushed or pulled through a system per second. Throughput figures are therefore usually given in units of MB/sec or GB/sec.

As the picture suggests, these properties are all related. It’s worth understanding how and why, because you will invariably need all three in the real world. It’s no good buying a storage system which can deliver massive numbers of IOPS, for example, if the latency will be terrible as a result.

The throughput is simply a product of the number of IOPS and the I/O size:

Throughput   =   IOPS   x   I/O size

So 2,048 IOPS with an 8k blocksize is (2,048 x 8k) = 16,384 kbytes/sec which is a throughput of 16MB/sec.

The latency is also related, although not in such a strict mathematical sense. Simply put, the latency of a storage system will rise as it gets busier. We can measure how busy the system is by looking at either the IOPS or Throughput figures, but throughput unnecessarily introduces the variable of block size so let’s stick with IOPS. We can therefore say that the latency is proportional to the IOPS:

Latency   ∝   IOPS

I like the mathematical symbol in that last line because it makes me feel like I’m writing something intelligent, but to be honest it’s not really accurate. The proportional (∝) symbol suggests a direct relationship, but actually the latency of a system usually increases exponentially as it nears saturation point.

SPC Benchmark for HP 3PAR (17 Oct 2011)

SPC Benchmark for HP 3PAR (17 Oct 2011)

We can see this if we plot a graph of latency versus IOPS – a common way of visualising performance characteristics in the storage world. The graph on the right shows the SPC benchmark results for an HP 3PAR disk system (submitted in 2011). See how the response time seems to hit a wall of maximum IOPS? Beyond this point, latency increases rapidly without the number of IOPS increasing. Even though there are only six data points on the graph it’s pretty easy to visualise where the limit of performance for this particular system is.

I said earlier that the term Latency is frequently misused – and just to prove it I misused it myself in the last paragraph. The SPC performance graph is actually plotting response time and not latency. These two terms, along with variations of the phrase I/O wait time, are often used interchangeably when they perhaps should not be.

According to Wikipedia, “Latency is a measure of time delay experienced in a system“. If your database needs, for example, to read a block from disk then that action requires a certain amount of time. The time taken for the action to complete is the response time. If your user session is subsequently waiting for that I/O before it can continue (a blocking wait) then it experiences I/O wait time which Oracle will chalk up to one of the regular wait events such as db file sequential read.

The latency is the amount of time taken until the device is ready to start reading the block, i.e not including the time taken to complete the read. In the disk world this includes things like the seek time (moving the actuator arm to the correct track) and the rotational latency (spinning the platter to the correct sector), both of which are mechanical processes (and therefore slow).

When I first began working for a storage vendor I found the intricacies of the terminology confusing – I suppose it’s no different to people entering the database world for the first time. I began to realise that there is often a language barrier in I.T. as people with different technical specialties use different vocabularies to describe the same underlying phenomena. For example, a storage person might say that the array is experiencing “high latency” while the database admin says that there is “high User I/O wait time“. The OS admin might look at the server statistics and comment on the “high levels of IOWAIT“, yet the poor user trying to use the application is only able to describe it as “slow“.

At the end of the day, it’s the application and its users that matter most, since without them there would be no need for the infrastructure. So with that in mind, let’s finish off this post by attempting to translate the terms above into the language of applications.

Translating Storage Into Application

Earlier we defined the three fundamental characteristics of storage. Now let’s attempt to translate them into the language of applications:

storage-characteristics-translated-to-application

Latency is about application acceleration. If you are looking to improve user experience, if you want screens on your ERP system to refresh quicker, if you want release notes to come out of the warehouse printer faster… latency is critical. It is extremely important for highly transactional (OLTP) applications which require fast response times. Examples include call centre systems, CRM, trading, e-Business etc where real-time data is critical and the high latency of spinning disk has a direct negative impact on revenue.

IOPS is for application scalability. IOPS are required for scaling applications and increasing the workload, which most commonly means one of three things: in the OLTP space, increasing the number of concurrent users; in the data warehouse space increasing the parallelism of batch processes, or in the consolidation / virtualisation space increasing the number of database instances located on a single physical platform (i.e. the density). This last example is becoming ever more important as more and more enterprises consolidate their database estates to save on operational and licensing costs.

Bandwidth / Throughput is effectively the amount of data you can push or pull through your system. Obviously that makes it a critical requirement for batch jobs or datawarehouse-type workloads where massive amounts of data need to be processed in order to aggregate and report, or identify trends. Increased bandwidth allows for batch processes to complete in reduced amounts of time or for Extract Transform Load (ETL) jobs to run faster. And every DBA that ever lived at some point had to deal with a batch process that was taking longer and longer until it started to overrun the window in which it was designed to fit…

Finally, a warning. As with any language there are subtleties and nuances which get lost in translation. The above “translation” is just a rough guide… the real message is to remember that I/O is driven by applications. Data sheets tell you the maximum performance of a product in ideal conditions, but the reality is that your applications are unique to your organisation so only you will know what they need. If you can understand what your I/O patterns look like using the three terms above, you are halfway to knowing what the best storage solution is for you…

Performance: It’s All About Balance…

balanceStorage For DBAs: Everyone wants their stuff to go faster. Whether it’s your laptop, tablet, phone, database or application… performance is one of the most desirable characteristics of any system. If your system isn’t fast enough, you start dreaming of more. Maybe you try and tune what you already have, or maybe you upgrade to something better: you buy a phone with a faster processor, or stick an SSD in your laptop… or uninstall Windows 🙂

When it comes to databases, I often find people considering the same set of options for boosting performance (usually in this order): half-heartedly tuning the database, adding more DRAM, *properly* tuning the database, adding or upgrading CPUs, then finally tuning the application. It amazes me how much time, money and effort is often spent trying to avoid getting the application developers to write their code properly, but that’s a subject for another blog.

The point of this blog is the following statement: to achieve the best performance on any system it is important that all of its resources are balanced.

Let’s think about the basic resources that comprise a computer system such as a database server:

system-resources

  • CPU – the processor, i.e. the thing that actually does the work. Every process pretty much exists to take some input, get on CPU, perform some calculations and produce some output. It’s no exaggeration to call this the heart of the system.
  • Network – communications with the outside world, whether it be the users, the application servers or other databases.
  • Memory – Dynamic Random Access Memory (DRAM) provides a store for data.
  • Storage – for example disk or flash; provides a store for data.

You’ll notice I’ve been a bit disingenuous by describing Memory and Storage the same way, but I want to make a point: both Memory and Storage are there to store data. Why have two different resources for what is essentially the same purpose?

The answer, which you obviously already know, is that DRAM is volatile (i.e. continuous power is required to maintain the stored information, otherwise it is lost) while Storage is persistent (i.e. the stored information remains in place until it is actively changed or removed).

When you think about it like that, the Storage resource has a big advantage over the Memory resource, because the data you are storing is safe from unexpected power loss. So why do we have the DRAM? What does it bring to the party? And why do I keep asking you questions you already know the answer to?

Ok I’ll get to the point, which is this: DRAM is used to drive up CPU utilisation.

The Long Walk

walking-long-roadThe CPU is interacting with the Memory and Storage resources by sending or requesting data. Each request takes a certain amount of time – and that time can vary depending on factors such as the amount of data and whether the resource is busy. But let’s ignore all that for now and just consider the minimum possible time taken to send or receive that data: the latency. CPUs have clock cycles, which you can consider a metronome keeping the beat to which everything else must dance. That’s a gross simplification which may make some people wince (read here if you want to know why), but I’m going to stick with it for the sake of clarity.

Let’s consider a 2GHz processor – by no means the fastest available clock speed out there today. The 2GHz indicates that the clock cycle is oscillating 2 billion times per second. That means one oscillation every half a nanosecond, which is such a tiny amount of time that we can’t really comprehend it, so instead I’m going to translate it into the act of walking, where each single pace is a clock cycle. With each step taken, an instruction can be executed, so:

One CPU Cycle = Walking 1 Pace

The current generation of DRAM is DDR3 DRAM, which has latencies of around 10 nanoseconds. So now, while walking along, if you want to access data in DRAM you need to incur a penalty of 20 paces where you potentially cannot do anything else.

Accessing DRAM = Walking 20 Paces

Now let’s consider storage – and in particular, our old friend the disk drive. I frequently see horrible latency problems with disk arrays (I guess it goes with the job) but I’ll be kind here and choose a latency of 5 milliseconds, which on a relatively busy system wouldn’t be too bad. 5 milliseconds is of course 5 million nanoseconds, which in our analogy is 10 million steps. According to the American College of Sports Medicine there are an average of 2,000 steps in one mile. So now, walking along and making an I/O request to disk incurs a penalty of 10,000,000 steps or 5,000 miles. Or, to put it another way:

Accessing Disk = Walking from London to San Francisco

Take a minute to consider the impact. Previously you were able to execute an instruction every step, but now you need to walk a fifth of the way around the planet before you can continue working. That’s going to impact your ability to get stuff done.

Maybe you think 5 milliseconds is high for disk latency (or maybe you think anyone walking from London to San Francisco might face some ocean-based issues) but you can see that the numbers easily translate: every millisecond of latency is equivalent to walking one thousand miles.

Don’t forget what that means back in the real world: it translates to your processor sitting there not doing anything because it’s waiting on I/O. Increasing the speed of that processor only increases the amount of work it’s unable to do during that wait time. If you didn’t have DRAM as a “temporary” store for data, how would you ever manage to do any work? No wonder In-Memory technologies are so popular these days.

Moore’s Law Isn’t Helping

Moores-LawIt’s often stated or inferred that Moore’s Law is bringing us faster processors every couple of years, when in fact the original statement was on doubling the number of transistors on an integrated circuit. But the underlying point remains that processor performance is increasing all the time. Looking at the four resources we outlined above, you could say that in a similar way DRAM technologies are progressing while network protocols are getting faster (10Gb Ethernet is commonplace, Infiniband is increasingly prevalent and 40Gb or 100Gb Ethernet is not far away).

On the other hand, disk performance has been stationary for years. According to this manual from Seagate the performance of CPUs increased 2,000,000x between 1987 and 2004 yet the performance of hard disk drives only increased 11x. That’s hardly surprising – how many years ago did the 15k RPM disk drive come out? We’re still waiting for something faster but the manufacturers have hit the limits of physics. The idea of helium-filled drives has been floated (sorry, couldn’t resist) and indeed they could be on the shelves soon, but if you ask me the whole concept is so up-in-the-air (sorry, I really can’t help it) that I have serious doubts whether it will actually take off (ok I promise that’s the last one).

The consequence of Moore’s Law is that the imbalance between disk storage and the other resources such as CPU is getting worse all the time. If you have performance issues caused by this imbalance – and then move to a newer, faster server with more processing power… the imbalance will only get worse.

The Silicon Data Centre

light_bulbDisk, as a consequence of its mechanical nature, cannot keep up with silicon as the number of transistors on a processor doubles every two years. Well as the saying goes, if you can’t beat them, join them. So why not put your persistent data store on silicon?

This is the basis of the argument for moving to flash memory: it’s silicon-based. The actual technology most vendors are using is NAND flash but that’s not massively important and technologies will come and go. The important point is to get storage onto the graph of Moore’s Law. Going back to the walking analogy above, an I/O to flash memory takes in the region of 200 microseconds, i.e. 200 thousand nanoseconds. This is a number of orders of magnitude faster than disk but still represents walking 400,000 paces or 200 miles. But unlike disk, the performance is getting better. And by moving storage to silicon we also pick up many other benefits such as reduced power consumption, space and cooling requirements. And most importantly we restore some balance to your server infrastructure.

Think about it. You have to admit that, as an argument, it’s pretty well balanced.

Footnote: Yes I know that by representing CPU clock cycles as instructions I am contributing to the Megahertz Myth. Sorry about that. Also, I strongly advise reading this article in the NoCOUG journal which makes some great points about DRAM and CPU utilisation. My favourite quote is, “Idle processors do not speed up database processing!” which is so obvious and yet so often overlooked.