Does My Database Need Flash?

Storage for DBAs: Here’s a question I get asked a lot: “Does my database need flash?”. In fact it’s the most common question customers have, followed by the alternative version, “Does my database need SSD?”. In fact, often customers already have some SSDs in their disk arrays but still see poor performance, so really I ought to wind it back a level and call this article, “Does my database need low latency storage?”. This would in fact be a much better headline from a technical perspective, but until I change the name of this site to LowLatencyDBA I’m sticking with the current title.

Flash is no longer a cutting edge new technology, it’s a mainstream product sold by almost every storage vendor. This means that you or your organisation will probably already have some flash sales person beating down your door to flog you some sort of flash product, whether it’s an all-flash array, a hybrid flash/disk system or a set of PCIe flash cards. While these products are diverse in nature, they all share two main characteristics: low latency and large numbers of IOPS. But how do you know whether you really need them?

In a later post I’ll be running through the questions which I think need to be asked in order to whittle down the massive list of flash vendors to the select few capable of servicing your needs. This, of course, will be difficult to achieve without being biased towards my own employer – but that’s a problem for another day. For now, here’s the first (and potentially most important) step: working out whether you actually need low latency flash storage in the first place.

Who Needs Flash?

For the world of databases, there are three main reasons why you might want to switch to low latency flash:

Acceleration – perhaps the most obvious reason is to go faster. There are many reasons why people desire better performance, but they generally boil down to one of two scenarios: Not Good Enough Now and Not Good Enough For The Future. In the former, bad performance is holding back an application, denying potential revenue or incurring penalties in some way (either SLA-based financial penalties or simply the loss of customers due to poor service levels). In the latter, existing infrastructure is incapable of allowing increased agility, i.e. the ability to do more (offering new services for example, or adding more concurrent users).

Consolidation – always on the mind of CIOs and CTOs is the benefit of consolidating database and server estates. Consolidation brings agility and risk benefits as well as the new and important benefit of cost savings. By consolidating (and standardising) multiple databases onto a smaller pool of servers, organisations save money on hardware, on maintenance and administration, and on the holy grail of all cost savings: software license fees. If you think that sounds like an exaggeration, take a look at this article on Wikibon which demonstrates that Oracle license costs account for 82% of the total cost of a traditional database deployment. Consolidation allows for reduced CPU cores, which means a reduction in the number of licenses, but it also increases I/O as workloads are “stacked” on the same infrastructure. The Wikibon article argues that by moving to flash storage and consolidating, the total cost drops significantly – by around 26% in fact.

Virtualisation – an increasingly prevalent option in the database world. The use of server virtualisation technologies is allowing organisations to move to cloud architectures, where environments are automatically provisioned, managed and migrated across hardware. Virtualisation brings massive agility benefits but also carries a risk because, just like with consolidation, I/O workloads accumulate on the same infrastructure. Unlike consolidation though, virtualisation adds an extra layer of latency, making the I/O even more of a potential bottleneck. Flash systems now make this option practical, as hypervisor vendors begin to realise the potential of flash memory.

There is actually a fourth reason, which is Infrastructure Optimisation. If you have data centres stuffed with disk arrays there is every chance that they can be replaced by a small number of flash arrays, thus reducing power, cooling and real estate requirements and saving large amounts of money. But as this article is primarily targeted at databases I thought I’d leave that one out for now. Consider it the icing on the cake… but don’t forget it, because sometimes it turns out that there’s a lot of icing.

So now we know the reasons why, let’s have a look at which sorts of systems are suitable for flash and which aren’t, starting with the Performance requirement…

Databases Love Flash If…

They create lots of I/O! I know, it sounds obvious, but more than once I’ve seen customers with CPU-bound applications that generate hardly any I/O. Flash is a fantastic technology, but its not magic.
There is lots of random I/O. Now don’t take that the wrong way – sequential I/O is good too. But if you currently have a random I/O workload running on a disk system you will see the most dramatic benefit after switching that to flash. Here’s why.
High amounts of parallelism. The simple fact is that a single process cannot drive anywhere near the amount of I/O that a good flash system can support. If you think of flash as being like a highway, not only is it fast, it’s also wide. Use all the lanes.
Large IOWAIT times. If you are using an operating system that has a concept of IOWAIT (Linux and most versions of UNIX do, Windows doesn’t) then this can be a great indicator that processes are stuck waiting on I/O. It’s not perfect though, because IOWAIT is actually an idle wait (within the operating system, this is nothing to do with Oracle wait events) so if the system is really busy it may not be present.

Those are all great indicators, but the next two should be considered the golden rules:

I/O wait times are high. Essentially we are looking for high latency from the existing storage system. Flash memory systems should deliver I/O with sub-millisecond latency, so if you see an average latency of 8ms on random reads (db file sequential read), for example, you know there is potential for reducing latency to an eighth of its previous average value.
I/O forms a significant percentage of Database Time. If I/O is only responsible for 5% of database time, no amount of lightening-fast flash is going to give you a big performance boost… your problems are elsewhere. On the other hand, if I/O is comprising a large portion of database time, you have lots of room for improvement. (I plan to post a guide to reading AWR Reports pretty soon)

If any of this is ticking boxes for you, it’s time to consider what flash could do for the performance of your database. On the other hand…

Performance Won’t Improve If…

There isn’t any I/O. Any flash vendor in the industry would be happy to sell you their products in this situation – and let’s face it you’ll get great latency! – but be realistic. If you don’t generate I/O, what’s the point? Unless of course you aren’t after performance. If consolidation, virtualisation or infrastructure optimisation is your aim, there could be a benefit. Also, consider the size of your memory components – if your database produces no physical I/O, could you consider reducing the size of the buffer cache? One of the big benefits of flash to consolidation is the ability to reduce SGA sizes and thus fit more databases onto the same DRAM-restricted server.
Single threaded workloads. Sure your application will run slightly faster, but will that speed-up be enough to justify the change of infrastructure? I’m not ruling this out – I have customers with single-threaded ETL jobs that bought flash because it was easier (and cheaper) than rewriting legacy code, but the impact of low-latency storage may well be reduced.
Application serialisation points. A session waiting on a lock will not wait any faster! Basically, if your application regularly ties itself in a knot with locks and contention issues, putting it on flash may well just increase the speed at which you hit those problems. Sometimes people use flash to overcome bad programming, but it’s by no means guaranteed to work.
CPU-bound systems. CPU starvation is a CPU problem, not an I/O problem. If anything, moving to low-latency storage will reduce the amount of time CPUs spent waiting on I/O and thus increase the amount of time they spend working, i.e. in a busy state. If your CPU is close to the limit and you remove the ballast that is a disk system, you might find that you hit the limit very quickly.

If you are unfortunate enough to be struggling with a badly-performing application that fits into one of these areas, flash probably isn’t the magic bullet you’re looking for.

Consolidation and Virtualisation

This is a different area where it’s no longer valid to only look at individual databases and their workloads. The key factor for both of these areas is density i.e. the number of databases or virtual machines that can fit on a single physical server. The main challenges here are memory usage and I/O generation: databases SGAs tend to be large, but flash allows for the possibility of reducing the buffer cache; while I/O generation is a problem in the disk world because consolidated workloads tend to create more random I/O. Of course, with flash that’s not really a problem. I’ve written a number of articles on consolidation and virtualisation in the past – I’m sure I’ll be writing more about them in the future too.

Summary

I work for a flash vendor – we want you to buy our products. We have competitors who want you to buy their products instead. If everyone in the industry is telling you to buy flash, how do you know if it’s relevant to you? Here’s my advice: make them speak your language and then check their claims against what you can see yourself.

Take some time to understand your workload. Look at the amount of I/O generated and the latency experienced; look at how random the workload is and the ratio of reads to writes (I’ll post a guide for this soon). Ask your (potential) flash vendor how much benefit you will see from your existing storage and then get them to explain why. If you’re a database person, make them speak in your language – don’t accept someone talking in the language of storage. Likewise if you’re an application person make them explain the benefits from an application perspective. You’re the customer, after all.

If your flash vendor can’t communicate with you in your language to explain the benefit you will see, there’s only one course of action: Get rid of them in a flash.

Footnote

Incidentally, if you live outside the UK and you’re wondering about the picture at the top of this article, check out this. If you live inside the UK you will know it’s a Cillit Bang reference… unless you live in a cave and shun the outside world – in which case, how are you reading this?

Understanding I/O: Random vs Sequential

Storage for DBAs: Ever been to one of those sushi restaurants where the food comes round in dishes on a conveyor belt? As each dish travels around the loop you eye it up and, as long as you can make your mind up in time, grab it. However, if you are as indecisive as me, there’s a chance it will be out of range before you come to your senses – in which case you have to wait for it to complete a further full revolution before getting another chance. And that’s assuming someone else doesn’t get to it first.

Let’s assume that it takes a dish exactly 4 minutes to complete a whole lap of the conveyor belt. And just for simplicity’s sake let’s also assume that no two dishes on the belt are identical. As a hungry diner you look in the little menu and see a particular dish which you decide you want. It’s somewhere on the belt, so how long will it take to arrive?

Probability dictates that it could be anywhere on the belt. It could be passing by right now, requiring no wait time – or it could have just passed out of reach, thus requiring 4 minutes of wait time to go all the way round again. As you follow this random method (choose from the menu then look at the belt) it makes sense that the average wait time will tend towards half way between the min and max wait times, i.e. 2 minutes in this case. So every time you pick a dish you wait an average of 2 minutes: if you have eight dishes the odds say that you will spend (8 x 2) = 16 minutes waiting for your food. Welcome to the disk data diet, I hope you weren’t too hungry?

Now let’s consider an alternative option, where you order eight dishes from the chef and he or she places all of them sequentially (i.e. next to each other) somewhere on the conveyor belt. That location is random, so again you might have to wait anywhere between 0 and 4 minutes (an average of 2 minutes) for the first dish to pass… but the next seven will follow one after the other with no wait time. So now, in this scenario, you only had to wait 2 minutes for all eight dishes. Much better.

I’m sure you will have seen through my analogy right from the start. The conveyor belt is a hard disk and the sushi dishes are blocks which are being eaten / read. I haven’t yet worked out how to factor a bottle Asahi Super Dry into this story, but I’ll have one all the same thanks.

Random versus Sequential I/O

I have another article planned for later in this series which describes the inescapable mechanics of disk. For now though, I’ll outline the basics: every time you need to access a block on a disk drive, the disk actuator arm has to move the head to the correct track (the seek time), then the disk platter has to rotate to locate the correct sector (the rotational latency). This mechanical action takes time, just like the sushi travelling around the conveyor belt.

Obviously the amount of time depends on where the head was previously located and how fortunate you are with the location of the sector on the platter: if it’s directly under the head you do not need to wait, but if it just passed the head you have to wait for a complete revolution. Even on the fastest 15k RPM disk that takes 4 milliseconds (15,000 rotations per minute = 250 rotations per second, which means one rotation is 1/250th of a second or 4ms). Admittedly that’s faster than the sushi in my earlier analogy, but the chances are you will need to read or write a far larger number of blocks than I can eat sushi dishes (and trust me, on a good day I can pack a fair few away).

What about the next block? Well, if that next block is somewhere else on the disk, you will need to incur the same penalties of seek time and rotational latency. We call this type of operation a random I/O. But if the next block happened to be located directly after the previous one on the same track, the disk head would encounter it immediately afterwards, incurring no wait time (i.e. no latency). This, of course, is a sequential I/O.

Size Matters

In my last post I described the Fundamental Characteristics of Storage: Latency, IOPS and Bandwidth (or Throughput). As a reminder, IOPS stands for I/Os Per Second and indicates the number of distinct Input/Output operations (i.e. reads or writes) that can take place within one second. You might use an IOPS figure to describe the amount of I/O created by a database, or you might use it when defining the maximum performance of a storage system. One is a real-world value and the other a theoretical maximum, but they both use the term IOPS.

When describing volumes of data, things are slightly different. Bandwidth is usually used to describe the maximum theoretical limit of data transfer, while throughput is used to describe a real-world measurement. You might say that the bandwidth is the maximum possible throughput. Bandwidth and throughput figures are usually given in units of size over units of time, e.g. Mb/sec or GB/sec. It pays to look carefully at whether the unit is using bits (b) or bytes (B), otherwise you are likely to end up looking a bit silly (sadly, I speak from experience).

In the previous post we stated that IOPS and throughput were related by the following relationship:

Throughput = IOPS x I/O size

It’s time to start thinking about that I/O size now. If we read or write a single random block in one second then the number of IOPS is 1 and the I/O size is also 1 (I’m using a unit of “blocks” to keep things simple). The Throughput can therefore be calculated as (1 x 1) = 1 block / second.

Alternatively, if we wanted to read or write eight contiguous blocks from disk as a sequential operation then this again would only result in the number of IOPS being 1, but this time the I/O size is 8. The throughput is therefore calculated as (1 x 8) = 8 blocks / second.

Hopefully you can see from this example the great benefit of sequential I/O on disk systems: it allows increased throughput. Every time you increase the I/O size you get a corresponding increase in throughput, while the IOPS figure remains resolutely fixed. But what happens if you increase the number of IOPS?

Latency Kills Disk Performance

In the example above I described a single-threaded process reading or writing a single random block on a disk. That I/O results in a certain amount of latency, as described earlier on (the seek time and rotational latency). We know that the average rotational latency of a 15k RPM disk is 4ms, so let’s add another millisecond for the disk head seek time and call the average I/O latency 5ms. How many (single-threaded) random IOPS can we perform if each operation incurs an average of 5ms wait? The answer is 1 second / 5 ms = 200 IOPS. Our process is hitting a physical limit of 200 IOPS on this disk.

What do you do if you need more IOPS? With a disk system you only really have one choice: add more disks. If each spindle can drive 200 IOPS and you require 80,000 IOPS then you need (80,000 / 200) = 400 spindles. Better clear some space in that data centre, eh?

On the other hand, if you can perform the I/O sequentially you may be able to reduce the IOPS requirement and increase the throughput, allowing the disk system to deliver more data. I know of Oracle customers who spend large amounts of time and resources carving up and re-ordering their data in order to allow queries to perform sequential I/O. They figure that the penalty incurred from all of this preparation is worth it in the long run, as subsequent queries perform better. That’s no surprise when the alternative was to add an extra wing to the data centre to house another bunch of disk arrays, plus more power and cooling to run them. This sort of “no pain, no gain” mentality used to be commonplace because there really weren’t any other options. Until now.

Flash Offers Another Way

The idea of sequential I/O doesn’t exist with flash memory, because there is no physical concept of blocks being adjacent or contiguous. Logically, two blocks may have consecutive block addresses, but this has no bearing on where the actual information is electronically stored. You might therefore say that all flash I/O is random, but in truth the principles of random I/O versus sequential I/O are disk concepts so don’t really apply. And since the latency of flash is sub-millisecond, it should be possible to see that, even for a single-threaded process, a much larger number of IOPS is possible. When we start considering concurrent operations things get even more interesting… but that topic is for another day.

Back to the sushi analogy, there is no longer a conveyor belt – the chefs are standing right in front of you. When you order a dish, it is placed in front of you immediately. Order a number of dishes and you might want to enlist the help of a few friends to eat in parallel, because the food will start arriving faster than you can eat it on your own. This is the world of flash memory, where hunger for data can be satisfied and appetites can be fulfilled. Time to break that disk diet, eh?

Looking back at the disk model, all that sitting around waiting for the sushi conveyor belt just takes too long. Sure you can add more conveyor belts or try to get all of your sushi dishes arranged in a line, but at the end of the day the underlying problem remains: it’s disk. And now that there’s an alternative, disk just seems a bit too fishy to me…

Strange ASM Behaviour with 4k Devices

question-mark-dice

This is only a short post to document something I’ve seen and reproduced but still don’t understand. Storage devices generally have a physical sector size of 512 bytes or, more recently, 4k. This is a subject which causes much confusion (partly because some vendors seek to portray whichever sector size they use as “better”). You can read more about the subject here.

On my Oracle Linux 6.3 server running the Unbreakable Enterprise Kernel I have two block devices presented from Violin. These devices have physical block sizes of 4k but logical block sizes of 512 bytes:

[root@server1 ~]# uname -r
2.6.39-200.24.1.el6uek.x86_64

[root@server1 ~]# fdisk -l /dev/mapper/violin_data1 | grep "Sector size"
Sector size (logical/physical): 512 bytes / 4096 bytes

[root@server1 ~]# fdisk -l /dev/mapper/violin_data2 | grep "Sector size"
Sector size (logical/physical): 512 bytes / 4096 bytes

Using Oracle Grid Infrastructure 11.2.0.3, I’m now going to create an ASM diskgroup on just one of the devices, using the new clause SECTOR_SIZE=4096:

SQL> CREATE DISKGROUP DATA EXTERNAL REDUNDANCY
  2  DISK '/dev/mapper/violin_data1'
  3  ATTRIBUTE
  4       'sector_size'='4096',
  5       'compatible.asm' = '11.2',
  6       'compatible.rdbms' = '11.2';

Diskgroup created.

SQL> select GROUP_NUMBER, DISK_NUMBER, STATE, NAME, PATH, SECTOR_SIZE from v$asm_disk;

GROUP_NUMBER DISK_NUMBER STATE    NAME            PATH                      SECTOR_SIZE
------------ ----------- -------- --------------- ------------------------- -----------
           0           1 NORMAL                   /dev/mapper/violin_data2          512
           1           0 NORMAL   DATA_0000       /dev/mapper/violin_data1          512

As you can see, the disk which joined the diskgroup shows a sector size of 512 (which is correct, because ASM is reading the logical block size of the device). The other disk is also showing a 512 byte sector size. So now let’s add it to the diskgroup:

SQL> alter diskgroup data add disk '/dev/mapper/violin_data2';

Diskgroup altered.

SQL> select GROUP_NUMBER, DISK_NUMBER, STATE, NAME, PATH, SECTOR_SIZE from v$asm_disk;

GROUP_NUMBER DISK_NUMBER STATE    NAME            PATH                      SECTOR_SIZE
------------ ----------- -------- --------------- ------------------------- -----------
           1           1 NORMAL   DATA_0001       /dev/mapper/violin_data2         4096
           1           0 NORMAL   DATA_0000       /dev/mapper/violin_data1          512

Huh? The newly added disk has suddenly become a 4k sector device. Why? If I add both devices during the initial CREATE DISKGROUP statement this does not happen, it only seems to happen when I ADD the disk to an existing diskgroup.

Why?

The Fundamental Characteristics of Storage

Storage for DBAs: As a rule of thumb, pretty much any storage system can be characterised by three fundamental properties:

Latency is a measurement of delay in a system; so in the case of storage it is the time taken to respond to an I/O request. It’s a term which is frequently misused – more on this later – but when found in the context of a storage system’s data sheet it often means the average latency of a single I/O. Latency figures for disk are usually measured in milliseconds; for flash a more common unit of measurement would be microseconds.

IOPS (which stands for I/Os Per Second) represents the number of individual I/O operations taking place in a second. IOPS figures can be very useful, but only when you know a little bit about the nature of the I/O such as its size and randomicity. If you look at the data sheet for a storage product you will usually see a Max IOPS figure somewhere, with a footnote indicating the I/O size and nature.

Bandwidth (also variously known as throughput) is a measure of data volume over time – in other words, the amount of data that can be pushed or pulled through a system per second. Throughput figures are therefore usually given in units of MB/sec or GB/sec.

As the picture suggests, these properties are all related. It’s worth understanding how and why, because you will invariably need all three in the real world. It’s no good buying a storage system which can deliver massive numbers of IOPS, for example, if the latency will be terrible as a result.

The throughput is simply a product of the number of IOPS and the I/O size:

Throughput = IOPS x I/O size

So 2,048 IOPS with an 8k blocksize is (2,048 x 8k) = 16,384 kbytes/sec which is a throughput of 16MB/sec.

The latency is also related, although not in such a strict mathematical sense. Simply put, the latency of a storage system will rise as it gets busier. We can measure how busy the system is by looking at either the IOPS or Throughput figures, but throughput unnecessarily introduces the variable of block size so let’s stick with IOPS. We can therefore say that the latency is proportional to the IOPS:

Latency ∝ IOPS

I like the mathematical symbol in that last line because it makes me feel like I’m writing something intelligent, but to be honest it’s not really accurate. The proportional (∝) symbol suggests a direct relationship, but actually the latency of a system usually increases exponentially as it nears saturation point.

We can see this if we plot a graph of latency versus IOPS – a common way of visualising performance characteristics in the storage world. The graph on the right shows the SPC benchmark results for an HP 3PAR disk system (submitted in 2011). See how the response time seems to hit a wall of maximum IOPS? Beyond this point, latency increases rapidly without the number of IOPS increasing. Even though there are only six data points on the graph it’s pretty easy to visualise where the limit of performance for this particular system is.

I said earlier that the term Latency is frequently misused – and just to prove it I misused it myself in the last paragraph. The SPC performance graph is actually plotting response time and not latency. These two terms, along with variations of the phrase I/O wait time, are often used interchangeably when they perhaps should not be.

According to Wikipedia, “Latency is a measure of time delay experienced in a system“. If your database needs, for example, to read a block from disk then that action requires a certain amount of time. The time taken for the action to complete is the response time. If your user session is subsequently waiting for that I/O before it can continue (a blocking wait) then it experiences I/O wait time which Oracle will chalk up to one of the regular wait events such as db file sequential read.

The latency is the amount of time taken until the device is ready to start reading the block, i.e not including the time taken to complete the read. In the disk world this includes things like the seek time (moving the actuator arm to the correct track) and the rotational latency (spinning the platter to the correct sector), both of which are mechanical processes (and therefore slow).

When I first began working for a storage vendor I found the intricacies of the terminology confusing – I suppose it’s no different to people entering the database world for the first time. I began to realise that there is often a language barrier in I.T. as people with different technical specialties use different vocabularies to describe the same underlying phenomena. For example, a storage person might say that the array is experiencing “high latency” while the database admin says that there is “high User I/O wait time“. The OS admin might look at the server statistics and comment on the “high levels of IOWAIT“, yet the poor user trying to use the application is only able to describe it as “slow“.

At the end of the day, it’s the application and its users that matter most, since without them there would be no need for the infrastructure. So with that in mind, let’s finish off this post by attempting to translate the terms above into the language of applications.

Translating Storage Into Application

Earlier we defined the three fundamental characteristics of storage. Now let’s attempt to translate them into the language of applications:

Latency is about application acceleration. If you are looking to improve user experience, if you want screens on your ERP system to refresh quicker, if you want release notes to come out of the warehouse printer faster… latency is critical. It is extremely important for highly transactional (OLTP) applications which require fast response times. Examples include call centre systems, CRM, trading, e-Business etc where real-time data is critical and the high latency of spinning disk has a direct negative impact on revenue.

IOPS is for application scalability. IOPS are required for scaling applications and increasing the workload, which most commonly means one of three things: in the OLTP space, increasing the number of concurrent users; in the data warehouse space increasing the parallelism of batch processes, or in the consolidation / virtualisation space increasing the number of database instances located on a single physical platform (i.e. the density). This last example is becoming ever more important as more and more enterprises consolidate their database estates to save on operational and licensing costs.

Bandwidth / Throughput is effectively the amount of data you can push or pull through your system. Obviously that makes it a critical requirement for batch jobs or datawarehouse-type workloads where massive amounts of data need to be processed in order to aggregate and report, or identify trends. Increased bandwidth allows for batch processes to complete in reduced amounts of time or for Extract Transform Load (ETL) jobs to run faster. And every DBA that ever lived at some point had to deal with a batch process that was taking longer and longer until it started to overrun the window in which it was designed to fit…

Finally, a warning. As with any language there are subtleties and nuances which get lost in translation. The above “translation” is just a rough guide… the real message is to remember that I/O is driven by applications. Data sheets tell you the maximum performance of a product in ideal conditions, but the reality is that your applications are unique to your organisation so only you will know what they need. If you can understand what your I/O patterns look like using the three terms above, you are halfway to knowing what the best storage solution is for you…

Performance: It’s All About Balance…

Storage For DBAs: Everyone wants their stuff to go faster. Whether it’s your laptop, tablet, phone, database or application… performance is one of the most desirable characteristics of any system. If your system isn’t fast enough, you start dreaming of more. Maybe you try and tune what you already have, or maybe you upgrade to something better: you buy a phone with a faster processor, or stick an SSD in your laptop… or uninstall Windows 🙂

When it comes to databases, I often find people considering the same set of options for boosting performance (usually in this order): half-heartedly tuning the database, adding more DRAM, *properly* tuning the database, adding or upgrading CPUs, then finally tuning the application. It amazes me how much time, money and effort is often spent trying to avoid getting the application developers to write their code properly, but that’s a subject for another blog.

The point of this blog is the following statement: to achieve the best performance on any system it is important that all of its resources are balanced.

Let’s think about the basic resources that comprise a computer system such as a database server:

CPU – the processor, i.e. the thing that actually does the work. Every process pretty much exists to take some input, get on CPU, perform some calculations and produce some output. It’s no exaggeration to call this the heart of the system.
Network – communications with the outside world, whether it be the users, the application servers or other databases.
Memory – Dynamic Random Access Memory (DRAM) provides a store for data.
Storage – for example disk or flash; provides a store for data.

You’ll notice I’ve been a bit disingenuous by describing Memory and Storage the same way, but I want to make a point: both Memory and Storage are there to store data. Why have two different resources for what is essentially the same purpose?

The answer, which you obviously already know, is that DRAM is volatile (i.e. continuous power is required to maintain the stored information, otherwise it is lost) while Storage is persistent (i.e. the stored information remains in place until it is actively changed or removed).

When you think about it like that, the Storage resource has a big advantage over the Memory resource, because the data you are storing is safe from unexpected power loss. So why do we have the DRAM? What does it bring to the party? And why do I keep asking you questions you already know the answer to?

Ok I’ll get to the point, which is this: DRAM is used to drive up CPU utilisation.

The Long Walk

The CPU is interacting with the Memory and Storage resources by sending or requesting data. Each request takes a certain amount of time – and that time can vary depending on factors such as the amount of data and whether the resource is busy. But let’s ignore all that for now and just consider the minimum possible time taken to send or receive that data: the latency. CPUs have clock cycles, which you can consider a metronome keeping the beat to which everything else must dance. That’s a gross simplification which may make some people wince (read here if you want to know why), but I’m going to stick with it for the sake of clarity.

Let’s consider a 2GHz processor – by no means the fastest available clock speed out there today. The 2GHz indicates that the clock cycle is oscillating 2 billion times per second. That means one oscillation every half a nanosecond, which is such a tiny amount of time that we can’t really comprehend it, so instead I’m going to translate it into the act of walking, where each single pace is a clock cycle. With each step taken, an instruction can be executed, so:

One CPU Cycle = Walking 1 Pace

The current generation of DRAM is DDR3 DRAM, which has latencies of around 10 nanoseconds. So now, while walking along, if you want to access data in DRAM you need to incur a penalty of 20 paces where you potentially cannot do anything else.

Accessing DRAM = Walking 20 Paces

Now let’s consider storage – and in particular, our old friend the disk drive. I frequently see horrible latency problems with disk arrays (I guess it goes with the job) but I’ll be kind here and choose a latency of 5 milliseconds, which on a relatively busy system wouldn’t be too bad. 5 milliseconds is of course 5 million nanoseconds, which in our analogy is 10 million steps. According to the American College of Sports Medicine there are an average of 2,000 steps in one mile. So now, walking along and making an I/O request to disk incurs a penalty of 10,000,000 steps or 5,000 miles. Or, to put it another way:

Accessing Disk = Walking from London to San Francisco

Take a minute to consider the impact. Previously you were able to execute an instruction every step, but now you need to walk a fifth of the way around the planet before you can continue working. That’s going to impact your ability to get stuff done.

Maybe you think 5 milliseconds is high for disk latency (or maybe you think anyone walking from London to San Francisco might face some ocean-based issues) but you can see that the numbers easily translate: every millisecond of latency is equivalent to walking one thousand miles.

Don’t forget what that means back in the real world: it translates to your processor sitting there not doing anything because it’s waiting on I/O. Increasing the speed of that processor only increases the amount of work it’s unable to do during that wait time. If you didn’t have DRAM as a “temporary” store for data, how would you ever manage to do any work? No wonder In-Memory technologies are so popular these days.

Moore’s Law Isn’t Helping

It’s often stated or inferred that Moore’s Law is bringing us faster processors every couple of years, when in fact the original statement was on doubling the number of transistors on an integrated circuit. But the underlying point remains that processor performance is increasing all the time. Looking at the four resources we outlined above, you could say that in a similar way DRAM technologies are progressing while network protocols are getting faster (10Gb Ethernet is commonplace, Infiniband is increasingly prevalent and 40Gb or 100Gb Ethernet is not far away).

On the other hand, disk performance has been stationary for years. According to this manual from Seagate the performance of CPUs increased 2,000,000x between 1987 and 2004 yet the performance of hard disk drives only increased 11x. That’s hardly surprising – how many years ago did the 15k RPM disk drive come out? We’re still waiting for something faster but the manufacturers have hit the limits of physics. The idea of helium-filled drives has been floated (sorry, couldn’t resist) and indeed they could be on the shelves soon, but if you ask me the whole concept is so up-in-the-air (sorry, I really can’t help it) that I have serious doubts whether it will actually take off (ok I promise that’s the last one).

The consequence of Moore’s Law is that the imbalance between disk storage and the other resources such as CPU is getting worse all the time. If you have performance issues caused by this imbalance – and then move to a newer, faster server with more processing power… the imbalance will only get worse.

The Silicon Data Centre

Disk, as a consequence of its mechanical nature, cannot keep up with silicon as the number of transistors on a processor doubles every two years. Well as the saying goes, if you can’t beat them, join them. So why not put your persistent data store on silicon?

This is the basis of the argument for moving to flash memory: it’s silicon-based. The actual technology most vendors are using is NAND flash but that’s not massively important and technologies will come and go. The important point is to get storage onto the graph of Moore’s Law. Going back to the walking analogy above, an I/O to flash memory takes in the region of 200 microseconds, i.e. 200 thousand nanoseconds. This is a number of orders of magnitude faster than disk but still represents walking 400,000 paces or 200 miles. But unlike disk, the performance is getting better. And by moving storage to silicon we also pick up many other benefits such as reduced power consumption, space and cooling requirements. And most importantly we restore some balance to your server infrastructure.

Think about it. You have to admit that, as an argument, it’s pretty well balanced.

Footnote: Yes I know that by representing CPU clock cycles as instructions I am contributing to the Megahertz Myth. Sorry about that. Also, I strongly advise reading this article in the NoCOUG journal which makes some great points about DRAM and CPU utilisation. My favourite quote is, “Idle processors do not speed up database processing!” which is so obvious and yet so often overlooked.

New Blog Series: Storage For DBAs

When I joined Violin I suddenly realised that I was going to be in a minority… a DBA in a world of storage people. DBAs don’t always think about storage – in fact I think it’s fair to say that many DBAs would prefer to keep storage at arm’s length. It’s just disk, right?

Of course all that changed back in 2007 when Oracle released 10g along with the new feature Automatic Storage Management. I confess that when I first heard about ASM I thought “Naah, I won’t be using that”. I couldn’t have been more wrong… before I knew it Oracle had me travelling all round Britain delivering the “ASM Roadshow” to DBA teams who listened with the all the interest of a dog that’s been shown a card trick*. [On one particular low point I found myself trying so hard to explain RAID to a group of stoic, unblinking Welshmen that I suffered some sort of mental breakdown and had to go and stand in the car park for a minute while my co-presenter leapt to my assistance. This was not a good day.]

These days DBAs are becoming more and more involved in storage, which means they have to spend more time talking to their Storage Administrator cousins (unless they have Exadata of course, in which the DBAs are the storage administrators).

During my time working with Exadata at Oracle I began to learn more and more about disk concepts such as Mean Time Between Failure and Annualised Failure Rates… but when I joined Violin I was plunged into a murky new world of terminology such as IOPS, bandwidth, latency and various other terms that probably should have been more familiar to me but weren’t.

I quickly came to the conclusion that people in the database world and people in the storage industry speak different languages. So now that I’ve been working at Violin for a year I thought it was time to try and bridge the divide and offer some translations of what all these crazy storage people are really talking about.

I’ll start with the basics and then get progressively more detailed until either a) I no longer know what I’m talking about, or b) nobody is reading anymore. So that should see me through until about the middle of next week then…

* I confess I stole this line from Bill Hicks. But it describes the scene perfectly and is funnier than anything I could think of…

Using SLOB to Test Physical I/O

For some time now I’ve been using the Silly Little Oracle Benchmark (SLOB) tool to drive physical I/O against the various high performance flash memory arrays I have in my lab (one of the benefits of working for Violin Memory is a lab stuffed with flash arrays!)

I wrote a number of little scripts and tools to automate this process and always intended to publish some of them to the community. However, time flies and I never seem to get around to finishing them off or making them presentable. I’ve come to the conclusion now that this will probably never happen, so I’ve published them in all their nasty, badly-written and uncommented glory. I can only apologise in advance.

You can find them here.

AWR Generator

tools

As part of my role at Violin I spend a lot of time profiling customer’s databases to see how their performance varies over time. The easiest way to do this (since I often don’t have remote access) is to ask for lots of AWR reports. One single report covering a large span of time is useless, because all peaks and troughs are averaged out into a meaningless hum of noise, so I always ask for one report per snapshot period (usually an hour) covering many hours or days. And I always ask for text instead of HTML because then I can process them automatically.

That’s all well and good, but generating a hundred AWR reports is a laborious and mind-numbingly dull task. So to make things easier I’ve written a SQL script to do it. I know there are many other scripts out there to do this, but none of them met the criteria I needed – mainly that they were SQL not shell (for portability) and that they didn’t create temporary objects (such as directories).

If it is of use to anyone then I offer it up here:

https://flashdba.com/database/useful-scripts/awr-generator/

Likewise if you manage to break it, please let me know! Thanks to Paul for confirming that it works on RAC and Windows systems (you know you love testing my SQL…)

Engineered Systems – An Alternative View

Have you seen the press recently? Or passed through an airport and seen the massive billboards advertising IT companies? I have – and I’ve learnt something from them: Engineered Systems are the best thing ever. I also know this because I read it on the Oracle website… and on the IBM website, although IBM likes to call them different names like “Workload Optimized Systems”. HP has its Converged Infrastructure, which is what Engineered Systems look like if you don’t make software. And even Microsoft, that notoriously hardware-free zone where software exists in a utopia unconstrained by nuts and bolts, has a SQL Server Appliance solution which it built with HP.

[I’m going to argue about this for a while, because that’s what I do. There is a summary section further down if you are pressed for time]

So clearly Engineered Systems are the future. Why? Well let’s have a look at the benefits:

Pre-Integration

It doesn’t make sense to buy all of the components of a solution and then integrate them yourself, stumbling across all sorts of issues and compatibility problems, when you can buy the complete solution from a single vendor. Integrating the solution yourself is the best of breed approach, something which seems to have fallout out of favour with marketing people in the IT industry. The Engineered Systems solution is pre-integrated, i.e. it’s already been assembled, tested and validated. It works. Other customers are using it. There is safety in the herd.

Optimization

In Oracle Marketing’s parlance, “Hardware and software, engineered to work together“. If the same vendor makes everything in the stack then there are more opportunities to optimize the design, the code, the integration… assumptions no longer need to be made, so the best possible performance can be squeezed out of the complete package.

Faster Deployment

Well… it’s already been built, right? See the Pre-Integration section above and think about all that time saved: you just need to wheel it in, connect up the power and turn it on. Simples.

Of course this isn’t completely the case if you also have to change the way your entire support organisation works in order to support the incoming technology, perhaps by retraining whole groups of operations staff and creating an entirely new specialised role to manage your new purchase. In fact, you could argue that the initial adoption of a technology like Exadata is so disruptive that it is much more complicated and resource-draining than building those best of breed solutions your teams have been integrating for decades. But once you’ve retrained all your staff, changed all your procedures, amended your security guidelines (so the DataBase Machine Administrator has access to all areas) and fended off the poachers (DBMAs get paid more than DBAs) you are undoubtedly in the perfect position to start benefiting from that faster deployment. Well done you.

And then there’s the migration from your existing platform, where (to continue with Exadata as an example) you have to upgrade your database to 11.2, migrate to Linux, convert to ASM, potentially change the endianness of your data and perhaps strip out some application hints in order to take advantage of features like Smart Scan. That work will probably take many times longer than the time saved by the pre-integration…

Single-Vendor Benefits

The great thing about having one vendor is that it simplifies the procurement process and makes support easier too – the infamous “One Throat To Choke” cliché.

Marketing Overdrive

If you believe the hype, the engineered system is the future of I.T. and anyone foolish enough to ignore this “new” concept is going to be left behind. So many of the vendors are pushing hard on that message, but of course there is one particular company with an ultra-aggressive marketing department who stands out above the rest: the one that bet the farm on the idea. Let’s have a look at an example of their marketing material:

Video hosted by YouTube under Standard Terms of Service. Content owner: Oracle Corporation

Now this is all very well, but I have an issue with Engineered Systems in general and this video in particular. Oracle says that if you want a car you do not go and buy all the different parts from multiple, disparate vendors and then set about putting them together yourself. Leaving aside the fact that some brave / crazy people do just that, let’s take a second to consider this. It’s certainly true that most people do not buy their cars in part form and then integrate them, but there is an important difference between cars and the components of Oracle’s Engineered Systems range: variety.

If we pick a typical motor vehicle manufacturer such as Ford or BMW, how many ranges of vehicle do they sell? Compact, family, sports, SUV, luxury, van, truck… then in each range there are many models, each model comes in many variants with a huge list of options that can be added or taken away. Why is there such a massive variety in the car industry? Because choice and flexibility are key – people have different requirements and will choose the product most suitable to their needs.

Looking at Oracle’s engineered systems range, there are six appliances – of which three are designed to run databases: the Exadata Database Machine, the SuperCluster and the ODA. So let’s consider Exadata: it comes in two variants, the X3-2 and the X3-8. The storage for both is identical: a full rack contains 14x Exadata storage servers each with a standard configuration of CPUs, memory, flash cards and hard disk drives. You can choose between high performance or high capacity disk drives but everything else is static (and the choice of disk type affects the whole rack, not just the individual server). What else can you change? Not a lot really – you can upgrade the DRAM in the database servers and choose between Linux or Solaris, but other than that the only option is the size of the rack.

The Exadata X3-2 comes in four possible rack sizes: eighth, quarter, half and full; the X3-8 comes only as a full rack. These rack sizes take into account both the database servers and the storage servers, meaning the balance of storage to compute power is fixed. This is a critical point to understand, because this ratio of compute to storage will vary for each different real-world database. Not only that, but it will vary through time as data volumes grow and usage patterns change. In fact, it might even vary through temporal changes such as holiday periods, weekends or simply just the end of the day when users log off and batch jobs kick in.

Flexibility

And there’s the problem with the appliance-based solution. By definition it cannot be as flexible as the bespoke alternative. Sure I don’t want to construct my own car, but I don’t need to because there are so many options and varieties on the market. If the only pre-integrated cars available were the compact, the van and the truck I might be more tempted to test out my car-building skills. To continue using Exadata as the example, it is possible to increase storage capacity independent of the database node compute capacity by purchasing a storage expansion rack, but this is not simply storage; it’s another set of servers each containing two CPU sockets, DRAM, flash cards, an operating system and software, hard disks… and of course a requirement to purchase more Exadata licenses. You cannot properly describe this as flexibility if, as you increase the capacity of one resource, you lose control of many other resources. In the car example, what if every time I wanted to add some horsepower to the engine I was also forced to add another row of seats? It would be ridiculous.

Summary: Two Sides To Every Coin

Engineered Systems are a design choice. Like all choices they have pros and cons. There are alternatives – and those alternatives also have pros and cons. For me, the Engineered System is one end of a sliding scale where hardware and software are tightly integrated. This brings benefits in terms of deployment time and performance optimization, but at the expense of flexibility and with potential vendor-lockin. The opposite end of that same scale is the Software Defined Data Centre (SDDC), where hardware and software are completely independent: hardware is nothing more than a flexible resource which can be added or removed, controlled and managed, aggregated and pooled… The properties and characteristics of the hardware matter, but the vendor does not. In this concept, data centres will simply contain elastic resources such as compute, storage and networking – which is really just an extension of the cloud paradigm that everyone has been banging on about for some time now.

It’s going to be interesting to see how the engineered system concept evolves: whether it will adapt to embrace ideas such as the SDDC or whether your large, monolithic engineered system will simply become another tombstone in the corner of your data centre. It’s hard to say, but whatever you do I recommend a healthy dose of scepticism when you read the marketing brochure…

New Installation Cookbook

Short post to mention that I’ve added another installation cookbook to the set published here. This one falls into the Advanced Cookbook section and covers installation on Oracle Linux 6.3 with Oracle 11.2.0.3 single-instance and 4k ASM, paying special attention to the configuration of UDEV and the multipathing software.

The blog posts haven’t been coming thick and fast recently as I have been concentrating on Violin’s (excellent) end of year but I hope to resume soon. I have one more piece to publish concerning subjects like Exadata and VMware, then a new blog series on “Storage for DBAs” to mark the combined anniversaries of my joining Violin and starting this blog.

In the meantime I’d like to recommend this short but very interesting blog series on Exadata Hybrid Columnar Compression over at ofirm.wordpress.com – part one starts here…