Thoughts on In Memory Databases (Part 1)

Everyone is talking about In Memory at the moment. On blogs, in tweets, in the press, in the Oracle marketing department, in books by SAP employees, even my Violin colleagues… it’s everywhere. What can I possibly add that will be of any value?

Well, how about owning up to something: I find myself in a bit of a quandary on this subject. On the one hand it’s a new buzzword, which means that a) it’s got everyone’s attention, and b) many people with their own agenda will seek to use it to their advantage… but on the other hand, given the nature of my employment (I work for Violin Memory, purveyors of flash memory systems), it seems like something we ought to be talking about.

As anyone who works in the IT industry knows (and perhaps it’s the same in other industries), we love a buzzword. Cloud, Analytics, Big Data, In Memory, Transformation… all of these phrases have been used at one time or another to try and wring cash out of customers who may or may not need the services and products they imagine the phrase represents. Even back at the end of the last millenium consultants worldwide were making huge amounts of money out of exploiting the phrase “Y2K”, some with more honourable intentions than others. I remember my old school received a letter from a “Y2K conformance specialist” informing them that this person could visit and inspect their football pitches to ensure they were “Y2K compliant”… (true story!)

So if buzzwords are prone to misuse, maybe the first thing we need to do is explore what “In Memory” really means? In fact, rewind a step – what do we mean when we say “Memory”?

What Is Memory?

It’s a basic question, but a good definition is surprisingly hard to pin down. Clearly this is an IT blog so (despite the deceiving picture above) I am only interested in talking about computer memory rather than the stuff in my head which stops working after I drink tequila. The definition of this term in the Free Online Dictionary of Computing is:

memory: These days, usually used synonymously with Random Access Memory or Read-Only Memory, but in the general sense it can be any device that can hold data in machine-readable format.

So that’s any device that can hold data in machine-readable format. So far so ambiguous. And of course that is the perfect situation for any would-be freeloader to exploit, since the less well-defined a definition is, the more room there is to manoeuvre any product into position as a candidate for that description.

Here’s what most people think of when they talk about computer memory… DRAM:

Dynamic Random Access Memory (DRAM)

This is Dynamic Random Access Memory – and it’s most likely what’s in your laptop, your desktop and your servers. You know all about this stuff – it’s fast, it’s volatile (i.e. the data stored on it is lost when the power goes off) and it’s comparatively expensive to say… disk, for which many orders of magnitude more are available at the same price point.

But now there is a new type of “memory” on the market, NAND flash memory. Actually it’s been around for over 25 years (read this great article for more details) but it is only now that we are seeing it being adopted en mass in data-centres, as well as being prevalent in consumer devices – the chances are your phone contains NAND flash, your tablet (if you have one) and maybe your computer if you are fortunate enough to have an SSD drive in it.

Toshiba NAND Flash

Flash memory, unlike DRAM, is persistent. That means when the power goes, the data remains. Flash access speeds are measured in microseconds – let’s say around 100 microseconds for a single random access. That’s significantly faster than disk, which is measured in (multiple) milliseconds – but still slower than DRAM, for which you would expect an access in around 100 nanoseconds. Flash is available in many forms, from USB devices and SSDs which fit into normal hard drive bays, through PCIe cards which connect direct to the system bus, and on to enterprise-class storage arrays such as those made by my employer like the Violin Memory 6000 series array.

Is flash a type of memory? It certainly fits the dictionary description above. But if you run something on flash, can you describe that something as now running “in memory”? You could argue the point either way I suppose.

Since we don’t seem to be doing well with defining what memory is, let’s change tack and talk about what it definitely isn’t. And that’s simple, because it definitely isn’t disk.

Disk

Whether it’s part of the formal definition or not, almost anyone would assume that memory is fast and non-mechanical, i.e. it has no moving parts. It is all semiconductors and silicon, not motors and magnets. A hard disk drive, with its rotating platters and moving actuator arm, is about the most un-memory-like way you can find to store your data, short of putting it on a big reel of tape. And, consistent with our experience of memory versus non-memory devices, it’s slow. In fact, every disk array vendor in the industry stuffs their enterprise disk arrays full of DRAM caches to make up for the slow performance of disk. So memory is something they use to mask the speed of their non-memory-based storage. Hang on then, if you have a small enough dataset so that the majority of your disk reads are coming from your disk array cache, does that mean you are running “in memory” too? No of course not, but the ambiguity is there to be exploited.

Primary Storage versus Secondary Storage

Since we are struggling with a formal definition of memory, perhaps another way to look at it is in terms of primary storage and secondary storage. The main difference here is that primary storage is directly addressable by the CPU, whereas secondary storage is addressed through input/output channels. Is that a good way of distinguishing memory from non-memory? It certainly works with DRAM, which ends up in the primary storage category, as well as disk, which ends up in the secondary storage category. But with flash it is a less successful differentiator.

The first problem is that as previously mentioned flash is available in multiple different forms. PCIe flash cards are directly addressable by the CPU whilst SSDs slot into hard drive bays and are accessed using storage protocols. In fact, just looking at the Violin Memory 6000 series array around which my day job revolves, connectivity options include PCIe direct attached, fibre-channel and Infiniband, meaning it could easily fit into either of the above categories.

What’s more, if you think of primary storage as somehow being faster than secondary storage, the Infiniband connectivity option of the Violin array is only about 50-100 microseconds slower than the PCIe version, yet brings a wealth of additional benefits such as high availability. It’s hard to think of a reason why you would choose the direct attached version of that with Infiniband.

Volatile versus Persistent

Maybe this is a better method of differentiating? Perhaps we can say that memory is that which is volatile, i.e. data stored on it will be lost when power is no longer available. The alternative is persistent storage, where data exists regardless of the power state. Does that make sense?

Not really. Think about your traditional computer, whether it’s a desktop or server. You have four high-level resources: CPU to do the work, network to communicate with the outside world, disk to store your data (the persistence layer). Why do you have memory in the form of DRAM? Why commit extra effort to managing a volatile store of data, much of which is probably duplicated on the persistence layer?

DRAM exists to drive up CPU utilisation. Processor speeds have famously doubled every couple of years or so. Network speeds have also increased drastically since the days of the 56k modem I used to struggle with in the 1990’s. Disk hasn’t – nowhere near in fact. Sure, capacity has increased – and speeds have slowly struggled upwards until they reached the limit of the 15k RPM drive, but in comparison to CPU improvements disk has been absolutely stagnant. So your computer is stuffed full of DRAM because, if it weren’t, the processors would spend all their time waiting for I/O instead of doing any work. By keeping as much data in volatile DRAM as possible, the speed of access is increased by around five orders of magnitude, resulting in CPUs which can spend more time working and less time waiting.

In the world of flash memory things are slightly different. DRAM is still necessary to maintain CPU utilisation, because flash is around two-and-a-half to three orders of magnitude slower than DRAM. But does it make sense to assume that “memory” is therefore only applicable to volatile data storage? What if a hypothetical persistent flash medium arrived with DRAM access speeds? Would we refuse to say that something running on this magic new media was running “In Memory”?

I don’t have an answer, only an opinion. My opinion is that memory is solid-state semiconductor-based storage and can be volatile or persistent. DRAM is a type of memory, but not the only type. Flash is a type of memory, while disk clearly is not.

So with that in mind, in the next part of this blog series I’m going to look at In Memory Database technologies and describe what I see as the three different architectures of IMDB that are currently available. As a taster, one of them is SAP HANA, one of them involves Violin Memory and the third one is the new Oracle Exadata X3 “Database In-Memory Machine”. And as a conclusion I will have to make a decision about the quandary I mentioned at the start of this article: should we at Violin claim a piece of the “In Memory” pie?

<Part Two of this blog series is located here>

Using Oracle Preinstall RPM with Red Hat 6

Recently I’ve been building Red Hat 6 systems and struggling to use the Oracle Preinstall RPM because it has a dependency on the Oracle Unbreakable Enterprise Kernel.

I’ve posted an article on this subject and my methods for getting around it:

Using Oracle Preinstall RPM with Red Hat 6

This system is not registered with ULN / RHN

One of the features of WordPress is the ability to see search terms which are taking viewers to your blog. One of the all-time highest searches bringing traffic to my site is “This system is not registered with ULN”… and sure enough if I search for that phrase on Google my site is one of the top links, taking people to one of my Violin Memory array Installation Cookbooks.

So I guess it’s only fair that I give these passers by some sort of advice on what to do if you see this message…

1. Don’t Panic

Chances are you have built a new Linux system using Red Hat Enterprise Linux or its twin sister Oracle Linux. You are probably now trying to use yum to install software packages, but every time you do so you see something similar to this:

# yum install oracle-validated -y
Loaded plugins: rhnplugin, security
This system is not registered with ULN.
ULN support will be disabled.
ol5_u7_base | 1.1 kB 00:00
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package oracle-validated.x86_64 0:1.1.0-14.el5 set to be updated
...

This is the Oracle Linux variant. If it was Red Hat you would see:

This system is not registered with RHN
RHN support will be disabled.

The first thing to understand is that you can quite happily build and run a system in this state – in fact I’m willing to bet there are many systems out there exhibiting this message every time somebody calls yum.

The message simply means that you have not registered this build of your system with Oracle or Red Hat. Both companies have a paid support offering which allows you to register a system and do things like get software updates. Red Hat’s is call the Red Hat Network (RHN) and Oracle’s, with their marketing department’s usual sense of humour, is called the Oracle Unbreakable Linux Network (ULN). [I’ve been critical of some Oracle products in the past but I have to say that I love Oracle Linux, even if I do find the name “Unbreakable” a bit daft…]

You don’t have to register your system with the vendor’s support network in order to be able to use it. I’m not making any statements about your support contract, if you have one – I’m just saying that it will work quite normally without it.

2. Register Your System

If you have Oracle or Red Hat support then you might as well register your system so that it can take advantage of their yum channels.

In systems prior to RHEL6 / OL6 you used a utility called up2date to register:

# up2date --register

Or if you want to use the text-mode version:

# up2date-nox --register

You can find a good tutorial explaining the process here. In RH6 / OL6 the process changed, so now you call the relevant utility.

For Red Hat you need to use the rhn_register command (actually this also became available in RHEL5). You will need your Red Hat Network login and password.

For Oracle Linux you need to use the uln_register command. You will need your Oracle ULN login, password and Customer Service Identifier (CSI).

Once your registration is complete the message “This system is not registered” should leave you alone.

3. Don’t Have Support?

Of course, not everybody has a support contract with Red Hat or Oracle. Some people have one but can’t find the details. Others can’t be bothered to set it up. If any of these applies to you then there is another alternative, which is the Oracle Public Yum Server. [Fair play to Wim and the team for making this available, because it’s been making my life easier for years now…]

Oracle’s public yum server is a freely available source of Linux OS downloads. Simply point your browser here and follow the instructions: http://public-yum.oracle.com/

In essence, you use wget to download the Oracle repo file which relates to your system and then (optionally) edit it to choose the yum channel you want to subscribe to (otherwise it will use the latest publicly-available stuff). The versions of the Linux software on the public yum repositories are (I believe) not as up-to-date as those you would get if you subscribed to a support contract, but they are still very new.

And the best bit is you can also use it if you have Red Hat installed; it isn’t restricted to Oracle Linux users. Having said that, make sure you don’t do anything which invalidates your support contracts. By pointing a RHEL system at the Oracle public yum server and running an update you are effectively converting your system to become Oracle Linux.

Here’s an example of how to set it up for OL6:

# cd /etc/yum.repos.d
# wget http://public-yum.oracle.com/public-yum-ol6.repo

This gets yum working and so allows for the Oracle Validated / Oracle Preinstall RPM to be installed in order to setup the database…

Hopefully that will satisfy some of my wayward blog visitors!

Oracle Achieves Record TPC-C Benchmark

It appears that one or two of my previous posts may have inadvertently annoyed some people at Oracle, so I would like to try and make amends today by posting something extremely positive about the company which, lest we forget, made the “world’s first commercial relational database“, backed Linux before it was a commercially viable option and – let’s face it – allowed me to have a career in IT rather than spend my life washing windows.

The positive story I’d like to share is this one:

Oracle Achieves Record TPC-C Benchmark Result on 2 Processor System

You can click through to see the details of the press release, but here is one of the main highlights describing the platform which was used to achieve this great result:

Oracle Database 11g Standard Edition One and Oracle Linux with the Unbreakable Enterprise Kernel Release 2, running on a Cisco UCS™ C240 M3 Rack Server with two Intel® Xeon® E5-2690 2.9 GHz processors achieved 1.6 Million transactions per minute (tpmC) with a price/performance of $0.47/tpmC.

Of course it seems only fair to point out that the server hardware used here was from Cisco rather than Oracle. That’s not a criticism of Oracle in anyway, merely a footnote to show that the glory needs to be shared a little bit. It doesn’t mention the storage either, but I’ll come back to that.

Also, it should be noted that the Oracle press release makes some comparisons with an IBM DB/2 benchmark – a comparison which some at IBM feel is somewhat disingenuous.

Whatever your thoughts on that, if you are like me you probably tend to overlook the marketing and press releases anyway and skip to the technical details, because they tend to be a lot more black and white. So allow me to point you to the executive summary and the full disclosure report both of which are available on the TPC.org website.

One thing you will notice if you dig down into the details is that the storage was actually two of these:

Violin Memory 6616 flash Memory Array

Yep, that’s right – I’m very pleased to say that in order to achieve new record levels of performance for the OLTP-based TPC-C benchmark, Oracle (and Cisco of course) used two Violin Memory flash Memory Arrays. Now that’s the kind of positive news we love to share…

Exadata X3 – Sound The Trumpets

It’s crazy time in the world of Oracle, because Oracle OpenWorld 2012 is only a week away. Which means that between now and then the world of Oracle blogging and tweeting will gradually reach fever pitch speculating on the various announcements that will be issued, products that will be launched and outrageous claims that will be made. The hype machine that is the Oracle Marketing department will be in overdrive, whilst partners and competitors clamour to get a piece of the action too. Such is life.

There was supposed to be one disappointment this year, i.e. that the much-longed-for new version of Oracle Database (12c) would not be released… we knew this because Larry told us back in June that it wouldn’t be out until December or January. Mind you, he also told us that it would’t be ported to Itanium, yet it appears that promise cannot be kept. And now it seems another of those claims back in June was incorrect, because yesterday we learnt (from Larry) that Oracle Database 12c would be released at OOW after all. How are we supposed to keep up with what’s accurate?

Also for OOW 2012 we have the prospect of new versions of the Oracle Exadata Database Machine, the Exadata X3, to replace the existing X2 models which have now been in service for two years. The new models (the X3-2 and the X3-8) don’t represent a huge change, more of an evolutionary step to keep up with current technology: Oracle partners have been told that the Westmere-based Xeon processors have been swapped for Sandy Bridge versions (see comments below), the amount of RAM has increased, the flash cards are switching from the ancient F20 models to the F40 models which have better performance characteristics as well as higher capacity (and my, don’t they look just like the LSI Nytro WarpDrive WLP4-200?)

One thing that doesn’t appear to be changing though is the disks in the storage servers, which remain the 12x 600GB high performance or 3TB high capacity spindles used in the X2-2 and X2-8. I’ve heard a lot of people suggest that Oracle might switch to using only SSDs in the storage servers, but I generally discount this idea because I am not sure it makes sense in the Exadata design. The Exadata Smart Flash Cache (i.e. the F20 / F40 cards) are there to try and handle the random I/O requests, as is the database buffer cache of course. The disks in an Exadata storage server are there to handle sequential I/O – and since all 12 of them can saturate the I/O controller there is no need to go increasing the available bandwidth with SSD… particularly if Oracle hasn’t got the technology to do SSD right (maybe they have, maybe they haven’t – I wouldn’t know… but working for a flash vendor I am aware that flash is a complicated technology and you need plenty of IP to manage it properly. My, those F40 cards really do look familiar…)

Exadata on Violin? No.

Of course what could have been really interesting is the idea of using the Violin Memory flash Memory Array as a storage server. Very much like an Exadata storage cell, the 6000 series array has intelligence in the form of its Memory Gateways, which are effectively a type of blade server built into the array. There are two in each 6000 series array and they have x86 processors, DRAM and network connectivity as you would expect. On a standard Violin Memory system you would find them running our own operating system with our vShare software, as well as the option to run Symantec Storage Foundation, but we have also used them to run other, extremely cool stuff:

Violin Memory Windows Cluster In A Box

Violin Memory OEMs VMware Virtualization Technology

Violin Memory DOES NOT run Exadata Storage Server

Ok that last one was a trap… Exadata storage software is a closed technology that can only be run on Oracle’s Exadata Database Machine. But ’twas not always thus…

Open and Closed

The original plan for Exadata storage software was that it would have an open hardware stack, rather than the proprietary Oracle-only approach that we see today. We know this from various sources including none other than the CEO of Oracle himself. It would have been possible to build Exadata systems on multiple platforms and architectures  – there was a port of iDB for HPUX under development, for example (evidence of this can be seen on page 101 of HP’s HPUX Release Notes). Given that Oracle’s success as a database company was founded on that openness and willingness to port onto multiple platforms, or to put it another way the freedom of choice, it came as a shock to many when the Sun acquisition put an end to this approach.

Now it seems that Oracle is going the other way. The Database Smart Flash Cache feature is only available on Solaris or Oracle Linux platforms. Hybrid Columnar Compression, an apparently generic feature, was only supported on Oracle Exadata systems when it was first released. Since then the list of supported storage for HCC has grown to encompass Oracle ZFS Storage Appliances and Oracle Pillar Axiom Storage Systems. Notice something these systems all have in common? The clue is in the name.

So what can we learn from this? Is Oracle using it’s advantage as the largest database vendor to make it’s less-successful hardware products more attractive? Will customers continue to see more goodies withheld unless they purchase the complete Oracle stack? Have a look at this and see what you think:

Oracle Storage – The Smarter Choice

This is a marketing feature in which Oracle explains the “Top Five Reasons Oracle Storage is a Smarter Choice than EMC“. But hold on, what’s reason number five?

So Oracle storage is “smarter” than EMC because Oracle doesn’t let you use an apparently-generic software feature on EMC? That’s an interesting view. Maybe there’s more. What about reason number four?

Oracle storage is “smarter” than EMC because Exadata software – you remember, that software which was originally going to be available on multiple systems and architectures – only runs on Oracle storage. Well duh.

Life Goes on

So here we are in the modern world. Exadata is a closed platform solution. It’s still well-designed and very good at doing the thing it was designed for (data warehousing). It’s still sold by Oracle as the strategic platform for all workloads. Oracle still claims that Exadata is a solution for OLTP and Consolidation workloads, yet we don’t see TPC-C benchmarks for it (and that criticism has become boring now anyway). Next week we will hear all about the Exadata write-back cache and how it means that Exadata X3 is now the best machine for OLTP, even though that claim was already being made about the V2 back in 2009.

I am sure the announcements at OOW will come thick and fast, with many a 200x improvement seen here or a 4000% reduction claimed there. But amid all the hype and hyperbole, why not take a minute to think about how different it all could have been?

Database Virtualisation: The End of Oracle RAC?

A long time ago (2003) in a galaxy far, far away (Denmark), a man wrote a white paper. However, this wasn’t an ordinary man – it was Mogens Nørgaard, OakTable founder, CEO of Miracle A/S and previously the head of RDBMS Support and then Premium Services at Oracle Support in Denmark. It’s fair to say that Mogens is one of the legends of the Oracle community and the truth is that if you haven’t heard of him you might have stumbled upon this blog by accident. Good luck.

The white paper was (somewhat provocatively) entitled, “You Probably Don’t Need RAC” and you can still find a copy of it here courtesy of my friends at iD Concept. If you haven’t read it, or you have but it was a long time ago, please read it again. It’s incredibly relevant – in fact I’m going to argue that it’s more relevant now than ever before. But before I do, I’m going to reprint the conclusions in their entirety:

  • If you have a system that needs to be up and running a few seconds after a crash, you probably need RAC.
  • If you cannot buy a big enough system to deliver the CPU power and or memory you crave, you probably need RAC.
  • If you need to cover your behind politically in your organisation, you can choose to buy clusters, Oracle, RAC and what have you, and then you can safely say: “We’ve bought the most expensive equipment known to man. It cannot possibly be our fault if something goes wrong or the system goes down”.
  • Otherwise, you probably don’t need RAC. Alternatives will usually be cheaper, easier to manage and quite sufficient.

Oracle RAC: What Is The Point?

To find out what the Real Application Clusters product is for, let’s have a look at the Oracle Database 2 Day + Real Application Clusters Guide and see what it says:

Oracle Real Application Clusters (Oracle RAC) enables an Oracle database to run across a cluster of servers, providing fault tolerance, performance, and scalability with no application changes necessary. Oracle RAC provides high availability for applications by removing the single point of failure with a single server.

So from this we see that RAC is a technology designed to provide two major benefits: high availability and scalability. The HA features are derived from being able to run on multiple physical machines, therefore providing the ability to tolerate the failure of a complete server. The scalability features are based around the concept of horizontal scaling, adding (relatively) cheap commodity servers to a pool rather than having to buy an (allegedly) more expensive single server. We also see that there are “no application changes necessary”. I have serious doubts about that last statement, as it appears to contradict evidence from countless independent Oracle experts.

That’s the technology – but one thing that cannot ever be excluded from the conversation is price. Technical people (I’m including myself here) tend to get sidetracked by technical details (I’m including myself there too), but every technology has to justify its price or it is of no economic use. At the time of writing, the Oracle Enterprise Edition license is showing up in the Oracle Shop as US$47,500 per processor. The cost of a RAC license is showing as US$23,000 per processor. That’s a lot of money, both in real terms and also as a percentage of the main Enterprise Edition license – almost 50% as much again. To justify that price tag, RAC needs to deliver something which is a) essential, and b) cannot be obtained through any other less-expensive means.

High Availability

The theory behind RAC is that it provides higher availability by protecting against the failure of a server. Since the servers are nodes in a cluster, the cluster remains up as long as the number of failed nodes is less than the total number of nodes in that cluster.

It’s a great theory. However, there is a downside – and that downside is complexity. RAC systems are much more complex than single-instance systems, a fact which is obvious but still worth mentioning. In my previous role as a database product expert for Oracle Corporation I got to visit multiple Oracle customers and see a large number of Oracle installations, many of which were RAC. The RAC systems were always the most complicated to manage, to patch, to upgrade and to migrate. At no time do I ever remember visiting a customer who had implemented the various Transparent Application Failover (TAF) policies and Fast Application Notification (FAN) mechanisms necessary to provide continuous service to users of a RAC system where a node fails. The simple fact is that most users have to restart their middle tier processes when a node fails and as a result all of the users of that node are kicked off. However, because the cluster remained available they are able to call this a “partial outage” instead of taking the SLA hit of a “complete outage”.

This is just semantics. If your users experience a situation where their work is lost and they have to log back in to start again, that’s an outage. That’s the very antithesis of high availability to me. If the added complexity of RAC means that these service interruptions happen more frequently, then I question whether RAC is really the best solution for high availability. I’m not suggesting that there is anything wrong with the Oracle product (take note Oracle lawyers), simply that if you are not designing and implementing your applications and infrastructure to use TAF and FAN then I do not see how your availability really benefits.

Complexity is the enemy of high availability – and RAC, no matter how you look at it, adds complexity over a single-instance implementation of Oracle.

Scalability

The claim here is that RAC allows for platforms to scale horizontally, by adding nodes to a cluster as additional resources are required. According to the documentation quote above this is possible “with no application changes”. I assume this only applies to the case where nodes are added to an existing multi-node cluster, because going from single-instance to RAC very definitely requires application changes – or at least careful consideration of application code. People far more eloquent (and concise) than I have documented this before, but consider anything in the application schema which is a serialization point: sequences, inserts into tables using a sequential number as the primary key, that sort of thing. You cannot expect an application to perform if you just throw it at RAC.

To understand the scalability point of RAC, it’s important to take a step back and see what RAC actually does conceptually. The answer is all about abstraction. RAC takes the one-to-one database-to-instance relationship and changes it to a one-to-many, so that multiple instances serve one database. This allows for the newly-abstracted instance layer to be expanded (or contracted) without affecting the database layer.

This is exactly the same idea as virtualisation of course. In virtualisation you take the one-to-one physical-server-to-operating-system relationship and abstract it so that you can have many virtual OS’s to each physical server. In fact in most virtualisation products you can take this even further and have many physical servers supporting those virtual machines, but the point is the same – by adding that extra layer of abstraction the resources which used to be tied together now become dynamic.

This is where the concept of RAC fails for me. Firstly, modern servers are extremely powerful – and comparatively cheap. You don’t need to buy a mainframe-style supercomputer in order to run a business-critical application, not when 80 core x86 servers are available and chip performance is rocketing at the speed of Moore’s Law.

Database Virtualisation Is The Answer

Virtualisation technology, whether from VMware, Microsoft or one of the other players in that market, allows for a much better expansion model than RAC in my opinion. The reason for this is summed up perfectly by Dr. Bert Scalzo (NoCOUG journal page 23) when he says, “Hardware is simply a dynamic resource“. By abstracting hardware through a virtualisation layer, the number and type of physical servers can now be changed without having to change the applications running on top in virtual machines.

Equally, by using virtualisation, higher service levels can be achieved due to the reduced complexity of the database (no RAC) and the ability to move virtual machines across physical domains with limited or no interruption. VMware’s vMotion feature, for example, allows for the online migration of Oracle databases with minimal impact to applications. Flash technologies such as the flash memory arrays from Violin Memory allow for the I/O issues around virtualisation to be mitigated or removed entirely. Software exists for managing and monitoring virtualised Oracle environments, whilst leading players in the technology space tell the world about their successes in adopting this model.

What’s more, virtualisation allows for incredible benefits in terms of agility. New Oracle environments can be built simply by cloning existing ones, multiple copies and clones can be taken for use in dev / test / UAT environments with minimal administrative overhead. Self-service options can be automated to give the users ability to get what they want, when they want it. The term “private cloud” stops being marketing hype and starts being an achievable goal.

And finally there’s the cost. VMware licenses are not cheap either, but hardware savings start to become apparent when virtualising. With RAC, you would probably avoid consolidating multiple applications onto the same nodes – an ill-timed node eviction would take out all of your systems and leave you with a real headache. With the added protection of the VM layer that risk is mitigated, so databases can be consolidated and physical hardware shared. Think about what that does to your hardware costs, operational expenditure and database licensing costs.

Conclusion

Ok so the title of this post was deliberately straying into the realms of sensationalism. I know that RAC is not dead – people will be running RAC systems for years to come. But for new implementations, particularly for private-cloud, IT-as-a-service style consolidation environments, is it really a justifiable cost? What does it actually deliver that cannot be achieved using other products – products that actually provide additional benefits too?

Personally, I have my doubts – I think it’s in danger of becoming a technology without a use case. And considering the cost and complexity it brings…

Exadata Roadmap – More Speculation

Oracle Sun Flash Accelerator F40 card

It’s silly season. In the run up to Oracle Open World there are always rumours and whispers about what products will be announced – and this year is no different. I know this because I’m one of the people partaking in the spread of baseless and unfounded speculation.

Clearly the thing that most people are talking about is the almost certain release of a new Exadata generation called the X3. There appear to be both the X3-2 and X3-8 generations coming, as well as an interesting “Exadata X3-2 Eighth Rack” (that’s eighth as in 1/8 not as in 8th I presume). You don’t need me to tell you any of this, because Andy Colvin from those excellent guys at Enkitec has written a great article all about it right here.

And as if that wasn’t enough, Kevin Closson, the ex-Performance Architect of Exadata, has added his own speculative article in which he walks the fine line of legal requirement placed upon anyone who used to work in the Oracle Development organisation (because Oracle’s expensively assembled legal team often finds time to stretch its muscles about these things: to quote Kevin, “I’m only commenting about the rumors I’ve read and I will neither confirm nor deny even whether I *can* confirm or deny.” But did you notice how he didn’t confirm whether he could confirm that he could confirm it?)

Anyway, with Andy and Kevin on the case, there is little point in me trying to add anything there. So let’s look at some of the other rumours.

Sun F40 Accelerator Flash Cards

It appears that the X3 will finally ditch the unloved Sun F20 flash cards that have been present since the introduction of flash when the V2 model came out in 2009. Flash technology has advanced rapidly over recent years – and the F20 cards were hardly at the forefront of the technology even in 2009.

The F20 cards contained four flash modules known as DOMs, each with 24GB of SLC flash and 64MB of DRAM. In order to ensure that writes made it to flash in the event of power loss, they also had a dirty great big super-capacitors strapped to the back. I’m no fan of supercaps in general, they tend to have reliability issues and go bang in the night. I’m not saying that Oracle’s cards had this issue though (because I also have to consider that expensively-assembled legal team). However it’s interesting to note this quote in the F20 user guide:  “Because high temperatures can have negative impact on life expectancy, it is best to locate the Sun Flash Accelerator F20 PCIe Card in PCIe slots that offer maximum airflow“.

The new F40 cards have now switched to using MLC flash and again contain four DOMs. This time they are 100GB in size, giving a total of 400GB usable (512GB raw). There is no mention of DRAM, but of course it must be there. The manual also offers no insight into whether there are any supercaps (unlike the F20 manual which had a lovely section on “Super Capacitors versus Batteries”) but I can see some fat little nodules on the picture up above which tell me that capacitance is still essential. The result of these changes (probably mainly the switch to MLC) is that the published mean time between failure has dropped by 50% from 2m hours to 1m hours. That’s taken 114 years off of the lifetime of the cards!

The power draw appears to have risen, because the F20 used around 16.5W during normal operation, whereas the F40 is described as using 25W max and 11.5W even when idle. On the other hand maybe they just picked a value in the middle and called that “normal”.

What will be interesting is to see how Oracle handles the flash write cliff. Flash media is very fast for reads; in the case of the F40 the latency is 251 microseconds (not impressive against the 90 microseconds on a Violin system, but still better than disk). Flash is even faster for writes, with the F40 having a 95 microsecond latency (25 microseconds on Violin 🙂 ). The area to watch out for though is erasing. On flash you can only write to an empty block, so once the block is used it has to be erased again before you issue another write to it. Violin has all sorts of patented technology to ensure this doesn’t affect performance (but as I’ve already plugged Violin twice I’ll shut up about it). Oracle doesn’t – at least, nothing that any of the flash vendors would be worried about.

[Disclosure: In the comments section below, Alex asked a question about the block size which made me realise that the F40 datasheet numbers are showing latency figures for 8k, whereas I am quoting Violin latency figures for 4k blocks. Even so, it’s still obvious that there are some big differences there.]

That’s never really been a problem for Oracle before, because the Exadata flash was used as a write-through cache, where the write performance of the flash cards was not an issue. This time, with the new “flash for all writes” capabilities of the flash cache, write performance is going to matter – particularly for sustained writes, such as ETL jobs, batch loads, data imports etc. Unless Oracle has some way to avoid it, once the capacity of the cards is used and all of the flash cells have been written to, there will be a big drop-off in performance whilst the garbage collection takes place in the background to try and erase free cells. It will be interesting to see how the X3 behaves during this type of load.

Database Virtualisation

This is the other hot topic for me, since I am an avid believer that we are seeing a major trend in the industry towards the virtualisation of production Oracle databases. Oracle, it has to be said, has not had a massive amount of success with its Oracle VM product. I actually quite like it, but I appear to be in a minority. It’s not got anything like the market penetration of Hyper-V, whilst VMware is in a different league altogether.

History tells us that when Oracle has a product with which it wants to drive (or rather, enforce) more adoption, it uses “interesting” strategies. The addition of OVM to Exadata is, for me, almost certain. In this way, Oracle gets to push its own virtualisation product as something that a) is “engineered to work with Exadata”, b) is a “one throat to choke” support solution, and c) is the *only* choice you can have.

Expect to see lots of announcements around this, with particular hype over the features such as online migration and integration with OEM, as well as lots of talk about how the Infiniband network makes it all a million times faster than some unspecified alternative.

Update 10 September 2012

It’s come to my attention that the Sun F40 cards look incredibly similar to the LSI Nytro WarpDrive WLP4-200 flash cards. Just take a look at the pictures. I don’t know this for a fact, but the similarity is plain to see. Surely Oracle must be OEMing these?

A note for Oracle’s legal team: please note that this is all wild speculation and that I am in no way using any knowledge gained whilst an employee of Oracle. In fact the main thing I learned whilst an employee was that people on the outside who aren’t supposed to know get to have a lot more fun speculating than the people on the inside who are supposed to know but don’t.

Querying DBA_HIST_SNAPSHOT and DBA_HIST_SYSSTAT

Why is it so hard in Oracle to get a decent answer to the question of how many seconds elapsed between two timestamps?

I’m looking at DBA_HIST_SNAPSHOT and wondering how many seconds each snapshot spans, because later on I want to use this to generate metrics like Redo Size per Second, etc.

SQL> desc dba_hist_snapshot
 Name                                Null?    Type
 ----------------------------------- -------- ------------------------
 SNAP_ID                             NOT NULL NUMBER
 DBID                                NOT NULL NUMBER
 INSTANCE_NUMBER                     NOT NULL NUMBER
 STARTUP_TIME                        NOT NULL TIMESTAMP(3)
 BEGIN_INTERVAL_TIME                 NOT NULL TIMESTAMP(3)
 END_INTERVAL_TIME                   NOT NULL TIMESTAMP(3)

So surely I can just subtract the begin time from the end time, right?

SQL> select SNAP_ID, END_INTERVAL_TIME - BEGIN_INTERVAL_TIME as elapsed
  2  from DBA_HIST_SNAPSHOT
  3  where SNAP_ID < 6
  4 order by 1;
   SNAP_ID ELAPSED
---------- ------------------------------
         1 +000000000 00:00:44.281
         2 +000000000 00:04:32.488
         3 +000000000 00:51:39.969
         4 +000000000 01:00:01.675
         5 +000000000 00:59:01.697

Gaaaah…. It’s given me one of those stupid interval datatypes! I’ve never been a fan of these. I just want to know the value in seconds.

Luckily I can cast a timestamp (the datatype in DBA_HIST_SNAPSHOT) as a good old fashioned DATE. We love dates, you can treat them as numbers and add, subtract etc. The integer values represent days, so you just need to multiply by 24 x 60 x 60 = 86400 to get seconds:

SQL> select SNAP_ID, END_INTERVAL_TIME - BEGIN_INTERVAL_TIME as elapsed,
  2  (cast(END_INTERVAL_TIME as date) - cast(BEGIN_INTERVAL_TIME as date))
  3    *86400 as elapsed2
  4  from dba_hist_snapshot
  5  where snap_id < 6
  6  order by 1;
   SNAP_ID ELAPSED                          ELAPSED2
---------- ------------------------------ ----------
         1 +000000000 00:00:44.281                44
         2 +000000000 00:04:32.488               272
         3 +000000000 00:51:39.969              3100
         4 +000000000 01:00:01.675              3602
         5 +000000000 00:59:01.697              3542

That’s much better. But I notice that, in snapshot 1 for example, the elapsed time was 44.281 seconds and in my CAST version it’s only 44 seconds. In casting to the DATA datatype there has been some rounding. Maybe that isn’t an issue, but surely there’s a way to keep that extra accuracy?

Here’s the answer I came up with – using EXTRACT:

SQL> select SNAP_ID, END_INTERVAL_TIME - BEGIN_INTERVAL_TIME as elapsed,
  2  (cast(END_INTERVAL_TIME as date) - cast(BEGIN_INTERVAL_TIME as date))
  3    *86400 as elapsed2,
  4  (extract(day from END_INTERVAL_TIME)-extract(day from BEGIN_INTERVAL_TIME))*86400 +
  5  (extract(hour from END_INTERVAL_TIME)-extract(hour from BEGIN_INTERVAL_TIME))*3600 +
  6  (extract(minute from END_INTERVAL_TIME)-extract(minute from BEGIN_INTERVAL_TIME))*60 +
  7  (extract(second from END_INTERVAL_TIME)-extract(second from BEGIN_INTERVAL_TIME)) as elapsed3
  8  from dba_hist_snapshot
  9  where snap_id < 6
 10  order by 1;
   SNAP_ID ELAPSED                          ELAPSED2   ELAPSED3
---------- ------------------------------ ---------- ----------
         1 +000000000 00:00:44.281                44     44.281
         2 +000000000 00:04:32.488               272    272.488
         3 +000000000 00:51:39.969              3100   3099.969
         4 +000000000 01:00:01.675              3602   3601.675
         5 +000000000 00:59:01.697              3542   3541.697

Not particularly simple, but at least accurate. I’m happy to be told that there’s an easier way?

Why?

Why am I doing this? Because I am trying to look at the performance of some benchmarks I am working on. The load generation tool creates regular AWR snapshots so I want to look at the peak IO rates for each snapshot to save myself generating a million AWR reports.

I am specifically interested in the statistics redo sizephysical reads, and physical writes from DBA_HIST_SYSSTAT. The tests aim to push each of these metrics (independently though, not all at the same time… yet!)

With that in mind, and with thanks to Ludovico Caldara for some code on which to build, here is the SQL that I am using to view the performance in each snapshot. First the output though, truncated to save some space on the screen:

SNAPSHOTID SNAPSHOTTIME    REDO_MBSEC REDO_GRAPH           READ_MBSEC READ_GRAPH           WRITE_MBSEC WRITE_GRAPH
---------- --------------- ---------- -------------------- ---------- -------------------- ----------- --------------------
       239 30-AUG 13:37:14     384.83 ******                        0 *                            .04 ******
       240 30-AUG 13:38:07     284.27 ****                          0 *                            .03 ****
       241 30-AUG 13:42:14     296.62 ****                          0                              .03 ****
       242 30-AUG 13:43:00    1242.08 ********************          0 *                            .12 ********************
       243 30-AUG 13:47:14     258.75 ****                          0                              .02 ****
       244 30-AUG 13:48:28     866.83 **************                0 *                            .08 *************
       245 30-AUG 13:52:14     456.24 *******                       0                              .04 *******
       246 30-AUG 13:54:43     773.61 ************                  0                              .07 ************
       247 30-AUG 13:57:38     624.23 **********                    0                              .06 *********
       248 30-AUG 14:00:22     613.98 *********                     0                              .05 *********

I’m currently running redo generation tests so I’m only interested in the redo size metric and the calculation of redo per second, i.e. column 3. I can use the graph at column 4 to instantly see which snapshot I need to look at: 243. That’s the one where redo was being generated at over 1.2Gb/sec – not bad for a two-socket machine attached to a single Violin 6616 array.

Now for the SQL… I warn you now, it’s a bit dense!

-- for educational use only - use at your own risk!
-- display physical IO statistics from DBA_HIST_SYSSTAT
-- specifically redo size, physical reads and physical writes

set lines 140 pages 45
accept num_days prompt 'Enter the number of days to report on [default is 0.5]: '
set verify off

SELECT redo_hist.snap_id AS SnapshotID
,      TO_CHAR(redo_hist.snaptime, 'DD-MON HH24:MI:SS') as SnapshotTime
,      ROUND(redo_hist.statval/elapsed_time/1048576,2) AS Redo_MBsec
,      SUBSTR(RPAD('*', 20 * ROUND ((redo_hist.statval/elapsed_time) / MAX (redo_hist.statval/elapsed_time) OVER (), 2), '*'), 1, 20) AS Redo_Graph
,      ROUND(physical_read_hist.statval/elapsed_time/1048576,2) AS Read_MBsec
,      SUBSTR(RPAD('*', 20 * ROUND ((physical_read_hist.statval/elapsed_time) / MAX (physical_read_hist.statval/elapsed_time) OVER (), 2), '*'), 1, 20) AS Read_Graph
,      ROUND(physical_write_hist.statval/elapsed_time/1048576,2) AS Write_MBsec
,      SUBSTR(RPAD('*', 20 * ROUND ((physical_write_hist.statval/elapsed_time) / MAX (physical_write_hist.statval/elapsed_time) OVER (), 2), '*'), 1, 20) AS Write_Graph
FROM (SELECT s.snap_id
            ,g.value AS stattot
            ,s.end_interval_time AS snaptime
            ,NVL(DECODE(GREATEST(VALUE, NVL(lag (VALUE) OVER (PARTITION BY s.dbid, s.instance_number, g.stat_name
                 ORDER BY s.snap_id), 0)), VALUE, VALUE - LAG (VALUE) OVER (PARTITION BY s.dbid, s.instance_number, g.stat_name
                     ORDER BY s.snap_id), VALUE), 0) AS statval
            ,(EXTRACT(day FROM s.end_interval_time)-EXTRACT(day FROM s.begin_interval_time))*86400 +
             (EXTRACT(hour FROM s.end_interval_time)-EXTRACT(hour FROM s.begin_interval_time))*3600 +
             (EXTRACT(minute FROM s.end_interval_time)-EXTRACT(minute FROM s.begin_interval_time))*60 +
             (EXTRACT(second FROM s.end_interval_time)-EXTRACT(second FROM s.begin_interval_time)) as elapsed_time
        FROM dba_hist_snapshot s,
             dba_hist_sysstat g,
             v$instance i
       WHERE s.snap_id = g.snap_id
         AND s.begin_interval_time >= sysdate-NVL('&num_days', 0.5)
         AND s.instance_number = i.instance_number
         AND s.instance_number = g.instance_number
         AND g.stat_name = 'redo size') redo_hist,
     (SELECT s.snap_id
            ,g.value AS stattot
            ,NVL(DECODE(GREATEST(VALUE, NVL(lag (VALUE) OVER (PARTITION BY s.dbid, s.instance_number, g.stat_name
                 ORDER BY s.snap_id), 0)), VALUE, VALUE - LAG (VALUE) OVER (PARTITION BY s.dbid, s.instance_number, g.stat_name
                     ORDER BY s.snap_id), VALUE), 0) AS statval
        FROM dba_hist_snapshot s,
             dba_hist_sysstat g,
             v$instance i
       WHERE s.snap_id = g.snap_id
         AND s.begin_interval_time >= sysdate-NVL('&num_days', 0.5)
         AND s.instance_number = i.instance_number
         AND s.instance_number = g.instance_number
         AND g.stat_name = 'physical read total bytes') physical_read_hist,
     (SELECT s.snap_id
            ,g.value AS stattot
            ,NVL(DECODE(GREATEST(VALUE, NVL(lag (VALUE) OVER (PARTITION BY s.dbid, s.instance_number, g.stat_name
                 ORDER BY s.snap_id), 0)), VALUE, VALUE - LAG (VALUE) OVER (PARTITION BY s.dbid, s.instance_number, g.stat_name
                     ORDER BY s.snap_id), VALUE), 0) AS statval
        FROM dba_hist_snapshot s,
             dba_hist_sysstat g,
             v$instance i
       WHERE s.snap_id = g.snap_id
         AND s.begin_interval_time >= sysdate-NVL('&num_days', 0.5)
         AND s.instance_number = i.instance_number
         AND s.instance_number = g.instance_number
         AND g.stat_name = 'physical write total bytes') physical_write_hist
WHERE redo_hist.snap_id = physical_read_hist.snap_id
  AND redo_hist.snap_id = physical_write_hist.snap_id
ORDER BY 1;

Exadata Roadmap Preview

Last week, Andrew Mendelsohn gave a talk at the Enkitec Extreme Exadata Expo (“E4”) run in Texas by those excellent guys at Enkitec. Andrew is the SVP of Oracle’s Database Server Technologies group, so it’s fair to say he has his finger on the pulse of the Oracle roadmap for Exadata.

Big thanks to Frits Hoogland for tweeting a picture of the roadmap slide. As you can see there are some interesting things on there… I’m told that Andrew described these features as “coming within the next 12 months”. Of course, that could mean they arrive at the next Oracle Open World in a month’s time, or they could be 365 days away. I suspect some are coming sooner than others, but as usual it is all wild speculation. Never mind though, if there’s one thing I’m quite good at it’s wild(ly inaccurate)  speculation.

The first one to consider is the in-memory optimized compression. Why is this important? Well, for Exadata, one reason is that no compression functionality can be offloaded to the storage cells, with their 168 cores (in a full rack). Instead it has to take place on the far-less processor-heavy compute nodes (only 96 cores on a full rack X2-2). Of course, it may be that the cells are busy and the compute nodes are idle, in which case this is a happy coincidence and there would be plenty of resource available for compression (although actually if the cells are really busy they may be performing “passthrough“, where work is offloaded back to the compute nodes!). But the fact remains that since the Exadata design is asymmetrical, you are still limited to only using the CPUs in the compute nodes. If you want to know what that means, you really need to be watching these videos by Kevin Closson. It seems like everyone wants to do everything in memory these days, but then I guess that’s not surprising when the alternative is doing it on disk.

The second important feature is the “flash for all writes” write-back flash cache, enabling the database writer to use some of the 5.3TB of flash available in a full rack. Of course, this is effectively a cache, albeit a persistent one. The writes still have to be de-staged back to disk at some point. Andrew is claiming a 10x improvement here on the slide, but it will be interesting to see how that plays out – particularly if those writes are sustained and the area allocated on the flash cards starts to run out. Kevin posted some views about this on his site, although being Kevin he likes to stick to the facts rather than throw about the armfuls of wildly inaccurate speculation that you’ll find here.

Finally, the feature that caught my eye the most was “Virtualization of database servers”. Regular readers will know my absolute faith in the meeting of databases with virtualization technology, so for me this appears to be yet another clear sign (if you look for them hard enough you can always find them 🙂 ). I wonder if this means the introduction of Oracle VM onto the compute nodes. The x86 hardware is there, the Infiniband network is there, so this could pave the way for OVM on Exadata with all of the resultant Live Migration technology… it’s a thought.

Let’s face it, Oracle is getting spanked in the virtualisation arena by VMware, so they need to do something big to get people to notice OVM. With the release of EMC’s vFabric Data Director 2.0 it’s now time to fight or give up. And we all know Oracle likes a fight.

For my money OVM is actually a great product, but then so is VMware. And for all Larry’s words on virtualization being the best security model, it’s a technology that has been noticeably lacking on what is, after all, Oracle’s strategic platform for all database workloads

Comments welcome… and feel free to call me out on what is clearly an obvious lack of insider knowledge.

Database Virtualization Part 2 – Flash Makes The Difference

In part one of this article I talked about Database Virtualisation and how I believe that it is the next trend in our industry. Databases – particularly Oracle databases – have held out against the rise of virtualisation for a long time, but as virtualisation products have matured and the drive to consumerise and consolidate IT services has increased, the idea of running production databases inside virtual machines has started to make real business sense. And to complete the perfect storm of conditions that make this not just a viable solution but a seriously attractive one, flash memory now enters the picture.

Why has it taken so long for virtualisation to be adopted with production databases? Oracle’s support policy is a factor, of course, along with their license policy (discussed later). But the primary reason I’ll wager is risk. And the risk is all around performance – how can you be sure that the addition of a hypervisor will not affect system performance? In particular, how can you ensure that performance remains predictable. It’s primarily a latency thing, you do not want to be adding extra code paths to the application calls where speed is of the essence. You cannot afford to be adding nanoseconds to your CPU calls and milliseconds to your I/O operations, because it’s all wait time – and it all adds up.

This is compounded because one of the most obvious goals in virtualisation is to run multiple different virtual databases on top of the same physical infrastructure. In the virtualisation world (whether looking at databases or not), each virtualised guest has its own workload pattern which includes the pattern of I/O it performs. However, as you overlay each different guest onto the same physical host, something interesting happens: the I/O pattern tends towards randomness.

Latency Matters

Latency is measured in units of time: nanoseconds for CPU cycles, microseconds for flash memory arrays, milliseconds for disk arrays, seconds for networks, but always units of time. And it’s lost time, it’s time spent waiting instead of doing the thing we want to do. We care about latency because the operations for which latency is measured (e.g. reads and writes) happen frequently, perhaps thousands of times per second. Although those units of time may appear quite small, when you multiply them by their frequency you discover that they turn out to be significant portions of the total available time. And time is what we care about most, it’s the reason we upgrade computer equipment to faster models, why we drive too fast or complain bitterly about the UK’s slow progress in adopting LTE (or is that just me?)

Disk arrays have horrible latency figures. If a CPU cycle takes only a nanosecond and accessing DRAM takes 100ns, waiting 10ms (so that’s 10,000,000ns) for a single block to be read from disk is like waiting a lifetime. Disk manufacturers can do little about this because somewhere a little metal arm has to move over a little spinning disk (seek time) and wait for it to rotate to the right place (rotational latency) before you can have your data. They have done their best to make that disk move as fast as possible (which is why it uses so much power and creates so much heat), but there are laws of physics which cannot be broken. Of course, one thing that disk does have in its favour is that once the disk head is in the correct place to read or write that data to or from the platter, it can access the following block really quickly. This is sequential I/O and it’s something that disks do much better than random I/O, for the obvious reason that every subsequent block read in a sequential I/O avoids the seek time and rotational latency thereby reducing the total average read or write time.

But hang on, what did we say before about virtualisation? The more virtual databases you fit onto the physical infrastructure (i.e. the density), the more random the I/O becomes. So as you increase the density, you get increasingly bad performance. Yet increasing the density is exactly what you want to do in order to achieve the cost savings associated with virtualising your databases… it’s one of the primary drivers of the whole exercise. Doesn’t that mean that disks are completely the wrong technology for virtualisation?

Luckily we have our new friend flash technology to help us, with its ultra-low latency. Flash doesn’t care whether I/O is random or sequential because it does not have any seek time or rotational latency – why would it, there are no moving parts. A Violin Memory flash array can read a 4k block in under 100 microseconds. Even if you add a fibre-channel layer that still won’t take you much over 300 microseconds – and if you care that much about latency then Infiniband is here to help, bringing the figure back down to 100ms again. Only flash memory has the ultra-low latency necessary for database virtualisation.

IOPS – The Upper Limit of Storage

One thing you do not want to happen when you virtualise your databases onto a consolidated physical platform is to find the ceiling of your I/O capabilities. Every storage system has an upper limit of the number of I/O operations that it can perform per second (known as IOPS) and when that ceiling is reached (known as saturation) things can get painful.

Why is this relevant to database virtualisation? Because when you virtualise, you overlay virtual images of databases onto a single physical host system. It’s like taking a load of pictures of your databases and superimposing them on top of each other. Your underlying infrastructure has to be able to deliver the sum of all of that demand, or everything on it will suffer.

Worse still, the latency you experience from an underutilised storage system will not be the same latency you will experience when pushing it to its peak capacity. As the number of IOPS increases, so will the latency of each operation. Disk systems saturate far quicker than flash systems because of the cost of all that seek time and rotational latency discussed earlier. However, disk array vendors know a few tricks to try and avoid this – the most obvious being overprovisioning (using far more physical disks / spindles than are required for the usable capacity) and short stroking (only using the outer edge of each disk’s platter in order to reduce seek time and increase the throughput – the outer edge of the platter has a larger circumference and has a greater bit density meaning more data can be delivered per rotation). They are great tricks to increase the number of IOPS a disk array can deliver… great, that is, if you are the vendor, because it means you get to sell more disks. For the customer though, this means a bigger disk array using more power, requiring more cooling, taking up more valuable data centre space and – here’s the punchline – costing more but wasting huge amounts of raw capacity.

This is why flash memory makes the ideal solution for virtualisation. For a start the maximum IOPS figures for disks versus flash are in different neighbourhoods: a single 15k RPM SAS disk can deliver around 175 to 210 IOPS. Admittedly you would expect to see more than one disk in an array, but let’s face it there would have to be a lot of those disks to get up to the 1,000,000 IOPS that a Violin Memory 6616 memory array can deliver (around 5,000 disks assuming a figure of 200 for the HDD). The Violin array is only 3U high and uses a fraction of the power that you would need from the equivalent monster of a disk array.

Surely that makes flash a default choice, but there’s an additional consideration – the predictable latency. At high levels of IOPS flash performs exactly as predicted – latency rises in a linear fashion. But with a disk array latency rises exponentially, resulting in a “hockey-stick” style graph. Let’s have a look at the recent disk array vendor’s SPC1 benchmark for an example of this (and remember this set a world-record SPC benchmark so it’s a top of the range system):

[I’ll post more on this subject in a separate series as I want to share some more in-depth information on it, but I am kind of stuck at the moment waiting for more powerful lab gear… the servers I have had up until just aren’t powerful enough to make my Violin arrays break into a sweat…]

So flash memory gives you the IOPS capabilities you need for virtualisation – with the additional advantage of protecting you against unpredictable latency when running at high utilisation.

Oracle Licensing

The other major topic to talk about with database virtualisation is Oracle licensing. As everyone who has ever bought one will testify, Oracle licenses are very expensive. Since Oracle licenses by the CPU core and then applies a multiplication factor based on the CPU architecture (e.g. 0.5 for most x86 processors) you can quickly rack up a massive license bill (plus ongoing support) for some of the larger multi-core processors available on the market today. By virtualising, can you tie VMs containing Oracle databases to just a specific set of CPUs (Oracle calls this server partitioning), thus reducing cost?

The complicated answer is that it depends on the hypervisor. The simple answer is almost always no. In the world of Oracle there are two methods of server partitioning: soft and hard. Oracle’s list of approved hard partitioning technologies includes Solaris 10 Containers, IBM LPARs and Fujitsu PPARs – these are the ones where you license only a subset of your processors. Everything that’s not on the approved hard partitioning list requires every processor core to be licensed. And guess what’s on the list of soft partitioning products? VMware. You can read VMware’s own take on that here. The case of Oracle VM is a more complex one. In general OVM is considered soft partitioning and so a full compliment of licenses is required, but there are methods for configuring hard partitioning (both for OVM on SPARC and OVM on X86) so that this license saving can be achieved.

Flash memory has an angle here as well though. As I have discussed on my previous database consolidation posts, flash memory allows for a greater utilisation of your CPUs (because of the reduction in IOWAIT time), which means you can do more with the same resources. So by using flash you can either resist the need for more CPUs (and therefore more Oracle licenses) or actually reduce them.

Virtualisation Means Consolidation

There are some other challenges faced around virtualising databases. Many of them are the same as the challenges faced when consolidating databases: namely how to achieve a better density of databases per physical infrastructure (thereby realising more cost savings). One of the most important of these is memory (as in DRAM), which can often be the limiting factor when squeezing multiple virtualised databases into a confined physical space.

I’m not going to recycle the whole consolidation subject again here, since I (hopefully) covered all of these points in my series of articles on database consolidation. In this sense, you could consider database virtualisation a subset of database consolidation; effectively one of the methods for delivering it, although database virtualisation offers more than a simple consolidation platform.

I could probably write a whole load more on that subject, but as this blog entry is already long enough I’m going to just hand it over to my friends at Delphix instead.