Database Virtualisation: The End of Oracle RAC?
September 10, 2012 50 Comments
A long time ago (2003) in a galaxy far, far away (Denmark), a man wrote a white paper. However, this wasn’t an ordinary man – it was Mogens Nørgaard, OakTable founder, CEO of Miracle A/S and previously the head of RDBMS Support and then Premium Services at Oracle Support in Denmark. It’s fair to say that Mogens is one of the legends of the Oracle community and the truth is that if you haven’t heard of him you might have stumbled upon this blog by accident. Good luck.
The white paper was (somewhat provocatively) entitled, “You Probably Don’t Need RAC” and you can still find a copy of it here courtesy of my friends at iD Concept. If you haven’t read it, or you have but it was a long time ago, please read it again. It’s incredibly relevant – in fact I’m going to argue that it’s more relevant now than ever before. But before I do, I’m going to reprint the conclusions in their entirety:
- If you have a system that needs to be up and running a few seconds after a crash, you probably need RAC.
- If you cannot buy a big enough system to deliver the CPU power and or memory you crave, you probably need RAC.
- If you need to cover your behind politically in your organisation, you can choose to buy clusters, Oracle, RAC and what have you, and then you can safely say: “We’ve bought the most expensive equipment known to man. It cannot possibly be our fault if something goes wrong or the system goes down”.
- Otherwise, you probably don’t need RAC. Alternatives will usually be cheaper, easier to manage and quite sufficient.
Oracle RAC: What Is The Point?
To find out what the Real Application Clusters product is for, let’s have a look at the Oracle Database 2 Day + Real Application Clusters Guide and see what it says:
Oracle Real Application Clusters (Oracle RAC) enables an Oracle database to run across a cluster of servers, providing fault tolerance, performance, and scalability with no application changes necessary. Oracle RAC provides high availability for applications by removing the single point of failure with a single server.
So from this we see that RAC is a technology designed to provide two major benefits: high availability and scalability. The HA features are derived from being able to run on multiple physical machines, therefore providing the ability to tolerate the failure of a complete server. The scalability features are based around the concept of horizontal scaling, adding (relatively) cheap commodity servers to a pool rather than having to buy an (allegedly) more expensive single server. We also see that there are “no application changes necessary”. I have serious doubts about that last statement, as it appears to contradict evidence from countless independent Oracle experts.
That’s the technology – but one thing that cannot ever be excluded from the conversation is price. Technical people (I’m including myself here) tend to get sidetracked by technical details (I’m including myself there too), but every technology has to justify its price or it is of no economic use. At the time of writing, the Oracle Enterprise Edition license is showing up in the Oracle Shop as US$47,500 per processor. The cost of a RAC license is showing as US$23,000 per processor. That’s a lot of money, both in real terms and also as a percentage of the main Enterprise Edition license – almost 50% as much again. To justify that price tag, RAC needs to deliver something which is a) essential, and b) cannot be obtained through any other less-expensive means.
The theory behind RAC is that it provides higher availability by protecting against the failure of a server. Since the servers are nodes in a cluster, the cluster remains up as long as the number of failed nodes is less than the total number of nodes in that cluster.
It’s a great theory. However, there is a downside – and that downside is complexity. RAC systems are much more complex than single-instance systems, a fact which is obvious but still worth mentioning. In my previous role as a database product expert for Oracle Corporation I got to visit multiple Oracle customers and see a large number of Oracle installations, many of which were RAC. The RAC systems were always the most complicated to manage, to patch, to upgrade and to migrate. At no time do I ever remember visiting a customer who had implemented the various Transparent Application Failover (TAF) policies and Fast Application Notification (FAN) mechanisms necessary to provide continuous service to users of a RAC system where a node fails. The simple fact is that most users have to restart their middle tier processes when a node fails and as a result all of the users of that node are kicked off. However, because the cluster remained available they are able to call this a “partial outage” instead of taking the SLA hit of a “complete outage”.
This is just semantics. If your users experience a situation where their work is lost and they have to log back in to start again, that’s an outage. That’s the very antithesis of high availability to me. If the added complexity of RAC means that these service interruptions happen more frequently, then I question whether RAC is really the best solution for high availability. I’m not suggesting that there is anything wrong with the Oracle product (take note Oracle lawyers), simply that if you are not designing and implementing your applications and infrastructure to use TAF and FAN then I do not see how your availability really benefits.
Complexity is the enemy of high availability – and RAC, no matter how you look at it, adds complexity over a single-instance implementation of Oracle.
The claim here is that RAC allows for platforms to scale horizontally, by adding nodes to a cluster as additional resources are required. According to the documentation quote above this is possible “with no application changes”. I assume this only applies to the case where nodes are added to an existing multi-node cluster, because going from single-instance to RAC very definitely requires application changes – or at least careful consideration of application code. People far more eloquent (and concise) than I have documented this before, but consider anything in the application schema which is a serialization point: sequences, inserts into tables using a sequential number as the primary key, that sort of thing. You cannot expect an application to perform if you just throw it at RAC.
To understand the scalability point of RAC, it’s important to take a step back and see what RAC actually does conceptually. The answer is all about abstraction. RAC takes the one-to-one database-to-instance relationship and changes it to a one-to-many, so that multiple instances serve one database. This allows for the newly-abstracted instance layer to be expanded (or contracted) without affecting the database layer.
This is exactly the same idea as virtualisation of course. In virtualisation you take the one-to-one physical-server-to-operating-system relationship and abstract it so that you can have many virtual OS’s to each physical server. In fact in most virtualisation products you can take this even further and have many physical servers supporting those virtual machines, but the point is the same – by adding that extra layer of abstraction the resources which used to be tied together now become dynamic.
This is where the concept of RAC fails for me. Firstly, modern servers are extremely powerful – and comparatively cheap. You don’t need to buy a mainframe-style supercomputer in order to run a business-critical application, not when 80 core x86 servers are available and chip performance is rocketing at the speed of Moore’s Law.
Database Virtualisation Is The Answer
Virtualisation technology, whether from VMware, Microsoft or one of the other players in that market, allows for a much better expansion model than RAC in my opinion. The reason for this is summed up perfectly by Dr. Bert Scalzo (NoCOUG journal page 23) when he says, “Hardware is simply a dynamic resource“. By abstracting hardware through a virtualisation layer, the number and type of physical servers can now be changed without having to change the applications running on top in virtual machines.
Equally, by using virtualisation, higher service levels can be achieved due to the reduced complexity of the database (no RAC) and the ability to move virtual machines across physical domains with limited or no interruption. VMware’s vMotion feature, for example, allows for the online migration of Oracle databases with minimal impact to applications. Flash technologies such as the flash memory arrays from Violin Memory allow for the I/O issues around virtualisation to be mitigated or removed entirely. Software exists for managing and monitoring virtualised Oracle environments, whilst leading players in the technology space tell the world about their successes in adopting this model.
What’s more, virtualisation allows for incredible benefits in terms of agility. New Oracle environments can be built simply by cloning existing ones, multiple copies and clones can be taken for use in dev / test / UAT environments with minimal administrative overhead. Self-service options can be automated to give the users ability to get what they want, when they want it. The term “private cloud” stops being marketing hype and starts being an achievable goal.
And finally there’s the cost. VMware licenses are not cheap either, but hardware savings start to become apparent when virtualising. With RAC, you would probably avoid consolidating multiple applications onto the same nodes – an ill-timed node eviction would take out all of your systems and leave you with a real headache. With the added protection of the VM layer that risk is mitigated, so databases can be consolidated and physical hardware shared. Think about what that does to your hardware costs, operational expenditure and database licensing costs.
Ok so the title of this post was deliberately straying into the realms of sensationalism. I know that RAC is not dead – people will be running RAC systems for years to come. But for new implementations, particularly for private-cloud, IT-as-a-service style consolidation environments, is it really a justifiable cost? What does it actually deliver that cannot be achieved using other products – products that actually provide additional benefits too?
Personally, I have my doubts – I think it’s in danger of becoming a technology without a use case. And considering the cost and complexity it brings…