Database Virtualisation: The End of Oracle RAC?

A long time ago (2003) in a galaxy far, far away (Denmark), a man wrote a white paper. However, this wasn’t an ordinary man – it was Mogens Nørgaard, OakTable founder, CEO of Miracle A/S and previously the head of RDBMS Support and then Premium Services at Oracle Support in Denmark. It’s fair to say that Mogens is one of the legends of the Oracle community and the truth is that if you haven’t heard of him you might have stumbled upon this blog by accident. Good luck.

The white paper was (somewhat provocatively) entitled, “You Probably Don’t Need RAC” and you can still find a copy of it here courtesy of my friends at iD Concept. If you haven’t read it, or you have but it was a long time ago, please read it again. It’s incredibly relevant – in fact I’m going to argue that it’s more relevant now than ever before. But before I do, I’m going to reprint the conclusions in their entirety:

  • If you have a system that needs to be up and running a few seconds after a crash, you probably need RAC.
  • If you cannot buy a big enough system to deliver the CPU power and or memory you crave, you probably need RAC.
  • If you need to cover your behind politically in your organisation, you can choose to buy clusters, Oracle, RAC and what have you, and then you can safely say: “We’ve bought the most expensive equipment known to man. It cannot possibly be our fault if something goes wrong or the system goes down”.
  • Otherwise, you probably don’t need RAC. Alternatives will usually be cheaper, easier to manage and quite sufficient.

Oracle RAC: What Is The Point?

To find out what the Real Application Clusters product is for, let’s have a look at the Oracle Database 2 Day + Real Application Clusters Guide and see what it says:

Oracle Real Application Clusters (Oracle RAC) enables an Oracle database to run across a cluster of servers, providing fault tolerance, performance, and scalability with no application changes necessary. Oracle RAC provides high availability for applications by removing the single point of failure with a single server.

So from this we see that RAC is a technology designed to provide two major benefits: high availability and scalability. The HA features are derived from being able to run on multiple physical machines, therefore providing the ability to tolerate the failure of a complete server. The scalability features are based around the concept of horizontal scaling, adding (relatively) cheap commodity servers to a pool rather than having to buy an (allegedly) more expensive single server. We also see that there are “no application changes necessary”. I have serious doubts about that last statement, as it appears to contradict evidence from countless independent Oracle experts.

That’s the technology – but one thing that cannot ever be excluded from the conversation is price. Technical people (I’m including myself here) tend to get sidetracked by technical details (I’m including myself there too), but every technology has to justify its price or it is of no economic use. At the time of writing, the Oracle Enterprise Edition license is showing up in the Oracle Shop as US$47,500 per processor. The cost of a RAC license is showing as US$23,000 per processor. That’s a lot of money, both in real terms and also as a percentage of the main Enterprise Edition license – almost 50% as much again. To justify that price tag, RAC needs to deliver something which is a) essential, and b) cannot be obtained through any other less-expensive means.

High Availability

The theory behind RAC is that it provides higher availability by protecting against the failure of a server. Since the servers are nodes in a cluster, the cluster remains up as long as the number of failed nodes is less than the total number of nodes in that cluster.

It’s a great theory. However, there is a downside – and that downside is complexity. RAC systems are much more complex than single-instance systems, a fact which is obvious but still worth mentioning. In my previous role as a database product expert for Oracle Corporation I got to visit multiple Oracle customers and see a large number of Oracle installations, many of which were RAC. The RAC systems were always the most complicated to manage, to patch, to upgrade and to migrate. At no time do I ever remember visiting a customer who had implemented the various Transparent Application Failover (TAF) policies and Fast Application Notification (FAN) mechanisms necessary to provide continuous service to users of a RAC system where a node fails. The simple fact is that most users have to restart their middle tier processes when a node fails and as a result all of the users of that node are kicked off. However, because the cluster remained available they are able to call this a “partial outage” instead of taking the SLA hit of a “complete outage”.

This is just semantics. If your users experience a situation where their work is lost and they have to log back in to start again, that’s an outage. That’s the very antithesis of high availability to me. If the added complexity of RAC means that these service interruptions happen more frequently, then I question whether RAC is really the best solution for high availability. I’m not suggesting that there is anything wrong with the Oracle product (take note Oracle lawyers), simply that if you are not designing and implementing your applications and infrastructure to use TAF and FAN then I do not see how your availability really benefits.

Complexity is the enemy of high availability – and RAC, no matter how you look at it, adds complexity over a single-instance implementation of Oracle.

Scalability

The claim here is that RAC allows for platforms to scale horizontally, by adding nodes to a cluster as additional resources are required. According to the documentation quote above this is possible “with no application changes”. I assume this only applies to the case where nodes are added to an existing multi-node cluster, because going from single-instance to RAC very definitely requires application changes – or at least careful consideration of application code. People far more eloquent (and concise) than I have documented this before, but consider anything in the application schema which is a serialization point: sequences, inserts into tables using a sequential number as the primary key, that sort of thing. You cannot expect an application to perform if you just throw it at RAC.

To understand the scalability point of RAC, it’s important to take a step back and see what RAC actually does conceptually. The answer is all about abstraction. RAC takes the one-to-one database-to-instance relationship and changes it to a one-to-many, so that multiple instances serve one database. This allows for the newly-abstracted instance layer to be expanded (or contracted) without affecting the database layer.

This is exactly the same idea as virtualisation of course. In virtualisation you take the one-to-one physical-server-to-operating-system relationship and abstract it so that you can have many virtual OS’s to each physical server. In fact in most virtualisation products you can take this even further and have many physical servers supporting those virtual machines, but the point is the same – by adding that extra layer of abstraction the resources which used to be tied together now become dynamic.

This is where the concept of RAC fails for me. Firstly, modern servers are extremely powerful – and comparatively cheap. You don’t need to buy a mainframe-style supercomputer in order to run a business-critical application, not when 80 core x86 servers are available and chip performance is rocketing at the speed of Moore’s Law.

Database Virtualisation Is The Answer

Virtualisation technology, whether from VMware, Microsoft or one of the other players in that market, allows for a much better expansion model than RAC in my opinion. The reason for this is summed up perfectly by Dr. Bert Scalzo (NoCOUG journal page 23) when he says, “Hardware is simply a dynamic resource“. By abstracting hardware through a virtualisation layer, the number and type of physical servers can now be changed without having to change the applications running on top in virtual machines.

Equally, by using virtualisation, higher service levels can be achieved due to the reduced complexity of the database (no RAC) and the ability to move virtual machines across physical domains with limited or no interruption. VMware’s vMotion feature, for example, allows for the online migration of Oracle databases with minimal impact to applications. Flash technologies such as the flash memory arrays from Violin Memory allow for the I/O issues around virtualisation to be mitigated or removed entirely. Software exists for managing and monitoring virtualised Oracle environments, whilst leading players in the technology space tell the world about their successes in adopting this model.

What’s more, virtualisation allows for incredible benefits in terms of agility. New Oracle environments can be built simply by cloning existing ones, multiple copies and clones can be taken for use in dev / test / UAT environments with minimal administrative overhead. Self-service options can be automated to give the users ability to get what they want, when they want it. The term “private cloud” stops being marketing hype and starts being an achievable goal.

And finally there’s the cost. VMware licenses are not cheap either, but hardware savings start to become apparent when virtualising. With RAC, you would probably avoid consolidating multiple applications onto the same nodes – an ill-timed node eviction would take out all of your systems and leave you with a real headache. With the added protection of the VM layer that risk is mitigated, so databases can be consolidated and physical hardware shared. Think about what that does to your hardware costs, operational expenditure and database licensing costs.

Conclusion

Ok so the title of this post was deliberately straying into the realms of sensationalism. I know that RAC is not dead – people will be running RAC systems for years to come. But for new implementations, particularly for private-cloud, IT-as-a-service style consolidation environments, is it really a justifiable cost? What does it actually deliver that cannot be achieved using other products – products that actually provide additional benefits too?

Personally, I have my doubts – I think it’s in danger of becoming a technology without a use case. And considering the cost and complexity it brings…

54 Responses to Database Virtualisation: The End of Oracle RAC?

  1. kevinclosson says:

    Good post.

    I have a comment regarding the following quote:

    “This is where the concept of RAC fails for me. Firstly, modern servers are extremely powerful – and comparatively cheap. You don’t need to buy a mainframe-style supercomputer in order to run a business-critical application, not when 80 core x86 servers are available and chip performance is rocketing at the speed of Moore’s Law.”

    Remember to always right-size an application. The more cores a server has the slower Oracle runs on a per-core basis and, indeed, we license by the core. Oracle (as per audited TPC-C results) services roughly 30% fewer TpmC at 80 cores than at 12. I suspect Violin folks know these TPC-C results pretty well.

    In a perfect (the opposite of insane) world we could build mixed CPU clusters and move databases (with virtualization) to so as to match the database size to servers with the fewest sockets that satisfy the requirements of each particular database.

    For more words on the matter of large systems versus quick systems I’ll offer a link to my recent interview in the NoCOUG journal: http://kevinclosson.files.wordpress.com/2012/08/nocoug_journal_201208.pdf

    • flashdba says:

      Kevin – thank you as always for visiting.

      You are right of course, in fact it’s been on my mind to write a separate blog post about this ever since you alluded to it in that excellent NoCOUG journal. However, the more I think about that article, the more I start to find myself wading into the technical weeds of things that I think I know but cannot be 100% sure.

      So in the end I’ve decided to petition you to write more about it instead. 🙂

    • Totally agree Kevin, but wouldn’t you not also agree that all that CPU power for databases is actually not needed (right better code!); There should be more than enough in nowadays systems to run a database anyway…? In a perfect world (the opposite of insane), Moore’s Law wouldn’t have to keep up with the need to gain in on the demand of CPU power all those 12 months changing new “this is even a better method to do it – reinvent it all” product live cycle…

      • kevinclosson says:

        CPUs will never be “fast enough.” I do know they are faster than people think though. I also know that people waste them through under-utilization. Don’t know what else to say.

        The point I was making doesn’t need factoring against any of the variables though. I’m simply stating the fact that smaller systems have bandwidth and latency characteristics that offer vastly improved throughput/core and when we license by the core that matters.

  2. Nice post … love the title, I bumped into one of the legal guys at TVP and he loved it too 🙂

    It’s funny you posted this as I was only thinking about this recently and wondering how RAC could stand up against Virtualisation as it becomes more widely adopted. A lot of people you speak to still have the fear when it comes to Virtual, but I see that rapidly disappearing and you are quite correct RAC will struggle to stand up against it. However, I would question the complexity thing – most DBA’s are pretty comfortable with RAC now (given its widespread use) – most are not that familiar with virtual tech … other than vbox for practice environments (I know how much you love vbox). I guess most orgs would bring in specialists to manage a proposed virtual environment especially if it is for live – which in turn adds more cost and perceived complexity.

    I don’t see RAC disappearing any time soon (mostly because of how widespread it is) but I do strongly agree with the overall message, probably more so the original message – “You probably don’t need RAC” – my last employer were the 2nd largest energy supplier in the UK and managed quite happily without it for all our live systems, OLTP and DW.

    Keep up the good work – I’ve passed on your new number to our legal guys…

    • flashdba says:

      Thanks Paul. As always your northern sense of humour is appreciated – not sure who by though? 🙂

      You say you question the complexity line – but seriously, how long does it take you to install a single instance database from a bare Linux install? Now how long does it take to install a RAC system? I’m not just talking about latest patchset here, I’m talking about applying all the PSUs etc using OPatch. Will OPatch work with the -auto option or will it decide to randomly sprinkle your Oracle Home with half-compiled files and then die? The more nodes you have, the lower your chances of making it out the other side!!

      You and I used to work with the same customer. Which system did that customer have that was almost entirely stable? The biggest system in their whole enterprise… it was single instance. And the others, the ones that fell over night and day, causing us to spend hours on conference calls… all RAC. I owe about 20% of last year’s earnings to overtime caused by RAC (maybe I should thank it!)

      But the main point is that DBAs don’t make this decision. They certainly influence, but this is a decision made by the CIO, the CTO, or the person who has to pay the license fee. Everything in this article is related to the cost of RAC versus what it delivers. I just don’t see that it’s providing value for money going forward…

      I’m glad to see you out in blog-land. When is the new Paul Till @ Pythian blog starting up?

  3. kevinclosson says:

    Oops… wrong URL, please substitute: http://kevinclosson.files.wordpress.com/2012/08/nocoug_journal_201208.pdf

    As for the small system vs large system piece I will be blogging it and the message is all about virtualization and right-sizing. But please allow the following link for your readers. Open the pic and see the numbers: https://twitter.com/kevinclosson/status/245213517416890368

  4. Henrik Krobath says:

    Having worked for Mogens – and heard the mantra “You probably don’t need RAC” – I find this quite interesting reading as I am currently working in the Telco business where we are working on building a private cloud infrastructure to be able to provision Oracle databases as a service to our projects.
    In this process we have discussed vmware vs. RAC and actually ended up discarding vmware as an option due to backup and recovery issues. VMware is unable to do backup to our SAN attached Virtual Tape Library as you are unable to attach the VTL devices directly to the vmware guests as this would interfere with vMotion and the likes.
    For us to go with vmware we’d end up increasing complexity of our backup infrastructure and strategy in general.
    Also, another issue for us was I/O. Currently we do not use SSD/Flashstorage in any of our database systems so we do not have the benefit of such technologies unless we do a substantial investment. That leaves us with the I/O we can get from our SAN via the physical host HBAs. I do agree that we’d be able to put in the amount of HBA’s we need, to get sufficient bandwidth, but in turn, that requires buying servers of a bigger dimension to provide enough slots for NIC’s and HBA’s.
    I think my point is, that the issue is not entirely black and white. Not when it comes to price nor when it comes down to complexity.
    Also, we’ve been discussing the level of HA provided by vmware. In my opinion HA provided by vmware can at the very best be compared to an active-passive cluster – and actually not completely. Given you have a hardware failure in your vmware physical host your virtual machines will crash. VMware will pick up on this event and reboot the virtual machines elsewhere in the vmware farm. That means that you’ll be waiting not only for the database to startup and do media recovery but also wait for the guest OS to boot as well.
    We found the possibility to do vMotion to free up physical hosts to exchange hardware and do maintenance to be the major benefit from vmware.

    Best Regards

    Henrik Krobath

    • Thanks Henrik for sharing your thoughts. I am currently busy for some customers who are trying to figure out their “database services” datacenter VMware virtualization implementation as well. Always good to know the pro’s and cons.

    • flashdba says:

      Hi Henrik, thanks for posting – I appreciate your input into this debate. You make some good points which I want to respond to.
      – it’s interesting that you discounted VMware. Of course, my article is about virtualisation technologies in general, of which VMware is only one – although the most prevalent one. Many others are in use in Enterprises I have spoken to, for example Hyper-V has a large presence and is commonly used for virtualising SQL Server. In fact, the virtualisation of SQL is much more mature and has been showing us Oracle people the way.
      – I don’t see that switching to flash is a substantial investment. Unlike the database appliance route, flash infrastructure can be a steady replacement rather than an all-out splurge of cash. Do you really not have enough HBAs on your servers?
      – I agree with you that at best VMware or virtualisation technologies can be compared to active-passive clusters. But my main point, one which I will expand on in the next post, is that this offers sufficient protection for most requirements and is substantially cheaper than the active-active option, as well as considerably less complex.

      But my favourite point that you made was, “the issue is not entirely black and white”. I absolutely agree – and I encourage anyone designing or architecting a platform to give consideration to all of the options rather than believe the marketing and sales hype. Buy the solution that gives you what you need without overpaying for technology which may not benefit you in the way you expect. I don’t blame the people in sales and marketing, it’s their job to get you to buy more stuff. But when a large company has an incumbent product which starts to lose its purpose, you cannot expect them to hold their hands up and tell you there may be better alternatives. Whether it’s Oracle with RAC or the big storage companies with their spinning disk arrays, they need those revenue streams and so will continue to wring every last dollar from you. It’s down to the buyer to be wise and work out what the real value is.

      • Oracle’s Licensing policies especially around vmware /hyper-v anything that’s not hard partitioned make virtualizing Oracle a much more costly proposition than RAC in most cases. Agreed this is an artificial imposition at the moment but am I correct in saying that most x86 vm solutions also do not take kindly to on the fly appearance of additional cpus in the guests ?

        • flashdba says:

          Oracle’s licensing policy means that in many cases you have to license every physical CPU regardless of how many you are using in the VM (one notable exception to this is the hard partitioning option in Oracle VM which has not been extended to other non-Oracle hypervisors).
          However, that doesn’t stop customers from saving on license costs by consolidating and virtualizing Oracle databases. If you have ten database servers for discrete databases and applications then consolidating them will allow for the possibility of reduced license costs. Just don’t run non-Oracle stuff on your new consolidated, virtualized system because you’ll require more CPU cores and therefore more licenses.
          On the fly addition of CPUs… hmm that’s not something I’ve experimented with to be honest.

  5. Allan Hirt says:

    As a current Microsoft Cluster MVP and SQL Server guy who has been very closely associated with clustering for over 12 years, it’s interesting to read your take on how the other half lives. Can’t tell you how many times I’ve heard over the years “So when is MS going to implement RAC for SQL Server?”, they still haven’t done it and I have no insider information on if they will or won’t. SQL has always been scale up, and I agree – today’s multi core servers where it’s not hard to buy 32, 64, or even 128+ GB of RAM, it’s amazing how far we’ve come in computing power.

    RAC to me, like SQL Server failover clustering (which is built on top of a Windows Server failover cluster), has a single point of failure: disk. And in RAC’s case, performance as well – at least from an outsider POV. For me, it’s not about the distributed lock manager. But if you have multiple processing entities, your I/O needs to scale accordingly. That’s just the way it is. A crappy disk subsystem – SQL or Oracle – will just underperform, and it’s more glaring the more load you throw at it.

    The SQL Server community has embraced virtualization a long time ago and it’s a pretty mature space with lots known on both Hyper-V as well as VMware.

    You also touch on something I talk to my customers about all the time – the app. The app to work in these environments, let alone scale, need to be coded right. The platform isn’t magic. Garbage in, garbage out.

    I strive to put in manageable, easy to maintain solutions for my customers. Just because I’m a clustering guy doesn’t mean I should or need to implement it. They need to live with it; I don’t own it long term.

    Anyway, good read. Thanks!

    • flashdba says:

      Hey Allan – thanks, I always value hearing from the Microsoft side of the fence. Sometimes Oracle people tend to be unfairly dismissive of SQL (partly fear of the unknown I suspect), but I believe that there is a lot we can learn from looking at the alternatives. Wherever Oracle made one choice and SQL made another, it’s worth looking to see what the consequences were. The way that SQL has embraced virtualization is a great example of two products choosing different forks in the road; many enterprises are happily running vast estates of SQL virtualized on Hyper V or VMware and yet we are only just reaching the tipping point for doing the same with Oracle.
      You said, “I strive to put in manageable, easy to maintain solutions for my customers”. I couldn’t agree more – complexity is the enemy of reliability!

      • Allan Hirt says:

        There’s a lot of FUD on both sides of the fence (not the least of which is that SQL Server doesn’t scale, which is so untrue it’s not funny). I’ve used a bunch of RDBMSes over the years including Oracle (including the old NLM version for Netware which definitely dates me). Virtualization brings a different set of issues to the table with things like the proper way to do I/O, overcommitting resources, etc. It’s not unlike database consolidation where you can over-consolidate into one or a handful of instances, only to find out you now need to back out some stuff. Virtualization doesn’t kill physical deployments, but with today’s large systems, virtualization (H-V, VMware, Xen, whatever) makes sense in an enterprise for many types of deployments.

        The biggest problem a lot of DBAs face – and I bet it is the same in the Oracle world – is understanding what they need in terms of performance (memory, CPU, I/Ops) to be able to say if virtualization is the right choice and/or make requests of the respective server teams. They won’t generally do well in physical or virtual deployments. If you’re myopic and don’t understand any HW or OS, managing your RDBMS is going to be an uphill battle.

  6. Gaja says:

    Dear FlashDBA,

    Check the following LinkedIn RAC SIG discussion, where Kevin and I were engaged in battle for many days 🙂 Needless to say, it was quite revealing and fun 🙂
    http://www.linkedin.com/groupItem?view=&gid=3156190&type=member&item=104224468&commentID=74659051#commentID_74659051

  7. No worries 🙂 I should have been more specific but at any rate, yes it was an interesting debate. Yes, Connor’s comment is bang on target. Like Connor, I am big proponent of simplicity and know that there are alternative options to achieve the infamous 5 nines of system uptime without introducing the complexities that RAC poses (along with the inherent overhead). I still hold to my stance that RAC should be considered only to solve specific scalability problems and with regards to HA is only relevant in special cases within the same data center. In today’s world of geographically distributed data centers, HA is way better with Active Data Guard and some very simple yet intelligent network traffic management using F5 switches.

    • flashdba says:

      Gaja, it’s refreshing to hear that someone of your stature and experience in the Oracle world feels that way. I often think that the perfect HA solution is something unachievable – a sort of theoretical system that our real-life designs can only tend towards. There are two reasons we can never completely achieve it – one is cost. However, even for someone with potentially infinite funding, the second reason is insurmountable – because it’s complexity.

      One of the ways of providing HA capabilities is to add layers of abstraction, as I mentioned in my article above. You continue to break up the stack into more layers and abstract them… and then provide redundancy for them. For example, by abstracting the operating system and then virtualising it we can protect it from the failure of an underlying server. By abstracting the instance using RAC and running multiple copies on different servers or VMs we can protect from failure of the operating system. We can do the same with application servers. But every additional layer of HA adds complexity, which is the absolute worst scenario for any HA system. The complexity is proportional to the number of HA features used.

      Gaja, I think perhaps it’s time we coined a new phrase based your previous work… How about Compulsive HA Disorder? 🙂

      • Indeed…I think Compulsive HA Disorder (CHD) can be the new term to portray what we are seeing out there 😉

        Jokes aside, I know the LinkedIn thread was very long, but in there I have suggested an alternative for RAC-based HA – 2 single-instance databases, on 2 different machines even potentially across data centers, and if within reasonable distance (30-40 km) can be connected with dark fiber. You then setup Active Data Guard to transport logs in SYNC mode, there by ensuring that the “secondary” instance never needs to catch up. The secondary instance can be used for all reporting needs. All transaction traffic still goes to the primary. As noted earlier in this thread, F5 can do the magic of managing read vs. write traffic from the application. Needless to say, Data Guard Broker sits in the middle between the 2 data centers as do some of the key F5 components.

        For DR, log transport is set to ASYNC for the obvious reason of geographical dispersion. From a cloud computing perspective, I believe that virtualization will assist us in the area of “Elasticity” of the computing resource, based on demand. The idea you propose is indeed something that we did consider when we were evaluating alternatives, but we ran into some issues with implementing it ACROSS data centers. Thus, we took the ADG and F5 route. The quasi-Cloud I am referring to here was Private, and was constrained in many ways. Would love to sit-down and truly understand the details of your proposal 🙂

  8. Dontcheff says:

    Very well said and very true indeed: “But the main point is that DBAs don’t make this decision. They certainly influence, but this is a decision made by the CIO, the CTO, or the person who has to pay the license fee.”

    For all I have seen now, RAC is everywhere. Everywhere I go: it is RAC, RAC or Exadata (which is also RAC).

    So, if DBAs were the decision makers, would we have less or more RAC?

    • flashdba says:

      Hi Julian

      That’s a great question! I know a lot of DBAs, although I can’t claim that the selection I know is representative. Still. I think there are probably a few camps. Firstly there are the guys, often grizzled veterans, who might say that RAC is a lot of extra effort for patching, both in terms of planning and implementation… as well as additional complexity and risk of human error. Then there are the younger DBAs looking to gain more experience, who may say that RAC is a great thing. Secretly that might not necessarily be true but they want the experience. I know that these people exist because a long time ago I am afraid I was one of them. I remember about 8 years ago when I was working as an Oracle DBA and persuaded my company that RAC was a great option for them. I presented them with the arguments – and at the time, I believed them to be true. But looking back I realise that I was biased in favour of RAC because I wanted to gain experience of it and be a more “complete” DBA. Also, in that role I wasn’t involved in the financials so I didn’t consider whether the supposed benefits were actually cost-efficient… Unfortunately, with hindsight, I do not believe they were – and I realise that I would have been a better DBA had I understood that there was more to the job than just the technical aspects of using Oracle.

      We live and learn eh?

      Thanks for stopping by Julian, I’ve read your blog in the past and wholeheartedly agree with your post about Oracle Certification. For anyone else reading this I highly recommend it:

      Oracle Certification: the confidence to press enter

  9. Great article! Kudos to you. As an Oracle technologist (started with Oracle 6) (Enterprise Architect for Veracity Consulting now, and in my past with IBM Global Services), I have struggled with implementing RAC. I struggled not just with the setup/installation component, but more importantly (in the last 5+ years especially), I had failed to properly bring my point across to the clients that RAC should not be implemented, and instead a virtual solution would have meet or exceeded 90% of the client’s technical goals. This article tied this altogether for me. Lastly, THANKS for bringing up “Violin Memory” to address high disk I/O needs (for example highly transactional active database systems that will produce terabytes of archive/transactional logs in a single day) in a VM-Oracle system.

    • flashdba says:

      Roger

      Thanks! When I wrote this article I never expected so many people to either publicly or private contact me and express their shared frustration. It’s been a revelation that there are so many experience professionals thinking the same thing.

      As for your comment regarding Violin Memory, it would be cool to have a customer like Veracity, so don’t hesitate to get in touch and see if our flash memory arrays could benefit your business and those of your customers.

      By the way, you mentioned Oracle 6. Do you remember the old “monitor” command in SQLDBA? That takes me back a bit… things seemed a lot simpler then. If I remember correctly, Oracle 6 supported referential integrity but didn’t actually enforce it – even then Oracle Marketing had a certain way with words. I found this great article by Martin Widlake on the subject of Oracle Nostalgia – happy memories!

      Oracle Nostalgia

  10. Pingback: My Application Platform Quotes

  11. Pingback: Curb Your Enthusiasm: Database Virualization « Julian Dontcheff's Database Blog

  12. mdinh says:

    Is technology just a fad? Often I have seen implementation without considering the cost or complexity of the solutions. Other times, it’s penny wise and pound foolish.

  13. Jakub Wartak says:

    “At no time do I ever remember visiting a customer who had implemented the various Transparent Application Failover (TAF) policies and Fast Application Notification (FAN) mechanisms necessary to provide continuous service to users of a RAC system where a node fails. The simple fact is that most users have to restart their middle tier processes when a node fails and as a result all of the users of that node are kicked off.”

    .. and the reason is lack of technical management in those companies that would enforce HA tests. The same would happen for cold failover solution… but RAC has the advantage that it uses all nodes concurrently, so any technical problem can be detected easier and not during worst failover conditions (for me this really matters, how one can prove that failover would work without regularly testing it, and we are back to the same discussion )… so actually the customers are “guilty” themselves because they simply are not capable of maximizing ROI. This applies even to Oracle Corp. itself, like releasing GI PatchSet 11.2.0.3.0 without working VIP/SCAN failovers due to GARP bugs on Linux x86_64 – it just illustrates the point about tests/QA even inside Oracle…

    The additional point is the confusion that there are three methods of failover: TAF, FAN, FCF(based on FAN but still separate). RAC is the only high availability technology that ***can*** protect from DB server failure under 40-50s for 1/N sessions, and this is still not fault tolerance (you wrote “to provide continuous service to users” – I guess it will not achieve that even in next 4-5 years). People often confuse Fault Tolerance and High Availability… I’m still waiting for proper Fault Tolerant solution that can give five or six nines availability on x86_64 commodity servers, but I guess it would have to do some kind of live memory mirroring or be integrated with special JDBC drivers multiplexing to N-nodes concurrently… Yeah and TAF can failover session without rollback.. but SELECTs only, perhaps thats good DWH/BI 🙂

    Personally I do believe RAC its all about RTO and costs. If someone doesn’t seem care between 1 min of downtime (for e.g. 50% or 33% Oracle sessions) and total outage for 5 minutes then he definitely doesn’t need RAC in OLTP world. I still believe that RAC is the way to go but only with DataGuard and ability to go with Transistent Logical Standby upgrades, DG and RAC rolling upgrades. If one can justify such env (with 3-4 DBAs that can manage that), then he really needs it 🙂 There is no such thing as impossible in engineering, it’s all about understanding real needs, costs and regular testing/QA… one can only congratulate to RAC Sales guys of being able to sell it to anyone ;]

    Keep good blogging 🙂

    • flashdba says:

      Hey Jakub

      RAC, Data Guard and Transient Logical Standby… woah! Imagine how complicated that environment would me, you are right that it would need at least 4 DBAs. I think that most HA technologies are designed to protect against hardware or software failures e.g. failed components, bugs etc. But the more products you implement to protect against those, the more complex the environment becomes and so the more prone you are to human errors… as well as more bugs… So somewhere in the middle there must be a “sweet spot” where you have the best protection you can have without losing the simplicity required to keep your environment manageable.

      That’s great, but there is the additional factor of cost. And with Oracle, the cost of licensable features is often high! So once you take that into account all those extra features become even less appealing.

      The bottom line for me is that even after all this time, Mogen’s white paper title still rings true. You probably don’t need RAC!

      • Jakub Wartak says:

        Well every problem can be mitigated, it just matter of costs. HA is about people, processes and technology IMHO in that particular order, but i find this hardly understood inside companies. Everyone just throws last bit (technology) at problem. HA is never cheap and is never fire & forget missile. HA is not an project, it’s rather an effort. RAC is definitely in too many places, for sure, but it is specialized (a “little” pricey) technology. RAC is IMHO not e.g. for ERP-like systems, internal systems, etc. it is for 24/7/365 shops running under 3/4/5 nines SLAs and business is paying for that. RAC doesn’t seem to make a lot of sense for 2 node clusters too.

        Human errors can be mitigated by serious change control process, technical change review processes and procedures… Bugs can be eliminated by having multiple testing environments, routine tests (preferably under application load), stabilization efforts, technical risks assessments, etc.

        I agree that RAC doesn’t belong to the virtualization, IAAS, Standard Edition (look at what Oracle Sales is promoting… an system with RAC but SE so no DBMS_REDEFINITION/online operations – sigh! – or what’s worse without even an ability to even use AWR/ASH there to see what’s happening inside… Highly Available Black Box? isn’t HA about understanding and prediction and elimination of the worst??? or is it just me?). I’m actually pretty happy with RAC in terms of manageability given the game is worth the effort (there are IMHO *rare* cases when it is!), and I’m afraid of more such kind of aggressive & successful crusades by Oracle Sales as e.g. popularizing Active-Active GoldenGate/Streams deployments For.Every.Application.Out.There.Without.Application.Changes(FEOTWAC? ;)) … because your Management wonders why your are not able to utilize 2nd data center…. http://bit.ly/SYvorH <- it already started, do you dare to try to migrate SAP to Active-Active over 200km? 😉

        -J.

  14. Ken says:

    RAC does not do what it says it does. It actually creates more points of failure. It you want high availability, buy 2 servers with their own disks. Go look at the Dell Website, for $40k you can buy a 16 CORE Server with 128 Gigs of RAM and an 18 SSD DISK RAID 10 (1.8 TBytes Storage) …. If you are concerned about up time, buy a second server and put it up in stand-by mode … You are all set, you don’t have to deal the added cost of a SAN or a SAN Administrator or the extra cost of a RAC trained DBA. In RAC you need a SAN, the SAN is a point of failure and I have seen them fail.

    • Ken says:

      Also, if you are concerned about scalability, in 2 years $80k will buy you 2 new servers even more powerful with even faster disks… still cheaper than dealing with RAC and SANs… Just think if they had RAC 25 years ago and people just kept adding nodes could have 286, 386, 486 and Pentium nodes… RAC is just a solution looking for a problem, it has been out for about 10 years now, and machines are significantly more powerful.

      • flashdba says:

        Ken

        Ok I agree with some of what you said but I’m not sure about your dislike of SANs. Adding a SAN to the design is adding complexity and it must therefore justify itself with features and benefits, but most SANs do exactly that in comparison to a direct-attached or local disk solution.

        • Ken says:

          I’ve been DBA in 2 SAN environments, both environments SAN performance was horrible, other environment SAN and servers in 2 different locations, construction worker severed fiber between SAN and Servers, caused major outage… In both environments, SAN Administrators, entirely different side of house from DBA team and an additional web of beauracracy to deal with … We are not just talking about hardware and software here but also all of the people involved with it… both cases SAN created more problems than it solved… if at all possible, KISS (Keep It Simple Stupid) ;-0

        • Ken says:

          Also, in both SAN environments, SAN Admins responsible for DB recovery… Backup software backed up datafiles at block level… This created lots of extra overhead and major performance hit because BU software is backing up Temp tablespaces and UNDO tablespaces and Online Redo Logs constantly … Confused Information Officers don’t listen to DBAs or Sys Admins, they listed to Oracle, Dell and EMC Sales Staff…

  15. Seref Arikan says:

    Greetings,
    Very interesting points, but as a non db guy, I have failed to see the answer to a critical question in the post: how does db virtualization handle distributed transactions?
    Under the scalability heading you mention the abstraction of multiple nodes as one, but the key thing RAC does is, it allows transactional access to multiple nodes, with formidable scalability.
    So far I have not been able to find an alternative for other db options or for oracle with different architectures that delivers this.
    How would you handle transactional operations in the virtualization scenario? Very informative discussions in the comments btw, but I’m still looking for the answer to my specific question 🙂

    All the best
    Seref

    • flashdba says:

      I’m not entirely sure I understand the question. The phrase “distributed transactions” means, at least in the Oracle world, transactions that take place between different databases. But in discussing RAC you appear to be asking about transactions which access multiple instances of the same database, i.e. different nodes of the same cluster.

      From a transaction point of view there is absolutely no difference between one large single instance or multiple smaller RAC instances. Actually, beneath the covers, there is a big difference as Cache Fusion and various other technologies introduce additional points of contention – but from the application’s perspective this is invisible.

      So let me ask you this. If you can have one large virtualized instance instead of numerous smaller physical RAC instances, where is the benefit of RAC?

      • Seref Arikan says:

        Thanks for the response. Let me try to clarify my question a bit. Maybe I have not used the terminology correctly, I should not have said distributed transactions, what I meant was a single transaction from an application point of view, that may or may not be running on a set of servers, that is, some middleware probably using distributed transactions behind the scenes.

        A RAC cluster(hope it is the right word) provides a server interface that lets adding nodes to improve performance. However, regardless of what the underlying setup is (shared nothing, shared everything, shared cache etc), the RAC cluster looks like one server to applications.

        So if I start a transaction, commit/update/delete some data, Oracle guarantees that I can do this on a cluster with N nodes, and the cluster will handle the data replication/sharding etc.
        I don’t have to think about distributed transactions just because I have N nodes in the cluster. To my app, it is all the same (a DB server endpoint with a network address and a port)

        I have been looking into alternatives for RAC, but most solutions offer eventual consistency or similar relaxed options in exchange for scaling out. I am not interested in scaling to hundreds or thousands of servers, but if I could scale out to 10 servers, say using postgresql, and I could use that cluster as a single db server, with immediate data consistency and increased performance (compared to 1 server), that would be my holy grail

        I am curious about the virtualization approach: how does it create a single large virtualized instance? Any pointers you’d recommend for introduction?

        Again, thanks for your response, this is the first ever piece of discussion I’ve found that offers virtualization as an alternative to RAC. Exciting!

        Regards
        Seref

        • flashdba says:

          Hi – don’t worry about the terminology, I see what you mean now. So you are looking for a scalability solution, since in your own words “if I could scale out to 10 servers … and I could use that cluster as a single db server … that would be my holy grail”

          RAC and its predecessor OPS was invented as a scalability solution because, at the time (we’re talking back in the last century here) most people ran Oracle in big RISC boxes running flavours of UNIX. You couldn’t browse your server vendor’s website and order a 10 socket 80 core x86-based NUMA box like you can today, so scaling was problematic and expensive. Now those large x86 servers are commodity items.

          The alternative to your N-node RAC solution is simple: a single node which is N-times more powerful. If you installed Oracle on a large multi-core server you will see a similar level of scaling to installing Oracle on multiple smaller nodes and putting them in a cluster. But crucially, you won’t have the overhead of interconnect traffic and the various technologies Oracle uses to maintain the transactional consistency you mention. Also, very importantly, you won’t have the massive cost overhead of RAC licenses.

          Another thing you mentioned was that “A RAC cluster … provides a server interface that lets adding nodes to improve performance”. Admittedly Oracle running on a single large server does not. That’s where virtualization comes in. by virtualizing you have the ability to change (on the fly) the resources available to the database guest. You also have new opportunities for migration.

          If designed correctly, virtualized and consolidated Oracle systems can provide *more* operational advantages than RAC at a *lower* cost. Sometimes considerably lower.

          • Seref Arikan says:

            Ah, I think I see what you mean now. You’re suggesting that virtualization allows easy scaling up, eliminating the complexity of scaling out. That is a very interesting way of thinking, one that I have not considered before.

            In a way, it is clustering hardware resources, so moving the idea of cluster from the db layer to hardware layer through virtualization, if I’m getting this right.

            Very nice food for thought. Thanks a lot for the patience and responses!

            Regards
            Seref

            • flashdba says:

              Yes exactly. “Virtualization allows easy scaling up, eliminating the complexity of scaling out ” <– You just explained it better than me. If I'd thought of using those words earlier it would have been a lot clearer 🙂

  16. VirtualT says:

    “The simple fact is that most users have to restart their middle tier processes when a node fails and as a result all of the users of that node are kicked off.”

    I am not an Oracle DBA, but a Virtualization Architect with many years of experience with clustering, and the above statement struck a chord and I needed to get more information. I actually asked one of our Oracle DBA’s whether or not we have issues with our middle tier on node failures/evictions whether from a storage issue or hardware failure of one of the nodes, and the answer I got was that we do not experience that outage.

    Am I wrong in thinking that there may be some middle tier apps that infact fail with any small interuption in service? Are app developers getting smarter and allowing for some level of interuption while the DB recovers on another node?

    I am in the classic battle of whether to ORAC or not to ORAC with the DBMS team, and because of standardization and the foothold ORAC has within our environment it is very difficult to move the DBA’s off of what they are comfortable with. We are currently virtualizing ORAC on VMware and we have been successful, but at the cost of complexity and cost. We are also putting a lot of pressure on the Storage Team to meet I/O requirements. My challenge is finding a better solution for Oracle HA for App Resiliency needs and doing it in a manner that can sway hardened Oracle DBA’s that are afraid of doing anything but what Oracle tells them to do.

    We utilize Data Guard for Cross DC use cases, such as DR. Data Guard also has it’s costs associated. I have looked to Symantec for there solution as well and they again have their own complexity that comes along with their solution albeit at a lower cost, but where else is there really to go?

    • flashdba says:

      To enable failover on the middle tier of a classic three-tier Oracle-based application you need to use two technologies: Transparent Application Failover (TAF) and Fast Application Notification (FAN):

      TAF: http://docs.oracle.com/cd/E11882_01/java.112/e16548/ocitaf.htm
      FAN: http://docs.oracle.com/cd/E11882_01/java.112/e16548/apxracfan.htm

      At least, that’s Oracle’s recommended way of doing it. Now, I can only speak about the experiences I’ve had and the customers I’ve met, which may or may not be a representative sample. Given my employment history I’m inclined to believe that it is representative, at least of organisations in the UK and mainland Europe – but I could be kidding myself.

      I’ve never seen an organisation implement an application with work, automated fault-free failover that allows customers to experience only a minor pause before they are magically migrated from one failed node of a RAC cluster to another working node. In almost every case I’ve seen, the loss of a RAC node requires a bounce of the application server on the middle tier.

      What normally happens is that the app server has a connection pool of JDBC connections to a service on the database, possibly load balanced across multiple nodes. When an instance fails, some of these connections will become unusable. With TAF, it’s possible to have “shadow” processes running on another node so that the connection switches over to use these instead. I’ve seen this configured on many occasions – sometimes even successfully – but still when a node is evicted from the cluster for some reason the middle-tier has to be bounced to restore full service.

      I’m sure there must be people who have got this working, it’s just that I believe most people take the hit of a “partial outage” instead.

      Your Oracle DBAs may well be fearful of virtualizing Oracle on VMware. In my opinion, Oracle has done a good job of scaring people away from this option for years… but now we have Oracle supporting and certifying their products on Hyper-V as well as Oracle VM, so why shy away from Vmware? It’s the market-leading virtualisation platform after all. DBAs need to embrace this or become extinct.

      • Jon says:

        “”I’ve never seen an organisation implement an application with work, automated fault-free failover that allows customers to experience only a minor pause before they are magically migrated from one failed node of a RAC cluster to another working node. In almost every case I’ve seen, the loss of a RAC node requires a bounce of the application server on the middle tier.”

        Well I have. A learning CMS system using Oracle RAC 11G configured with TAF on the app servers which are using JDBC / OCI driver. Multiple application servers, in separate data centres and multiple RAC nodes in each datacenter. Datacenters are a few km apart. We’ve lost one of the data centres due to power failures and network issues on quite a few occasions and ran successfully off the other datacenter without anyone, i.e. the end-users, noticing as the service continued to run without downtime or outage which as this is a 24/7/365 service is required. TAF and RAC worked as advertised and has provided a much higher level of uptime/availability. Add in the ability to do Oracle rolling patches and downtime is reduced even more.

        • flashdba says:

          I published this post in September 2012 – over three and a half years ago. I knew one day somebody would find it and tell me that they had achieved the impossible, even though many said I was crazy. I never gave up hope that one day, against all the odds, I would meet the miracle worker who made that jumble of technology work as described in the manuals. And now… the day has come.

          Jon, I can say only this. I salute you.

  17. Jeroen says:

    Oracle is Unveiling the Latest Engineered System for Enterprise Virtualization

    Live Webcast: Virtualization and Cloud Made Simple and Easy with Oracle’s Latest Engineered System
    https://event.on24.com/eventRegistration/EventLobbyServlet?target=registration.jsp&eventid=653148&sessionid=1&key=D1651F9F945A682AE602335A5FA79C6A

    • flashdba says:

      Yes I saw this. Interesting – I have been expecting something like to come along for a long time now… In fact for so long that I had decided maybe it wasn’t coming after all…

      But we saw a hint that it was on its way some time back:

      Exadata Roadmap Preview

  18. Pingback: Six alternatives to replace licensed Oracle database options | Dirty Cache

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.