The Great Hypervisor Bake-off: VMware ESX vs Oracle VM

lock-horns

This is a very simple post to show the results of some recent testing that Tom and I ran using Oracle SLOB on Violin to determine the impact of using virtualization. But before we get to that, I am duty bound to write a paragraph of text featuring lots of long sentences peppered with industry buzz words. Forgive me, it’s just the way I’m wired.

It is increasingly common these days to find database environments running in virtual machines – even large, business critical ones. The driver is the trend to commoditize I.T. services and build consolidated, private-cloud style solutions in order to control operational expense and increase agility (not to mention reduce exposure to Oracle licenses). But, as I’ve said in previous posts, the catalyst has been the unblocking of I/O as legacy disk systems are replaced by flash memory. In the past, virtual environments caused a kind of I/O blender effect whereby I/O calls become increasingly randomized – and this sucked for the performance of disk drives. Flash memory arrays on the other hand can deliver random I/O all day long because… well, if you don’t know the reasons by now can I just recommend starting at the beginning. The outcome is that many large and medium-sized organisations are now building database-as-a-service platforms with Oracle databases (other database products are available) running in virtual machines. It’s happening right now.

Phew. Anyway, that last paragraph was just a wordy way of telling you that I’m often seeing Oracle running in virtual machines on top of hypervisors. But how much of a performance impact do those hypervisors have? Step this way to find out.

The Contenders

boxersWhen it comes to running Oracle on a hypervisor using Intel x86 hardware (for that is what I have available), I only know of three real contenders:

Hyper-V has been an option for a couple of years now, but I’ll be honest – I have neither the time nor the inclination to test it today. It’s not that I don’t rate it as a product, it’s just that I’ve never used it before and don’t have enough time to learn something new right now. Maybe someday I’ll come back and add it to the mix.

In the meantime, it’s the big showdown: VMware versus Oracle VM. Not that Oracle VM is really in the same league as VMware in terms of market share… but you know, I’m trying to make this sound exciting.

The Test

This is going to be an Oracle SLOB sustained throughput test. In other words, I’m going to build an Oracle database and then shovel a massive amount of I/O through it (you can read all about SLOB here and here). SLOB will be configured to run with 25% of statements being UPDATEs (the remainder are SELECTs) and will run for 8 hours straight. What we want to see is a) which hypervisor configuration allows the greatest I/O bandwidth, and b) which hypervisor configuration exhibits the most predictable performance.

This is the configuration. First the hardware:

Violin Memory 6616 flash Memory Array

Violin Memory 6616 flash Memory Array

  • 1x Dell PowerEdge R720 server
  • 2x Intel Xeon CPU E5-2690 v2 10-core @ 3.00GHz [so that’s 2 sockets, 20 cores, 40 threads for this server]
  • 128GB DRAM
  • 1x Violin Memory 6616 (SLC) flash memory array [the one that did this]
  • 8GB fibre-channel

And the software:

  • Hypervisor: VMware ESXi 5.5.1
  • Hypervisor: Oracle VM for x86 3.3.1
  • VM: Oracle Linux 6 Update 5 (with the Unbreakable Enterprise v3 Kernel 3.6.18)
  • Oracle Grid Infrastructure 11.2.0.4 (for Automatic Storage Management)
  • Oracle Database Enterprise Edition 11.2.0.4

Each VM is configured with 20 vCPUs and is using Linux Device Mapper Multipath and Oracle ASMLib. ASM is configured to use one single +DATA disgroup comprising 8 ASM disks (LUNs from Violin) with external redundancy. The database parameters and SLOB settings are all listed on the SLOB sustained throughput test page.

Results: Bare Metal (Baseline)

First let’s see what happens when we don’t use a hypervisor at all and just run OL6.5 on bare metal:

Oracle SLOB- 8 Hour Sustained Throughput Test with no hypervisor (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         232,431.0       194,452.3        37,978.7
         Database Requests:         228,909.4       194,447.9        34,461.5
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           3,515.1             0.3         3,514.8
                Total (MB):           1,839.6         1,519.2           320.4

Ok so we’re looking at 1519 MB/sec of read throughput and 320 MB/sec of write throughput. Crucially, the lines are nice and consistent – with very little deviation from the mean. By dividing the amount of time spent waiting on db file sequential read (i.e. random physical reads) with the number of waits, we can calculate that the average latency for random reads was 438 microseconds.

Now we know what to expect, let’s look at the result from the hypervisor tests.

Results: VMware vSphere

VMware is configured to use Raw Device Mapping (RDM) which essentially gives the benefits of raw devices… read here for more details on that. Here are the test results:

Oracle SLOB- 8 Hour Sustained Throughput Test with VMware ESXi 5.5.1 (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         173,141.7       145,066.8        28,075.0
         Database Requests:         170,615.3       145,064.0        25,551.4
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           2,522.8             0.1         2,522.7
                Total (MB):           1,370.0         1,133.4           236.7

Average read throughput for this test was 1133 MB/sec and write throughput averaged at 237 MB/sec. Average read latency was 596 microseconds. That’s an increase of 36%.

In comparison to the bare metal test, we see that total bandwidth dropped by around 25%. That might seem like a lot but remember, we are absolutely hammering this system. A real database is unlikely to ever create this level of sustained I/O. In my role at Violin I’ve been privileged to work on some of the busiest databases in Europe – nothing is ever this crazy (although a few do come close).

Results: Oracle VM

Oracle VM is based on the Xen hypervisor and therefore uses Xen virtual disks to present block devices. For this test I downloaded the Oracle Linux 6 Update 5 template from Oracle’s eDelivery site. You can see more about the way this VM was configured here. Here are the test results:

Oracle SLOB- 8 Hour Sustained Throughput Test with Oracle VM 3.3.1 (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         160,563.8       134,592.9        25,970.9
         Database Requests:         158,538.1       134,587.3        23,950.8
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           2,017.2             0.2         2,016.9
                Total (MB):           1,273.4         1,051.6           221.9

This time we see average read bandwidth of 1052MB/sec and average write bandwidth of 222MB/sec, with the average read latency at 607 microseconds, which is 39% higher than the baseline test.

Meanwhile, total bandwidth dropped by 31%. That’s slightly worse than VMware, but what’s really interesting is the deviation. Look at how ragged the lines are on the OVM test! There is a much higher degree of variance exhibited here than on the VMware test.

Conclusion

This is only one test so I’m not claiming it’s conclusive. VMware does appear to deliver slightly better performance than OVM in my tests, but it’s not a huge difference. However, I am very much concerned by the variance of the OVM test in comparison to VMware. Look, for example, at the wait event histograms for db file sequential read:

Wait Event Histogram
-> Units for Total Waits column: K is 1000, M is 1000000, G is 1000000000
-> % of Waits: value of .0 indicates value was <.05%; value of null is truly 0
-> % of Waits: column heading of <=1s is truly <1024ms, >1s is truly >=1024ms
-> Ordered by Event (idle events last)

                                                             % of Waits
                                          -----------------------------------------------
                                    Total
Hypervisor  Event                   Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s   >1s
----------- ----------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
Bare Metal: db file sequential read 5557.  98.7   1.3    .0    .0    .0    .0
VMware ESX: db file sequential read 4164.  92.2   6.7   1.1    .0    .0    .0
Oracle VM : db file sequential read 3834.  95.6   4.1    .1    .1    .0    .0    .0    .0

The OVM tests show occasional results in the two highest buckets, meaning once or twice there were waits in excess of 1 second! However, to be fair, OVM also had more millisecond waits than VMware.

Anyway, for now – and for this setup at least – I’m sticking with VMware. You should of course test your own workloads before choosing which hypervisor works for you…

Thanks as always to Kevin for bringing Oracle SLOB to the community.

Advertisements

18 Responses to The Great Hypervisor Bake-off: VMware ESX vs Oracle VM

  1. Stuart Archer says:

    Which is all well and good, until the license man comes a knocking. Oracle are *still* well behind the curve in terms of support (in both the license term, and actual support – I’m told by them that the DB is _not_ supported if the underlying “hardware” is ESX, regardless of OS). Of course, this is pretty much in their own benefit since the *only* supported and financially viable VM option for Oracle at this time is OVM. But it’s good to see some real numbers as to the %loss of pushing on a VM,

    • flashdba says:

      The support issue is a right old can of worms isn’t it? VMware says this:

      http://www.vmware.com/files/pdf/techpaper/vmw-understanding-oracle-certification-supportlicensing-environments.pdf

      While Oracle has My Oracle Support note 249212.1, which is conveniently reprinted in the appendix of the VMware document.

      I think technically (or should that be “legally”?) it is incorrect to say that Oracle “is not supported” on VMware ESX. However, I know of numerous customers who are either confused or frightened by the support issue – and they are frequently the smaller customers (because, after all, big customers have more weight to throw around). Such is the life of the Oracle customer.

    • Andy Mac. says:

      If you were told you were not supported, then unfortunately you were given a bum steer. You are absolutely supported, but the combination hasn’t been certified by Oracle. This is no different to the way Oracle treat other 3rd party products. The simple reality is it’s not possible for Oracle (or any other vendor) to certify every 3rd product product whether it be a virtualisation layer, a load balancer or a network switch. At the end of the day, a process needs to be in place that provides a boundary of ownership and responsibility in triaging issues that arise. Should Oracle own and resolve an issue that never manifests itself on x86 physical commodity h/w platforms or OVM or on VMWare but (say) is only reported by customers running on a specific version of MS Hyper-V?! Many many Oracle customers run Oracle on VMWare and get a level of support indistinguishable from customers on certified platforms.

      • robinsc says:

        The issue is the only objective way to assure that the problem is not caused by the virtualization is to reproduce the issue without virtualization and Oracle can and may ask you to do exactly that. If it is a complex system with High availability etc the cost of having to reproduce the issue on a bare metal system is prohibitive let alone the need to keep sufficient bare metal systems available to do so and the licensing issues related to the same. Hence even though it might not be officient unsupported partically you may not be able to get support unless you have leverage over Oracle ( which few companies have).

  2. Looks interesting but is only testing the I/O a valid test? I know you explained the limits of the test but it would be interesting to push the network and memory as well.

    • flashdba says:

      I agree Dom, there is plenty of room to expand the test. But as you know I have a special interested in the I/O, considering I work for a storage vendor. More than anything, I want to know how I/O from the database is impacted by virtualization.

      At the end of the day, every system is different – which is why I ask readers to test their own systems rather than rely on my results.

  3. Andy Mac. says:

    Really interesting test from a pure I/O perspective: I am really suprised by the variance observed from the perspective of ‘all other things being equal’.

    • Andy Mac. says:

      PS. Does this mean OVM was the one that came out with the ‘soggy bottom’ in your bake-off?

  4. Did the template you use the xenblk paravirtualized drivers or the hardware virtualized drives. I sometimes see oracle templates which are not configured to use paravirtualization. it would be nice if you could repeat the test with hardware virtualized OVM image vs paravirtualized OVm to see whether it made any difference ?

    • flashdba says:

      The template used xenblk paravirtualized drivers. There are many things I could retest if I had the time and inclination, but alas I have other priorities.

  5. Andy Colvin says:

    For the OVM test, did you use virtual disks created in the storage repository or physical LUNs mapped directly to the VM? I believe that OVM will also support raw disks in the same way that you configured the VMware system. That way, you can get away from virtual disk devices running on top of an OCFS2 filesystem. It would be interesting to see if there’s an uplift from physical devices on the OVM system.

    • flashdba says:

      Hi Andy! The OVM test was using physical LUNs mapped directly to the VM, which were then presented as xenblk paravirtualized devices. In my opinion that’s the closest configuration to the VMware RDM configuration I used in the other test.

  6. If you dedicate an entire box to the db, and hence licensing the entire box for Oracle (independent of the VM) then VMware sounds a reasonable choice.

    But as boxes get more and more powerful, as the core counts keep growing and growing, (I’m hypothesizing that) many customers need to carve up their boxes into “core-subsets” to manage the license cost…

    Its been my experience that license cost generally trumps performance in most places 🙂 which makes Oracle VM the logical (aka only :-)) choice.

    • flashdba says:

      Hey Connor. You make a valid point, as always. But I think that only makes sense if you are planning on hosting non-Oracle workloads on the same virtualization platform as Oracle. You certainly don’t want to license all of the physical cores only to use half of them running something like Exchange, MSSQL or Sharepoint.

      Oracle VM is the only hypervisor with which Oracle allows hard partitioning for licenses (how convenient) but most of the customers I speak to use VMware and simply segregate Oracle workloads onto their own platform. Thus you size the platform according to your needs and license the whole thing. You probably pay the same amount of license fees you would have done in a non-virtualized environment but you get the benefits of increased agility that come with virtualization.

      Also, as I alluded to earlier, most customers have many other workloads to virtualize. Chances are they already do this with VMware. It’s a rare customer that wants to adopt to different hypervisor technologies in their estate – and to this day I’ve never met anyone that was prepared to run ALL of their virtualized workloads on Oracle VM. In fact, I’ve never seen anything other than Oracle products running on Oracle VM. Have you?

  7. re: “It’s a rare customer that wants to adopt to different hypervisor technologies”

    That’s interesting, aka, not my experience 🙂

    Most clients I’ve seen use both, namely Oracle VM for “all things Oracle”, eg, they’ll carve up a box to be half-db (oracle), half-app tier (weblogic etc) using Oracle VM, and use VMWare for non-Oracle (file/print/mail/etc etc).

    In terms of “licensing the whole thing” I agree, but (sadly) I’ve lost track of the number of places I’ve seen where they license (say) 50% of the cores on box for “project X”, then “project Y” will go buy *another* server, and they’ll license 50% of the core on that…and so on … and so on. Even in the virtualised world, the paranoia of “I want my *own* server” lives on 😦

    • flashdba says:

      Maybe things are done differently in the northern and southern hemispheres? It’s sort of like the virtualised equivalent of water going down the plug hole in different directions 🙂

  8. Mike White says:

    Really good work. Goes a long way towards lifting the “fog”. These results are certainly not what the Oracle script says [OVM paravirtualisation = better I/O]. It would be very interesting to see a comparison of Hyper-V, now that you have a baseline. In Australia, we are seeing a number of OpenStack projects that are really large, based on KVM. As OpenStack gains traction, I expect KVM will pass the install base of Hyper-V and OVM.

  9. james says:

    good work! it will be interesting to see the test repeated on Citrix XenServer which is a different distribution of Xen Hypervisor. The latest release incorporates a lot of under-the-hood changes for improved I/O performance including latest Xen 4.4 !

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s