The Great Hypervisor Bake-off: VMware ESX vs Oracle VM
May 7, 2015 18 Comments
This is a very simple post to show the results of some recent testing that Tom and I ran using Oracle SLOB on Violin to determine the impact of using virtualization. But before we get to that, I am duty bound to write a paragraph of text featuring lots of long sentences peppered with industry buzz words. Forgive me, it’s just the way I’m wired.
It is increasingly common these days to find database environments running in virtual machines – even large, business critical ones. The driver is the trend to commoditize I.T. services and build consolidated, private-cloud style solutions in order to control operational expense and increase agility (not to mention reduce exposure to Oracle licenses). But, as I’ve said in previous posts, the catalyst has been the unblocking of I/O as legacy disk systems are replaced by flash memory. In the past, virtual environments caused a kind of I/O blender effect whereby I/O calls become increasingly randomized – and this sucked for the performance of disk drives. Flash memory arrays on the other hand can deliver random I/O all day long because… well, if you don’t know the reasons by now can I just recommend starting at the beginning. The outcome is that many large and medium-sized organisations are now building database-as-a-service platforms with Oracle databases (other database products are available) running in virtual machines. It’s happening right now.
Phew. Anyway, that last paragraph was just a wordy way of telling you that I’m often seeing Oracle running in virtual machines on top of hypervisors. But how much of a performance impact do those hypervisors have? Step this way to find out.
Hyper-V has been an option for a couple of years now, but I’ll be honest – I have neither the time nor the inclination to test it today. It’s not that I don’t rate it as a product, it’s just that I’ve never used it before and don’t have enough time to learn something new right now. Maybe someday I’ll come back and add it to the mix.
In the meantime, it’s the big showdown: VMware versus Oracle VM. Not that Oracle VM is really in the same league as VMware in terms of market share… but you know, I’m trying to make this sound exciting.
This is going to be an Oracle SLOB sustained throughput test. In other words, I’m going to build an Oracle database and then shovel a massive amount of I/O through it (you can read all about SLOB here and here). SLOB will be configured to run with 25% of statements being UPDATEs (the remainder are SELECTs) and will run for 8 hours straight. What we want to see is a) which hypervisor configuration allows the greatest I/O bandwidth, and b) which hypervisor configuration exhibits the most predictable performance.
This is the configuration. First the hardware:
- 1x Dell PowerEdge R720 server
- 2x Intel Xeon CPU E5-2690 v2 10-core @ 3.00GHz [so that’s 2 sockets, 20 cores, 40 threads for this server]
- 128GB DRAM
- 1x Violin Memory 6616 (SLC) flash memory array [the one that did this]
- 8GB fibre-channel
And the software:
- Hypervisor: VMware ESXi 5.5.1
- Hypervisor: Oracle VM for x86 3.3.1
- VM: Oracle Linux 6 Update 5 (with the Unbreakable Enterprise v3 Kernel 3.6.18)
- Oracle Grid Infrastructure 126.96.36.199 (for Automatic Storage Management)
- Oracle Database Enterprise Edition 188.8.131.52
Each VM is configured with 20 vCPUs and is using Linux Device Mapper Multipath and Oracle ASMLib. ASM is configured to use one single +DATA disgroup comprising 8 ASM disks (LUNs from Violin) with external redundancy. The database parameters and SLOB settings are all listed on the SLOB sustained throughput test page.
Results: Bare Metal (Baseline)
First let’s see what happens when we don’t use a hypervisor at all and just run OL6.5 on bare metal:
IO Profile Read+Write/Second Read/Second Write/Second ~~~~~~~~~~ ----------------- --------------- --------------- Total Requests: 232,431.0 194,452.3 37,978.7 Database Requests: 228,909.4 194,447.9 34,461.5 Optimized Requests: 0.0 0.0 0.0 Redo Requests: 3,515.1 0.3 3,514.8 Total (MB): 1,839.6 1,519.2 320.4
Ok so we’re looking at 1519 MB/sec of read throughput and 320 MB/sec of write throughput. Crucially, the lines are nice and consistent – with very little deviation from the mean. By dividing the amount of time spent waiting on db file sequential read (i.e. random physical reads) with the number of waits, we can calculate that the average latency for random reads was 438 microseconds.
Now we know what to expect, let’s look at the result from the hypervisor tests.
Results: VMware vSphere
IO Profile Read+Write/Second Read/Second Write/Second ~~~~~~~~~~ ----------------- --------------- --------------- Total Requests: 173,141.7 145,066.8 28,075.0 Database Requests: 170,615.3 145,064.0 25,551.4 Optimized Requests: 0.0 0.0 0.0 Redo Requests: 2,522.8 0.1 2,522.7 Total (MB): 1,370.0 1,133.4 236.7
Average read throughput for this test was 1133 MB/sec and write throughput averaged at 237 MB/sec. Average read latency was 596 microseconds. That’s an increase of 36%.
In comparison to the bare metal test, we see that total bandwidth dropped by around 25%. That might seem like a lot but remember, we are absolutely hammering this system. A real database is unlikely to ever create this level of sustained I/O. In my role at Violin I’ve been privileged to work on some of the busiest databases in Europe – nothing is ever this crazy (although a few do come close).
Results: Oracle VM
Oracle VM is based on the Xen hypervisor and therefore uses Xen virtual disks to present block devices. For this test I downloaded the Oracle Linux 6 Update 5 template from Oracle’s eDelivery site. You can see more about the way this VM was configured here. Here are the test results:
IO Profile Read+Write/Second Read/Second Write/Second ~~~~~~~~~~ ----------------- --------------- --------------- Total Requests: 160,563.8 134,592.9 25,970.9 Database Requests: 158,538.1 134,587.3 23,950.8 Optimized Requests: 0.0 0.0 0.0 Redo Requests: 2,017.2 0.2 2,016.9 Total (MB): 1,273.4 1,051.6 221.9
This time we see average read bandwidth of 1052MB/sec and average write bandwidth of 222MB/sec, with the average read latency at 607 microseconds, which is 39% higher than the baseline test.
Meanwhile, total bandwidth dropped by 31%. That’s slightly worse than VMware, but what’s really interesting is the deviation. Look at how ragged the lines are on the OVM test! There is a much higher degree of variance exhibited here than on the VMware test.
This is only one test so I’m not claiming it’s conclusive. VMware does appear to deliver slightly better performance than OVM in my tests, but it’s not a huge difference. However, I am very much concerned by the variance of the OVM test in comparison to VMware. Look, for example, at the wait event histograms for db file sequential read:
Wait Event Histogram -> Units for Total Waits column: K is 1000, M is 1000000, G is 1000000000 -> % of Waits: value of .0 indicates value was <.05%; value of null is truly 0 -> % of Waits: column heading of <=1s is truly <1024ms, >1s is truly >=1024ms -> Ordered by Event (idle events last) % of Waits ----------------------------------------------- Total Hypervisor Event Waits <1ms <2ms <4ms <8ms <16ms <32ms <=1s >1s ----------- ----------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- Bare Metal: db file sequential read 5557. 98.7 1.3 .0 .0 .0 .0 VMware ESX: db file sequential read 4164. 92.2 6.7 1.1 .0 .0 .0 Oracle VM : db file sequential read 3834. 95.6 4.1 .1 .1 .0 .0 .0 .0
The OVM tests show occasional results in the two highest buckets, meaning once or twice there were waits in excess of 1 second! However, to be fair, OVM also had more millisecond waits than VMware.
Anyway, for now – and for this setup at least – I’m sticking with VMware. You should of course test your own workloads before choosing which hypervisor works for you…
Thanks as always to Kevin for bringing Oracle SLOB to the community.