Oracle ASM and Thin Provisioning – How To Reclaim Space

It came to my attention last November that I had crossed the one year anniversary since my last post on flashdba.com. I was so surprised that I immediately decided to write a new post, which took another three months. There are reasons why I’m no longer posting technical blogs about databases and flash, but I’ll cover them in a later post. No, not that late – I hope.

In the meantime, I thought I’d write a note on this subject because I’ve lost count of the number of times I’ve been asked questions on the topic of Oracle ASM and Thin Provisioning. Normally, I’m asked by customers or prospects who think there is an issue with their storage system… whereas, in fact, the problem is entirely storage-agnostic.

But first, some background.

Thin Provisioning

Thin Provisioning (TP) is used to describe the overcommitment of storage capacity. Your host may think it’s been allocated 10TB of capacity and is currently using 2TB, but the storage platform has only really allocated the 2TB used and the remaining 8TB may not even exist. Why would you want this? Because in a multi-host environment (where hosts could be virtual or physical), the amount of allocated-but-unused capacity could be significant. Without TP, serious amounts of capacity would need to be provisioned which may never be used, but with TP all the hosts can be “fooled” into thinking they have been allocated what they want while the actual utilised capacity is only the sum of what each host has used.

Where things can get a bit complicated with TP is that many layers in your stack may be thin provisioning storage to the layers above them. Most storage arrays are capable of TP (or indeed mandate its use), but hypervisors often have thin provisioning options too. Meanwhile, some applications which create data store structures have options which can help or hinder the use of TP. For example, VMware has the ability to create virtual disks which are thin, thick (lazy zeroed) or thick (eager zeroed). As a result, it isn’t always obvious to the underlying storage whether a particular set of allocated blocks are really in use or not. Won’t somebody think of the poor storage array?

Trim and Unmap

Consider the situation where a large file is created and then deleted in a filesystem on a typical operating system. Commonly, the deletion process doesn’t really delete anything other than the metadata telling the filesystem where the file resided. Thus the underlying file data remains until such time as something else comes along and overwrites it. This is beneficial because it is faster and requires less work than trying to overwrite the file with (for example) zeros. But if the filesystem resides on a storage array which uses TP, how will the storage array know that the space allocated to the file is now free? It can’t – unless the filesystem has a way of telling it.

For this purpose there exists a set of OS calls known as trim commands – and for the SCSI protocol (used by most block storage devices such as SANs) the command is known as UNMAP. Issuing one of these commands allows the calling layer (the filesystem, or perhaps a volume manager) to notify the storage platform that a specific set of blocks are no longer in use and can be “unmapped”, freeing space. As a side note, large calls to UNMAP can often have temporary but unexpected consequences on storage performance, as large amounts of metadata may need to be updated.

Oracle ASM: Unmap is for Wimps

Let’s get straight to the point here: Oracle’s Automatic Storage Manager doesn’t natively use UNMAP commands. Quelle surprise. But there are still ways to free up space back to thin provisioned arrays. Two in fact: let’s call them the bad way and the good way. First though, let’s set up the scenario:

Test Scenario

Consider the situation where an Oracle ASM diskgroup is created on a 10TB volume group presented from a thin provisioning All-Flash storage array. The DBA then creates a large “bigfile” tablespace in the diskgroup, with a 5TB datafile (the rest of the database resides elsewhere). Anyone who has sat waiting for the CREATE TABLESPACE command for any period of time will be aware that, during the datafile creation process, Oracle likes to fill the whole file with empty blocks. From Oracle’s perspective, this has the advantage of ensuring that the entire datafile capacity has been marked as used by the storage array. In other words, it’s not “fake” thin provisioned space which may or may not be available, but real available capacity which now belongs to Oracle. (You may also recall that Oracle no longer takes this approach with tempfiles, instead using the faster “sparse” allocation method.)

At this point, what will the volumes on the storage array will be showing? We know that 10TB has been allocated, of which 5TB has been used. So shouldn’t that leave 5TB free? Probably not, because almost every All-Flash storage array uses data reduction technologies such as compression, deduplication and zero-detect. Since each block in the tablespace contains a unique block number, deduplication isn’t going to add any value (which is why arrays like the Kaminario allow dedupe to be disabled on a per-volume basis), but compression is going to have great fun with all the emptiness inside each Oracle block so the storage array will probably show significantly less than 5TB used.

Next, our enterprising DBA watches a Connor McDonald video about DBMS_RANDOM and gets a little overexcited, then fills the entire tablespace with random data to the point that the storage array can hardly achieve any compression. The outcome? Allocated = 10TB, Used = 5TB, Free = 5TB.

Finally, after watching a video of Larry Ellison explaining that the Oracle Autonomous Database needs “no human intervention” and thus fearing for his job, the DBA deletes the tablespace and goes home. Back to 10TB free? No.
The tablespace deletion command does a number of things, including notifying Oracle ASM that the file’s allocation units are no longer in use and removing the datafile from the database’s controlfile. But at no point does anybody bother to tell the storage array that the used space is now free, so the array’s capacity statistics remain: Allocated = 10TB, Used = 5TB, Free = 5TB.

ASRU: The Bad Way

ASRU is Oracle’s ASM Reclamation Utility, a PERL script developed in conjunction with 3PAR (a storage array now owned by HPE) and designed to free up space from scenarios such as the one above. It is, in my personal opinion, a terrible botched solution which was created to serve a purpose which no longer exists – although, interestingly, many storage vendors still seem to recommend it by default (for example, Pure Storage still describe it as the only solution for reclaiming unused space with Oracle ASM).
ASRU doesn’t issue UNMAP commands. Instead, it takes advantage of the fact that most modern storage platforms (including 3PAR, Pure Storage and Kaminario) treat blocks full of zeros as free space (a feature known as zero detect). Thus what ASRU does – when manually run by a DBA (presumably during a change window in the middle of the night while rubbing a lucky rabbit’s foot and praying to the gods of all major religions) – is compact the remaining data in any diskgroup toward the start of the volume and then write zeros above the high watermark where this compacted data ends.
In our example above, this should return the capacity statistics to approximately: Allocated = 10TB, Used = 0TB, Free = 10TB. However, because zero detect is often considered to be a type of data reduction, some arrays then show horribly-skewed data reduction ratios as a result of ASRU.
Don’t get me wrong, many people have successfully used ASRU – and in some situations it may be your only choice. But there is another way…

ASM Filter Driver: The Good Way

Since Oracle Database version 12.1.0.2, the option has been available to install a piece of software called ASMFD, the ASM Filter Driver. ASMFD is a kernel module which resides in the I/O path of Oracle ASM disks – and is the natural successor to the Linux-only ASMLib kernel driver. Unlike ASMLib, or indeed native ASM, the ASMFD module contains support for SCSI UNMAP commands, which really is the missing piece of the jigsaw. Providing you use ASMFD, the deletion of files from within ASM will result in the storage array being notified as allocation units are freed up, resulting in the correct recalculation of Free and Used Capacity statistics – and without the unnecessary hack of writing zeros all over the place. It really is a no brainer.
Unless, of course, you’ve already installed your database and ASM and are now looking for some way to return freed capacity. In which case, installing ASMFD on an existing system may seem even more challenging than running ASRU. But you know what they say: it’s better to do it right first time than to be constantly forced into bodging it with PERL scripts.

TL;DR

If you want Oracle ASM to correctly free space back to your thin provisioned storage array, you need to choose between the correct method of using ASM Filter Driver or the botched method of running the ASRU reclamation tool, which comes in the form of a PERL script. Either way, it’s nothing to do with the storage platform, so don’t blame the storage guy…

The Flash Insider: To POC or Not To POC?

Proof of Concept?

Guest Post

I’m excited announce another guest blog written by my good friend and funny-talking American cousin Nathan Fuzi. Like me, Nate comes from a database background but joined the all-flash storage revolution back in its infancy. Which means, like me, Nate how has a little tombstone on his résumé marked Violin Memory. But even though he has since moved up to working in THE CLOUD, Nate’s experience working for an AFA vendor is invaluable. Over six years, he worked with hundreds of database customers who were deciding whether to purchase all-flash storage and – more importantly – wondering how to test their databases on those storage platforms. Now, for your benefit, he writes about one of the most crucial stages of the process: the proof of concept (POC).

Indulge me, if you will:  take yourself back to a time long, long ago–perhaps nearly forgotten.  Waaaay back when storage arrays were built of spinning hard drives front-ended with DRAM for caching purposes, and conventional wisdom had not yet agreed whether flash memory could serve as persistent storage media.  I know:  it seems like forever ago.  Even the ghost of Christmas Past is like, Really?  But I assure you that time happened.  I lived through it, and so did my buddy flashdba and a number of others.  Those were heady days, full of wonder and spectacle and … many, many proofs of concept.

storage-characteristicsAnd who could blame folks back then for wanting to see more than that these mysterious and spectacular “all flash” storage arrays could ingest synthetic data and spit it back at previously unseen IOPS rates, incredibly low latency numbers, and firehose-like bandwidth volumes?  Because let’s face it:  marketing numbers and theoretical performance are just that.  Theoretical.  You know, as in “your mileage may vary”.  What makes a difference to people is what kind of performance the product delivers to their specific application.  Folks like flashdba and myself got pretty good guessing at the latency numbers our products would deliver at the IOPS rates we observed in applications.  We could then do some simple math to substitute in our anticipated latency for the current value and accurately predict our improvement on execution time for a given SQL statement.  But in the early days, proving our claims to a skeptical customer often meant asking them to deploy their application on our array, as the IO profile was complex and varied.

Oh… the Pain

The PoC is still quite common and often necessary–and not just for storage products, although especially for storage products, with their increasingly wild performance claims.  But it’s painful.  You have to have an entire non-production setup in place or build one just for the PoC, and then you have to have enough additional ports on your Ethernet or FC switches (or whatever new-fangled connectivity the latest flashy product is sporting) that you can leave everything intact and hook up the new array, expose storage to the host, perform some tests, and then ideally migrate the application data over to run some “real world” tests.

But what could we achieve without doing a full-blown PoC?  There are lots of synthetic load generation utilities out there these days, some easier to use than others and some more flexible and fuller-featured.  A short list of popular tools here:

Iometer                http://www.iometer.org/

DiskSpd                https://gallery.technet.microsoft.com/DiskSpd-a-robust-storage-6cd2f223

VDBench              http://www.oracle.com/technetwork/server-storage/vdbench-downloads-1901681.html

Fio                          http://freecode.com/projects/fio

What are you really testing for?

One common aspect I have seen of what are, frankly, flawed testing paradigms is that admins often attempt to spin up to the max IOPS the host/array combination can drive for that particular workload setting and then hold that rate for some period.  This methodology demonstrates a couple of array attributes:  maximum sustained performance and, run long enough, the point at which caching and garbage collection mechanisms are overrun and a worst case sustained performance profile presents itself.  test-blackboardWhat it definitely does not demonstrate is the latency you can expect for your workload, which for most database environments is likely less than 10% of the maximum IOPS performance capacity of the modern all-flash array.  And what about the fact that complex animals like the Oracle Database perform both random single-block IOs and sequential multi-block IOs simultaneously and at a nanosecond’s notice, depending on the whim of the optimizer?  Simplistic performance evaluation unfortunately brings the average storage or database admin no closer to understanding how the array will perform for his actual workload–and isn’t that the whole point of doing such an evaluation?

What’s a DBA to do?

A while back, our friends over at Pure Storage wrote a blog in which they shared some metrics they had pulled from call-home data from their customer environments.  They said, for example, that Oracle environment IO activity broke down like this on average, in terms of block sizes and reads versus writes, and they helpfully provided a VDBench configuration file to drive that IO pattern:

http://blog.purestorage.com/modeling-io-size-mixes-with-vdbench/

That was really cool of them, but, on closer examination, it occurred to me that this profile really only described a blender of some number of disparate Oracle environments.  The chances of it approximating any one Oracle environment were nominal, and the chances of it approximating YOUR Oracle environment went to monkeys with typewriters producing Shakespeare.  So this driver doesn’t actually issue the IOs that your Oracle database is going to issue.  To me, that seriously limits its value.  Another problem I have with it is that, with its single read workload definition, it is going to show me the average latency for all read IOs as a single number.  But Oracle helpfully shows me my random read time separate from my random write time–and my multi-block read time separate from those, and my sequential write time for redo separate from those, etc.  This granularity is what makes Oracle’s instrumentation so valuable in performance analysis.  I refuse to give it up.

Taking Charge

So what can you do?  Well, Oracle is capturing all of your IO metrics for you automatically, so just take a look at your AWR report (you guys on SE can get this from Statspack reports) for them and build your own IO driver for VDBench.  As an example, one customer–let’s call them a large international bank–was curious to see if our products could deliver comparable or better latency than their existing storage.  They shared their AWR reports with me, and I found their IO profile section for the period they really cared about.  Here’s a snippet:

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
<SNIP>
physical read IO requests                55,301,220        7,682.5       1,180.1
physical read bytes              1.936982535373E+13 2.69087766E+09 4.1334639E+08
physical read partial requests               26,445            3.7           0.6
physical read requests optimized         49,680,085        6,901.6       1,060.2
physical read total IO requests          55,479,809        7,707.3       1,183.9
physical read total bytes        1.938706428365E+13 2.69327251E+09 4.1371427E+08
physical read total bytes optimi 1.773273192858E+13 2.46345082E+09 3.7841130E+08
physical read total multi block          19,552,557        2,716.3         417.3
physical reads cache                     14,137,864        1,964.1         301.7
physical reads cache prefetch            11,716,783        1,627.7         250.0
physical reads direct                 1,168,102,453      162,274.1      24,927.0
physical reads direct (lob)                      22            0.0           0.0
physical reads direct temporary         307,926,728       42,777.5       6,571.1
physical reads prefetch warmup                    0            0.0           0.0
physical write IO requests               37,072,831        5,150.2         791.1
physical write bytes              4,873,114,484,736  676,978,477.6 1.0399083E+08
physical write requests optimize         31,566,182        4,385.2         673.6
physical write total IO requests         37,460,357        5,204.0         799.4
physical write total bytes        4,908,503,636,480  681,894,777.9 1.0474603E+08
physical write total bytes optim  3,540,697,530,368  491,877,634.2  75,557,447.1
physical write total multi block          5,511,767          765.7         117.6
<SNIP>
redo writes                                 341,363           47.4           7.3

Of course, not every multi-block read is 1M because that would be too easy.  And good luck trying to get all the numbers to line up exactly.  That Oracle pulls the metrics from different places still means some rough math.  But, with a little patience and fiddling, we can get a great approximation of the number of single block random reads, large block sequential reads, random and multi-block writes, and redo writes that match up closely to these values, both in IOPS and bandwidth.  When in doubt, use the higher of [IOPS listed, Bandwidth listed].  Thus I could set up my VDBench workload definitions:

# single-block, 100% random reads
wd=wd_oracle_rand_read,rdpct=100,xfersize=16k,seekpct=100,iorate=3250,sd=sd*,priority=1

# multi-block, 100% sequential reads
wd=wd_oracle_seq_read,rdpct=100,xfersize=1024k,seekpct=0,iorate=2500,sd=sd*,priority=2

# single-block, 100% random writes
wd=wd_oracle_rand_write,rdpct=0,xfersize=16k,seekpct=100,iorate=5800,sd=sd*,priority=3

# multi-block, 100% sequential writes
wd=wd_oracle_seq_write,rdpct=0,xfersize=768k,seekpct=0,iorate=750,sd=sd*,priority=4

# redo write sizes vary per the LGWR mechanism, so we’ll go with redo size (bytes) per second / redo writes per second
wd=wd_oracle_redo_write,rdpct=0,xfersize=64k,seekpct=0,iorate=50,sd=sd*,priority=5
rd=rd_oracle_ramp,wd=wd_oracle*,iorate=12350,interval=1,elapsed=120,forthreads=8,warmup=5

As a quick check, with the customer’s 16KB block size, this config drives just over 50 MB/s random reads + 2500 MB/s sequential reads, which gets really close to the 2566 MB/s total reads stated in the snippet above.  It also drives about 91 MB/s random writes + 563 MB/s sequential writes + 3 MB/s redo for a total of 657 MB/s writes, which is really close to the reported 650 MB/s write bandwidth in the snippet.  I could take this even further to break out if I needed to characterize performance for other IO types or block sizes.  VDBench helpfully puts out a separate HTML file for each workload definition, allowing us to see the latency metrics for each IO type and size that you can then compare against the values in our AWR or Statspack report.  Note that you should set your forthreads value just high enough that you can drive the desired IOPS total; any higher and you’ll push latency up without achieving anything useful. And clearly the total IOPS target for the run definition should match the sum of the individual workload drivers.

PoC Avoided?  Maybe.

question-mark-diceAll of what I have described here helps to answer the question of What would each latency number look like for my IO workload as it exists today?  From this, you can use a little math to answer with great accuracy the execution time for any particular SQL with the lower latency.  The next logical question is What is going to happen to overall application performance when each query runs so much faster and completes sooner, allowing the next query to start earlier, etc?  That part is much more difficult to predict and may require a full-blown PoC to answer definitively, but at least you know the product you’re about to invest time in can deliver the latency you expect with your current IO workload profile.  If you’re hoping for a 10X performance improvement for your application, you’d better see that IO wait currently accounts for a large percentage of database time and that the latency of your new array beats the current latency by enough to make that dream a reality.

New Installation Cookbook: Oracle Linux 6.7 with Oracle 11.2.0.4 RAC

cookbookI’ve updated my install cookbooks page to include a new cookbook for installation of Oracle 11.2.0.4 Real Application Clusters on Oracle Linux 6.7.

This is also the first one I’ve published since I left the employment of Violin Memory to work for Kaminario, so this install uses a Kaminario K2 All Flash Array. However, it applies very well to any Oracle RAC installation which uses relatively capable storage.

Enjoy:

https://flashdba.com/install-cookbooks/oracle-linux-6-7-with-oracle-11-2-0-4-rac/

Oracle’s ASM Filter Driver Revisited

filter

Almost exactly a year ago I published a post covering my first impressions of the ASM Filter Driver (ASMFD) released in Oracle 12.1.0.2, followed swiftly by a second post showing that it didn’t work with 4k native devices.

When I wrote that first post I was about to start my summer holidays, so I’m afraid to admit that I was a little sloppy and made some false assumptions toward the end – assumptions which were quickly overturned by eagle-eyed readers in the comments section. So I need to revisit that at some point in this post.

But first, some background.

Some Background

If you don’t know what ASMFD is, let me just quote from the 12.1 documentation:

Oracle ASM Filter Driver (Oracle ASMFD) is a kernel module that resides in the I/O path of the Oracle ASM disks. Oracle ASM uses the filter driver to validate write I/O requests to Oracle ASM disks.

The Oracle ASMFD simplifies the configuration and management of disk devices by eliminating the need to rebind disk devices used with Oracle ASM each time the system is restarted.

The Oracle ASM Filter Driver rejects any I/O requests that are invalid. This action eliminates accidental overwrites of Oracle ASM disks that would cause corruption in the disks and files within the disk group. For example, the Oracle ASM Filter Driver filters out all non-Oracle I/Os which could cause accidental overwrites.

This is interesting, because ASMFD is considered a replacement for Oracle ASMLib, yet the documentation for ASMFD doesn’t make all of the same claims that Oracle makes for ASMLib. Both ASMFD and ASMLib claim to simplify the configuration and management of disk devices, but ASMLib’s documentation also claims that it “greatly reduces kernel resource usage“. Doesn’t ASMFD have this effect too? What is definitely a new feature for ASMFD is the ability to reject invalid (i.e. non-Oracle) I/O operations to ASMFD devices – and that’s what I got wrong last time.

However, before we can revisit that, I need to install ASMFD on a brand new system.

Installing ASMFD

Last time I tried this I made the mistake of installing 12.1.0.2.0 with no patch set updates. Thanks to a reader called terry, I now know that the PSU is a very good idea, so this time I’m using 12.1.0.2.3. First let’s do some preparation.

Preparing To Install

I’m using an Oracle Linux 6 Update 5 system running the Oracle Unbreakable Enterprise Kernel v3:

[root@server4 ~]# cat /etc/oracle-release
Oracle Linux Server release 6.5
[root@server4 ~]# uname -r
3.8.13-26.2.3.el6uek.x86_64

As usual I have taken all of the necessary pre-installation steps to make the Oracle Universal Installer happy. I have disabled selinux and iptables, plus I’ve configured device mapper multipathing. I have two sets of 8 LUNs from my Violin storage: 8 using 512e emulation mode (512 byte logical block size but 4k physical block size) and 8 using 4kN native mode (4k logical and physical block size). If you have any doubts about what that means, read here.

[root@server4 ~]# ls -l /dev/mapper
total 0
crw-rw---- 1 root root 10, 236 Jul 20 16:52 control
lrwxrwxrwx 1 root root       7 Jul 20 16:53 mpatha -> ../dm-0
lrwxrwxrwx 1 root root       7 Jul 20 16:53 mpathap1 -> ../dm-1
lrwxrwxrwx 1 root root       7 Jul 20 16:53 mpathap2 -> ../dm-2
lrwxrwxrwx 1 root root       7 Jul 20 16:53 mpathap3 -> ../dm-3
lrwxrwxrwx 1 root root       7 Jul 20 16:53 v4kdata1 -> ../dm-6
lrwxrwxrwx 1 root root       7 Jul 20 16:53 v4kdata2 -> ../dm-7
lrwxrwxrwx 1 root root       7 Jul 20 16:53 v4kdata3 -> ../dm-8
lrwxrwxrwx 1 root root       7 Jul 20 16:53 v4kdata4 -> ../dm-9
lrwxrwxrwx 1 root root       8 Jul 20 16:53 v4kdata5 -> ../dm-10
lrwxrwxrwx 1 root root       8 Jul 20 16:53 v4kdata6 -> ../dm-11
lrwxrwxrwx 1 root root       8 Jul 20 16:53 v4kdata7 -> ../dm-12
lrwxrwxrwx 1 root root       8 Jul 20 16:53 v4kdata8 -> ../dm-13
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data1 -> ../dm-14
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data2 -> ../dm-15
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data3 -> ../dm-16
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data4 -> ../dm-17
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data5 -> ../dm-18
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data6 -> ../dm-19
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data7 -> ../dm-20
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data8 -> ../dm-21
lrwxrwxrwx 1 root root       8 Jul 20 16:53 vg_halfserver4-lv_home -> ../dm-22
lrwxrwxrwx 1 root root       7 Jul 20 16:53 vg_halfserver4-lv_root -> ../dm-4
lrwxrwxrwx 1 root root       7 Jul 20 16:53 vg_halfserver4-lv_swap -> ../dm-5

The 512e devices are shown in green and the 4k devices shown in red. The other devices here can be ignored as they are related to the default filesystem layout of the operating system.

Installing Oracle 12.1.0.2.3 Grid Infrastructure (software only)

This is where the first challenge comes. When you perform a standard install of Oracle 12c Grid Infrastructure you are asked for storage devices on which you can locate items such as the ASM SPFILE, OCR and voting disks. In the old days of using ASMLib you would have prepared these in advance, because ASMLib is a separate kernel module located outside of the Oracle GI home. But ASMFD is part of the Oracle Home and so doesn’t exist prior to installation. Thus we have a chicken and egg situation.

Even worse, I know from bitter experience that I need to install some patches prior to labelling my disks, but I can’t install patches without installing the Oracle home either.

So the only thing for it is to perform a Software Only installation from the Oracle Universal Installer, then apply the PSU, then create an ASM instance and finally label the LUNs with ASMFD. It’s all very long winded. It wouldn’t be a problem if I was migrating from an existing ASMLib setup, but this is a clean install. Such is the price of progress.

To save this post from becoming longer and more unreadable than a 12c AWR report, I’ve captured the entire installation and configuration of 12.1.0.2.3 GI and ASM on a separate installation cookbook page, here:

Installing 12.1.2.0.3 Grid Infrastructure with Oracle Linux 6 Update 5

It’s simpler that way. If you don’t want to go and read it, just take it from me that we now have a working ASM instance which currently has no devices under its control. The PSU has been applied so we are ready to start labelling.

Using ASM Filter Driver to Label Devices

The next step is to start labelling my LUNs with ASMFD. I’m using what the documentation describes as an “Oracle Grid Infrastructure Standalone (Oracle Restart) Environment”, so I’m following this set of steps in the documentation.

Step one tells me to run a dsget command and then a dsset command to add a diskstring of ‘AFD:*’. Ok:

[oracle@server4 ~]$ asmcmd dsget
parameter:
profile:++no-value-at-resource-creation--never-updated-through-ASM++
[oracle@server4 ~]$ asmcmd dsset 'AFD:*'
[oracle@server4 ~]$ asmcmd dsget
parameter:AFD:*
profile:AFD:*

Next I need to stop CRS (I’m using a standalone config so actually it’s HAS):

[root@server4 ~]# crsctl stop has
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'server4'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'server4'
CRS-2673: Attempting to stop 'ora.asm' on 'server4'
CRS-2673: Attempting to stop 'ora.evmd' on 'server4'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'server4' succeeded
CRS-2677: Stop of 'ora.evmd' on 'server4' succeeded
CRS-2677: Stop of 'ora.asm' on 'server4' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'server4'
CRS-2677: Stop of 'ora.cssd' on 'server4' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'server4' has completed
CRS-4133: Oracle High Availability Services has been stopped.

And then I need to run the afd_configure command (all as the root user). Before and after doing so I will check for any loaded kernel modules with oracle in the name, so see what changes:

[root@server4 ~]# lsmod | grep oracle
oracleacfs           3308260  0
oracleadvm            508030  0
oracleoks             506741  2 oracleacfs,oracleadvm
[root@server4 ~]# asmcmd afd_configure
Connected to an idle instance.
AFD-627: AFD distribution files found.
AFD-636: Installing requested AFD software.
AFD-637: Loading installed AFD drivers.
AFD-9321: Creating udev for AFD.
AFD-9323: Creating module dependencies - this may take some time.
AFD-9154: Loading 'oracleafd.ko' driver.
AFD-649: Verifying AFD devices.
AFD-9156: Detecting control device '/dev/oracleafd/admin'.
AFD-638: AFD installation correctness verified.
Modifying resource dependencies - this may take some time.
[root@server4 ~]# lsmod | grep oracle
oracleafd             211540  0
oracleacfs           3308260  0
oracleadvm            508030  0
oracleoks             506741  2 oracleacfs,oracleadvm
[root@server4 ~]# asmcmd afd_state
Connected to an idle instance.
ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'ENABLED' on host 'server4'

Notice the new kernel module called oracleafd. Also, AFD is showing that “filtering is enabled” – I guess this relates to the protection against invalid writes.

Time to start up HAS or CRS again:

[root@server4 ~]# crsctl start has
CRS-4123: Oracle High Availability Services has been started.

Ok, let’s start labelling those devices.

Labelling (Incorrectly)

Now remember that I am testing with two sets of devices here: 512e and 4k. The 512e devices are emulating a 512 byte blocksize, so they should result in ASM creating diskgroups with a blocksize of 512 bytes – thus avoiding all the tedious bugs from which Oracle suffers when using 4096 byte diskgroups.

So let’s just test a 512e LUN to see what happens when I label it and present it to ASM. First, the label is created using the afd_label command:

[oracle@server4 ~]$ ls -l /dev/mapper/v512data1
lrwxrwxrwx 1 root root 8 Jul 24 10:30 /dev/mapper/v512data1 -> ../dm-14
[oracle@server4 ~]$ ls -l /dev/dm-14
brw-rw---- 1 oracle dba 252, 14 Jul 24 10:30 /dev/dm-14
[oracle@server4 ~]$ asmcmd afd_label v512data1 /dev/mapper/v512data1
[oracle@server4 ~]$ asmcmd afd_lsdsk
--------------------------------------------------------------------------------
Label                     Filtering   Path
================================================================================
V512DATA1                   ENABLED   /dev/sdpz

Well, it worked.. sort of. The path we can see in the lsdsk output does not show the /dev/mapper/v512data1 multipath device I specified… instead it’s one of the non-multipath /dev/sd* devices. Why?

Even worse, look what happens when I check the SECTOR_SIZE column of the v$asm_disk view in ASM:

SQL> select group_number, name, sector_size, block_size, state
  2  from v$asm_diskgroup;

GROUP_NUMBER NAME	SECTOR_SIZE BLOCK_SIZE STATE
------------ ---------- ----------- ---------- -----------
	   1 V512DATA	       4096	  4096 MOUNTED

Even though my LUNs are presented as 512e, ASM has chosen to see them as 4096 byte. That’s not wanted I want. Gaah!

Labelling (Correctly)

To fix this I need to unlabel that LUN so that AFD has nothing under its control, then update the oracleafd_use_logical_block_size parameter via the special SYSFS files /sys/modules/oracleafd:

[root@server4 ~]# cd /sys/module/oracleafd
[root@server4 oracleafd]# ls -l
total 0
-r--r--r-- 1 root root 4096 Jul 20 14:43 coresize
drwxr-xr-x 2 root root    0 Jul 20 14:43 holders
-r--r--r-- 1 root root 4096 Jul 20 14:43 initsize
-r--r--r-- 1 root root 4096 Jul 20 14:43 initstate
drwxr-xr-x 2 root root    0 Jul 20 14:43 notes
drwxr-xr-x 2 root root    0 Jul 20 14:43 parameters
-r--r--r-- 1 root root 4096 Jul 20 14:43 refcnt
drwxr-xr-x 2 root root    0 Jul 20 14:43 sections
-r--r--r-- 1 root root 4096 Jul 20 14:43 srcversion
-r--r--r-- 1 root root 4096 Jul 20 14:43 taint
--w------- 1 root root 4096 Jul 20 14:43 uevent
[root@server4 oracleafd]# cd parameters
[root@server4 parameters]# ls -l
total 0
-rw-r--r-- 1 root root 4096 Jul 20 14:43 oracleafd_use_logical_block_size
[root@server4 parameters]# cat oracleafd_use_logical_block_size
0
[root@server4 parameters]# echo 1 > oracleafd_use_logical_block_size
[root@server4 parameters]# cat oracleafd_use_logical_block_size
1

After making this change, AFD will present the logical blocksize of 512 bytes to ASM rather than the physical blocksize of 4096 bytes. So let’s now label those disks again:

[root@server4 mapper]# for lun in 1 2 3 4 5 6 7 8; do
> asmcmd afd_label v512data$lun /dev/mapper/v512data$lun
> done
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
[root@server4 mapper]# asmcmd afd_lsdsk
Connected to an idle instance.
--------------------------------------------------------------------------------
Label                     Filtering   Path
================================================================================
V512DATA1                   ENABLED   /dev/mapper/v512data1
V512DATA2                   ENABLED   /dev/mapper/v512data2
V512DATA3                   ENABLED   /dev/mapper/v512data3
V512DATA4                   ENABLED   /dev/mapper/v512data4
V512DATA5                   ENABLED   /dev/mapper/v512data5
V512DATA6                   ENABLED   /dev/mapper/v512data6
V512DATA7                   ENABLED   /dev/mapper/v512data7
V512DATA8                   ENABLED   /dev/mapper/v512data8

Note the correct multipath devices (“/dev/mapper/*”) are now being shown in the lsdsk command output. If I now create an ASM diskgroup on these LUNs, it will have a 512 byte sector size:

SQL> get afd.sql
  1  CREATE DISKGROUP V512DATA EXTERNAL REDUNDANCY
  2  DISK 'AFD:V512DATA1', 'AFD:V512DATA2',
  3	  'AFD:V512DATA3', 'AFD:V512DATA4',
  4	  'AFD:V512DATA5', 'AFD:V512DATA6',
  5	  'AFD:V512DATA7', 'AFD:V512DATA8'
  6  ATTRIBUTE
  7	  'compatible.asm' = '12.1',
  8*	  'compatible.rdbms' = '12.1'
SQL> /

Diskgroup created.

SQL> select disk_number, mount_status, header_status, state, sector_size, path
  2  from v$asm_disk;

DISK_NUMBER MOUNT_S HEADER_STATU STATE	  SECTOR_SIZE PATH
----------- ------- ------------ -------- ----------- --------------------
	  0 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA1
	  1 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA2
	  2 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA3
	  3 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA4
	  4 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA5
	  5 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA6
	  6 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA7
	  7 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA8

8 rows selected.

SQL> select group_number, name, sector_size, block_size, state
  2  from v$asm_diskgroup;

GROUP_NUMBER NAME	SECTOR_SIZE BLOCK_SIZE STATE
------------ ---------- ----------- ---------- -----------
	   1 V512DATA		512	  4096 MOUNTED

Phew.

Failing To Label 4kN Devices

So what about my 4k native mode devices, the ones with a 4096 byte logical block size? What happens if I try to label them?

[root@server4 ~]# asmcmd afd_label V4KDATA1 /dev/mapper/v4kdata1
Connected to an idle instance.
ASMCMD-9513: ASM disk label set operation failed.

Yeah, that didn’t work out did it? Let’s look in the trace file:

[root@server4 ~]# tail -5 /u01/app/oracle/log/diag/asmcmd/user_root/server4/alert/alert.log
24-Jul-15 12:38 ASMCMD (PID = 8695) Given command - afd_label V4KDATA1 '/dev/mapper/v4kdata1'
24-Jul-15 12:38 NOTE: Verifying AFD driver state : loaded
24-Jul-15 12:38 NOTE: afdtool -add '/dev/mapper/v4kdata1' 'V4KDATA1'
24-Jul-15 12:38 NOTE:
24-Jul-15 12:38 ASMCMD-9513: ASM disk label set operation failed.

I’ve struggled to find any more meaningful message, even when I manually run the afdtool command shown in the log – but it seems pretty likely that this is failing due to the device being 4kN. I therefore assume that AFD still isn’t 4kN ready. I do wish Oracle would make some meaningful progress on its support of 4kN devices…

I/O Filter Protection

So now let’s investigate this protection that ASMFD claims to have against non-Oracle I/Os. First of all, what do those files in /dev/oracleafd/disks actually contain?

[root@server4 ~]# cd /dev/oracleafd/disks
[root@server4 disks]# ls -l
total 32
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA1
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA2
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA3
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA4
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA5
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA6
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA7
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA8
[root@server4 disks]# cat V512DATA1
/dev/mapper/v512data1

Aha. This is what I got wrong in my original post last year, because – keen as I was to start my summer vacation – I didn’t spot that these files are simply pointers to the relevant multipath device in /dev/mapper. So let’s follow the pointers this time.

Let’s remind ourselves that the files in /dev/mapper are actually symbolic links to /dev/dm-* devices:

root@server4 disks]# ls -l /dev/mapper/v512data1
lrwxrwxrwx 1 root root 8 Jul 24 12:34 /dev/mapper/v512data1 -> ../dm-14
[root@server4 disks]# ls -l /dev/dm-14
brw-rw---- 1 oracle dba 252, 14 Jul 24 12:34 /dev/dm-14

So it’s these /dev/dm-* devices that are at the end of the trail we just followed. If we dump the first 64 bytes of this /dev/dm-14 device, we should be able to see the AFD label:

[root@server4 disks]# od -c -N 64 /dev/dm-14
0000000                           (   o   u   t
0000020                                
0000040   O   R   C   L   D   I   S   K   V   5   1   2   D   A   T   A
0000060   1

There it is. We can also read it with kfed to see what ASM thinks of it:

[root@server4 ~]# kfed read /dev/dm-14
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                  1953853224 ; 0x00c: 0x74756f28
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
000000000 00000000 00000000 00000000 74756F28  [............(out]
000000010 00000000 00000000 00000000 00000000  [................]
000000020 4C43524F 4B534944 32313556 41544144  [ORCLDISKV512DATA]
000000030 00000031 00000000 00000000 00000000  [1...............]
000000040 00000000 00000000 00000000 00000000  [................]
  Repeat 251 times

So what happens if I overwrite it, as the root user, with some zeros? And maybe some text too just for good luck?

root@server4 ~]# dd if=/dev/zero of=/dev/dm-14 bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.00570833 s, 735 MB/s
[root@server4 ~]# echo CORRUPTION > /dev/dm-14
[root@server4 ~]# od -c -N 64 /dev/dm-14
0000000   C   O   R   R   U   P   T   I   O   N  \n          
0000020

It looks like it’s changed. I also see that if I dump it from another session which opens a fresh file descriptor. Yet in the /var/log/messages file there is now a new entry:

F 4626129.736/150724115533 flush-252:14[1807]  afd_mkrequest_fn: write IO on ASM managed device (major=252/minor=14)  not supported i=0 start=0 seccnt=8  pstart=0  pend=41943040
Jul 24 12:55:33 server4 kernel: quiet_error: 1015 callbacks suppressed
Jul 24 12:55:33 server4 kernel: Buffer I/O error on device dm-14, logical block 0
Jul 24 12:55:33 server4 kernel: lost page write due to I/O error on dm-14

Hmm. It seems like ASMFD has intervened to stop the write, yet when I query the device I see the “new” data. Where’s the old data gone? Well, let’s use kfed again:

[root@server4 ~]# kfed read /dev/dm-14
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                  1953853224 ; 0x00c: 0x74756f28
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
000000000 00000000 00000000 00000000 74756F28  [............(out]
000000010 00000000 00000000 00000000 00000000  [................]
000000020 4C43524F 4B534944 32313556 41544144  [ORCLDISKV512DATA]
000000030 00000031 00000000 00000000 00000000  [1...............]
000000040 00000000 00000000 00000000 00000000  [................]
  Repeat 251 times

The label is still there! Magic.

I have to confess, I don’t really know how ASM does this. Indeed, I struggled to get the system back to a point where I could manually see the label using the od command. In the end, the only way I managed it was to reboot the server – yet ASM works fine all along and the diskgroup was never affected:

SQL> alter diskgroup V512DATA check all;
Mon Jul 20 16:46:23 2015
NOTE: starting check of diskgroup V512DATA
Mon Jul 20 16:46:23 2015
GMON querying group 1 at 5 for pid 7, osid 9255
GMON checking disk 0 for group 1 at 6 for pid 7, osid 9255
GMON querying group 1 at 7 for pid 7, osid 9255
GMON checking disk 1 for group 1 at 8 for pid 7, osid 9255
GMON querying group 1 at 9 for pid 7, osid 9255
GMON checking disk 2 for group 1 at 10 for pid 7, osid 9255
GMON querying group 1 at 11 for pid 7, osid 9255
GMON checking disk 3 for group 1 at 12 for pid 7, osid 9255
GMON querying group 1 at 13 for pid 7, osid 9255
GMON checking disk 4 for group 1 at 14 for pid 7, osid 9255
GMON querying group 1 at 15 for pid 7, osid 9255
GMON checking disk 5 for group 1 at 16 for pid 7, osid 9255
GMON querying group 1 at 17 for pid 7, osid 9255
GMON checking disk 6 for group 1 at 18 for pid 7, osid 9255
GMON querying group 1 at 19 for pid 7, osid 9255
GMON checking disk 7 for group 1 at 20 for pid 7, osid 9255
Mon Jul 20 16:46:23 2015
SUCCESS: check of diskgroup V512DATA found no errors
Mon Jul 20 16:46:23 2015
SUCCESS: alter diskgroup V512DATA check all

So there you go. ASMFD: it does what it says on the tin. Just don’t try using it with 4kN devices…

The Great Hypervisor Bake-off: VMware ESX vs Oracle VM

lock-horns

This is a very simple post to show the results of some recent testing that Tom and I ran using Oracle SLOB on Violin to determine the impact of using virtualization. But before we get to that, I am duty bound to write a paragraph of text featuring lots of long sentences peppered with industry buzz words. Forgive me, it’s just the way I’m wired.

It is increasingly common these days to find database environments running in virtual machines – even large, business critical ones. The driver is the trend to commoditize I.T. services and build consolidated, private-cloud style solutions in order to control operational expense and increase agility (not to mention reduce exposure to Oracle licenses). But, as I’ve said in previous posts, the catalyst has been the unblocking of I/O as legacy disk systems are replaced by flash memory. In the past, virtual environments caused a kind of I/O blender effect whereby I/O calls become increasingly randomized – and this sucked for the performance of disk drives. Flash memory arrays on the other hand can deliver random I/O all day long because… well, if you don’t know the reasons by now can I just recommend starting at the beginning. The outcome is that many large and medium-sized organisations are now building database-as-a-service platforms with Oracle databases (other database products are available) running in virtual machines. It’s happening right now.

Phew. Anyway, that last paragraph was just a wordy way of telling you that I’m often seeing Oracle running in virtual machines on top of hypervisors. But how much of a performance impact do those hypervisors have? Step this way to find out.

The Contenders

boxersWhen it comes to running Oracle on a hypervisor using Intel x86 hardware (for that is what I have available), I only know of three real contenders:

Hyper-V has been an option for a couple of years now, but I’ll be honest – I have neither the time nor the inclination to test it today. It’s not that I don’t rate it as a product, it’s just that I’ve never used it before and don’t have enough time to learn something new right now. Maybe someday I’ll come back and add it to the mix.

In the meantime, it’s the big showdown: VMware versus Oracle VM. Not that Oracle VM is really in the same league as VMware in terms of market share… but you know, I’m trying to make this sound exciting.

The Test

This is going to be an Oracle SLOB sustained throughput test. In other words, I’m going to build an Oracle database and then shovel a massive amount of I/O through it (you can read all about SLOB here and here). SLOB will be configured to run with 25% of statements being UPDATEs (the remainder are SELECTs) and will run for 8 hours straight. What we want to see is a) which hypervisor configuration allows the greatest I/O bandwidth, and b) which hypervisor configuration exhibits the most predictable performance.

This is the configuration. First the hardware:

Violin Memory 6616 flash Memory Array

Violin Memory 6616 flash Memory Array

  • 1x Dell PowerEdge R720 server
  • 2x Intel Xeon CPU E5-2690 v2 10-core @ 3.00GHz [so that’s 2 sockets, 20 cores, 40 threads for this server]
  • 128GB DRAM
  • 1x Violin Memory 6616 (SLC) flash memory array [the one that did this]
  • 8GB fibre-channel

And the software:

  • Hypervisor: VMware ESXi 5.5.1
  • Hypervisor: Oracle VM for x86 3.3.1
  • VM: Oracle Linux 6 Update 5 (with the Unbreakable Enterprise v3 Kernel 3.6.18)
  • Oracle Grid Infrastructure 11.2.0.4 (for Automatic Storage Management)
  • Oracle Database Enterprise Edition 11.2.0.4

Each VM is configured with 20 vCPUs and is using Linux Device Mapper Multipath and Oracle ASMLib. ASM is configured to use one single +DATA disgroup comprising 8 ASM disks (LUNs from Violin) with external redundancy. The database parameters and SLOB settings are all listed on the SLOB sustained throughput test page.

Results: Bare Metal (Baseline)

First let’s see what happens when we don’t use a hypervisor at all and just run OL6.5 on bare metal:

Oracle SLOB- 8 Hour Sustained Throughput Test with no hypervisor (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         232,431.0       194,452.3        37,978.7
         Database Requests:         228,909.4       194,447.9        34,461.5
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           3,515.1             0.3         3,514.8
                Total (MB):           1,839.6         1,519.2           320.4

Ok so we’re looking at 1519 MB/sec of read throughput and 320 MB/sec of write throughput. Crucially, the lines are nice and consistent – with very little deviation from the mean. By dividing the amount of time spent waiting on db file sequential read (i.e. random physical reads) with the number of waits, we can calculate that the average latency for random reads was 438 microseconds.

Now we know what to expect, let’s look at the result from the hypervisor tests.

Results: VMware vSphere

VMware is configured to use Raw Device Mapping (RDM) which essentially gives the benefits of raw devices… read here for more details on that. Here are the test results:

Oracle SLOB- 8 Hour Sustained Throughput Test with VMware ESXi 5.5.1 (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         173,141.7       145,066.8        28,075.0
         Database Requests:         170,615.3       145,064.0        25,551.4
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           2,522.8             0.1         2,522.7
                Total (MB):           1,370.0         1,133.4           236.7

Average read throughput for this test was 1133 MB/sec and write throughput averaged at 237 MB/sec. Average read latency was 596 microseconds. That’s an increase of 36%.

In comparison to the bare metal test, we see that total bandwidth dropped by around 25%. That might seem like a lot but remember, we are absolutely hammering this system. A real database is unlikely to ever create this level of sustained I/O. In my role at Violin I’ve been privileged to work on some of the busiest databases in Europe – nothing is ever this crazy (although a few do come close).

Results: Oracle VM

Oracle VM is based on the Xen hypervisor and therefore uses Xen virtual disks to present block devices. For this test I downloaded the Oracle Linux 6 Update 5 template from Oracle’s eDelivery site. You can see more about the way this VM was configured here. Here are the test results:

Oracle SLOB- 8 Hour Sustained Throughput Test with Oracle VM 3.3.1 (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         160,563.8       134,592.9        25,970.9
         Database Requests:         158,538.1       134,587.3        23,950.8
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           2,017.2             0.2         2,016.9
                Total (MB):           1,273.4         1,051.6           221.9

This time we see average read bandwidth of 1052MB/sec and average write bandwidth of 222MB/sec, with the average read latency at 607 microseconds, which is 39% higher than the baseline test.

Meanwhile, total bandwidth dropped by 31%. That’s slightly worse than VMware, but what’s really interesting is the deviation. Look at how ragged the lines are on the OVM test! There is a much higher degree of variance exhibited here than on the VMware test.

Conclusion

This is only one test so I’m not claiming it’s conclusive. VMware does appear to deliver slightly better performance than OVM in my tests, but it’s not a huge difference. However, I am very much concerned by the variance of the OVM test in comparison to VMware. Look, for example, at the wait event histograms for db file sequential read:

Wait Event Histogram
-> Units for Total Waits column: K is 1000, M is 1000000, G is 1000000000
-> % of Waits: value of .0 indicates value was <.05%; value of null is truly 0
-> % of Waits: column heading of <=1s is truly <1024ms, >1s is truly >=1024ms
-> Ordered by Event (idle events last)

                                                             % of Waits
                                          -----------------------------------------------
                                    Total
Hypervisor  Event                   Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s   >1s
----------- ----------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
Bare Metal: db file sequential read 5557.  98.7   1.3    .0    .0    .0    .0
VMware ESX: db file sequential read 4164.  92.2   6.7   1.1    .0    .0    .0
Oracle VM : db file sequential read 3834.  95.6   4.1    .1    .1    .0    .0    .0    .0

The OVM tests show occasional results in the two highest buckets, meaning once or twice there were waits in excess of 1 second! However, to be fair, OVM also had more millisecond waits than VMware.

Anyway, for now – and for this setup at least – I’m sticking with VMware. You should of course test your own workloads before choosing which hypervisor works for you…

Thanks as always to Kevin for bringing Oracle SLOB to the community.

ASM Rebalance Too Slow? 3 Tips To Improve Rebalance Times

see-saw

I’ve run into a few customers recently who have had problems with their ASM rebalance operations running too slowly. Surprisingly, there were some simple concepts being overlooked – and once these were understood, the rebalance times were dramatically improved. For that reason, I’m documenting the solutions here… I hope that somebody, somewhere benefits…

1. Don’t Overbalance

Every time you run an ALTER DISKGROUP REBALANCE operation you initiate a large amount of I/O workload as Oracle ASM works to evenly stripe data across all available ASM disks (i.e. LUNs). The most common cause of rebalance operations running slowly that I see (and I’m constantly surprised how much I see this) is to overbalance, i.e. cause ASM to perform more I/O than is necessary.

It almost always goes like this. The customer wants to migrate some data from one set of ASM disks to another, so they first add the new disks:

alter diskgroup data
add disk  'ORCL:NEWDATA1','ORCL:NEWDATA2','ORCL:NEWDATA3','ORCL:NEWDATA4',
          'ORCL:NEWDATA5','ORCL:NEWDATA6','ORCL:NEWDATA7','ORCL:NEWDATA8'
rebalance power 11 wait;

Then they drop the old disks like this:

alter diskgroup data
drop disk 'DATA1','DATA2','DATA3','DATA4',
          'DATA5','DATA6','DATA7','DATA8'
rebalance power 11 wait;

Well guess what? That causes double the amount of I/O that is actually necessary to migrate, because Oracle evenly stripes across all disks and then has to rebalance a second time once the original disks are dropped.

This is how it should be done – in one single operation:

alter diskgroup data
add disk  'ORCL:NEWDATA1','ORCL:NEWDATA2','ORCL:NEWDATA3','ORCL:NEWDATA4',
          'ORCL:NEWDATA5','ORCL:NEWDATA6','ORCL:NEWDATA7','ORCL:NEWDATA8'
drop disk 'DATA1','DATA2','DATA3','DATA4',
          'DATA5','DATA6','DATA7','DATA8'
rebalance power 11 wait;

A customer of mine tried this earlier this week and reported back that their ASM rebalance time had reduced by a factor of five!

By the way, the WAIT command means the cursor doesn’t return until the command is finished. To have the command essentially run in the background you can simply change this to NOWAIT. Also, you could run the ADD and DROP commands separately if you used a POWER LIMIT of zero for the first command, as this would pause the rebalance and then the second command would kick it off.

2. Power Limit Goes Up To 1024

Simple one this, but easily forgotten. From the early days of ASM, the maximum power limit for rebalance operations was 11. See here if you don’t know why.

From 11.2.0.2, if the COMPATIBLE.ASM disk group attribute is set to 11.2.0.2 or higher the limit is now 1024. That means 11 really isn’t going to cut it anymore. If you are asking for full power, make sure you know what number that is.

3. Avoid The Compact Phase (for Flash Storage Systems)

An ASM rebalance operation comprises three phases, where the third one is the compact phase. This attempts to move data as close as possible to the outer tracks of the disks ASM is using.

Did you spot the issue there? Disks. This I/O-heavy phase is completely pointless on a flash system, where I/O is served evenly from any logical address within a LUN.

You can therefore avoid that potentially-massive I/O hit by disabling the compact phase, using the underscore parameter _DISABLE_REBALANCE_COMPACT=TRUE. Remember that you need to get Oracle Support’s permission before setting underscore parameters! Point your SR in the direction of the following My Oracle Support note:

What is ASM rebalance compact Phase and how it can be disabled (Doc ID 1902001.1)

Unfortunately it appears the parameter was deprecated in 12c, so from now on you have to set the ASM diskgroup attribute “_rebalance_compact” to FALSE (note the opposite value to that set at the instance level!), for example:

ALTER DISKGROUP  SET ATTRIBUTE "_rebalance_compact”="FALSE";

If you want to know more about this topic (for example, what the first two rebalance phases are), or indeed anything about ASM in general, I highly recommend the legendary ASM blogger that is Bane Radulovic a.k.a. ASM Support Guy.

Conclusion

An ASM rebalance potentially creates a lot of I/O, which means you may need to wait for a long time before it finishes. For that reason, make sure you understand what you are doing and make every effort to perform only as much I/O as you actually need. Don’t forget you can use the EXPLAIN WORK command to gauge in advance how much work is required.

Happy rebalancing!

Oracle Exadata X5: The Road To Ten Billion Dollars

money

Now that the dust has settled on the announcement of Oracle’s new Exadata X5 Database Machine, I’ve been doing some research in order to update my History of Exadata post (it’ll be ready soon). While reviewing the datasheets and other collateral for the X5 I was struck by the meteoric increase in one particular statistic: the number of processor cores on each database server. Oracle is riding that Moore’s Law train all the way to the bank.

The thing is, the number of cores per database server is directly linked to the cost of licensing the Oracle Database for each Exadata machine – and that means trouble if you are the one paying the bills. Assuming you buy a full rack – and that you license every core in every database server (which is the most common choice, since only a very brave minority would consider the Oracle VM Trusted Partitions option), your license cost has been increasing by 50% for each of the last two Exadata releases (X3-2 to X4-2 to X5-2). Let’s have a look at that in graphical form (click on the image to enlarge):

Exadata DB License Cost Comparison X2-2 to X5-2

Let’s not forget here that I am only plotting the cost of licensing the database software. We are not taking into account the extra costs associated with paying for the hardware, licensing the storage servers or purchasing any of the pretty-much essential enterprise edition options such as Oracle RAC, partitioning, multitenancy or the diagnostic pack license. Nor are we considering the infamous 22% per annum software support costs. I’m also using the list price – which you would never expect to pay – but even if you used discounted prices the percentage increases would remain the same.

Anyway, after crunching the numbers it turns out there is good news and bad news…

Good News for Oracle

The good news for Oracle is that if Exadata continues to increase the number of cores at the rate of 50% extra per release, the list price for licensing the database software on the future X23-2 model will be $10.1 billion:

Exadata DB License Cost Projection

There’s no doubt about it – this will pay for a lot of yachts.

Bad News for Oracle

There is a downside though. As Kevin Closson has already shown, Oracle appears to be having trouble balancing the I/O capabilities of the X5 against this tremendously-increased compute power. Even after abandoning previous claims that a memory hierarchy with “low-cost disk” as the bottom tier would bring the “highest performance at the lowest cost” to customers, the new Oracle Exadata X5 “Extreme Flash” model (because apparently flash on its own isn’t enough, it has to be extreme flash) struggles to deliver an improvement in IOPS per flash device, as Kevin has shown.

exadata-x5-ef-price-per-io

Excerpt from the Oracle Exadata X5-2 Press Release

You wouldn’t think this from reading the press release, which promises “breakthrough performance and price per I/O” (emphasis added by me). Price per I/O, eh? That sounds like we take the overall price and divide it by the number of IOPS the system can deliver, right?

So let’s do that. And to be generous, let’s only look at the database license cost (ignoring all those other costs again) and take the maximum IOPS number from the datasheets (which in most cases is for unrealistic, 100% read only, fully cached workloads). How will it look? I’ll overlay it as a line (in blue) on top of the first graph:

Exadata DB License Cost Comparison

Well it turns out that the price per I/O is actually falling: it’s down from $1.71 on the X4-2 (HP) model to $1.65 on the X5-2 EF. But three and a half percent is not much of an improvement considering the 50% extra on the price tag, is it? And as for offering a “breakthrough” price per I/O, the X2-2 was better at only $1.52 per I/O per second!

Summary

In my view, something is not right about the balance between compute and storage on the Exadata X5. It feels as though Oracle is bumping up the compute power faster than would be architecturally prudent because this results in a higher purchase price. Maybe I’m wrong – I have no insider knowledge and can only speculate… but when the Exadata X23-2 finally comes out in a couple of decades time, maybe we’ll know for sure.

Footnote

The comments section of this article makes for interesting reading, with responses from a number of Oracle employees – although not necessarily speaking on behalf of the mothership. The noteworthy (not to mention calm and measured) comments come from the Exadata Product Manager, Gurmeet Goindi. In addition I’d like to draw your attention to the following two URLs: