Oracle ASM and Thin Provisioning – How To Reclaim Space

It came to my attention last November that I had crossed the one year anniversary since my last post on flashdba.com. I was so surprised that I immediately decided to write a new post, which took another three months. There are reasons why I’m no longer posting technical blogs about databases and flash, but I’ll cover them in a later post. No, not that late – I hope.

In the meantime, I thought I’d write a note on this subject because I’ve lost count of the number of times I’ve been asked questions on the topic of Oracle ASM and Thin Provisioning. Normally, I’m asked by customers or prospects who think there is an issue with their storage system… whereas, in fact, the problem is entirely storage-agnostic.

But first, some background.

Thin Provisioning

Thin Provisioning (TP) is used to describe the overcommitment of storage capacity. Your host may think it’s been allocated 10TB of capacity and is currently using 2TB, but the storage platform has only really allocated the 2TB used and the remaining 8TB may not even exist. Why would you want this? Because in a multi-host environment (where hosts could be virtual or physical), the amount of allocated-but-unused capacity could be significant. Without TP, serious amounts of capacity would need to be provisioned which may never be used, but with TP all the hosts can be “fooled” into thinking they have been allocated what they want while the actual utilised capacity is only the sum of what each host has used.

Where things can get a bit complicated with TP is that many layers in your stack may be thin provisioning storage to the layers above them. Most storage arrays are capable of TP (or indeed mandate its use), but hypervisors often have thin provisioning options too. Meanwhile, some applications which create data store structures have options which can help or hinder the use of TP. For example, VMware has the ability to create virtual disks which are thin, thick (lazy zeroed) or thick (eager zeroed). As a result, it isn’t always obvious to the underlying storage whether a particular set of allocated blocks are really in use or not. Won’t somebody think of the poor storage array?

Trim and Unmap

Consider the situation where a large file is created and then deleted in a filesystem on a typical operating system. Commonly, the deletion process doesn’t really delete anything other than the metadata telling the filesystem where the file resided. Thus the underlying file data remains until such time as something else comes along and overwrites it. This is beneficial because it is faster and requires less work than trying to overwrite the file with (for example) zeros. But if the filesystem resides on a storage array which uses TP, how will the storage array know that the space allocated to the file is now free? It can’t – unless the filesystem has a way of telling it.

For this purpose there exists a set of OS calls known as trim commands – and for the SCSI protocol (used by most block storage devices such as SANs) the command is known as UNMAP. Issuing one of these commands allows the calling layer (the filesystem, or perhaps a volume manager) to notify the storage platform that a specific set of blocks are no longer in use and can be “unmapped”, freeing space. As a side note, large calls to UNMAP can often have temporary but unexpected consequences on storage performance, as large amounts of metadata may need to be updated.

Oracle ASM: Unmap is for Wimps

Let’s get straight to the point here: Oracle’s Automatic Storage Manager doesn’t natively use UNMAP commands. Quelle surprise. But there are still ways to free up space back to thin provisioned arrays. Two in fact: let’s call them the bad way and the good way. First though, let’s set up the scenario:

Test Scenario

Consider the situation where an Oracle ASM diskgroup is created on a 10TB volume group presented from a thin provisioning All-Flash storage array. The DBA then creates a large “bigfile” tablespace in the diskgroup, with a 5TB datafile (the rest of the database resides elsewhere). Anyone who has sat waiting for the CREATE TABLESPACE command for any period of time will be aware that, during the datafile creation process, Oracle likes to fill the whole file with empty blocks. From Oracle’s perspective, this has the advantage of ensuring that the entire datafile capacity has been marked as used by the storage array. In other words, it’s not “fake” thin provisioned space which may or may not be available, but real available capacity which now belongs to Oracle. (You may also recall that Oracle no longer takes this approach with tempfiles, instead using the faster “sparse” allocation method.)

At this point, what will the volumes on the storage array will be showing? We know that 10TB has been allocated, of which 5TB has been used. So shouldn’t that leave 5TB free? Probably not, because almost every All-Flash storage array uses data reduction technologies such as compression, deduplication and zero-detect. Since each block in the tablespace contains a unique block number, deduplication isn’t going to add any value (which is why arrays like the Kaminario allow dedupe to be disabled on a per-volume basis), but compression is going to have great fun with all the emptiness inside each Oracle block so the storage array will probably show significantly less than 5TB used.

Next, our enterprising DBA watches a Connor McDonald video about DBMS_RANDOM and gets a little overexcited, then fills the entire tablespace with random data to the point that the storage array can hardly achieve any compression. The outcome? Allocated = 10TB, Used = 5TB, Free = 5TB.

Finally, after watching a video of Larry Ellison explaining that the Oracle Autonomous Database needs “no human intervention” and thus fearing for his job, the DBA deletes the tablespace and goes home. Back to 10TB free? No.
The tablespace deletion command does a number of things, including notifying Oracle ASM that the file’s allocation units are no longer in use and removing the datafile from the database’s controlfile. But at no point does anybody bother to tell the storage array that the used space is now free, so the array’s capacity statistics remain: Allocated = 10TB, Used = 5TB, Free = 5TB.

ASRU: The Bad Way

ASRU is Oracle’s ASM Reclamation Utility, a PERL script developed in conjunction with 3PAR (a storage array now owned by HPE) and designed to free up space from scenarios such as the one above. It is, in my personal opinion, a terrible botched solution which was created to serve a purpose which no longer exists – although, interestingly, many storage vendors still seem to recommend it by default (for example, Pure Storage still describe it as the only solution for reclaiming unused space with Oracle ASM).
ASRU doesn’t issue UNMAP commands. Instead, it takes advantage of the fact that most modern storage platforms (including 3PAR, Pure Storage and Kaminario) treat blocks full of zeros as free space (a feature known as zero detect). Thus what ASRU does – when manually run by a DBA (presumably during a change window in the middle of the night while rubbing a lucky rabbit’s foot and praying to the gods of all major religions) – is compact the remaining data in any diskgroup toward the start of the volume and then write zeros above the high watermark where this compacted data ends.
In our example above, this should return the capacity statistics to approximately: Allocated = 10TB, Used = 0TB, Free = 10TB. However, because zero detect is often considered to be a type of data reduction, some arrays then show horribly-skewed data reduction ratios as a result of ASRU.
Don’t get me wrong, many people have successfully used ASRU – and in some situations it may be your only choice. But there is another way…

ASM Filter Driver: The Good Way

Since Oracle Database version 12.1.0.2, the option has been available to install a piece of software called ASMFD, the ASM Filter Driver. ASMFD is a kernel module which resides in the I/O path of Oracle ASM disks – and is the natural successor to the Linux-only ASMLib kernel driver. Unlike ASMLib, or indeed native ASM, the ASMFD module contains support for SCSI UNMAP commands, which really is the missing piece of the jigsaw. Providing you use ASMFD, the deletion of files from within ASM will result in the storage array being notified as allocation units are freed up, resulting in the correct recalculation of Free and Used Capacity statistics – and without the unnecessary hack of writing zeros all over the place. It really is a no brainer.
Unless, of course, you’ve already installed your database and ASM and are now looking for some way to return freed capacity. In which case, installing ASMFD on an existing system may seem even more challenging than running ASRU. But you know what they say: it’s better to do it right first time than to be constantly forced into bodging it with PERL scripts.

TL;DR

If you want Oracle ASM to correctly free space back to your thin provisioned storage array, you need to choose between the correct method of using ASM Filter Driver or the botched method of running the ASRU reclamation tool, which comes in the form of a PERL script. Either way, it’s nothing to do with the storage platform, so don’t blame the storage guy…

New Installation Cookbook: Oracle Linux 6.7 with Oracle 11.2.0.4 RAC

cookbookI’ve updated my install cookbooks page to include a new cookbook for installation of Oracle 11.2.0.4 Real Application Clusters on Oracle Linux 6.7.

This is also the first one I’ve published since I left the employment of Violin Memory to work for Kaminario, so this install uses a Kaminario K2 All Flash Array. However, it applies very well to any Oracle RAC installation which uses relatively capable storage.

Enjoy:

https://flashdba.com/install-cookbooks/oracle-linux-6-7-with-oracle-11-2-0-4-rac/

Was I Mentioned During Oracle’s Q4 2015 Results Call?

hurd

In a proud moment for me, it appears that Mark Hurd, CEO of Oracle, has mentioned my flashdba blog during the Oracle Q4 2015 results call. At least, that’s what I’m reading into this section from the transcript published by Seeking Alpha:

We grew in storage in the quarter and this is — really we are going through a shift in storage now. We released our SAN product FS1 in the quarter which saw some bookings. This is really the first quarter we got any bookings out of FS1 or GFS product, somebody’s renamed that but I haven’t recently – BS1. I wish they wouldn’t do that to me but so they renamed BS1 – so I missed the – but anyway so we had good growth in PaaS – as well.

I’m pretty sure that my blog post entitled Postcard from Oracle OpenWorld 2014: The Oracle FS1 Flash Array was the first place in which Oracle’s newly-announced FS1 Flash Storage System was ironically described as the “BS1 Flash Storage Array” due to some of the baffling marketing claims made at its announcement during Oracle OpenWorld 2014. Claims like, “The Oracle FS1 is the first mainstream, general purpose flash array”.

I haven’t heard the recording of the call, just read the transcript, but it appears to me that Mr Hurd uses the BS1 phrase to get some laughs from the analysts on the call.

So hey, thanks Mark! It’s exciting to know that finally, even in a small way, I’ve been able to make a contribution at the highest levels within Oracle. I am open to discussions about filling a new role as Oracle’s SVP of Investor Comedy Moments. And with those results, it could be an increasingly essential role…!

The Great Hypervisor Bake-off: VMware ESX vs Oracle VM

lock-horns

This is a very simple post to show the results of some recent testing that Tom and I ran using Oracle SLOB on Violin to determine the impact of using virtualization. But before we get to that, I am duty bound to write a paragraph of text featuring lots of long sentences peppered with industry buzz words. Forgive me, it’s just the way I’m wired.

It is increasingly common these days to find database environments running in virtual machines – even large, business critical ones. The driver is the trend to commoditize I.T. services and build consolidated, private-cloud style solutions in order to control operational expense and increase agility (not to mention reduce exposure to Oracle licenses). But, as I’ve said in previous posts, the catalyst has been the unblocking of I/O as legacy disk systems are replaced by flash memory. In the past, virtual environments caused a kind of I/O blender effect whereby I/O calls become increasingly randomized – and this sucked for the performance of disk drives. Flash memory arrays on the other hand can deliver random I/O all day long because… well, if you don’t know the reasons by now can I just recommend starting at the beginning. The outcome is that many large and medium-sized organisations are now building database-as-a-service platforms with Oracle databases (other database products are available) running in virtual machines. It’s happening right now.

Phew. Anyway, that last paragraph was just a wordy way of telling you that I’m often seeing Oracle running in virtual machines on top of hypervisors. But how much of a performance impact do those hypervisors have? Step this way to find out.

The Contenders

boxersWhen it comes to running Oracle on a hypervisor using Intel x86 hardware (for that is what I have available), I only know of three real contenders:

Hyper-V has been an option for a couple of years now, but I’ll be honest – I have neither the time nor the inclination to test it today. It’s not that I don’t rate it as a product, it’s just that I’ve never used it before and don’t have enough time to learn something new right now. Maybe someday I’ll come back and add it to the mix.

In the meantime, it’s the big showdown: VMware versus Oracle VM. Not that Oracle VM is really in the same league as VMware in terms of market share… but you know, I’m trying to make this sound exciting.

The Test

This is going to be an Oracle SLOB sustained throughput test. In other words, I’m going to build an Oracle database and then shovel a massive amount of I/O through it (you can read all about SLOB here and here). SLOB will be configured to run with 25% of statements being UPDATEs (the remainder are SELECTs) and will run for 8 hours straight. What we want to see is a) which hypervisor configuration allows the greatest I/O bandwidth, and b) which hypervisor configuration exhibits the most predictable performance.

This is the configuration. First the hardware:

Violin Memory 6616 flash Memory Array

Violin Memory 6616 flash Memory Array

  • 1x Dell PowerEdge R720 server
  • 2x Intel Xeon CPU E5-2690 v2 10-core @ 3.00GHz [so that’s 2 sockets, 20 cores, 40 threads for this server]
  • 128GB DRAM
  • 1x Violin Memory 6616 (SLC) flash memory array [the one that did this]
  • 8GB fibre-channel

And the software:

  • Hypervisor: VMware ESXi 5.5.1
  • Hypervisor: Oracle VM for x86 3.3.1
  • VM: Oracle Linux 6 Update 5 (with the Unbreakable Enterprise v3 Kernel 3.6.18)
  • Oracle Grid Infrastructure 11.2.0.4 (for Automatic Storage Management)
  • Oracle Database Enterprise Edition 11.2.0.4

Each VM is configured with 20 vCPUs and is using Linux Device Mapper Multipath and Oracle ASMLib. ASM is configured to use one single +DATA disgroup comprising 8 ASM disks (LUNs from Violin) with external redundancy. The database parameters and SLOB settings are all listed on the SLOB sustained throughput test page.

Results: Bare Metal (Baseline)

First let’s see what happens when we don’t use a hypervisor at all and just run OL6.5 on bare metal:

Oracle SLOB- 8 Hour Sustained Throughput Test with no hypervisor (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         232,431.0       194,452.3        37,978.7
         Database Requests:         228,909.4       194,447.9        34,461.5
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           3,515.1             0.3         3,514.8
                Total (MB):           1,839.6         1,519.2           320.4

Ok so we’re looking at 1519 MB/sec of read throughput and 320 MB/sec of write throughput. Crucially, the lines are nice and consistent – with very little deviation from the mean. By dividing the amount of time spent waiting on db file sequential read (i.e. random physical reads) with the number of waits, we can calculate that the average latency for random reads was 438 microseconds.

Now we know what to expect, let’s look at the result from the hypervisor tests.

Results: VMware vSphere

VMware is configured to use Raw Device Mapping (RDM) which essentially gives the benefits of raw devices… read here for more details on that. Here are the test results:

Oracle SLOB- 8 Hour Sustained Throughput Test with VMware ESXi 5.5.1 (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         173,141.7       145,066.8        28,075.0
         Database Requests:         170,615.3       145,064.0        25,551.4
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           2,522.8             0.1         2,522.7
                Total (MB):           1,370.0         1,133.4           236.7

Average read throughput for this test was 1133 MB/sec and write throughput averaged at 237 MB/sec. Average read latency was 596 microseconds. That’s an increase of 36%.

In comparison to the bare metal test, we see that total bandwidth dropped by around 25%. That might seem like a lot but remember, we are absolutely hammering this system. A real database is unlikely to ever create this level of sustained I/O. In my role at Violin I’ve been privileged to work on some of the busiest databases in Europe – nothing is ever this crazy (although a few do come close).

Results: Oracle VM

Oracle VM is based on the Xen hypervisor and therefore uses Xen virtual disks to present block devices. For this test I downloaded the Oracle Linux 6 Update 5 template from Oracle’s eDelivery site. You can see more about the way this VM was configured here. Here are the test results:

Oracle SLOB- 8 Hour Sustained Throughput Test with Oracle VM 3.3.1 (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         160,563.8       134,592.9        25,970.9
         Database Requests:         158,538.1       134,587.3        23,950.8
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           2,017.2             0.2         2,016.9
                Total (MB):           1,273.4         1,051.6           221.9

This time we see average read bandwidth of 1052MB/sec and average write bandwidth of 222MB/sec, with the average read latency at 607 microseconds, which is 39% higher than the baseline test.

Meanwhile, total bandwidth dropped by 31%. That’s slightly worse than VMware, but what’s really interesting is the deviation. Look at how ragged the lines are on the OVM test! There is a much higher degree of variance exhibited here than on the VMware test.

Conclusion

This is only one test so I’m not claiming it’s conclusive. VMware does appear to deliver slightly better performance than OVM in my tests, but it’s not a huge difference. However, I am very much concerned by the variance of the OVM test in comparison to VMware. Look, for example, at the wait event histograms for db file sequential read:

Wait Event Histogram
-> Units for Total Waits column: K is 1000, M is 1000000, G is 1000000000
-> % of Waits: value of .0 indicates value was <.05%; value of null is truly 0
-> % of Waits: column heading of <=1s is truly <1024ms, >1s is truly >=1024ms
-> Ordered by Event (idle events last)

                                                             % of Waits
                                          -----------------------------------------------
                                    Total
Hypervisor  Event                   Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s   >1s
----------- ----------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
Bare Metal: db file sequential read 5557.  98.7   1.3    .0    .0    .0    .0
VMware ESX: db file sequential read 4164.  92.2   6.7   1.1    .0    .0    .0
Oracle VM : db file sequential read 3834.  95.6   4.1    .1    .1    .0    .0    .0    .0

The OVM tests show occasional results in the two highest buckets, meaning once or twice there were waits in excess of 1 second! However, to be fair, OVM also had more millisecond waits than VMware.

Anyway, for now – and for this setup at least – I’m sticking with VMware. You should of course test your own workloads before choosing which hypervisor works for you…

Thanks as always to Kevin for bringing Oracle SLOB to the community.

ASM Rebalance Too Slow? 3 Tips To Improve Rebalance Times

see-saw

I’ve run into a few customers recently who have had problems with their ASM rebalance operations running too slowly. Surprisingly, there were some simple concepts being overlooked – and once these were understood, the rebalance times were dramatically improved. For that reason, I’m documenting the solutions here… I hope that somebody, somewhere benefits…

1. Don’t Overbalance

Every time you run an ALTER DISKGROUP REBALANCE operation you initiate a large amount of I/O workload as Oracle ASM works to evenly stripe data across all available ASM disks (i.e. LUNs). The most common cause of rebalance operations running slowly that I see (and I’m constantly surprised how much I see this) is to overbalance, i.e. cause ASM to perform more I/O than is necessary.

It almost always goes like this. The customer wants to migrate some data from one set of ASM disks to another, so they first add the new disks:

alter diskgroup data
add disk  'ORCL:NEWDATA1','ORCL:NEWDATA2','ORCL:NEWDATA3','ORCL:NEWDATA4',
          'ORCL:NEWDATA5','ORCL:NEWDATA6','ORCL:NEWDATA7','ORCL:NEWDATA8'
rebalance power 11 wait;

Then they drop the old disks like this:

alter diskgroup data
drop disk 'DATA1','DATA2','DATA3','DATA4',
          'DATA5','DATA6','DATA7','DATA8'
rebalance power 11 wait;

Well guess what? That causes double the amount of I/O that is actually necessary to migrate, because Oracle evenly stripes across all disks and then has to rebalance a second time once the original disks are dropped.

This is how it should be done – in one single operation:

alter diskgroup data
add disk  'ORCL:NEWDATA1','ORCL:NEWDATA2','ORCL:NEWDATA3','ORCL:NEWDATA4',
          'ORCL:NEWDATA5','ORCL:NEWDATA6','ORCL:NEWDATA7','ORCL:NEWDATA8'
drop disk 'DATA1','DATA2','DATA3','DATA4',
          'DATA5','DATA6','DATA7','DATA8'
rebalance power 11 wait;

A customer of mine tried this earlier this week and reported back that their ASM rebalance time had reduced by a factor of five!

By the way, the WAIT command means the cursor doesn’t return until the command is finished. To have the command essentially run in the background you can simply change this to NOWAIT. Also, you could run the ADD and DROP commands separately if you used a POWER LIMIT of zero for the first command, as this would pause the rebalance and then the second command would kick it off.

2. Power Limit Goes Up To 1024

Simple one this, but easily forgotten. From the early days of ASM, the maximum power limit for rebalance operations was 11. See here if you don’t know why.

From 11.2.0.2, if the COMPATIBLE.ASM disk group attribute is set to 11.2.0.2 or higher the limit is now 1024. That means 11 really isn’t going to cut it anymore. If you are asking for full power, make sure you know what number that is.

3. Avoid The Compact Phase (for Flash Storage Systems)

An ASM rebalance operation comprises three phases, where the third one is the compact phase. This attempts to move data as close as possible to the outer tracks of the disks ASM is using.

Did you spot the issue there? Disks. This I/O-heavy phase is completely pointless on a flash system, where I/O is served evenly from any logical address within a LUN.

You can therefore avoid that potentially-massive I/O hit by disabling the compact phase, using the underscore parameter _DISABLE_REBALANCE_COMPACT=TRUE. Remember that you need to get Oracle Support’s permission before setting underscore parameters! Point your SR in the direction of the following My Oracle Support note:

What is ASM rebalance compact Phase and how it can be disabled (Doc ID 1902001.1)

Unfortunately it appears the parameter was deprecated in 12c, so from now on you have to set the ASM diskgroup attribute “_rebalance_compact” to FALSE (note the opposite value to that set at the instance level!), for example:

ALTER DISKGROUP  SET ATTRIBUTE "_rebalance_compact”="FALSE";

If you want to know more about this topic (for example, what the first two rebalance phases are), or indeed anything about ASM in general, I highly recommend the legendary ASM blogger that is Bane Radulovic a.k.a. ASM Support Guy.

Conclusion

An ASM rebalance potentially creates a lot of I/O, which means you may need to wait for a long time before it finishes. For that reason, make sure you understand what you are doing and make every effort to perform only as much I/O as you actually need. Don’t forget you can use the EXPLAIN WORK command to gauge in advance how much work is required.

Happy rebalancing!

Oracle Exadata X5: The Road To Ten Billion Dollars

money

Now that the dust has settled on the announcement of Oracle’s new Exadata X5 Database Machine, I’ve been doing some research in order to update my History of Exadata post (it’ll be ready soon). While reviewing the datasheets and other collateral for the X5 I was struck by the meteoric increase in one particular statistic: the number of processor cores on each database server. Oracle is riding that Moore’s Law train all the way to the bank.

The thing is, the number of cores per database server is directly linked to the cost of licensing the Oracle Database for each Exadata machine – and that means trouble if you are the one paying the bills. Assuming you buy a full rack – and that you license every core in every database server (which is the most common choice, since only a very brave minority would consider the Oracle VM Trusted Partitions option), your license cost has been increasing by 50% for each of the last two Exadata releases (X3-2 to X4-2 to X5-2). Let’s have a look at that in graphical form (click on the image to enlarge):

Exadata DB License Cost Comparison X2-2 to X5-2

Let’s not forget here that I am only plotting the cost of licensing the database software. We are not taking into account the extra costs associated with paying for the hardware, licensing the storage servers or purchasing any of the pretty-much essential enterprise edition options such as Oracle RAC, partitioning, multitenancy or the diagnostic pack license. Nor are we considering the infamous 22% per annum software support costs. I’m also using the list price – which you would never expect to pay – but even if you used discounted prices the percentage increases would remain the same.

Anyway, after crunching the numbers it turns out there is good news and bad news…

Good News for Oracle

The good news for Oracle is that if Exadata continues to increase the number of cores at the rate of 50% extra per release, the list price for licensing the database software on the future X23-2 model will be $10.1 billion:

Exadata DB License Cost Projection

There’s no doubt about it – this will pay for a lot of yachts.

Bad News for Oracle

There is a downside though. As Kevin Closson has already shown, Oracle appears to be having trouble balancing the I/O capabilities of the X5 against this tremendously-increased compute power. Even after abandoning previous claims that a memory hierarchy with “low-cost disk” as the bottom tier would bring the “highest performance at the lowest cost” to customers, the new Oracle Exadata X5 “Extreme Flash” model (because apparently flash on its own isn’t enough, it has to be extreme flash) struggles to deliver an improvement in IOPS per flash device, as Kevin has shown.

exadata-x5-ef-price-per-io

Excerpt from the Oracle Exadata X5-2 Press Release

You wouldn’t think this from reading the press release, which promises “breakthrough performance and price per I/O” (emphasis added by me). Price per I/O, eh? That sounds like we take the overall price and divide it by the number of IOPS the system can deliver, right?

So let’s do that. And to be generous, let’s only look at the database license cost (ignoring all those other costs again) and take the maximum IOPS number from the datasheets (which in most cases is for unrealistic, 100% read only, fully cached workloads). How will it look? I’ll overlay it as a line (in blue) on top of the first graph:

Exadata DB License Cost Comparison

Well it turns out that the price per I/O is actually falling: it’s down from $1.71 on the X4-2 (HP) model to $1.65 on the X5-2 EF. But three and a half percent is not much of an improvement considering the 50% extra on the price tag, is it? And as for offering a “breakthrough” price per I/O, the X2-2 was better at only $1.52 per I/O per second!

Summary

In my view, something is not right about the balance between compute and storage on the Exadata X5. It feels as though Oracle is bumping up the compute power faster than would be architecturally prudent because this results in a higher purchase price. Maybe I’m wrong – I have no insider knowledge and can only speculate… but when the Exadata X23-2 finally comes out in a couple of decades time, maybe we’ll know for sure.

Footnote

The comments section of this article makes for interesting reading, with responses from a number of Oracle employees – although not necessarily speaking on behalf of the mothership. The noteworthy (not to mention calm and measured) comments come from the Exadata Product Manager, Gurmeet Goindi. In addition I’d like to draw your attention to the following two URLs:

Oracle AWR Reports: When IOStats Lie

graph

If you’ve been unfortunate enough to follow my dithering on Twitter recently you’ll know that I’ve been lurching between thinking that there is and isn’t a problem with Oracle’s tracking of I/O statistics in its AWR reports.

I’m now convinced there is a problem, but I can’t work out what causes it… so step 1 is to describe the problem here, after which step 2 will probably be to sit back and hope someone far more intelligent than me will solve it.

But first some background:

AWR I/O Statistics

I’ve written about the I/O statistics contained in Oracle AWR Reports before, so I won’t repeat myself too much other than to highlight two critical areas which we’ll focus on today: Instance Statistics and IOStat Summaries. By the way, the format of AWR reports changed in 11.2.0.4 and 12c to include a new IO Profile section, but today we’re covering reports from 11.2.0.3.

First of all, here are some sensible statistics. I’m going to show you the IOStat by Function Summary section of a report from a real life database:

IOStat by Function summary
-> 'Data' columns suffixed with M,G,T,P are in multiples of 1024
    other columns suffixed with K,M,G,T,P are in multiples of 1000
-> ordered by (Data Read + Write) desc

                Reads:   Reqs   Data    Writes:  Reqs   Data    Waits:    Avg  
Function Name   Data    per sec per sec Data    per sec per sec Count    Tm(ms)
--------------- ------- ------- ------- ------- ------- ------- ------- -------
Buffer Cache Re   14.9G   504.1 4.22491      0M     0.0      0M 1690.3K     0.0
Direct Reads      12.2G     3.5 3.45612      0M     0.0      0M       0     N/A
DBWR                 0M     0.0      0M    7.9G   186.6 2.25336       0     N/A
Others             4.2G    12.1 1.18110    3.6G     2.5 1.03172   41.9K     0.1
LGWR                 1M     0.0 .000277    5.8G    11.9 1.65642   17.4K     1.0
Direct Writes        0M     0.0      0M      4M     0.0 .001110       0     N/A
TOTAL:            31.2G   519.7 8.86241   17.4G   201.0 4.94263 1749.6K     0.0
          -------------------------------------------------------------

This section of the report is breaking down all I/O operations into the functions that caused them (e.g. buffer cache reads, direct path reads or writes, redo log writes by LGWR, the writing of dirty buffers by DBWR and so on). Let’s ignore that level of detail now and just focus on the TOTAL row at the bottom.

To try and make this simple to describe, I’ve gone a bit crazy with the colours. In green I’ve highlighted the labels for reads and writes – and now let’s walk through the columns:

  1. The first column is the function name, but we’re just focussing on TOTAL
  2. The second column is the total amount of reads that happened in this snapshot: 31.2GB
  3. The third column is the average number of read requests per second, i.e. 519.7 read IOPS
  4. The fourth column is the average volume of data read per second, i.e. 8.86 MB/sec read throughput
  5. The fifth column is the total amount of writes that happened in this snapshot: 17.4GB
  6. The sixth column is the average number of write requests per second, i.e. 201 write IOPS
  7. The seventh column is the average volume of data written per second, i.e. 4.94 MB/sec write throughput
  8. The eighth and ninth columns are not of interest to us here

The whole section is based on the DBA_HIST_IOSTAT_FUNCTION view. What we care about today is the IOPS figures (shown in red) and the Throughput figures (shown in blue). Pay attention to the comments in the view header which show that Data columns (including throughput) are multiples of 1024 while other columns (including IOPS) are multiples of 1000. It’s interesting that the two throughput values are obviously measured in MB/sec and yet are missing the “M” suffix – I assume this “falls off the end” of the column because of the number of decimal places displayed.

Now that we have these figures explained, let’s compare them to what we see in the Instance Activity Stats section of the same AWR report:

Instance Activity Stats
-> Ordered by statistic name

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
physical read IO requests                 1,848,576          513.3         121.5
physical read bytes                  29,160,300,544    8,096,201.8   1,917,053.5
physical read total IO requests           1,892,366          525.4         124.4
physical read total bytes            33,620,594,176    9,334,578.5   2,210,281.7
physical read total multi block              17,096            4.8           1.1
physical reads                            3,559,607          988.3         234.0

physical write IO requests                  671,728          186.5          44.2
physical write bytes                  8,513,257,472    2,363,660.5     559,677.7
physical write total IO requests            723,348          200.8          47.6
physical write total bytes           18,657,198,080    5,180,071.5   1,226,559.6
physical write total multi block             26,192            7.3           1.7
physical writes                           1,039,216          288.5          68.3

For this section we care about the per Second column because both IOPS and Throughput are measured using units per second. For both reads and writes there are two sets of statistics: those with the word total in them and those without. You can find the full description of 11gR2 statistics in the documentation, but the difference between the two is best summed up by this snippet which describes physical read total bytes:

Total size in bytes of disk reads by all database instance activity including application reads, backup and recovery, and other utilities. The difference between this value and “physical read bytes” gives the total read size in bytes by non-application workload.

I’ve underlined non-application workload here because this is critical. If you merely look at the Load Profile section at the top of an AWR report you will only see values for “application workload” I/O but this does not include stuff like RMAN backups, archive logging and so on… important stuff if you care about the actual I/O workload. For this reason, we only care about the following four statistics:

  1. physical read total IO requests (per second) = the average number of read IOPS
  2. physical read total bytes (per second) = the average read throughput in bytes per second
  3. physical write total IO requests (per second) = the average number of write IOPS
  4. physical write total bytes (per second) = the average write throughput in bytes per second

Again I’ve coloured the IOPS figures in red and the throughput figures in blue.

Tying It Together

Now that we have our two sets of values, let’s just compare them to make sure they align. The IOPS figures do not require any conversion but the throughput figures do: the values in the Instance Activity Stats section are in bytes/second and we want them to be in MB/second so we need to divide by 1048576 (i.e. 1024 * 1024).

IOStat by Function

Instance Activity Stats

Error Percentage

Read IOPS

519.7

525.4

1.09%

Write IOPS

201.0

200.8

0.10%

Read Throughput

8.86241 MB/sec

9,334,578.5 bytes/sec

8.90215 MB/sec

0.45%

Write Throughput

4.94263 MB/sec

5,180,071.5 bytes/sec

4.94010 MB/sec

0.05%

I’ve calculated the error percentages here to see how far the figures vary. It is my assumption that the Instance Activity Stats are accurate and that any margin of error in the IOStat figures comes as a result of sampling issues. The highest error percentage we see here is just over 1%, which is hardly a problem in my opinion.

Don’t Believe The Stats

So far I have no complaints about the matching of statistics in the AWR report. But now let me introduce you to the AWR report that has been puzzling me recently:

IOStat by Function summary
-> 'Data' columns suffixed with M,G,T,P are in multiples of 1024
    other columns suffixed with K,M,G,T,P are in multiples of 1000
-> ordered by (Data Read + Write) desc

                Reads:   Reqs   Data    Writes:  Reqs   Data    Waits:    Avg
Function Name   Data    per sec per sec Data    per sec per sec Count    Tm(ms)
--------------- ------- ------- ------- ------- ------- ------- ------- -------
Direct Reads       3.8T 2.7E+04    3.2G    405M     3.6 .335473       0     N/A
Direct Writes        0M     0.0      0M  164.5G  1159.0 139.519       0     N/A
Buffer Cache Re   63.2G  3177.6 53.6086      0M     0.0      0M 3292.5K    23.4
DBWR                 0M     0.0      0M      2G    45.3 1.71464       0     N/A
LGWR                 0M     0.0      0M    485M    19.2 .401739   19.7K     0.9
Others             220M    12.3 .182232     26M     0.7 .021536   15.6K    34.5
Streams AQ           0M     0.0      0M      0M     0.0      0M       1     9.0
TOTAL:             3.9T 3.0E+04    3.3G  167.4G  1227.8 141.992 3327.8K    23.3
          -------------------------------------------------------------

Instance Activity Stats
-> Ordered by statistic name

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
physical read total IO requests          35,889,568       29,728.4       3,239.4
physical read total bytes         4,261,590,852,608 3.52999864E+09 3.8465483E+08

physical write total IO requests          1,683,381        1,394.4         151.9
physical write total bytes          205,090,714,624  169,882,555.1  18,511,663.0

Again I’ve coloured the IOPS measurements in red and the throughput measurements in blue. And as before we need to convert the bytes per second values shown in the Instance Activity Stats section to MB/sec as shown in the IOStat by Function Summary section.

IOStat by Function

Instance Activity Stats

Error Percentage

Read IOPS

30,000

29,728.4

0.91%

Write IOPS

1227.8

1,394.4

11.95%

Read Throughput

3.3 GB/sec

3,379.2 MB/sec

3,529,998,640 bytes/sec

3,366.469 MB/sec

0.38%

Write Throughput

141.992 MB/sec

169,882,555.1 bytes/sec

162.01263 MB/sec

12.36%

Do you see what’s bugging me here? The write values for both IOPS and throughput are out by over 10% when I compare the values in IOStat by Function against the Instance Activity Stats. Ten percent is a massive margin of error at this level – we’re talking 20MB/sec. To translate that into something easier to understand, if 20MB/sec were sustained over a 24 hour period it would amount to over 1.6TB of data. I’ve seen smaller data warehouses.

So why is this happening? Unfortunately I don’t have access to the system where this AWR report was created, so I cannot tell if, for example, there was an instance restart between the start and end snapshots (although the elapsed time of the report was just 20 minutes so it seems unlikely).

The truth is I don’t know. Which is why I’m writing about it here… if you think you have the answer, or just as importantly if you see the same behaviour, let me know!

New Cookbook: Oracle Linux 6 Update 5 within an Oracle VM Template

Oracle-VMI’ve posted a new installation cookbook for using Oracle within a virtual machine running on Oracle VM. Surprisingly, I was unable to come up with a satisfactory method of accessing external storage that did not involve the use of Oracle ASMLib

Oracle Linux 6 Update 5 within an Oracle VM Template

Oracle AWR Reports: When Averages Lie

truth-sign

I was recently contacted by a customer who runs a busy ERP system on their Oracle database. Their users have been screaming for some time about unpredictable performance, where application screens which would usually open in under a second suddenly take an age. ERP systems are legendary for this, of course, because as each screen loads the application tends to fire off a glut of queries to populate all the various fields, graphs and so on. When performance problems strike, the elapsed time for each query rises, which usually results in “stuff going slowly”. And unfortunately, because of something known to psychologists as the availability heuristic, users tend not to remember the times when performance was acceptable… just the occasions when they were left frustrated.

Initial Investigation

The first thing I ask for in situations like this is a set of AWR reports. Customers are generally happy to part with such information, whereas the running of bespoke SQL scripts to harvest data can often cause delays as the scripts have to be investigated and tested. AWR reports, I’m sure you know, contain information about the behaviour and performance of the database between two points in time – and much of the data is averaged. For this reason, I always always ask for a set of multiple, consecutive reports covering the problematic period where each report covers no more than one hour. I’d prefer less than one hour per report, but the default configuration of Oracle results in hourly snapshots so that’s usually the limit… as it was for this customer.

As a first pass, I like to run the AWR reports through my awr-parser.sh script. This parses them all and produces a CSV file as output. This can then be easily used to graph information such the amount of time spent working on CPU versus time spent waiting on I/O requests to complete. Often this is enough to get a basic idea of what’s happening….

awr-averages-lie-work-vs-wait

In the above graph, each column shows data from a single (hourly) AWR report; blue represents DB CPU while red represents User I/O wait time. If we are looking to increase the user experience with low-latency storage then we would hope to find two things: a) lots of User I/O wait time, and b) average latency which is higher than we would expect a flash array to deliver.

Unfortunately, the first requirement isn’t really there in the above graph. There just isn’t enough red, except for peaks towards the end of the day where there appears to be some batch job or backup taking place. A peak at the AWR reports shows that this is in fact a scheduled job which performs an online rebuild of an index. With more capable storage we could make this go a lot faster, so would that perhaps help the customer?

Sadly, the answer was no, it wouldn’t. The customer specifically stated that it’s the online day where they struggle and not the index rebuild. Hmmm.

More Graphs

Let’s have a look at some of the other information that came out of the AWR reports to see if anything interesting shows up:

awr-averages-lie-cpu-utilization

Plenty of spare CPU capacity here. In fact, if you think about how much Oracle licenses cost per CPU – and how much CPU is not being used… it looks like this database server represents a significant over-investment.

awr-averages-lie-throughput

Not much in the way of I/O… there are spikes here and there, including those for the index rebuilds, but overall the I/O rates are very low and for many samples we barely scratch 10MiB/sec.

So what have we got? A system which hardly performs any I/O and doesn’t seem to spend any time waiting on I/O during the period the users are complaining about. Time to go back to the customer and ask for some more detail…

Second Attempt

After a plea for more details, I received another batch of AWR reports covering a single 24 hour period, as well as the crucial information that complaints were received between 5am and 6am. Perfect. Let’s have a look at that throughput graph again:

awr-averages-lie-2nd-throughput

Again, the throughput is very low – including around the critical 5am-6am period where complaints were received – which might fool us into thinking any problems are not related to the storage. But what about if we plot a graph of I/O Operations Per Second (IOPS)?

awr-averages-lie-2nd-iops

Bingo! There’s a big spike of read IOPS at exactly the time where problems were reported, which we can put down to a large amount of random physical reads. We know from the customer that they have a multi-controller monolithic disk array – the top of the range product from the world’s biggest disk array vendor. But it’s still a disk array, which means random I/O frequently results in high latency as somewhere a disk head has to move across the platter and then the platter has to rotate to the correct sector.

So we should be able to just check the average wait time for random physical reads (the wait event db file sequential read) and it will reveal itself as the problem. Right? We’ll expect to see waits in the region of 4-8ms or perhaps even worse if it’s really struggling.

Here’s a graph of the average user I/O wait time seen across each AWR report – don’t forget we’re especially interested in the troublesome 5am-6am period:

awr-averages-lie-average-wait

Not what we expected. The 5am-6am slot is showing the best latency figures for the entire day! And this is when users are complaining?

Averages Lie

A disk array like this usually has a reasonably large DRAM cache and quite possibly a set of SSDs being used for further caching or tiering of data. If a physical read can be satisfied from the cache, it will likely be serviced in under a millisecond. But if we experience a cache miss, we’re forced to pay the penalty of mechanical latency and wait for many milliseconds. Thus there are two ranges of response times: sub-millisecond and many milliseconds.

Average wait times won’t show this because they are simply the average of all waits over the period of the AWR report (an entire hour in this case). What if many of the reads are cache hits but some still miss? Maybe it’s possible that there are enough misses to cause users to complain, but enough hits to make the average look acceptable?

Luckily, there is another section in the AWR report which can help here: the Wait Event Histogram. Here’s the histograms covering db file sequential read waits (random physical reads) for our 24 hour period:

$ cat header.txt; for file in `ls -1`; do
grep -A 40 "Wait Event Histogram" $file | grep "db file sequential read" | head -1
done
                                                    % of Waits                  
                                 -----------------------------------------------
                           Total                                                
Event                      Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s   >1s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----  
db file sequential read     3329  99.2    .8    .0    .0    .0    .0    .0    .0
db file sequential read     3487  99.4    .6    .0    .0
db file sequential read    35.4K  21.0   1.2   2.5  23.2  39.0  11.9   1.1
db file sequential read    82.9K  34.6   1.8   3.4  15.9  31.9  11.2   1.2
db file sequential read    428.6  67.4   2.1   2.5  10.1  15.2   2.4    .3
db file sequential read    4440.  92.6   1.3   1.2   1.9   2.6    .4    .0
db file sequential read    2013K  91.7   1.4   1.2   2.1   3.0    .5    .0
db file sequential read    131.7  45.7   1.1   2.8  15.6  29.1   5.4    .3
db file sequential read    76.3K  20.7   1.7   2.8  21.7  44.0   8.7    .4
db file sequential read    83.7K  31.1   1.1   2.4  16.6  38.1  10.0    .7
db file sequential read    430.4  79.6   1.5   2.0   4.4   9.3   2.9    .3
db file sequential read    106.6  32.3   1.7   3.0  15.4  36.1  10.3   1.2
db file sequential read      94K  23.8   1.8   2.7  16.6  41.7  12.1   1.2
db file sequential read    248.6  36.0   1.8   2.9  14.8  34.5   8.9   1.1
db file sequential read    202.5  31.9   1.4   2.0   9.0  38.0  16.1   1.6
db file sequential read    237.7  20.2   1.9   2.9  17.7  43.5  12.5   1.2
db file sequential read    49.5K   9.4   1.0   2.7  24.1  48.0  13.6   1.1
db file sequential read    75.6K  42.4   1.5   3.0  16.9  29.9   5.9    .5
db file sequential read    88.7K  61.4   1.3   2.5  12.4  18.2   3.8    .4
db file sequential read    644.3  84.3   1.0   1.5   4.3   7.3   1.5    .2
db file sequential read    61.3K  34.9   6.9   4.4  11.0  31.2  10.5   1.1
db file sequential read      330  96.7   3.1    .2    .0    .0
db file sequential read      12K  88.7   9.5   1.2    .4    .2    .1    .0    .0

Now the problem becomes clear. Although many reads are hitting the cache (<1ms bucket), there are an awful lot that are missing. The <2ms and <4ms buckets are virtually empty and then around <8ms we start to see big numbers again. Those are cache misses which incur the penalty of seek time on a hard disk drive somewhere in the array.

What’s more, it’s clear that the array is under some pressure because in many of the AWR reports there are substantial waits in the <16ms and above buckets. I’ve highlighted in red the occasions where over one third of read requests took over 8ms. That’s exactly the sort of behaviour that is making people replace disks with flash.

Let’s visualise the above data. I’ve gone back to the older set of AWR reports and graphed the histogram buckets to see if we can get a feel for the distribution of waits:

awr-averages-lie-random-read-histogram

As you can see, I’ve tried to go for a heat map-style colour scheme, with green indicating cache hits on the storage array and the colours moving from yellow through orange to red indicating cache misses which result in disk drive seek time. There are also some real outliers in black where requests took over one thousand milliseconds to be serviced (ouch).

Another way to look at it is to make the picture clearer by reducing the detail. In the graph below I’ve simply assumed that everything sub-millisecond is a storage array cache hit and everything else is a miss:

awr-averages-lie-cache-hits-and-misses

Clearly there are a lot of physical reads paying the price of mechanical latency here.

Conclusion

What this customer does next is up for debate – there are more than just technical issues to consider. But the point of this post is to illustrate that while average wait times are great for getting a rough feel of how a system is behaving, they also offer huge scope to lose details. And sometimes those details are important. The devil, as they say, is truly in the detail…

Update: January 2015 – I have now added the ability to read Wait Event Histograms to my awr-parser.sh script which you can find on my Useful Scripts page…

Oracle, Parallelism and Direct Path Reads… on Flash

3000-open-case

Guest Post

This is another guest post from my buddy Nate Fuzi, who performs the same role as me for Violin but is based in the US instead of EMEA. Because he’s an American, Nate believes that “football” is played using your hands and that the ball is actually egg-shaped. This is of course ridiculous, because as the entire rest of the world knows, this is football whereas the game Nate is thinking of is actually called “HandEgg”. Now that we’ve cleared that up, over to you, Nate:

Lately, I’ve been running into much confusion around Oracle’s direct path IO functionality (11g+) and, unusually, not all of that confusion is my own. There is a perplexing lack of literature and experimentation with direct path IO on the Internet today. Seriously, I’ve looked. I decided I needed to better understand this event and its timing in order to properly extend suggestions to customers. I set about trying to prove some things I thought I knew, and I managed to confirm several suspicions but also surprised myself with some unexpected results. I’d like to share these in hopes of clarifying this event for everyone in practical terms.

Direct Path IO Background

To set the stage a bit, at the highest level, Oracle created the direct path IO event to describe an IO executed by an Oracle process that reads into (or writes from) the process global area (think of this as the session’s private memory) directly from (to) storage, bypassing the Oracle buffer cache. The rationale is this: full table scans of large tables into the buffer cache consume a lot of space, pushing out likely useful buffers in favor of buffers unlikely to be needed again in the near future. Reading directly into the process global area instead of the shared global area keeps full table scans from polluting the buffer cache and diminishing its overall effectiveness. Since the direct path IO is used for full scanning large objects, it looks to the database’s DB_FILE_MULTIBLOCK_READ_COUNT (henceforth referred to as DBMRC) setting for guidance on the size of IO calls to issue.

Makes sense. But what’s been confusing me is the apparently inconsistent performance of direct path reads and writes, even against Violin’s all-flash arrays. With random and other multi-block IO events showing very low, consistent performance, direct path reads can still be all over the board. How is that? Is it truly impacting performance? How can I make it better, or should I even try? After seeing this at a number of customer installations, I decided to run some tests on a smallish lab server attached to a single Violin array.

The Setup

I have a test database with a number of tables almost exactly 125GB in size full of randomized data. Full-scanning one of these tables via “select count(*)” was plenty to exercise the direct path read repeatedly, varying parallelism and DBMRC. My goal was to see the effect of these settings on both elapsed time and perceived latency. With an 8K database block size (RHEL 6.3, Oracle 12.1.0.1, ASM), I ran the test with DBMRC set to 4, then 8, 16, and finally 128. I ran each test with no parallelism, then with “parallel 16” hinted. So what did I see?

Test Results

direct-path-read-testing

Note that elapsed times represent the time my query returned to the SQL*Plus prompt with the “set timing on” directive applied to my session and are not 100% representative of time spent on the database but are close enough for my purposes. Total direct path read (DPR) time was pulled from the respective AWR report after execution finished. I asterisked the Physical Read Requests column because some of the reports showed 0 physical reads for the test SQL, while it was clear from the total physical reads that my read operation was the only possible culprit; therefore I felt justified in attributing that total (minus a few here and there from the AWR snapshotting process) to the test SQL. Note also that the best elapsed time was achieved with the lowest DBMRC and parallel 16. Worst time by far was also obtained with DBMRC set to 4 but without parallelism—although it accumulated the least amount of wait time on DPR. In general, throwing more cores at the problem improved performance hugely; not surprising, but noteworthy. We know that flash does not benefit from multi-block IO as a rule: at the lowest level, every IO is effectively a random IO, and larger blocks / groups of blocks are fetched independently, assembled, and returned to the caller as a single unit. However, there is a definite overhead in issuing IO requests, waiting for the calls to return, and consuming the requested data. This is evidenced by the high elapsed time for the single-threaded run with DBMRC set to 4: the least amount of reported IO wait time still contributed to the longest overall elapsed time.

So what do these values tell us? For one thing, parallelism is your friend. One core performing a FTS just isn’t going to get the job done nearly as quickly as multiple cores. Also, parallelism vastly trumps DBMRC as a tool for improving performance on flash when CPU resources are available. Performance between parallel processing runs was within 2%, no matter what the DBMRC setting. This I expected, having come into the testing with the assumption that DBMRC was irrelevant when working with flash. I was surprised at the exceedingly high elapsed time with the single-threaded query using small DBMRC. I would expect that to be higher than the others, but not nearly 3X longer than the single-threaded run with DBMRC at 128.

These revelations are mildly interesting, but what I find much more curious is the difference in reported DPR latency. Certainly, a highly parallel execution can accumulate more database time than wall clock time for any event. But we can tell from the elapsed times that, when we’re not starving the database for parallelism, DBMRC is practically meaningless when applied to flash. Yet the calculation of the average latency of the event is mysterious in that 1) 16 threads operating with DBMRC of 128 experiences roughly 4X the number of waits the single-threaded execution performs; 2) it does so apparently at about 13X the average latency of the single-threaded run; and 3) it racks up about 51X the amount of total DPR wait time.

What’s worse is that DPR stats are very strangely represented in the Tablespace IO stats section of the report. Here’s the snippet from test run #2:

                     Av       Av     Av      1-bk  Av 1-bk          Writes   Buffer  Av Buf
Tablespace   Reads   Rds/s  Rd(ms) Blks/Rd   Rds/s  Rd(ms)  Writes   avg/s    Waits  Wt(ms)
---------- ------- ------- ------- ------- ------- ------- ------- ------- -------- -------
DEMO       4.1E+06  49,395     0.0     4.0       2     0.0       0       0        0     0.0

We have to cut Oracle some slack on the Av Rds/s value here because it’s now averaging over the time it took me to start the test after my initial snapshot, then realize the test was done and execute another AWR snapshot to end the reporting period. Fine. But an average read time of 0.0ms?! Clearly, Oracle is recording some number of reads, but it’s not reporting timing on them at all in this section of the report. We have to look to the SQL Ordered by Physical Reads (Unoptimized) section of the report to confirm it’s actually doing a relevant number of IO requests:

-> Total Physical Read Requests:       4,092,389
-> Captured SQL account for    0.0% of Total
-> Total UnOptimized Read Requests:       4,092,389
-> Captured SQL account for    0.0% of Total
-> Total Optimized Read Requests:               1
-> Captured SQL account for    0.0% of Total
 
[some lines removed]
 
UnOptimized   Physical              UnOptimized
  Read Reqs   Read Reqs Executions Reqs per Exe   %Opt %Total    SQL Id
----------- ----------- ---------- ------------ ------ ------ -------------
          0           0          1          0.0    N/A    0.0 4kpvpt49hm3nf
Module: SQL*Plus
   PDB: DEMO
select /*+ parallel 16 */ count(*) from demo.length100_1

Oh, wait.  Oracle doesn’t credit my query with any physical read requests.  I have to look at the total just above in the report, and see that only the AWR snapshot performed any other IO on the system, and subtract that from the total.  Sigh.  At least ~4.1M reads at 32KB comes close to 125GB.

So what gives, Oracle?  I’ve read some Oracle notes and other blogs on the subject of DPR, and they suggest the wait event is not necessarily triggered when the IO call is initially issued, but instead when the session decides it needs all outstanding DPR IOs it has issued to complete before moving on—or it fills up all its “slots” and has to wait for those to free up.  Thus the under-reporting of the actual number of DPR waits and the artificially high wait time for each of those waits:  fewer waits, along with potentially many IO requests outstanding when the wait is triggered and timing starts.  But nowhere in all of this is there a set of numbers that I can trust to accurately describe my DPR performance.  The fact that DPR IO is completely left out of tablespace timings is seriously troubling:  we trust these stats to determine “hot” tablespaces and under-performing mount points.  This throws all kinds of doubt into the mix.

What can I say about Oracle’s DPR at this point?  While it works just fine and serves its purpose, the instrumentation appears to be lacking, even in Oracle 12.1.  After this testing, I feel even more confident telling customers to ignore the latency reported for this event—at least for now.  And I’ve confirmed my belief that, with any sort of parallelism enabled on your database, DBMRC is largely irrelevant for flash storage and only adds a mystery factor to reported latencies.  Yes, setting this to a low value will affect costing of FTS vs. index access, so you should verify that plans currently employing FTS that you want to remain that way still do.  This is easy enough with an alter session and explain plan.  With that, Oracle, the ball is in your court:  please define your terms, fix your instrumentation around DPR, or tell customers to stop worrying about DPR latencies.  Meanwhile, I’m going to advise people who are otherwise happy with their performance but want better latency numbers in their reports to set DBMRC lower and get on with their lives.