New Installation Cookbook: Oracle Linux 6.7 with Oracle 11.2.0.4 RAC

cookbookI’ve updated my install cookbooks page to include a new cookbook for installation of Oracle 11.2.0.4 Real Application Clusters on Oracle Linux 6.7.

This is also the first one I’ve published since I left the employment of Violin Memory to work for Kaminario, so this install uses a Kaminario K2 All Flash Array. However, it applies very well to any Oracle RAC installation which uses relatively capable storage.

Enjoy:

https://flashdba.com/install-cookbooks/oracle-linux-6-7-with-oracle-11-2-0-4-rac/

Oracle’s ASM Filter Driver Revisited

filter

Almost exactly a year ago I published a post covering my first impressions of the ASM Filter Driver (ASMFD) released in Oracle 12.1.0.2, followed swiftly by a second post showing that it didn’t work with 4k native devices.

When I wrote that first post I was about to start my summer holidays, so I’m afraid to admit that I was a little sloppy and made some false assumptions toward the end – assumptions which were quickly overturned by eagle-eyed readers in the comments section. So I need to revisit that at some point in this post.

But first, some background.

Some Background

If you don’t know what ASMFD is, let me just quote from the 12.1 documentation:

Oracle ASM Filter Driver (Oracle ASMFD) is a kernel module that resides in the I/O path of the Oracle ASM disks. Oracle ASM uses the filter driver to validate write I/O requests to Oracle ASM disks.

The Oracle ASMFD simplifies the configuration and management of disk devices by eliminating the need to rebind disk devices used with Oracle ASM each time the system is restarted.

The Oracle ASM Filter Driver rejects any I/O requests that are invalid. This action eliminates accidental overwrites of Oracle ASM disks that would cause corruption in the disks and files within the disk group. For example, the Oracle ASM Filter Driver filters out all non-Oracle I/Os which could cause accidental overwrites.

This is interesting, because ASMFD is considered a replacement for Oracle ASMLib, yet the documentation for ASMFD doesn’t make all of the same claims that Oracle makes for ASMLib. Both ASMFD and ASMLib claim to simplify the configuration and management of disk devices, but ASMLib’s documentation also claims that it “greatly reduces kernel resource usage“. Doesn’t ASMFD have this effect too? What is definitely a new feature for ASMFD is the ability to reject invalid (i.e. non-Oracle) I/O operations to ASMFD devices – and that’s what I got wrong last time.

However, before we can revisit that, I need to install ASMFD on a brand new system.

Installing ASMFD

Last time I tried this I made the mistake of installing 12.1.0.2.0 with no patch set updates. Thanks to a reader called terry, I now know that the PSU is a very good idea, so this time I’m using 12.1.0.2.3. First let’s do some preparation.

Preparing To Install

I’m using an Oracle Linux 6 Update 5 system running the Oracle Unbreakable Enterprise Kernel v3:

[root@server4 ~]# cat /etc/oracle-release
Oracle Linux Server release 6.5
[root@server4 ~]# uname -r
3.8.13-26.2.3.el6uek.x86_64

As usual I have taken all of the necessary pre-installation steps to make the Oracle Universal Installer happy. I have disabled selinux and iptables, plus I’ve configured device mapper multipathing. I have two sets of 8 LUNs from my Violin storage: 8 using 512e emulation mode (512 byte logical block size but 4k physical block size) and 8 using 4kN native mode (4k logical and physical block size). If you have any doubts about what that means, read here.

[root@server4 ~]# ls -l /dev/mapper
total 0
crw-rw---- 1 root root 10, 236 Jul 20 16:52 control
lrwxrwxrwx 1 root root       7 Jul 20 16:53 mpatha -> ../dm-0
lrwxrwxrwx 1 root root       7 Jul 20 16:53 mpathap1 -> ../dm-1
lrwxrwxrwx 1 root root       7 Jul 20 16:53 mpathap2 -> ../dm-2
lrwxrwxrwx 1 root root       7 Jul 20 16:53 mpathap3 -> ../dm-3
lrwxrwxrwx 1 root root       7 Jul 20 16:53 v4kdata1 -> ../dm-6
lrwxrwxrwx 1 root root       7 Jul 20 16:53 v4kdata2 -> ../dm-7
lrwxrwxrwx 1 root root       7 Jul 20 16:53 v4kdata3 -> ../dm-8
lrwxrwxrwx 1 root root       7 Jul 20 16:53 v4kdata4 -> ../dm-9
lrwxrwxrwx 1 root root       8 Jul 20 16:53 v4kdata5 -> ../dm-10
lrwxrwxrwx 1 root root       8 Jul 20 16:53 v4kdata6 -> ../dm-11
lrwxrwxrwx 1 root root       8 Jul 20 16:53 v4kdata7 -> ../dm-12
lrwxrwxrwx 1 root root       8 Jul 20 16:53 v4kdata8 -> ../dm-13
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data1 -> ../dm-14
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data2 -> ../dm-15
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data3 -> ../dm-16
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data4 -> ../dm-17
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data5 -> ../dm-18
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data6 -> ../dm-19
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data7 -> ../dm-20
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data8 -> ../dm-21
lrwxrwxrwx 1 root root       8 Jul 20 16:53 vg_halfserver4-lv_home -> ../dm-22
lrwxrwxrwx 1 root root       7 Jul 20 16:53 vg_halfserver4-lv_root -> ../dm-4
lrwxrwxrwx 1 root root       7 Jul 20 16:53 vg_halfserver4-lv_swap -> ../dm-5

The 512e devices are shown in green and the 4k devices shown in red. The other devices here can be ignored as they are related to the default filesystem layout of the operating system.

Installing Oracle 12.1.0.2.3 Grid Infrastructure (software only)

This is where the first challenge comes. When you perform a standard install of Oracle 12c Grid Infrastructure you are asked for storage devices on which you can locate items such as the ASM SPFILE, OCR and voting disks. In the old days of using ASMLib you would have prepared these in advance, because ASMLib is a separate kernel module located outside of the Oracle GI home. But ASMFD is part of the Oracle Home and so doesn’t exist prior to installation. Thus we have a chicken and egg situation.

Even worse, I know from bitter experience that I need to install some patches prior to labelling my disks, but I can’t install patches without installing the Oracle home either.

So the only thing for it is to perform a Software Only installation from the Oracle Universal Installer, then apply the PSU, then create an ASM instance and finally label the LUNs with ASMFD. It’s all very long winded. It wouldn’t be a problem if I was migrating from an existing ASMLib setup, but this is a clean install. Such is the price of progress.

To save this post from becoming longer and more unreadable than a 12c AWR report, I’ve captured the entire installation and configuration of 12.1.0.2.3 GI and ASM on a separate installation cookbook page, here:

Installing 12.1.2.0.3 Grid Infrastructure with Oracle Linux 6 Update 5

It’s simpler that way. If you don’t want to go and read it, just take it from me that we now have a working ASM instance which currently has no devices under its control. The PSU has been applied so we are ready to start labelling.

Using ASM Filter Driver to Label Devices

The next step is to start labelling my LUNs with ASMFD. I’m using what the documentation describes as an “Oracle Grid Infrastructure Standalone (Oracle Restart) Environment”, so I’m following this set of steps in the documentation.

Step one tells me to run a dsget command and then a dsset command to add a diskstring of ‘AFD:*’. Ok:

[oracle@server4 ~]$ asmcmd dsget
parameter:
profile:++no-value-at-resource-creation--never-updated-through-ASM++
[oracle@server4 ~]$ asmcmd dsset 'AFD:*'
[oracle@server4 ~]$ asmcmd dsget
parameter:AFD:*
profile:AFD:*

Next I need to stop CRS (I’m using a standalone config so actually it’s HAS):

[root@server4 ~]# crsctl stop has
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'server4'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'server4'
CRS-2673: Attempting to stop 'ora.asm' on 'server4'
CRS-2673: Attempting to stop 'ora.evmd' on 'server4'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'server4' succeeded
CRS-2677: Stop of 'ora.evmd' on 'server4' succeeded
CRS-2677: Stop of 'ora.asm' on 'server4' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'server4'
CRS-2677: Stop of 'ora.cssd' on 'server4' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'server4' has completed
CRS-4133: Oracle High Availability Services has been stopped.

And then I need to run the afd_configure command (all as the root user). Before and after doing so I will check for any loaded kernel modules with oracle in the name, so see what changes:

[root@server4 ~]# lsmod | grep oracle
oracleacfs           3308260  0
oracleadvm            508030  0
oracleoks             506741  2 oracleacfs,oracleadvm
[root@server4 ~]# asmcmd afd_configure
Connected to an idle instance.
AFD-627: AFD distribution files found.
AFD-636: Installing requested AFD software.
AFD-637: Loading installed AFD drivers.
AFD-9321: Creating udev for AFD.
AFD-9323: Creating module dependencies - this may take some time.
AFD-9154: Loading 'oracleafd.ko' driver.
AFD-649: Verifying AFD devices.
AFD-9156: Detecting control device '/dev/oracleafd/admin'.
AFD-638: AFD installation correctness verified.
Modifying resource dependencies - this may take some time.
[root@server4 ~]# lsmod | grep oracle
oracleafd             211540  0
oracleacfs           3308260  0
oracleadvm            508030  0
oracleoks             506741  2 oracleacfs,oracleadvm
[root@server4 ~]# asmcmd afd_state
Connected to an idle instance.
ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'ENABLED' on host 'server4'

Notice the new kernel module called oracleafd. Also, AFD is showing that “filtering is enabled” – I guess this relates to the protection against invalid writes.

Time to start up HAS or CRS again:

[root@server4 ~]# crsctl start has
CRS-4123: Oracle High Availability Services has been started.

Ok, let’s start labelling those devices.

Labelling (Incorrectly)

Now remember that I am testing with two sets of devices here: 512e and 4k. The 512e devices are emulating a 512 byte blocksize, so they should result in ASM creating diskgroups with a blocksize of 512 bytes – thus avoiding all the tedious bugs from which Oracle suffers when using 4096 byte diskgroups.

So let’s just test a 512e LUN to see what happens when I label it and present it to ASM. First, the label is created using the afd_label command:

[oracle@server4 ~]$ ls -l /dev/mapper/v512data1
lrwxrwxrwx 1 root root 8 Jul 24 10:30 /dev/mapper/v512data1 -> ../dm-14
[oracle@server4 ~]$ ls -l /dev/dm-14
brw-rw---- 1 oracle dba 252, 14 Jul 24 10:30 /dev/dm-14
[oracle@server4 ~]$ asmcmd afd_label v512data1 /dev/mapper/v512data1
[oracle@server4 ~]$ asmcmd afd_lsdsk
--------------------------------------------------------------------------------
Label                     Filtering   Path
================================================================================
V512DATA1                   ENABLED   /dev/sdpz

Well, it worked.. sort of. The path we can see in the lsdsk output does not show the /dev/mapper/v512data1 multipath device I specified… instead it’s one of the non-multipath /dev/sd* devices. Why?

Even worse, look what happens when I check the SECTOR_SIZE column of the v$asm_disk view in ASM:

SQL> select group_number, name, sector_size, block_size, state
  2  from v$asm_diskgroup;

GROUP_NUMBER NAME	SECTOR_SIZE BLOCK_SIZE STATE
------------ ---------- ----------- ---------- -----------
	   1 V512DATA	       4096	  4096 MOUNTED

Even though my LUNs are presented as 512e, ASM has chosen to see them as 4096 byte. That’s not wanted I want. Gaah!

Labelling (Correctly)

To fix this I need to unlabel that LUN so that AFD has nothing under its control, then update the oracleafd_use_logical_block_size parameter via the special SYSFS files /sys/modules/oracleafd:

[root@server4 ~]# cd /sys/module/oracleafd
[root@server4 oracleafd]# ls -l
total 0
-r--r--r-- 1 root root 4096 Jul 20 14:43 coresize
drwxr-xr-x 2 root root    0 Jul 20 14:43 holders
-r--r--r-- 1 root root 4096 Jul 20 14:43 initsize
-r--r--r-- 1 root root 4096 Jul 20 14:43 initstate
drwxr-xr-x 2 root root    0 Jul 20 14:43 notes
drwxr-xr-x 2 root root    0 Jul 20 14:43 parameters
-r--r--r-- 1 root root 4096 Jul 20 14:43 refcnt
drwxr-xr-x 2 root root    0 Jul 20 14:43 sections
-r--r--r-- 1 root root 4096 Jul 20 14:43 srcversion
-r--r--r-- 1 root root 4096 Jul 20 14:43 taint
--w------- 1 root root 4096 Jul 20 14:43 uevent
[root@server4 oracleafd]# cd parameters
[root@server4 parameters]# ls -l
total 0
-rw-r--r-- 1 root root 4096 Jul 20 14:43 oracleafd_use_logical_block_size
[root@server4 parameters]# cat oracleafd_use_logical_block_size
0
[root@server4 parameters]# echo 1 > oracleafd_use_logical_block_size
[root@server4 parameters]# cat oracleafd_use_logical_block_size
1

After making this change, AFD will present the logical blocksize of 512 bytes to ASM rather than the physical blocksize of 4096 bytes. So let’s now label those disks again:

[root@server4 mapper]# for lun in 1 2 3 4 5 6 7 8; do
> asmcmd afd_label v512data$lun /dev/mapper/v512data$lun
> done
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
[root@server4 mapper]# asmcmd afd_lsdsk
Connected to an idle instance.
--------------------------------------------------------------------------------
Label                     Filtering   Path
================================================================================
V512DATA1                   ENABLED   /dev/mapper/v512data1
V512DATA2                   ENABLED   /dev/mapper/v512data2
V512DATA3                   ENABLED   /dev/mapper/v512data3
V512DATA4                   ENABLED   /dev/mapper/v512data4
V512DATA5                   ENABLED   /dev/mapper/v512data5
V512DATA6                   ENABLED   /dev/mapper/v512data6
V512DATA7                   ENABLED   /dev/mapper/v512data7
V512DATA8                   ENABLED   /dev/mapper/v512data8

Note the correct multipath devices (“/dev/mapper/*”) are now being shown in the lsdsk command output. If I now create an ASM diskgroup on these LUNs, it will have a 512 byte sector size:

SQL> get afd.sql
  1  CREATE DISKGROUP V512DATA EXTERNAL REDUNDANCY
  2  DISK 'AFD:V512DATA1', 'AFD:V512DATA2',
  3	  'AFD:V512DATA3', 'AFD:V512DATA4',
  4	  'AFD:V512DATA5', 'AFD:V512DATA6',
  5	  'AFD:V512DATA7', 'AFD:V512DATA8'
  6  ATTRIBUTE
  7	  'compatible.asm' = '12.1',
  8*	  'compatible.rdbms' = '12.1'
SQL> /

Diskgroup created.

SQL> select disk_number, mount_status, header_status, state, sector_size, path
  2  from v$asm_disk;

DISK_NUMBER MOUNT_S HEADER_STATU STATE	  SECTOR_SIZE PATH
----------- ------- ------------ -------- ----------- --------------------
	  0 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA1
	  1 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA2
	  2 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA3
	  3 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA4
	  4 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA5
	  5 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA6
	  6 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA7
	  7 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA8

8 rows selected.

SQL> select group_number, name, sector_size, block_size, state
  2  from v$asm_diskgroup;

GROUP_NUMBER NAME	SECTOR_SIZE BLOCK_SIZE STATE
------------ ---------- ----------- ---------- -----------
	   1 V512DATA		512	  4096 MOUNTED

Phew.

Failing To Label 4kN Devices

So what about my 4k native mode devices, the ones with a 4096 byte logical block size? What happens if I try to label them?

[root@server4 ~]# asmcmd afd_label V4KDATA1 /dev/mapper/v4kdata1
Connected to an idle instance.
ASMCMD-9513: ASM disk label set operation failed.

Yeah, that didn’t work out did it? Let’s look in the trace file:

[root@server4 ~]# tail -5 /u01/app/oracle/log/diag/asmcmd/user_root/server4/alert/alert.log
24-Jul-15 12:38 ASMCMD (PID = 8695) Given command - afd_label V4KDATA1 '/dev/mapper/v4kdata1'
24-Jul-15 12:38 NOTE: Verifying AFD driver state : loaded
24-Jul-15 12:38 NOTE: afdtool -add '/dev/mapper/v4kdata1' 'V4KDATA1'
24-Jul-15 12:38 NOTE:
24-Jul-15 12:38 ASMCMD-9513: ASM disk label set operation failed.

I’ve struggled to find any more meaningful message, even when I manually run the afdtool command shown in the log – but it seems pretty likely that this is failing due to the device being 4kN. I therefore assume that AFD still isn’t 4kN ready. I do wish Oracle would make some meaningful progress on its support of 4kN devices…

I/O Filter Protection

So now let’s investigate this protection that ASMFD claims to have against non-Oracle I/Os. First of all, what do those files in /dev/oracleafd/disks actually contain?

[root@server4 ~]# cd /dev/oracleafd/disks
[root@server4 disks]# ls -l
total 32
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA1
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA2
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA3
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA4
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA5
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA6
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA7
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA8
[root@server4 disks]# cat V512DATA1
/dev/mapper/v512data1

Aha. This is what I got wrong in my original post last year, because – keen as I was to start my summer vacation – I didn’t spot that these files are simply pointers to the relevant multipath device in /dev/mapper. So let’s follow the pointers this time.

Let’s remind ourselves that the files in /dev/mapper are actually symbolic links to /dev/dm-* devices:

root@server4 disks]# ls -l /dev/mapper/v512data1
lrwxrwxrwx 1 root root 8 Jul 24 12:34 /dev/mapper/v512data1 -> ../dm-14
[root@server4 disks]# ls -l /dev/dm-14
brw-rw---- 1 oracle dba 252, 14 Jul 24 12:34 /dev/dm-14

So it’s these /dev/dm-* devices that are at the end of the trail we just followed. If we dump the first 64 bytes of this /dev/dm-14 device, we should be able to see the AFD label:

[root@server4 disks]# od -c -N 64 /dev/dm-14
0000000                           (   o   u   t
0000020                                
0000040   O   R   C   L   D   I   S   K   V   5   1   2   D   A   T   A
0000060   1

There it is. We can also read it with kfed to see what ASM thinks of it:

[root@server4 ~]# kfed read /dev/dm-14
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                  1953853224 ; 0x00c: 0x74756f28
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
000000000 00000000 00000000 00000000 74756F28  [............(out]
000000010 00000000 00000000 00000000 00000000  [................]
000000020 4C43524F 4B534944 32313556 41544144  [ORCLDISKV512DATA]
000000030 00000031 00000000 00000000 00000000  [1...............]
000000040 00000000 00000000 00000000 00000000  [................]
  Repeat 251 times

So what happens if I overwrite it, as the root user, with some zeros? And maybe some text too just for good luck?

root@server4 ~]# dd if=/dev/zero of=/dev/dm-14 bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.00570833 s, 735 MB/s
[root@server4 ~]# echo CORRUPTION > /dev/dm-14
[root@server4 ~]# od -c -N 64 /dev/dm-14
0000000   C   O   R   R   U   P   T   I   O   N  \n          
0000020

It looks like it’s changed. I also see that if I dump it from another session which opens a fresh file descriptor. Yet in the /var/log/messages file there is now a new entry:

F 4626129.736/150724115533 flush-252:14[1807]  afd_mkrequest_fn: write IO on ASM managed device (major=252/minor=14)  not supported i=0 start=0 seccnt=8  pstart=0  pend=41943040
Jul 24 12:55:33 server4 kernel: quiet_error: 1015 callbacks suppressed
Jul 24 12:55:33 server4 kernel: Buffer I/O error on device dm-14, logical block 0
Jul 24 12:55:33 server4 kernel: lost page write due to I/O error on dm-14

Hmm. It seems like ASMFD has intervened to stop the write, yet when I query the device I see the “new” data. Where’s the old data gone? Well, let’s use kfed again:

[root@server4 ~]# kfed read /dev/dm-14
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                  1953853224 ; 0x00c: 0x74756f28
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
000000000 00000000 00000000 00000000 74756F28  [............(out]
000000010 00000000 00000000 00000000 00000000  [................]
000000020 4C43524F 4B534944 32313556 41544144  [ORCLDISKV512DATA]
000000030 00000031 00000000 00000000 00000000  [1...............]
000000040 00000000 00000000 00000000 00000000  [................]
  Repeat 251 times

The label is still there! Magic.

I have to confess, I don’t really know how ASM does this. Indeed, I struggled to get the system back to a point where I could manually see the label using the od command. In the end, the only way I managed it was to reboot the server – yet ASM works fine all along and the diskgroup was never affected:

SQL> alter diskgroup V512DATA check all;
Mon Jul 20 16:46:23 2015
NOTE: starting check of diskgroup V512DATA
Mon Jul 20 16:46:23 2015
GMON querying group 1 at 5 for pid 7, osid 9255
GMON checking disk 0 for group 1 at 6 for pid 7, osid 9255
GMON querying group 1 at 7 for pid 7, osid 9255
GMON checking disk 1 for group 1 at 8 for pid 7, osid 9255
GMON querying group 1 at 9 for pid 7, osid 9255
GMON checking disk 2 for group 1 at 10 for pid 7, osid 9255
GMON querying group 1 at 11 for pid 7, osid 9255
GMON checking disk 3 for group 1 at 12 for pid 7, osid 9255
GMON querying group 1 at 13 for pid 7, osid 9255
GMON checking disk 4 for group 1 at 14 for pid 7, osid 9255
GMON querying group 1 at 15 for pid 7, osid 9255
GMON checking disk 5 for group 1 at 16 for pid 7, osid 9255
GMON querying group 1 at 17 for pid 7, osid 9255
GMON checking disk 6 for group 1 at 18 for pid 7, osid 9255
GMON querying group 1 at 19 for pid 7, osid 9255
GMON checking disk 7 for group 1 at 20 for pid 7, osid 9255
Mon Jul 20 16:46:23 2015
SUCCESS: check of diskgroup V512DATA found no errors
Mon Jul 20 16:46:23 2015
SUCCESS: alter diskgroup V512DATA check all

So there you go. ASMFD: it does what it says on the tin. Just don’t try using it with 4kN devices…

The Great Hypervisor Bake-off: VMware ESX vs Oracle VM

lock-horns

This is a very simple post to show the results of some recent testing that Tom and I ran using Oracle SLOB on Violin to determine the impact of using virtualization. But before we get to that, I am duty bound to write a paragraph of text featuring lots of long sentences peppered with industry buzz words. Forgive me, it’s just the way I’m wired.

It is increasingly common these days to find database environments running in virtual machines – even large, business critical ones. The driver is the trend to commoditize I.T. services and build consolidated, private-cloud style solutions in order to control operational expense and increase agility (not to mention reduce exposure to Oracle licenses). But, as I’ve said in previous posts, the catalyst has been the unblocking of I/O as legacy disk systems are replaced by flash memory. In the past, virtual environments caused a kind of I/O blender effect whereby I/O calls become increasingly randomized – and this sucked for the performance of disk drives. Flash memory arrays on the other hand can deliver random I/O all day long because… well, if you don’t know the reasons by now can I just recommend starting at the beginning. The outcome is that many large and medium-sized organisations are now building database-as-a-service platforms with Oracle databases (other database products are available) running in virtual machines. It’s happening right now.

Phew. Anyway, that last paragraph was just a wordy way of telling you that I’m often seeing Oracle running in virtual machines on top of hypervisors. But how much of a performance impact do those hypervisors have? Step this way to find out.

The Contenders

boxersWhen it comes to running Oracle on a hypervisor using Intel x86 hardware (for that is what I have available), I only know of three real contenders:

Hyper-V has been an option for a couple of years now, but I’ll be honest – I have neither the time nor the inclination to test it today. It’s not that I don’t rate it as a product, it’s just that I’ve never used it before and don’t have enough time to learn something new right now. Maybe someday I’ll come back and add it to the mix.

In the meantime, it’s the big showdown: VMware versus Oracle VM. Not that Oracle VM is really in the same league as VMware in terms of market share… but you know, I’m trying to make this sound exciting.

The Test

This is going to be an Oracle SLOB sustained throughput test. In other words, I’m going to build an Oracle database and then shovel a massive amount of I/O through it (you can read all about SLOB here and here). SLOB will be configured to run with 25% of statements being UPDATEs (the remainder are SELECTs) and will run for 8 hours straight. What we want to see is a) which hypervisor configuration allows the greatest I/O bandwidth, and b) which hypervisor configuration exhibits the most predictable performance.

This is the configuration. First the hardware:

Violin Memory 6616 flash Memory Array

Violin Memory 6616 flash Memory Array

  • 1x Dell PowerEdge R720 server
  • 2x Intel Xeon CPU E5-2690 v2 10-core @ 3.00GHz [so that’s 2 sockets, 20 cores, 40 threads for this server]
  • 128GB DRAM
  • 1x Violin Memory 6616 (SLC) flash memory array [the one that did this]
  • 8GB fibre-channel

And the software:

  • Hypervisor: VMware ESXi 5.5.1
  • Hypervisor: Oracle VM for x86 3.3.1
  • VM: Oracle Linux 6 Update 5 (with the Unbreakable Enterprise v3 Kernel 3.6.18)
  • Oracle Grid Infrastructure 11.2.0.4 (for Automatic Storage Management)
  • Oracle Database Enterprise Edition 11.2.0.4

Each VM is configured with 20 vCPUs and is using Linux Device Mapper Multipath and Oracle ASMLib. ASM is configured to use one single +DATA disgroup comprising 8 ASM disks (LUNs from Violin) with external redundancy. The database parameters and SLOB settings are all listed on the SLOB sustained throughput test page.

Results: Bare Metal (Baseline)

First let’s see what happens when we don’t use a hypervisor at all and just run OL6.5 on bare metal:

Oracle SLOB- 8 Hour Sustained Throughput Test with no hypervisor (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         232,431.0       194,452.3        37,978.7
         Database Requests:         228,909.4       194,447.9        34,461.5
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           3,515.1             0.3         3,514.8
                Total (MB):           1,839.6         1,519.2           320.4

Ok so we’re looking at 1519 MB/sec of read throughput and 320 MB/sec of write throughput. Crucially, the lines are nice and consistent – with very little deviation from the mean. By dividing the amount of time spent waiting on db file sequential read (i.e. random physical reads) with the number of waits, we can calculate that the average latency for random reads was 438 microseconds.

Now we know what to expect, let’s look at the result from the hypervisor tests.

Results: VMware vSphere

VMware is configured to use Raw Device Mapping (RDM) which essentially gives the benefits of raw devices… read here for more details on that. Here are the test results:

Oracle SLOB- 8 Hour Sustained Throughput Test with VMware ESXi 5.5.1 (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         173,141.7       145,066.8        28,075.0
         Database Requests:         170,615.3       145,064.0        25,551.4
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           2,522.8             0.1         2,522.7
                Total (MB):           1,370.0         1,133.4           236.7

Average read throughput for this test was 1133 MB/sec and write throughput averaged at 237 MB/sec. Average read latency was 596 microseconds. That’s an increase of 36%.

In comparison to the bare metal test, we see that total bandwidth dropped by around 25%. That might seem like a lot but remember, we are absolutely hammering this system. A real database is unlikely to ever create this level of sustained I/O. In my role at Violin I’ve been privileged to work on some of the busiest databases in Europe – nothing is ever this crazy (although a few do come close).

Results: Oracle VM

Oracle VM is based on the Xen hypervisor and therefore uses Xen virtual disks to present block devices. For this test I downloaded the Oracle Linux 6 Update 5 template from Oracle’s eDelivery site. You can see more about the way this VM was configured here. Here are the test results:

Oracle SLOB- 8 Hour Sustained Throughput Test with Oracle VM 3.3.1 (SLC)

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:         160,563.8       134,592.9        25,970.9
         Database Requests:         158,538.1       134,587.3        23,950.8
        Optimized Requests:               0.0             0.0             0.0
             Redo Requests:           2,017.2             0.2         2,016.9
                Total (MB):           1,273.4         1,051.6           221.9

This time we see average read bandwidth of 1052MB/sec and average write bandwidth of 222MB/sec, with the average read latency at 607 microseconds, which is 39% higher than the baseline test.

Meanwhile, total bandwidth dropped by 31%. That’s slightly worse than VMware, but what’s really interesting is the deviation. Look at how ragged the lines are on the OVM test! There is a much higher degree of variance exhibited here than on the VMware test.

Conclusion

This is only one test so I’m not claiming it’s conclusive. VMware does appear to deliver slightly better performance than OVM in my tests, but it’s not a huge difference. However, I am very much concerned by the variance of the OVM test in comparison to VMware. Look, for example, at the wait event histograms for db file sequential read:

Wait Event Histogram
-> Units for Total Waits column: K is 1000, M is 1000000, G is 1000000000
-> % of Waits: value of .0 indicates value was <.05%; value of null is truly 0
-> % of Waits: column heading of <=1s is truly <1024ms, >1s is truly >=1024ms
-> Ordered by Event (idle events last)

                                                             % of Waits
                                          -----------------------------------------------
                                    Total
Hypervisor  Event                   Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s   >1s
----------- ----------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
Bare Metal: db file sequential read 5557.  98.7   1.3    .0    .0    .0    .0
VMware ESX: db file sequential read 4164.  92.2   6.7   1.1    .0    .0    .0
Oracle VM : db file sequential read 3834.  95.6   4.1    .1    .1    .0    .0    .0    .0

The OVM tests show occasional results in the two highest buckets, meaning once or twice there were waits in excess of 1 second! However, to be fair, OVM also had more millisecond waits than VMware.

Anyway, for now – and for this setup at least – I’m sticking with VMware. You should of course test your own workloads before choosing which hypervisor works for you…

Thanks as always to Kevin for bringing Oracle SLOB to the community.

ASM Rebalance Too Slow? 3 Tips To Improve Rebalance Times

see-saw

I’ve run into a few customers recently who have had problems with their ASM rebalance operations running too slowly. Surprisingly, there were some simple concepts being overlooked – and once these were understood, the rebalance times were dramatically improved. For that reason, I’m documenting the solutions here… I hope that somebody, somewhere benefits…

1. Don’t Overbalance

Every time you run an ALTER DISKGROUP REBALANCE operation you initiate a large amount of I/O workload as Oracle ASM works to evenly stripe data across all available ASM disks (i.e. LUNs). The most common cause of rebalance operations running slowly that I see (and I’m constantly surprised how much I see this) is to overbalance, i.e. cause ASM to perform more I/O than is necessary.

It almost always goes like this. The customer wants to migrate some data from one set of ASM disks to another, so they first add the new disks:

alter diskgroup data
add disk  'ORCL:NEWDATA1','ORCL:NEWDATA2','ORCL:NEWDATA3','ORCL:NEWDATA4',
          'ORCL:NEWDATA5','ORCL:NEWDATA6','ORCL:NEWDATA7','ORCL:NEWDATA8'
rebalance power 11 wait;

Then they drop the old disks like this:

alter diskgroup data
drop disk 'DATA1','DATA2','DATA3','DATA4',
          'DATA5','DATA6','DATA7','DATA8'
rebalance power 11 wait;

Well guess what? That causes double the amount of I/O that is actually necessary to migrate, because Oracle evenly stripes across all disks and then has to rebalance a second time once the original disks are dropped.

This is how it should be done – in one single operation:

alter diskgroup data
add disk  'ORCL:NEWDATA1','ORCL:NEWDATA2','ORCL:NEWDATA3','ORCL:NEWDATA4',
          'ORCL:NEWDATA5','ORCL:NEWDATA6','ORCL:NEWDATA7','ORCL:NEWDATA8'
drop disk 'DATA1','DATA2','DATA3','DATA4',
          'DATA5','DATA6','DATA7','DATA8'
rebalance power 11 wait;

A customer of mine tried this earlier this week and reported back that their ASM rebalance time had reduced by a factor of five!

By the way, the WAIT command means the cursor doesn’t return until the command is finished. To have the command essentially run in the background you can simply change this to NOWAIT. Also, you could run the ADD and DROP commands separately if you used a POWER LIMIT of zero for the first command, as this would pause the rebalance and then the second command would kick it off.

2. Power Limit Goes Up To 1024

Simple one this, but easily forgotten. From the early days of ASM, the maximum power limit for rebalance operations was 11. See here if you don’t know why.

From 11.2.0.2, if the COMPATIBLE.ASM disk group attribute is set to 11.2.0.2 or higher the limit is now 1024. That means 11 really isn’t going to cut it anymore. If you are asking for full power, make sure you know what number that is.

3. Avoid The Compact Phase (for Flash Storage Systems)

An ASM rebalance operation comprises three phases, where the third one is the compact phase. This attempts to move data as close as possible to the outer tracks of the disks ASM is using.

Did you spot the issue there? Disks. This I/O-heavy phase is completely pointless on a flash system, where I/O is served evenly from any logical address within a LUN.

You can therefore avoid that potentially-massive I/O hit by disabling the compact phase, using the underscore parameter _DISABLE_REBALANCE_COMPACT=TRUE. Remember that you need to get Oracle Support’s permission before setting underscore parameters! Point your SR in the direction of the following My Oracle Support note:

What is ASM rebalance compact Phase and how it can be disabled (Doc ID 1902001.1)

Unfortunately it appears the parameter was deprecated in 12c, so from now on you have to set the ASM diskgroup attribute “_rebalance_compact” to FALSE (note the opposite value to that set at the instance level!), for example:

ALTER DISKGROUP  SET ATTRIBUTE "_rebalance_compact”="FALSE";

If you want to know more about this topic (for example, what the first two rebalance phases are), or indeed anything about ASM in general, I highly recommend the legendary ASM blogger that is Bane Radulovic a.k.a. ASM Support Guy.

Conclusion

An ASM rebalance potentially creates a lot of I/O, which means you may need to wait for a long time before it finishes. For that reason, make sure you understand what you are doing and make every effort to perform only as much I/O as you actually need. Don’t forget you can use the EXPLAIN WORK command to gauge in advance how much work is required.

Happy rebalancing!

Oracle Exadata X5: The Road To Ten Billion Dollars

money

Now that the dust has settled on the announcement of Oracle’s new Exadata X5 Database Machine, I’ve been doing some research in order to update my History of Exadata post (it’ll be ready soon). While reviewing the datasheets and other collateral for the X5 I was struck by the meteoric increase in one particular statistic: the number of processor cores on each database server. Oracle is riding that Moore’s Law train all the way to the bank.

The thing is, the number of cores per database server is directly linked to the cost of licensing the Oracle Database for each Exadata machine – and that means trouble if you are the one paying the bills. Assuming you buy a full rack – and that you license every core in every database server (which is the most common choice, since only a very brave minority would consider the Oracle VM Trusted Partitions option), your license cost has been increasing by 50% for each of the last two Exadata releases (X3-2 to X4-2 to X5-2). Let’s have a look at that in graphical form (click on the image to enlarge):

Exadata DB License Cost Comparison X2-2 to X5-2

Let’s not forget here that I am only plotting the cost of licensing the database software. We are not taking into account the extra costs associated with paying for the hardware, licensing the storage servers or purchasing any of the pretty-much essential enterprise edition options such as Oracle RAC, partitioning, multitenancy or the diagnostic pack license. Nor are we considering the infamous 22% per annum software support costs. I’m also using the list price – which you would never expect to pay – but even if you used discounted prices the percentage increases would remain the same.

Anyway, after crunching the numbers it turns out there is good news and bad news…

Good News for Oracle

The good news for Oracle is that if Exadata continues to increase the number of cores at the rate of 50% extra per release, the list price for licensing the database software on the future X23-2 model will be $10.1 billion:

Exadata DB License Cost Projection

There’s no doubt about it – this will pay for a lot of yachts.

Bad News for Oracle

There is a downside though. As Kevin Closson has already shown, Oracle appears to be having trouble balancing the I/O capabilities of the X5 against this tremendously-increased compute power. Even after abandoning previous claims that a memory hierarchy with “low-cost disk” as the bottom tier would bring the “highest performance at the lowest cost” to customers, the new Oracle Exadata X5 “Extreme Flash” model (because apparently flash on its own isn’t enough, it has to be extreme flash) struggles to deliver an improvement in IOPS per flash device, as Kevin has shown.

exadata-x5-ef-price-per-io

Excerpt from the Oracle Exadata X5-2 Press Release

You wouldn’t think this from reading the press release, which promises “breakthrough performance and price per I/O” (emphasis added by me). Price per I/O, eh? That sounds like we take the overall price and divide it by the number of IOPS the system can deliver, right?

So let’s do that. And to be generous, let’s only look at the database license cost (ignoring all those other costs again) and take the maximum IOPS number from the datasheets (which in most cases is for unrealistic, 100% read only, fully cached workloads). How will it look? I’ll overlay it as a line (in blue) on top of the first graph:

Exadata DB License Cost Comparison

Well it turns out that the price per I/O is actually falling: it’s down from $1.71 on the X4-2 (HP) model to $1.65 on the X5-2 EF. But three and a half percent is not much of an improvement considering the 50% extra on the price tag, is it? And as for offering a “breakthrough” price per I/O, the X2-2 was better at only $1.52 per I/O per second!

Summary

In my view, something is not right about the balance between compute and storage on the Exadata X5. It feels as though Oracle is bumping up the compute power faster than would be architecturally prudent because this results in a higher purchase price. Maybe I’m wrong – I have no insider knowledge and can only speculate… but when the Exadata X23-2 finally comes out in a couple of decades time, maybe we’ll know for sure.

Footnote

The comments section of this article makes for interesting reading, with responses from a number of Oracle employees – although not necessarily speaking on behalf of the mothership. The noteworthy (not to mention calm and measured) comments come from the Exadata Product Manager, Gurmeet Goindi. In addition I’d like to draw your attention to the following two URLs:

Deprecation of Non-CDB Architecture in Oracle 12c

dead-end

Back in July 2013, Oracle released the latest version of its flagship database product, Oracle 12c. Among the usual fanfare was information about a number of new options – including one known as Multitenant. With the Multitenant option, databases use a new architecture which features a container database (or CDB) which in turn contains one or more pluggable databases (or PDBs). Use of Multitenant requires a licence – which at the time of writing retails at $17,500 per processor (perpetual) plus 22% per annum for support.

This post is not intended to discuss the way Multitenant works – if you want to read more about it, Tim has a great set of articles about Multitenant here. But keep in mind that you can choose to install the Multitenant feature or not. If you do install it, you can create a single PDB within your CDB without requiring the license. As soon as you use more than one PDB the license is required.

What I want to talk about is Oracle’s attitude to its customers and what seems to me to be breathtaking arrogance. Personally I can think of three very good reasons why I might not want to use the single PDB within a CDB configuration which does not require a Multitenant license:

  • Multitenant requires additional configuration and the use of new administrative commands, which means re-writing admin procedures and re-training operations staff
  • Multitenant is an entirely new feature, with new code paths – which means it carries a risk of bugs (the list of bug fixes for the 12.1.0.2 patchset contains a section on Pluggable/Container Databases which lists no fewer than 105 items)
  • With the Multitenant option installed it is possible to trigger the requirement for an expensive set of licenses due to human error… without the option installed this is not possible

So it seems to me that, while Multitenant might be an interesting and useful new feature to evaluate, it is not something that I would want to be forced into using on production environments just yet. As always, people who manage production environments are conservative in their attitude to risk.

And that’s why I’m surprised to see this deprecation notice in the 12.1 documentation:

cdb_dep

The non-CDB architecture (i.e. the old way of building a database without CDBs and PDBs) has been deprecated in 12c and “may be desupported or unavailable” in later releases of Oracle Database. In other words, you need to change to using the CDB and PDB configuration now, even if you do not plan to purchase the Multitenant option.

It would be nice to have the choice, wouldn’t it?

Deprecated versus Desupported

OK so first let’s just remember that the term deprecated does not mean the same as the term desupported. We can dip back into the documentation to define these two important terms:

“By deprecate, we mean that the feature is no longer being enhanced but is still supported for the full life of the 12.1 release. By desupported, we mean that Oracle will no longer fix bugs related to that feature and may remove the code altogether.”

– Oracle Database 12c Upgrade Guide

It seems that Oracle will still support the use of non-CDB databases and will continue to do so for the lifetime of the 12.1 release. But, if you were designing a new system right now, it would take some confidence to choose a configuration which is deprecated and already living on death row.

And there’s more. The deprecation notice says there are some features that still do not work with the CDB architecture – and that if you want to use these you should use the deprecated non-CDB architecture. The list of features which are restricted or not available includes Automatic Data Optimization, Heat Maps, DBVERIFY and Flashback Pluggable Database (you can see the complete lists here for 12.1.0.1 and 12.1.0.2).

So we can add a fourth reason to our list of three drawn up earlier:

  • one-wayThe Multitenant causes a number of other options or features to be unusable

Now, given that I think all four reasons stated here are good enough to stand up on their own, what does this say about Oracle’s decision to deprecate non-CDB architectures?

You can draw your own conclusion, but I can’t help see it as arrogance on Oracle’s part as they force customers to use a specific new configuration with little regard for how it affects their operations. At worst, I don’t like being forced into changes by the vendor (to whom customers pay large amounts of money) while at best I would at least expect them to get all the features working before forcing the issue…

Update – 18 February 2015

Since I published this article back in late January I’ve had a lot of comments – both here and on Twitter. Some agree with me, some disagree – and unsurprisingly many of the latter are Oracle employees. One particular Oracle employee took to his corporate blog to post a four-part response!

One of those responses was in regard to my concern that, since it seemingly cannot be unlinked, customers may be able to inadvertently trigger usage of the Multitenant feature and thus incur an expensive and unexpected license bill. [I have no knowledge that this has ever happened, I am merely concerned that it may be possible.]

I’d like to quote the following response from our friend on the Oracle blog (his own opinion, not the views of his employer) who is apparently looking to rubbish that concern. I have placed the most enlightening part of the text in bold for emphasis:

This bit of FUD is silly. First of all, this risk already exists with various features of the Oracle database. For example, many of the OEM packs can be inadvertently used without a license as could several of the views in the database itself. Partitioning is another example that comes to mind. Often it’s installed in a database but it’s use requires a license.

So, how is this any different? Well, it’s not. Simply put, this is an argument for enterprise compliance auditing/management.

I hope this convinces you more than it convinces me.

Oracle AWR Reports: When IOStats Lie

graph

If you’ve been unfortunate enough to follow my dithering on Twitter recently you’ll know that I’ve been lurching between thinking that there is and isn’t a problem with Oracle’s tracking of I/O statistics in its AWR reports.

I’m now convinced there is a problem, but I can’t work out what causes it… so step 1 is to describe the problem here, after which step 2 will probably be to sit back and hope someone far more intelligent than me will solve it.

But first some background:

AWR I/O Statistics

I’ve written about the I/O statistics contained in Oracle AWR Reports before, so I won’t repeat myself too much other than to highlight two critical areas which we’ll focus on today: Instance Statistics and IOStat Summaries. By the way, the format of AWR reports changed in 11.2.0.4 and 12c to include a new IO Profile section, but today we’re covering reports from 11.2.0.3.

First of all, here are some sensible statistics. I’m going to show you the IOStat by Function Summary section of a report from a real life database:

IOStat by Function summary
-> 'Data' columns suffixed with M,G,T,P are in multiples of 1024
    other columns suffixed with K,M,G,T,P are in multiples of 1000
-> ordered by (Data Read + Write) desc

                Reads:   Reqs   Data    Writes:  Reqs   Data    Waits:    Avg  
Function Name   Data    per sec per sec Data    per sec per sec Count    Tm(ms)
--------------- ------- ------- ------- ------- ------- ------- ------- -------
Buffer Cache Re   14.9G   504.1 4.22491      0M     0.0      0M 1690.3K     0.0
Direct Reads      12.2G     3.5 3.45612      0M     0.0      0M       0     N/A
DBWR                 0M     0.0      0M    7.9G   186.6 2.25336       0     N/A
Others             4.2G    12.1 1.18110    3.6G     2.5 1.03172   41.9K     0.1
LGWR                 1M     0.0 .000277    5.8G    11.9 1.65642   17.4K     1.0
Direct Writes        0M     0.0      0M      4M     0.0 .001110       0     N/A
TOTAL:            31.2G   519.7 8.86241   17.4G   201.0 4.94263 1749.6K     0.0
          -------------------------------------------------------------

This section of the report is breaking down all I/O operations into the functions that caused them (e.g. buffer cache reads, direct path reads or writes, redo log writes by LGWR, the writing of dirty buffers by DBWR and so on). Let’s ignore that level of detail now and just focus on the TOTAL row at the bottom.

To try and make this simple to describe, I’ve gone a bit crazy with the colours. In green I’ve highlighted the labels for reads and writes – and now let’s walk through the columns:

  1. The first column is the function name, but we’re just focussing on TOTAL
  2. The second column is the total amount of reads that happened in this snapshot: 31.2GB
  3. The third column is the average number of read requests per second, i.e. 519.7 read IOPS
  4. The fourth column is the average volume of data read per second, i.e. 8.86 MB/sec read throughput
  5. The fifth column is the total amount of writes that happened in this snapshot: 17.4GB
  6. The sixth column is the average number of write requests per second, i.e. 201 write IOPS
  7. The seventh column is the average volume of data written per second, i.e. 4.94 MB/sec write throughput
  8. The eighth and ninth columns are not of interest to us here

The whole section is based on the DBA_HIST_IOSTAT_FUNCTION view. What we care about today is the IOPS figures (shown in red) and the Throughput figures (shown in blue). Pay attention to the comments in the view header which show that Data columns (including throughput) are multiples of 1024 while other columns (including IOPS) are multiples of 1000. It’s interesting that the two throughput values are obviously measured in MB/sec and yet are missing the “M” suffix – I assume this “falls off the end” of the column because of the number of decimal places displayed.

Now that we have these figures explained, let’s compare them to what we see in the Instance Activity Stats section of the same AWR report:

Instance Activity Stats
-> Ordered by statistic name

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
physical read IO requests                 1,848,576          513.3         121.5
physical read bytes                  29,160,300,544    8,096,201.8   1,917,053.5
physical read total IO requests           1,892,366          525.4         124.4
physical read total bytes            33,620,594,176    9,334,578.5   2,210,281.7
physical read total multi block              17,096            4.8           1.1
physical reads                            3,559,607          988.3         234.0

physical write IO requests                  671,728          186.5          44.2
physical write bytes                  8,513,257,472    2,363,660.5     559,677.7
physical write total IO requests            723,348          200.8          47.6
physical write total bytes           18,657,198,080    5,180,071.5   1,226,559.6
physical write total multi block             26,192            7.3           1.7
physical writes                           1,039,216          288.5          68.3

For this section we care about the per Second column because both IOPS and Throughput are measured using units per second. For both reads and writes there are two sets of statistics: those with the word total in them and those without. You can find the full description of 11gR2 statistics in the documentation, but the difference between the two is best summed up by this snippet which describes physical read total bytes:

Total size in bytes of disk reads by all database instance activity including application reads, backup and recovery, and other utilities. The difference between this value and “physical read bytes” gives the total read size in bytes by non-application workload.

I’ve underlined non-application workload here because this is critical. If you merely look at the Load Profile section at the top of an AWR report you will only see values for “application workload” I/O but this does not include stuff like RMAN backups, archive logging and so on… important stuff if you care about the actual I/O workload. For this reason, we only care about the following four statistics:

  1. physical read total IO requests (per second) = the average number of read IOPS
  2. physical read total bytes (per second) = the average read throughput in bytes per second
  3. physical write total IO requests (per second) = the average number of write IOPS
  4. physical write total bytes (per second) = the average write throughput in bytes per second

Again I’ve coloured the IOPS figures in red and the throughput figures in blue.

Tying It Together

Now that we have our two sets of values, let’s just compare them to make sure they align. The IOPS figures do not require any conversion but the throughput figures do: the values in the Instance Activity Stats section are in bytes/second and we want them to be in MB/second so we need to divide by 1048576 (i.e. 1024 * 1024).

IOStat by Function

Instance Activity Stats

Error Percentage

Read IOPS

519.7

525.4

1.09%

Write IOPS

201.0

200.8

0.10%

Read Throughput

8.86241 MB/sec

9,334,578.5 bytes/sec

8.90215 MB/sec

0.45%

Write Throughput

4.94263 MB/sec

5,180,071.5 bytes/sec

4.94010 MB/sec

0.05%

I’ve calculated the error percentages here to see how far the figures vary. It is my assumption that the Instance Activity Stats are accurate and that any margin of error in the IOStat figures comes as a result of sampling issues. The highest error percentage we see here is just over 1%, which is hardly a problem in my opinion.

Don’t Believe The Stats

So far I have no complaints about the matching of statistics in the AWR report. But now let me introduce you to the AWR report that has been puzzling me recently:

IOStat by Function summary
-> 'Data' columns suffixed with M,G,T,P are in multiples of 1024
    other columns suffixed with K,M,G,T,P are in multiples of 1000
-> ordered by (Data Read + Write) desc

                Reads:   Reqs   Data    Writes:  Reqs   Data    Waits:    Avg
Function Name   Data    per sec per sec Data    per sec per sec Count    Tm(ms)
--------------- ------- ------- ------- ------- ------- ------- ------- -------
Direct Reads       3.8T 2.7E+04    3.2G    405M     3.6 .335473       0     N/A
Direct Writes        0M     0.0      0M  164.5G  1159.0 139.519       0     N/A
Buffer Cache Re   63.2G  3177.6 53.6086      0M     0.0      0M 3292.5K    23.4
DBWR                 0M     0.0      0M      2G    45.3 1.71464       0     N/A
LGWR                 0M     0.0      0M    485M    19.2 .401739   19.7K     0.9
Others             220M    12.3 .182232     26M     0.7 .021536   15.6K    34.5
Streams AQ           0M     0.0      0M      0M     0.0      0M       1     9.0
TOTAL:             3.9T 3.0E+04    3.3G  167.4G  1227.8 141.992 3327.8K    23.3
          -------------------------------------------------------------

Instance Activity Stats
-> Ordered by statistic name

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
physical read total IO requests          35,889,568       29,728.4       3,239.4
physical read total bytes         4,261,590,852,608 3.52999864E+09 3.8465483E+08

physical write total IO requests          1,683,381        1,394.4         151.9
physical write total bytes          205,090,714,624  169,882,555.1  18,511,663.0

Again I’ve coloured the IOPS measurements in red and the throughput measurements in blue. And as before we need to convert the bytes per second values shown in the Instance Activity Stats section to MB/sec as shown in the IOStat by Function Summary section.

IOStat by Function

Instance Activity Stats

Error Percentage

Read IOPS

30,000

29,728.4

0.91%

Write IOPS

1227.8

1,394.4

11.95%

Read Throughput

3.3 GB/sec

3,379.2 MB/sec

3,529,998,640 bytes/sec

3,366.469 MB/sec

0.38%

Write Throughput

141.992 MB/sec

169,882,555.1 bytes/sec

162.01263 MB/sec

12.36%

Do you see what’s bugging me here? The write values for both IOPS and throughput are out by over 10% when I compare the values in IOStat by Function against the Instance Activity Stats. Ten percent is a massive margin of error at this level – we’re talking 20MB/sec. To translate that into something easier to understand, if 20MB/sec were sustained over a 24 hour period it would amount to over 1.6TB of data. I’ve seen smaller data warehouses.

So why is this happening? Unfortunately I don’t have access to the system where this AWR report was created, so I cannot tell if, for example, there was an instance restart between the start and end snapshots (although the elapsed time of the report was just 20 minutes so it seems unlikely).

The truth is I don’t know. Which is why I’m writing about it here… if you think you have the answer, or just as importantly if you see the same behaviour, let me know!