New AWR Report Format: Oracle 11.2.0.4 and 12c

statistics-and-graphs

This is a post about Oracle Automatic Workload Repository (AWR) Reports. If you are an Oracle professional you doubtless know what these are – and if you have to perform any sort of performance tuning as part of your day job it’s likely you spend a lot of time immersed in them. Goodness knows I do – a few weeks ago I had to analyse 2,304 of them in one (long) day. But for anyone else, they are (huge) reports containing all sorts of information about activities that happened between two points of time on an Oracle instance. If that doesn’t excite you now, please move along – there is nothing further for you here.

truthAWR Reports have been with us since the introduction of the Automatic Workload Repository back in 10g and can be considered a replacement for the venerable Statspack tool. Through each major incremental release the amount of information contained in an AWR Report has grown; for instance, the 10g reports didn’t even show the type of operating system, but 11g reports do. More information is of course a good thing, but sometimes it feels like there is so much data now it’s hard to find the truth hidden among all the distractions.

I recently commented in another post about the change in AWR report format introduced in 11.2.0.4. This came as a surprise to me because I cannot previously remember report formatting changing mid-release, especially given the scale of the change. Not only that, but I’m sure I’ve seen reports from 11.2.0.3 in the new format too (implying it was added via a patch set update), although I can’t find the evidence now so am forced to concede I may have imagined it. The same new format also continues into 12.1.0.1 incidentally.

The 11.2.0.4 New Features document doesn’t mention anything about a new report format. I can’t find anything about it on My Oracle Support (but then I can never find anything about anything I’m looking for on MOS these days). So I’m taking it upon myself to document the new format and the changes introduced – as well as point out a nasty little issue that’s caught me out a couple of times already.

Comparing Old and New Formats

From what I can tell, all of the major changes except one have taken place in the Report Summary section at the start of the AWR report. Oracle appears to have re-ordered the subsections and added a couple of new ones:

  • Wait Classes by Total Wait Time
  • IO Profile

The new Wait Classes section is interesting because there is already a section called Foreground Wait Class down in the Wait Event Statistics section of the Main Report, but the additional section appears to include background waits as well. The IO Profile section is especially useful for people like me who work with storage – and I’ve already blogged about it here.

In addition, the long-serving Top 5 Timed Foreground Events section has been renamed and extended to become Top 10 Foreground Events by Total Wait Time.

Here are the changes in tabular format:

Old Format

New Format

Cache Sizes

Load Profile

Instance Efficiency Percentages

Shared Pool Statistics

Top 5 Timed Foreground Events

Host CPU

Instance CPU

Memory Statistics

Time Model Statistics

Load Profile

Instance Efficiency Percentages

Top 10 Foreground Events by Total Wait Time

Wait Classes by Total Wait Time

Host CPU

Instance CPU

IO Profile

Memory Statistics

Cache Sizes

Shared Pool Statistics

Time Model Statistics

I also said there was one further change outside of the Report Summary section. It’s the long-standing Instance Activity Stats section, which has now been divided into two:

Old Format

New Format

Instance Activity Stats

Key Instance Activity Stats

Other Instance Activity Stats

I don’t really understand the point of that change, nor why a select few statistics are deemed to be more “key” than others. But hey, that’s the mystery of Oracle, right?

Tablespace / Filesystem IO Stats

Another, more minor change, is the addition of some cryptic-looking “1-bk” columns to the two sections Tablespace IO Stats and File IO Stats:

Tablespace
------------------------------
          Av       Av     Av      1-bk  Av 1-bk          Writes   Buffer  Av Buf
  Reads   Rds/s  Rd(ms) Blks/Rd   Rds/s  Rd(ms)  Writes   avg/s    Waits  Wt(ms)
------- ------- ------- ------- ------- ------- ------- ------- -------- -------
UNDOTBS1
8.4E+05      29     0.7     1.0 6.3E+06    29.2       1     220    1,054     4.2
SYSAUX
 95,054       3     0.8     1.0  11,893     3.3       1       0        1    60.0
SYSTEM
    745       0     0.0     1.0   1,055     0.0       0       0       13     0.8
USERS
    715       0     0.0     1.0     715     0.0       0       0        0     0.0
TEMP
      0       0     0.0     N/A       7     0.0       0       0        0     0.0

I have to confess it took me a while to figure out what they meant – in the end I had to consult the documentation for the view DBA_HIST_FILESTATXS:

Column Datatype NULL Description
SINGLEBLKRDS NUMBER Number of single block reads
SINGLEBLKRDTIM NUMBER Cumulative single block read time (in hundredths of a second)

Aha! So the AWR report is now giving us the number of single block reads (SINGLEBLKRDS) and the average read time for them (SINGLEBLKRDTIM / SINGLEBLKRDS). That’s actually pretty useful information for testing storage, since single block reads tell no lies. [If you want to know what I mean by that, visit Frits Hoogland’s blog and download his white paper on multiblock reads…]

Top 10: Don’t Believe The Stats

One thing you might want to be wary about is the new Top 10 section… Here are the first two lines from mine after running a SLOB test:

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                            Tota    Wait   % DB
Event                                 Waits Time Avg(ms)   time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
db file sequential read        3.379077E+09 2527       1   91.4 User I/O
DB CPU                                      318.           11.5

Now, normally when I run SLOB and inspect the post-run awr.txt file I work out the average wait time for db file sequential read so I can work out the latency. Since AWR reports do not have enough decimal places for the sort of storage I use (the wait shows simply as 0 or 1), I have to divide the total wait time by the number of waits. But in the report above, the total wait time of 2,527 divided by 3,379,077,000 waits gives me an average of 0.000747 microseconds. Huh? Looking back at the numbers above it’s clear that the Total Time column has been truncated and some of the digits are missing. That’s bad news for me, as I regularly use scripts to strip this information out and parse it.

This is pretty poor in my opinion, because there is no warning and the number is just wrong. I assume this is an edge case because the number of waits contains so many digits, but for extended SLOB tests that’s not unlikely. Back in the good old Top 5 days it looked like this, which worked fine:

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                              Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
db file sequential read         119,835,425      50,938      0   84.0 User I/O
latch: cache buffers lru chain   20,051,266       6,221      0   10.3 Other

Unfortunately, in the new 11.2.0.4 and above Top 10 report, the Total Time column simply isn’t wide enough. Instead, I have to scan down to the Foreground Wait Events section to get my true data:

                                                             Avg
                                        %Time Total Wait    wait    Waits   % DB
Event                             Waits -outs   Time (s)    (ms)     /txn   time
-------------------------- ------------ ----- ---------- ------- -------- ------
db file sequential read    3.379077E+09     0  2,527,552       1     11.3   91.4

This is something worth looking out for, especially if you also use scripts to fetch data from AWR files. Of course, the HTML reports don’t suffer from this problem, which just makes it even more annoying as I can’t parse HTML reports automatically (and thus I despise them immensely).

12.1.0.2 AWR Reports

One final thing to mention is the AWR report format of 12.1.0.2 (which was just released at the time of writing). There aren’t many changes from 12.1.0.1 but just a few extra lines have crept in, which I’ll highlight here. In the main, they are related to the new In Memory Database option.

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):               0.3               4.8      0.00      0.02
              DB CPU(s):               0.1               1.2      0.00      0.00
      Background CPU(s):               0.0               0.1      0.00      0.00
      Redo size (bytes):          50,171.6         971,322.7
  Logical read (blocks):             558.6          10,814.3
          Block changes:             152.2           2,947.0
 Physical read (blocks):              15.1             292.0
Physical write (blocks):               0.2               4.7
       Read IO requests:              15.1             292.0
      Write IO requests:               0.2               3.3
           Read IO (MB):               0.1               2.3
          Write IO (MB):               0.0               0.0
           IM scan rows:               0.0               0.0
Session Logical Read IM:
             User calls:              16.1             312.0
           Parses (SQL):              34.0             658.0
      Hard parses (SQL):               4.6              88.0
     SQL Work Area (MB):               0.9              17.2
                 Logons:               0.1               1.7
         Executes (SQL):              95.4           1,846.0
              Rollbacks:               0.0               0.0
           Transactions:               0.1

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:   97.55       Redo NoWait %:  100.00
            Buffer  Hit   %:   97.30    In-memory Sort %:  100.00
            Library Hit   %:   81.75        Soft Parse %:   86.63
         Execute to Parse %:   64.36         Latch Hit %:   96.54
Parse CPU to Parse Elapsd %:   19.45     % Non-Parse CPU:   31.02
          Flash Cache Hit %:    0.00

<snip!>

Cache Sizes                       Begin        End
~~~~~~~~~~~                  ---------- ----------
               Buffer Cache:       960M       960M  Std Block Size:         8K
           Shared Pool Size:     4,096M     4,096M      Log Buffer:   139,980K
             In-Memory Area:         0M         0M

One other thing of note is that the Top 10 section now (finally) displays average wait times to two decimal places. This took a surprising amount of time to arrive, but it’s most welcome:

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
db file parallel read               63,157      828.6      13.12   86.1 User I/O
DB CPU                                          234.2              24.3
db file sequential read            113,786       67.8       0.60    7.0 User I/O

New section: Oracle SLOB Testing

slob ghost

For some time now I have preferred Oracle SLOB as my tool for generating I/O workloads using Oracle databases. I’ve previously blogged some information on how to use SLOB for PIO testing, as well as shared some scripts for running tests and extracting results. I’ve now added a whole new landing page for SLOB and a complete guide to running sustained throughput testing.

Why would you want to run sustained throughput tests? Well, one great reason is that not all storage platforms can cope with sustained levels of write workload. Flash arrays, or any storage array which contains flash, have a tendency to suffer from garbage collection issues when sustained write workloads hit them hard enough.

Find out more by following the links below:

New My Oracle Support note on Advanced Format (4k) storage

advanced-format-logo

In the past I have been a little critical of Oracle’s support notes and documentation regarding the use of Advanced Format 4k storage devices. I must now take that back, as my new friends in Oracle ASM Development and Product Management very kindly offered to let me write a new support note, which they have just published on My Oracle Support. It’s only supposed to be high level, but it does confirm that the _DISK_SECTOR_SIZE_OVERRIDE parameter can be safely set in database instances when using 512e storage and attempting to create 4k online redo logs.

The new support note is:

Using 4k Redo Logs on Flash and SSD-based Storage (Doc ID 1681266.1)

Don’t forget that you can read all about the basics of using Oracle with 4k sector storage here. And if you really feel up to it, I have a 4k deep dive page here.

Oracle SLOB On Solaris

Guest Post

This is another guest post from my buddy Nate Fuzi, who performs the same role as me for Violin but is based in the US instead of EMEA. Nate believes that all English people live in the Dickensian London of the 19th century and speak in Cockney rhyming slang. I hate to disappoint, so have a butcher’s below and feast your mince pies on his attempts to make SLOB work on Solaris without going chicken oriental. Over to you Nate, me old china plate.

slob ghost

Note: The Silly Little Oracle Benchmark, or SLOB, is a Linux-only tool designed and released for the community by Kevin Closson. There are no ports for other operating systems – and Kevin has always advised that the solution for testing on another platform is to use a Linux VM and connect via TNS. The purpose of this post is simply to show what happens when you have no other choice but to try and get SLOB working natively on Solaris…

I wrestled with SLOB 2 for a couple hours last week for a demo build we had in-house to show our capabilities to a prospective customer. I should mention I’ve had great success—and ease!—with SLOB 2 previously. But that was on Linux. This was on Solaris 10—to mimic the setup the customer has in-house. No problem, I thought; there’s some C files to compile, but then there’s just shell scripts to drive the thing. What could go wrong?

Well, it would seem Kevin Closson, the creator of SLOB and SLOB 2, did his development on an OS with a better sense of humor than Solaris. The package unzipped, and the setup.sh script appeared to run successfully, but runit.sh would load up the worker threads and wait several seconds before launching them—and then immediately call it “done” and bail out, having executed on the database only a couple seconds. Huh? I had my slob.conf set to execute for 300 seconds.

I had two databases created: one with 4K blocks and one with 8K blocks. I had a tablespace created for SLOB data called SLOB4K and SLOB8K, respectively. I ran setup.sh SLOB4K 128, and the log file showed no errors. All good, I thought. Now run runit.sh 12, and it stops as quickly as it starts. Oof.

It took Bryan Wood, a much better shell script debugger (hey, I said DEbugger) than myself, to figure out all the problems.

First, there was this interesting line of output from the runit.sh command:

NOTIFY: Connecting users 1 2 3 Usage: mpstat [-aq] [-p | -P processor_set] [interval [count]]
4 5 6 7 8 9 10

Seems Solaris doesn’t like mpstat –P ALL. However it seems that on Solaris 10 the mpstat command shows all processors even without the -P option.

Next, Solaris doesn’t like Kevin’s “sleep .5” command inside runit.sh; it wants whole numbers only. That raises the question in my mind why he felt the need to check for running processes every half second rather than just letting it wait a full second between checks, but fine. Modify the command in the wait_pids() function to sleep for a full second, and that part is happy.

But it still kicks out immediately and kills the OS level monitoring commands, even though there are active SQL*Plus sessions out there. It seems on Solaris the ps –p command to report status on a list of processes requires the list of process IDs to be escaped where Linux does not. IE:

-bash-3.2$ ps -p 1 2 3
usage: ps [ -aAdeflcjLPyZ ] [ -o format ] [ -t termlist ]
        [ -u userlist ] [ -U userlist ] [ -G grouplist ]
        [ -p proclist ] [ -g pgrplist ] [ -s sidlist ] [ -z zonelist ]
  'format' is one or more of:
        user ruser group rgroup uid ruid gid rgid pid ppid pgid sid taskid ctid
        pri opri pcpu pmem vsz rss osz nice class time etime stime zone zoneid
        f s c lwp nlwp psr tty addr wchan fname comm args projid project pset

But with quotes:

-bash-3.2$ ps -p "1 2 3"
   PID TTY         TIME CMD
     1 ?           0:02 init
     2 ?           0:00 pageout
     3 ?          25:03 fsflush

After some messing about, Bryan had the great idea to simply replace the command:

while ( ps -p $pids > /dev/null 2>&1 )

With:

while ( ps -p "$pids" > /dev/null 2>&1 )

Just thought I might save someone else some time and hair pulling by sharing this info… Here are the finished file diffs:

-bash-3.2$ diff runit.sh runit.sh.original
31c30
< while ( ps -p "$pids" > /dev/null 2>&1 )
---
> while ( ps -p $pids > /dev/null 2>&1 )
33c32
<       sleep 1
---
>       sleep .5
219c218
<       ( mpstat 3  > mpstat.out ) &
---
>       ( mpstat -P ALL 3  > mpstat.out ) &

New installation cookbook for SUSE Linux Enterprise Server 11 SP3

Exactly what it says on the tin, I’ve added a new installation cookbook for SUSE 11 SP3 which creates Violin on a set of 4k devices.

I’ve started setting the add_random tunable of the noop I/O scheduler because it seems to give a boost in performance during benchmarking runs. If I can find the time, I will blog about this at some point…

For more details read this document from Red Hat.

More Problems with Oracle’s Support of 4k Devices

This is going to be another one of those posts, a bit like this one, that discuss the use of Oracle’s database product with Advanced Format devices. I wish there weren’t so many of these posts, but it seems that Oracle has a lot of issues with it’s implementation of 4k support.

(Before reading on, if you aren’t sure what I’m talking about here then please have a read of this page…)

In the last post I built a database which used Oracle ASM (and the Linux ASMLib kernel driver) but found that if the database used an SPFILE which was located on a 4k device (within an ASM diskgroup) it didn’t work. Today, I’m going to forego ASM and use a filesystem instead (something I would never do in real life).

Building a 4k Filesystem

Let’s start with a single 4k LUN being presented from my Violin array. I’ve already configured the Linux device mapper multipathing so that it presents itself as a nicely-named device in the /dev/mapper directory:

[oracle@half-server4 ~]$ ls -l /dev/mapper/fs4ktest 
lrwxrwxrwx 1 root root 7 Feb 25 15:53 /dev/mapper/fs4ktest -> ../dm-7
[oracle@half-server4 ~]$ fdisk -l /dev/mapper/fs4ktest

Note: sector size is 4096 (not 512)

Disk /dev/mapper/fs4ktest: 215.8 GB, 215822106624 bytes
255 heads, 63 sectors/track, 3279 cylinders
Units = cylinders of 16065 * 4096 = 65802240 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 524288 bytes

We can see that this is indeed a 4k device, i.e. it has not only a 4096 byte physical blocksize, but a 4096 byte logical blocksize too. The fdisk command has even taken the time to print a special “Note” to ensure we see the sector size is not the usual 512 bytes. The next thing to do is format it with a filesystem so I’m going to use ext4:

[root@half-server4 ~]# mkfs.ext4 /dev/mapper/fs4test 
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=1 blocks, Stripe width=128 blocks
2097152 inodes, 8388608 blocks
419430 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
256 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624

Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

Now it needs to be mounted. I’m just going to stick it on a mount point in a new top-level directory called /fstest:

[root@half-server4 ~]# mkdir -p /fstest/fs4ktest
[root@half-server4 ~]# chown -R oracle:oinstall /fstest
[root@half-server4 ~]# mount /dev/mapper/fs4ktest /fstest/fs4ktest

Finally, we I am going to create an Oracle database using this filesystem. I’m not going to cut and paste all the output for that, because it’s all a bit dull… so let’s just skip to the bit where DBCA has completed and the database is open.

Oracle On A 4k Filesystem

So the good news is, it worked. The database is up and running and the datafiles are located on the 4k filesystem:

[oracle@half-server4 fstest]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.1.0 Production on Mon Mar 3 21:30:02 2014
Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> select name from v$datafile;

NAME
---------------------------------------------------------------------------------------
/fstest/fs4ktest/oracle/oradata/FSTEST/datafile/o1_mf_system_9k9wxmw6_.dbf
/fstest/fs4ktest/oracle/oradata/FSTEST/datafile/o1_mf_sysaux_9k9ww6r6_.dbf
/fstest/fs4ktest/oracle/oradata/FSTEST/datafile/o1_mf_undotbs1_9k9wz257_.dbf
/fstest/fs4ktest/oracle/oradata/FSTEST/datafile/o1_mf_users_9k9wz123_.dbf

Cool. We can end this post here then, right? Well, no… because there is a bit of a problem with this database. Let’s just have a quick check of the FILESYSTEMIO_OPTIONS parameter:

SQL> show parameter filesystem

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
filesystemio_options		     string	 none

This parameter controls the way that I/O is performed for files located on filesystems. It isn’t relevant for databases using Oracle ASM (for which the DISK_ASYNCH_IO parameter exists instead), but here it’s making a massive difference. According to the Oracle documentation, it has four possible options:

  • ASYNCH: enable asynchronous I/O on file system files, which has no timing requirement for transmission.
  • DIRECTIO: enable direct I/O on file system files, which bypasses the buffer cache.
  • SETALL: enable both asynchronous and direct I/O on file system files.
  • NONE: disable both asynchronous and direct I/O on file system files.

Normally, when I see filesystem-based databases, I find this parameter set to SETALL. This means asynchronous and direct I/O, but here it is set to NONE which means neither. And it’s the DIRECTIO that we are interested in.

One Buffer Cache Is Enough

As you are no doubt aware, Oracle databases have a buffer cache which is used to cache copies of database blocks. However, the Linux operating system also has its own buffer cache for filesystems. Most people would consider it ineffective to use two levels of cache – and if that is the case, it will obviously be the Oracle buffer cache that needs to be used. So let’s set the parameter to use direct I/O and then restart the database (as the parameter is not dynamic):

SQL> alter system set filesystemio_options='directIO' scope=spfile;

System altered.

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 1.3896E+10 bytes
Fixed Size		    4663568 bytes
Variable Size		 2751465200 bytes
Database Buffers	 1.1107E+10 bytes
Redo Buffers		   33673216 bytes
ORA-00205: error in identifying control file, check alert log for more info

Oh dear. What happened?

[oracle@half-server4 ~]$ tail /u01/app/oracle/diag/rdbms/fstest/fstest/trace/alert_fstest.log 
ORA-00210: cannot open the specified control file
ORA-00202: control file: '/fstest/fs4ktest/oracle/oradata/FSTEST/controlfile/o1_mf_9k9wzp31_.ctl'
ORA-27047: unable to read the header block of file
Linux-x86_64 Error: 22: Invalid argument
Additional information: 1
ORA-205 signalled during: ALTER DATABASE   MOUNT...

The answer, which you can find in My Oracle Support note 1133713.1, is that Oracle does not support 4k devices with direct I/O. This has been the case for a long time – I remember first discovering this nearly two years ago, on 11.2.0.2, yet there is no sign of it being fixed. According to the note, “It is not yet known in which version this support will be available.” Pah.

There’s More: Diagnostic Destination on 4k

And then there’s the diagnostic destination. How about if I choose to locate this on a 4k filesystem?

SQL> show parameter diagnostic_dest

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
diagnostic_dest 		     string	 /u01/app/oracle

SQL> alter system set diagnostic_dest='/fstest/fs4ktest/oracle' scope=spfile;

System altered.

I’ll give it a few minutes and then go and look in some of the files… guess what I see?

ORA-48101: error encountered when attempting to read a file [block] [/fstest/fs4ktest/oracle/diag/rdbms/fstest/fstest/metadata/INCIDENT.ams] [0]
ORA-27072: File I/O error
Linux-x86_64 Error: 22: Invalid argument
Additional information: 4
Additional information: 1
Additional information: -1

Look familiar? (This is nothing to do with direct I/O by the way, I disabled that again before this test.)

So let’s be honest, things aren’t going all that well here. There are still a lot of things that do not appear to work properly when using 4k devices. Luckily, my Violin array can present storage as 512 byte to avoid this sort of issue, but really I feel that Oracle needs to get cracking on its Advanced Format support. This is not just a flash memory thing, pretty much every major disk vendor is making Advanced Format devices now from Western Digital, through HGST to Seagate.

Time to get with the programme?

Oracle ASMLib: Physical and Logical Blocksize

This article is about the use of Advanced Format devices on Oracle’s ASMLib kernel library for Linux. For background, read this page on 4k sector sizes first, otherwise it might all sound like nonsense. Mind you, it mind sound like nonsense anyway, I can’t guarantee anything here. By the way, a big hello to my buddy Nate who asked for this information: you rock, dude.

In more recent versions of ASMLib, Oracle introduced a new parameter into the /etc/sysconfig/oracleasm file:

[root@half-server4 mapper]# tail -5 /etc/sysconfig/oracleasm
# ORACLEASM_USE_LOGICAL_BLOCK_SIZE: 'true' means use the logical block size
# reported by the underlying disk instead of the physical. The default
# is 'false'
ORACLEASM_USE_LOGICAL_BLOCK_SIZE=false

To understand what this parameter does, consider this device which I am presenting from a Violin array:

[root@half-server4 ~]# ls -l /dev/mapper/testlun
lrwxrwxrwx 1 root root 8 Feb 27 15:33 /dev/mapper/testlun -> ../dm-19
[root@half-server4 ~]# fdisk -l /dev/mapper/testlun

Disk /dev/mapper/testlun: 34.4 GB, 34359738368 bytes
255 heads, 63 sectors/track, 4177 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 524288 bytes

The important bit there is highlighted in red. This device has a 4k physical blocksize (as all Violin devices do, as well as many other modern storage systems) but has a 512 byte logical blocksize. Essentially, this LUN is pretending to be a 512 byte based.

Now that’s all well and good. Operating systems and applications that cannot support 4k block sizes (e.g. Red Hat 5 and Oracle Linux 5) will happily use this, because they believe it to be 512 byte. But later versions of ASMLib have started being too clever for their own good.

Don’t Look Behind The Curtain

Let’s create an ASMLib label on this device:

root@half-server4 ~]# oracleasm createdisk TESTLUN /dev/mapper/testlun 
Writing disk header: done
Instantiating disk: done

And now we can attempt to put an ASM diskgroup on it:

SQL> CREATE DISKGROUP TEST EXTERNAL REDUNDANCY
DISK 'ORCL:TESTLUN'
ATTRIBUTE
     'sector_size'='512',
     'compatible.asm' = '11.2',
     'compatible.rdbms' = '11.2';  
CREATE DISKGROUP TEST EXTERNAL REDUNDANCY
*
ERROR at line 1:
ORA-15018: diskgroup cannot be created
ORA-15038: disk '' mismatch on 'Sector Size' with target disk group [4096]
[512]

What happened? Well, ASMLib has looked behind the smoke and mirrors and decided that this is actually a 4k device. It’s therefore presenting this to Oracle ASM as 4k, which causes the problem (because I explicitly asked for sector size to be 512 byte on this diskgroup).

One possible solution is to change the ASM_DISKSTRING from it’s default value of NULL (meaning ‘ORCL:*’) to ‘/dev/oracleasm/disks/*’, i.e. the location where ASMLib creates its own block devices. We can test this theory with fdisk:

[oracle@half-server4 ~]$ ls -l /dev/oracleasm/disks/TESTLUN 
brw-rw---- 1 oracle dba 252, 19 Feb 27 15:38 /dev/oracleasm/disks/TESTLUN
[oracle@half-server4 ~]$ fdisk -l /dev/oracleasm/disks/TESTLUN | grep "Sector size"
Sector size (logical/physical): 512 bytes / 4096 bytes

So that would work. But it would lose many of the claimed benefits of ASMLib such as reduced file descriptors and context switching. Also, it feels like a hack.

Setting ORACLEASM_USE_LOGICAL_BLOCK_SIZE

The answer, as you probably guessed, is to set this new parameter. It defaults, wrongly in my opinion, to using the physical block size. We can either edit the value in the file to be true in order to use the logical blocksize, or preferably use the oracleasm configure command:

root@half-server4 ~]# oracleasm configure -b
Writing Oracle ASM library driver configuration: done
[root@half-server4 ~]# oracleasm configure | grep ORACLEASM_USE_LOGICAL_BLOCK_SIZE
ORACLEASM_USE_LOGICAL_BLOCK_SIZE="true"

It can be set back to using the physical blocksize with the following command:

[root@half-server4 ~]# oracleasm configure -p
Writing Oracle ASM library driver configuration: done
[root@half-server4 ~]# oracleasm configure | grep ORACLEASM_USE_LOGICAL_BLOCK_SIZE
ORACLEASM_USE_LOGICAL_BLOCK_SIZE="false"

Finally, a word of warning. If you are like me, then you are a bit stupid and can’t follow instructions properly. I set the value of the parameter to TRUE in upper case and then spent hours wondering why it didn’t work. The answer, to my embarrassment, is that it’s case sensitive. TRUE is not a valid value so it defaults to false. Doh.

Oracle AWR Reports: Understanding I/O Statistics

truth

One consequence of my job is that I spend a lot of time looking at Oracle Automatic Workload Repository reports, specifically at information about I/O. I really do mean a lot of time (honestly, I’m not kidding, I have had dreams about AWR reports). One thing that comes up very frequently is the confusion relating to how the measurements of IOPS and throughput are displayed in the AWR report Load Profile section. The answer, is that they aren’t. Well, not exactly… let me explain.

Physical Read and Write I/O

Right at the top of an AWR report, just after the Database and Host details, you’ll find the familiar Load Profile section. Until recently, it had changed very little through the releases of Oracle since its introduction in 10g. Here’s a sample from 11g Release 2:

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):               44.1                0.4       0.07       1.56
       DB CPU(s):                1.6                0.0       0.00       0.06
       Redo size:      154,034,644.3        1,544,561.0
    Logical read:          154,436.1            1,548.6
   Block changes:           82,491.9              827.2
  Physical reads:              150.6                1.5
 Physical writes:           18,135.2              181.9
      User calls:               28.3                0.3
          Parses:              142.7                1.4
     Hard parses:                7.5                0.1
W/A MB processed:                2.1                0.0
          Logons:                0.1                0.0
        Executes:              607.7                6.1
       Rollbacks:                0.0                0.0
    Transactions:               99.7

In my role I have to look at the amount of I/O being driven by a database, so I can size a solution based on flash memory. This means knowing two specific metrics: the number of I/Os per second (IOPS) and the throughput (typically measured in MB/sec). I need to know these values for both read and write I/O so that I can understand the ratio. I also want to understand things like the amount of random versus sequential I/O, but that’s beyond the scope of this post.

The first thing to understand is that none of this information is shown above. There are values for Physical reads and Physical writes but these are actually measured in database blocks. Even if we knew the block size (which we don’t because Oracle databases can have multiple block sizes) we do not know how many I/Os were required. Ten Oracle blocks could be written in one sequential I/O or ten individual “random” I/Os, completely changing the IOPS measurement. To find any of this information we have to descend into the depths of the AWR report to find the Instance Activity Stats section.

In Oracle 12c, the format of the AWR report changed, especially the AWR Load Profile section, which was modified to show the units that each measurement uses. It also includes some new lines such as Read/Write IO Requests and Read/Write IO. Here’s a sample from a 12c database (taken during a 30 second run of SLOB):

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):              44.1               0.4      0.07      1.56
              DB CPU(s):               1.6               0.0      0.00      0.06
      Redo size (bytes):     154,034,644.3       1,544,561.0
  Logical read (blocks):         154,436.1           1,548.6
          Block changes:          82,491.9             827.2
 Physical read (blocks):             150.6               1.5
Physical write (blocks):          18,135.2             181.9
       Read IO requests:             150.3               1.5
      Write IO requests:          15,096.9             151.4
           Read IO (MB):               1.2               0.0
          Write IO (MB):             141.7               1.4
             User calls:              28.3               0.3
           Parses (SQL):             142.7               1.4
      Hard parses (SQL):               7.5               0.1
     SQL Work Area (MB):               2.1               0.0
                 Logons:               0.1               0.0
         Executes (SQL):             607.7               6.1
              Rollbacks:               0.0               0.0
           Transactions:              99.7

Now, you might be forgiven for thinking that the values highlighted in red and blue above tell me the very IOPS and throughput information I need. If this were the case, we could say that this system performed 150 physical read IOPS and 15k write IOPS, with throughput of 1.2 MB/sec reads and 141.7 MB/sec writes. Right?

But that isn’t the case – and to understand why, we need to page down five thousand times through the increasingly-verbose AWR report until we eventually find the Other Instance Activity Stats section (or just Instance Activity Stats in pre-12c reports) and see this information (edited for brevity):

Other Instance Activity Stats                  DB/Inst: ORCL/orcl  Snaps: 7-8
-> Ordered by statistic name

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
physical read IO requests                     5,123          150.3           1.5
physical read bytes                      42,049,536    1,233,739.3      12,371.2
physical read total IO requests              37,162        1,090.3          10.9
physical read total bytes            23,001,900,544  674,878,987.9   6,767,255.2
physical read total multi block              21,741          637.9           6.4
....
physical write IO requests                  514,547       15,096.9         151.4
physical write bytes                  5,063,483,392  148,563,312.9   1,489,698.0
physical write total IO requests            537,251       15,763.0         158.1
physical write total bytes           18,251,309,056  535,495,967.4   5,369,611.4
physical write total multi block             18,152          532.6           5.3

The numbers in red and blue match up with those above, albeit with the throughput values using different units of bytes/sec instead of MB/sec. But the problem is, these aren’t the “total” values – which are highlighted in green. So what are those total values showing us?

Over to the Oracle Database 12c Reference Guide:

physical read IO requests:  Number of read requests for application activity (mainly buffer cache and direct load operation) which read one or more database blocks per request. This is a subset of “physical read total IO requests” statistic.

physical read total IO requests: Number of read requests which read one or more database blocks for all instance activity including application, backup and recovery, and other utilities. The difference between this value and “physical read total multi block requests” gives the total number of single block read requests.

The values that don’t have the word total in them, i.e. the values shown in the AWR Profile section at the start of a report, are only showing what Oracle describes as “application activity“. That’s all very well, but it’s meaningless if you want to know how much your database is driving your storage. This is why the values with total in the name are the ones you should consider. And in the case of my sample report above, there is a massive discrepancy between the two: for example, the read throughput value for application activity is just 1.2 MB/sec while the total value is actually 644 MB/sec – over 500x higher! That extra non-application activity is definitely worth knowing about. (In fact, I was running a highly parallelised RMAN backup during the test just to make the point…)

[Note: There was another section here detailing how to find and include the I/O generated by redo into the totals, but after consultation with guru and legend Tanel Poder it’s come to my attention that this is incorrect. In fact, reads and writes to redo logs are included in the physical read/write total statistics…]

Oracle 12c IO Profile Section

Luckily, Oracle 12c now has a new section which presents all the information in one table. Here’s a sample extracted from the same report as above:

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:          16,853.4         1,090.3        15,763.0
         Database Requests:          15,247.2           150.3        15,096.9
        Optimized Requests:               0.1             0.0             0.0
             Redo Requests:             517.5             1.2           516.3
                Total (MB):           1,154.3           643.6           510.7
             Database (MB):             142.9             1.2           141.7
      Optimized Total (MB):               0.0             0.0             0.0
                 Redo (MB):             295.7             0.0           295.7
         Database (blocks):          18,285.8           150.6        18,135.2
 Via Buffer Cache (blocks):          18,282.1           150.0        18,132.0
           Direct (blocks):               3.7             0.6             3.1

Suddenly life is more simple. You want to know the total IOPS and throughput? It’s all in one place. You want to calculate the ratio of reads to writes? Just compare the read and write columns. Happy days.

One word of warning though: there are other database processes driving I/O which may not be tracked in these statistics. I see no evidence for control file reads and writes being shown, although these are insignificant in magnitude. More significant would be I/O from the archiver process for databases running in archive log mode, as each redo log must be sequentially read and re-written out as an archive log. Are these included? Yet another possibility would be the Recovery Writer (RVWR) process which is responsible for writing flashback logs when database flashback logging is enabled. [Discussions with Jonathan Lewis suggest these stats are all included – and let’s face it, he wrote the book on the subject…!]  It all adds up… Oracle really needs to provide better clarity on what these statistics are measuring.

Conclusion

If you want to know how much I/O is being driven by your database, do not use the information in the Load Profile section of an AWR report. Use the I/O Profile section if available, or otherwise skip to the Instance Activity Stats section and look at the total values for physical reads and writes (and redo). Everything else is just lies, damned lies and (I/O) statistics.

Oracle Fixes The 4k SPFILE Problem…But It’s Still Broken

As anyone familiar with the use of Oracle on Advanced Format storage devices will know to their cost, Oracle has had some difficulties implementing support of 4k devices. Officially, support for devices with a 4096 byte sector size was introduced in Oracle 11g Release 2 (see section 4.8.1.4 of the New Features Guide) but actually, if the truth be told, there were some holes.

(Before reading on, if you aren’t sure what I’m talking about here then please have a read of this page…)

I should say at this point that most 4k Advanced Format storage products have the ability to offer 512 byte emulation, which means any of the problems shown here can be avoided with very little effort (or performance overhead), but since 4096 byte devices are widely expected to take over, it would be nice if Oracle could tighten up some of the problems. After all, it’s not just flash memory devices that tend to be 4k-based: Toshiba, HGST, Seagate and Western Digital are all making hard disk drives that use Advanced Format too.

The SPFILE Problem in <11.2.0.4

Given that 4k devices are allegedly supported in Oracle 11g Release 2 you would think it would make sense that you can provision a load of 4k LUNs and then install the Oracle Grid Infrastructure and Database software on them. But no, in versions up to and including 11.2.0.3 this caused a problem with the SPFILE.

Here’s my sample system. I have 10 LUNs all with a physical and logical blocksize of 4k. I’m using Oracle’s ASMLib kernel driver to present them to ASM and the 4k logical and physical properties are preserved through into the ASMLib device too:

[root@half-server4 mapper]# fdisk -l /dev/mapper/slobdata1 | grep "Sector size"
Sector size (logical/physical): 4096 bytes / 4096 bytes
[root@half-server4 mapper]# oracleasm querydisk /dev/mapper/slobdata1
Device "/dev/mapper/slobdata1" is marked an ASM disk with the label "SLOBDATA1"
[root@half-server4 mapper]# fdisk -l /dev/oracleasm/disks/SLOBDATA1 | grep "Sector size"
Sector size (logical/physical): 4096 bytes / 4096 bytes

Next I’ve installed 11.2.0.3 Grid Infrastructure and created an ASM diskgroup on these LUNs. As you can see, Oracle has successfully spotted the devices are 4k and correspondingly set the ASM diskgroup sector size to 4096:

ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N        4096   4096  1048576    737280   737207                0          737207              0             N  DATA/

Time to install the database. I’ll just fire up the 11.2.0.3 OUI and install the database software complete with a default database, choosing to locate the database files in this +DATA diskgroup. What could possibly go wrong?

oracle-11203-spfile-error

The installer gets as far as running the database configuration assistant and then crashes out with the message:

PRCR-1079 : Failed to start resource ora.orcl.db
CRS-5017: The resource action "ora.orcl.db start" encountered the following error:
ORA-01078: failure in processing system parameters
ORA-15081: failed to submit an I/O operation to a disk
ORA-27091: unable to queue I/O
ORA-17507: I/O request size 512 is not a multiple of logical block size
ORA-06512: at line 4
. For details refer to "(:CLSN00107:)" in "/home/OracleHome/product/11.2.0/grid/log/half-server4/agent/ohasd/oraagent_oracle/oraagent_oracle.log".

Why did this happen? The clue is in the error message highlighted in red. Over the years that this has been happening (and trust me, it’s been happening for far too long) various notes have appeared on My Oracle Support, such as 1578983.1 and 14626924.8. The cause is the following bug:

Bug 14626924  Not able to read spfile from ASM diskgroup and disk with sector size of 4096

At the time of  writing, this bug is shown as fixed in 11.2.0.4 and the 12.2 forward code stream, with backports available for 11.2.0.2.0, 11.2.0.3.0, 11.2.0.3.7, 12.1.0.1.0 and 12.1.0.1.2. Alternatively, there is the simple workaround (documented in my Install Cookbooks) of placing the SPFILE in a non-4k location.

What the heck, I have the 11.2.0.4 binaries stored locally, so let’s fire it up and see the fixed SPFILE in action.

11.2.0.4 with 4k Devices

As with the previous example, I have a set of 4k LUNs presented via ASMLib – I won’t repeat the output from above as it’s identical. The ASM diskgroup correctly shows the sector size as 4096, so we’re ready to install the database software and let it create a database. As before the database files will be located in the diskgroup – including the SPFILE – but this time, it won’t fail because bug 14626924 is fixed in 11.2.0.4 right? Right?

oracle-11204-spfile-error

Oh dear. That’s not the error-free installation we were hoping for. Also, one of my pet hates, why are these Clusterware messages so incredibly unhelpful? Looking in the oraagent_oracle.log file we find the following nugget of information:

ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
Linux-x86_64 Error: 2: No such file or directory
Process ID: 0
Session ID: 0 Serial number: 0

That’s not entirely useful either, so let’s try and fire up the database manually:

[oracle@half-server4 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Wed Feb 12 17:56:29 2014
Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to an idle instance.
SQL> startup nomount
ORA-01078: failure in processing system parameters
ORA-17510: Attempt to do i/o beyond file size

Aha! The problem happens even attempting to start in NOMOUNT mode, so it seems likely this is related to reading the PFILE or SPFILE. Let’s just check to see what we have:

[oracle@half-server4 ~]$ ls -l $ORACLE_HOME/dbs/initorcl.ora
-rw-r----- 1 oracle oinstall 35 Feb 12 17:26 /u01/app/oracle/product/11.2.0/dbhome_1/dbs/initorcl.ora
[oracle@half-server4 ~]$ cat $ORACLE_HOME/dbs/initorcl.ora
SPFILE='+DATA/orcl/spfileorcl.ora'

Sure enough, we have an SPFILE located in the ASM diskgroup… and we cannot read it. Could it possibly be that even in 11.2.0.4 there are problems with SPFILEs being located in 4k devices? Searching My Oracle Support for ORA-17510 initially draws a blank. But a search of the Oracle bug database for the previous bug number (14626924) brings up some interesting new bugs:

Bug 16870214 : DB STARTUP FAILS WITH ORA-17510 IF SPFILE IS IN 4K SECTOR SIZE DISKGROUP

In the description of this bug, the following statement is made:

PROBLEM:
--------
ORA-17510: Attempt to do i/o beyond file size after applying patch 14626924 
TO READ SPFILE FROM ASM DISKGROUP AND DISK WITH SECTOR SIZE OF 4096 

DIAGNOSTIC ANALYSIS:
--------------------
1. create init file initvarial.ora in dbs directory with below
spfile='+DATA1/VARIAL/spfilevarial.ora'

2. startup pfile=/u01/app/oracle/product/11.2.0.3/db_1/dbs/initvarial.ora

SQL> startup pfile=/u01/app/oracle/product/11.2.0.3/db_1/dbs/initvarial.ora
ORA-17510: Attempt to do i/o beyond file size

This looks very similar. Unfortunately this bug is marked as a duplicate of base bug 18016679, which sadly is unpublished. All we know about it is that, at the time of writing, it isn’t fixed – the status of the duplicate is still “Waiting for the base bug fix“.

So there we have it. The infamous 4k SPFILE issue is fixed in 11.2.0.4 and replaced with something else that makes it equally unusable. For now, we’ll just have to keep those SPFILEs in 512 byte devices…

Update Feb 14th 2014

I kind of had a feeling that the above problem was in some way related to the use of ASMLib, so I thought I’d repeat the entire 11.2.0.4 install using normal block devices. Essentially this means changing the ASM discovery path from it’s default value of ‘ORCL:*’ to the path of the device mapper multipath devices, which is my case is ‘/dev/mapper/slob*’.

This time we don’t even get through the Grid Infrastructure installation, which fails while running the Oracle ASM Configuration Assistance (asmca) giving the following error messages:

[main] [ 2014-02-14 15:55:22.112 GMT ] [UsmcaLogger.logInfo:143]  CREATE DISKGROUP SQL: CREATE DISKGROUP DATA EXTERNAL REDUNDANCY  DISK '/dev/mapper/slobdata1', '/dev/mapper/slobdata2', '/dev/mapper/slobdata3', '/dev/mapper/slobdata4',
'/dev/mapper/slobdata5', '/dev/mapper/slobdata6', '/dev/mapper/slobdata7', '/dev/mapper/slobdata8'
ATTRIBUTE 'compatible.asm'='11.2.0.0.0','au_size'='1M'
[main] [ 2014-02-14 15:55:22.206 GMT ] [SQLEngine.done:2189]  Done called
[main] [ 2014-02-14 15:55:22.224 GMT ] [UsmcaLogger.logException:173]  SEVERE:method oracle.sysman.assistants.usmca.backend.USMDiskGroupManager:createDiskGroups
[main] [ 2014-02-14 15:55:22.224 GMT ] [UsmcaLogger.logException:174]  ORA-15018: diskgroup cannot be created
ORA-27061: waiting for async I/Os failed

It seems Oracle still has some way to go before this will work properly…

Playing The Data Reduction Lottery

Picture courtesy of Capsun Poe

Picture courtesy of Capsun Poe

Storage for DBAs: Do you want to sell your house? Or your car? Let’s go with the car – just indulge me on this one. You have a car, which you weren’t especially planning on selling, but I’m making you an offer you can’t refuse. I’m offering you one million dollars so how can you say no?

The only thing is, when we come to make the trade I turn up not with a suitcase full of cash but a single Mega Millions lottery ticket. How would you feel about that? You may well feel aggrieved that I am offering you something which cost me just $1 but my response is this: it has an effective value of well over $1m. Does that work for you?

Blurred Lines

The thing is, this happens all the time in product marketing and we just put up with it. Oracle’s new Exadata Database Machine X4-2 has 44.8TB of raw flash in a full rack configuration, yet the datasheet states it has an effective flash capacity of 448TB. Excuse me? Let’s read the small print to find out what this means: apparently this is “the size of the data files that can often be stored in Exadata and be accessed at the speed of flash memory“.  No guarantees then, you just might get that, if you’re lucky. I thought datasheets where supposed to be about facts?

Meanwhile, back in storageland, a look at some of the datasheets from various flash array vendors throws up a similar practice. One vendor shows the following flash capacity figures for their array:

  • 2.75 – 11 TBs raw capacity
  • 5 – 50 TBs effective capacity

In my last two posts I covered deduplication and data compression as part of an overall data reduction strategy in storage. To recap, I gave my opinion that dedupe has no place with databases (although it has major benefits in workloads such as VDI) while data compression has benefits but is not necessarily best implemented at the storage level.

Here’s the thing. Your database vendor’s software has options that allow you to perform data reduction. You can also buy host-level software to do this. And of course, you can buy storage products that do this too. So which is best? It probably depends on which vendor you ask (i.e. database, host-level or storage), since each one is chasing revenue for that option – and in some storage vendor cases the data reduction is “always on”, which means you get it whether you want it or not (and whether you want to pay for it or not). But what you should know is this: your friendly flash storage vendor has the most to gain or lose when it comes to data reduction software.

Lies, Damned Lies and Capacities

When you purchase storage, you invariably buy it at a value based on price per usable capacity, most commonly using the unit of dollars per GB. This is simply a convenient way of comparing the price of competing products which may otherwise have different capacities: if a storage array costs $X and gives you GB of usable capacity, then the price in $/GB (dollars per gig) is therefore X/Y.

Now this practice originally developed when buying disk arrays – and there are some arguments to be made that $/GB carries less significance with flash… but everyone does it. Even if you aren’t doing it, chances are somebody in your purchasing department is. And even though it may not be the best way to compare two different products, you can bet that the vendor whose product has the lowest $/GB price will be the one looking most comfortable when it comes to decision day.

But what if there was a way to massage those figures? Each vendor wants to beat the competition, so they start to say things like, “Hey, what about if you use our storage compression features? On average our customers see a 10x reduction in data. This means the usable capacity is actually 10Y!“. Wouldn’t you know it? The price per gig (which is now X/10Y) just came down by 90%!

The First Rule of Compression

You all know this, but I’m going to say it anyway. Different sets of data result in different levels of compression (and deduplication). It’s obvious. Yet in the sterile environment of datasheets and TCO calculations it often gets overlooked. So let me spell it out for once and for all:

The first rule of compression is that the compression ratio is entirely dependant on the data being compressed.

Thus if you are buying or selling a product that uses compression, deduplication and data reduction, you cannot make any guarantees. Sure you can talk about “average compression ratios”, but what does that mean? Is there really such a thing as the average dataset?

Conclusion: Know What You Are Paying For

It’s a very simple message: when you buy a flash array (or indeed any storage array) be sure to understand the capacity values you are buying and paying for. Dollar per GB values are only relevant with usable capacities, not so-called effective or logical capacities. Also, don’t get too hung up on raw capacity values, since they won’t help you when you run out of usable space.

Definitions are important. Without them, nothing we talk about is … well, definite. So here are mine:

Lies, Damned Lies and Capacities