Oracle AWR Reports: Understanding I/O Statistics

One consequence of my job is that I spend a lot of time looking at Oracle Automatic Workload Repository reports, specifically at information about I/O. I really do mean a lot of time (honestly, I’m not kidding, I have had dreams about AWR reports). One thing that comes up very frequently is the confusion relating to how the measurements of IOPS and throughput are displayed in the AWR report Load Profile section. The answer, is that they aren’t. Well, not exactly… let me explain.

Physical Read and Write I/O

Right at the top of an AWR report, just after the Database and Host details, you’ll find the familiar Load Profile section. Until recently, it had changed very little through the releases of Oracle since its introduction in 10g. Here’s a sample from 11g Release 2:

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):               44.1                0.4       0.07       1.56
       DB CPU(s):                1.6                0.0       0.00       0.06
       Redo size:      154,034,644.3        1,544,561.0
    Logical read:          154,436.1            1,548.6
   Block changes:           82,491.9              827.2
  Physical reads:              150.6                1.5
 Physical writes:           18,135.2              181.9
      User calls:               28.3                0.3
          Parses:              142.7                1.4
     Hard parses:                7.5                0.1
W/A MB processed:                2.1                0.0
          Logons:                0.1                0.0
        Executes:              607.7                6.1
       Rollbacks:                0.0                0.0
    Transactions:               99.7

In my role I have to look at the amount of I/O being driven by a database, so I can size a solution based on flash memory. This means knowing two specific metrics: the number of I/Os per second (IOPS) and the throughput (typically measured in MB/sec). I need to know these values for both read and write I/O so that I can understand the ratio. I also want to understand things like the amount of random versus sequential I/O, but that’s beyond the scope of this post.

The first thing to understand is that none of this information is shown above. There are values for Physical reads and Physical writes but these are actually measured in database blocks. Even if we knew the block size (which we don’t because Oracle databases can have multiple block sizes) we do not know how many I/Os were required. Ten Oracle blocks could be written in one sequential I/O or ten individual “random” I/Os, completely changing the IOPS measurement. To find any of this information we have to descend into the depths of the AWR report to find the Instance Activity Stats section.

In Oracle 12c, the format of the AWR report changed, especially the AWR Load Profile section, which was modified to show the units that each measurement uses. It also includes some new lines such as Read/Write IO Requests and Read/Write IO. Here’s a sample from a 12c database (taken during a 30 second run of SLOB):

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
             DB Time(s):              44.1               0.4      0.07      1.56
              DB CPU(s):               1.6               0.0      0.00      0.06
      Redo size (bytes):     154,034,644.3       1,544,561.0
  Logical read (blocks):         154,436.1           1,548.6
          Block changes:          82,491.9             827.2
 Physical read (blocks):             150.6               1.5
Physical write (blocks):          18,135.2             181.9
       Read IO requests:             150.3               1.5
      Write IO requests:          15,096.9             151.4
           Read IO (MB):               1.2               0.0
          Write IO (MB):             141.7               1.4
             User calls:              28.3               0.3
           Parses (SQL):             142.7               1.4
      Hard parses (SQL):               7.5               0.1
     SQL Work Area (MB):               2.1               0.0
                 Logons:               0.1               0.0
         Executes (SQL):             607.7               6.1
              Rollbacks:               0.0               0.0
           Transactions:              99.7

Now, you might be forgiven for thinking that the values highlighted in red and blue above tell me the very IOPS and throughput information I need. If this were the case, we could say that this system performed 150 physical read IOPS and 15k write IOPS, with throughput of 1.2 MB/sec reads and 141.7 MB/sec writes. Right?

But that isn’t the case – and to understand why, we need to page down five thousand times through the increasingly-verbose AWR report until we eventually find the Other Instance Activity Stats section (or just Instance Activity Stats in pre-12c reports) and see this information (edited for brevity):

Other Instance Activity Stats                  DB/Inst: ORCL/orcl  Snaps: 7-8
-> Ordered by statistic name

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
physical read IO requests                     5,123          150.3           1.5
physical read bytes                      42,049,536    1,233,739.3      12,371.2
physical read total IO requests              37,162        1,090.3          10.9
physical read total bytes            23,001,900,544  674,878,987.9   6,767,255.2
physical read total multi block              21,741          637.9           6.4
....
physical write IO requests                  514,547       15,096.9         151.4
physical write bytes                  5,063,483,392  148,563,312.9   1,489,698.0
physical write total IO requests            537,251       15,763.0         158.1
physical write total bytes           18,251,309,056  535,495,967.4   5,369,611.4
physical write total multi block             18,152          532.6           5.3

The numbers in red and blue match up with those above, albeit with the throughput values using different units of bytes/sec instead of MB/sec. But the problem is, these aren’t the “total” values – which are highlighted in green. So what are those total values showing us?

Over to the Oracle Database 12c Reference Guide:

physical read IO requests: Number of read requests for application activity (mainly buffer cache and direct load operation) which read one or more database blocks per request. This is a subset of “physical read total IO requests” statistic.

physical read total IO requests: Number of read requests which read one or more database blocks for all instance activity including application, backup and recovery, and other utilities. The difference between this value and “physical read total multi block requests” gives the total number of single block read requests.

The values that don’t have the word total in them, i.e. the values shown in the AWR Profile section at the start of a report, are only showing what Oracle describes as “application activity“. That’s all very well, but it’s meaningless if you want to know how much your database is driving your storage. This is why the values with total in the name are the ones you should consider. And in the case of my sample report above, there is a massive discrepancy between the two: for example, the read throughput value for application activity is just 1.2 MB/sec while the total value is actually 644 MB/sec – over 500x higher! That extra non-application activity is definitely worth knowing about. (In fact, I was running a highly parallelised RMAN backup during the test just to make the point…)

[Note: There was another section here detailing how to find and include the I/O generated by redo into the totals, but after consultation with guru and legend Tanel Poder it’s come to my attention that this is incorrect. In fact, reads and writes to redo logs are included in the physical read/write total statistics…]

Oracle 12c IO Profile Section

Luckily, Oracle 12c now has a new section which presents all the information in one table. Here’s a sample extracted from the same report as above:

IO Profile                  Read+Write/Second     Read/Second    Write/Second
~~~~~~~~~~                  ----------------- --------------- ---------------
            Total Requests:          16,853.4         1,090.3        15,763.0
         Database Requests:          15,247.2           150.3        15,096.9
        Optimized Requests:               0.1             0.0             0.0
             Redo Requests:             517.5             1.2           516.3
                Total (MB):           1,154.3           643.6           510.7
             Database (MB):             142.9             1.2           141.7
      Optimized Total (MB):               0.0             0.0             0.0
                 Redo (MB):             295.7             0.0           295.7
         Database (blocks):          18,285.8           150.6        18,135.2
 Via Buffer Cache (blocks):          18,282.1           150.0        18,132.0
           Direct (blocks):               3.7             0.6             3.1

Suddenly life is more simple. You want to know the total IOPS and throughput? It’s all in one place. You want to calculate the ratio of reads to writes? Just compare the read and write columns. Happy days.

One word of warning though: there are other database processes driving I/O which may not be tracked in these statistics. I see no evidence for control file reads and writes being shown, although these are insignificant in magnitude. More significant would be I/O from the archiver process for databases running in archive log mode, as each redo log must be sequentially read and re-written out as an archive log. Are these included? Yet another possibility would be the Recovery Writer (RVWR) process which is responsible for writing flashback logs when database flashback logging is enabled. [Discussions with Jonathan Lewis suggest these stats are all included – and let’s face it, he wrote the book on the subject…!] It all adds up… Oracle really needs to provide better clarity on what these statistics are measuring.

Conclusion

If you want to know how much I/O is being driven by your database, do not use the information in the Load Profile section of an AWR report. Use the I/O Profile section if available, or otherwise skip to the Instance Activity Stats section and look at the total values for physical reads and writes (and redo). Everything else is just lies, damned lies and (I/O) statistics.

Oracle Fixes The 4k SPFILE Problem…But It’s Still Broken

As anyone familiar with the use of Oracle on Advanced Format storage devices will know to their cost, Oracle has had some difficulties implementing support of 4k devices. Officially, support for devices with a 4096 byte sector size was introduced in Oracle 11g Release 2 (see section 4.8.1.4 of the New Features Guide) but actually, if the truth be told, there were some holes.

(Before reading on, if you aren’t sure what I’m talking about here then please have a read of this page…)

I should say at this point that most 4k Advanced Format storage products have the ability to offer 512 byte emulation, which means any of the problems shown here can be avoided with very little effort (or performance overhead), but since 4096 byte devices are widely expected to take over, it would be nice if Oracle could tighten up some of the problems. After all, it’s not just flash memory devices that tend to be 4k-based: Toshiba, HGST, Seagate and Western Digital are all making hard disk drives that use Advanced Format too.

The SPFILE Problem in <11.2.0.4

Given that 4k devices are allegedly supported in Oracle 11g Release 2 you would think it would make sense that you can provision a load of 4k LUNs and then install the Oracle Grid Infrastructure and Database software on them. But no, in versions up to and including 11.2.0.3 this caused a problem with the SPFILE.

Here’s my sample system. I have 10 LUNs all with a physical and logical blocksize of 4k. I’m using Oracle’s ASMLib kernel driver to present them to ASM and the 4k logical and physical properties are preserved through into the ASMLib device too:

[root@half-server4 mapper]# fdisk -l /dev/mapper/slobdata1 | grep "Sector size"
Sector size (logical/physical): 4096 bytes / 4096 bytes
[root@half-server4 mapper]# oracleasm querydisk /dev/mapper/slobdata1
Device "/dev/mapper/slobdata1" is marked an ASM disk with the label "SLOBDATA1"
[root@half-server4 mapper]# fdisk -l /dev/oracleasm/disks/SLOBDATA1 | grep "Sector size"
Sector size (logical/physical): 4096 bytes / 4096 bytes

Next I’ve installed 11.2.0.3 Grid Infrastructure and created an ASM diskgroup on these LUNs. As you can see, Oracle has successfully spotted the devices are 4k and correspondingly set the ASM diskgroup sector size to 4096:

ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N        4096   4096  1048576    737280   737207                0          737207              0             N  DATA/

Time to install the database. I’ll just fire up the 11.2.0.3 OUI and install the database software complete with a default database, choosing to locate the database files in this +DATA diskgroup. What could possibly go wrong?

The installer gets as far as running the database configuration assistant and then crashes out with the message:

PRCR-1079 : Failed to start resource ora.orcl.db
CRS-5017: The resource action "ora.orcl.db start" encountered the following error:
ORA-01078: failure in processing system parameters
ORA-15081: failed to submit an I/O operation to a disk
ORA-27091: unable to queue I/O
ORA-17507: I/O request size 512 is not a multiple of logical block size
ORA-06512: at line 4
. For details refer to "(:CLSN00107:)" in "/home/OracleHome/product/11.2.0/grid/log/half-server4/agent/ohasd/oraagent_oracle/oraagent_oracle.log".

Why did this happen? The clue is in the error message highlighted in red. Over the years that this has been happening (and trust me, it’s been happening for far too long) various notes have appeared on My Oracle Support, such as 1578983.1 and 14626924.8. The cause is the following bug:

Bug 14626924 Not able to read spfile from ASM diskgroup and disk with sector size of 4096

At the time of writing, this bug is shown as fixed in 11.2.0.4 and the 12.2 forward code stream, with backports available for 11.2.0.2.0, 11.2.0.3.0, 11.2.0.3.7, 12.1.0.1.0 and 12.1.0.1.2. Alternatively, there is the simple workaround (documented in my Install Cookbooks) of placing the SPFILE in a non-4k location.

What the heck, I have the 11.2.0.4 binaries stored locally, so let’s fire it up and see the fixed SPFILE in action.

11.2.0.4 with 4k Devices

As with the previous example, I have a set of 4k LUNs presented via ASMLib – I won’t repeat the output from above as it’s identical. The ASM diskgroup correctly shows the sector size as 4096, so we’re ready to install the database software and let it create a database. As before the database files will be located in the diskgroup – including the SPFILE – but this time, it won’t fail because bug 14626924 is fixed in 11.2.0.4 right? Right?

Oh dear. That’s not the error-free installation we were hoping for. Also, one of my pet hates, why are these Clusterware messages so incredibly unhelpful? Looking in the oraagent_oracle.log file we find the following nugget of information:

ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
Linux-x86_64 Error: 2: No such file or directory
Process ID: 0
Session ID: 0 Serial number: 0

That’s not entirely useful either, so let’s try and fire up the database manually:

[oracle@half-server4 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Wed Feb 12 17:56:29 2014
Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to an idle instance.
SQL> startup nomount
ORA-01078: failure in processing system parameters
ORA-17510: Attempt to do i/o beyond file size

Aha! The problem happens even attempting to start in NOMOUNT mode, so it seems likely this is related to reading the PFILE or SPFILE. Let’s just check to see what we have:

[oracle@half-server4 ~]$ ls -l $ORACLE_HOME/dbs/initorcl.ora
-rw-r----- 1 oracle oinstall 35 Feb 12 17:26 /u01/app/oracle/product/11.2.0/dbhome_1/dbs/initorcl.ora
[oracle@half-server4 ~]$ cat $ORACLE_HOME/dbs/initorcl.ora
SPFILE='+DATA/orcl/spfileorcl.ora'

Sure enough, we have an SPFILE located in the ASM diskgroup… and we cannot read it. Could it possibly be that even in 11.2.0.4 there are problems with SPFILEs being located in 4k devices? Searching My Oracle Support for ORA-17510 initially draws a blank. But a search of the Oracle bug database for the previous bug number (14626924) brings up some interesting new bugs:

Bug 16870214 : DB STARTUP FAILS WITH ORA-17510 IF SPFILE IS IN 4K SECTOR SIZE DISKGROUP

In the description of this bug, the following statement is made:

PROBLEM:
--------
ORA-17510: Attempt to do i/o beyond file size after applying patch 14626924 
TO READ SPFILE FROM ASM DISKGROUP AND DISK WITH SECTOR SIZE OF 4096 

DIAGNOSTIC ANALYSIS:
--------------------
1. create init file initvarial.ora in dbs directory with below
spfile='+DATA1/VARIAL/spfilevarial.ora'

2. startup pfile=/u01/app/oracle/product/11.2.0.3/db_1/dbs/initvarial.ora

SQL> startup pfile=/u01/app/oracle/product/11.2.0.3/db_1/dbs/initvarial.ora
ORA-17510: Attempt to do i/o beyond file size

This looks very similar. Unfortunately this bug is marked as a duplicate of base bug 18016679, which sadly is unpublished. All we know about it is that, at the time of writing, it isn’t fixed – the status of the duplicate is still “Waiting for the base bug fix“.

So there we have it. The infamous 4k SPFILE issue is fixed in 11.2.0.4 and replaced with something else that makes it equally unusable. For now, we’ll just have to keep those SPFILEs in 512 byte devices…

Update Feb 14th 2014

I kind of had a feeling that the above problem was in some way related to the use of ASMLib, so I thought I’d repeat the entire 11.2.0.4 install using normal block devices. Essentially this means changing the ASM discovery path from it’s default value of ‘ORCL:*’ to the path of the device mapper multipath devices, which is my case is ‘/dev/mapper/slob*’.

This time we don’t even get through the Grid Infrastructure installation, which fails while running the Oracle ASM Configuration Assistance (asmca) giving the following error messages:

[main] [ 2014-02-14 15:55:22.112 GMT ] [UsmcaLogger.logInfo:143]  CREATE DISKGROUP SQL: CREATE DISKGROUP DATA EXTERNAL REDUNDANCY  DISK '/dev/mapper/slobdata1', '/dev/mapper/slobdata2', '/dev/mapper/slobdata3', '/dev/mapper/slobdata4',
'/dev/mapper/slobdata5', '/dev/mapper/slobdata6', '/dev/mapper/slobdata7', '/dev/mapper/slobdata8'
ATTRIBUTE 'compatible.asm'='11.2.0.0.0','au_size'='1M'
[main] [ 2014-02-14 15:55:22.206 GMT ] [SQLEngine.done:2189]  Done called
[main] [ 2014-02-14 15:55:22.224 GMT ] [UsmcaLogger.logException:173]  SEVERE:method oracle.sysman.assistants.usmca.backend.USMDiskGroupManager:createDiskGroups
[main] [ 2014-02-14 15:55:22.224 GMT ] [UsmcaLogger.logException:174]  ORA-15018: diskgroup cannot be created
ORA-27061: waiting for async I/Os failed

It seems Oracle still has some way to go before this will work properly…

Playing The Data Reduction Lottery

Storage for DBAs: Do you want to sell your house? Or your car? Let’s go with the car – just indulge me on this one. You have a car, which you weren’t especially planning on selling, but I’m making you an offer you can’t refuse. I’m offering you one million dollars so how can you say no?

The only thing is, when we come to make the trade I turn up not with a suitcase full of cash but a single Mega Millions lottery ticket. How would you feel about that? You may well feel aggrieved that I am offering you something which cost me just $1 but my response is this: it has an effective value of well over $1m. Does that work for you?

Blurred Lines

The thing is, this happens all the time in product marketing and we just put up with it. Oracle’s new Exadata Database Machine X4-2 has 44.8TB of raw flash in a full rack configuration, yet the datasheet states it has an effective flash capacity of 448TB. Excuse me? Let’s read the small print to find out what this means: apparently this is “the size of the data files that can often be stored in Exadata and be accessed at the speed of flash memory“. No guarantees then, you just might get that, if you’re lucky. I thought datasheets where supposed to be about facts?

Meanwhile, back in storageland, a look at some of the datasheets from various flash array vendors throws up a similar practice. One vendor shows the following flash capacity figures for their array:

2.75 – 11 TBs raw capacity
5 – 50 TBs effective capacity

In my last two posts I covered deduplication and data compression as part of an overall data reduction strategy in storage. To recap, I gave my opinion that dedupe has no place with databases (although it has major benefits in workloads such as VDI) while data compression has benefits but is not necessarily best implemented at the storage level.

Here’s the thing. Your database vendor’s software has options that allow you to perform data reduction. You can also buy host-level software to do this. And of course, you can buy storage products that do this too. So which is best? It probably depends on which vendor you ask (i.e. database, host-level or storage), since each one is chasing revenue for that option – and in some storage vendor cases the data reduction is “always on”, which means you get it whether you want it or not (and whether you want to pay for it or not). But what you should know is this: your friendly flash storage vendor has the most to gain or lose when it comes to data reduction software.

Lies, Damned Lies and Capacities

When you purchase storage, you invariably buy it at a value based on price per usable capacity, most commonly using the unit of dollars per GB. This is simply a convenient way of comparing the price of competing products which may otherwise have different capacities: if a storage array costs $X and gives you Y GB of usable capacity, then the price in $/GB (dollars per gig) is therefore X/Y.

Now this practice originally developed when buying disk arrays – and there are some arguments to be made that $/GB carries less significance with flash… but everyone does it. Even if you aren’t doing it, chances are somebody in your purchasing department is. And even though it may not be the best way to compare two different products, you can bet that the vendor whose product has the lowest $/GB price will be the one looking most comfortable when it comes to decision day.

But what if there was a way to massage those figures? Each vendor wants to beat the competition, so they start to say things like, “Hey, what about if you use our storage compression features? On average our customers see a 10x reduction in data. This means the usable capacity is actually 10Y!“. Wouldn’t you know it? The price per gig (which is now X/10Y) just came down by 90%!

The First Rule of Compression

You all know this, but I’m going to say it anyway. Different sets of data result in different levels of compression (and deduplication). It’s obvious. Yet in the sterile environment of datasheets and TCO calculations it often gets overlooked. So let me spell it out for once and for all:

The first rule of compression is that the compression ratio is entirely dependant on the data being compressed.

Thus if you are buying or selling a product that uses compression, deduplication and data reduction, you cannot make any guarantees. Sure you can talk about “average compression ratios”, but what does that mean? Is there really such a thing as the average dataset?

Conclusion: Know What You Are Paying For

It’s a very simple message: when you buy a flash array (or indeed any storage array) be sure to understand the capacity values you are buying and paying for. Dollar per GB values are only relevant with usable capacities, not so-called effective or logical capacities. Also, don’t get too hung up on raw capacity values, since they won’t help you when you run out of usable space.

Definitions are important. Without them, nothing we talk about is … well, definite. So here are mine:

Oracle Exadata X4 (Part 2): The All Flash Database Machine?

This article looks at the new Oracle Exadata X4-2 Database Machine from Big Red. In part one I looked at the changes made from the X3 model (more stuff) as well as the implications (more license bills). I also covered some of the confusing and bewildering descriptions Oracle has used to describe the flash capacity of the X4. To recap, here are some of the quotes made in various Oracle literature:

Source	Quote
Oracle Exadata X4-2 datasheet	“44.8 TB of raw physical flash memory”
Oracle Exadata X4 Press Release	“logical flash cache capacity to 88 TB”
Oracle X3 to X4 Changes slide deck	“flash cache compression expands capacity to 88TB (raw)”
Oracle Exadata X4-2 datasheet	“effective flash capacity of 440 TB”

The source of this confusion appears to be the claim that a new feature called Exadata Smart Flash Cache Compression will allow more data to fit into flash. Noticeably absent from the press release and datasheet is the information that this new feature apparently requires the Advanced Compression license, potentially adding over $1m to the list price of a full rack (see slide 22 of this Oracle presentation).

This second part of the article will look at the implications of these changes, but to make things more interesting there’s one specific change I haven’t mentioned until now. And it’s the change that I think gives the biggest insight into Oracle’s thinking.

The Hybrid Database Machine

Picture courtesy of Dennis van Zuijlekom

Right now, in the storage industry, there is a paradigm shift taking place as primary data moves from rusty old spinning disks to semiconductor-based NAND flash storage. Most storage vendors now offer all-flash arrays as part of their product lineup, although one or two still insist on the hybrid approach where data is located on disk but flash is used as a tiering or caching layer to improve performance.

Oracle, despite being one of the early adopters of flash with its Sun Oracle Database Machine (i.e. the Exadata v2), still uses the hybrid approach in Exadata. Each full rack contains 14 storage cells, with each cell containing 12 rotating magnetic disks as well as four PCIe flash cards (made by LSI and then rebranded as Sun). The disks can be bought in two options: high performance or high capacity (known as HP and HC respectively). It’s fair to say that the majority of customers buy the high performance version (* see comments below) – after all, Exadata is a very expensive solution aimed at solving performance problems, so performance is generally high up on a customer’s list of requirements.

Upgrading to Slower Performance?

See if you can spot the most important change to be made since the introduction of flash back in the Sun Oracle v2 (second generation) machine:

Product	Raw Flash	High Performance Disks	HP Disk Capacity
Sun Oracle Database Machine (v2)	5.3 TB	600GB 15,000 RPM	100 TB
Exadata Database Machine X2-2	5.3 TB	600GB 15,000 RPM	100 TB
Exadata Database Machine X3-2	22.4 TB	600GB 15,000 RPM	100 TB
Exadata Database Machine X4-2	44.8 TB	1.2TB 10,000 RPM	200 TB

Did you notice? In the X4 model storage cells, the HP disks have now doubled in capacity. That’s not the important bit though, it’s the sacrifice that Oracle had to make to do this: 10k RPM disk drives instead of 15k RPM. In Exadata X4, the high performance disks are slower than in Exadata X3.

How much slower are we talking? Well, the average rotational latency of a 15k RPM drive is 4ms. The average rotational latency for a 10k RPM drive is 6ms. That’s an extra 50% average rotational latency. Why on Earth would Oracle make that change? If customers wanted more capacity, couldn’t they just buy the storage expansion racks?

Design Dilemmas

The answer lies in two of Oracle’s fundamental design choices for the Exadata architecture:

the reliance on ASM software mirroring (meaning all data is stored either twice or three times), and
the use of flash as cache only (meaning all data in flash is eventually destaged to disk) rather than a tier of storage.

Remember that Oracle claims the Exadata Smart Flash Cache can now contain 88TB of data? But if all data on disk must be mirrored, then with ASM “normal redundancy” (i.e. double mirroring) the usable disk capacity with HP disks is just 90TB, according to the datasheet. If you want to perform zero-downtime upgrades then you need “high redundancy” (i.e. triple mirroring) which means even less capacity. What is the point of having less disk capacity than you have flash cache capacity? Clue: there is no point.

Which is where I finally get to my point. Oracle has taken the decision, almost by stealth, to make the Exadata X4 into an all-flash database machine. Except you still have to pay for the disks…

The All Flash Database Machine

Before we go any further, here’s a quote from Oracle’s Vice President of Product Management, Tim Shetler, discussing the increased flash capacity in Exadata X4:

Yes, that’s right: on Exadata X4, your entire database is now likely to be in flash. Yet in Exadata flash is only ever used as a cache, so the database in question is also going to be located on disk. And because ASM mirroring is required, it will actually be on disk twice – or, if you need zero-downtime upgrades, three times. Three copies on disk and one on flash? That doesn’t seem like the most efficient way to utilise what is, after all, extremely expensive storage.

What about the “inactive, colder data” that remains solely on disk? Well ok… let’s think about that for a minute. The flash cache, according to the sources in the first table above, holds between 88TB and 440TB of data – but, since it’s a cache, that data must be read from a persistent source somewhere. That source is the disks. If your disks contain “inactive, colder data” which doesn’t enter the cache, exactly how is that cache going to be efficiently populated? Keeping inactive data on Exadata’s disks is not only financially ruinous, it impacts the effect of having such an increased flash cache capacity.

Money Talks

What if Oracle ditched the disks and went for an all-flash architecture, as many storage vendors are now doing? Would that be a win for Oracle and it’s customers alike?

Whether it would be a win for customers is something that can be debated. What is undeniable though is the commercial problem Oracle would face if it made a technical decision to ditch the disks. Customers buying Oracle Exadata have to pay for Oracle Exadata Storage Software licenses… and guess what the licensing unit is? You license by the disk. Each storage cell has 12 disks and each full rack has 14 cells, meaning a full rack requires 168 storage licenses. These are currently listing at $10,000 per disk, bringing the total list price to $1.68m per rack.

Hmm. Admitting that the disks are no longer necessary could be an expensive problem, couldn’t it?

Predictions for 2014: DataBase-as-a-Service

It’s that time of year again where lots of people write articles which begin with the words “It’s that time of year again…” and make endless references to crystal balls, tea leaves and the benefits of hindsight. But not me, I’m not descending into cliché. Apart from that first sentence, which with the benefit of hindsight could have been reworded.

Anyway, as 2013 draws to a close it’s time to look forward into 2014 and make some suitably vague predictions about cloud computing, big data 2.0 and the internet of things. But the thing is, my focus is on enterprise applications that use enterprise database software such as Oracle or Microsoft SQL Server. The people I meet in my day job – and to some extent the people that are kind enough to read my blog – tend to work in this field too. Cloud computing will definitely affect us all in the long term, but I’m not sure it will drastically change our lives in 2014. Likewise, the only way I see the internet of things affecting us next year is the possibility of more data in our data warehouses… and if I made the prediction that your data warehouses would get bigger, you would be pretty unimpressed.

What about the raft of technologies that come under the heading big data? (By the way, I only said “big data 2.0” to tease Gwen Shapira) Will we see SQL-on-Hadoop threatening the Oracle ecosystem? Maybe even being adopted for OLTP workloads? Maybe some day, but it won’t be mainstream in 2014. And that kinda makes all of the usual predictions a bit … well, irrelevant to us.

So with that in mind, it’s time to gaze into the crystal ball, read the tea leaves and abandon any cliché-avoidance claims I made in the first paragraph.

A Lesson From Intel

There is a theory that Intel is suffering from the rise of the what is called the mobile/cloud era. Instead of users sat behind desktop computers accessing application servers (and the database servers behind them) it’s now very common to find users with smart devices (i.e. phones and tablets) accessing applications which run in (public) cloud data centres. This shouldn’t be a surprise to anyone who has noticed the savage decline in PC sales. But what does it mean for Intel?

In general it’s bad news. Firstly and most obviously it’s bad because Intel has the desktop PC market sown up but is struggling against ARM in the mobile device market (so much so that it now even makes ARM processors in its own fabs). But the second reason is more interesting: cloud computing is allowing data centres to run Intel enterprise-class processors at higher utilisations. The nature of cloud computing, i.e. shared and consolidated resources running flexible, virtualised workloads, means better value for money can be extracted from compute resources. Cloud computing means better efficiency, which is good for customers but bad for Intel.

Why am I talking about this? Because this is a problem based around the cost of CPUs. And as you may remember, the CPUs in your database server are the most expensive CPUs you own because they are tied to your database software licenses.

Moore’s Law: Diminishing Returns

We all know that Moore’s Law is bringing us more transistors on a circuit every couple of years, meaning increasing amounts of compute power in your servers. But there is a catch: average CPU utilisation in most private data centres remains the same, with industry reports claiming the average is between 4% and 15% (and I personally know of a global financial organisation with an average of 4% so these are realistic estimates). Considering the cost of server resources (including power, cooling, real estate, people to maintain them) that makes for uncomfortable reading; but if you add on top the price of database software licenses (licensed by the core) it becomes prohibitively expensive. The knock-on effect of Moore’s Law is that as compute resource increases, so does the money you are wasting. But the good news is, it’s also a massive potential for saving money: just like Intel’s mobile/cloud problem above, driving more compute from your CPUs means more efficiency for you and less money for Oracle.

DataBase-as-a-Service

The way to increase CPU utilisation is to virtualise and consolidate databases: regular readers will know that I’ve been banging on about this for ages. As part of my day job, I’ve been travelling around Europe for the last 18 months talking about DBaaS (but under various other names such as Database Virtualisation or Private Cloud) to customers big and small – and I know a number of large enterprises who are actively planning or building such solutions, so it came as no surprise to me when 451 Research issued this report in August 2013. This is my prediction for 2014: the adoption of database-as-a-service solutions will enter the mainstream. The benefits are too hard to ignore: increased agility, reduced operational costs and better utilisation of compute resources (meaning lower total cost of ownership). It also acts as an on-ramp to running your databases in the (public) cloud at some point in the future.

At one end there will be hyperscale customers such as the telcos and financial organisations that I have already seen embark on this journey. But at the other end, even smaller customers can benefit from a simple VMware, Hyper-V or OVM-based solution to drive up CPU utilisation. And it’s not just me who thinks this either. Just don’t forget to build your solution on flash.

Oracle Wants Its Piece Of The Action

Of course, this could be bad news for Oracle, since customers who use their compute resources more efficiently will require less Oracle licenses, along with less support and maintenance contracts. What does Oracle do to avoid this situation? Well… if you can’t beat them, join them:

Yes, Oracle wants (more of) your money and it’s prepared to use its mighty marketing machine to get it. This means:

Using the terms Oracle Cloud and DataBase-as-a-Service everywhere
Heavily marketing the (extra cost) Multitenant features of Oracle 12c as an alternative to third-party hypervisors
Firing up the Forbes Oracle Voice sockpuppet for some typical propaganda
Signing deals to support Oracle running on other (previously unthinkable) popular hypervisors
Rebranding the next generation of it’s Exadata Database Machine as a DataBase-as-a-Service solution
Aggressively taking on the established cloud players at all three layers: infrastructure, platform and applications

Personally I see DBaaS as an opportunity to embrace open systems and build flexible architectures. But, unsurprisingly, Oracle’s viewpoint is that you can build your DBaaS to use any solution as long as it’s red.

Conclusion

So there you have it. In a stunning leap into the unknown I’ve predicted that DBaaS will be widely adopted (even though this has already started happening), that your data warehouses will grow larger and that Oracle wants more of your money. And with that hat-trick in the bag, I’m taking the rest of the year off. See you in 2014…

Oracle Exadata X4 (Part 1): Bigger Than It Looks?

One of the results of my employment history is that I tend to take particular interest in the goings on at a certain enterprise software (and hardware!) company based in Redwood Shores. I love watching Oracle’s announcements, press releases, product releases and financial statements to see what they are up to – and I am never more intrigued than when they release a new version of one of their Engineered Systems.

In part this is because I used to work with Exadata a lot and still know many people who do. But the main reason I like Engineered Systems releases is because I believe there is no better indicator of Oracle’s future strategy. Sifting through the deluge of marketing clod, product collateral, datasheets and press releases is like reading the tea leaves – and I’ve been doing it for a long time.

A few weeks ago Oracle released the new “fifth-generation” Exadata Database Machine X4-2, along with the usual avalanche of marketing. Over the last couple of weeks I’ve been throwing it all up in the air to see what lands. Part one of this post will look at the changes, while part two will look at the underlying message.

Database as a Service

The first thing to notice about Exadata X4 is that Oracle Marketing has fallen in love with a new term: database as a service. Previous versions of Exadata were described as being suitable for database consolidation, but in the X4 launch this phrase has been superseded:

Personally I see little difference between consolidation and DBaaS, but I assume the latter has more connotations of cloud computing and so is more fitting for a company attempting to build its own cloud empire. The idea is presumably that you buy Exadata for use in private clouds and use Oracle Cloud for your public cloud service. That’s all very well, but what I find somewhat surprising is the claim that X4 is optimized for OLTP, data warehousing and database-as-a-service. Surely those three workloads encompass everything? Claiming that you have built a solution which is optimized for everything is … shall we say bold?

More Processor Cores = More Licenses

As with previous releases, Oracle has frozen the price of both the Exadata hardware and the Exadata Storage Software licenses (see price lists). This seems like a great result for customers given that the X4 contains significantly faster hardware (see comments section). For example, the Exadata compute nodes change from having 8-core Sandy Bridge versions of the Intel Xeon processor to 12-core Ivy Bridge models. What never ceases to amaze me is the number of people who do not immediately see the consequence of this change: 50% more cores means 50% more database software licenses are required to run the equivalent X4 machine. So while the Exadata storage license cost remains unchanged, the cost of running Oracle Database Enterprise Edition increases by 50%, as does the cost of options such as Oracle RAC, Partitioning, Advanced Compression, the Diagnostic and Tuning Packs, etc etc. And it just so happens that the bits which increase by 50% happen to form the majority of the cost (and don’t forget that the 22% annual fee for support and maintenance will also be going up for them):

Prices are estimates - contact Oracle for correct pricing — Prices are estimates – contact Oracle for correct pricing

So far so boring. Nobody expected something for nothing, despite some of the altruistic statements made to the press. But there’s something much more interesting going on if you look at the X4’s use of flash memory…

Ever-Increasing Capacity

The new Exadata X4 model now contains 44.8TB of raw flash in the form of rebranded LSI Nytro PCIe cards placed in the storage cells. The term “raw“, as always in the storage industry, is used to denote the total amount of flash available prior to any overhead such as RAID, formatting, areas kept aside for garbage collection, etc. Once all of these overheads are added, you end up with a new figure known as “usable” – and it is this amount which describes the area where you can store data.

But hold on, what’s this new term “logical flash capacity” in the press release promising “88 TB per full rack”?

That’s twice the raw capacity! This is an incredible statement, because this so-called “logical” capacity is in fact a complete guess based on compression ratios – which are entirely dependant on your data. And it gets worse when you read the datasheet, which makes the following claim: an “effective flash capacity” of “Up to 448TB“! This is now ten times the raw capacity!

But what is an “effective flash capacity”? Let’s read the small print of the datasheet to find out… Apparently this is the size of the data files that can often be stored in Exadata and be accessed at the speed of flash memory. No guarantees then, you just might get that, if you’re lucky. I thought datasheets where supposed to be about facts?

I am very uncomfortable about this sort of claim, partly because it carries no guarantees, but mainly because it often confuses customers. It’s not inconceivable that a potential customer will mistakenly think they are buying more raw flash capacity than they are actually are. You think not? Then take a look at slide 21 of this Oracle presentation and consider the use of the word “raw”:

Maybe someone can explain to me how that statement can possibly be valid, because to me it looks utterly bewildering.

Exadata Smart Flash Cache Compression

The Exadata Smart Flash Cache has been a stalwart of the Exadata machine for many generations, so it is no surprise to see its feature set continually expanding. For the Exadata X4 release, the big feature appears to be Exadata Smart Flash Cache Compression (read more about it here), which allows Oracle to transparently compress data and store it on the PCIe flash cards. It is this feature which Oracle is describing when it claims a “logical flash cache capacity” of 88TB in the press release and the datasheet. Yet according to slide 22 of this Oracle presentation it is a feature which requires the Advanced Compression Option:

As you can see, the author of this slide deck makes the rather brave assumption that most Exadata customers already have licenses for Advanced Compression (something I strongly contest). But either way, does it not seem reasonable that the press release and/or the datasheet should include this statement if they are going to promise such enlarged flash capacities? I’ve looked and looked, but I cannot see this mentioned – even in the infamous small print.

The thing is, right now on the Oracle Store, the Advanced Compression Option is retailing at $11,500 per core. Given that the new Exadata X4 machine now has 192 cores in a full rack (and taking into account the core multiplication factor of 0.5 for Intel Xeon), I calculate the list price of this option as being over $1.1m. Personally, I think that’s a large enough add-on that it ought to be mentioned up front.

Conclusion

As always with Oracle’s Exadata products, there is much to read between the lines. In the second part of this article I’ll be drawing my own conclusions about what the X4 means… stay tuned.

Storage Myths: Storage Compression Has No Downside

Storage for DBAs: My last post in this blog series was aimed at dispelling the myth that dedupe is a suitable storage technology for databases. To my surprise it became the most popular article I’ve ever published (based on reads per day). Less surprisingly though, it lead to quite a backlash from some of the other flash storage vendors who responded with comments along the lines of “well we don’t need dedupe because we also have compression”. Fair enough. So today let’s take a look at the benefits and drawbacks of storage-level compression as part of an overall data reduction strategy. And by the way, I’m not against either dedupe or storage-level compression. I just think they have drawbacks as well as benefits – something that isn’t always being made clear in the marketing literature. And being in the storage industry, I know why that is…

What Is Compression?

In storageland we tend to talk about the data reduction suite of tools, which comprise of deduplication, compression and thin provisioning. The latter is a way of freeing up capacity which is allocated but not used… but that’s a topic for another day.

Dedupe and compression have a lot in common: they both fundamentally involve the identification and removal of patterns, which are then replaced with keys. The simplest way to explain the difference would be to consider a book shelf filled with books. If you were going to dedupe the bookshelf you would search through all of the books, removing any duplicate titles and making a note of how many duplicates there were. Easy. Now you want to compress the books, so you need to read each book and look for duplicate patterns of words. If you find the same sentence repeated numerous times, you can replace it with a pointer to a notebook where you can jot down the original. Hmmm…. less easy. You can see that dedupe is much more of a quick win.

Of course there is more to compression than this. Prior to any removal of duplicate patterns data is usually transformed – and it is the method used in this transformation process that differentiates all of the various compression algorithms. I’m not going to delve into the detail in this article, but if you are interested then a great way to get an idea of what’s involved is to read the Wikipedia page on the BZIP2 file compression tool and look at all the processes involved.

Why Compress?

Data compression is essentially a trade-off, where reduced storage footprint is gained at the expense of extra CPU cycles – and therefore as a consequence, extra time. This additional CPU and time must be spent whenever the compressed data is read, thus increasing the read latency. It also needs to be spent during a write, but – as with dedupe – this can take place during the write process (known as inline) or at some later stage (known as post-process). Inline compression will affect the write latency but will also eliminate the need for a staging area where uncompressed data awaits compression.

Traditionally, compression has been used for archive data, i.e. data that must be retained but is seldom accessed. This is a good fit for compression, since the additional cost of decompression will rarely be paid. However, the use of compression with primary data is a different story: does it make sense to repeatedly incur time and CPU penalties on data that is frequently read or written? The answer, of course, is that it’s entirely down to any business requirements. However, I do strongly believe that – as with dedupe – there should be a choice, rather than an “always on” solution where you cannot say no. One vendor I know makes this rather silly claim: “so important it’s always-on”. What a fine example of a design limitation manifesting itself as a marketing claim.

Where to Compress?

As with all applications, data tends to flow down from users through an application layer, into a database. This database sits on top of a host, which is connected to some sort of persistent storage. There are therefore a number of possible places where data can be compressed:

Database-level compression, such as using basic compression in Oracle (part of the core product), or the Advanced Compression option (extra license required).
Host-level compression, such as you might find in products like Symantec’s Veritas Storage Foundation software suite.
Storage-level compression, where the storage array compresses data either at a global level, or more ideally, at some configurable level (e.g. by the LUN).

Of course, compressed data doesn’t easily compress again, since all of the repetitive patterns will have been removed. In fact, running compression algorithms on compressed data is, at the very least, a waste of time and CPU – while in the worst case it could actually increase the size of the compressed data. This means it doesn’t really make sense to use multiple levels of compression, such as both database-level and storage-level. Choosing the correct level is therefore important. So which is best?

Benefits and Drawbacks

If you read some of the marketing literature I’ve seen recently you would soon come to the conclusion that compressing your data at the storage level is the only way to go. It certainly has some advantages, such as ease of deployment: just switch it on and sit back, all of your data is now compressed. But there are drawbacks too – and I believe it pays to make an informed decision.

Performance

The most obvious and measurable drawback is the addition of latency to I/O operations. In the case of inline compression this affects both reads and writes, while post-process compression inevitably results in more background I/O operations taking place, increasing wear and potentially impacting other workloads. Don’t take it for granted that this additional latency won’t affect you, especially at peak workload. Everyone in the flash industry knows about a certain flash vendor whose inline dedupe and compression software has to switch into post-process mode under high load, because it simply cannot cope.

Influence

This one is less obvious, but in my opinion far more important. Let’s say you compress your database at the storage-level so that as blocks are written to storage they are compressed, then decompressed again when they are read back out into the buffer cache. That’s great, you’ve saved yourself some storage capacity at the overhead of some latency. But what would have happened if you’d used database-level compression instead?

With database-level compression the data inside the data blocks would be compressed. This means not just that data which resides on storage, but also the data in memory – inside the buffer cache. That means you need less physical memory to hold the same amount of data, because it’s compressed in memory as well as on storage. What will you do with the excess physical memory? You could increase the size of the buffer cache, holding more data in memory and possibly improving performance through a reduction in physical I/O. Or you could run more instances on the same server… in fact, database-level compression is very useful if you want to build a consolidation environment, because it allows a greater density of databases per physical host.

There’s more. Full table scans will scan a smaller number of blocks because the data is compressed. Likewise any blocks sent over the network contain compressed data, which might make a difference to standby or Data Guard traffic. When it comes to compression, the higher up in the stack you begin, the more benefits you will see.

Don’t Believe The Hype

The moral of this story is that compression, just like deduplication, is a fantastic option to have available when and if you want to use it. Both of these tools allow you to trade time and CPU resource in favour of a reduced storage footprint. Choices are a good thing.

They are not, however, guaranteed wins – and they should not be sold as such. Take the time to understand the drawbacks before saying yes. If your storage vendor – or your database vendor (“storage savings of up to 204x“!!) – is pushing compression maybe they have a hidden agenda? In fact, they almost definitely will have.

And that will be the subject of the next post…

Storage Myths: Dedupe for Databases

Storage for DBAs: Data deduplication – or “dedupe” – is a technology which falls under the umbrella of data reduction, i.e. reducing the amount of capacity required to store data. In very simple terms it involves looking for repeating patterns and replacing them with a marker: as long as the marker requires less space than the pattern it replaces, you have achieved a reduction in capacity. Deduplication can happen anywhere: on storage, in memory, over networks, even in database design – for example, the standard database star or snowflake schema. However, in this article we’re going to stick to talking about dedupe on storage, because this is where I believe there is a myth that needs debunking: databases are not a great use case for dedupe.

Deduplication Basics: Inline or Post-Process

If you are using data deduplication either through a storage platform or via software on the host layer, you have two basic choices: you can deduplicate it at the time that it is written (known as inline dedupe) or allow it to arrive and then dedupe it at your leisure in some transparent manner (known as post-process dedupe). Inline dedupe affects the time taken to complete every write, directly affecting I/O performance. The benefit of post-process dedupe therefore appears to be that it does not affect performance – but think again: post-process dedupe first requires data to be written to storage, then read back out into the dedupe algorithm, before being written to storage again in its deduped format – thus magnifying the amount of I/O traffic and indirectly affecting I/O performance. In addition, post-process dedupe requires more available capacity to provide room for staging the inbound data prior to dedupe.

Deduplication Basics: (Block) Size Matters

In most storage systems dedupe takes place at a defined block size, whereby each block is hashed to produce a unique key before being compared with a master lookup table containing all known hash keys. If the newly-generated key already exists in the lookup table, the block is a duplicate and does not need to be stored again. The block size is therefore pretty important, because the smaller the granularity, the higher the chances of finding a duplicate:

In the picture you can see that the pattern “1234”repeats twice over a total of 16 digits. With an 8-digit block size (the lower line) this repeat is not picked up, since the second half of the 8-digit pattern does not repeat. However, by reducing the block size to 4 digits (the upper line) we can now get a match on our unique key, meaning that the “1234” pattern only needs to be stored once.

This sounds like great news, let’s just choose a really small block size, right? But no, nothing comes without a price – and in this case the price comes in the size of the hashing lookup table. This table, which contains one key for every unique block, must range in size from containing just one entry (the “ideal” scenario where all data is duplicated) to having one entry for each block (the worst case scenario where every block is unique). By making the block size smaller, we are inversely increasing the maximum size of the hashing table: half the block size means double the potential number of hash entries.

Hash Abuse

Why do we care about having more hash entries? There are a few reasons. First there is the additional storage overhead: if your data is relatively free of duplication (or the block size does not allow duplicates to be detected) then not only will you fail to reclaim any space but you may end up using extra space to store all of the unique keys associated with each block. This is clearly not a great outcome when using a technology designed to reduce the footprint of your data. Secondly, the more hash entries you have, the more entries you need to scan through when comparing freshly-hashed blocks during writes or locating existing blocks during reads. In other words, the more of a performance overhead you will suffer in order to read your data and (in the case of inline dedupe) write it.

If this is sounding familiar to you, it’s because the hash data is effectively a database in which storage metadata is stored and retrieved. Just like any database the performance will be dictated by the volume of data as well as the compute resource used to manipulate it, which is why many vendors choose to store this metadata in DRAM. Keeping the data in memory brings certain performance benefits, but with the price of volatility: changes in memory will be lost if the power is interrupted, so regular checkpoints are required to persistent storage. Even then, battery backup is often required, because the loss of even one hash key means data corruption. If you are going to replace your data with markers from a lookup table, you absolutely cannot afford to lose that lookup table, or there will be no coming back.

Database Deduplication – Don’t Be Duped

Now that we know what dedupe is all about, let’s attempt to apply it to databases and see what happens. You may be considering the use of dedupe technology with a database system, or you may simply be considering the use of one of a number of recent storage products that have inline dedupe in place as an “always on” option, i.e. you cannot turn it off regardless of whether it helps or hinders. The vendor may make all sorts of claims about the possibilities of dedupe, but how much benefit will you actually see?

Let’s consider the different components of a database environment in the context of duplication:

Oracle datafiles contain data blocks which have block headers at the start of the block. These contain numbers which are unique for each datafile, making deduplication impossible at the database block size. In addition, the end of each block contains a tailcheck section which features a number generated using data such as the SCN, so even if the block were divided into two the second half would offer limited opportunity for dedupe while the first half would offer none.
Even if you were able to break down Oracle blocks into small enough chunks to make dedupe realistic, any duplication of data is really a massive warning about your database design: normalise your data! Also, consider features like index key compression which are part of the Enterprise Edition license.
Most Oracle installations have multiplexed copies of important files like online redo logs and controlfiles. These files are so important that Oracle synchronously maintains multiple copies in order to ensure against data loss. If your storage system is deduplicating these copies, this is a bad thing – particularly if it’s an always on feature that gives you no option.
While unallocated space (e.g. in an ASM diskgroup) might appear to offer the potential for dedupe, this is actually a problem which you should solve using another storage technology: thin provisioning.
You may have copies of datafiles residing on the same storage as production, which therefore allow large-scale deduplication to take place; perhaps they are used as backups or test/development environments. However, in the latter case, test/dev environments are a use case for space-efficient snapshots rather than dedupe. And if you are keeping your backups on the same storage system as your production data, well… good luck to you. There is nothing more for you here.
Maybe we aren’t talking about production data at all. You have a large storage array which contains multiple copies of your database for use with test/dev environments – and thus large portions of the data are duplicated. Bingo! The perfect use case for storage dedupe, right? Wrong. Database-level problems require database-level solutions, not storage-level workarounds. Get yourself some licenses for Delphix and you won’t look back.

To conclude, while dedupe is great in use cases like VDI, it offers very limited benefit in database environments while potentially making performance worse. That in itself is worrying, but what I really see as a problem is the way that certain storage vendors appear to be selling their capacity based on assumed levels of dedupe, i.e. “Sure we are only giving you X terabytes of storage for Y price, but actually you’ll get 10:1 dedupe which means the price is really ten times lower!”

Sizing should be based on facts, not assumptions. Just like in the real world, nothings comes for free in I.T. – and we’ve all learnt that the hard way at some point. Don’t be duped.

The Most Expensive CPUs You Own

Storage for DBAs: Take a look in your data centre at all those humming boxes and flashing lights. Ignore the storage and networking gear for now and just concentrate on the servers. You probably have many different models, with different types and numbers of CPUs and DRAM inside. My question is, which CPUs are the most expensive? Almost without exception, the answer will be the CPUs inside your database servers…

In the last couple of posts I talked about the real cost of enterprise database software in general and Oracle RAC in particular. The point I was making was that database software, which is traditionally licensed by the CPU core, is expensive in comparison to the cost of the hardware on which it runs. But since the hardware fundamentally affects the performance – and therefore value for money – of the software, it’s important to make the right choices when building a database system. And yes, predictably, I believe that this means using flash memory instead of disk – but don’t worry, that’s not the main message behind post.

Lawn Mower Tax

Think of any consumer item which comes in multiple sizes and price brackets. I don’t know, let’s say a lawn mower. To simplify, let’s assume you can buy three different types of mower: small ($250), medium ($500) and large ($1000). The small one is cheaper but less powerful, so it takes longer to cut your grass, while the large one is the most expensive but requires the shortest amount of time. Which would you pick?

There’s no right answer because it depends on your requirements. But let’s introduce an unexpected complication into the mix: lawn mower tax. The government, in their wisdom, imposes a $50,000 tax on the purchase of any new lawn mower regardless of size. You still need a mower so you are forced to pay the tax, but is your choice influenced? The chances are you would buy the larger model, because a) the percentage difference in overall price is much less, and b) it avoids the risk of needing to upgrade in the future and having to pay the tax again. The $51,000 large mower represents better value for money than the two smaller models.

CPU Tax

You can think of database software in the same way. There are countless types of CPU available on the market right now: Intel, AMD, ARM, IBM Power, Oracle / Fujitsu SPARC, etc. Each vendor has many models and architectures, clock speeds and power ratings, yet they all share one important property: core count. And that core count is subject to the massive “CPU tax” that is the database software license. I’m sticking to the Oracle Database in this post but the same applies to Microsoft SQL Server (where licenses are core-based from SQL2012 onwards), Sybase and so on.

Take a standard two-socket sixteen-core Intel Xeon-based server as an example: there are a multitude of CPU models fitting that description. Even if we restrict ourselves to the Sandy Bridge-EP range Wikipedia shows there are 11 different models fitting the description of “8 cores per socket”. Yet not all CPUs are equal. Wouldn’t it make sense, given the massive cost associated with core-based licensing, so ensure you are using the processor which gives you the best performance, i.e. value for money, per license?

Performance Per Licenseable Core

The problem of determining which CPUs provide the best value for money was one I struggled with for a while. Looking at benchmarks like SPECint and the datasheets from Intel and co, it’s hard not to be overwhelmed by data – and if I’m honest I probably don’t have the systems-level knowledge to interpret it accurately. Ironically, the solution came from someone who does have that knowledge, but showed me that it isn’t required because there’s a much simpler way. More importantly, benchmarks like SPECint don’t take into account what we want these CPUs to do, which is to run the Oracle Database.

Kevin Closson‘s elegant and annoyingly simple solution was to use TPC benchmarks – specifically the transactional TPC-C benchmark from Oracle databases, results from which are freely available here. All we need to do then is simply download the spreadsheet, filter out the non-Oracle workloads and then divide the value of tpmC (the number of orders that can be fully processed per minute) by the number of CPU cores to get the performance per core.

Since this is an Oracle-specific calculation we also then need to multiply this by Oracle’s Processor Core Factor (see link on this page) to get the ultimate figure we need to know, the performance per license. Here’s my working copy of the spreadsheet, but I make no claims to its accuracy and will not keep this screenshot up-to-date. You should recalculate every time you want to make a judgement on which servers to use, it’s a very simple exercise.

Click to enlarge — Performance per licensable core (based on published TPC-C benchmark results using Oracle) – click to enlarge

The red column is the performance per licenseable core, marked “Perf / license“. Hopefully it’s obvious that this is just a re-work of Kevin’s ideas, many of which he posted in this blog article, which I highly recommend reading. As such I can claim no credit, except for any mistakes.

The Flash Angle

Of course, this wouldn’t be a flashdba article without some mention of flash memory. As discussed above there are many different types and models of CPU, but there is one great leveller: CPUs are all equally good at doing nothing. If your processors are waiting on I/O then they are not working – and that has a direct negative effect on the value you are realising from them.

In the above chart, the last benchmark result (with the best value for performance per licensable core) is this one performed by Cisco. Now, I honestly didn’t engineer this article to work out this way, but it so happens that Cisco used a pair of Violin Memory 6616 flash memory arrays to achieve this workload. (I’d almost* be happier if this had been a competitor’s flash array, because I don’t want this to look like an advert for my employer and therefore detract from my point…)

The point I’m aiming to make here is that it’s worth using the best-performing processors in order to see value for money from your database licenses. But to enable that, the processors need to be released from the chains of high-latency storage – and that, quite simply, means using flash.

* almost, but not quite

OOW13: The Future Is Here (Just Don’t Mention “Legacy”)

Last week I attended Oracle OpenWorld 2013 in the stunning city of San Francisco, along with 60,000 other attendees. At times it felt like we’d taken over the entire city, with every street, bus, billboard and hotel plastered in Oracle logos and pictures of engineered systems… although apparently there was some other stuff going on too.

I learnt a lot from OOW this year. I met many customers and potential customers, attended sessions from Oracle and its partners (including Violin’s competitors) and spent some time with friends at OaktableWorld as an antidote to the marketing hype. Oracle is many things to many people, but one thing that’s hard to deny is the company’s drive for innovation. Every year there are new products, new features, new options to learn – it’s very impressive. Of course, each one of these invariably means paying more license money – but those yachts don’t come cheap. This year as we looked to the future there were discussions about In Memory, Big Data, the Internet of Things and M2M. But what about the present? And more importantly, what about those of us still tied to the past?

The Database In Memory Option

In his opening keynote this year, Larry Ellison announced the Oracle Database In Memory Option, seen by many as an attempt to counter SAP’s HANA In-Memory database and Microsoft’s In Memory OLTP option for SQL Server 2014. This was by no means the only announcement of the week, or even the night (take, for example, the Oracle Big Memory Machine which, with 384 cores, I can’t help feeling would have been better named the Big License Bill Machine), but it’s a great example of the problem I want to discuss.

The obvious criticism is Oracle’s tiresome policy of pre-announcing and re-announcing the same thing (perfectly described by Doug Henschen here). The In-Memory Option isn’t available yet, nor will it be until “sometime next year”, which could conceivably be after OpenWorld 2014. But my real issue is that, like many other announcements, it’s a feature of the 12c database… which means almost everyone running in production won’t be able to use it.

Ok so maybe by the time Oracle finally rolls it out there will be some early adopters running 12c on their critical systems. But as the saying goes, you can always spot the pioneers by the arrows sticking out of their backs. Many people will refuse to upgrade to Oracle 12c until at least the release of version 2. And many, many more people simply won’t have a choice. We all spent a week talking about new or unreleased features that will change our lives, but how many customers will use them in production before next year’s slew of announcements?

Legacy Applications

The majority of organisations that I speak to are running legacy applications to support their businesses. The more risk-averse the business, the more ancient and convoluted the applications being supported (which is ironic if you consider the risk associated with maintaining old, complex code). Speak to any bank or telco and you’ll find applications from the previous decade running on versions of Oracle (or MSSQL, Sybase, etc) that you’d almost forgotten about. Scratch the surface and you’ll find lots of stuff on 11g Release 2, lots on 11gR1, plenty of stuff on 10.2 and maybe even 10.1. Dig really deep and horror of horrors, 9i is only just the beginning.

Not only that, but you’ll often find these databases aren’t even running on the terminal (i.e. supported) patchset! Why? Because upgrading an application or a database is a mammoth task, filled with risk and cost. I know I’m not the only one that has worked on 18+ month database upgrade projects which never even tasted success. Even applying a patchset requires full regression testing of an application – and if it’s a legacy application what are the support implications?

legacy-risk-stack — List of Legacy Refresh tasks in order of increasing risk and time/cost

In my view, despite all the talk of new technologies and paradigm shifts, the need to refresh legacy applications is more relevant now than ever. I guess I see it more working for a company like Violin because replacing legacy storage with flash memory offers a massive win with relatively little risk. Upgrading to 12c, on the other hand, is not a project to be treated lightly – despite the promises of features such as the Database In Memory option. Many customers simply cannot afford the time, money or risk associated with upgrades and migrations, despite any potential rewards. Yet who is championing them?

Footnote

I’m excited and intrigued by the new product launched by my employer Violin Memory, the Force 2510 Memory Appliance. I don’t usually use my blog to directly promote our products but this one interests me because it fits in below the list in the above picture, offering memory speeds without application or database changes. I hope to get one in my lab soon so I can blog what I see…