Oracle AWR Reports: When IOStats Lie

graph

If you’ve been unfortunate enough to follow my dithering on Twitter recently you’ll know that I’ve been lurching between thinking that there is and isn’t a problem with Oracle’s tracking of I/O statistics in its AWR reports.

I’m now convinced there is a problem, but I can’t work out what causes it… so step 1 is to describe the problem here, after which step 2 will probably be to sit back and hope someone far more intelligent than me will solve it.

But first some background:

AWR I/O Statistics

I’ve written about the I/O statistics contained in Oracle AWR Reports before, so I won’t repeat myself too much other than to highlight two critical areas which we’ll focus on today: Instance Statistics and IOStat Summaries. By the way, the format of AWR reports changed in 11.2.0.4 and 12c to include a new IO Profile section, but today we’re covering reports from 11.2.0.3.

First of all, here are some sensible statistics. I’m going to show you the IOStat by Function Summary section of a report from a real life database:

IOStat by Function summary
-> 'Data' columns suffixed with M,G,T,P are in multiples of 1024
    other columns suffixed with K,M,G,T,P are in multiples of 1000
-> ordered by (Data Read + Write) desc

                Reads:   Reqs   Data    Writes:  Reqs   Data    Waits:    Avg  
Function Name   Data    per sec per sec Data    per sec per sec Count    Tm(ms)
--------------- ------- ------- ------- ------- ------- ------- ------- -------
Buffer Cache Re   14.9G   504.1 4.22491      0M     0.0      0M 1690.3K     0.0
Direct Reads      12.2G     3.5 3.45612      0M     0.0      0M       0     N/A
DBWR                 0M     0.0      0M    7.9G   186.6 2.25336       0     N/A
Others             4.2G    12.1 1.18110    3.6G     2.5 1.03172   41.9K     0.1
LGWR                 1M     0.0 .000277    5.8G    11.9 1.65642   17.4K     1.0
Direct Writes        0M     0.0      0M      4M     0.0 .001110       0     N/A
TOTAL:            31.2G   519.7 8.86241   17.4G   201.0 4.94263 1749.6K     0.0
          -------------------------------------------------------------

This section of the report is breaking down all I/O operations into the functions that caused them (e.g. buffer cache reads, direct path reads or writes, redo log writes by LGWR, the writing of dirty buffers by DBWR and so on). Let’s ignore that level of detail now and just focus on the TOTAL row at the bottom.

To try and make this simple to describe, I’ve gone a bit crazy with the colours. In green I’ve highlighted the labels for reads and writes – and now let’s walk through the columns:

  1. The first column is the function name, but we’re just focussing on TOTAL
  2. The second column is the total amount of reads that happened in this snapshot: 31.2GB
  3. The third column is the average number of read requests per second, i.e. 519.7 read IOPS
  4. The fourth column is the average volume of data read per second, i.e. 8.86 MB/sec read throughput
  5. The fifth column is the total amount of writes that happened in this snapshot: 17.4GB
  6. The sixth column is the average number of write requests per second, i.e. 201 write IOPS
  7. The seventh column is the average volume of data written per second, i.e. 4.94 MB/sec write throughput
  8. The eighth and ninth columns are not of interest to us here

The whole section is based on the DBA_HIST_IOSTAT_FUNCTION view. What we care about today is the IOPS figures (shown in red) and the Throughput figures (shown in blue). Pay attention to the comments in the view header which show that Data columns (including throughput) are multiples of 1024 while other columns (including IOPS) are multiples of 1000. It’s interesting that the two throughput values are obviously measured in MB/sec and yet are missing the “M” suffix – I assume this “falls off the end” of the column because of the number of decimal places displayed.

Now that we have these figures explained, let’s compare them to what we see in the Instance Activity Stats section of the same AWR report:

Instance Activity Stats
-> Ordered by statistic name

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
physical read IO requests                 1,848,576          513.3         121.5
physical read bytes                  29,160,300,544    8,096,201.8   1,917,053.5
physical read total IO requests           1,892,366          525.4         124.4
physical read total bytes            33,620,594,176    9,334,578.5   2,210,281.7
physical read total multi block              17,096            4.8           1.1
physical reads                            3,559,607          988.3         234.0

physical write IO requests                  671,728          186.5          44.2
physical write bytes                  8,513,257,472    2,363,660.5     559,677.7
physical write total IO requests            723,348          200.8          47.6
physical write total bytes           18,657,198,080    5,180,071.5   1,226,559.6
physical write total multi block             26,192            7.3           1.7
physical writes                           1,039,216          288.5          68.3

For this section we care about the per Second column because both IOPS and Throughput are measured using units per second. For both reads and writes there are two sets of statistics: those with the word total in them and those without. You can find the full description of 11gR2 statistics in the documentation, but the difference between the two is best summed up by this snippet which describes physical read total bytes:

Total size in bytes of disk reads by all database instance activity including application reads, backup and recovery, and other utilities. The difference between this value and “physical read bytes” gives the total read size in bytes by non-application workload.

I’ve underlined non-application workload here because this is critical. If you merely look at the Load Profile section at the top of an AWR report you will only see values for “application workload” I/O but this does not include stuff like RMAN backups, archive logging and so on… important stuff if you care about the actual I/O workload. For this reason, we only care about the following four statistics:

  1. physical read total IO requests (per second) = the average number of read IOPS
  2. physical read total bytes (per second) = the average read throughput in bytes per second
  3. physical write total IO requests (per second) = the average number of write IOPS
  4. physical write total bytes (per second) = the average write throughput in bytes per second

Again I’ve coloured the IOPS figures in red and the throughput figures in blue.

Tying It Together

Now that we have our two sets of values, let’s just compare them to make sure they align. The IOPS figures do not require any conversion but the throughput figures do: the values in the Instance Activity Stats section are in bytes/second and we want them to be in MB/second so we need to divide by 1048576 (i.e. 1024 * 1024).

IOStat by Function

Instance Activity Stats

Error Percentage

Read IOPS

519.7

525.4

1.09%

Write IOPS

201.0

200.8

0.10%

Read Throughput

8.86241 MB/sec

9,334,578.5 bytes/sec

8.90215 MB/sec

0.45%

Write Throughput

4.94263 MB/sec

5,180,071.5 bytes/sec

4.94010 MB/sec

0.05%

I’ve calculated the error percentages here to see how far the figures vary. It is my assumption that the Instance Activity Stats are accurate and that any margin of error in the IOStat figures comes as a result of sampling issues. The highest error percentage we see here is just over 1%, which is hardly a problem in my opinion.

Don’t Believe The Stats

So far I have no complaints about the matching of statistics in the AWR report. But now let me introduce you to the AWR report that has been puzzling me recently:

IOStat by Function summary
-> 'Data' columns suffixed with M,G,T,P are in multiples of 1024
    other columns suffixed with K,M,G,T,P are in multiples of 1000
-> ordered by (Data Read + Write) desc

                Reads:   Reqs   Data    Writes:  Reqs   Data    Waits:    Avg
Function Name   Data    per sec per sec Data    per sec per sec Count    Tm(ms)
--------------- ------- ------- ------- ------- ------- ------- ------- -------
Direct Reads       3.8T 2.7E+04    3.2G    405M     3.6 .335473       0     N/A
Direct Writes        0M     0.0      0M  164.5G  1159.0 139.519       0     N/A
Buffer Cache Re   63.2G  3177.6 53.6086      0M     0.0      0M 3292.5K    23.4
DBWR                 0M     0.0      0M      2G    45.3 1.71464       0     N/A
LGWR                 0M     0.0      0M    485M    19.2 .401739   19.7K     0.9
Others             220M    12.3 .182232     26M     0.7 .021536   15.6K    34.5
Streams AQ           0M     0.0      0M      0M     0.0      0M       1     9.0
TOTAL:             3.9T 3.0E+04    3.3G  167.4G  1227.8 141.992 3327.8K    23.3
          -------------------------------------------------------------

Instance Activity Stats
-> Ordered by statistic name

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
physical read total IO requests          35,889,568       29,728.4       3,239.4
physical read total bytes         4,261,590,852,608 3.52999864E+09 3.8465483E+08

physical write total IO requests          1,683,381        1,394.4         151.9
physical write total bytes          205,090,714,624  169,882,555.1  18,511,663.0

Again I’ve coloured the IOPS measurements in red and the throughput measurements in blue. And as before we need to convert the bytes per second values shown in the Instance Activity Stats section to MB/sec as shown in the IOStat by Function Summary section.

IOStat by Function

Instance Activity Stats

Error Percentage

Read IOPS

30,000

29,728.4

0.91%

Write IOPS

1227.8

1,394.4

11.95%

Read Throughput

3.3 GB/sec

3,379.2 MB/sec

3,529,998,640 bytes/sec

3,366.469 MB/sec

0.38%

Write Throughput

141.992 MB/sec

169,882,555.1 bytes/sec

162.01263 MB/sec

12.36%

Do you see what’s bugging me here? The write values for both IOPS and throughput are out by over 10% when I compare the values in IOStat by Function against the Instance Activity Stats. Ten percent is a massive margin of error at this level – we’re talking 20MB/sec. To translate that into something easier to understand, if 20MB/sec were sustained over a 24 hour period it would amount to over 1.6TB of data. I’ve seen smaller data warehouses.

So why is this happening? Unfortunately I don’t have access to the system where this AWR report was created, so I cannot tell if, for example, there was an instance restart between the start and end snapshots (although the elapsed time of the report was just 20 minutes so it seems unlikely).

The truth is I don’t know. Which is why I’m writing about it here… if you think you have the answer, or just as importantly if you see the same behaviour, let me know!

Advertisements

4 Responses to Oracle AWR Reports: When IOStats Lie

  1. The first anomaly is that the function called “Direct Read” is logged at 405MB written. Have you checked the arithmetic to see what the best fit would be if some other reads and writes and been accounted the wrong way round ?

    I assume you’ve checked all the read and write stats on the instance activity to see if
    a) there are stats where the first set of instance activity shows close to zero and the second shows something like the discrepancy.
    or
    b) there are stats that look as if they may be the right size to account for double-counting in one area or missed accounting in the other.

    • flashdba says:

      Hi Jonathan – and thanks for taking up some of your valuable time to look at my humble post.

      You are right, I have spent a lot of time looking at the various instance activity stats trying to find something that would tally up with the 20,993,152 bytes/sec of missing write throughput or 166.6 write IOPS. I’ve had a lot of false dawns, but overall I haven’t been successful – although it has lead me to invent a new term: AWR blindness.

      It’s interesting that if I look at these values of IOPS and throughput I can calculate that whatever this missing data stream is (assuming it’s mainly caused by just one thing) the average block size is around 123k. If we round that up to 128k it perfectly equals the database block size (8k) multiplied by the value for db_file_multiblock_read_count (16). But is that relevant? Is anything relevant? Or have I been staring at this report for so long that I am now able to prove that black is white and so will get myself run over on the next zebra crossing.

      • jo says:

        The “nearly 128KB” is the sort of thing that could be relevant – especially since you’ve got direct path WRITE stats under the direct path READ function …. it hints at a bug with direct path reads, which would tend to be the full size read except for the initial small extents if you’ve got system managed extent allocation.

        If you can post the second AWR somewhere I’ll see if I can find a little time tomorrow to take a look at it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s