New section: Oracle SLOB Testing

slob ghost

For some time now I have preferred Oracle SLOB as my tool for generating I/O workloads using Oracle databases. I’ve previously blogged some information on how to use SLOB for PIO testing, as well as shared some scripts for running tests and extracting results. I’ve now added a whole new landing page for SLOB and a complete guide to running sustained throughput testing.

Why would you want to run sustained throughput tests? Well, one great reason is that not all storage platforms can cope with sustained levels of write workload. Flash arrays, or any storage array which contains flash, have a tendency to suffer from garbage collection issues when sustained write workloads hit them hard enough.

Find out more by following the links below:

Advertisement

Oracle SLOB On Solaris

Guest Post

This is another guest post from my buddy Nate Fuzi, who performs the same role as me for Violin but is based in the US instead of EMEA. Nate believes that all English people live in the Dickensian London of the 19th century and speak in Cockney rhyming slang. I hate to disappoint, so have a butcher’s below and feast your mince pies on his attempts to make SLOB work on Solaris without going chicken oriental. Over to you Nate, me old china plate.

slob ghost

Note: The Silly Little Oracle Benchmark, or SLOB, is a Linux-only tool designed and released for the community by Kevin Closson. There are no ports for other operating systems – and Kevin has always advised that the solution for testing on another platform is to use a Linux VM and connect via TNS. The purpose of this post is simply to show what happens when you have no other choice but to try and get SLOB working natively on Solaris…

I wrestled with SLOB 2 for a couple hours last week for a demo build we had in-house to show our capabilities to a prospective customer. I should mention I’ve had great success—and ease!—with SLOB 2 previously. But that was on Linux. This was on Solaris 10—to mimic the setup the customer has in-house. No problem, I thought; there’s some C files to compile, but then there’s just shell scripts to drive the thing. What could go wrong?

Well, it would seem Kevin Closson, the creator of SLOB and SLOB 2, did his development on an OS with a better sense of humor than Solaris. The package unzipped, and the setup.sh script appeared to run successfully, but runit.sh would load up the worker threads and wait several seconds before launching them—and then immediately call it “done” and bail out, having executed on the database only a couple seconds. Huh? I had my slob.conf set to execute for 300 seconds.

I had two databases created: one with 4K blocks and one with 8K blocks. I had a tablespace created for SLOB data called SLOB4K and SLOB8K, respectively. I ran setup.sh SLOB4K 128, and the log file showed no errors. All good, I thought. Now run runit.sh 12, and it stops as quickly as it starts. Oof.

It took Bryan Wood, a much better shell script debugger (hey, I said DEbugger) than myself, to figure out all the problems.

First, there was this interesting line of output from the runit.sh command:

NOTIFY: Connecting users 1 2 3 Usage: mpstat [-aq] [-p | -P processor_set] [interval [count]]
4 5 6 7 8 9 10

Seems Solaris doesn’t like mpstat –P ALL. However it seems that on Solaris 10 the mpstat command shows all processors even without the -P option.

Next, Solaris doesn’t like Kevin’s “sleep .5” command inside runit.sh; it wants whole numbers only. That raises the question in my mind why he felt the need to check for running processes every half second rather than just letting it wait a full second between checks, but fine. Modify the command in the wait_pids() function to sleep for a full second, and that part is happy.

But it still kicks out immediately and kills the OS level monitoring commands, even though there are active SQL*Plus sessions out there. It seems on Solaris the ps –p command to report status on a list of processes requires the list of process IDs to be escaped where Linux does not. IE:

-bash-3.2$ ps -p 1 2 3
usage: ps [ -aAdeflcjLPyZ ] [ -o format ] [ -t termlist ]
        [ -u userlist ] [ -U userlist ] [ -G grouplist ]
        [ -p proclist ] [ -g pgrplist ] [ -s sidlist ] [ -z zonelist ]
  'format' is one or more of:
        user ruser group rgroup uid ruid gid rgid pid ppid pgid sid taskid ctid
        pri opri pcpu pmem vsz rss osz nice class time etime stime zone zoneid
        f s c lwp nlwp psr tty addr wchan fname comm args projid project pset

But with quotes:

-bash-3.2$ ps -p "1 2 3"
   PID TTY         TIME CMD
     1 ?           0:02 init
     2 ?           0:00 pageout
     3 ?          25:03 fsflush

After some messing about, Bryan had the great idea to simply replace the command:

while ( ps -p $pids > /dev/null 2>&1 )

With:

while ( ps -p "$pids" > /dev/null 2>&1 )

Just thought I might save someone else some time and hair pulling by sharing this info… Here are the finished file diffs:

-bash-3.2$ diff runit.sh runit.sh.original
31c30
< while ( ps -p "$pids" > /dev/null 2>&1 )
---
> while ( ps -p $pids > /dev/null 2>&1 )
33c32
<       sleep 1
---
>       sleep .5
219c218
<       ( mpstat 3  > mpstat.out ) &
---
>       ( mpstat -P ALL 3  > mpstat.out ) &

SLOB2: Testing The Effect Of Oracle Blocksize

I recently posted a test harness for generating physical I/O using the new version of SLOB (the Silly Little Oracle Benchmark) known as SLOBv2. This test harness can be used for driving varying workloads and then processing the results for use in … well, wherever really. Some friends of mine are getting very adept with R recently, but I have yet to board that train, so I’m still plugging my data into Excel. Here’s an example.

We know that Oracle allows varying database block sizes with the parameter DB_BLOCK_SIZE, which typically has values of 4k, 8k (the default), 16k or 32k. Do you ever change this value? In my experience the vast majority of customers use 8k and a small number of data warehouse users choose 32k. I almost never see 16k and absolutely never see 4k or lower. Yet the choice of value can have a big effect on performance…

Simple Test with 8k Block Size

In the storage world we like to talk about IOPS and throughput as well as latency (see this article for a description of these terms). IOPS and throughput are related: multiply the number of IOPS by the block size and you get the throughput as a result.

So let’s see what happens if we run SLOB PIO tests using the test harness for the default 8k block size, testing workloads with 0% update DML, 10%, 20% and 30%:

SLOB2-initial-testing

You can see four loops or “petals”, with each loop starting at the bottom left and working out towards the upper right, moving anti-clockwise before coming back down. This is expected behaviour, because each line tracks an increasing number of SLOB processes from 1 to 64. As the number of processes increases, so does the number of IOPS – and as a result there is a small increase in the latency. At some point near the high end of 64 the server CPU gets saturated, causing the number of IOPS to drop and therefore resulting in the latency coming down again – this is because the compute resource is exhausted while the storage has plenty left to spare (the server has 2 sockets each with an E5-2470 8-core processor, but crucially I am pinning* Oracle to one core with CPU_COUNT=1). To put that even more simply, there is not enough CPU available to drive the storage any harder (and the storage can go a LOT faster – after all, it’s the same storage used during this). [* This is really bad wording from me – see the comments section below]

From this graph we can deduce the optimum number of SLOB processes needed to drive the maximum possible I/O through Oracle for each workload: the point where the line bends back on itself marks this spot. We can also see, in graphical form, evidence of what we already know: read-only workloads can drive much higher amounts of IOPS at a lower latency than mixed workloads.

Multiple Tests with Varying Block Sizes

Now let’s take that relatively simple graph and cover it in cluttered “petal-shaped” lines like the ones above, but with a set for each of the following block sizes: 4k, 8k, 16k and 32k.

slob-blocksize-tests-latency-versus-IOPS

Ok so it’s not easy on the eyes – but look closely because there’s a story in there. In this graph, each block size has the same four-petal pattern as above, with the colour used to denote the block size. The 32k block size, for example, is in purple – and quite clearly exhibits the highest latencies at its peaks. The blue 4k blocksize line, on the other hand, has very low latencies and extends the furthest to the right – indicating that 4k would be the better choice if you were aiming to drive as many IOPS as possible.

So 4k has lower latency and more IOPS… must be the way to go, right?

Two Sides To Every Story

What happens if we stop thinking about IOPS and start thinking about throughput? By multiplying the IOPS by the block size we can draw up the same graph but with throughput on the horizontal axis instead:

slob-blocksize-tests-latency-versus-throughput

Well now. The blue 4k line may indeed have the lowest latency figures but if throughput is important it’s nowhere on this scale. The purple 32k line, on the other hand, is able to drive over 3,500MB/sec of throughput at its peak (and still stay around the 300 microsecond latency mark). Maybe 32k is the way to go then?

Conclusion

As always, the truth lies somewhere in between. In the case of SLOB the workload is extremely random, meaning that each update is probably only affecting one row per block. It therefore makes no sense to have large 32k blocks as this is just an overhead – the throughput may be high, but the majority of the data being read is waste. Your real life workload, on the other hand, is likely to be more diverse and unpredictable. SLOB is a brilliant tool for using Oracle itself to generate load, but not intended as a substitute for proper testing. What it is great for though is learning, so use the test harness (or write your own) and get testing.

Also, don’t overlook the impact of DB_BLOCK_SIZE when building your databases – as you can see above it has a potentially dramatic effect on I/O.

New SLOB2 Physical I/O Harness

slob ghostShort post to point out that I’ve now posted the updated PIO Test Harness for SLOB2. This can be used to run multiple SLOB tests with varying numbers of workers and values of UPDATE_PCT. In addition there is also a revised version of the AWR analyzer shell script which can be used to extract various performance values from the AWR reports outputted by SLOB2.

Having said that, SLOB2 also comes with its own data extraction tool, awr_info.sh, so it may be that you prefer to use this depending on what data you are interested in.

SLOB: PL/SQL Commit Optimization

slob ghostI ran some SLOB tests over the weekend using the new SLOBv2 kit and noticed some interesting results. I was using SLOB to generate physical I/O but the “anomaly” is best demonstrated by putting SLOB in “Logical I/O mode”, i.e. by having a large enough buffer cache to satisfy all reads.

I’m calling SLOB with the following configuration parameters and 32 worker processes:

UPDATE_PCT=20
RUN_TIME=30000
WORK_LOOP=1000
SCALE=10000
WORK_UNIT=256
REDO_STRESS=HEAVY
LOAD_PARALLEL_DEGREE=8
SHARED_DATA_MODULUS=0

Notice the WORK_LOOP value is non-zero and the RUN_TIME is fairly large – I’m choosing to run a specific set of SLOBops rather than use elapsed time to define each test length. With WORK_LOOP at 1,000 and 32 worker processes that should generate 32,000 SLOBops. Since UPDATE_PCT is 20% I would expect to see around (32,000 * 20%) = 6,400 update statements. So let’s have a look at a couple of interesting statistics in the AWR report generated from this run:

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
redo synch writes                                97            2.0           0.0
user commits                                  6,400          134.5           1.0

That’s exactly the number of user commits we expected. But the number of redo synch writes is interesting…

Redo Synch Writes

When a session places a commit record into the log buffer it posts the log writer process and then puts itself into a log file sync wait until LGWR notifies it that the record has been written to persistent storage. Actually there are times when the session will not post LGWR (because it can see via a flag that LGWR is already writing) but one thing it always does is increment the counter redo synch writes. So in the above AWR output we would expect to see a matching number of redo synch writes to user commits… yet we don’t. Why?

There’s a little-known optimization in Oracle PL/SQL which means that Oracle will not always wait for the log buffer flush to complete, but will instead carry on processing – effectively sacrificing the D (durability) of ACID compliance. This is best explained by Jonathan Lewis in Chapter 6 of his excellent book Oracle Core – if you haven’t read it, consider putting it at the top of your reading list.

Because SLOB’s engine is a PL/SQL block containing a WHILE … LOOP, Oracle decides that the concept of durability can be rather loosely defined to be at the level of the PL/SQL block rather than the transactions being created within it. According to Jonathan, one way of persuading Oracle not to use this optimization is to use a database link; so let’s modify the slob.sql update statement to include the use of a loopback database link and see if the number of redo synch writes now rises to around 6,400:

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
redo synch writes                             6,564          302.3           0.5
user commits                                 12,811          590.0           1.0

Indeed it does… but now the number of user commits has doubled, presumably as the result of Oracle performing a two-phase commit (Oracle doesn’t know the loopback database link points to the same database so assumes the transactions are distributed).

Conclusion

I blogged this because I found it interesting, rather than because I had a point I was trying to prove. However, if there were to be any conclusions to this entry they would be the following:

  1. SLOB is a great tool for experimenting with the behaviour of Oracle under load
  2. Jonathan’s Oracle Core book is essential reading for anyone who wants to understand Oracle to a deeper level

It’s probably also worth keeping in mind that SLOB’s use of PL/SQL blocks may result in slightly different behaviour from the log writer than you might see from alternative load generation tools or applications which generate I/O.

SLOB2: Essential for Every DBA Toolkit

slob ghostA couple of weeks ago, Kevin released the second version of SLOB – the Silly Little Oracle Benchmark. Readers will know that I was already a big fan of the original version, but version 2 (which I was fortunate enough to test prior to its release) now has extra features and functionality which make it the definitive load generation tool for Oracle databases. To put it simply, every DBA should now have SLOB2 in their toolkit.

Why? Well, SLOB allows you to learn things. It allows you to study performance and test the effect of changes that you introduce to a stable system. It allows you to find bottlenecks. And most importantly for me, it allows you to write test harnesses which build up a picture of the complete performance characteristics of a system.

Here’s an example using my two-socket eight-core Sandy Bridge-EN server. What’s the maximum number of IOPS I can drive through an Oracle database when running a read-only workload? What about a workload with 10% writes, or 20% or 30%? And what latencies does the storage deliver? Using SLOB2 with my PIO test harness (located here) I can easily generate all of these results and plot them on a graph:

SLOB2-initial-testing

Each line on the graph represents a SLOB test with a specific UPDATE_PCT value (0%, 10%, 20% or 30%). The line plots the IOPS vs Latency results as the number of SLOB worker processes ramps up from 1 to 64. You can see that, as you might expect, each line reaches a peak value of IOPS which the system can deliver (limited in this case by the server, not the storage) then starts to fall back down. From this graph you can fairly easily predict the behaviour as the UPDATE_PCT values increase further. Sometimes a picture really is worth a thousand words.

I’ll be posting some more SLOB test results and ideas soon. For now, my recommendation is to download it and use it. SLOB isn’t just a benchmark tool, it’s a learning tool. Use it wisely and you will come to understand a lot more about Oracle performance.

Using SLOB to Test Physical I/O

slob ghostFor some time now I’ve been using the Silly Little Oracle Benchmark (SLOB) tool to drive physical I/O against the various high performance flash memory arrays I have in my lab (one of the benefits of working for Violin Memory is a lab stuffed with flash arrays!)

I wrote a number of little scripts and tools to automate this process and always intended to publish some of them to the community. However, time flies and I never seem to get around to finishing them off or making them presentable. I’ve come to the conclusion now that this will probably never happen, so I’ve published them in all their nasty, badly-written and uncommented glory. I can only apologise in advance.

You can find them here.

SLOB using Violin 6616 on Fujitsu Servers

I’ve been too busy to blog recently, which is frustrating because I want to talk more about subjects that are important to me such as Database Virtualization and the great hype around In Memory Databases. However, in the meantime I’d like to share some results from running SLOB (the Silly Little Oracle Benchmark) on Violin Memory flash memory arrays.

I’ve posted some SLOB results in the past, such as these. However, at the time I only had some Nehalem generation CPUs in my lab servers, so I was struggling to generate anything like the level of load that a Violin flash memory array can create.

Now I have some power :-). Thanks to my good friends at Fujitsu I have some powerful servers including an RX300 on which to run some SLOB testing. This is a two-socket machine containing two quad-core Intel Xeon E5-2643 Sandy Bridge processors. Let’s see what it can do.

First let’s run a physical IO test using a Violin Memory 6616 array connected by 8GFC:

Load Profile              Per Second    Per Transaction
~~~~~~~~~~~~         ---------------    ---------------
      DB Time(s):               95.9            5,511.9
       DB CPU(s):                7.1              406.0
       Redo size:            2,729.7          156,939.6
   Logical reads:          192,361.2       11,059,614.8
   Block changes:                6.0              347.5
  Physical reads:          189,430.7       10,891,128.1
 Physical writes:                1.9              110.6

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
db file sequential read         119,835,425      50,938      0   84.0 User I/O
latch: cache buffers lru chain   20,051,266       6,221      0   10.3 Other
DB CPU                                            4,466           7.4
wait list latch free                567,154         572      1     .9 Other
latch: cache buffers chains           1,653           1      0     .0 Concurrenc

The database block size is 8k so that’s 189k random read 8k IOPS. However, as you can see the Oracle AWR report does not calculate the average wait time with enough accuracy for testing flash – the average wait of my random I/O is showing as zero milliseconds. However it’s very simple to recalculate this in microseconds since we just need to take the total time (50,938 seconds) and divide by the total number of waits (119,835,425) then multiply by one million to get the answer: 425 microseconds average latency.

Now let’s run a redo generation test. This time I’m on a Fujitsu PRIMEQUEST 8-socket E7-8830 (Westmere-EX) NUMA box, again connected via 8GFC to the Violin Memory 6616 array. Remember, this test is the one where you create a large enough SGA and buffer cache that the Database Writer does not get posted to flush dirty blocks to disk. As a result, the Log Writer is the only process performing physical I/O. This is a similar workload to that produced during OLTP benchmarks such as TPC-C – a benchmark for which Oracle, Cisco, HP and Fujitsu have all recently used Violin Memory arrays to set records. Here’s the output from my redo SLOB test:

Load Profile              Per Second    Per Transaction
~~~~~~~~~~~~         ---------------    ---------------
      DB Time(s):              197.6                2.8
       DB CPU(s):               18.8                0.3
       Redo size:    1,477,126,876.3       20,568,059.6
   Logical reads:          896,951.0           12,489.5
   Block changes:          672,039.3            9,357.7
  Physical reads:           15,529.0              216.2
 Physical writes:          166,099.8            2,312.8

Over 1.4GB/second of sustained redo generation (the snapshot was for a five minute period). That’s an impressive result. Unfortunately I’ve had to sit on it for a month now whilst the details of the Violin and Fujitsu agreement were worked out… but now I am able to share it, I am looking forward to setting some new records using Violin Memory storage and Fujitsu servers. Stay tuned…!

[Update: I confess that in the month or so since I ran these tests I’d forgotten that I didn’t use SLOB to produce the results of the redo test, although it has a perfectly capable redo generation ability… the redo test output above is actually from a TPC-C-like workload generator. The physical read tests are from SLOB though – and it still remains my favourite benchmarking tool!]

SLOB on Violin 3000 Series with PCIe Direct Attach

A reader Alex asked if I could post a comparative set of tests from my previous 3000 series Infiniband testing but using the PCIe direct-attached method. I was actually very keen to test this myself as I wanted to see how close the Infiniband connectivity method could get to the PCIe latencies. Why? Well, PCIe offers the lowest overhead but also causes some HA problems.

When SSDs first came out they were just that, solid state disks – or at least they looked like them. They had the same form factor and plugged into existing disk controllers, but had no spinning magnetic parts. This offered performance benefits but those benefits were restricted to the performance of those very disk controllers, which were never designed for this sort of technology. We call this the first generation of flash.

To overcome this architectural limitation, flash vendors came out with a new solution – placing flash on PCIe cards which can then attached direct to the system board, reducing latency and providing extreme performance. This is what we call the second generation of flash. It is what vendors such as Fusion IO provide – and looking at FIO’s share price you would have to congratulate them on getting to market and making a success of this.

However, there are other architectural limitations to this PCIe approach. One is that you cannot physical share the storage provided by PCIe – sure you can run some sort of sharing software to make it available outside of the server it is plugged into, but that increases latency and defeats the object of having super-fast flash storage plugged right into the system board. Even worse, if the system goes down then that flash (and everything that was on it) is unavailable. This makes PCIe flash cards a non-starter for HA solutions. If you want HA then the best you can do with them is use them for caching data which is still available on shared storage elsewhere (the Oracle Database Smart Flash Cache being one possible solution).

At Violin we don’t like that though. We don’t believe in spending time and CPU resources (or even worse, human resources) managing a cache of data trying to improve the probability and predictability of cache hits. Not when flash is now available as a tier 1 storage medium, giving faster results whilst using less space, power and cooling.

Another problem with PCIe is that the number of slots on a system board will always be limited – for reasons of heat, power, space etc there will always be a limit beyond which you cannot expand.

And there’s another even more major problem with PCIe flash cards, which no PCIe flash vendor can overcome: you cannot replace a PCIe card without taking the server down. That’s hardly the sort of enterprise HA solution that most customers are looking for.

This is where we get to the third generation of flash storage, which is to place the flash memory into arrays which connect via storage fabrics such as fibre-channel or Infiniband. This allows for the flash storage to be shared, to be extended, to offer resilience (e.g. RAID) and to have high-availability features such as online patching and maintenance, hot-swappable components etc.

This is the approach that Violin Memory took when designing their flash memory arrays from the ground up. And it’s an approach which has resulted in both families of array having a host of connectivity features: PCIe (for those who don’t want HA), iSCSI, Fibre-Channel and now Infiniband.

But what does the addition of a fibre-channel gateway do to the latency? Well, it adds a few hundred microseconds to the latency… In the scheme of things, when legacy disk arrays deliver latencies of >5ms that’s nothing, but when we are talking about flash memory with latencies of <1ms that suddenly becomes a big deal. And that’s why the Infiniband connectivity is so important – because it ostensibly offers the latency of PCIe but with the HA and management features of FC.

So let’s have a look at the latencies of the 3000 series using PCIe direct attach to see how the latency measures up against the Infiniband testing in my previous post:

Filename      Event                          Waits  Time(s)  Latency       IOPS
------------- ------------------------ ------------ -------- ------- ----------
awr_0_1.txt   db file sequential read      308,185       33     107     7,139.2
awr_0_4.txt   db file sequential read    4,166,252      510     122    24,883.1
awr_0_8.txt   db file sequential read    9,146,095    1,245     136    41,569.2
awr_0_16.txt  db file sequential read   19,496,201    3,112     160    70,121.9
awr_0_32.txt  db file sequential read   40,159,185   11,079     275    92,185.0
awr_0_64.txt  db file sequential read   81,342,725   49,049     602    99,060.1

We can see that again the latency is pretty much scaling at a linear rate. And up to 16 readers (which is double the number of CPU cores I have available) the latency remains under 200us. This is very similar to the Infiniband results, where up to (and including) 16 readers I also had <200us latency.

A couple of points to note:

  • Again the lack of CPU capability in my Supermicro servers is prohibiting me from really pushing the arrays – causing the tests above 16 readers to get skewed. I have requested a new set of lab servers with ten-core Westmere-EX CPUs so I just need to sit back and wait for Father Christmas to visit
  • The database block size is 8k
  • To make matters even more complicated, this was actually a RAC system (although I ran the SLOB tests from a single instance)

That last point is worth expanding. I said that PCie does not allow for HA. That’s not strictly true for Violin however. In this system I have a pair of Supermicro servers, each connected via PCie to my single 3205 SLC array and presenting a single LUN, which I have partitioned and presented to ASM as a series of ASM disks.

Because ASM does not require SCSI-3 persistent reservations or any other such nastiness, I am able to use this as shared storage and run a 11.2.0.3 RAC and Grid Infrastructure system on it. I’ve run all the usual cable-pulling tests and not managed to break it yet, although I’m not convinced it is a design I would choose over Infiniband if I had to choose… mainly because the PCIe method does not incorporate the Violin Memory HA Gateway, which gives me the management GUI and an additional layer of protection from partial / unaligned IO.

I now need to go and beg for that bigger server so I can get some serious testing done on the 6000 series array which is currently laughing at me every time I tickle it with SLOB

SLOB on Violin 3000 Series with Infiniband

Last week I invited Martin Bach to the Violin Memory EMEA headquarters to do some testing on both our 3000 and 6000 series arrays. Martin was very interested in seeing how the Violin flash memory arrays performed, having already had some experience with PCIe-based flash card vendor.

There are a few problems with PCIe flash cards, but perhaps the two most critical are that a) the storage cannot be shared, meaning it isn’t highly-available; and b) the replacement of any PCIe card requires the server to be taken offline.

Violin’s approach is fundamentally different because the flash memory is contained in a separate unit which can then be presented over one of a number of connections: PCIe direct-attached, Fibre Channel, iSCSI… and now Infiniband. All of those, with the exception of PCIe, allow for the storage to be shared and highly-available. So why do we still provide PCIe?

There are two answers. The first and most simple answer is for flexibility – the design of the arrays makes it simple to provide multiple connectivity options, so why not? The second and more important (in terms of performance) is for latency. The overhead of adding fibre-channel to a flash memory is only in the order of one or two hundred microseconds, but if you consider that the 6216 SLC array has a read and write latency of 90 and 25 microseconds respectively that’s quite an additional overheard.

The new and exciting addition to these options is therefore Infiniband, which allows for extremely low latencies yet with the ability to avoid the pitfalls of PCIe around sharing and HA.

To demonstrate the latency figures achievable through a 3205 SLC array connected via Infiniband, Martin and I ran a series of SLOB physical IO tests and monitored the latency. The tests consisted of gradually ramping up the number of readers to see how the latency fared as the number of IOPS increased – we always kept the number of writers as zero. As usual the database block size was 8k. Here are the results:

Filename      Event                          Waits  Time(s)  Latency       IOPS
------------- ------------------------ ------------ -------- ------- ----------
awr_0_1.txt   db file sequential read        9,999        1     100     2,063.8
awr_0_4.txt   db file sequential read       29,992        5     166     5,998.8
awr_0_8.txt   db file sequential read       39,965        6     150     8,285.5
awr_0_16.txt  db file sequential read       79,958       15     187    13,897.8
awr_0_32.txt  db file sequential read      159,914       43     269    18,133.9
awr_0_64.txt  db file sequential read   21,595,919    6,035     280   115,461.1
awr_0_128.txt db file sequential read   99,762,808   69,007     691   124,907.4

The interesting thing is to note how the latency scales linearly. The tests were performed on a 2s8c16t Supermicro server with 2x QDR Infiniband connections via a switch to the array. The Supermicro starts having trouble driving the IO once we get beyond 32 readers – and by the time we get to 128 the load average is so high on the machine that even logging on is hard work. I guess it’s time to ask for a bigger server in the lab…