SLOB on Violin 3000 Series with PCIe Direct Attach

A reader Alex asked if I could post a comparative set of tests from my previous 3000 series Infiniband testing but using the PCIe direct-attached method. I was actually very keen to test this myself as I wanted to see how close the Infiniband connectivity method could get to the PCIe latencies. Why? Well, PCIe offers the lowest overhead but also causes some HA problems.

When SSDs first came out they were just that, solid state disks – or at least they looked like them. They had the same form factor and plugged into existing disk controllers, but had no spinning magnetic parts. This offered performance benefits but those benefits were restricted to the performance of those very disk controllers, which were never designed for this sort of technology. We call this the first generation of flash.

To overcome this architectural limitation, flash vendors came out with a new solution – placing flash on PCIe cards which can then attached direct to the system board, reducing latency and providing extreme performance. This is what we call the second generation of flash. It is what vendors such as Fusion IO provide – and looking at FIO’s share price you would have to congratulate them on getting to market and making a success of this.

However, there are other architectural limitations to this PCIe approach. One is that you cannot physical share the storage provided by PCIe – sure you can run some sort of sharing software to make it available outside of the server it is plugged into, but that increases latency and defeats the object of having super-fast flash storage plugged right into the system board. Even worse, if the system goes down then that flash (and everything that was on it) is unavailable. This makes PCIe flash cards a non-starter for HA solutions. If you want HA then the best you can do with them is use them for caching data which is still available on shared storage elsewhere (the Oracle Database Smart Flash Cache being one possible solution).

At Violin we don’t like that though. We don’t believe in spending time and CPU resources (or even worse, human resources) managing a cache of data trying to improve the probability and predictability of cache hits. Not when flash is now available as a tier 1 storage medium, giving faster results whilst using less space, power and cooling.

Another problem with PCIe is that the number of slots on a system board will always be limited – for reasons of heat, power, space etc there will always be a limit beyond which you cannot expand.

And there’s another even more major problem with PCIe flash cards, which no PCIe flash vendor can overcome: you cannot replace a PCIe card without taking the server down. That’s hardly the sort of enterprise HA solution that most customers are looking for.

This is where we get to the third generation of flash storage, which is to place the flash memory into arrays which connect via storage fabrics such as fibre-channel or Infiniband. This allows for the flash storage to be shared, to be extended, to offer resilience (e.g. RAID) and to have high-availability features such as online patching and maintenance, hot-swappable components etc.

This is the approach that Violin Memory took when designing their flash memory arrays from the ground up. And it’s an approach which has resulted in both families of array having a host of connectivity features: PCIe (for those who don’t want HA), iSCSI, Fibre-Channel and now Infiniband.

But what does the addition of a fibre-channel gateway do to the latency? Well, it adds a few hundred microseconds to the latency… In the scheme of things, when legacy disk arrays deliver latencies of >5ms that’s nothing, but when we are talking about flash memory with latencies of <1ms that suddenly becomes a big deal. And that’s why the Infiniband connectivity is so important – because it ostensibly offers the latency of PCIe but with the HA and management features of FC.

So let’s have a look at the latencies of the 3000 series using PCIe direct attach to see how the latency measures up against the Infiniband testing in my previous post:

Filename      Event                          Waits  Time(s)  Latency       IOPS
------------- ------------------------ ------------ -------- ------- ----------
awr_0_1.txt   db file sequential read      308,185       33     107     7,139.2
awr_0_4.txt   db file sequential read    4,166,252      510     122    24,883.1
awr_0_8.txt   db file sequential read    9,146,095    1,245     136    41,569.2
awr_0_16.txt  db file sequential read   19,496,201    3,112     160    70,121.9
awr_0_32.txt  db file sequential read   40,159,185   11,079     275    92,185.0
awr_0_64.txt  db file sequential read   81,342,725   49,049     602    99,060.1

We can see that again the latency is pretty much scaling at a linear rate. And up to 16 readers (which is double the number of CPU cores I have available) the latency remains under 200us. This is very similar to the Infiniband results, where up to (and including) 16 readers I also had <200us latency.

A couple of points to note:

  • Again the lack of CPU capability in my Supermicro servers is prohibiting me from really pushing the arrays – causing the tests above 16 readers to get skewed. I have requested a new set of lab servers with ten-core Westmere-EX CPUs so I just need to sit back and wait for Father Christmas to visit
  • The database block size is 8k
  • To make matters even more complicated, this was actually a RAC system (although I ran the SLOB tests from a single instance)

That last point is worth expanding. I said that PCie does not allow for HA. That’s not strictly true for Violin however. In this system I have a pair of Supermicro servers, each connected via PCie to my single 3205 SLC array and presenting a single LUN, which I have partitioned and presented to ASM as a series of ASM disks.

Because ASM does not require SCSI-3 persistent reservations or any other such nastiness, I am able to use this as shared storage and run a 11.2.0.3 RAC and Grid Infrastructure system on it. I’ve run all the usual cable-pulling tests and not managed to break it yet, although I’m not convinced it is a design I would choose over Infiniband if I had to choose… mainly because the PCIe method does not incorporate the Violin Memory HA Gateway, which gives me the management GUI and an additional layer of protection from partial / unaligned IO.

I now need to go and beg for that bigger server so I can get some serious testing done on the 6000 series array which is currently laughing at me every time I tickle it with SLOB

Advertisements

4 Responses to SLOB on Violin 3000 Series with PCIe Direct Attach

  1. Alex says:

    Hello,

    Thanks for running that .. something that I am trying to understand .. Based on this results we have around 5 times more IOPS with PCIe compared to Infiniband ( with 8 readers 40k vs 8k IOPS ) ? But then the latency between both is very comparable ? I got to be missing something here .

    Regards,
    Alex

    • flashdba says:

      Hi Alex

      You are absolutely right – but it’s down to a learning curve with SLOB rather than the array. When Martin and I ran those IB tests we originally used a larger buffer cache and didn’t set the recycle pool up, so we were getting >90% buffer cache hits, hence a lot of CPU cycles were “wasted” on logical I/O rather than the all-important physical I/O.

      I will post some more results with the recycle pool setup properly in due course. I also keep meaning to plot some graphs of latency versus IOPS but … as always there aren’t enough hours in the day. I guess that’s the price you pay when working for a startup!

  2. kevinclosson says:

    If we settle on no less than 8 SLOB sessions, you should find that these parameters are fit for 2s8c16t. I push 189,435 db file sequential read/s at 320us on my kit with 64 sessions.

    Let us know how it goes.

    db_name = SLOB
    compatible = 11.2.0.2
    UNDO_MANAGEMENT=AUTO
    db_block_size = 8192
    db_files = 2000
    processes = 500
    shared_pool_size = 1500M
    db_recycle_cache_size=500M
    db_cache_size=16G
    filesystemio_options=setall
    parallel_max_servers=0
    _db_block_prefetch_limit=0
    _db_block_prefetch_quota=0
    _db_file_noncontig_mblock_read_count=0
    log_buffer=268435456
    pga_aggregate_target=8G
    _disk_sector_size_override=TRUE
    resource_manager_plan=”
    db_block_checksum = FALSE
    db_block_checking = FALSE
    disk_asynch_io=TRUE
    job_queue_processes = 0
    aq_tm_processes = 0
    audit_trail = FALSE
    result_cache_max_size = 0
    use_large_pages=ONLY
    undo_tablespace=myundo

    • flashdba says:

      Thanks Kevin – it’s not in your post here but you mentioned to me recently that SLOB has around an 80MB user footprint, so ramping up those users will quickly get us outside the boundary of the recycle cache here.
      I don’t have the 3000 array to hand right now but I do have a 6616 so I will run SLOB and make some new posts. What I’d really like to do is connect up a few arrays against one of my lab servers, but right now that’s not an option.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s