parallelism – flashdba

Guest Post

Nate Fuzi is my friend and collegue, based out in the US fulfilling the same role that I perform in EMEA. He is also the person with which I have drunk more sake jello shots than I ever thought probable / sensible / acceptable. Nate recently wrote this note regarding the use of Oracle’s Automatic Degree of Parallelism with Violin Memory flash storage – and I liked it so much I asked him if I could re-blog it for the Internet community. I suspect I will have to offer him some sort of sake jello shot-based payment, but I am prepared to suffer this so that you, the reader, do not have to. My pain is your gain. Over to you Nate…

Who among us is not a fan of Auto DOP (Automatic Degree of Parallelism) in Oracle 11gR2? This easy button was supposed to take all the stress out of handling parallelism inside the database: no more setting non-default degrees on tables, no need to put parallel hints in SQL, etc. According to so many blogs, all you had to do was set PARALLEL_DEGREE_POLICY to AUTO, and the parallelism fairy sprinkled her dust in just the right places to make your multi-threaded dreams come true.

[If you don’t care to read ALL about my pain and suffering, skip down to #MEAT]

But she vexed me time and again—in multiple POCs where high core count systems failed to launch more than a couple oracle processes at a time and where I finally resorted back to the old ways to achieve what I knew the Violin array was capable of delivering in terms of IOPS/bandwidth, but more importantly application elapsed times. And so it was with this customer, a POC where we had to deliver a new platform for their QA system comprised of a large 80-core x86 server and a single 3000 series Violin Memory flash memory array. Full table scans (single-threaded) could drive the array over 1300 MB/s, so the basic plumbing seemed to be in order. Yet having loaded a copy of their production database onto the array and run standard application reports against it, it actually yielded worse results than the production system, an older Solaris box with fewer processors, less RAM (and less SGA/PGA to Oracle), and a spinning-disk based SAN behind it. What the…? In the mix also were lots of small and not-so-small differences: parameter values derived from such physical differences as the core count, the processes setting, the size of the PGA, etc.; the fact that they wanted to test 11.2.0.3 as part of the overall testing (production was still on 11.2.0.1); “system stats” being a term their DBAs had never heard; all kinds of differences in object stats; the list went on. So the new system isn’t keeping up with prod, let alone beating it, you say? Where do I start?

I arrived on site Monday morning with 2 days scheduled to determine the problem and get this engine cranking the kind of horsepower we knew it could. The owner of the POC made it clear his concern was getting maximum performance out of the rig without touching the application code. Standard reports were showing spotty improvements: some were down to 5 minutes from 17 minutes, others were unchanged at 12 minutes, and still others had worsened from 18 minutes to 22 minutes. A batch job that ran 2 hours in production twice a day (more times would be lovely, of course) had run out of TEMP space after 5 hours on the test system. Clearly something was afoot. He had already tried everything I could suggest over email: turning OPTIMIZER_FEATURES_ENABLE back to 11.2.0.1, gathering object stats, gathering workload system stats, trying auto DOP, enabling and then disabling hyper-threading, and more. The only consistent result was an increase in his frustration level.

I asked how he wanted to run this rig, assuming there were no inhibitors. He wanted most of the memory allocated to Oracle, taking advantage of 11.2.0.3’s features and fixes. OK, then let’s set those and start diagnosing from there; no sense fixing a system running in a config you don’t want—especially since efforts to make the test system look and act like prod only with faster storage had all failed. A quick run through of the report test battery showed results similar to what they’d seen before. We broke it down to the smallest granule we could: run a single report, see the SQL it generates on the test system, compare its explain plan there to what production would do with it. This being my first run-in with MicroStrategy reports, I had the fun of discovering every report run generates a “temporary” table with a unique name, inserts its result set into that table, and then returns that result set to the report server. Good luck trending performance of a single SQL ID while making your tuning changes. Well, let’s just compare the SELECT parts for each report’s CTAS and subsequent INSERTs then. What we saw was that the test system was consistently doing more work—a lot more work—to satisfy the same query. More buffer gets, more looping, more complicated plans. Worse, watching it run from the top utility, one oracle process had a single processor at about a trot and was driving unimpressive IO. Here we have this fleet of Porsches to throw at the problem, and we’re leaving all but one parked—and that one we’re driving like we feel guilty for not buying a Prius.

New object stats, workload system stats, optimizer features, hyper-threading—all make negligible difference. Histogram bucket counts and values are too close to be a factor. Let’s try hinting some parallelism. BANG. 2 minutes goes down to 15 seconds. Awesome. But we can’t make changes to the code, and certainly not when the SQL is generated by the report server each time. This also means no SQL profiles or baselines. And setting degrees on their tables might open floodgates I don’t want to open. Plus I have DOP set to AUTO. Why isn’t that thing doing anything when it clearly helps the SQL to run in parallel?! Enter the Google.

#MEAT

Ah, the rave reviews of auto DOP. Oh, the ease with which it operates. My, the results you’ll get. But we aren’t getting it. Auto DOP isn’t doing anything for us. How can we make it see parallel execution as a more viable option? Turns out there are several knobs for turning “auto” parallelism up or down, and I was unaware of the combination all them.

Sure there’s your PARALLEL_DEGREE_LIMIT, possibly affected by your core count and PARALLEL_THREADS_PER_CPU, and of course your options for PARALLEL_DEGREE_POLICY, plus PARALLEL_MAX_SERVERS which is derived from some unspecified mix of CPU_COUNT, PARALLEL_THREADS_PER_CPU and the PGA_AGGREGATE_TARGET. But have you run DBMS_RESOURCE_MANAGER.CALIBRATE_IO? Have a look over Automatic Degree of Parallelism in 11.2.0.2 [ID 1269321.1] to see if you might be limiting yourself parallel-wise because the database has no idea what your IO subsystem is capable of. Note that they do not mention workload system stats in this context. I couldn’t find verification that these play no part in DOP, but this note seems to suggest the values in DBA_RSRC_IO_CALIBRATE play a much more significant role now. Following a link about a bug (10180307) in 11.2.0.2 and below, older versions of the CALIBRATE_IO procedure could produce unpredictable results. But more important was the comment that “The per process maximum throughput (MAX_PMBPS) value [might be] too large, resulting in a low DOP while running AutoDOP.” We confirmed this by allowing CALIBRATE_IO to run for about 10 minutes and checking the results. What we saw was 31K IOPS, 328 MB/s total, 334 MB/s per process, and a latency of 0. Interesting numbers, but the explain plan still said the computed DOP was 1. So what does Oracle suggest you do if you don’t like the parallelism? Cheat and set the values manually. According to the same note, start with:

delete from resource_io_calibrate$;
insert into resource_io_calibrate$ values(current_timestamp, current_timestamp, 0, 0, 200, 0, 0);
commit;

This tells the database a single process can drive at most 200 MB/s from the storage system. If you want more parallelism, tell Oracle each process drives less IO, and the database suggests creating more processes to go after that data. I had already seen a single process driving the array to maximum bandwidth, so ~330 MB/s seemed low, but still it was hindering our parallelism efforts. Even the 200 MB/s setting drove no parallelism in our test. We dropped the value to 50 MB/s and finally parallelism picked itself up off the floor, suggesting a degree of 8. We cranked that puppy down to 5 MB/s and suddenly Oracle wanted to throw all 80 cores at our little query. Booya. We fell back to 50 MB/s, and ran the tests again. We hit a record time of 49s on a report that took 21 minutes in prod. 28 minutes went to 59s for another report. 22 minutes went to 44s. Not everything was <1m. Some only went from 17m to 5m, but this was still enough to make the testers ask if something was wrong. And the 2 hour report that wasn’t finishing in 5 hours completed in 28 minutes.

Maybe you already knew all this, and you’ve loved auto DOP long time. But if you didn’t, I thought I would share my experience in hopes you won’t lose as much time or hair on it. I find potential customers frequently ask whether tuning is required with a flash memory solution, and my honest answer is “sometimes”. This was one of those sometimes.

Addendum: Here are the final results from the tuning by Nate – with the report names removed to protect confidentiality: