ASM Rebalance Too Slow? 3 Tips To Improve Rebalance Times

see-saw

I’ve run into a few customers recently who have had problems with their ASM rebalance operations running too slowly. Surprisingly, there were some simple concepts being overlooked – and once these were understood, the rebalance times were dramatically improved. For that reason, I’m documenting the solutions here… I hope that somebody, somewhere benefits…

1. Don’t Overbalance

Every time you run an ALTER DISKGROUP <NAME> REBALANCE operation you initiate a large amount of I/O workload as Oracle ASM works to evenly stripe data across all available ASM disks (i.e. LUNs). The most common cause of rebalance operations running slowly that I see (and I’m constantly surprised how much I see this) is to overbalance, i.e. cause ASM to perform more I/O than is necessary.

It almost always goes like this. The customer wants to migrate some data from one set of ASM disks to another, so they first add the new disks:

alter diskgroup data
add disk  'ORCL:NEWDATA1','ORCL:NEWDATA2','ORCL:NEWDATA3','ORCL:NEWDATA4',
          'ORCL:NEWDATA5','ORCL:NEWDATA6','ORCL:NEWDATA7','ORCL:NEWDATA8'
rebalance power 11 wait;

Then they drop the old disks like this:

alter diskgroup data
drop disk 'DATA1','DATA2','DATA3','DATA4',
          'DATA5','DATA6','DATA7','DATA8'
rebalance power 11 wait;

Well guess what? That causes double the amount of I/O that is actually necessary to migrate, because Oracle evenly stripes across all disks and then has to rebalance a second time once the original disks are dropped.

This is how it should be done – in one single operation:

alter diskgroup data
add disk  'ORCL:NEWDATA1','ORCL:NEWDATA2','ORCL:NEWDATA3','ORCL:NEWDATA4',
          'ORCL:NEWDATA5','ORCL:NEWDATA6','ORCL:NEWDATA7','ORCL:NEWDATA8'
drop disk 'DATA1','DATA2','DATA3','DATA4',
          'DATA5','DATA6','DATA7','DATA8'
rebalance power 11 wait;

A customer of mine tried this earlier this week and reported back that their ASM rebalance time had reduced by a factor of five!

By the way, the WAIT command means the cursor doesn’t return until the command is finished. To have the command essentially run in the background you can simply change this to NOWAIT. Also, you could run the ADD and DROP commands separately if you used a POWER LIMIT of zero for the first command, as this would pause the rebalance and then the second command would kick it off.

2. Power Limit Goes Up To 1024

Simple one this, but easily forgotten. From the early days of ASM, the maximum power limit for rebalance operations was 11. See here if you don’t know why.

From 11.2.0.2, if the COMPATIBLE.ASM disk group attribute is set to 11.2.0.2 or higher the limit is now 1024. That means 11 really isn’t going to cut it anymore. If you are asking for full power, make sure you know what number that is.

3. Avoid The Compact Phase (for Flash Storage Systems)

An ASM rebalance operation comprises three phases, where the third one is the compact phase. This attempts to move data as close as possible to the outer tracks of the disks ASM is using.

Did you spot the issue there? Disks. This I/O-heavy phase is completely pointless on a flash system, where I/O is served evenly from any logical address within a LUN.

You can therefore avoid that potentially-massive I/O hit by disabling the compact phase, using the underscore parameter _DISABLE_REBALANCE_COMPACT=TRUE. Remember that you need to get Oracle Support’s permission before setting underscore parameters! Point your SR in the direction of the following My Oracle Support note:

What is ASM rebalance compact Phase and how it can be disabled (Doc ID 1902001.1)

If you want to know more about this (for example, what the first two rebalance phases are), or indeed anything about ASM in general, I highly recommend the legendary ASM blogger that is Bane Radulovic a.k.a. ASM Support Guy.

Conclusion

An ASM rebalance potentially creates a lot of I/O, which means you may need to wait for a long time before it finishes. For that reason, make sure you understand what you are doing and make every effort to perform only as much I/O as you actually need. Don’t forget you can use the EXPLAIN WORK command to gauge in advance how much work is required.

Happy rebalancing!

Implementing Linux native multipathing or DM-MPIO together with EMC PowerPath

puzzle

Guest Post

I’m delighted to say that this is another guest post from my good friend Nate Fuzi, who performs the same role as me for Violin but is based in the US instead of EMEA. Because he is American, Nate thinks that scones are called “biscuits”, that chips are called “fries” and that there is nothing – *nothing* – that cannot be improved with the simple addition of bacon. Clearly, something is fundamentally wrong with him – and yet he is like a brother to me. Like the strange, American step-brother I only see a few times a year and whom I cannot understand without the use of a translator. But he’s family all the same. So over to you Nate… and remember: Mom loved me more.

Remember when your parent answering your whiny “but whyyyyyyy???” with “Because I said so” was something you just had to accept? It meant there was no more explanation coming, and it was time for you to move on. Over the years, that answer broke down, and you grew confident you were owed more. And parents agreed: the more you demonstrated your ability to reason, the more reason you got to help you over the denial. It’s a sign of respect that we pay each other in adult life. And it can feel like disrespect if the reason offered feels weak or like it is intended to discourage further inquiry.

I was recently faced with solving what seemed a straightforward problem: take an existing Linux server running EMC’s multipathing software, PowerPath or “PP” as I will refer to it here, to access LUNs presented from that company’s SAN product, the VNX array, and attach and run Violin storage alongside the VNX. PP didn’t then support Violin arrays (still doesn’t at the time of this writing), so what was the client to do when they wanted to try out Violin’s AFA for their database environments? Just run PP and native Linux multipathing, called DM-MPIO, side by side, letting PP manage the VNX LUNs and DM-MPIO manage I/Os bound for Violin, right?

PowerPath versus DM-MPIO

PowerPath versus DM-MPIO

Wrong. Won’t work, I read. PowerPath does something at the HBA layer, I read a seemingly helpful web poster explain, that will corrupt either the VNX data or the Violin data. Well… maybe it will work, suggested another poster, but EMC might not support customers running in such a configuration. Others suggested ominously that PP and DM-MPIO don’t work well together… leaving it to the reader’s imagination what might result. I’m no master Googler, but I couldn’t find where anyone had put aside the rumors and vaguely threatening suggestions and actually tried it. Well, I did it, and I want to write about it so others know it can be done and how to do it because, well, those explanations I read didn’t stand up to question and felt like they were meant to scare me into not trying it. Of course I had to try it! Now, let’s be clear about what I am and am not saying: I am saying I have done this and it works. It’s in production at a customer site, running for months without issue. I am not saying that I have spoken to your EMC support rep and that you’ve been green-lighted to do this in your production environment. I’m not an EMC customer, and I don’t have a buddy in EMC support. So let’s consider this for informational purposes only for the time being.

First off, as several folks rightly pointed out, DM-MPIO could easily manage LUNs from both SAN products. Drop PP, configure DM-MPIO, and done. Well, that just sounds too simple. But it’s true: DM-MPIO has come a long way over the last few years and offers a pretty good set of features for free. PP costs money but is not without added value, as it does have additional configurability for reserve paths that become active in the event of a failure scenario, as well as IO distribution models beyond those offered by DM-MPIO, for example. My customer wanted to keep running PP, so this option was off the table for me.

Next up is the fun fact that PP advises you upon installation that you should “Blacklist all devices in /etc/multipath.conf and stop multipathd service”. The installer doesn’t say what will happen if you don’t do this, only that it is “*** IMPORTANT ***”. Check. Easy enough to ignore if this is the first thing you do. But if DM-MPIO is already running on the system and you try to start PP, it tells you this (verified in 5.7 and 6.0 only):

[root@host] # /etc/init.d/PowerPath start
Starting PowerPath:                                                      [FAILED]
Aborting PowerPath start since DM-Multipath is active.
Refer to PowerPath for Linux Installation and Administration Guide for more information

That’s a bummer. You actually have to stop multipathd and flush its paths before PP will start up. OK, I can do that. And, to be sure, you do NOT want both products attempting to manage IOs for the same device at the same time. That really is a bad thing. As we’ll see shortly, we might even want to segregate traffic across different FC ports, although this is strictly for optimization, not because you can’t mix traffic. But, as soon as we’ve installed the device-mapper-multipath-* packages, let’s honor this restriction right away by blacklisting the EMC devices in /etc/multipath.conf like this:

blacklist {
       devnode "^(control|vg|ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"    # standard stuff
       devnode "^hd[a-z][0-9]*"                                          # this line too
       device {
              vendor "DGC"
              product "*"
       }
       device {
              vendor "EMC"
              product "*"
       }
}

Note that the VNX line grew up in the Clariion company later acquired by EMC and presents a vendor string of “DGC”. Don’t ask me why. [Because the Clariion was a product from Data General Corporation? — flashdba :-)] It is my understanding that VMAX arrays do present “EMC” as their vendor string. Having done this, we want to explicitly except Violin devices from getting blacklisted:

blacklist_exceptions {
       device {
              vendor "VIOLIN"
              product "*"
       }
}

This isn’t completely necessary, but it does make clear our intentions: don’t manage VNX/EMC devices but do manage Violin devices. Having both entries in the file means that adding some third storage product to the FC SAN won’t cause it to get picked up by DM-MPIO without us consciously making it so. Belt and suspenders, they used to say.

Verify your multipath configuration without actually running it. Do this by adding the “-d” flag to your multipath command:

[root@host] # multipath -v3 -d

The “-v3″ flag gives us a verbose parsing of the configuration file so we can see each device and whether, what, and why DM-MPIO is going to do that with device. Make changes ad nauseum, and once you like what you see, run the command without “-d”, and create your multipath devices.

Cool. But remember when PP refused to start up earlier, saying DM-MPIO was found running? Guess what: PP’s inexplicable method of editing your /etc/rc.d/rc.sysinit script to insert its startup lines means it doesn’t attempt to start up until after DM-MPIO gets started on reboot. (Take a look for yourself; it’s there. It also makes you manually start PP if you apply a kernel update that resets the contents of rc.sysinit, at which point it reinserts the startup lines. Sweet.) How to get around this? I’m sure there are lots of solutions. I created a script to flush existing multipaths and start up PP in /etc/init.d and linked it as /etc/rc.d/rc3.d/S86PowerPath. This makes it so PP gets called just prior to DM-MPIO, and each is happy. The later call in /etc/rc.d/rc.sysinit is then redundant but causes no harm. I suppose you could almost as easily edit the rc.sysinit script to remove the check–just remember to make the same edit if/when you update PP.

Now, what was that I said a bit ago about segregating traffic on different HBA ports? This is not required; no magic is happening on the HBA with either product. Each one will discover the devices it is concerned with via its own callout routine and handle that device how you configure it to. But let’s imagine you have 4 FC ports on your host and choose to allow PP and DM-MPIO to each manage devices across all those ports. Neither will be aware what the other is doing in terms of trying to optimize IO distribution across all paths available, and you could well end up shooting yourself in the foot with sub-optimal end results. Segregating traffic also allows you to set different HBA queue depths or optimization settings as recommended by each storage vendor, and we all want to comply with best practices, right?!

Conclusion

None of this is meant to disparage EMC. Well, OK: the part about having the PowerPath startup script insert lines into /etc/rc.d/rc.sysinit is meant to disparage. I think that’s archaic and clunky. I have to believe there’s a more elegant way to do that today. I do hope I save some other soul the frustration I went through determining if this could be done and then how. If anyone has implemented more elegant solutions, I’d love to read about them.

Postcards from Storageland: Three Years At Violin

calendar

A few weeks ago, in what seems to be a truly modern phenomenon, I became aware that it was my third anniversary of joining Violin after I noticed a number of people congratulating me on LinkedIn. In many ways it feels like I’ve already been here for a lifetime, but it was only twelve months ago I was trying to think of a suitable flash-based pun for the title of an article just like this one. This year I opted out of the “Three years in a flash” headline, it seemed a bit too lame. Those NAND-based puns were only ever a flash in the pan.*

So what’s happened in the three years since I joined Violin? Well, quite a lot. When I signed up in early 2012 Violin was pioneering the flash array industry – and when I say pioneering I mean that, unlike in today’s crowded AFA market, it was a pretty lonely place. The only other all-flash array vendor with a presence was Texas Memory Systems (TMS), but they had seemingly gone into hibernation in the markets I had exposure to (as it turned out they were looking for a buyer, which they found in the form of IBM).

I was one of the first employees in EMEA, part of a business which was rapidly expanding due to a global reseller agreement with HP for our 3000 series array. The main enemy was the status quo – monolithic disk arrays from EMC, IBM, HP, HDS etc, perhaps with a smattering of SSDs to try and alleviate the terrible performance of random I/O. With the 3000 on HP’s price list and no real competition to worry about it seemed like the world was there for the taking. Time to pay of the mortgage.

Were we overconfident? Guilty of hubris, perhaps? We must have alienated a few people in the industry because I know not everyone felt sympathy for what happened next.

Pride Cometh Before A Fall

With hindsight, the $2.35 billion that HP paid for 3PAR meant it was unlikely to continue using Violin as a strategic product. HP may have a history of write downs, but it simply couldn’t justify OEMing the new 6000 series array with 3PAR still on the books so… it didn’t.

Meanwhile, EMC purchased a company that hadn’t yet shipped a product, IBM did its deal with TMS, Cisco bizarrely purchased Whiptail (which now appears to be suspended as a product) and a number of SSD-based flash array startups (e.g. Pure Storage) appeared on the market.

crash-chartAll of which meant that, when Violin went to IPO, things didn’t exactly go to plan. In fact, it eventually resulted in a change of management and the introduction of a new CEO and management team who have systematically transformed the company over the last year. But at the time, it felt like a roller coaster.

So why am I reminiscing about the bad times? Partly because I don’t want to gloss over the past, but also because I genuinely think that Violin has had to do a lot of growing up in the last year or so – and that’s a good thing. When I look at other flash vendors throwing FUD at each other, getting into legal disputes over employees or burning bridges with their channel partners to try and get their pre-IPO books look more attractive… I can’t help a wry smile. Youth, eh? Some people still have harsh lessons to learn.

From Niche to Platform

This year, on the third anniversary of my joining Violin, we announced an important new product – the 7000 series Flash Storage Platform. Until the FSP, Violin had generally competed in the niche performance-optimized market – what some people call Tier 0 – where the single most important attribute is… well, performance (think database workloads). We’ve been pretty successful there, mainly because the 6000 series was (and still is) unbelievably fast, but also partly because much of the competition competes lower down in the capacity-optimized market (where price per GB is key – think VDI workloads). But we also attracted a surprising amount of criticism for the lack of certain Data Services features, such as deduplication (a feature that I’ve never coveted for database workloads).

But with the Flash Storage Platform, Violin – and flash in general – is moving into a new, larger and much more demanding market: Tier 1 primary storage. This is the big playground where all the major disk array vendors are desperately trying to stem the losses from their legacy SAN products. flash-market-venn-diagramIt’s also a market which is nearly 15 times larger than the one we used to operate in. And most importantly, it’s the one where you need to be able to deliver on all three requirements of the Primary Storage Trinity:

  • Performance (high IOPS and low latency)
  • Data Services (lots of features, fully integrated)
  • Capacity Optimization (low $/GB price)

By complete coincidence, this product launch also coincides with the end of the Understanding Flash section of my blog series on Storage for DBAs (when I started the flashdba blog it was aimed at database administrators, but over time the intended audience has expanded to anyone with an interest in flash storage).

With that in mind, in the next set of posts I’ll be turning my attention to the concepts and architecture of All Flash Arrays. What defines an AFA? What needs to be considered when designing one? And why doesn’t it make sense to stuff a load of SSDs into an existing disk array in the hope that it will deliver the performance of All Flash?

This is a really exciting time to be working in the storage industry – there’s lots to do and a massive opportunity to embrace. Because of this, the blog posts haven’t been coming as quickly as I’d hoped. But I still have much I want to talk about… so don’t worry, the next one will be back in a flash.**

* I really will stop making flash-based puns now

** Apart from this one

Understanding Flash: Fabrication, Shrinkage and the Next Big Thing

Semiconductor Fabrication Plant (picture courtesy of SemiWiki.com)

Semiconductor Fabrication Plant (picture courtesy of SemiWiki.com)

Before I draw this series on Understanding Flash to a close, I wanted to briefly touch on the subject of manufacturing. Don’t worry, I’ve taken heed of the kind feedback I had after my floating gate transistor blog post (“Please stop talking about electrons!“) and will instead focus on the commercial aspects, because ultimately they affect the price you will be paying for your flash-based storage. If you’re not familiar with the way things are you might be surprised…

The Supplier Landscape

Semiconductor chips, such as the NAND flash packages found in your storage, phones and tablets, are manufactured in fabrication plants known as “fabs” or flash “foundries”. To say that fabs are not cheap to build would be somewhat of an understatement – they are mind-bogglingly, ludicrously expensive.

In 2014 the global semiconductor industry posted record sales totalling $335.8 billion. That’s the entire semiconductor industry, not just the subset that produces NAND flash… and I think you’ll agree that’s quite a lot of money. But to put that figure into perspective, when Samsung decided to build an entirely new fab in late 2014, it had to commit $15 billion for a project that won’t be completed “until mid-2017″.

Clearly a fab is an eye-watering investment – and it is mainly for this reason that there are (at the time of writing) only six key companies worldwide who run flash foundries. What’s more, because of that staggering cost, four of those six are working together in pairs to share the investment burden. The four teams are:

  • Toshiba and SanDisk
  • Samsung
  • Intel and Micron
  • SK Hynix

With only four sets of fabs in operation, the market is hardly awash with an abundance of NAND flash – which of course suits the fab operators just fine as it keeps the price of flash higher. Oversupply would be unsustainable for an industry with such high costs.

Process Shrinking

Talking of costs, the fab operators are never allowed to stand still because of the relentless drive to make things smaller – Moore’s Law doesn’t just apply to processors; NAND flash is a semiconductor too. The act of taking the design of a microchip and reducing it in size is known as a process shrink – and it brings all sorts of benefits. Remember that a NAND flash cell is basically constructed from floating gate transistors? Well if those transistors are reduced in size:

  • Less current is needed per transistor, reducing the overall power consumption of the flash
  • This in turn reduces the heat output of an identical design with the same clock frequency (or alternatively the clock frequency can be increased)
  • Most importantly, more flash dies can be produced from the same silicon wafers (the raw material), resulting in either reduced costs or increased density

It may sound like smaller is better, but as always there is a flip side. Retooling a flash foundry to move to a smaller process geometry takes time and money – and the return on investment for the fab is reduced with each shrink. But we’ll come back to that in a minute.

Process Geometries

The different sizes in which flash is fabricated are known as process geometries. Traditionally, they are defined by measuring the distance (in nanometers) between the source and drain of each transistor (although these days this practice has become much less well-defined). In a strange quirk of algebra, the two digit numbers are written as <digit><letter> so that, for example, the first geometry to be in the range 29-20nm (e.g. 25nm) is called 2X, then a second in that range (e.g. 21nm) is called 2Y. Once into the teens (e.g. 19nm) we go back to 1X. [Although interestingly, Toshiba and SanDisk have a second generation NAND which they call 1Y to distinguish it from first-generation 1X even though both are 19nm]

Here’s the current technology roadmap for NAND flash courtesy of my friends at TechInsights:

Technology Roadmap for NAND Flash (Image courtesy of TechInsights)

Technology Roadmap for NAND Flash (Image courtesy of TechInsights)

So what’s the problem with flash getting smaller? Well, hopefully for the very last time, cast your mind back to the concept of floating gate transistors. On these tiny devices the floating gates effectively capture and store electrons that are passing from one side of the transistor to another, retaining charge – and that charge determines the cell’s value of 0 or 1. But at the tiny geometries that flash is approaching now, the number of electrons involved is desperately small, resulting in large margins for error. Electrons, after all, are notoriously good at hide and seek.

Floating Gate Limitations (SK Hynix Presentation - Aug 2012)

Floating Gate Limitations (SK Hynix Presentation – Aug 2012)

This means that flash cells are less reliable, that even more error correction codes are required… and that ultimately, we are reaching what is known as the scaling limit. It looks like the end is it sight for flash as we know it.

3D NAND Flash

Except it isn’t. If you look again at the Technology Roadmap picture you’ll see entries appearing alongside each foundry operator for 3D NAND. This is a significant new fabrication process that looks like it will be the direction taken by the flash industry for the foreseeable future. I would love to attempt a detailed explanation of 3D NAND design here, but a) it’s beyond me, b) I promised to listen to the feedback after the last time I went sub-atomic, and c) Jim Handy (AKA The Memory Guy) already has a fantastic set of articles all about it.

The point is, 3D NAND maps out a future for NAND flash beyond the scaling limits of 2D planar NAND – and the big fab operators have already invested. That puts 3D NAND in pole position as we increasingly look to non-volatile memory to store our data.

But there are other technologies trying to compete…

The Next Big Thing

The holy grail of memory technology is universal memory. This is a hypothetical technology which brings all the benefits, but none of the drawbacks, of the multitude of memory technologies in use today: SRAM, DRAM, NAND Flash and so on. SRAM (often used on-chip as a CPU cache) is fast but expensive, DRAM is cheaper but must be constantly refreshed (using considerably more power) while NAND flash is comparatively slower and wears out with use. If only there were some new technology that met all our requirements?

Well, first of all, it’s worth mentioning that if universal memory did come about it would have far-reaching consequences on the way we design and build computers… but all the same, there are various technologies in R&D right now which make claims to be the NVM solution of the future: PCM, MRAM, RRAM, FeRAM, Memristor and so on. But these – and any other – technologies all face the same challenge: that massive initial investment required to go from prototype to production. In other words, the cost and time associated with building a fabrication plant.

R&D Centre for SHRAM

R&D Centre for SHRAM

Say you converted your shed into a clean room and invented a wonderful new memory technology called Shed-RAM (SHRAM). SHRAM is fast, cheap, dense and only uses a tiny amount of power. You’ve still got to convince somebody to splurge billions of dollars on a foundry before it can be productised. And, as I’m sure you’ll agree, that’s a pretty big bet on an untested technology – especially if it takes years to build. What happens if, one year in, your neighbour (who recently converted his garage into a clean room) invents Garage-RAM (GaRAM) which makes your SHRAM worthless? Nobody is going to cope well with the loss of billions of dollars on a bad bet.

Conclusion

The investment hurdle required to turn a new NVM technology into a product is both a challenge and a stabiliser to the NVM industry. In theory it could stifle potential new technologies, but at the same time (at least for those of us in enterprise storage) it means there is plenty of warning when the tide turns: you are likely to see any successful new tech in your phone before it appears in your data centre.

Now, who can lend me $15 billion to build a new fab? I can’t afford anything more than 0% interest, but I can offer a lifetime’s supply of USB sticks to sweeten the deal…

Understanding Flash: Floating Gates and Wear

OLYMPUS DIGITAL CAMERA

One of the important characteristics of flash memory is wear. We know from previous articles in this series that flash packages consist of dies, which contain planes, which contain blocks, which in turn contain pages. We also know that these pages contain individual cells which store the bits of data… but to understand what wear is we need to look a little bit closer at those cells, where we will find something called a floating gate transistor.

Now don’t run away. Electronics might not be your thing, but I’m won’t be getting too deep. [After all, I once attempted degree-level education in Electronic Engineering but had to move to another course because I kept burning my fingers on the soldering irons during practical laboratory sessions…] If you really don’t want to think about transistors you can just skip to the section called Wear, or ignore this blog post entirely and spend a few minutes looking at picture of cats instead.

Field Effect Transistors

If you are reading this blog on a computer or a phone (and how else could you be reading it?) you owe a debt of gratitude to a humble device called the metal oxide semiconductor field effect transistor, or MOSFET. These tiny devices revolutionised the world and are considered by some to be the most important invention of the 20th century.

Students being stimulated to go to a bar and drink alcohol

Students being stimulated to go to a bar and drink alcohol

They therefore deserve to be explained in a serious and respectful manner, but sadly I don’t have time for that so I’m going to resort to one of my silly analogies. Again.

Imagine you have a house full of students, which we’ll call the source. Two doors down the road you have a nightclub full of free alcohol, which we’ll call the drain. However, it’s freezing cold outside and the students won’t venture out of the door, meaning there is no flow of students from source to drain. Now let’s set up a banging sound system behind the wall of the property in between. If we play music loud enough through the wall then we can excite the students into action, causing them to run down the road and enter the nightclub, which creates the flow. The loud music (which we’re going to call the gate) is the stimulus which causes this flow, essentially acting as a switch, while the volume of the music which first drives the students into action is called the threshold. The beauty of this design is that we can control the flow of students from behind the wall, therefore avoiding having to come into direct contact with them. Phew.

I really cannot tell you how bad that analogy is, but I’m afraid it’s only going to get worse later on. If you don’t know how a transistor works I implore you to watch this short video from the excellent folk at Veritasium, which describes it far better than I ever intend to.

Metal Oxide Semiconductor Field Effect Transistor (MOSFET)

Metal Oxide Semiconductor Field Effect Transistor (MOSFET)

In a MOSFET, electrons (students!) can be allowed to flow from the source terminal to the drain terminal by the application of charge to the gate terminal. This charge creates an electric field which alters the behaviour of the silicon layers (the pinkish parts of the diagram) and thus controls the flow. The important bit here is the yellow rectangle which represents an insulating layer (commonly known as the oxide layer). This means the gate is never physically connected to either the source or drain terminals – and you’ll see why this is important as we turn our attention now to something called the floating grate transistor, or FGMOS.

There are many things I haven’t explained in that clumsy analogy, but this is not an electronics blog. Of specific interest is the way that the two types of silicon (n and p) are doped in order to create free electrons and electron holes. Seriously, watch the video. If you haven’t ever learnt how a semiconductor works and my student party analogy is the nearest you get, it would be a crime. Although that’s not going to stop me from using it again in the next section…

Floating Gate Transistors

Floating Gate MOSFET (FGMOS)

Floating Gate MOSFET (FGMOS)

The diagram on the right (labelled “FGMOS”) is of a Floating Gate MOSFET, which is essentially what you will find in a flash memory cell. If you play spot the difference with the previous diagram above (the one labelled “MOSFET”) you’ll see that there are now two gates, one above the yellow oxide layer as before but the other entirely surrounded by it. This second gate is known as the floating gate because it is completely electrically isolated.

Notice also that the oxide layer beneath the floating gate is deliberately thinner than that above it in the diagram. Now, back to my analogy…

strip-doorsThe students are still there, as is the nightclub. The property in between is still there, but this time we have replaced the brick wall at the front with one of those sets of PVC strip curtains that you sometimes find covering the doors of factories, or butchers shops to keep insects out (or even lining the edge of the cold aisle in a data centre). Inside, we’ll put some comfy chairs and maybe a beer fridge. This is now our floating gate party room and the PVC strips are our thinner section of oxide layer.

Excited students going to a nightclub are stimulated so that some fall into the "Floating Gate" room

Excited students going to a nightclub are stimulated so that some fall into the “Floating Gate” room

As we now play the music loud enough to exceed the threshold, the excited students will run up the road towards the bar as before, but this time – with enough excitement – some of them will pass through the divider into our floating gate room and remain there, thus trapping electrons (sorry, I mean students) on the floating gate. Meanwhile, as the floating gate room fills up with people, the sound of the music becomes more muffled which slows down the flow of students in the road outside. To put it another way, the volume threshold at which stimulation will occur rises if there are students (sorry, I mean electrons) on the floating gate.

In a FGMOS, if a high charge is applied to the control gate in the same manner as with a MOSFET, electrons flowing from source to drain can get excited and “jump” through the oxide layer into the floating gate, increasing its retained charge. This is the program operation we have talked about so many times before: the floating gate is the “bucket of electrons” from my previous posts (a classic case of mixed metaphors). To erase the charge stored on the floating gate, a high voltage is applied across the source and drain while a negative voltage is applied to the control gate, causing the retained electrons to “jump” back off the floating gate (through the oxide layer). I’m putting the word “jump” in inverted commas there because it’s slightly more complicated and usually involves a process called Fowler–Nordheim tunnelling. I’ll explain everything I understand about that in the next paragraph.

[This paragraph intentionally left blank]

Yeah, it’s complicated, it involves quantum mechanics and it’s way over my head. I’m just taking it for granted that it works.

Read Operations

Now that we have methods for programming and erasing we just need a way of testing the value stored: a read operation. When we were talking about the MOSFET in the previous section we could control the flow of charge (which I had better start calling current) between source and drain by varying the voltage applied to the gate.

FGMOS Read Thresholds: The voltage threshold at which current begins to flow from drain to source is different depending on the charge stored on the floating gate. By testing at an intermediate reference voltage (VtREF) called the "read point" we can determine whether the floating gate contains charge (which we call ZERO) or not (which we call ONE)

FGMOS Read Thresholds: The voltage threshold at which current begins to flow from drain to source is different depending on the charge stored on the floating gate. By testing at an intermediate reference voltage (VtREF) called the “read point” we can determine whether the floating gate contains charge (which we call ZERO) or not (which we call ONE)

In the FGMOS this can be turned around so that by measuring the current we can determine the voltage on the floating gate, because electrons trapped on the floating gate cause the threshold we’ve previously mentioned to move. By applying a certain voltage (VtREF in the diagram on the right) across the source and drain and then testing the current we can determine if the voltage on the gate is above or below a specific point, called the read point.

So, if we play music at a certain threshold volume when the floating gate party room is empty, the sound travels far enough to stimulate a flow of students to the nightclub. But if the room is full (the floating gate contains charge) this specific volume will not stimulate a flow. Instead, it will require a louder threshold volume to get the students out of bed. And I think it’s time to abandon the student analogy now… let us never speak of it again.

Read Points for SLC and MLC

Read Points for SLC and MLC (image courtesy AnandTech)

If you remember, flash comes in different forms: SLC, MLC and TLC. So why are MLC reads slower than SLC, with TLC even slower still? Well, because SLC contains only one bit of data (two possible states: a zero or one), we only need to test one threshold voltage, i.e. SLC only has one read point. But for MLC, where there are two bits (and therefore four possible states), there are three read points… while for TLC there are even more.

Remember the bucket full of electrons analogy? When you test the SLC bucket to see if it’s above or below 50% full, the answer will tell you whether the stored value is a zero or a one. But for the MLC bucket, the answer to that test isn’t enough: based on the first answer you then need to perform a second test to see if the bucket is above or below 25% / 75% [delete as appropriate] full. All this additional testing takes time, which is why SLC reads are faster than MLC reads, which in turn are faster than TLC reads.

But what about the wear?

Flash Wear

broken-blindsFinally I’m getting to the point. Remember the PVC strip curtains, like the ones in the main picture at the top of this post? What do you think happens to them as all those excited students (sorry) hurtle through them? They get damaged. In a FGMOS the oxide layer which isolates the floating gate from the silicon substrate is designed to be thin enough to allow quantum tunnelling of electrons when a high enough charge is applied, but this process gradually damages the layer. Reads are not a problem because only lower voltages are used and no electron tunnelling takes place. But program and erase operations are a different story, which is why wear is measured by the number of program/erase cycles. As the layer gets more damaged, the isolation of the floating gate is increasingly affected and the probability of stored charge leaking out will increase.

For SLC this is less of an issue, because during a read we only need to measure at one read point – so there is a lot of room for error either side of the threshold. But for MLC, with three read points, we need to be much more exact. Thus the wear caused by using SLC, MLC and TLC isn’t really very different, it’s simply that the tolerances for error are much finer with the increased bit counts of MLC and TLC.

At some point, the oxide layer for a FGMOS will become sufficiently degraded so that it can no longer store charge properly on the floating gate. We won’t know about this until it happens, but at some point a read from the cell will no longer be trustworthy. And clearly that’s a problem for data storage, which is why (just like disk) flash stores error correction codes (ECC) alongside user data to ensure that incorrect information is spotted and dealt with while any underlying pages are marked as unusable – all without impacting users.

Comparison of SLC, MLC and EMLC (courtesy of EE Times)

Comparison of SLC, MLC and EMLC (courtesy of EE Times)

A final point to consider is that there is a (more expensive) form of MLC known as eMLC (the e stands for enterprise, with “normal” MLC then sometimes referred to as consumer or cMLC). The only difference between eMLC and the standard cMLC we have discussed in the past is that program and erase operations are “slowed down” for eMLC in order to cause less damage to the oxide layer. This gives slightly reduced performance but also significantly reduces wear, allowing up to 10x more P/E cycles. Opinion is divided on whether this is actually a worthwhile investment (not for me though, I think it’s a waste of money in the majority of cases).

Hot News

That’s the end of this post – and as usual I’ve committed two of my three regular blogging sins: writing too much, using silly analogies and finishing on a terrible pun. Time to complete the trio:

Some flash companies have been experimenting with methods of refreshing the lifetime of flash, with one avenue of exploration focussed on providing a short burst of heat to repair the damage to the oxide layer. It has been claimed that flash treated this way can sustain over 100 million P/E cycles with no noticeable degradation. If that is really the case – and this technology can be put into production – we might finally find that the Talking Heads were correct: in the world of flash memory we are on the road to no-wear…

Oracle AWR Reports: When IOStats Lie

graph

If you’ve been unfortunate enough to follow my dithering on Twitter recently you’ll know that I’ve been lurching between thinking that there is and isn’t a problem with Oracle’s tracking of I/O statistics in its AWR reports.

I’m now convinced there is a problem, but I can’t work out what causes it… so step 1 is to describe the problem here, after which step 2 will probably be to sit back and hope someone far more intelligent than me will solve it.

But first some background:

AWR I/O Statistics

I’ve written about the I/O statistics contained in Oracle AWR Reports before, so I won’t repeat myself too much other than to highlight two critical areas which we’ll focus on today: Instance Statistics and IOStat Summaries. By the way, the format of AWR reports changed in 11.2.0.4 and 12c to include a new IO Profile section, but today we’re covering reports from 11.2.0.3.

First of all, here are some sensible statistics. I’m going to show you the IOStat by Function Summary section of a report from a real life database:

IOStat by Function summary
-> 'Data' columns suffixed with M,G,T,P are in multiples of 1024
    other columns suffixed with K,M,G,T,P are in multiples of 1000
-> ordered by (Data Read + Write) desc

                Reads:   Reqs   Data    Writes:  Reqs   Data    Waits:    Avg  
Function Name   Data    per sec per sec Data    per sec per sec Count    Tm(ms)
--------------- ------- ------- ------- ------- ------- ------- ------- -------
Buffer Cache Re   14.9G   504.1 4.22491      0M     0.0      0M 1690.3K     0.0
Direct Reads      12.2G     3.5 3.45612      0M     0.0      0M       0     N/A
DBWR                 0M     0.0      0M    7.9G   186.6 2.25336       0     N/A
Others             4.2G    12.1 1.18110    3.6G     2.5 1.03172   41.9K     0.1
LGWR                 1M     0.0 .000277    5.8G    11.9 1.65642   17.4K     1.0
Direct Writes        0M     0.0      0M      4M     0.0 .001110       0     N/A
TOTAL:            31.2G   519.7 8.86241   17.4G   201.0 4.94263 1749.6K     0.0
          -------------------------------------------------------------

This section of the report is breaking down all I/O operations into the functions that caused them (e.g. buffer cache reads, direct path reads or writes, redo log writes by LGWR, the writing of dirty buffers by DBWR and so on). Let’s ignore that level of detail now and just focus on the TOTAL row at the bottom.

To try and make this simple to describe, I’ve gone a bit crazy with the colours. In green I’ve highlighted the labels for reads and writes – and now let’s walk through the columns:

  1. The first column is the function name, but we’re just focussing on TOTAL
  2. The second column is the total amount of reads that happened in this snapshot: 31.2GB
  3. The third column is the average number of read requests per second, i.e. 519.7 read IOPS
  4. The fourth column is the average volume of data read per second, i.e. 8.86 MB/sec read throughput
  5. The fifth column is the total amount of writes that happened in this snapshot: 17.4GB
  6. The sixth column is the average number of write requests per second, i.e. 201 write IOPS
  7. The seventh column is the average volume of data written per second, i.e. 4.94 MB/sec write throughput
  8. The eighth and ninth columns are not of interest to us here

The whole section is based on the DBA_HIST_IOSTAT_FUNCTION view. What we care about today is the IOPS figures (shown in red) and the Throughput figures (shown in blue). Pay attention to the comments in the view header which show that Data columns (including throughput) are multiples of 1024 while other columns (including IOPS) are multiples of 1000. It’s interesting that the two throughput values are obviously measured in MB/sec and yet are missing the “M” suffix – I assume this “falls off the end” of the column because of the number of decimal places displayed.

Now that we have these figures explained, let’s compare them to what we see in the Instance Activity Stats section of the same AWR report:

Instance Activity Stats
-> Ordered by statistic name

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
physical read IO requests                 1,848,576          513.3         121.5
physical read bytes                  29,160,300,544    8,096,201.8   1,917,053.5
physical read total IO requests           1,892,366          525.4         124.4
physical read total bytes            33,620,594,176    9,334,578.5   2,210,281.7
physical read total multi block              17,096            4.8           1.1
physical reads                            3,559,607          988.3         234.0

physical write IO requests                  671,728          186.5          44.2
physical write bytes                  8,513,257,472    2,363,660.5     559,677.7
physical write total IO requests            723,348          200.8          47.6
physical write total bytes           18,657,198,080    5,180,071.5   1,226,559.6
physical write total multi block             26,192            7.3           1.7
physical writes                           1,039,216          288.5          68.3

For this section we care about the per Second column because both IOPS and Throughput are measured using units per second. For both reads and writes there are two sets of statistics: those with the word total in them and those without. You can find the full description of 11gR2 statistics in the documentation, but the difference between the two is best summed up by this snippet which describes physical read total bytes:

Total size in bytes of disk reads by all database instance activity including application reads, backup and recovery, and other utilities. The difference between this value and “physical read bytes” gives the total read size in bytes by non-application workload.

I’ve underlined non-application workload here because this is critical. If you merely look at the Load Profile section at the top of an AWR report you will only see values for “application workload” I/O but this does not include stuff like RMAN backups, archive logging and so on… important stuff if you care about the actual I/O workload. For this reason, we only care about the following four statistics:

  1. physical read total IO requests (per second) = the average number of read IOPS
  2. physical read total bytes (per second) = the average read throughput in bytes per second
  3. physical write total IO requests (per second) = the average number of write IOPS
  4. physical write total bytes (per second) = the average write throughput in bytes per second

Again I’ve coloured the IOPS figures in red and the throughput figures in blue.

Tying It Together

Now that we have our two sets of values, let’s just compare them to make sure they align. The IOPS figures do not require any conversion but the throughput figures do: the values in the Instance Activity Stats section are in bytes/second and we want them to be in MB/second so we need to divide by 1048576 (i.e. 1024 * 1024).

IOStat by Function

Instance Activity Stats

Error Percentage

Read IOPS

519.7

525.4

1.09%

Write IOPS

201.0

200.8

0.10%

Read Throughput

8.86241 MB/sec

9,334,578.5 bytes/sec

8.90215 MB/sec

0.45%

Write Throughput

4.94263 MB/sec

5,180,071.5 bytes/sec

4.94010 MB/sec

0.05%

I’ve calculated the error percentages here to see how far the figures vary. It is my assumption that the Instance Activity Stats are accurate and that any margin of error in the IOStat figures comes as a result of sampling issues. The highest error percentage we see here is just over 1%, which is hardly a problem in my opinion.

Don’t Believe The Stats

So far I have no complaints about the matching of statistics in the AWR report. But now let me introduce you to the AWR report that has been puzzling me recently:

IOStat by Function summary
-> 'Data' columns suffixed with M,G,T,P are in multiples of 1024
    other columns suffixed with K,M,G,T,P are in multiples of 1000
-> ordered by (Data Read + Write) desc

                Reads:   Reqs   Data    Writes:  Reqs   Data    Waits:    Avg
Function Name   Data    per sec per sec Data    per sec per sec Count    Tm(ms)
--------------- ------- ------- ------- ------- ------- ------- ------- -------
Direct Reads       3.8T 2.7E+04    3.2G    405M     3.6 .335473       0     N/A
Direct Writes        0M     0.0      0M  164.5G  1159.0 139.519       0     N/A
Buffer Cache Re   63.2G  3177.6 53.6086      0M     0.0      0M 3292.5K    23.4
DBWR                 0M     0.0      0M      2G    45.3 1.71464       0     N/A
LGWR                 0M     0.0      0M    485M    19.2 .401739   19.7K     0.9
Others             220M    12.3 .182232     26M     0.7 .021536   15.6K    34.5
Streams AQ           0M     0.0      0M      0M     0.0      0M       1     9.0
TOTAL:             3.9T 3.0E+04    3.3G  167.4G  1227.8 141.992 3327.8K    23.3
          -------------------------------------------------------------

Instance Activity Stats
-> Ordered by statistic name

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
physical read total IO requests          35,889,568       29,728.4       3,239.4
physical read total bytes         4,261,590,852,608 3.52999864E+09 3.8465483E+08

physical write total IO requests          1,683,381        1,394.4         151.9
physical write total bytes          205,090,714,624  169,882,555.1  18,511,663.0

Again I’ve coloured the IOPS measurements in red and the throughput measurements in blue. And as before we need to convert the bytes per second values shown in the Instance Activity Stats section to MB/sec as shown in the IOStat by Function Summary section.

IOStat by Function

Instance Activity Stats

Error Percentage

Read IOPS

30,000

29,728.4

0.91%

Write IOPS

1227.8

1,394.4

11.95%

Read Throughput

3.3 GB/sec

3,379.2 MB/sec

3,529,998,640 bytes/sec

3,366.469 MB/sec

0.38%

Write Throughput

141.992 MB/sec

169,882,555.1 bytes/sec

162.01263 MB/sec

12.36%

Do you see what’s bugging me here? The write values for both IOPS and throughput are out by over 10% when I compare the values in IOStat by Function against the Instance Activity Stats. Ten percent is a massive margin of error at this level – we’re talking 20MB/sec. To translate that into something easier to understand, if 20MB/sec were sustained over a 24 hour period it would amount to over 1.6TB of data. I’ve seen smaller data warehouses.

So why is this happening? Unfortunately I don’t have access to the system where this AWR report was created, so I cannot tell if, for example, there was an instance restart between the start and end snapshots (although the elapsed time of the report was just 20 minutes so it seems unlikely).

The truth is I don’t know. Which is why I’m writing about it here… if you think you have the answer, or just as importantly if you see the same behaviour, let me know!

New Cookbook: Oracle Linux 6 Update 5 within an Oracle VM Template

Oracle-VMI’ve posted a new installation cookbook for using Oracle within a virtual machine running on Oracle VM. Surprisingly, I was unable to come up with a satisfactory method of accessing external storage that did not involve the use of Oracle ASMLib

Oracle Linux 6 Update 5 within an Oracle VM Template

Follow

Get every new post delivered to your Inbox.

Join 990 other followers