Implementing Linux native multipathing or DM-MPIO together with EMC PowerPath

puzzle

Guest Post

I’m delighted to say that this is another guest post from my good friend Nate Fuzi, who performs the same role as me for Violin but is based in the US instead of EMEA. Because he is American, Nate thinks that scones are called “biscuits”, that chips are called “fries” and that there is nothing – *nothing* – that cannot be improved with the simple addition of bacon. Clearly, something is fundamentally wrong with him – and yet he is like a brother to me. Like the strange, American step-brother I only see a few times a year and whom I cannot understand without the use of a translator. But he’s family all the same. So over to you Nate… and remember: Mom loved me more.

Remember when your parent answering your whiny “but whyyyyyyy???” with “Because I said so” was something you just had to accept? It meant there was no more explanation coming, and it was time for you to move on. Over the years, that answer broke down, and you grew confident you were owed more. And parents agreed: the more you demonstrated your ability to reason, the more reason you got to help you over the denial. It’s a sign of respect that we pay each other in adult life. And it can feel like disrespect if the reason offered feels weak or like it is intended to discourage further inquiry.

I was recently faced with solving what seemed a straightforward problem: take an existing Linux server running EMC’s multipathing software, PowerPath or “PP” as I will refer to it here, to access LUNs presented from that company’s SAN product, the VNX array, and attach and run Violin storage alongside the VNX. PP didn’t then support Violin arrays (still doesn’t at the time of this writing), so what was the client to do when they wanted to try out Violin’s AFA for their database environments? Just run PP and native Linux multipathing, called DM-MPIO, side by side, letting PP manage the VNX LUNs and DM-MPIO manage I/Os bound for Violin, right?

PowerPath versus DM-MPIO

PowerPath versus DM-MPIO

Wrong. Won’t work, I read. PowerPath does something at the HBA layer, I read a seemingly helpful web poster explain, that will corrupt either the VNX data or the Violin data. Well… maybe it will work, suggested another poster, but EMC might not support customers running in such a configuration. Others suggested ominously that PP and DM-MPIO don’t work well together… leaving it to the reader’s imagination what might result. I’m no master Googler, but I couldn’t find where anyone had put aside the rumors and vaguely threatening suggestions and actually tried it. Well, I did it, and I want to write about it so others know it can be done and how to do it because, well, those explanations I read didn’t stand up to question and felt like they were meant to scare me into not trying it. Of course I had to try it! Now, let’s be clear about what I am and am not saying: I am saying I have done this and it works. It’s in production at a customer site, running for months without issue. I am not saying that I have spoken to your EMC support rep and that you’ve been green-lighted to do this in your production environment. I’m not an EMC customer, and I don’t have a buddy in EMC support. So let’s consider this for informational purposes only for the time being.

First off, as several folks rightly pointed out, DM-MPIO could easily manage LUNs from both SAN products. Drop PP, configure DM-MPIO, and done. Well, that just sounds too simple. But it’s true: DM-MPIO has come a long way over the last few years and offers a pretty good set of features for free. PP costs money but is not without added value, as it does have additional configurability for reserve paths that become active in the event of a failure scenario, as well as IO distribution models beyond those offered by DM-MPIO, for example. My customer wanted to keep running PP, so this option was off the table for me.

Next up is the fun fact that PP advises you upon installation that you should “Blacklist all devices in /etc/multipath.conf and stop multipathd service”. The installer doesn’t say what will happen if you don’t do this, only that it is “*** IMPORTANT ***”. Check. Easy enough to ignore if this is the first thing you do. But if DM-MPIO is already running on the system and you try to start PP, it tells you this (verified in 5.7 and 6.0 only):

[root@host] # /etc/init.d/PowerPath start
Starting PowerPath:                                                      [FAILED]
Aborting PowerPath start since DM-Multipath is active.
Refer to PowerPath for Linux Installation and Administration Guide for more information

That’s a bummer. You actually have to stop multipathd and flush its paths before PP will start up. OK, I can do that. And, to be sure, you do NOT want both products attempting to manage IOs for the same device at the same time. That really is a bad thing. As we’ll see shortly, we might even want to segregate traffic across different FC ports, although this is strictly for optimization, not because you can’t mix traffic. But, as soon as we’ve installed the device-mapper-multipath-* packages, let’s honor this restriction right away by blacklisting the EMC devices in /etc/multipath.conf like this:

blacklist {
       devnode "^(control|vg|ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"    # standard stuff
       devnode "^hd[a-z][0-9]*"                                          # this line too
       device {
              vendor "DGC"
              product "*"
       }
       device {
              vendor "EMC"
              product "*"
       }
}

Note that the VNX line grew up in the Clariion company later acquired by EMC and presents a vendor string of “DGC”. Don’t ask me why. [Because the Clariion was a product from Data General Corporation? — flashdba :-)] It is my understanding that VMAX arrays do present “EMC” as their vendor string. Having done this, we want to explicitly except Violin devices from getting blacklisted:

blacklist_exceptions {
       device {
              vendor "VIOLIN"
              product "*"
       }
}

This isn’t completely necessary, but it does make clear our intentions: don’t manage VNX/EMC devices but do manage Violin devices. Having both entries in the file means that adding some third storage product to the FC SAN won’t cause it to get picked up by DM-MPIO without us consciously making it so. Belt and suspenders, they used to say.

Verify your multipath configuration without actually running it. Do this by adding the “-d” flag to your multipath command:

[root@host] # multipath -v3 -d

The “-v3″ flag gives us a verbose parsing of the configuration file so we can see each device and whether, what, and why DM-MPIO is going to do that with device. Make changes ad nauseum, and once you like what you see, run the command without “-d”, and create your multipath devices.

Cool. But remember when PP refused to start up earlier, saying DM-MPIO was found running? Guess what: PP’s inexplicable method of editing your /etc/rc.d/rc.sysinit script to insert its startup lines means it doesn’t attempt to start up until after DM-MPIO gets started on reboot. (Take a look for yourself; it’s there. It also makes you manually start PP if you apply a kernel update that resets the contents of rc.sysinit, at which point it reinserts the startup lines. Sweet.) How to get around this? I’m sure there are lots of solutions. I created a script to flush existing multipaths and start up PP in /etc/init.d and linked it as /etc/rc.d/rc3.d/S86PowerPath. This makes it so PP gets called just prior to DM-MPIO, and each is happy. The later call in /etc/rc.d/rc.sysinit is then redundant but causes no harm. I suppose you could almost as easily edit the rc.sysinit script to remove the check–just remember to make the same edit if/when you update PP.

Now, what was that I said a bit ago about segregating traffic on different HBA ports? This is not required; no magic is happening on the HBA with either product. Each one will discover the devices it is concerned with via its own callout routine and handle that device how you configure it to. But let’s imagine you have 4 FC ports on your host and choose to allow PP and DM-MPIO to each manage devices across all those ports. Neither will be aware what the other is doing in terms of trying to optimize IO distribution across all paths available, and you could well end up shooting yourself in the foot with sub-optimal end results. Segregating traffic also allows you to set different HBA queue depths or optimization settings as recommended by each storage vendor, and we all want to comply with best practices, right?!

Conclusion

None of this is meant to disparage EMC. Well, OK: the part about having the PowerPath startup script insert lines into /etc/rc.d/rc.sysinit is meant to disparage. I think that’s archaic and clunky. I have to believe there’s a more elegant way to do that today. I do hope I save some other soul the frustration I went through determining if this could be done and then how. If anyone has implemented more elegant solutions, I’d love to read about them.

Paris Oracle Meetup (Jeudi 26 Mars 2015 @ 7pm)

A quick post to say that this week, on Thursday 26th, I will be giving a talk (in English!) on the subject of Flash for DBAs (and architects, designers, developers, managers etc) at the Paris Oracle Meetup:

global_197134042Flash for DBAs: A new technology is sweeping the world of storage. Flash, a type of non-volatile memory, is gradually replacing hard disk drives. It began in consumer electronic devices such as phones, cameras and tablets – then it moved into laptops in the form of SSDs. And now the last fortress of hard drives us under attack: the data centre. Find out what flash memory is, what it means for your Oracle databases and how to use it to give you performance that was never possible in the old days of disk…

If you are in the Paris area please do feel free to register here:

http://www.meetup.com/parisoracle/events/221062002/

Postcards from Storageland: Three Years At Violin

calendar

A few weeks ago, in what seems to be a truly modern phenomenon, I became aware that it was my third anniversary of joining Violin after I noticed a number of people congratulating me on LinkedIn. In many ways it feels like I’ve already been here for a lifetime, but it was only twelve months ago I was trying to think of a suitable flash-based pun for the title of an article just like this one. This year I opted out of the “Three years in a flash” headline, it seemed a bit too lame. Those NAND-based puns were only ever a flash in the pan.*

So what’s happened in the three years since I joined Violin? Well, quite a lot. When I signed up in early 2012 Violin was pioneering the flash array industry – and when I say pioneering I mean that, unlike in today’s crowded AFA market, it was a pretty lonely place. The only other all-flash array vendor with a presence was Texas Memory Systems (TMS), but they had seemingly gone into hibernation in the markets I had exposure to (as it turned out they were looking for a buyer, which they found in the form of IBM).

I was one of the first employees in EMEA, part of a business which was rapidly expanding due to a global reseller agreement with HP for our 3000 series array. The main enemy was the status quo – monolithic disk arrays from EMC, IBM, HP, HDS etc, perhaps with a smattering of SSDs to try and alleviate the terrible performance of random I/O. With the 3000 on HP’s price list and no real competition to worry about it seemed like the world was there for the taking. Time to pay of the mortgage.

Were we overconfident? Guilty of hubris, perhaps? We must have alienated a few people in the industry because I know not everyone felt sympathy for what happened next.

Pride Cometh Before A Fall

With hindsight, the $2.35 billion that HP paid for 3PAR meant it was unlikely to continue using Violin as a strategic product. HP may have a history of write downs, but it simply couldn’t justify OEMing the new 6000 series array with 3PAR still on the books so… it didn’t.

Meanwhile, EMC purchased a company that hadn’t yet shipped a product, IBM did its deal with TMS, Cisco bizarrely purchased Whiptail (which now appears to be suspended as a product) and a number of SSD-based flash array startups (e.g. Pure Storage) appeared on the market.

crash-chartAll of which meant that, when Violin went to IPO, things didn’t exactly go to plan. In fact, it eventually resulted in a change of management and the introduction of a new CEO and management team who have systematically transformed the company over the last year. But at the time, it felt like a roller coaster.

So why am I reminiscing about the bad times? Partly because I don’t want to gloss over the past, but also because I genuinely think that Violin has had to do a lot of growing up in the last year or so – and that’s a good thing. When I look at other flash vendors throwing FUD at each other, getting into legal disputes over employees or burning bridges with their channel partners to try and get their pre-IPO books look more attractive… I can’t help a wry smile. Youth, eh? Some people still have harsh lessons to learn.

From Niche to Platform

This year, on the third anniversary of my joining Violin, we announced an important new product – the 7000 series Flash Storage Platform. Until the FSP, Violin had generally competed in the niche performance-optimized market – what some people call Tier 0 – where the single most important attribute is… well, performance (think database workloads). We’ve been pretty successful there, mainly because the 6000 series was (and still is) unbelievably fast, but also partly because much of the competition competes lower down in the capacity-optimized market (where price per GB is key – think VDI workloads). But we also attracted a surprising amount of criticism for the lack of certain Data Services features, such as deduplication (a feature that I’ve never coveted for database workloads).

But with the Flash Storage Platform, Violin – and flash in general – is moving into a new, larger and much more demanding market: Tier 1 primary storage. This is the big playground where all the major disk array vendors are desperately trying to stem the losses from their legacy SAN products. flash-market-venn-diagramIt’s also a market which is nearly 15 times larger than the one we used to operate in. And most importantly, it’s the one where you need to be able to deliver on all three requirements of the Primary Storage Trinity:

  • Performance (high IOPS and low latency)
  • Data Services (lots of features, fully integrated)
  • Capacity Optimization (low $/GB price)

By complete coincidence, this product launch also coincides with the end of the Understanding Flash section of my blog series on Storage for DBAs (when I started the flashdba blog it was aimed at database administrators, but over time the intended audience has expanded to anyone with an interest in flash storage).

With that in mind, in the next set of posts I’ll be turning my attention to the concepts and architecture of All Flash Arrays. What defines an AFA? What needs to be considered when designing one? And why doesn’t it make sense to stuff a load of SSDs into an existing disk array in the hope that it will deliver the performance of All Flash?

This is a really exciting time to be working in the storage industry – there’s lots to do and a massive opportunity to embrace. Because of this, the blog posts haven’t been coming as quickly as I’d hoped. But I still have much I want to talk about… so don’t worry, the next one will be back in a flash.**

* I really will stop making flash-based puns now

** Apart from this one

Understanding Flash: Fabrication, Shrinkage and the Next Big Thing

Semiconductor Fabrication Plant (picture courtesy of SemiWiki.com)

Semiconductor Fabrication Plant (picture courtesy of SemiWiki.com)

Before I draw this series on Understanding Flash to a close, I wanted to briefly touch on the subject of manufacturing. Don’t worry, I’ve taken heed of the kind feedback I had after my floating gate transistor blog post (“Please stop talking about electrons!“) and will instead focus on the commercial aspects, because ultimately they affect the price you will be paying for your flash-based storage. If you’re not familiar with the way things are you might be surprised…

The Supplier Landscape

Semiconductor chips, such as the NAND flash packages found in your storage, phones and tablets, are manufactured in fabrication plants known as “fabs” or flash “foundries”. To say that fabs are not cheap to build would be somewhat of an understatement – they are mind-bogglingly, ludicrously expensive.

In 2014 the global semiconductor industry posted record sales totalling $335.8 billion. That’s the entire semiconductor industry, not just the subset that produces NAND flash… and I think you’ll agree that’s quite a lot of money. But to put that figure into perspective, when Samsung decided to build an entirely new fab in late 2014, it had to commit $15 billion for a project that won’t be completed “until mid-2017″.

Clearly a fab is an eye-watering investment – and it is mainly for this reason that there are (at the time of writing) only six key companies worldwide who run flash foundries. What’s more, because of that staggering cost, four of those six are working together in pairs to share the investment burden. The four teams are:

  • Toshiba and SanDisk
  • Samsung
  • Intel and Micron
  • SK Hynix

With only four sets of fabs in operation, the market is hardly awash with an abundance of NAND flash – which of course suits the fab operators just fine as it keeps the price of flash higher. Oversupply would be unsustainable for an industry with such high costs.

Process Shrinking

Talking of costs, the fab operators are never allowed to stand still because of the relentless drive to make things smaller – Moore’s Law doesn’t just apply to processors; NAND flash is a semiconductor too. The act of taking the design of a microchip and reducing it in size is known as a process shrink – and it brings all sorts of benefits. Remember that a NAND flash cell is basically constructed from floating gate transistors? Well if those transistors are reduced in size:

  • Less current is needed per transistor, reducing the overall power consumption of the flash
  • This in turn reduces the heat output of an identical design with the same clock frequency (or alternatively the clock frequency can be increased)
  • Most importantly, more flash dies can be produced from the same silicon wafers (the raw material), resulting in either reduced costs or increased density

It may sound like smaller is better, but as always there is a flip side. Retooling a flash foundry to move to a smaller process geometry takes time and money – and the return on investment for the fab is reduced with each shrink. But we’ll come back to that in a minute.

Process Geometries

The different sizes in which flash is fabricated are known as process geometries. Traditionally, they are defined by measuring the distance (in nanometers) between the source and drain of each transistor (although these days this practice has become much less well-defined). In a strange quirk of algebra, the two digit numbers are written as <digit><letter> so that, for example, the first geometry to be in the range 29-20nm (e.g. 25nm) is called 2X, then a second in that range (e.g. 21nm) is called 2Y. Once into the teens (e.g. 19nm) we go back to 1X. [Although interestingly, Toshiba and SanDisk have a second generation NAND which they call 1Y to distinguish it from first-generation 1X even though both are 19nm]

Here’s the current technology roadmap for NAND flash courtesy of my friends at TechInsights:

Technology Roadmap for NAND Flash (Image courtesy of TechInsights)

Technology Roadmap for NAND Flash (Image courtesy of TechInsights)

So what’s the problem with flash getting smaller? Well, hopefully for the very last time, cast your mind back to the concept of floating gate transistors. On these tiny devices the floating gates effectively capture and store electrons that are passing from one side of the transistor to another, retaining charge – and that charge determines the cell’s value of 0 or 1. But at the tiny geometries that flash is approaching now, the number of electrons involved is desperately small, resulting in large margins for error. Electrons, after all, are notoriously good at hide and seek.

Floating Gate Limitations (SK Hynix Presentation - Aug 2012)

Floating Gate Limitations (SK Hynix Presentation – Aug 2012)

This means that flash cells are less reliable, that even more error correction codes are required… and that ultimately, we are reaching what is known as the scaling limit. It looks like the end is it sight for flash as we know it.

3D NAND Flash

Except it isn’t. If you look again at the Technology Roadmap picture you’ll see entries appearing alongside each foundry operator for 3D NAND. This is a significant new fabrication process that looks like it will be the direction taken by the flash industry for the foreseeable future. I would love to attempt a detailed explanation of 3D NAND design here, but a) it’s beyond me, b) I promised to listen to the feedback after the last time I went sub-atomic, and c) Jim Handy (AKA The Memory Guy) already has a fantastic set of articles all about it.

The point is, 3D NAND maps out a future for NAND flash beyond the scaling limits of 2D planar NAND – and the big fab operators have already invested. That puts 3D NAND in pole position as we increasingly look to non-volatile memory to store our data.

But there are other technologies trying to compete…

The Next Big Thing

The holy grail of memory technology is universal memory. This is a hypothetical technology which brings all the benefits, but none of the drawbacks, of the multitude of memory technologies in use today: SRAM, DRAM, NAND Flash and so on. SRAM (often used on-chip as a CPU cache) is fast but expensive, DRAM is cheaper but must be constantly refreshed (using considerably more power) while NAND flash is comparatively slower and wears out with use. If only there were some new technology that met all our requirements?

Well, first of all, it’s worth mentioning that if universal memory did come about it would have far-reaching consequences on the way we design and build computers… but all the same, there are various technologies in R&D right now which make claims to be the NVM solution of the future: PCM, MRAM, RRAM, FeRAM, Memristor and so on. But these – and any other – technologies all face the same challenge: that massive initial investment required to go from prototype to production. In other words, the cost and time associated with building a fabrication plant.

R&D Centre for SHRAM

R&D Centre for SHRAM

Say you converted your shed into a clean room and invented a wonderful new memory technology called Shed-RAM (SHRAM). SHRAM is fast, cheap, dense and only uses a tiny amount of power. You’ve still got to convince somebody to splurge billions of dollars on a foundry before it can be productised. And, as I’m sure you’ll agree, that’s a pretty big bet on an untested technology – especially if it takes years to build. What happens if, one year in, your neighbour (who recently converted his garage into a clean room) invents Garage-RAM (GaRAM) which makes your SHRAM worthless? Nobody is going to cope well with the loss of billions of dollars on a bad bet.

Conclusion

The investment hurdle required to turn a new NVM technology into a product is both a challenge and a stabiliser to the NVM industry. In theory it could stifle potential new technologies, but at the same time (at least for those of us in enterprise storage) it means there is plenty of warning when the tide turns: you are likely to see any successful new tech in your phone before it appears in your data centre.

Now, who can lend me $15 billion to build a new fab? I can’t afford anything more than 0% interest, but I can offer a lifetime’s supply of USB sticks to sweeten the deal…

Oracle Exadata X5: The Road To Ten Billion Dollars

money

Now that the dust has settled on the announcement of Oracle’s new Exadata X5 Database Machine, I’ve been doing some research in order to update my History of Exadata post (it’ll be ready soon). While reviewing the datasheets and other collateral for the X5 I was struck by the meteoric increase in one particular statistic: the number of processor cores on each database server. Oracle is riding that Moore’s Law train all the way to the bank.

The thing is, the number of cores per database server is directly linked to the cost of licensing the Oracle Database for each Exadata machine – and that means trouble if you are the one paying the bills. Assuming you buy a full rack – and that you license every core in every database server (which is the most common choice, since only a very brave minority would consider the Oracle VM Trusted Partitions option), your license cost has been increasing by 50% for each of the last two Exadata releases (X3-2 to X4-2 to X5-2). Let’s have a look at that in graphical form (click on the image to enlarge):

Exadata DB License Cost Comparison X2-2 to X5-2

Let’s not forget here that I am only plotting the cost of licensing the database software. We are not taking into account the extra costs associated with paying for the hardware, licensing the storage servers or purchasing any of the pretty-much essential enterprise edition options such as Oracle RAC, partitioning, multitenancy or the diagnostic pack license. Nor are we considering the infamous 22% per annum software support costs. I’m also using the list price – which you would never expect to pay – but even if you used discounted prices the percentage increases would remain the same.

Anyway, after crunching the numbers it turns out there is good news and bad news…

Good News for Oracle

The good news for Oracle is that if Exadata continues to increase the number of cores at the rate of 50% extra per release, the list price for licensing the database software on the future X23-2 model will be $10.1 billion:

Exadata DB License Cost Projection

There’s no doubt about it – this will pay for a lot of yachts.

Bad News for Oracle

There is a downside though. As Kevin Closson has already shown, Oracle appears to be having trouble balancing the I/O capabilities of the X5 against this tremendously-increased compute power. Even after abandoning previous claims that a memory hierarchy with “low-cost disk” as the bottom tier would bring the “highest performance at the lowest cost” to customers, the new Oracle Exadata X5 “Extreme Flash” model (because apparently flash on its own isn’t enough, it has to be extreme flash) struggles to deliver an improvement in IOPS per flash device, as Kevin has shown.

exadata-x5-ef-price-per-io

Excerpt from the Oracle Exadata X5-2 Press Release

You wouldn’t think this from reading the press release, which promises “breakthrough performance and price per I/O” (emphasis added by me). Price per I/O, eh? That sounds like we take the overall price and divide it by the number of IOPS the system can deliver, right?

So let’s do that. And to be generous, let’s only look at the database license cost (ignoring all those other costs again) and take the maximum IOPS number from the datasheets (which in most cases is for unrealistic, 100% read only, fully cached workloads). How will it look? I’ll overlay it as a line (in blue) on top of the first graph:

Exadata DB License Cost Comparison

Well it turns out that the price per I/O is actually falling: it’s down from $1.71 on the X4-2 (HP) model to $1.65 on the X5-2 EF. But three and a half percent is not much of an improvement considering the 50% extra on the price tag, is it? And as for offering a “breakthrough” price per I/O, the X2-2 was better at only $1.52 per I/O per second!

Summary

In my view, something is not right about the balance between compute and storage on the Exadata X5. It feels as though Oracle is bumping up the compute power faster than would be architecturally prudent because this results in a higher purchase price. Maybe I’m wrong – I have no insider knowledge and can only speculate… but when the Exadata X23-2 finally comes out in a couple of decades time, maybe we’ll know for sure.

Footnote

The comments section of this article makes for interesting reading, with responses from a number of Oracle employees – although not necessarily speaking on behalf of the mothership. The noteworthy (not to mention calm and measured) comments come from the Exadata Product Manager, Gurmeet Goindi. In addition I’d like to draw your attention to the following two URLs:

Deprecation of Non-CDB Architecture in Oracle 12c

dead-end

Back in July 2013, Oracle released the latest version of its flagship database product, Oracle 12c. Among the usual fanfare was information about a number of new options – including one known as Multitenant. With the Multitenant option, databases use a new architecture which features a container database (or CDB) which in turn contains one or more pluggable databases (or PDBs). Use of Multitenant requires a licence – which at the time of writing retails at $17,500 per processor (perpetual) plus 22% per annum for support.

This post is not intended to discuss the way Multitenant works – if you want to read more about it, Tim has a great set of articles about Multitenant here. But keep in mind that you can choose to install the Multitenant feature or not. If you do install it, you can create a single PDB within your CDB without requiring the license. As soon as you use more than one PDB the license is required.

What I want to talk about is Oracle’s attitude to its customers and what seems to me to be breathtaking arrogance. Personally I can think of three very good reasons why I might not want to use the single PDB within a CDB configuration which does not require a Multitenant license:

  • Multitenant requires additional configuration and the use of new administrative commands, which means re-writing admin procedures and re-training operations staff
  • Multitenant is an entirely new feature, with new code paths – which means it carries a risk of bugs (the list of bug fixes for the 12.1.0.2 patchset contains a section on Pluggable/Container Databases which lists no fewer than 105 items)
  • With the Multitenant option installed it is possible to trigger the requirement for an expensive set of licenses due to human error… without the option installed this is not possible

So it seems to me that, while Multitenant might be an interesting and useful new feature to evaluate, it is not something that I would want to be forced into using on production environments just yet. As always, people who manage production environments are conservative in their attitude to risk.

And that’s why I’m surprised to see this deprecation notice in the 12.1 documentation:

cdb_dep

The non-CDB architecture (i.e. the old way of building a database without CDBs and PDBs) has been deprecated in 12c and “may be desupported or unavailable” in later releases of Oracle Database. In other words, you need to change to using the CDB and PDB configuration now, even if you do not plan to purchase the Multitenant option.

It would be nice to have the choice, wouldn’t it?

Deprecated versus Desupported

OK so first let’s just remember that the term deprecated does not mean the same as the term desupported. We can dip back into the documentation to define these two important terms:

“By deprecate, we mean that the feature is no longer being enhanced but is still supported for the full life of the 12.1 release. By desupported, we mean that Oracle will no longer fix bugs related to that feature and may remove the code altogether.”

- Oracle Database 12c Upgrade Guide

It seems that Oracle will still support the use of non-CDB databases and will continue to do so for the lifetime of the 12.1 release. But, if you were designing a new system right now, it would take some confidence to choose a configuration which is deprecated and already living on death row.

And there’s more. The deprecation notice says there are some features that still do not work with the CDB architecture – and that if you want to use these you should use the deprecated non-CDB architecture. The list of features which are restricted or not available includes Automatic Data Optimization, Heat Maps, DBVERIFY and Flashback Pluggable Database (you can see the complete lists here for 12.1.0.1 and 12.1.0.2).

So we can add a fourth reason to our list of three drawn up earlier:

  • one-wayThe Multitenant causes a number of other options or features to be unusable

Now, given that I think all four reasons stated here are good enough to stand up on their own, what does this say about Oracle’s decision to deprecate non-CDB architectures?

You can draw your own conclusion, but I can’t help see it as arrogance on Oracle’s part as they force customers to use a specific new configuration with little regard for how it affects their operations. At worst, I don’t like being forced into changes by the vendor (to whom customers pay large amounts of money) while at best I would at least expect them to get all the features working before forcing the issue…

Update – 18 February 2015

Since I published this article back in late January I’ve had a lot of comments – both here and on Twitter. Some agree with me, some disagree – and unsurprisingly many of the latter are Oracle employees. One particular Oracle employee took to his corporate blog to post a four-part response!

One of those responses was in regard to my concern that, since it seemingly cannot be unlinked, customers may be able to inadvertently trigger usage of the Multitenant feature and thus incur an expensive and unexpected license bill. [I have no knowledge that this has ever happened, I am merely concerned that it may be possible.]

I’d like to quote the following response from our friend on the Oracle blog (his own opinion, not the views of his employer) who is apparently looking to rubbish that concern. I have placed the most enlightening part of the text in bold for emphasis:

This bit of FUD is silly. First of all, this risk already exists with various features of the Oracle database. For example, many of the OEM packs can be inadvertently used without a license as could several of the views in the database itself. Partitioning is another example that comes to mind. Often it’s installed in a database but it’s use requires a license.

So, how is this any different? Well, it’s not. Simply put, this is an argument for enterprise compliance auditing/management.

I hope this convinces you more than it convinces me.

Understanding Flash: Floating Gates and Wear

OLYMPUS DIGITAL CAMERA

One of the important characteristics of flash memory is wear. We know from previous articles in this series that flash packages consist of dies, which contain planes, which contain blocks, which in turn contain pages. We also know that these pages contain individual cells which store the bits of data… but to understand what wear is we need to look a little bit closer at those cells, where we will find something called a floating gate transistor.

Now don’t run away. Electronics might not be your thing, but I’m won’t be getting too deep. [After all, I once attempted degree-level education in Electronic Engineering but had to move to another course because I kept burning my fingers on the soldering irons during practical laboratory sessions…] If you really don’t want to think about transistors you can just skip to the section called Wear, or ignore this blog post entirely and spend a few minutes looking at picture of cats instead.

Field Effect Transistors

If you are reading this blog on a computer or a phone (and how else could you be reading it?) you owe a debt of gratitude to a humble device called the metal oxide semiconductor field effect transistor, or MOSFET. These tiny devices revolutionised the world and are considered by some to be the most important invention of the 20th century.

Students being stimulated to go to a bar and drink alcohol

Students being stimulated to go to a bar and drink alcohol

They therefore deserve to be explained in a serious and respectful manner, but sadly I don’t have time for that so I’m going to resort to one of my silly analogies. Again.

Imagine you have a house full of students, which we’ll call the source. Two doors down the road you have a nightclub full of free alcohol, which we’ll call the drain. However, it’s freezing cold outside and the students won’t venture out of the door, meaning there is no flow of students from source to drain. Now let’s set up a banging sound system behind the wall of the property in between. If we play music loud enough through the wall then we can excite the students into action, causing them to run down the road and enter the nightclub, which creates the flow. The loud music (which we’re going to call the gate) is the stimulus which causes this flow, essentially acting as a switch, while the volume of the music which first drives the students into action is called the threshold. The beauty of this design is that we can control the flow of students from behind the wall, therefore avoiding having to come into direct contact with them. Phew.

I really cannot tell you how bad that analogy is, but I’m afraid it’s only going to get worse later on. If you don’t know how a transistor works I implore you to watch this short video from the excellent folk at Veritasium, which describes it far better than I ever intend to.

Metal Oxide Semiconductor Field Effect Transistor (MOSFET)

Metal Oxide Semiconductor Field Effect Transistor (MOSFET)

In a MOSFET, electrons (students!) can be allowed to flow from the source terminal to the drain terminal by the application of charge to the gate terminal. This charge creates an electric field which alters the behaviour of the silicon layers (the pinkish parts of the diagram) and thus controls the flow. The important bit here is the yellow rectangle which represents an insulating layer (commonly known as the oxide layer). This means the gate is never physically connected to either the source or drain terminals – and you’ll see why this is important as we turn our attention now to something called the floating grate transistor, or FGMOS.

There are many things I haven’t explained in that clumsy analogy, but this is not an electronics blog. Of specific interest is the way that the two types of silicon (n and p) are doped in order to create free electrons and electron holes. Seriously, watch the video. If you haven’t ever learnt how a semiconductor works and my student party analogy is the nearest you get, it would be a crime. Although that’s not going to stop me from using it again in the next section…

Floating Gate Transistors

Floating Gate MOSFET (FGMOS)

Floating Gate MOSFET (FGMOS)

The diagram on the right (labelled “FGMOS”) is of a Floating Gate MOSFET, which is essentially what you will find in a flash memory cell. If you play spot the difference with the previous diagram above (the one labelled “MOSFET”) you’ll see that there are now two gates, one above the yellow oxide layer as before but the other entirely surrounded by it. This second gate is known as the floating gate because it is completely electrically isolated.

Notice also that the oxide layer beneath the floating gate is deliberately thinner than that above it in the diagram. Now, back to my analogy…

strip-doorsThe students are still there, as is the nightclub. The property in between is still there, but this time we have replaced the brick wall at the front with one of those sets of PVC strip curtains that you sometimes find covering the doors of factories, or butchers shops to keep insects out (or even lining the edge of the cold aisle in a data centre). Inside, we’ll put some comfy chairs and maybe a beer fridge. This is now our floating gate party room and the PVC strips are our thinner section of oxide layer.

Excited students going to a nightclub are stimulated so that some fall into the "Floating Gate" room

Excited students going to a nightclub are stimulated so that some fall into the “Floating Gate” room

As we now play the music loud enough to exceed the threshold, the excited students will run up the road towards the bar as before, but this time – with enough excitement – some of them will pass through the divider into our floating gate room and remain there, thus trapping electrons (sorry, I mean students) on the floating gate. Meanwhile, as the floating gate room fills up with people, the sound of the music becomes more muffled which slows down the flow of students in the road outside. To put it another way, the volume threshold at which stimulation will occur rises if there are students (sorry, I mean electrons) on the floating gate.

In a FGMOS, if a high charge is applied to the control gate in the same manner as with a MOSFET, electrons flowing from source to drain can get excited and “jump” through the oxide layer into the floating gate, increasing its retained charge. This is the program operation we have talked about so many times before: the floating gate is the “bucket of electrons” from my previous posts (a classic case of mixed metaphors). To erase the charge stored on the floating gate, a high voltage is applied across the source and drain while a negative voltage is applied to the control gate, causing the retained electrons to “jump” back off the floating gate (through the oxide layer). I’m putting the word “jump” in inverted commas there because it’s slightly more complicated and usually involves a process called Fowler–Nordheim tunnelling. I’ll explain everything I understand about that in the next paragraph.

[This paragraph intentionally left blank]

Yeah, it’s complicated, it involves quantum mechanics and it’s way over my head. I’m just taking it for granted that it works.

Read Operations

Now that we have methods for programming and erasing we just need a way of testing the value stored: a read operation. When we were talking about the MOSFET in the previous section we could control the flow of charge (which I had better start calling current) between source and drain by varying the voltage applied to the gate.

FGMOS Read Thresholds: The voltage threshold at which current begins to flow from drain to source is different depending on the charge stored on the floating gate. By testing at an intermediate reference voltage (VtREF) called the "read point" we can determine whether the floating gate contains charge (which we call ZERO) or not (which we call ONE)

FGMOS Read Thresholds: The voltage threshold at which current begins to flow from drain to source is different depending on the charge stored on the floating gate. By testing at an intermediate reference voltage (VtREF) called the “read point” we can determine whether the floating gate contains charge (which we call ZERO) or not (which we call ONE)

In the FGMOS this can be turned around so that by measuring the current we can determine the voltage on the floating gate, because electrons trapped on the floating gate cause the threshold we’ve previously mentioned to move. By applying a certain voltage (VtREF in the diagram on the right) across the source and drain and then testing the current we can determine if the voltage on the gate is above or below a specific point, called the read point.

So, if we play music at a certain threshold volume when the floating gate party room is empty, the sound travels far enough to stimulate a flow of students to the nightclub. But if the room is full (the floating gate contains charge) this specific volume will not stimulate a flow. Instead, it will require a louder threshold volume to get the students out of bed. And I think it’s time to abandon the student analogy now… let us never speak of it again.

Read Points for SLC and MLC

Read Points for SLC and MLC (image courtesy AnandTech)

If you remember, flash comes in different forms: SLC, MLC and TLC. So why are MLC reads slower than SLC, with TLC even slower still? Well, because SLC contains only one bit of data (two possible states: a zero or one), we only need to test one threshold voltage, i.e. SLC only has one read point. But for MLC, where there are two bits (and therefore four possible states), there are three read points… while for TLC there are even more.

Remember the bucket full of electrons analogy? When you test the SLC bucket to see if it’s above or below 50% full, the answer will tell you whether the stored value is a zero or a one. But for the MLC bucket, the answer to that test isn’t enough: based on the first answer you then need to perform a second test to see if the bucket is above or below 25% / 75% [delete as appropriate] full. All this additional testing takes time, which is why SLC reads are faster than MLC reads, which in turn are faster than TLC reads.

But what about the wear?

Flash Wear

broken-blindsFinally I’m getting to the point. Remember the PVC strip curtains, like the ones in the main picture at the top of this post? What do you think happens to them as all those excited students (sorry) hurtle through them? They get damaged. In a FGMOS the oxide layer which isolates the floating gate from the silicon substrate is designed to be thin enough to allow quantum tunnelling of electrons when a high enough charge is applied, but this process gradually damages the layer. Reads are not a problem because only lower voltages are used and no electron tunnelling takes place. But program and erase operations are a different story, which is why wear is measured by the number of program/erase cycles. As the layer gets more damaged, the isolation of the floating gate is increasingly affected and the probability of stored charge leaking out will increase.

For SLC this is less of an issue, because during a read we only need to measure at one read point – so there is a lot of room for error either side of the threshold. But for MLC, with three read points, we need to be much more exact. Thus the wear caused by using SLC, MLC and TLC isn’t really very different, it’s simply that the tolerances for error are much finer with the increased bit counts of MLC and TLC.

At some point, the oxide layer for a FGMOS will become sufficiently degraded so that it can no longer store charge properly on the floating gate. We won’t know about this until it happens, but at some point a read from the cell will no longer be trustworthy. And clearly that’s a problem for data storage, which is why (just like disk) flash stores error correction codes (ECC) alongside user data to ensure that incorrect information is spotted and dealt with while any underlying pages are marked as unusable – all without impacting users.

Comparison of SLC, MLC and EMLC (courtesy of EE Times)

Comparison of SLC, MLC and EMLC (courtesy of EE Times)

A final point to consider is that there is a (more expensive) form of MLC known as eMLC (the e stands for enterprise, with “normal” MLC then sometimes referred to as consumer or cMLC). The only difference between eMLC and the standard cMLC we have discussed in the past is that program and erase operations are “slowed down” for eMLC in order to cause less damage to the oxide layer. This gives slightly reduced performance but also significantly reduces wear, allowing up to 10x more P/E cycles. Opinion is divided on whether this is actually a worthwhile investment (not for me though, I think it’s a waste of money in the majority of cases).

Hot News

That’s the end of this post – and as usual I’ve committed two of my three regular blogging sins: writing too much, using silly analogies and finishing on a terrible pun. Time to complete the trio:

Some flash companies have been experimenting with methods of refreshing the lifetime of flash, with one avenue of exploration focussed on providing a short burst of heat to repair the damage to the oxide layer. It has been claimed that flash treated this way can sustain over 100 million P/E cycles with no noticeable degradation. If that is really the case – and this technology can be put into production – we might finally find that the Talking Heads were correct: in the world of flash memory we are on the road to no-wear…

Follow

Get every new post delivered to your Inbox.

Join 977 other followers