How the Next Generation of Flash Storage is Changing the Economics Of SaaS Businesses (Recorded Webinar)

This week I had the opportunity to record a webinar on a subject very close to my heart, the Software-as-a-Service industry. From 2003 to 2007 I managed the production infrastructure for a global SaaS company through the transition from startup to acquisition (partly by Salesforce.com). At the time, SaaS was a relatively new phenomenon, predating any concept of “Cloud”, but the challenges we faced then are still very relevant today.

The company was run by charismatic American entrepreneur Mark Suster, now a well known venture capitalist and blogger. Looking back, it was an incredible learning experience – but I do also remember that I spent a lot of time trying to coax more performance our of multi-tenancy database platform, which was constantly being held back by … yes, you guessed it… disk performance.

The webinar was hosted by Kaminario (my employer) and co-presented by myself and Jeff Kaplan of ThinkStrategies. Here’s the synopsis and the link (registration required). Enjoy!

http://info.kaminario.com/how-flash-storage-is-changing-saas-businesses

Advances in all flash storage are reshaping infrastructure strategies for the modern data center. SaaS businesses are on the leading edge of adopting all flash storage as they build application delivery infrastructure that supports the performance, scalability, and agility required to deliver high quality business apps to users around the world.

Join Jeff Kaplan, Managing Director of ThinkStrategies and Chris Buckel, author of the FlashDBA Blog and Technology Evangelist from Kaminario for this webinar discussion of infrastructure strategies for modern SaaS Businesses.

Understanding Flash: The Fall and Rise of Flash Memory

This month sees the four year anniversary of some interesting events. Commonwealth countries around the world celebrated the Diamond Jubilee of Queen Elizabeth II. Whitney Houston was tragically found dead in a Beverly Hills hotel. The Caribbean was hit hard by sargassum seaweed invasion. And I made the decision to leave the comfort of Oracle databases and join the exciting new All-Flash Array industry.

Ok, I might have been stretching the use of the word “interesting” there. But for those with an interest in flash memory, February 2012 was still a very important month due to the publication of a research paper co-authored by the University of California’s Department of Computer Science and Engineering and Microsoft Research.

The paper was entitled The Bleak Future of NAND Flash Memory – and it wasn’t pleasant reading for somebody who had just abandoned a career in databases to bet everything on flash.

The Death of Flash Memory

I have never spoken to the authors of this paper so I don’t know where the “Bleak Future” title came from, but it seems reasonable to say that it was somewhat more inflammatory than the content. In the body of the paper, the authors examined the behaviour of NAND flash memory chips as the lithography shrank – and also as the number of bits per cell increased from SLC through MLC to TLC. At the time of publication the authors were examining 25nm technology but it was already obvious that this form of NAND (known as 2D planar NAND) was going to hit physical limitations beyond which it could no longer shrink. This is known in the semiconductor world as the scaling limit.

The paper concluded:

“SSDs will continue to improve by some metrics (notably density and cost per bit), but everything else about them is poised to get worse. This makes the future of SSDs cloudy”

This sentiment, along with the “bleak future” thing, caused a bit of a stir in the tech world. TheRegister, for example, ran a typically tongue-in-cheek headline: “Flash DOOMED to drive itself off a cliff – boffins“. Various industry bloggers discussed the potential of technologies like ReRAM to take over for the next decade, while HP made it’s annual claim that Memristor technology (a form of ReRAM) would soon be here to save the day. I started wondering if I should register the domain name ReRamDBA.com…

The Resurrection – Now Showing in 3D

Four years later, ReRAM is still just around the corner but now in the form of Intel and Micron’s 3D XPoint technology, while HP has significantly backtracked on its Memristor programme. Flash memory, meanwhile, is still going strong thanks to the introduction of vertical or 3D NAND.

“Reports of my death have been greatly exaggerated” – Mark Twain

Of course, hindsight is a wonderful thing. It’s easy to look back now at the publication of the Bleak Future… paper and consider it flawed. To see the flaws at the time of publication would have required a bit more thought.

So that’s why this month’s hero is Allyn Malventano, Storage Editor for PC Perspective, who published an article on 21st February 2012 (the same month!) called NAND Flash Memory – A Future Not So Bleak After All in which he described the original publication as “bad science”. Allyn’s conclusion was so prescient that I’m going to quote it right here (although you should read the whole article to get the full context):

“The point I want all of you to take home here is that just as with the CPU, RAM, or any other industry involving wafers and dies, the manufacturers will adapt and overcome to the hurdles they meet. There is always another way, and when the need arises, manufacturers will figure it out.”

Bravo. Samsung is now manufacturing its third generation V-NAND chips, while the Toshiba/SanDisk and Intel/Micro partnerships are both going 3D. Samsung’s V-NAND has already moved from 24 through 32 to 48 layers, while it has been theorised that there is no natural limit on the number of layers possible.

Of course, there’s always the spectre of a new technology sweeping everything before it – and the big story right now is Intel/Micron’s 3D XPoint technology. Will it take over from flash in the future? Who knows.

One thing I do know is that new technologies find their rightful place when they are both technically capable and economically viable. If 3D XPoint or any other non-volatile memory product can win the day, it will leave us all better off – and hopefully without the need for alarmist research papers.

Now, if you’ll excuse me, I’m off to check on the availability of the 3D-XPointDBA.com domain name…

(You can read more about 3D NAND here)

Understanding Flash: What is 3D NAND?

About 18 months ago I wrote a post describing the different types of NAND flash known as SLC, MLC and TLC. However, 18 months is a lifetime in the world of technology so now I need to clarify it based on the widespread adoption of a new type of NAND flash. Let me explain…

Recap: 2D Planar NAND

Until recently, most of the flash memory used for data storage was of a form known as 2D Planar NAND and could be found in three types called Single Level Cell (SLC), Multi-Level Cell (MLC) and TLC (Triple-Level Cell). I always used to use my bucket of electrons analogy to describe the difference between them:

Each cell within planar NAND flash memory stores charge in a way similar to how a bucket stores water. By considering an imaginary line half-way up the bucket we can assign a binary one or zero based on whether the bucket contains more or less water than the line. Thus a full bucket, or a fully-charged NAND cell, denotes a zero while an empty bucket / cell denotes a one… assuming we are considering SLC, where each bucket stores one bit.

Moving to MLC (two bits) or TLC (three bits) is therefore a case of adding more lines, allowing us to differentiate between more states within the same bucket. The benefit is double (MLC) or quadruple (TLC) density but the drawback is that there will be a lower margin for error when measuring the amount of water/charge stored. As a consequence, the actions of reading, writing and erasing take longer while the endurance of the cell also drops drastically (leaky buckets are more of a problem as you try to be more precise about the measurements). The original article covers this all in more detail.

Shrinking Lithographies

If you remember, I also talked about the way that flash memory manufacturers are constantly shrinking the size of NAND flash cells in order to make increasingly dense packages, thus reducing the cost – but that the technology was now approaching its physical limits. In the bucket example, just imagine that the buckets are getting smaller and smaller. This is initially a good thing because smaller buckets (actually floating gate transistors) mean more buckets can fit in the same overall space, but in time the buckets become so small that they are no longer manageable – and then the technology hits a brick wall.

So Why Is It 2D?

In NAND flash memory, sets of cells are connected together in a string to form a NAND gate:

NAND-flash-structure — Image courtesy of Warren Miller at Avnet

If you consider one of the pieces of silicon substrate contained inside a flash chip as a rectangle with dimensions X and Y, each one of these strings of cells will take up some space stretching out in one of these two dimensions. Shrinking the lithography, i.e. manufacturing everything on a smaller scale, will give us the opportunity to fit more strings on the same about of substrate. But as we previously discussed, there comes a point when things are simply too small and too close together, resulting in interference and leakage.

3D NAND: Going Vertical

Image courtesy of Kristian Vättö at AnandTech

The cost of a semiconductor is proportional to the die size. It is therefore a good thing for the cost if more electronics can be crammed into the same tiny piece of silicon. The fundamental difference in 3D NAND, which gives rise to its name, is that the strings previously described are now arranged vertically – another words in the Z dimension. For this reason, Samsung calls the technology V-NAND.

Imagine the string of cells shown earlier, but this time stood on its end and then folded in two to make a U shape. We now have a vertical string which takes up only a fraction of the original space in the X and Y dimensions. What’s more, we can continue to build in the Z dimension as manufacturing processes allow. Samsung’s first generation of V-NAND had strings of 24 layers, while the second generation had 32. The latest 3rd generation now has 48. And as Jim Handy explains, there are few theoretical limits on the number of layers possible. (Just to be clear, these layers are all within the same “wafer” of silicon, otherwise there would be no cost benefit…)

Crucially, since the move to a Z dimension relieves the pressure on the X and Y dimensions, 3D flash is actually free to return to a slightly larger lithography, thus avoiding all of the nasty problems that 2D planar NAND was starting to hit as it approached the 10nm range.

Charge Trap Flash and 3D TLC

Aside from the vertical stacking, there is another fundamental change with 3D NAND – it no longer uses floating gate transistors (yes, that’s right, the buckets from earlier). Instead, it uses a technology called Charge Trap Flash. I’m not going to attempt an explanation of CTF here, but it was memorably described by Samsung as like using cheese instead of water. So, instead of the buckets from earlier, picture cheese.

This cheese has a number of benefits over floating gate transistors in terms of endurance and power consumption, but it still works in a similar way in terms of the number of bits that can be stored – in other words SLC, MLC and TLC. However, because of its better endurance rates, with 3D NAND it is now a realistic proposition to use TLC to replace 2D planar MLC (something my employer Kaminario has already embraced).

This is big news. The cost per density of 3D TLC NAND flash is revolutionary, with plenty of room for further developments as the flash fabricators add more layers. Three years ago it looked like NAND flash was a technology in terminal decline, but with 3D techniques the future is bright. We might even get to a point soon where we see the introduction of…

Quad-Level Cell (QLC) Flash

If the endurance of CTF-based 3D NAND is acceptable, it’s not hard to envisage one of the flash fabricators releasing a quadruple-level cell version of the medium. The potential benefit is an order-of-magnitude increase in density for roughly the same cost.

After all, everybody wants more cheese… right?

All Flash Arrays: Hybrid Means Compromise

Sometimes the transition between two technologies is long and complicated. It may be that the original technology is so well established that it’s entrenched in people’s minds as simply “the way things are” – inertia, you might say. It could be that there is more than one form of the new technology to choose from, with smart customers holding back to wait and see which emerges as a stronger contender for their investment. Or it could just be that the newer form of technology doesn’t yet deliver all of the benefits of the legacy version.

hybrid-car-toyota The automotive industry seems like a good example here. After over a century of using internal combustion engines, we are now at the point where electric vehicles are a serious investment for manufacturers. However, fully-electric vehicles still have issues to overcome, while there is continued debate over which approach is better: batteries or hydrogen fuel cells. Needless to say, the majority of vehicles on the road today still use what you could call the legacy method of propulsion.

However, one type of vehicle which has been successful in gaining market share is the hybrid electric vehicle. This solution attempts to offer customers the best of both worlds: the lower fuel consumption and claimed environmental benefits of an electric vehicle, but with the range, performance and cost of a fuel-powered vehicle. Not everybody believes it makes sense, but enough do to make it a worthwhile venture for the manufacturers.

Now here’s the interesting thing about hybrid vehicles… the thing that motivated me to write two paragraphs about cars instead of flash arrays… Nobody believes that hybrid electric vehicles are the permanent solution. Everybody knows that hybrids are a transient solution on the way to somewhere else. Nobody at all thinks that hybrid is the end-game. But the people who buy hybrid cars also believe that this state of affairs will not change during the period in which they own the car.

Hybrid Flash Arrays (HFAs)

There are two types of flash storage architecture which could be labelled a hybrid – those where a disk array has been repopulated with flash and those which are designed specifically for the purpose of mixing flash and disk. I’ve talked about naming conventions before, it’s a tricky subject. But for the purposes of this article I am only discussing the latter: systems where the architecture has been designed so that disk and flash co-exist as different tiers of storage. Think along the lines of Nimble Storage, Tegile and Tintri.

Why do this? Well, as with hybrid electric vehicles the idea is to bridge the gap between two technologies (disk and flash) by giving customers the best of both worlds. That means the performance of flash plus its low power, cooling and physical space requirements – combined with the density of disk and its corresponding impact on price. In other words, if disk is cheap but slow while flash is fast but expensive, HFAs are aimed at filling the gap.

Hybrid (adjective)

of mixed character; composed of different elements.

bred as a hybrid from different species or varieties.

As you can see there are a lot of synergies between this trend and that of the electric vehicle. Also, most storage systems are purchased with a five-year refresh cycle in mind, which is not dissimilar to the average length of ownership of a car. But there’s a massive difference: the rate of change in the development of flash memory technology.

In recent years the density of NAND flash has increased by orders of magnitude, especially with the introduction of 3D NAND technology and the subsequent use of Triple-Level Cell (TLC). And when the density goes up, the price comes down – closing the gap between disk and flash. In fact we’re at the point now where Wikibon predicts that “flash … will become a lower cost media than disk … for almost all storage in 2016″:

Image courtesy of Wikibon "Evolution of All-Flash Array Architectures" by David Floyer (2015) — *Image courtesy of Wikibon “Evolution of All-Flash Array Architectures” by David Floyer (2015)*

That’s great news for customers – but definitely not for HFA vendors.

Conclusion

And so we reach the root of the problem with HFAs. It’s not just that they are slower than All Flash Arrays. It’s not even that they rely on the guesswork of automatic tiering algorithms to move data between their tiers of disk and flash. It’s simply that their entire existence is predicated on the idea of being a transitory solution designed to bridge a gap which is already closing faster than they can fill it.

If you want proof of this, just look at the three HFA vendors I name checked earlier – all of which are rushing to bring out All Flash versions of their arrays. Nimble Storage is the only one of the three to be publicly listed – and its recent results indicate a strategic rethink may be required.

When it comes to hybrid electric vehicles, it’s true that the concept of mass-owned fully-electric cars still belongs in the future. But when it comes to hybrid flash arrays, the adoption of All-Flash is already happening today. The advice to customers looking to invest in a five-to-seven year storage project is therefore pretty simple: Mind the gap.

Why Kaminario?

This summer I made the decision to leave my previous employer and join another vendor in the All Flash Array space – a company called Kaminario. A lot of people have been in touch to ask me about this, so I thought I’d answer the question here… Why Kaminario?

To answer the question, we first need to look at where the All Flash industry finds itself today…

The Path To Flash Adoption

We all know that disk-based storage has been struggling to deliver to the enterprise for many years now. And most of us are aware that flash memory is the technology most suitably placed to take over the mantle as the storage medium of choice. However, even keeping in mind the typical five-to-seven year refresh cycle for enterprise storage platforms, the journey to adopt flash in the enterprise data centre has been slower than some might have expected. Why?

There are three reasons, in my view. The first two are pretty obvious: cost and functionality. I’ll cover the third in another post – but cost and functionality have changed drastically over the four phases of flash:

Phase One: Extreme Performance

The early days of enterprise flash storage were pioneered by the likes of FusionIO with their PCIe flash cards. These things sold for a $/GB price that would seem obscene in today’s AFA marketplace – and (at least initially) they had almost no functionality in terms of thin provisioning, replication, snapshots, data reduction technologies and so on. They weren’t even shared storage! They were just fast blobs of flash that you could stick right inside your server to get performance which, at the time, seemed insanely fast – think <250 microseconds of latency.

This meant they were only really suitable for extreme performance requirements, where the cost and complexity was justified by the resultant improvement to the application.

Phase Two: Niche Performance Applications

The next step on the path to flash adoption was the introduction of flash as shared storage (i.e. SAN). These were the first All Flash Arrays, a marketplace pioneered by Violin Memory (my former employer) and Texas Memory Systems (subsequently acquired by IBM). The fact that they were shared allowed a larger number of applications to be migrated to flash, but they were still very much used as a niche performance play due to a lack of features such as data reduction, replication etc.

Phase Three: Virtualization for Servers and Desktops

The third phase was driven by the introduction of a very important feature: data reduction. By implementing deduplication and/or compression – therefore massively reducing the effective price in $/GB – a couple of new entrants to the AFA space were able to redefine the marketplace and leave the pioneering AFA vendors floundering. These new players were Pure Storage and EMC with its XtremIO system – and they were able to create and attack an entirely new market: virtualization. Initially they went after Virtual Desktop Infrastructure projects, which have lots of duplicate data and create lots of IOPS, but in time the market for Virtual Server Infrastructure (i.e. VMware, Hyper-V, Xen etc) became a target too.

Phase Four: General Purpose Storage

This is where we are now – or at least, it’s where we’ve just arrived. The price of flash storage has consistently dropped as the technology has advanced, while almost all of the features and functionality originally found on enterprise-class disk arrays are now available on AFAs. We’re finally at the point now where, with some caveats, customers are either moving to or planning the wholesale replacement of their general purpose disk arrays with All Flash. Indeed, with the constant evolution of NAND flash technology it’s no longer fanciful to believe that Backup and Archive workloads could also move to flash…

We are now at the inflection point where, thanks to the combination of data reduction features and constantly-evolving NAND flash development, the cost of All Flash storage has fallen as low as enterprise disk storage while delivering all the functionality required to replace disk entirely. We call this concept the All-Flash Data Centre.

So to answer the question at the start of this post, I have joined Kaminario because I believe they are ideally placed, architecturally and commercially, to lead the adoption of this new phase of flash storage – a technology that I fundamentally believe in.

Making The All-Flash Data Centre A Reality

I mentioned earlier that NAND flash is changing and evolving all the time. It reminds me a little of smartphones – you buy the latest and greatest model only for it to become yesterday’s news almost before you’ve worked out how to use it. But the typical refresh cycle of a smartphone is one-to-two years, while for enterprise storage it’s five-to-seven. That’s a long time to risk an investment in evolving technology.

Kaminario’s K2 All Flash Array is based on its SPEAR architecture. Essentially, what Kaminario has created is a high-performance, scalable framework for taking memory and presenting it as enterprise-class storage – with all the resilience, functionality and performance you would expect. When the company was first founded this memory was just that: DRAM. But since NAND flash became economically viable, Kaminario has been using flash – and the architecture is agile enough to adopt whichever technology makes the most sense in the future.

As an example of this agility, Kaminario was the first AFA vendor to adopt 3D NAND technology and the first to adopt 3D TLC. This obviously allows a major competitive advantage when it comes to providing the most cost-efficient All Flash Array. But what really drew me to Kaminario was their decision to allow customers to integrate future hardware (such as new types of flash) into their existing arrays rather than making them migrate to a new product as is typical in the industry. By protecting customers’ investments, Kaminario is taking some of the risk out of moving to an AFA solution. It calls this programme the Perpetual Array.

In addition to this, Kaminario has a unique ability to offer both scale out and scale up architecture (scalability is something I will discuss further in my Storage for DBAs series soon) and to deliver workload agnostic performance… all technical features that deliver real business value. But those are for discussion another day.

For today the message is simple: Kaminario is making the All Flash Data Centre a reality.. and I want to be here to help customers make that happen.

All Flash Arrays: SSD-based versus Ground-Up Design

In recent articles in this series I’ve been looking at the architectural choices for building All Flash Arrays (AFAs). I surmised that there are three main approaches:

Hybrid Flash Arrays
SSD-based All Flash Arrays
Ground-Up All Flash Arrays (which from here on I’ll refer to as Custom Flash Module arrays or CFM arrays)

I’ve already blown metaphorical raspberries at the hybrid approach, so now it’s time to cover the other two.

SSD or CFM: The Big Question

I think the most interesting question in the AFA industry right now is the one of whether the SSD or CFM design will win. Of course, it’s easy to say “win” like that as if it’s a simple race, but this is I.T. – there’s never a simple answer. However, the reality is that each method offers benefits and drawbacks, so I’m going to use this blog post to simply describe them as I see them.

Before I do that, let me just remind you of what the vendor landscape looks like at this time:

SSD-based architecture: Right now you can buy SSD-based arrays from EMC (XtremIO), Pure Storage, Kaminario, Solidfire, HP 3PAR and Huawei to name a few. It’s fair to say that the SSD-based design has been the most common in the AFA space so far.

CFM-based architecture: On the other hand, you can now buy ground-up CFM-based arrays from Violin Memory, IBM (FlashSystem), HDS (VSP), Pure Storage (FlashArray//m) and EMC (DSSD). The latter has caused some excitement because of DSSD’s current air of mystery in the marketplace – in other words, the product isn’t yet generally available.

So which approach is “the best”?

The SSD-based Approach

If you were going to start an All Flash Array company and needed to bring a product to market as soon as possible, it’s quite likely you would go down the SSD route. Apart from anything else, flash management is hard work – and needs constant attention as new types of flash come to market. A flash hardware engineer friend of mine used to say that each new flash chip is like a snowflake – they all behave slightly differently. So by buying flash in the ready-made form of an SSD you bypass the requirement to put in all this work. The flash controller from the SSD vendor does it for you, leaving you to concentrate on the other stuff that’s needed in enterprise storage: resilience, availability, data services, etc.

On the other hand, it seems clear that an SSD is a package of flash pretending to behave like a disk. That often means I/Os are taking place via protocols that were designed for disk, such as Serial Attached SCSI. Also, in a unit the size of an all flash array there are likely to be many SSDs… but because each one is an isolated package of flash, they cannot work together and manage the flash holistically. In other words, if one SSD is experiencing issues due to garbage collection (for example), the others cannot take the strain.

The Ground-Up Approach

For a number of years I worked for Violin Memory, which adopted the ground-up approach at its very core. Violin’s position was that only the CFM approach could unlock the full potential benefits from NAND flash. By tightly integrating the NAND flash into its array – and by using its own controllers to manage that flash – Violin believed it could deliver the best performance in the AFA market. On the other hand, many SSD vendors build products for the consumer market where the highest levels of performance simply aren’t necessary. All that’s required is something faster than disk – it doesn’t always have to be the fastest possible solution.

It could also be argued that any CFM vendor who has a good relationship with a flash fabricator (for example, Violin was partly-owned by Toshiba) could gain a competitive advantage by working on the very latest NAND flash technologies before they are available in SSD form. What’s more, SSDs represent an additional step in the process of taking NAND flash from chip to All Flash Array, which potentially means there’s an extra party needing to make their margin. Could it be that the CFM approach is more cost effective? [Update from Jan 2017: Violin Memory has now filed for chapter 11 bankruptcy protection]

SSD Economics

The argument about economics is an interesting one. Many technical people have a tendency to focus on what they know and love: technology. I’m as guilty of this as anyone – given two solutions to a problem I tend to gravitate toward the one that has the most elegant technical design, even if it isn’t necessarily the most commercially-favourable. Taking raw flash and integrating it into a custom flash module sounds great, but what is the cost of manufacturing those CFMs?

Manufacturing is all about economies of scale. If you design something and then build thousands of them, it will obviously cost you more per unit than if you build millions of them. How many ground-up all flash vendors are building their custom flash modules by the millions? In May 2015, IBM issued this press release in which they claimed that they were the “number one all-flash storage array vendor in 2014“. How many units did they ship? 2,100.

In just the second quarter of 2015, almost 24 million SSDs were shipped to customers, with Samsung responsible for 43.8% of that total (according to US analyst firm Trendfocus, Inc). Who do you think was able to achieve the best economy of scale?

Design Agility

The other important question is the one about New Stuff ™. We are always being told about fantastic new storage technologies that are going to change our lives, so who is best placed to adopt them first?

Again there’s an argument to be made on both sides. If the CFM flash vendor is working hand-in-glove with a fabricator, they may have access to the latest technology coming down the line. That means they can be prepared ahead of the pack – a clear competitive advantage, right?

But how agile is the CFM design? Changing the NVM media requires designing an entirely-new flash module, with all the associated hardware engineering costs such as prototyping, testing, QA and limited initial manufacturing runs.

For an SSD all flash array vendor, however, that work is performed by the SSD vendor… again somebody like Samsung, Intel or Micron who have vast infrastructures in place to perform that sort of work all the time. After all, a finished SSD must behave exactly like a disk, regardless of what NVM technology it uses under the covers.

Conclusion

There are obviously two sides to this argument. The SSD was designed to replace a fundamental bottleneck in storage systems: the hard disk drive. Ironically, it may be the fate of the SSD to become exactly what it replaced. For flash to become mainstream it was necessary to create a “flash-behaving-as-disk” package, but the flip side of this is the way that SSDs stifle the true potential of the underlying flash. (Although perhaps NVMe technologies will offer us some salvation…)

However, unless you are a company the size of Samsung, Intel or Micron it seems unlikely that you would be able to retain the manufacturing agility and economies of scale required to produce custom flash modules at the price point of SSDs. Nor would you be likely to have the agility to adopt new NVM technologies at the moment that they become economically preferable to whatever medium you were using previously.

Whatever happens, you can be sure that each side will claim victory. With the entire primary data market to play for, this is a high stakes game. Every vendor has to invest a large amount of money to enter the field, so nobody wants to end up being consigned to the history books as the Betamax of flash…

For younger readers, Betamax was the loser in a battle with VHS over who would dominate the video tape market. You can read about it here. What do you mean, “What is a video tape?” Those things your parents used to watch movies on before the days of DVDs. What do you mean, “What is a DVD?” Jeez, I feel old.

All Flash Arrays: Where’s My Capacity? Effective, Usable and Raw Explained

What’s the most important attribute to consider when you want to buy a new storage system? More critical than performance, more interesting than power and cooling requirements, maybe even more important than price? Whether it’s an enterprise-class All Flash Array, a new drive for your laptop or just a USB flash key, the first question on anybody’s mind is usually: how big is it?

Yet surprisingly, at least when it comes to All Flash Arrays, it is becoming increasingly difficult to get an accurate answer to this question. So let’s try and bring some clarity to that in this post.

Before we start, let’s quickly address the issue of binary versus decimal capacity measurements. For many years the computer industry has lived with two different definitions for capacity: memory is typically measured in binary values (powers of two) e.g. one kibibyte = 2 ^ 10 bytes = 1024 bytes. On the other hand, hard disk drive manufacturers have always used decimal values (powers of ten) e.g. one kilobyte = 10 ^ 3 bytes = 1000 bytes. Since flash memory is commonly used for the same purpose as disk drives, it is usually sold with capacities measured in decimal values – so make sure you factor this in when sizing your environments.

Definitions

Now that’s covered, let’s look at the three ways in which capacity is most commonly described: raw capacity, usable capacity and effective capacity. To ensure we don’t stray from the truth, I’m going to use definitions from SNIA, the Storage Networking Industry Association.

Raw Capacity: The sum total amount of addressable capacity of the storage devices in a storage system.

The raw capacity of a flash storage product is the sum of the available capacity of each and every flash chip on which data can be stored. Imagine an SSD containing 18 Intel MLC NAND die packages, each of which has 32GB of addressable flash. This therefore contains 576GB of raw capacity. The word addressable is important because the packages actually contain additional unaddressable flash which is used for purposes such as error correction – but since this cannot be addressed by either you or the firmware of the SSD, it doesn’t count towards the raw value.

Usable Capacity: (synonymous with Formatted Capacity in SNIA terminology) The total amount of bytes available to be written after a system or device has been formatted for use… [it] is less than or equal to raw capacity.

Possibly one of the most abused terms in storage, usable capacity (also frequently called net capacity) is what you have left after taking raw capacity and removing the space set aside for system use, RAID parity, over-provisioning (i.e. headroom for garbage collection) and so on. It is guaranteed capacity, meaning you can be certain that you can store this amount of data regardless of what the data looks like.

That last statement is important once data reduction technologies come into play, i.e. compression, deduplication and (arguably) thin provisioning. Take 10TB of usable space and write 5TB of data into it – you now have 5TB of usable capacity remaining. Sounds simple? But take 10TB of usable space and write 5TB of data which dedupes and compresses at a 5:1 ratio – now you only need 1TB of usable space to store it, meaning you have 9TB of usable capacity left available.

Effective Capacity: The amount of data stored on a storage system … There is no way to precisely predict the effective capacity of an unloaded system. This measure is normally used on systems employing space optimization technologies.

The effective capacity of a storage system is the amount of data you could theoretically store on it in certain conditions. These conditions are assumptions, such as “my data will reduce by a factor of x:1”. There is much danger here. The assumptions are almost always related to the ability of a dataset to reduce in some way (i.e. compress, dedupe etc) – and that cannot be known until the data is actually loaded. What’s more, data changes… as does its ability to reduce.

For this reason, effective capacity is a terrible measurement on which to make any firm plans unless you have some sort of guarantee from the vendor. This takes the form of something like, “We guarantee you x terabytes of effective capacity – and if you fail to realise this we will provide you with additional storage free of charge to meet that guarantee”. This would typically then be called the guaranteed effective capacity.

The most commonly used assumptions in the storage industry are that databases reduce by around 2:1 to 4:1, VSI systems around 5:1 to 6:1 and VDI systems anything from 8:1 right up to 18:1 or even further. This means an average data reduction of around 6:1, which is the typical ratio you will see on most vendor’s data sheets. If you take 10TB of usable capacity and assume an average of 6:1 data reduction, you therefore end up with an effective capacity of 60TB. Some vendors use a lower ratio, such as 3:1 – and this is good for you the customer, because it gives you more protection from the risk of your data not reducing.

But it’s all meaningless in the real world. You simply cannot know what the effective capacity of a storage system is until you put your data on it. And you cannot guarantee that it will remain that way if your data is going to change. Never, ever buy a storage system based purely on the effective capacity offered by the vendor unless it comes with a guarantee – and always consider whether the assumed data reduction ratio is relevant to you.

Think of it this way: if you are being sold effective capacity you are taking on the financial risk associated with that data reduction. However, if you are being sold guaranteed effective capacity, the vendor is taking on that financial risk instead. Which scenario would you prefer?

Use and Abuse of Capacities

Three different ways to measure capacity? Sounds complicated. And in complexity comes opportunities for certain flash array vendors to use smoke and mirrors in order to make their products seem more appealing. I’m going to highlight what I think are the two most common tactics here.

1. Confusing Usable Capacity with Effective Capacity

Many flash array vendors have Always-On data reduction services. This is often claimed to be for the customer’s benefit but is often more about reducing the amount of writes taking place to the flash media (to alleviate performance and endurance issues). For some vendors, not having the ability to disable data reduction can be spun around to their advantage: they simply make out that the terms usable and effective are synonymous, or splice them together into the unforgivable phrase effective usable capacity to make their products look larger. How convenient.

Let me tell you this now: every flash array has a usable capacity… it is the maximum amount of unique, incompressible data that can be stored, i.e. the effective capacity if the data reduction ratio were 1:1. I would argue that this is a much more important figure to know when you buy a flash system, because if you buy on effective capacity you are just buying into dreams. Make your vendor tell you the usable capacity at 1:1 data reduction and then calculate the price per GB based on that value.

2. My Data Reduction Is Better Than Yours

Every flash vendor thinks their data reduction technologies are the best. They can’t all be right. Yet sometimes you will hear claims so utterly ridiculous that you’ll think it’s some kind of joke. I guess we all believe what we want to hear.

Here’s the truth. Compression and deduplication are mature technologies – they have been around for decades. Nobody in the world of flash storage is going to suddenly invent something that is remarkably better than the competition. Sometimes one vendor’s tech might deliver better results than another, but on other days (and, crucially, with other datasets) that will reverse. For this reason, as well as for your own sanity, you should assume they will all be roughly the same… at least until you can test them with your data. When you evaluate competitive flash products, make them commit to a guaranteed effective capacity and then hold them to it. If they won’t commit, beware.

3. Including Savings From Thin Provisioning

Thin Provisioning is a feature whereby physical storage is allocated only when it is used, rather than being preallocated at the moment of volume creation. Thus if a 10TB volume is created and then 1TB of data written to it, only 1TB of physical storage is used. The host is fooled into thinking that it has all 10TB available, but in reality there may be far less physical capacity free on the storage system.

Some vendors show the benefits from thin provisioned volumes as a separate saving from compression and deduplication, but some roll all three up into a single number. This conveniently makes their data reduction ratios look amazing – but in my opinion this then becomes a joke. Thin provisioning isn’t data reduction, because no data is being reduced. It’s a cheap trick – and it says a lot about the vendor’s faith in their compression and deduplication technologies.

Conclusion

Don’t let your vendors set the agenda when it comes to sizing. If you are planning on buying a certain capacity of flash, make sure you know the raw and usable capacities, plus the effective capacity and the assumed data reduction ratio used to calculate it. Remember that usable should be lower than raw, while effective (which is only relevant when data reduction technologies are present) will commonly be higher.

Keep in mind that Effective Capacity = Usable Capacity X Data Reduction Factor.

Be aware that when a product with an “Always-On” data reduction architecture tells you how much capacity you have left, it’s basically a guess. In reality, it’s entirely dependent on the data you intend to write. I’ve always thought that “Always-On” was another bit of marketing spin; you could easily rename it as an “Unavoidable” or “No Choice” architecture.

In my opinion, the best data reduction technology will be selectable and granular. That means you can choose, at a LUN level, whether you want to take advantage of compression and/or deduplication or not – you aren’t tied in by the architecture. Like with all features, the architecture should allow you to have a choice rather then enforce a compromise.

Finally, remember the rule about effective capacity. If it’s guaranteed, the risk is on the vendor. If they won’t guarantee, the risk is on you.

So there we have it: clarity and choice. Because in my opinion – and no matter which way you measure it – one size simply doesn’t fit all.

Goodbye Violin…

There’s a joke among some of the longer-serving employees of Violin Memory that “Violin years” are like dog years: every normal year feels like a decade. It’s not meant in a negative way, it’s just that things change so quickly and dramatically in the flash industry that whenever we take a moment to look back it seems impossible to believe that it was only a few years ago when flash was considered a new technology and we were having to try and explain the point of its existence to a world totally reliant on disk.

So with that in mind it’s time to make an announcement… after thirty five “Violin years” I have made the decision to leave Violin Memory.

A lot of people have been emailing or private-messaging me recently, asking why I’ve been so quiet on social media – so I’m glad I can finally now answer some of those questions.

Why are you leaving Violin Memory?

I don’t really want to answer this at the moment. For now I will simply restrict myself to saying this: I have lots of friends who remain at Violin Memory and I consider them to be among the smartest and most genuine people in the industry. It’s been an absolute privilege to work alongside them, as well as a lot of fun. I’ll miss them, as well as our amazing customers and channel partners all around Europe.

Where are you going next?

I plan to take on a new role for another all flash array vendor. I will announce that sometime soon, so thank you to all who have shown an interest.

Can’t you just tell us the name?

No.

So what happens next?

Hopefully I’ll soon be in a position to talk about the future and fill in the “My Employer” box on the top right hand side of the page. But for now… stay tuned.

All Flash Arrays: Can’t I Just Stick Some SSDs In My Disk Array?

In the previous post of this series I outlined three basic categories of All Flash Array (AFA): the hybrid AFA, the SSD-based AFA and the ground-up AFA. This post addresses the first one and is therefore aimed at answering one of the questions I hear most often: why can’t I just stick a bunch of SSDs in my existing disk array?

Data Centre Dinosaurs

Disk arrays – and in this case we are mainly talking about storage area networks – have been around for a long time. Every large company has a number of monolithic, multi-controller cache-based disk arrays in their data centre. They are the workhorses of storage: ever reliable, able to host multiple, mixed workloads and deliver predictable performance. Predictably slow performance, of course – but you mustn’t underestimate just how safe these things feel to the people who are paid to ensure the safety of their data. Add to this the full suite of data services that come with them (replication, mirroring, snapshots etc) and you have all that you could ask for.

Except of course that they are horribly expensive, terribly slow and use up vast amounts of power, cooling and floor space. They are also a dying breed, memorably described by Chris Mellor of The Register as like outdated battleships in an era of modern warfare.

The SSD Power-Up

Every large vendor has a top-end product: EMC’s VMAX, IBM’s DS8000, HDS’s VSP… and pretty much every product has the ability to use SSDs in place of disk drives. So why not fill one of these monsters with flash drives and then call it an “All Flash Array”? Just like in those computer games where your spacecraft hits a power-up and suddenly it’s bigger and faster with better weapons… surely a bunch of SSDs would convert your ageing battleship into a modern cruiser with new-found agility and seaworthiness. Ahoy! [Ok, I’ll stop with the naval analogies now]

Well, no. And to understand why, let’s look at how most disk arrays are architected.

The Classic Disk Array Architecture

Let’s consider what it takes to build a typical SAN disk array. We’ll start with the most obvious component, which is the hard disk drive itself. These things have been around for decades so we are fairly familiar with them. They offer pretty reasonable capacity of up to 4TB (in fact there are even a few 6TB models out now) but they have a limitation with regard to performance: you are unlikely to be able to drive much more than 200 transactions per second.

At this point, stop and think about the performance characteristics of a hard disk drive. Disks don’t really care if you are doing read or write I/Os – the performance is fairly symmetrical. However, there is a drastic difference in the performance of random versus sequential I/Os: each I/O operation incurs the penalty of latency. A single, large I/O is therefore much more efficient than many, small I/Os.

This is the architectural constraint of every hard disk drive and therefore the design challenge around which we much architect our disk array.

In the next section we’re going to build a disk array from scratch, considering all the possibilities that need to be accounted for. If it looks like it will take too long to read, you can skip down to the conclusion section at the end.

Building A Disk Array

We’re now going to build a disk array, so the first ingredient is clearly going to be hard drives. So let’s start by taking a bunch of disks and putting them into a shelf, or some other form of enclosure:

At this point the density is limited by the number of disks we can fit into the enclosure, which might typically be 25 if they are of the smaller form factor or up to around 14 if they are the larger 3.5 inch variety.

Next we’re going to need a controller – and in that controller we’re going to want a large chunk of DRAM to act as a cache in order to try and minimise the number of I/Os hitting the disk enclosure. We’ll allocate some of that DRAM to work as a read cache, in the hope that many of the reads will be hitting a small subset of the data stored, i.e. the “hot” blocks. If this gamble is successful we will have taken some load off of the disks – and that is a Good Thing:

The rest of the DRAM will be allocated to a write cache, because clearly we don’t want to have to incur the penalty of rotational latency every time a write I/O is performed. By writing the data to the DRAM buffer and then issuing the acknowledgement back to the client, we can take our time over writing the data to the persistent storage in the disk enclosure.

Now, this is an enterprise-class product we are trying to build, so that means there are requirements for resiliency, redundancy, online maintenance etc. It therefore seems pretty obvious that having only one controller is a single point of failure, so let’s add another one:

This brings up a new challenge concerning that write cache we just discussed. Since we are acknowledging writes when they hit DRAM it could be possible for controller to crash before changed blocks are persisted to disk – resulting in data loss. Also, an old copy of a block could be in the cache of one controller while a newly-changed version exists in the other one. These possibilities cannot be allowed, so we will need to mirror our write cache between the controllers. In this setup we won’t acknowledge the write until it’s been written to both write caches.

Of course, this introduces a further delay, so we’ll need to add some sort of high speed interconnect into the design to make the process of mirroring as fast as possible:

This mirroring may protect us from losing data in the event of a single controller failure, but what about if power to the entire system was lost? Changed blocks in the mirrored write cache would still be lost, resulting in lost data… so now we need to add some batteries to each controller in order to provide sufficient power that cached writes can be flushed to persistent media in the event of any systemic power issue:

That’s everything we need from the controllers – so now we need to connect them together with the disks. Traditionally, disk arrays have tended to use serial architectures to attach disks onto a back end network which essentially acts as a loop. This has some limitations in terms of performance but when your fundamental building blocks are each limited to 200 IOPS it’s hardly the end of the world:

So there we have it. We’ve built a disk array complete with redundant controllers and battery-backed DRAM cache. Put a respectable logo on the front and you will find this basic design used in data centres around the world.

But does it still make sense if you switch to flash?

From HDD To SSD

Let’s take our finished disk array design and replace the disks with SSDs:

And now let’s take a moment to consider the performance characteristics of flash: the latency is much lower than disk, meaning the penalty for performing random I/Os instead of sequential I/Os is negligible. However, the performance of read versus write I/Os is asymmetrical: writes take substantially longer than reads – especially sustained writes. What does that do to all of the design principles in our previous architecture?

With so many more transactions per second available from the SSDs, it no longer makes sense to use a serial / loop based back end network. Some sort of switched infrastructure is probably more suitable.
Because the flash media is so much faster than disk (i.e. has a significantly lower latency), we can do away with the read cache. Depending on our architecture, we may also be able to avoid using a write cache too – resulting in complete removal of the DRAM in those controllers (although this would not be the case if deduplication were to be included in the design – more on that another time).
If we no longer have data in DRAM, we no longer need the batteries and may also be able to remove or at least downsize the high speed network connecting the controllers.

All we are left with now is the enclosure full of SSDs – and there is an argument to be made for whether that is the most efficient method of packaging NAND flash. It’s certainly not the most dense method, which is why Violin Memory and IBM’s FlashSystem both use their own custom flash modules to package their flash.

Conclusion

Did you notice how pretty much every design decision that we made building the disk array architecture turned out to be the wrong one for a flash-based solution? This shouldn’t really be a surprise, since flash is fundamentally different to disk in its behaviour and performance.

Architecture matters. Filling a legacy disk array with SSDs simply isn’t playing to the strengths of flash. Perhaps if it were a low cost option it would be a sensible stop-gap solution, but typically the SSD options for these legacy arrays are astonishingly expensive.

So next time you look at a hybrid disk array product that’s being marketed as “all flash”, do yourself a favour. Think about the architecture. If it was designed for disk, the chances are it’ll perform like it was designed for disk. And you don’t want to end up with that sinking feeling…

Thanks to my friend and former colleague Steve “yeah” Willson for the concepts behind this blog post. Steve, I dedicate the picture of a velociraptor at the top of this page to you. You have earned it.

Oracle’s ASM Filter Driver Revisited

Almost exactly a year ago I published a post covering my first impressions of the ASM Filter Driver (ASMFD) released in Oracle 12.1.0.2, followed swiftly by a second post showing that it didn’t work with 4k native devices.

When I wrote that first post I was about to start my summer holidays, so I’m afraid to admit that I was a little sloppy and made some false assumptions toward the end – assumptions which were quickly overturned by eagle-eyed readers in the comments section. So I need to revisit that at some point in this post.

But first, some background.

Some Background

If you don’t know what ASMFD is, let me just quote from the 12.1 documentation:

Oracle ASM Filter Driver (Oracle ASMFD) is a kernel module that resides in the I/O path of the Oracle ASM disks. Oracle ASM uses the filter driver to validate write I/O requests to Oracle ASM disks.

The Oracle ASMFD simplifies the configuration and management of disk devices by eliminating the need to rebind disk devices used with Oracle ASM each time the system is restarted.

The Oracle ASM Filter Driver rejects any I/O requests that are invalid. This action eliminates accidental overwrites of Oracle ASM disks that would cause corruption in the disks and files within the disk group. For example, the Oracle ASM Filter Driver filters out all non-Oracle I/Os which could cause accidental overwrites.

This is interesting, because ASMFD is considered a replacement for Oracle ASMLib, yet the documentation for ASMFD doesn’t make all of the same claims that Oracle makes for ASMLib. Both ASMFD and ASMLib claim to simplify the configuration and management of disk devices, but ASMLib’s documentation also claims that it “greatly reduces kernel resource usage“. Doesn’t ASMFD have this effect too? What is definitely a new feature for ASMFD is the ability to reject invalid (i.e. non-Oracle) I/O operations to ASMFD devices – and that’s what I got wrong last time.

However, before we can revisit that, I need to install ASMFD on a brand new system.

Installing ASMFD

Last time I tried this I made the mistake of installing 12.1.0.2.0 with no patch set updates. Thanks to a reader called terry, I now know that the PSU is a very good idea, so this time I’m using 12.1.0.2.3. First let’s do some preparation.

Preparing To Install

I’m using an Oracle Linux 6 Update 5 system running the Oracle Unbreakable Enterprise Kernel v3:

[root@server4 ~]# cat /etc/oracle-release
Oracle Linux Server release 6.5
[root@server4 ~]# uname -r
3.8.13-26.2.3.el6uek.x86_64

As usual I have taken all of the necessary pre-installation steps to make the Oracle Universal Installer happy. I have disabled selinux and iptables, plus I’ve configured device mapper multipathing. I have two sets of 8 LUNs from my Violin storage: 8 using 512e emulation mode (512 byte logical block size but 4k physical block size) and 8 using 4kN native mode (4k logical and physical block size). If you have any doubts about what that means, read here.

[root@server4 ~]# ls -l /dev/mapper
total 0
crw-rw---- 1 root root 10, 236 Jul 20 16:52 control
lrwxrwxrwx 1 root root       7 Jul 20 16:53 mpatha -> ../dm-0
lrwxrwxrwx 1 root root       7 Jul 20 16:53 mpathap1 -> ../dm-1
lrwxrwxrwx 1 root root       7 Jul 20 16:53 mpathap2 -> ../dm-2
lrwxrwxrwx 1 root root       7 Jul 20 16:53 mpathap3 -> ../dm-3
lrwxrwxrwx 1 root root       7 Jul 20 16:53 v4kdata1 -> ../dm-6
lrwxrwxrwx 1 root root       7 Jul 20 16:53 v4kdata2 -> ../dm-7
lrwxrwxrwx 1 root root       7 Jul 20 16:53 v4kdata3 -> ../dm-8
lrwxrwxrwx 1 root root       7 Jul 20 16:53 v4kdata4 -> ../dm-9
lrwxrwxrwx 1 root root       8 Jul 20 16:53 v4kdata5 -> ../dm-10
lrwxrwxrwx 1 root root       8 Jul 20 16:53 v4kdata6 -> ../dm-11
lrwxrwxrwx 1 root root       8 Jul 20 16:53 v4kdata7 -> ../dm-12
lrwxrwxrwx 1 root root       8 Jul 20 16:53 v4kdata8 -> ../dm-13
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data1 -> ../dm-14
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data2 -> ../dm-15
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data3 -> ../dm-16
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data4 -> ../dm-17
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data5 -> ../dm-18
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data6 -> ../dm-19
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data7 -> ../dm-20
lrwxrwxrwx 1 root root       8 Jul 20 17:00 v512data8 -> ../dm-21
lrwxrwxrwx 1 root root       8 Jul 20 16:53 vg_halfserver4-lv_home -> ../dm-22
lrwxrwxrwx 1 root root       7 Jul 20 16:53 vg_halfserver4-lv_root -> ../dm-4
lrwxrwxrwx 1 root root       7 Jul 20 16:53 vg_halfserver4-lv_swap -> ../dm-5

The 512e devices are shown in green and the 4k devices shown in red. The other devices here can be ignored as they are related to the default filesystem layout of the operating system.

Installing Oracle 12.1.0.2.3 Grid Infrastructure (software only)

This is where the first challenge comes. When you perform a standard install of Oracle 12c Grid Infrastructure you are asked for storage devices on which you can locate items such as the ASM SPFILE, OCR and voting disks. In the old days of using ASMLib you would have prepared these in advance, because ASMLib is a separate kernel module located outside of the Oracle GI home. But ASMFD is part of the Oracle Home and so doesn’t exist prior to installation. Thus we have a chicken and egg situation.

Even worse, I know from bitter experience that I need to install some patches prior to labelling my disks, but I can’t install patches without installing the Oracle home either.

So the only thing for it is to perform a Software Only installation from the Oracle Universal Installer, then apply the PSU, then create an ASM instance and finally label the LUNs with ASMFD. It’s all very long winded. It wouldn’t be a problem if I was migrating from an existing ASMLib setup, but this is a clean install. Such is the price of progress.

To save this post from becoming longer and more unreadable than a 12c AWR report, I’ve captured the entire installation and configuration of 12.1.0.2.3 GI and ASM on a separate installation cookbook page, here:

Installing 12.1.2.0.3 Grid Infrastructure with Oracle Linux 6 Update 5

It’s simpler that way. If you don’t want to go and read it, just take it from me that we now have a working ASM instance which currently has no devices under its control. The PSU has been applied so we are ready to start labelling.

Using ASM Filter Driver to Label Devices

The next step is to start labelling my LUNs with ASMFD. I’m using what the documentation describes as an “Oracle Grid Infrastructure Standalone (Oracle Restart) Environment”, so I’m following this set of steps in the documentation.

Step one tells me to run a dsget command and then a dsset command to add a diskstring of ‘AFD:*’. Ok:

[oracle@server4 ~]$ asmcmd dsget
parameter:
profile:++no-value-at-resource-creation--never-updated-through-ASM++
[oracle@server4 ~]$ asmcmd dsset 'AFD:*'
[oracle@server4 ~]$ asmcmd dsget
parameter:AFD:*
profile:AFD:*

Next I need to stop CRS (I’m using a standalone config so actually it’s HAS):

[root@server4 ~]# crsctl stop has
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'server4'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'server4'
CRS-2673: Attempting to stop 'ora.asm' on 'server4'
CRS-2673: Attempting to stop 'ora.evmd' on 'server4'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'server4' succeeded
CRS-2677: Stop of 'ora.evmd' on 'server4' succeeded
CRS-2677: Stop of 'ora.asm' on 'server4' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'server4'
CRS-2677: Stop of 'ora.cssd' on 'server4' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'server4' has completed
CRS-4133: Oracle High Availability Services has been stopped.

And then I need to run the afd_configure command (all as the root user). Before and after doing so I will check for any loaded kernel modules with oracle in the name, so see what changes:

[root@server4 ~]# lsmod | grep oracle
oracleacfs           3308260  0
oracleadvm            508030  0
oracleoks             506741  2 oracleacfs,oracleadvm
[root@server4 ~]# asmcmd afd_configure
Connected to an idle instance.
AFD-627: AFD distribution files found.
AFD-636: Installing requested AFD software.
AFD-637: Loading installed AFD drivers.
AFD-9321: Creating udev for AFD.
AFD-9323: Creating module dependencies - this may take some time.
AFD-9154: Loading 'oracleafd.ko' driver.
AFD-649: Verifying AFD devices.
AFD-9156: Detecting control device '/dev/oracleafd/admin'.
AFD-638: AFD installation correctness verified.
Modifying resource dependencies - this may take some time.
[root@server4 ~]# lsmod | grep oracle
oracleafd             211540  0
oracleacfs           3308260  0
oracleadvm            508030  0
oracleoks             506741  2 oracleacfs,oracleadvm
[root@server4 ~]# asmcmd afd_state
Connected to an idle instance.
ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'ENABLED' on host 'server4'

Notice the new kernel module called oracleafd. Also, AFD is showing that “filtering is enabled” – I guess this relates to the protection against invalid writes.

Time to start up HAS or CRS again:

[root@server4 ~]# crsctl start has
CRS-4123: Oracle High Availability Services has been started.

Ok, let’s start labelling those devices.

Labelling (Incorrectly)

Now remember that I am testing with two sets of devices here: 512e and 4k. The 512e devices are emulating a 512 byte blocksize, so they should result in ASM creating diskgroups with a blocksize of 512 bytes – thus avoiding all the tedious bugs from which Oracle suffers when using 4096 byte diskgroups.

So let’s just test a 512e LUN to see what happens when I label it and present it to ASM. First, the label is created using the afd_label command:

[oracle@server4 ~]$ ls -l /dev/mapper/v512data1
lrwxrwxrwx 1 root root 8 Jul 24 10:30 /dev/mapper/v512data1 -> ../dm-14
[oracle@server4 ~]$ ls -l /dev/dm-14
brw-rw---- 1 oracle dba 252, 14 Jul 24 10:30 /dev/dm-14
[oracle@server4 ~]$ asmcmd afd_label v512data1 /dev/mapper/v512data1
[oracle@server4 ~]$ asmcmd afd_lsdsk
--------------------------------------------------------------------------------
Label                     Filtering   Path
================================================================================
V512DATA1                   ENABLED   /dev/sdpz

Well, it worked.. sort of. The path we can see in the lsdsk output does not show the /dev/mapper/v512data1 multipath device I specified… instead it’s one of the non-multipath /dev/sd* devices. Why?

Even worse, look what happens when I check the SECTOR_SIZE column of the v$asm_disk view in ASM:

SQL> select group_number, name, sector_size, block_size, state
  2  from v$asm_diskgroup;

GROUP_NUMBER NAME	SECTOR_SIZE BLOCK_SIZE STATE
------------ ---------- ----------- ---------- -----------
	   1 V512DATA	       4096	  4096 MOUNTED

Even though my LUNs are presented as 512e, ASM has chosen to see them as 4096 byte. That’s not wanted I want. Gaah!

Labelling (Correctly)

To fix this I need to unlabel that LUN so that AFD has nothing under its control, then update the oracleafd_use_logical_block_size parameter via the special SYSFS files /sys/modules/oracleafd:

[root@server4 ~]# cd /sys/module/oracleafd
[root@server4 oracleafd]# ls -l
total 0
-r--r--r-- 1 root root 4096 Jul 20 14:43 coresize
drwxr-xr-x 2 root root    0 Jul 20 14:43 holders
-r--r--r-- 1 root root 4096 Jul 20 14:43 initsize
-r--r--r-- 1 root root 4096 Jul 20 14:43 initstate
drwxr-xr-x 2 root root    0 Jul 20 14:43 notes
drwxr-xr-x 2 root root    0 Jul 20 14:43 parameters
-r--r--r-- 1 root root 4096 Jul 20 14:43 refcnt
drwxr-xr-x 2 root root    0 Jul 20 14:43 sections
-r--r--r-- 1 root root 4096 Jul 20 14:43 srcversion
-r--r--r-- 1 root root 4096 Jul 20 14:43 taint
--w------- 1 root root 4096 Jul 20 14:43 uevent
[root@server4 oracleafd]# cd parameters
[root@server4 parameters]# ls -l
total 0
-rw-r--r-- 1 root root 4096 Jul 20 14:43 oracleafd_use_logical_block_size
[root@server4 parameters]# cat oracleafd_use_logical_block_size
0
[root@server4 parameters]# echo 1 > oracleafd_use_logical_block_size
[root@server4 parameters]# cat oracleafd_use_logical_block_size
1

After making this change, AFD will present the logical blocksize of 512 bytes to ASM rather than the physical blocksize of 4096 bytes. So let’s now label those disks again:

[root@server4 mapper]# for lun in 1 2 3 4 5 6 7 8; do
> asmcmd afd_label v512data$lun /dev/mapper/v512data$lun
> done
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
Connected to an idle instance.
[root@server4 mapper]# asmcmd afd_lsdsk
Connected to an idle instance.
--------------------------------------------------------------------------------
Label                     Filtering   Path
================================================================================
V512DATA1                   ENABLED   /dev/mapper/v512data1
V512DATA2                   ENABLED   /dev/mapper/v512data2
V512DATA3                   ENABLED   /dev/mapper/v512data3
V512DATA4                   ENABLED   /dev/mapper/v512data4
V512DATA5                   ENABLED   /dev/mapper/v512data5
V512DATA6                   ENABLED   /dev/mapper/v512data6
V512DATA7                   ENABLED   /dev/mapper/v512data7
V512DATA8                   ENABLED   /dev/mapper/v512data8

Note the correct multipath devices (“/dev/mapper/*”) are now being shown in the lsdsk command output. If I now create an ASM diskgroup on these LUNs, it will have a 512 byte sector size:

SQL> get afd.sql
  1  CREATE DISKGROUP V512DATA EXTERNAL REDUNDANCY
  2  DISK 'AFD:V512DATA1', 'AFD:V512DATA2',
  3	  'AFD:V512DATA3', 'AFD:V512DATA4',
  4	  'AFD:V512DATA5', 'AFD:V512DATA6',
  5	  'AFD:V512DATA7', 'AFD:V512DATA8'
  6  ATTRIBUTE
  7	  'compatible.asm' = '12.1',
  8*	  'compatible.rdbms' = '12.1'
SQL> /

Diskgroup created.

SQL> select disk_number, mount_status, header_status, state, sector_size, path
  2  from v$asm_disk;

DISK_NUMBER MOUNT_S HEADER_STATU STATE	  SECTOR_SIZE PATH
----------- ------- ------------ -------- ----------- --------------------
	  0 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA1
	  1 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA2
	  2 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA3
	  3 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA4
	  4 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA5
	  5 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA6
	  6 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA7
	  7 CACHED  MEMBER	 NORMAL 	  512 AFD:V512DATA8

8 rows selected.

SQL> select group_number, name, sector_size, block_size, state
  2  from v$asm_diskgroup;

GROUP_NUMBER NAME	SECTOR_SIZE BLOCK_SIZE STATE
------------ ---------- ----------- ---------- -----------
	   1 V512DATA		512	  4096 MOUNTED

Phew.

Failing To Label 4kN Devices

So what about my 4k native mode devices, the ones with a 4096 byte logical block size? What happens if I try to label them?

[root@server4 ~]# asmcmd afd_label V4KDATA1 /dev/mapper/v4kdata1
Connected to an idle instance.
ASMCMD-9513: ASM disk label set operation failed.

Yeah, that didn’t work out did it? Let’s look in the trace file:

[root@server4 ~]# tail -5 /u01/app/oracle/log/diag/asmcmd/user_root/server4/alert/alert.log
24-Jul-15 12:38 ASMCMD (PID = 8695) Given command - afd_label V4KDATA1 '/dev/mapper/v4kdata1'
24-Jul-15 12:38 NOTE: Verifying AFD driver state : loaded
24-Jul-15 12:38 NOTE: afdtool -add '/dev/mapper/v4kdata1' 'V4KDATA1'
24-Jul-15 12:38 NOTE:
24-Jul-15 12:38 ASMCMD-9513: ASM disk label set operation failed.

I’ve struggled to find any more meaningful message, even when I manually run the afdtool command shown in the log – but it seems pretty likely that this is failing due to the device being 4kN. I therefore assume that AFD still isn’t 4kN ready. I do wish Oracle would make some meaningful progress on its support of 4kN devices…

I/O Filter Protection

So now let’s investigate this protection that ASMFD claims to have against non-Oracle I/Os. First of all, what do those files in /dev/oracleafd/disks actually contain?

[root@server4 ~]# cd /dev/oracleafd/disks
[root@server4 disks]# ls -l
total 32
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA1
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA2
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA3
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA4
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA5
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA6
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA7
-rw-r--r-- 1 root root 22 Jul 24 12:34 V512DATA8
[root@server4 disks]# cat V512DATA1
/dev/mapper/v512data1

Aha. This is what I got wrong in my original post last year, because – keen as I was to start my summer vacation – I didn’t spot that these files are simply pointers to the relevant multipath device in /dev/mapper. So let’s follow the pointers this time.

Let’s remind ourselves that the files in /dev/mapper are actually symbolic links to /dev/dm-* devices:

root@server4 disks]# ls -l /dev/mapper/v512data1
lrwxrwxrwx 1 root root 8 Jul 24 12:34 /dev/mapper/v512data1 -> ../dm-14
[root@server4 disks]# ls -l /dev/dm-14
brw-rw---- 1 oracle dba 252, 14 Jul 24 12:34 /dev/dm-14

So it’s these /dev/dm-* devices that are at the end of the trail we just followed. If we dump the first 64 bytes of this /dev/dm-14 device, we should be able to see the AFD label:

[root@server4 disks]# od -c -N 64 /dev/dm-14
0000000                           (   o   u   t
0000020                                
0000040   O   R   C   L   D   I   S   K   V   5   1   2   D   A   T   A
0000060   1

There it is. We can also read it with kfed to see what ASM thinks of it:

[root@server4 ~]# kfed read /dev/dm-14
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                  1953853224 ; 0x00c: 0x74756f28
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
000000000 00000000 00000000 00000000 74756F28  [............(out]
000000010 00000000 00000000 00000000 00000000  [................]
000000020 4C43524F 4B534944 32313556 41544144  [ORCLDISKV512DATA]
000000030 00000031 00000000 00000000 00000000  [1...............]
000000040 00000000 00000000 00000000 00000000  [................]
  Repeat 251 times

So what happens if I overwrite it, as the root user, with some zeros? And maybe some text too just for good luck?

root@server4 ~]# dd if=/dev/zero of=/dev/dm-14 bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.00570833 s, 735 MB/s
[root@server4 ~]# echo CORRUPTION > /dev/dm-14
[root@server4 ~]# od -c -N 64 /dev/dm-14
0000000   C   O   R   R   U   P   T   I   O   N  \n          
0000020

It looks like it’s changed. I also see that if I dump it from another session which opens a fresh file descriptor. Yet in the /var/log/messages file there is now a new entry:

F 4626129.736/150724115533 flush-252:14[1807]  afd_mkrequest_fn: write IO on ASM managed device (major=252/minor=14)  not supported i=0 start=0 seccnt=8  pstart=0  pend=41943040
Jul 24 12:55:33 server4 kernel: quiet_error: 1015 callbacks suppressed
Jul 24 12:55:33 server4 kernel: Buffer I/O error on device dm-14, logical block 0
Jul 24 12:55:33 server4 kernel: lost page write due to I/O error on dm-14

Hmm. It seems like ASMFD has intervened to stop the write, yet when I query the device I see the “new” data. Where’s the old data gone? Well, let’s use kfed again:

[root@server4 ~]# kfed read /dev/dm-14
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                  1953853224 ; 0x00c: 0x74756f28
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
000000000 00000000 00000000 00000000 74756F28  [............(out]
000000010 00000000 00000000 00000000 00000000  [................]
000000020 4C43524F 4B534944 32313556 41544144  [ORCLDISKV512DATA]
000000030 00000031 00000000 00000000 00000000  [1...............]
000000040 00000000 00000000 00000000 00000000  [................]
  Repeat 251 times

The label is still there! Magic.

I have to confess, I don’t really know how ASM does this. Indeed, I struggled to get the system back to a point where I could manually see the label using the od command. In the end, the only way I managed it was to reboot the server – yet ASM works fine all along and the diskgroup was never affected:

SQL> alter diskgroup V512DATA check all;
Mon Jul 20 16:46:23 2015
NOTE: starting check of diskgroup V512DATA
Mon Jul 20 16:46:23 2015
GMON querying group 1 at 5 for pid 7, osid 9255
GMON checking disk 0 for group 1 at 6 for pid 7, osid 9255
GMON querying group 1 at 7 for pid 7, osid 9255
GMON checking disk 1 for group 1 at 8 for pid 7, osid 9255
GMON querying group 1 at 9 for pid 7, osid 9255
GMON checking disk 2 for group 1 at 10 for pid 7, osid 9255
GMON querying group 1 at 11 for pid 7, osid 9255
GMON checking disk 3 for group 1 at 12 for pid 7, osid 9255
GMON querying group 1 at 13 for pid 7, osid 9255
GMON checking disk 4 for group 1 at 14 for pid 7, osid 9255
GMON querying group 1 at 15 for pid 7, osid 9255
GMON checking disk 5 for group 1 at 16 for pid 7, osid 9255
GMON querying group 1 at 17 for pid 7, osid 9255
GMON checking disk 6 for group 1 at 18 for pid 7, osid 9255
GMON querying group 1 at 19 for pid 7, osid 9255
GMON checking disk 7 for group 1 at 20 for pid 7, osid 9255
Mon Jul 20 16:46:23 2015
SUCCESS: check of diskgroup V512DATA found no errors
Mon Jul 20 16:46:23 2015
SUCCESS: alter diskgroup V512DATA check all

So there you go. ASMFD: it does what it says on the tin. Just don’t try using it with 4kN devices…