The Battle For Your Databases

There’s a battle going on right now between all of the public cloud vendors – a war in the clouds. And you might be surprised to hear what they are fighting over… They are fighting over you. Or, more specifically, your business-critical databases.

Everybody has something in the cloud these days. On a personal level, we are all keeping our photos, our music and our emails in the cloud. Corporations have followed suit: email, document collaboration and workflow, backups, websites… Almost everything is in the cloud. Almost.

The Big Scary Stuff That Nobody Wants To Move

Pretty much every company with an on-prem presence will have one or more relational databases underpinning their critical applications. Oracle Database, Microsoft SQL Server, PostgreSQL, DB/2 (the forgotten database of yesteryear: it’s still out there, but nobody likes to talk about it), MySQL… these products support mission critical applications like CRM, ERM, e-commerce, all those SAP modules that I can never remember the names of… And in each industry vertical, there are critical systems: healthcare has Electronic Patient Records, retail has its warehouse management platforms, finance has all manner of systems labelled Do Not Touch.

These workloads are the last bastion of on-prem, the final stand of the privately-managed data centre. And just like mainframes, on-prem may never completely die, but we should expect to see it fade away this decade. The challenge, though, is the inertia caused by such massive amounts of complexity and the associated risk of disturbing it. I have witnessed DBA teams who draw lots over which unfortunate will have to log on to “that database”, the one in the corner that nobody understands or wants to touch when it’s working ok. So how are they going to migrate that entire thing into AWS or Azure? Everybody knows a story about an eighteen-month migration project that overran budget by 1000% and then failed, right?

The View From The Clouds

So you may ask, if all this complex, gnarly stuff is full of risk, why do the hyperscalers want it? The answer is, because this is the biggest game left on the hunting ground. These vast technology stacks are the crown jewels of on-prem data estates. If you are Cloud Vendor A, there are some important reasons why you really want to capture this workload into your cloud:

  1. Big applications and databases require a large recurring spend on premium cloud infrastructure
  2. Customers are used to spending large amounts of money to run these services
  3. The surrounding application ecosystem offers potential for the upsell of further cloud services (analytics, AI, business intelligence etc)
  4. Once that workload comes into your cloud, it’s probably never leaving. In other words, it’s a long-term guaranteed revenue stream.

The last point is especially important: vendors use the term sticky to describe workloads like this. The effort of migrating all that sensitive, critical data and all that impenetrable business logic (written ten years ago by developers who have long since moved on) means you are never going to want to do this more than once. Once it’s in, it’s in.

A Massive Anchor

Working with one of the hyperscalers, I have heard these databases described as anchor workloads (credit: Kellyn Pot’vin Gorman) because they are what holds back the migration of large, juicy and complex environments into the public cloud. Like the biggest beast on the savannah, they are the hardest to take down… but a successful capture means everybody gets to eat until they are full.

So if this is you – if you are in fact a massive anchor – it’s probably worth keeping this in mind. Migrating your complex, challenging workload to the public cloud might seem like a mammoth task from your perspective, but to the hyperscalers you are the goose that lays the golden egg. And they can’t wait to get cracking.

Side note: I originally planned to call this post “Cloud Wars”, but I discovered that my former Oracle colleague, the inestimable Bob Evans, had beaten me to it…

How To Look Stupid (Part #612)

Now is the winter of our discontent. But rather than dwell on what a terrible year 2020 has been, I thought I’d make my final post of the year something more positive… so I am going to look back on one of the (many) times I made a fool of myself, in the hope that 2021 will give me the chance to do so again.

When Computers Go Bad

In the late 1990s, I was fresh out of university and working in my first job, for a small company (5 people!) at London’s Heathrow Airport, as a developer and database admin. We provided cargo handling software for all of the big airlines and freight companies. And on this particular day, “Dave”* at Air Canada had a problem with his system.

My company’s software managed the customs clearance of all inbound air freight for most of the airport. In order for inbound freight to leave the secure warehouses on a truck, this software (which, for Air Canada, ran on their main HPUX server) would send a message to the central HM Customs computer and then, upon receiving clearance, print out an official “air waybill” document. The waybill was legal proof that goods had clearance to leave the warehouse: no waybill = no clearance = no freight.

An hour ago, Dave had called in with a major problem: goods were being cleared by customs, but no waybills were bring printed. Air Canada now had a queue of lorries backed up at the warehouse and a crew that couldn’t do any work. There was nothing wrong with the printers, it was our software. Fix it, Dave begged us. Fix it now!

When DBAs Go Rogue

A senior colleague of mine, Denis**, was working on the problem and trying to test a fix on our lab system. He was also dialled in to Air Canada’s production system, on which our software ran – a crucial fact which turned out to be very important.

So when he called through to me from the server room to say, “Hey could you reboot the lab box?” I wondered over to his desktop and typed the magic reboot command on the first root window I found. Hey, one terminal session looks like another, right?

“Are you going to reboot it?” called Denis.

“I already have,” I yelled back, mildly irritated.

Denis stuck his head out of the door and stared at me, puzzled. I was then able to watch a whole range of emotions pass over his face: confusion changed to comprehension which in turn became outright horror.

I had just hard rebooted Air Canada’s entire UNIX platform with no warning to them at all.

Knowing When To Own Up

It took them a little while for Air Canada to realise what (or who) had happened to them. Remember, this was the 1990s, so big iron UNIX systems took about 15-30 mins to restart – and everybody was connected via dumb terminals which would have just suddenly gone blank.

Fred was a DBA until he accidentally truncated the wrong table

I mainly spent this time in purgatory, thinking about alternative careers, planning my new life in a Tibetan monastery or hoping for a natural disaster to divert attention.

But eventually, my desk phone rang and our receptionist said, “Dave from Air Canada wants to speak to you”.

I can vividly remember the dry mouth, my sweaty palms holding the phone, my voice about three octaves too high.

“Yes?” I stammered.

“I don’t know what you’ve done,” said Dave, “but all the waybills are coming out again now. Thanks very much!”

It’s important, I think, to be honest in these situations. But not that honest. So I let Dave get back to his busy job and made a mental note to confess to what had really happened some time within the next 25 years. And then I filed that next to the other mental note – the one about never, ever typing reboot without triple checking which system you are connected to.

Aspirations for 2021

When I look back at this story – and the many other times in my career when I made myself look stupid – I am grateful for the fact that things turned out ok. The whole year 2020 has felt like an elongated version of the purgatory I experienced above. But, as anybody who has ever rebooted a 1990’s-era big iron UNIX server will attest, the login window only appears about ten seconds after you’ve finally admitted to yourself that it’s never coming back.

So let’s hope that 2021, like Dave and his waybill printouts, gets us back on track fast.

* The names of innocent parties have been changed to protect their identities

** Denis really was called Denis though

The Public Cloud: The Hotel For Your Applications

Unless you are Larry Ellison (hi Larry!), the chances are you probably live in a normal house or an apartment, maybe with your family. You have a limited number of bedrooms, so if you want to have friends or relatives come to stay with you, there will come point where you cannot fit anybody else in without it being uncomfortable. Of course, for a large investment of time and money, you could extend your existing accommodation or maybe buy somewhere bigger, but that feels a bit extreme if you only want to invite a few people On to your Premises for the weekend.

Another option would be to sell up and move into a hotel. Pick the right hotel and you have what is effectively a limitless ability to scale up your accommodation – now everybody can come and stay in comfort. And as an added bonus, hotels take care of many dull or monotonous daily tasks: cooking, cleaning, laundry, valet parking… Freeing up your time so you can concentrate on more important, high-level tasks – like watching Netflix. And the commercial model is different too: you only pay for rooms on the days when you use them. There is no massive up-front capital investment in property, no need to plan for major construction works at the end of your five year property refresh cycle. It’s true pay-as-you-go!

It’s The Cloud, Stupid

The public cloud really is the hotel for your applications and databases. Moving from an investment model to a consumption-based expense model? Tick. Effectively limitless scale on demand? Tick. Being relieved of all the low-level operational tasks that come with running your own infrastructure? Tick. Watching more Netflix? Definite Tick.

But, of course, the public cloud isn’t better (or worse) than On Prem, it’s just different. It has potential benefits, like those above, but it also has potential disadvantages which stem from the fact that it’s a pre-packaged service, a common offering. Everyone has different, unique requirements but the major cloud providers cannot tailor everything they do to you individual needs – that level of customisation would dilute their profit margins. So you have to adapt your needs to their offering.

To illustrate this, we need to talk about car parking:

Welcome To The Hotel California

So… you decide to uproot your family and move into one of Silicon Valley’s finest hotels (maybe we could call it Hotel California?) so you can take advantage of all those cloud benefits discussed above. But here’s the problem, your $250/day suite only comes with one allocated parking bay in the hotel garage, yet your family has two cars. You can “burst” up by parking in the visitor spaces, but that costs $50/day and there is no guarantee of availability, so the only solution which guarantees you a second allocated bay is to rent a second room from the hotel!

This is an example of how the hotel product doesn’t quite fit with your requirements, so you have to bend your requirement to their offering – at the sacrifice of cost efficiency. (Incurring the cost of a second room that you don’t always need is called overprovisioning.) It happens all the time in every industry: any time a customer has to fit a specific requirement to a vendor’s generic offering, something somewhere won’t quite fit – and the only way to fix it is to pay more.

The public cloud is full of situations like this. The hyperscalers have extensive offerings but their size means they are less flexible to individual needs. Smaller cloud companies can be more attentive to an individual customer’s requirements, but lack the economies of scale of companies like Amazon Web Services, Microsoft and Google, meaning their products are less complete and their prices potentially higher. The only real way to get exactly what you want 100% of the time is… of course… to host your data on your own kit, managed by you, on your premises.

Such A Lovely Place

I should state here for the record that I am not anti-public cloud. Far from it. I just think it’s important to understand the implications of moving to the public cloud. There are a lot of articles written about this journey – and many of them talk about “giving up control of your data”. I’m not sure I entirely buy that argument, other than in a literal data-sovereignty sense, but one thing I believe to be absolutely beyond doubt is that a move to the public cloud will require an inevitable amount of compromise.

That should be the end of this post, but I’m afraid that I cannot now pass up the opportunity to mention one other compromise of the public cloud, purely because it fits into the Hotel California theme. I know, I’m a sucker for a punchline.

You and your family have enjoyed your break at the hotel, but you feel that it’s not completely working – those car parking charges, the way you aren’t allowed to decorate the walls of your room, the way the hotel suddenly discontinued Netflix and replaced it with Crackle. What the …? So you decide to move out, maybe to another hotel or maybe back to your own premises. But that’s when you remember about the egress charges; for every family member checking out of the hotel, you have to pay $50,000. Yikes!

I guess it turns out that, just like with the cloud, you can check out anytime you like… but you can never leave.

Cloud DBA: The Next Generation of Database Administrator?

Don’t drop the ball…

In the previous post, I ranted discussed the evolution of the DBA role, looking at how many additional functions the database administrator has inherited over the years: code fixer, virtualisation tamer, Linux / Windows juggler, reluctant storage administrator, application server hater, firewall botherer and all round fixer of any product badged as Oracle.

But the real change I am interested in comes as a result of databases moving into the cloud. Because this exposes the DBA to ownership of a new problem: cost. Specifically, ongoing operational costs – or Opex. It is my belief that this is in fact A New Thing – and New Things are not to be trusted. Sure, in the on prem world, DBAs were involved in decisions concerning capital expenditure (Capex) like the scoping of database servers, the calculation of how many database licenses were needed, the justification of additional license options (e.g. Enterprise Edition instead of Standard Edition). But in most cases, those decisions were made by a collective and then signed off by the business.

My Public Cloud Bill Just Arrived…

Cloud is different. Everything you do in the public cloud costs money. You want to spin up an instance? Kerching. You want to use some SSD storage? Kerching! You want to download copies of your data to an on prem location? Egress charges ahoy… KERCHING!

Bills, Bills, Bills…

Decisions taken by DBAs in the normal course of their day jobs can now have a significant effect on the next invoice from the cloud vendor. Do you remember in the early days of cell phones, if you used your phone a lot you were never entirely sure what the bill would look like at the end of the month? Could be a little more than usual, could be so massive you need a loan from the World Bank. Sometimes, the cloud has a similar feel.

Most cloud vendors have remarkably complex pricing structures (some say this complexity is deliberate!) and this has in fact spawned a whole industry of experts (“cloud economists”) who can help customers understand and reduce their cloud costs, often using the two step principle of 1) turn stuff off, and 2) negotiate harder for discounts.

Into this new minefield steps that brave warrior, the DBA. Often charged with the apparently simple task of “move that database into the cloud”, not only must a new technical language be learned (e.g. “it’s not a VM in the cloud, it’s an instance”) and a new set of TLAs be absorbed (“In my AWS VPC, I use EC2, EBS, S3 and ZXP”)… but also a new understanding must be gained of what each checkbox and pulldown option does to the operating cost.

Another Plate To Spin

It’s a whole new area of expertise to take on – and it’s complex. What’s more, it’s subtly different between cloud vendors – and even if you only use one cloud, it’s subject to change over time. Usually in the direction of more expensive

Here’s a simple example: provisioning an instance. You are a DBA (congrats!) and you need to migrate your on prem database into, say, Amazon Web Services. You first of all need to configure a Linux instance and some disks. There are many different ways of doing this – including templates, infrastructure-as-code and so on – but let’s do it in the GUI for fun. First, you’ll need some compute power, so let’s provision some from the Elastic Compute Cloud (EC2). Which type shall we choose?

If you are new to this, there are a lot of options. I mean, really a lotLet me see now, there’s categories of General Purpose, Compute Optimized, Memory Optimized, Accelerated Computing, or Storage Optimized. These are just the categories… each one of which contains many types, which contains many options! But “General Purpose” sounds kinda normal, so let’s choose that. Now you need to choose the instance type:

Amazon Web Services – Elastic Compute Cloud choices for General Purpose instance types

Amazon Web Services – EC2 M5 Large instance types

If we go for instance type of M5, we are told that “This family provides a balance of compute, memory, and network resources, and is a good choice for many applications”. Cool, so now you have to pick the instance size:

This screenshot only shows a fraction of the total choices, with each config of vCPUs and Memory replicated again in the m5d.* range (adds NVMe SSD storage), plus some further options around bare metal. It is a labyrinthine set of options to consider.

If you haven’t undertaken the myriad training courses for this cloud vendor, how do you know which instance size to choose? Well, maybe the same way that you specced up the config of your on prem database servers before… right? Except most DBAs didn’t do that, they were allocated servers without really playing a part in their procurement. But my real point here is that the choice you make reflects the ongoing monthly cost. And there are more choices to make! After all, you are going to need some storage from Elastic Block Store on which to place your database:

Amazon Web Services – Elastic Block Store volume types

Amazon recommends one of two different options for “I/O-intensive NoSQL and relational databases” plus a third for data warehouses. I’ll tell you right now, if your database is even mildly transactional, you will want to use io1 or io2. Whatever you choose, it will have an affect on the monthly cost – you can see this by checking it out on the AWS Calculator.

And you know what we didn’t even cover at the start? The region – the geographical location in which this instance runs – also changes the cost, sometimes significantly. Pricing for European regions is often surprisingly higher than regions in the US.

Why This Matters (TL;DR)

What I am trying to show here is that, in the course of provisioning databases in the cloud, DBAs are having to make complicated choices which not only affect the performance of their databases but also the ongoing cost. In fact, it’s a balancing act: performance and cost are two sides of the same coin. Amazon Web Services, in the example above, offers a huge and dazzling array of options which offer different trade offs for these two dimensions. That’s not a bad thing by the way – I am not criticising AWS for giving us a choice – but it’s bewildering to the uninitiated.

What’s more, if you put a database in Microsoft Azure, or Google Cloud Platform, or Oracle Cloud Infrastructure, or Alibaba Cloud or … I can’t think of any other clouds … then be prepared for the fact that everything changes again.

It’s time for DBAs to learn to juggle with yet another ball.

 

Evolution of the DBA

In the previous post, I looked at Gartner’s recent assertion that 75% of databases will be deployed to the cloud by 2022 – and that the cloud is now the default platform for managing data.

The massive shift to the public cloud has a lot of implications, many of which have been written about at length over the last few years. But one question I don’t think has been asked enough is: what does this mean for the poor, beleaguered database administrator? Let’s start with a look at the journey DBAs have been since “the old days”.

DBA 1.0: The (Good) Old Days

Data centres used to contain four distinct tribes of beings living in semi-peaceful co-existence: SysAdmins, DBAs, Network Admins and Storage Admins: Four groups of specialists, each with a distinct skillset and a fairly delineated boundary of responsibility. I say four, it was really three – as everybody who remembers this era will attest, Network Admins were actually mythical creatures who never inhabited their desks; historical evidence now suggests that they were actually just a simple script which automatically closed any ticket with the phrase “No problems were found with the network”.

The database administrator occupied a unique position in this family, because they lived further up in the application stack and so dealt with developers and application owners, business users and sometimes – whisper it – those wondrous beings, the “end users”. Conveniently, this made the DBA the perfect person to blame for almost any problem at any layer in the stack. Application slow? Must be a database problem. Query taking too long? MUST be a database problem. Never mind that the database server doesn’t have enough memory the developers have no concept of how to code in SQL and the storage system is a RAID5 bag of spanners running on spinning rust… it’s always a database problem. And we know it’s not a networking problem because it says here that “No problems were found with the network”.

One outcome of this “unique” position was that many DBAs had to learn skills outside of their core profession (networking, Linux or Windows admin skills, SQL tuning, PL/SQL decoding, hostage negotiation etc). I’d love to say this thirst for knowledge was due to professional pride, but the best DBAs I ever met simply learned these skills so they could prove they weren’t in the wrong and thus get an easier life. “Oh you think your SQL runs slow because of my database huh? Well if you rewrote it like this, it runs in 10% of the time and doesn’t make all the lights go dim in the data centre, you imbecile…”

DBA 2.0: The IT Generalist

As the data centre evolved and new technologies such as Virtualization, NoSQL, Hadoop and the Cloud became prevalent, the clearly defined roles of yesteryear started to become blurred. In the last decade, we saw the rise of a new creature in the data centre: The IT Generalist. Of course, this is mainly just another way of saying DBA with Extra Responsibilities (but no extra pay). It is now commonplace for DBAs to be managing a multitude of different technologies outside of the traditional RDBMS: many DBAs are managing, at least at some level, VMware clusters or other virtualization platforms; I know DBAs who have had tangles with firewalls and software-defined networking… I have even met a large number of DBAs who admin their All-Flash storage arrays (simpler than the old fashioned disk array, after all).

As a side note, anyone with the job title of “Oracle DBA” also found themselves lumbered with managing any technology which was Oracle-badged – and that’s a lot of stuff. Fusion Middleware, Oracle Linux, Weblogic, Oracle ZFS Appliance, anything running under Automatic Storage Management, even Java! The list goes on… how long before somebody gets a ticket because Tik Tok isn’t working properly?

Larry Ellison might have famously said he wants to get rid of the DBA, but the reality is that the DBA role has just become even more wide-ranging.

DBA 3.0: The Cloud DevOps DBA

Fast forward to 2020, the DBA is now managing applications running on databases which run in containers on virtual machines in the cloud, probably deployed via some sort of infrastructure-as-code implementation. Hey, the dream of the modern IT organisation is to achieve some utopian level of automation – and it’s the DBA who has the most practice of automating cross-function tasks; they’ve been trying to do it for years just for an easier life. (Note how the dream of “an easier life” motivates so much of DBA behaviour!)

Of course, everything is now DevOps too… right? If you aren’t DevOps, you aren’t in the gang. Remember when everything had to be agile? But, when you scratched the surface, “agile” was just a way of saying “we haven’t documented any of this”. Well, DevOps has taken over from agile as the buzz word of choice. And the literal translation of “DevOps” is “we still didn’t document anything but also we aren’t going to follow any kind of change control procedures or put any of these code releases through anything more than the most primitive of testing routines, so good luck”.

But in this long evolutionary journey, there is one thing that DBAs have never been exposed to … until today. Cost. As a DBA, you may have had to argue for more powerful servers, faster CPUs, more database processor licenses, cost options (“I need the Tuning Pack, damnit!”), but the cloud is a different ball game. A DBA building a database in the public cloud is making decisions which have a direct affect on the (quite possibly massive) monthly bill from AWS / Azure / GCP / Oracle Cloud / other vendor of choice. This is what I wanted to look at in this post before I got massively carried away.

DBAs of the World, Unite!

I’ll be honest, I didn’t intend this post to become some sort of DBA Manifesto, but once I started typing I couldn’t stop. Blogging is like that sometimes. In the next post, we’ll delve a bit deeper into the future of DBAs and angle on the cloud costs. In the meantime, let’s summarise:

Everybody knows that the DBA is the humble, hard-working hero of Enterprise IT: dedicated and underpaid, overburdened and undertrained, blamed for everything and thanked for nothing… the DBA really is the Morlock of the data centre, working long nights and hard weekends to keep all those wonderful, spoilt Eloi end users happy*. If you are a DBA, give yourself a pat on the back for surviving this evolutionary journey. If you’re a SysAdmin, be honest: you guys need to buy your DBAs a drink now and then. And if you are a Network Admin: stick to the script.

* If the Morlock and Eloi references aren’t working for you, read this.

Databases Now Live In The Cloud

 

I recently stumbled across a tech news post which surprised me so much I nearly dropped my mojito. The headline of this article screamed:

Gartner Says the Future of the Database Market Is the Cloud

Now I know what you are thinking… the first two words probably put your cynicism antenna into overdrive. And as for the rest, well duh! You could make a case for any headline which reads “The Future of ____________ is the Cloud”. Databases, Artificial Intelligence, Retail, I.T., video streaming, the global economy… But stick with me, because it gets more interesting:

On-Premises DBMS Revenue Continues to Decrease as DBMS Market Shifts to the Cloud

Yeah, not yet. That’s just a predictable sub-heading, I admit. But now we get to the meat of the article – and it’s the very first sentence which turns everything upside down:

By 2022, 75% of all databases will be deployed or migrated to a cloud platform, with only 5% ever considered for repatriation to on-premises, according to Gartner, Inc.

Boom! By the year 2022, 75% of all databases will be in the cloud! Even with the cloud so ubiquitous these days, that number caused me some surprise.

Also, I have so many questions about this:

  1. Does “a cloud platform” mean the public cloud? One would assume so but the word “public” doesn’t appear anywhere in the article.
  2. Does “all databases” include RDBMS, NoSQL, key-value stores, what? Does it include Microsoft Access?
  3. Is the “75%” measured by the number of individual databases, by capacity, by cost, by the number of instances or by the number of down-trodden DBAs who are trying to survive yet another monumental shift in their roles?
  4. How do databases perform in the public cloud?

Now, I’m writing this in mid-2020, in the middle of the global COVID19 pandemic. The article, which is a year old and so pre-COVID19, makes the prediction that this will come true within the next two years. It doesn’t allow for the possibility of a total meltdown of society or the likelihood that the human race will be replaced by Amazon robots within that timeframe. But, on the assumption that we aren’t all eating out of trash cans by then, I think the four questions above need to be addressed.

Questions 1, 2 and 3 appear to be the domain of the authors of this Gartner report. But question 4 opens up a whole new area for investigation – and that will be the topic of this next set of blogs. But let’s finish reading the Gartner notes first, because there’s more:

“Cloud is now the default platform for managing data”

One of the report’s authors, long-serving and influential Gartner analyst Merv Adrian, wrote an accompanying blog post in which he makes the assertion that “cloud is now the default platform for managing data”.

And just to make sure nobody misunderstands the strength of this claim, he follows it up with the following, even stronger, remark:

On-premises is the past, and only legacy compatibility or special requirements should keep you there.

Now, there will be people who read this who immediately dismiss it as either obvious (“we’re already in the cloud”) or gross exaggeration (“we aren’t leaving our data centre anytime soon”) – such is the fate of the analyst. But I think this is pretty big. Perhaps the biggest shift of the last few decades, in fact.

Why This Is A Big Deal

The move from mainframes to client/server put more power in the hands of the end users; the move to mobile devices freed us from the constraints of physical locations; the move to virtualization released us from the costs and constraints of big iron; but the move to the cloud is something which carries far greater consequences.

After all, the cloud offers many well-known benefits: almost infinite scalability and flexibility, immunity to geographical constraints, costs which are based on usage (instead of up-front capital expenditure), and a massive ecosystem of prebuilt platforms and services.

And all you have to give up in return is complete control of your data.

Oh and maybe also the predictability of your I.T. costs – remember in the old days of cell phones, when you never exactly knew what your bill would look like at the end of the month? Yeah, like that, but with more zeroes on the end.

Over to Merv to provide the final summary (emphasis is mine):

The message in our research is simple – on-premises is the new legacy.  Cloud is the future. All organizations, big and small, will be using the cloud in increasing amounts. While it is still possible and probable that larger organizations will maintain on-premises systems, increasingly these will be hybrid in nature, supporting both cloud and on-premises.

The two questions I’m going to be asking next are:

  1. What does this shift to the cloud mean for the unrecognised but true hero of the data center, the DBA?
  2. If we are going to be building or migrating all of our databases to the cloud, how do we address the ever-critical question of database performance?

Link to Source Article from Gartner

Link to Merv Adrian blog post

Don’t Call It A Comeback

I’ve Been Here For Years…

Ok, look. I know what I said before: I retired the jersey. But like all of the best superheroes, I’ve been forced to come out of retirement and face a fresh challenge… maybe my biggest challenge yet.

Back in 2012, I started this blog at the dawn of a new technology in the data centre: flash memory, also known as solid state storage. My aim was to fight ignorance and misinformation by shining the light of truth upon the facts of storage. Yes, I just used the phrase “the light of truth”, get over it, this is serious. Over five years and more than 200 blog posts, I oversaw the emergence of flash as the dominant storage technology for tier one workloads (basically, databases plus other less interesting stuff). I’m not claiming 100% of the credit here, other people clearly contributed, but it’s fair to say* that without me you would all still be using hard disk drives and putting up with >10ms latencies. Thus I retired to my beach house, secure in the knowledge that my legend was cemented into history.

But then, one day, everything changed…

Everybody knows that Information Technology moves in phases, waves and cycles. Mainframes, client/server, three-tier architectures, virtualization, NoSQL, cloud… every technology seems to get its moment in the sun…. much like me recently, relaxing by the pool with a well-earned mojito. And it just so happened that on this particular day, while waiting for a refill, I stumbled across a tech news article which planted the seed of a new idea… a new vision of the future… a new mission for the old avenger.

It’s time to pull on the costume and give the world the superhero it needs, not the superhero it wants…

Guess who’s back?

* It’s actually not fair to say that at all, but it’s been a while since I last blogged so I have a lot of hyperbole to get off my chest.

The Final Post: Hardware Is Dead

Hanging up the jersey

Well, my friends, this is it. The time has come to retire the flashdba jersey after more than seven years of fun and frolics. In part one of this post, I looked back at my time in the All-Flash storage industry and marvelled at the crazy, Game of Thrones-style chaos that saw so many companies arrive, fight, merge, split up and burn out. Throughout that time, I wrote articles on this blog site which attempted to explain the technical aspects of All-Flash as the industry went from niche to mainstream. Like many technical bloggers, I found this writing process enjoyable and fulfilling, because it helped me put some order to my own thoughts on the subject. But back in 2017, something changed and my blogs became less and less frequent… eventually leading here. I’ll explain why in a minute, but first we need to talk about the title of this post.

Hardware Is Meh

Back in my Oracle days, I worked with a product called Exadata – a converged database appliance which Oracle marketed as “hardware and software engineered to work together”. For a time, Oracle’s “Engineered Systems” were the future of the company and, therefore, the epicentre of their marketing campaigns. Today? It’s all about the Oracle Cloud. And this is actually a perfect representation of the I.T. industry as a whole… because, here in 2019, nobody wants to talk about hardware anymore. Whether it’s hyper-converged systems, All-Flash storage, “Engineered” database appliances or basic server and networking infrastructure, hardware is just not cool anymore.

Hardware is MehFor a long time, companies have purchased hardware systems as a capital expense, the cost then being written off over a number of years, at which point the dreaded hardware refresh is required. Choosing the correct specifications of for hardware (capacity, performance, number of ports etc) has always been extremely challenging because business is unpredictable: buy too small and you will need to upgrade at some point down the line, which could be expensive; buy too big and you are overpaying for resources you may never use. And also, if you are a small company or a startup, those capital expenses can be very hard to fund while you wait for revenue to build.

Today, nobody needs to do this anymore. The cloud – and in particular the public cloud – allows companies to consume exactly what they need, just when they need it – and fund it as an operating expense, with complete flexibility. One of the great joys of the public cloud is that hardware has been commoditised and abstracted to such a degree that you just don’t need to care about it anymore. Serverless, you might say… (IT has aways been fond of a ridiculous buzz word)

The Vendor View: AWS Is The New Enemy

For infrastructure vendors, the industry has reached a new tipping point. A few years ago, if you worked in sales for a storage startup (like me), you found business by targeting EMC customers who were unhappy with the prices they were paying / service they were getting / quality of steaks being bought for them by their EMC rep. Ditto, to a lesser extent, with HP and IBM, but EMC was the big gorilla of the marketplace. Today, everybody in storage has a new number #1 enemy: Amazon Web Services, with Microsoft Azure and Google Cloud Platform making up the top three. But make no mistake, AWS is eating everybody’s lunch – and the biggest challenge for the rest is that in many customer’s eyes, the public cloud is Amazon Web Services. (EMC, meanwhile, doesn’t even exist anymore but is instead a part of Dell… that would have been impossible to imagine five years ago).

Cloud ≠ Public Cloud

However, nobody (sane) is predicting that 100% of workloads will end up in the public cloud (and let’s be honest now, when we say “public cloud” we basically mean AWS, Azure and GCP). For some companies – where I.T. is not their core business – it makes perfect sense to do everything in the cloud. But for others, various reasons relating to control, risk, performance, security and regulation will mean that at least some data remains on premises, in private or hybrid clouds. You can argue among yourselves about how much.

So, for those people who still require their own infrastructure, what now? Once you’ve seen how easy it is to use the public cloud, sampled all the rich functionality of AWS and fallen into the trap of having staff paying for AWS instances on their credit cards (so-called “Shadow IT“), how do you go back to the old days of five-year up-front capital investments into large boxes of tin which sit in the corner of your data centre and remain stubbornly inflexible?

Consumption-Based Infrastructure

Consumption-Based Infrastructure

Ok let’s get to the conclusion. A couple of years ago, Kaminario (my employer) decided to exit the hardware business and become a software company. Like most (almost all) All-Flash storage vendors, Kaminario uses commodity whitebox components (basically, Intel x86 servers and enterprise-class SSDs) for the hardware chassis and then runs their own software on top to turn them into high-performance, highly-resilient and feature-rich storage platforms. Everybody does it: DellEMC’s XtremIO, Pure Storage, Kaminario, HP Nimble, NetApp… all of the differentiation in the AFA business is in software. So why purchase hardware components, manufacture and integrate them, keep them in inventory and then pass on all that extra cost to customers when your core business is actually software?

Kaminario decided to take a new route by disaggregating the hardware from the software and then handing over the hardware part to someone who already sells millions of hardware units all around the globe. Now, when you buy a Kaminario storage array, you get exactly the same physical appliance, but you (or your reseller) actually buy the hardware from Tech Data at commodity component cost. You then buy a consumption-based license to use the software from Kaminario based on the number of terabytes of data stored. This can be on a monthly Pay As You Go model or via a pre-paid subscription for a number of years. In a real sense, it is the cloud consumption model for people who require on-prem infrastructure.

There are all sorts of benefits to this (most customers never fill their storage arrays above 80% capacity, so why always pay for 100%?), but I’m not going to delve into them here because this is not a sales pitch, it’s an explanation for what I did next.

What I Did Next

Seeing as Kaminario decided to make a momentous shift, I thought it was a good time to make one of my own. So, two years ago, I took the decision to leave the world of technical presales and become a software sales executive. As in, a quota-carrying, non-technical, commercial sales guy with targets to hit and commission to earn. Presales people also earn commission, but are far more protected from the “lumpy” highs and lows that come with complex and lengthy high-value sales cycles (what sales people call “big ticket sales”). In commercial sales, the highs are higher and the lows are lower – and the risks are definitely riskier. Since my new role coincided with the company going through an entire change of business model, the risk was pretty hard to quantify, but I’m pleased to say that 2018 was the company’s best ever year, not just globally but also in the territory that I now manage (the United Kingdom).

More importantly for me, I’m now two years into this new journey and I have zero regrets about the decision to leave my technical past behind. I’ve learnt more than ever before (often the hard way) and I’ve experienced all the highs and lows one might expect, but I still get the same excitement from this role that I used to get in the early days of my technical career.

So, the time is right to hang up the technical jersey and bid flashdba farewell. It’s been fun and I want to say thank you to everybody who read, commented, agreed or disagreed with my content. There are almost 200 posts and pages on this site which I will leave here in the hope that they remain useful to others – and as a sort of virtual monument to my former career.

In the meantime, I’ve got to go now, because there are meetings to be had, customers to be entertained, dinners to be expensed and (hopefully) deals to be closed. Farewell, my friends, stay in touch… and remember, if you need to buy something… call me, yeah?

— flashdba —

[September 2020 Spoiler Alert: I couldn’t stay away]

Flash Debrief: The End (part 1)

Seven years ago this month, I created a blog and online presence called flashdba to mark the start of my journey away from Oracle databases (and DBAing) into the newly-born All-Flash Storage industry. Six years ago this month, I posted the first in what transpired to be a very long blog series attempting to explain the concepts of All Flash to those few who were interested. At least, I always assumed it would be a few, but now here we are in 2019 and the flashdba.com blog has been read over a million times, referenced in all sorts of surprising places and alluded to by Chris Mellor at The Register. One of my articles even (allegedly) got a mention by Mark Hurd during an Oracle forecast call!

But now, for various reasons that I will explain later, it’s time to draw it to a conclusion.

Review

Wow! What a ride it’s been, huh? Seven years ago, I joined a company called Violin Memory who were at the forefront of the infant (or should that be infantile?) flash industry. At one point, Violin had a global partnership with HP to make an “Exadata-killer” machine and had a valuation estimated to be around $2bn. EMC even wrote a secret briefing document in which they said, “Violin … is XtremIO’s #1 competitor in the all-flash storage market”. Meanwhile, numerous other small flash companies were being acquired for ridiculous, crazy and obscene money despite often being “pre-product” or pre-GA.So it took a particularly special effort for Violin Memory to take that head start and end up in Chapter 11 bankruptcy in December 2016. (The company is reborn as Violin Systems now, of course – and I still have friends there, so out of respect for them I have to keep my Violin stories under wraps. Which is a shame, because boy do I have some great stories…)divorce

Meanwhile, back in 2015, I’d decided to leave Violin Memory and join another All Flash pioneer, Kaminario – where I remain today. It’s fair to say that Violin Memory didn’t appreciate that decision, with the result that I had to spend a lot of time dealing with their lawyers. You feel very small when you are a sole person engaged in a legal dispute with a corporation who can afford an expensive legal team – you become enormously aware of the difference in spending power (although, in hindsight, perhaps Violin could have used those legal fees elsewhere to better effect). So, when the CEO of Kaminario interrupted his family holiday to call me and assure me that they would stand by me throughout the dispute, it left me with a real glimpse into the different in culture between my former employer and my current one. Also, Kaminario’s lawyers were a lot better!

The Flash Storage Wars (available now as a boxset)

The road from 2012 to 2019 is littered with the bloody carcasses of failed flash companies. From the disasters (Violin Memory, Skyera, FusionIO, Tintri, Whiptail, DSSD) to the acquisitions (Texas Memory Systems, XtremIO, Virident, SolidFire, Nimble) – not all of which could be considered successful – to the home-grown products which never really delivered (I’m looking at you, Oracle FS1). One company, Pure Storage, managed to beat the odds, ride out some stormy times and go from startup to fully-established player, although following their IPO the stock market has never really given them a lot of love. Meanwhile, EMC – the ultimate big dog of storage – was acquired by Dell, while HP split into two companies and NetApp continued to be linked with an acquisition by Lenovo or Cisco. Someday, somebody is going to turn the whole story into a boxset and sell it to Netflix for millions. Game of Thrones eat your heart out.

Yet there can be no doubt that All Flash itself has succeeded in its penetration of the previously disk-dominated enterprise storage market, with IDC regularly reporting huge year-on-year growth figures (e.g. 39.3% between Q3-2017 and Q4-2018). I vividly remember, back in 2012, having to explain to every prospective customer what flash was and why it was important. Today, every prospect has already decided they want All Flash. In fact, AFAs have become so mainstream now that, starting this year, Gartner will be merging its Solid State Array Magic Quadrant with the more traditional MQ for General-Purpose Disk Arrays. It just doesn’t make sense to have two separate models now.

So Who Won?

Good question. Was it DellEMC, the biggest company in storage and the current #1 in market share? Was it Pure Storage, who led Gartner’s most recent Solid State Array Magic Quadrant (but have it all to lose when the SSA MQ merges with the general-purpose MQ)? Or was it any number of investors and venture capitalists who managed to make money on the back of such market disruption? It’s a subjective question so you can choose your own answer. But for me, it’s very clear that there was only one winner… and back in 2012 we had no idea (although my old boss called it over a decade ago… I should have paid more attention). The ultimate winner of this war – and many other wars besides – is the cloud.

In part two – the final ever blog in this series (and possibly at all – spoiler alert), I’ll explain why I think the cloud is the ultimate winner… and why I’m calling time on flashdba after all these years. Wipe away those tears, my friends – not long now.

See also: this article apparently inspired the highly respected storage-industry journalist Chris Mellor to write A Potted History of All-Flash Arrays over at Blocks and Files. Thanks Chris, I’m honoured!

Oracle ASM and Thin Provisioning – How To Reclaim Space

It came to my attention last November that I had crossed the one year anniversary since my last post on flashdba.com. I was so surprised that I immediately decided to write a new post, which took another three months. There are reasons why I’m no longer posting technical blogs about databases and flash, but I’ll cover them in a later post. No, not that late – I hope.

In the meantime, I thought I’d write a note on this subject because I’ve lost count of the number of times I’ve been asked questions on the topic of Oracle ASM and Thin Provisioning. Normally, I’m asked by customers or prospects who think there is an issue with their storage system… whereas, in fact, the problem is entirely storage-agnostic.

But first, some background.

Thin Provisioning

Thin Provisioning (TP) is used to describe the overcommitment of storage capacity. Your host may think it’s been allocated 10TB of capacity and is currently using 2TB, but the storage platform has only really allocated the 2TB used and the remaining 8TB may not even exist. Why would you want this? Because in a multi-host environment (where hosts could be virtual or physical), the amount of allocated-but-unused capacity could be significant. Without TP, serious amounts of capacity would need to be provisioned which may never be used, but with TP all the hosts can be “fooled” into thinking they have been allocated what they want while the actual utilised capacity is only the sum of what each host has used.

Where things can get a bit complicated with TP is that many layers in your stack may be thin provisioning storage to the layers above them. Most storage arrays are capable of TP (or indeed mandate its use), but hypervisors often have thin provisioning options too. Meanwhile, some applications which create data store structures have options which can help or hinder the use of TP. For example, VMware has the ability to create virtual disks which are thin, thick (lazy zeroed) or thick (eager zeroed). As a result, it isn’t always obvious to the underlying storage whether a particular set of allocated blocks are really in use or not. Won’t somebody think of the poor storage array?

Trim and Unmap

Consider the situation where a large file is created and then deleted in a filesystem on a typical operating system. Commonly, the deletion process doesn’t really delete anything other than the metadata telling the filesystem where the file resided. Thus the underlying file data remains until such time as something else comes along and overwrites it. This is beneficial because it is faster and requires less work than trying to overwrite the file with (for example) zeros. But if the filesystem resides on a storage array which uses TP, how will the storage array know that the space allocated to the file is now free? It can’t – unless the filesystem has a way of telling it.

For this purpose there exists a set of OS calls known as trim commands – and for the SCSI protocol (used by most block storage devices such as SANs) the command is known as UNMAP. Issuing one of these commands allows the calling layer (the filesystem, or perhaps a volume manager) to notify the storage platform that a specific set of blocks are no longer in use and can be “unmapped”, freeing space. As a side note, large calls to UNMAP can often have temporary but unexpected consequences on storage performance, as large amounts of metadata may need to be updated.

Oracle ASM: Unmap is for Wimps

Let’s get straight to the point here: Oracle’s Automatic Storage Manager doesn’t natively use UNMAP commands. Quelle surprise. But there are still ways to free up space back to thin provisioned arrays. Two in fact: let’s call them the bad way and the good way. First though, let’s set up the scenario:

Test Scenario

Consider the situation where an Oracle ASM diskgroup is created on a 10TB volume group presented from a thin provisioning All-Flash storage array. The DBA then creates a large “bigfile” tablespace in the diskgroup, with a 5TB datafile (the rest of the database resides elsewhere). Anyone who has sat waiting for the CREATE TABLESPACE command for any period of time will be aware that, during the datafile creation process, Oracle likes to fill the whole file with empty blocks. From Oracle’s perspective, this has the advantage of ensuring that the entire datafile capacity has been marked as used by the storage array. In other words, it’s not “fake” thin provisioned space which may or may not be available, but real available capacity which now belongs to Oracle. (You may also recall that Oracle no longer takes this approach with tempfiles, instead using the faster “sparse” allocation method.)

At this point, what will the volumes on the storage array will be showing? We know that 10TB has been allocated, of which 5TB has been used. So shouldn’t that leave 5TB free? Probably not, because almost every All-Flash storage array uses data reduction technologies such as compression, deduplication and zero-detect. Since each block in the tablespace contains a unique block number, deduplication isn’t going to add any value (which is why arrays like the Kaminario allow dedupe to be disabled on a per-volume basis), but compression is going to have great fun with all the emptiness inside each Oracle block so the storage array will probably show significantly less than 5TB used.

Next, our enterprising DBA watches a Connor McDonald video about DBMS_RANDOM and gets a little overexcited, then fills the entire tablespace with random data to the point that the storage array can hardly achieve any compression. The outcome? Allocated = 10TB, Used = 5TB, Free = 5TB.

Finally, after watching a video of Larry Ellison explaining that the Oracle Autonomous Database needs “no human intervention” and thus fearing for his job, the DBA deletes the tablespace and goes home. Back to 10TB free? No.
The tablespace deletion command does a number of things, including notifying Oracle ASM that the file’s allocation units are no longer in use and removing the datafile from the database’s controlfile. But at no point does anybody bother to tell the storage array that the used space is now free, so the array’s capacity statistics remain: Allocated = 10TB, Used = 5TB, Free = 5TB.

ASRU: The Bad Way

ASRU is Oracle’s ASM Reclamation Utility, a PERL script developed in conjunction with 3PAR (a storage array now owned by HPE) and designed to free up space from scenarios such as the one above. It is, in my personal opinion, a terrible botched solution which was created to serve a purpose which no longer exists – although, interestingly, many storage vendors still seem to recommend it by default (for example, Pure Storage still describe it as the only solution for reclaiming unused space with Oracle ASM).
ASRU doesn’t issue UNMAP commands. Instead, it takes advantage of the fact that most modern storage platforms (including 3PAR, Pure Storage and Kaminario) treat blocks full of zeros as free space (a feature known as zero detect). Thus what ASRU does – when manually run by a DBA (presumably during a change window in the middle of the night while rubbing a lucky rabbit’s foot and praying to the gods of all major religions) – is compact the remaining data in any diskgroup toward the start of the volume and then write zeros above the high watermark where this compacted data ends.
In our example above, this should return the capacity statistics to approximately: Allocated = 10TB, Used = 0TB, Free = 10TB. However, because zero detect is often considered to be a type of data reduction, some arrays then show horribly-skewed data reduction ratios as a result of ASRU.
Don’t get me wrong, many people have successfully used ASRU – and in some situations it may be your only choice. But there is another way…

ASM Filter Driver: The Good Way

Since Oracle Database version 12.1.0.2, the option has been available to install a piece of software called ASMFD, the ASM Filter Driver. ASMFD is a kernel module which resides in the I/O path of Oracle ASM disks – and is the natural successor to the Linux-only ASMLib kernel driver. Unlike ASMLib, or indeed native ASM, the ASMFD module contains support for SCSI UNMAP commands, which really is the missing piece of the jigsaw. Providing you use ASMFD, the deletion of files from within ASM will result in the storage array being notified as allocation units are freed up, resulting in the correct recalculation of Free and Used Capacity statistics – and without the unnecessary hack of writing zeros all over the place. It really is a no brainer.
Unless, of course, you’ve already installed your database and ASM and are now looking for some way to return freed capacity. In which case, installing ASMFD on an existing system may seem even more challenging than running ASRU. But you know what they say: it’s better to do it right first time than to be constantly forced into bodging it with PERL scripts.

TL;DR

If you want Oracle ASM to correctly free space back to your thin provisioned storage array, you need to choose between the correct method of using ASM Filter Driver or the botched method of running the ASRU reclamation tool, which comes in the form of a PERL script. Either way, it’s nothing to do with the storage platform, so don’t blame the storage guy…