The Hidden Cost of Letting AI Agents Query Live Systems

Every DBA has seen it at least once. You are looking at the production database and something feels wrong. CPU is climbing. Logical reads are surging. One or two sessions are consuming far more resources than everything else combined.

You drill down into the SQL.

The query is unfamiliar. It joins more tables than you would expect, sometimes even joining the same table multiple times. The execution plan is complex, the logical reads enormous.

The first question is always the same: where did this query come from?

Often the answer is simple. A new application feature has been released. Somewhere inside it is a query that looked perfectly reasonable in testing. Run once or twice, it behaved well enough.

Production tells a different story.

Now the query is executed repeatedly. Sometimes concurrently. Sometimes triggered by user behaviour the test harness never simulated. Occasionally the query runs long enough for the application to time out, so the user refreshes the page, launching another execution while the first one is still running.

What looked harmless in isolation becomes dangerous when multiplied across real workloads.

But a DBA usually has one advantage: you can usually find the human behind the query.

Someone wanted to know something. Once you understand that intent, the workload can usually be controlled. Run it less often. Move it elsewhere. Rewrite it entirely.

Operational databases survive because their workloads are ultimately bounded by human intent. AI agents change this equation in a subtle but important way.

When the Database Becomes Part of the Reasoning Process

Traditional enterprise applications interact with databases in predictable ways.

An application developer writes the query. It is tested, reviewed, and eventually deployed. Over time the database develops a recognisable workload pattern. DBAs learn those patterns, optimise them, and build operational safeguards around them.

Even complex systems ultimately follow this model. A user action triggers application logic, which triggers a known set of database queries.

AI agents behave differently.

Instead of executing a predefined query pattern, an agent may generate queries dynamically as part of a reasoning process. It retrieves data, evaluates the result, refines the question… and queries again.

In other words, the database is no longer simply answering a question. It becomes part of the mechanism the system uses to figure out the answer.

This subtle shift has an important consequence.

Applications constrain database workloads.

Agents amplify them.

The Amplification Effect

A traditional application might translate a single user action into one or two database queries.

An agent may translate the same request into many.

The agent retrieves information, evaluates it, and then decides what to ask next. It may retry a query with different parameters, join additional tables, or explore alternative paths through the data.

Some systems do this only once or twice. More autonomous agents may repeat the process multiple times in a reasoning loop.

From the database’s perspective, a single human question can now generate many queries. In simple retrieval pipelines this may only be a handful. In more autonomous reasoning systems it can grow to dozens — sometimes even hundreds — as the agent explores the data.

Multiply that pattern across thousands of employees interacting with AI systems, and the database workload begins to look very different from the one the system was originally designed to support.

This is not necessarily a flaw. It is simply how exploratory systems behave. But it does change the shape of the workload interacting with production systems.

Why Traditional Controls Are Not Enough

Experienced DBAs will recognise that production databases already include many mechanisms designed to control runaway workloads.

Connection limits, query governors, resource groups, and timeouts exist precisely because poorly behaved queries can destabilise shared systems.

These controls work well when applications generate predictable query patterns.

What they are less suited to is a workload whose shape is not known in advance.

Agentic systems can generate new queries dynamically, retry failed steps, or explore alternative reasoning paths. The individual queries may not be problematic on their own. The challenge is the emergent behaviour that appears when many such queries interact with a production workload.

The result is a new class of operational uncertainty. In practice this can manifest as unexpected infrastructure consumption, runaway query costs in cloud databases, or production incidents triggered by exploratory workloads interacting badly with operational traffic.

When a traditional query misbehaves, a DBA can usually trace it back to a specific application feature or report. With agentic systems, the activity observed in the database may simply be a side effect of an evolving reasoning process.

Understanding why a query is running becomes much harder.

Enterprise Architectures Will Evolve

For this reason, the architecture implied by many early AI demonstrations is unlikely to survive unchanged in large enterprise environments.

Those demonstrations often show AI agents querying operational databases directly. For simple scenarios, that approach works well enough.

At scale, however, exploratory workloads interacting directly with systems of record create operational and economic pressures that are difficult to manage.

The likely response is architectural evolution.

Instead of allowing agents to query operational systems directly, enterprises will increasingly introduce intermediate layers designed to absorb exploratory behaviour. These layers might include analytical replicas, semantic data services, curated retrieval surfaces, vector indexes, materialised views, or other forms of controlled query infrastructure.

What matters is not the specific technology. It is the architectural separation.

Systems of record remain the authoritative source of enterprise data. But the exploratory behaviour of AI agents will increasingly occur one step removed from those systems.

Until architectures evolve to contain that behaviour, many enterprise AI initiatives will struggle to deliver the level of return on investment that early demonstrations appear to promise.

Not because the models fail, but because the surrounding architecture has not yet caught up with how those models actually behave.

This article is part of the Databases in the Age of AI series.

The Cloud Is Built on Uniformity. Databases Are Not

When engineers build infrastructure, they aim for stability. Systems must behave predictably under load. They must survive success as well as failure. Growth is assumed. If the system is successful, it will be stressed. The challenge is ensuring that growth does not destabilise the structure meant to support it.

Uniform Supply, Irregular Demand

The public cloud addresses this challenge through uniformity. Compute, storage and network capacity are delivered as standardised building blocks. Instance sizes are predefined. Performance characteristics are abstracted. Capacity can be replicated across availability zones and regions with remarkable consistency. This is the operational achievement of Amazon Web Services, Microsoft Azure and Google Cloud. Uniform supply enables hyperscale.

It is an extraordinary achievement. It is also deliberately impersonal.

Enterprise systems of record operate under different conditions. They reflect the behaviour of real customers and real businesses. Demand is rarely smooth. It clusters around product launches, reporting cycles, market opens, regulatory deadlines and seasonal peaks. A transactional database sits at the centre of this behaviour. It absorbs whatever the business generates.

It does not get to choose the shape of that demand.

Cloud infrastructure assumes that workloads can be distributed horizontally across uniform units. Systems of record often experience concentrated pressure at specific moments in time. This is not a flaw, although it can feel like one at 9:01 on Monday when performance dashboards are flashing red. It reflects the difference between uniform supply and irregular demand.

Statistical Guarantees, Deterministic Commit

The distinction becomes clearer when examining reliability. Cloud platforms express resilience statistically. Availability is measured in percentages. Durability in strings of nines. Performance in ranges and envelopes. At fleet scale, these models are extraordinarily effective.

Most of the time, that is enough.

Transactional systems optimise for a different property. When a transaction commits, it must commit correctly and in order. When state changes, it must change once. The authoritative copy of truth cannot be approximate. Modern database engines distribute storage and parallelise reads. But the semantics of correctness still converge at the transaction boundary. Correctness is deterministic even when the infrastructure beneath it is statistical.

Cloud platforms optimise for aggregate behaviour across fleets. Transactional systems optimise for the one transaction that must not be wrong. Both approaches are rational. They are aimed at different targets — and those targets matter most under stress.

Where the Models Converge

The economic model reinforces the distinction. Elasticity allows capacity to expand and contract in response to demand. Stateless services align naturally with this pattern. Horizontal replication tends to make cost proportional to usage.

That is the promise.

Systems of record often scale differently. Reducing latency, increasing memory and tightening the coupling between compute and data frequently precede horizontal distribution. Cloud providers offer very large instance types and strong performance isolation. It is entirely possible to achieve impressive stability and scale in this way — and not infrequently, equally impressive cloud bills.

However, as performance guarantees increase, elasticity typically narrows. Dedicated capacity replaces shared pools. Provisioned throughput replaces best-effort scheduling. Predictability rises. Flexibility declines. These are trade-offs rather than failures. They always have been.

For many years, architects could separate these optimisation philosophies. Operational systems preserved deterministic state transitions. Analytical platforms absorbed variability downstream.

AI Removes the Insulation

AI reduces that separation. Machine agents increasingly interact directly with live operational data rather than curated extracts. Inference engines that are inherently probabilistic now depend more directly on infrastructure expected to be exact. The probabilistic layer now relies on deterministic state in real-time.

This does not invalidate the cloud model, nor does it imply that systems of record do not belong there. It makes the architectural assumptions more visible. Uniform, statistical infrastructure interacts with non-uniform, deterministic state, and the optimisation choices at each layer become more tightly coupled.

Architects must therefore decide explicitly where determinism resides in their stack and what elasticity they are willing to trade to preserve it. Elasticity, isolation and correctness can coexist. But not infinitely, and never without cost.

Someone always pays for certainty.

The cloud is built on uniformity. Databases are built on authority. That authority carries operational consequence.

As AI tightens the feedback loop between them, trade-offs once hidden inside abstraction become explicit design decisions. And explicit design decisions are where responsibility ultimately resides.

This article is part of the Databases in the Age of AI series.

Why We Spent 20 Years Protecting Databases from Analytics (and Why AI Just Broke That Truce)

8:53am on a Monday

I want to take you back roughly twenty years.

It is 8:53am on a Monday morning in central London and I am sitting at my desk staring at the screen. My coffee is untouched. My palms are sweating.

The ETL job is still running.

If that phrase means nothing to you, it was enough in the mid-2000s to strike fear into the heart of any production DBA. At the time I was the database lead for a SaaS platform serving customers across the UK, most of whom would start logging in from 9am.

The weekend ETL process (Extract, Transform, Load) pulled data out of the operational database, reshaped it and pushed it into the enterprise data warehouse. It ran every weekend and it was supposed to finish long before Monday morning.

It had not.

CPU pinned. Redo logs churning. User sessions beginning to queue.

In a few minutes customers would start calling support. Support would start calling me. And somewhere upstairs, the CEO would notice that his dashboard was still showing last week’s numbers.

I had two options. Let it run and hope it finished before the system buckled under real user traffic. Or kill it and spend the next hour watching rollback potentially take even longer, while guaranteeing that the warehouse would be stale until the following weekend.

How did we end up building systems where that was a normal Monday morning dilemma?

Why We Built the Wall

For the better part of two decades, the industry answered that question in one consistent way: keep analytics away from operational systems.

Transactional databases were designed to process orders, update accounts and record events predictably. Analytical workloads were different. They were heavy, exploratory and often poorly constrained. They scanned large portions of data, built aggregates, joined everything to everything and consumed CPU and I/O in bursts that were difficult to forecast.

Putting the two together in the same system was a recipe for contention.

So we separated them.

We built ETL pipelines. We built data warehouses. Later, we built data lakes and lakehouses. We introduced replication, change data capture and streaming. Each innovation was, in its own way, an attempt to preserve the integrity of the system of record while still making data available for analysis.

This was not fashion. It was defensive architecture.

The separation protected revenue-generating systems from analytical curiosity. It provided workload isolation. It gave operations teams a fighting chance of keeping Monday morning uneventful.

It Was Never Just About Performance

Enterprise environments rarely have a single system of record. A CRM system holds customer interactions. An ordering platform tracks transactions. Billing lives somewhere else. Supply chain somewhere else again.

The warehouse became not just a safety valve for analytics, but a unifying layer. It was the place where disparate operational systems could be reconciled into something coherent.

For years, this model worked.

Dashboards were allowed to be slightly stale. Reports could reflect yesterday’s state. Humans tolerate delay. In fact, they often prefer it. Analysis takes time, and decisions are rarely made in milliseconds.

The truce held because the consumer was human.

AI Changes the Consumer

AI agents change that.

An AI agent does not log in at 9am. It does not wait for a dashboard refresh. It does not tolerate yesterday’s numbers if it is expected to act on what is true right now.

Inference is not reporting. It is decision execution at machine speed.

If an agent is recommending a next action, approving a transaction, adjusting a price or triggering a workflow, the freshness of the underlying data becomes materially important. Close enough is no longer good enough. Staleness is no longer cosmetic. It alters outcomes.

The architectural assumption that analytics can safely run on a delayed copy of operational data begins to fracture.

This does not mean warehouses were a mistake. It does not mean lakehouses are obsolete. It does not mean streaming pipelines were misguided.

It means they were optimised for a different consumer.

For years, we optimised for human analysis. Now we are increasingly optimising for machine-driven action.

That is a different problem.

The Balance of Trade-Offs

For two decades, the answer was clear: keep them apart.

Protect the system of record. Move the data. Analyse it somewhere else. Accept a little delay in exchange for stability and control.

That architecture was forged in moments exactly like that Monday morning at 8:53am, CPU pinned, redo logs churning, business users about to log in.

AI does not invalidate that history. It simply changes the balance of trade-offs.

The truce between operational databases and analytics was built for a world where humans consumed insight.

We are now entering a world where machines consume state.

And that changes the conversation.

This article is part of the Databases in the Age of AI series.

AI Doesn’t Read Dashboards… and That Changes Everything for Databases

A bank executive opens a fraud dashboard in Microsoft Power BI.

Losses by region, chargeback ratios, transaction velocity trends and a heatmap of anomalous activity. The numbers refresh within minutes. Data flows out of the system of record, is reshaped and aggregated, then presented for interpretation.

This is contemporary analytics: fast and operationally impressive. But it remains interpretive. It explains what is happening, while intervention occurs elsewhere – inside a fraud model embedded in the execution path, deciding in milliseconds whether money moves or an account is frozen.

Reporting systems describe what has already occurred. Even when refreshed every few minutes, they are retrospective. Inference systems are anticipatory. They evaluate the present in order to shape what happens next.

For two decades, enterprise data platforms were built around a deliberate separation between systems of record and analytical platforms. The system of record handled revenue-generating transactions; analytics operated on copies refreshed hourly or even every few minutes. Latency narrowed, but the boundary remained.

AI systems do not consume summaries, however fresh. They make decisions inside the transaction itself. A real-time fraud model does not want a recently refreshed extract; it requires the authoritative state of the business at the moment of decision. When automation replaces interpretation, data freshness becomes a decision integrity requirement. That shift changes the role of the database entirely.

Snapshots ≠ State

The difference is not batch versus real time. It is snapshot versus canonical state.

A snapshot is a materialised representation of state at a prior point in time – even if that point is only moments earlier. It may be refreshed frequently or continuously streamed, but it remains a copy. The system of record contains the canonical state of the enterprise – balances, limits, flags and relationships – reflecting the legal and financial truth when a transaction commits.

In fraud detection, that distinction is decisive. A dashboard can tolerate slight delay because its purpose is explanation. A model embedded in the execution path cannot. It must evaluate current balance, velocity and account status, not a recently materialised representation.

For years, we increased the distance between analytics and the system of record to protect transactional stability. That separation reduced risk in a world where insight followed action.

Automation reverses that order. Insight now precedes action. Once decisions are automated, the gap between a copy of the data and the authoritative source becomes consequential.

When Almost Right Is Still Wrong

If a fraud dashboard is slightly stale, an analyst may adjust a threshold next week. When a fraud model evaluates incomplete or delayed state, the error is executed immediately and repeated at scale.

False declines can lock customers out within minutes. False approvals can leak substantial losses before discrepancies surface in reporting. Automation compresses time and amplifies mistakes because there is no interpretive buffer.

Real-time intervention is inevitable. Competitive pressure and regulatory scrutiny demand it. But once decisions are automated, tolerance for architectural distance shrinks. A delay harmless in reporting can be material in a decision stream. A dataset “close enough” for analytics may be insufficient for automated intervention.

The risk is not that dashboards are wrong; it is that forward-looking systems may act on something almost right.

Databases in the Age of Intervention

When fraud detection becomes automated intervention rather than retrospective analysis, the requirements on the data platform change. Freshness is defined at the moment of decision, not by refresh intervals.

Replication patterns take on new significance. Asynchronous copies and downstream materialisations were designed to protect the system of record. They optimise scale and isolation, but every layer introduces potential lag or divergence. For reporting, that trade-off is acceptable. For automated decisions in revenue-generating workflows, it becomes risk.

Workload separation also looks different. When analytics is retrospective, distance protects performance. When inference is embedded in operational workflows, proximity to live transactional state matters. The challenge is enabling safe, predictable access without compromising correctness.

Fraud detection is simply the clearest example. Dynamic pricing, credit approvals, supply chain routing and clinical triage all follow the same pattern. The model is not generating a report about what happened; it is evaluating the present to influence what happens next.

For decades, enterprise architecture assumed intelligence followed events. As AI systems become anticipatory and automated, intelligence precedes action. The database is no longer simply the foundation of record-keeping.

It becomes part of how the future is decided – whether we are comfortable with that or not.

This article is part of the Databases in the Age of AI series.

Databases Were Built for Humans – AI Agents Change the Equation

For more than a decade, the industry has been preparing for a data explosion.

Zettabytes. Exponential curves. Hockey sticks on slides. Whether it was IDC’s DataSphere forecasts or countless vendor keynotes, the message was consistent: the amount of data created and stored worldwide was about to grow very, very fast.

And to be fair, that part largely went to plan.

Enterprises adapted. Storage scaled out. Cloud elasticity became normal. Analytical workloads were pushed away from systems of record. The industry did the work required to survive – and even thrive – in a world of exploding data volumes.

What almost nobody questioned, however, was a much quieter assumption baked into all of that planning.

The Assumption Nobody Revisited

All of those forecasts, explicit or implicit, assumed that the users of enterprise systems would remain human.

Humans are slow. Humans are bursty. Humans sleep.

Even power users have natural limits, predictable working patterns and an instinct for self-preservation when systems start pushing back. Entire generations of database design, connection management and capacity planning quietly depend on those characteristics.

It wasn’t a bad assumption. It was a reasonable one. Until it wasn’t.

A Step-Change, Not a Trend

What has changed is not just how much data exists, but who – or what – is accessing it.

AI agents introduce a new class of user into enterprise computing: non-human, machine-speed actors operating directly against application logic and data sources. This isn’t a continuation of an existing trend. It’s a step-change.

You’re not adding more users along the same curve. You’re changing the curve itself.

The data explosion was predicted. The user explosion – at least in this form – was not.

Why AI Agents Break Old Rules

AI agents don’t just behave like very enthusiastic humans.

They are fundamentally different:

Speed: they operate at machine speed, turning milliseconds into meaningful units of work
Relentlessness: they don’t pause, sleep or slow down unless explicitly forced to
Unpredictability: agentic workflows fan out, retry, amplify and cascade in ways humans never could

These aren’t “power users”. They’re closer to autonomous load generators.

When Agents Hit Systems of Record

Critically, AI agents don’t want last night’s report.

They want now.

That pulls them towards operational systems of record – the RDBMS platforms that were carefully protected for the last twenty years from exactly this kind of access pattern. Read replicas help, until they don’t. Caches help, until coherence matters. Copy lag becomes a business problem, not a technical detail.

The long-standing truce between OLTP and everything else is under strain.

Capacity Planning Enters the Chaos Zone

Traditional infrastructure planning assumes that tomorrow looks broadly like yesterday, just a bit bigger.

AI agents break that assumption.

Sudden workload spikes. Non-linear fan-out. Cost curves that move faster than budgeting cycles. Organisations are forced into an uncomfortable choice: over-provision aggressively and accept unpredictable cloud bills, or under-provision and risk outages in systems that now sit directly on critical decision paths.

Capacity planning stops being optimisation. It becomes risk management.

This Is Already Happening

None of this is theoretical.

Organisations are already talking openly about AI agents as part of their workforce – not as tools, but as actors performing work at scale.

Enterprises are comfortable counting tens of thousands of AI agents as “workers”, but it shouldn’t be surprising when those workers behave very differently to humans – and place very different demands on the systems beneath them.

The Equation Has Changed

The data explosion followed the forecast.

The explosion in users did not.

Databases were built for humans – slow, bursty, predictable ones – and that assumption shaped everything from architecture to cost models. AI agents don’t fit that mould… and pretending they do is how organisations drift into outages, runaway costs or both.

Databases were built for humans. AI agents didn’t get the memo – and they’re already in production.

This article is part of the Databases in the Age of AI series.

Inferencing Is a Database Problem Disguised as an AI Problem

I have a habit of becoming interested in technology trends only once they collide with reality. Flash memory wasn’t interesting to me because it was new – it was interesting because it broke long-held assumptions about how databases behaved under load.

Cloud computing wasn’t interesting to me because infrastructure became someone else’s problem. It became interesting when database owners started making uncomfortable compromises just to get revenue-affecting systems to run acceptably in the cloud. Compute was routinely overprovisioned to compensate for storage performance, leading to large bills for resources that were mostly idle. At the same time, “modernisation” began to feel less like an architectural necessity and more like a convenient justification for expensive consultancy services.

And now, just when I thought flashdba had nothing left to say, AI is following the same path.

We’ve Seen This Movie Before

For the last couple of years, most of the attention has been on training. Bigger models, more parameters, more GPUs, massive share prices. That focus made sense because training is visible, centralised and easy to reason about in isolation. But as inferencing starts to move up into the enterprise, something changes.

In the enterprise, inferencing stops being an interesting AI capability and starts becoming part of real business workflows. It gets embedded into customer interactions, operational decisions and automated processes that run continuously, not just when someone pastes a prompt into a chat window. At that point, the constraints change dramatically.

Enterprise inferencing is no longer about what a model knows. It is about what the business knows right now. And that is where things begin to feel very familiar to anyone responsible for systems of record.

Because once inferencing depends on real-time access to authoritative operational data, the centre of gravity shifts away from models and back towards databases. Latency matters. Consistency matters. Concurrency matters. Security boundaries matter. Above all, correctness matters.

This is the point at which inferencing stops looking like an AI problem and starts looking like what it actually is: a database problem, wearing an AI costume.

Inferencing Changes Once It Becomes Operational

While inferencing remains something that sits at the edge of the enterprise, its demands are relatively modest: a delayed response is tolerable… slightly stale data is acceptable. If an answer is occasionally wrong, the consequences are usually limited to a poor user experience rather than a failed business process.

That changes quickly once inferencing becomes operational. When it is embedded directly into business workflows, inferencing is no longer advisory… it becomes participatory. It influences decisions, triggers actions and – increasingly – operates in the same execution path as the systems of record themselves. At that point, inferencing stops consuming convenient snapshots of data and starts demanding access to live context data.

What is Live Context?

By live context, I don’t mean training data, feature stores or yesterday’s replica. I mean current, authoritative operational data, accessed at the point a decision is being made. Data that reflects what is happening in the business right now, not what was true at some earlier point in time. This context is usually scoped to a specific customer, transaction or event and must be retrieved under the same consistency, security and governance constraints as the underlying system of record. In other words, a relational database. Your relational database.

Live Context gravitates towards RDBMS systems of record. It does not appear spontaneously – it is created at the moment a business state changes: when an order is placed, a payment is authorised, an entitlement is updated or a limit is breached, that change becomes real only when the transaction is committed to the RDBMS. Until then, it is provisional.

Analytical platforms can consume that state later, but they do not create it. Feature stores, caches and replicas can approximate it, but they do so after the fact. The only place where the current state of the business definitively exists is inside the operational production databases that process and commit transactions.

As inferencing becomes dependent on live context, it is therefore pulled towards those databases. Not because they are designed for AI workloads, and certainly not because this is desirable, but because they are the source of truth. If an inference is expected to reflect what is true right now, it must, in some form, depend on the same data paths that make the business run.

This is where the tension becomes unavoidable.

Inferencing Is Now A Database Problem

Once inferencing becomes dependent on live context, it inherits the constraints of the systems that provide that context. Performance, concurrency, availability, security and correctness are no longer secondary considerations. They become defining characteristics of whether inferencing can be trusted to operate inside business-critical workflows at all.

This is why enterprise AI initiatives are unlikely to succeed or fail based on model accuracy alone. They will succeed or fail based on how well inferencing workloads coexist with production databases that were never designed, built or costed with AI in mind. At that point, inferencing stops being an AI problem to be delegated elsewhere and becomes a database concern that must be understood, designed for and owned accordingly.

This article is part of the Databases in the Age of AI series.

The Biggest Gap In The Clouds? High Performance RDBMS

Over the course of the last few blog posts, we’ve looked at how an increasing number of database workloads are migrating to the cloud, how there is more than one path to get there… and why overprovisioning is one of the biggest challenges to overcome.

We’re talking about business-critical application workloads here: big, complex, demanding, mission-critical, sensitive, performance-hungry… When on-prem, they are almost certainly running on dedicated, high-end infrastructure. And that’s a potential issue when you then migrate them to run on “someone else’s computer“.

As we’ve discussed before, the cloud is really a big pool of discrete resources and services, all of which are available on demand. You want a managed PostgreSQL instance? Click! It’s yours. You want three hundred virtual machines on which you can install your own software? Clickety-click! Off you go. If you’ve got the budget, the cloud has got a way for you to spend it. But underneath it all, whether you are using PaaS databases supplied by the cloud provider or installing the database software on IaaS systems, you are sharing that infrastructure – and the available performance – with the rest of the world.

Cloud Outcomes: Optimization versus Modernization

For some database workloads moving to the cloud, the modernization path will be the best fit, which means they will likely move to Platform-as-a-Service solutions where the day-to-day management of the database, operating system and infrastructure is taken care of by the cloud provider. Some examples of this path: on-prem SQL Server databases moving to Azure SQL Managed Instances; Oracle Databases moving to AWS’s Oracle RDS solution, etc.

But there is usually a certain class of database workload which doesn’t easily fit into these pre-packaged PaaS solutions: the big, the complex, the gnarly… the monsters of your data centre. And they inevitably end up in Infrastructure-as-a-Service… or stuck on-prem. For customers choosing the IaaS route (the “optimization path” in cloud-speak), the cloud provider manages the infrastructure but the customer is still responsible for the database and operating system.

Obviously, IaaS has a higher management overhead than PaaS, but often the journey to IaaS is simpler (essentially more of a lift and shift approach), while PaaS solutions often require a more complex migration. Especially with some cloud providers, where the recommended PaaS solution is actually a different database product entirely (for example, Oracle customers moving to Google Cloud or Microsoft Azure will be recommended by those cloud providers to move to Cloud SQL and Managed PostgreSQL respectively).

I/O Performance Is The Biggest Challenge

My view is that PaaS solutions are the best path for all appropriate workloads, but there will always be some outliers which need to move to IaaS. Almost by definition, those are the most high-profile, demanding, expensive, revenue-affecting… in fact… the most interesting workloads. And in all the cases I’ve seen, I/O performance has been the limiting factor.

It’s relatively easy to get a lot of compute power in the cloud. But as soon as you start ramping up the amount of data you need to read and write, or demanding that those reads and writes have very fast, predictable response times, you hit problems. In other words, if latency, IOPS or throughput are your metrics of choice, you’d better be ready to start doing unnatural things.

And it’s not necessarily the case that your required level of performance cannot be achieved. Often, it’s more correct to say that your required levels of performance cannot be achieved at an acceptable cost. Because it turns out that the following statement is just as true in the cloud as it ever was on-prem:

Performance and Cost are two sides of the same coin…

This is why I believe that the biggest gap in the cloud providers’ product portfolios today is in the area of high performance relational databases: primarily Oracle Database and Microsoft SQL Server. The PaaS solutions are designed for the average workloads, not the high-end. A complex database running on, for example, Oracle Exadata will struggle to run on a vanilla IaaS deployment – while the refactoring required to take that database and migrate it to Managed PostreSQL is almost unimaginable.

Cloud Compromises: Constrained and Optimized CPUs

Imagine the scenario where you wonder into a clothing store to buy a t-shirt. You find a design you like in size “Medium” but it’s too tight (I guess #lockdown has been unkind to us all…) so you ask for the next size up. But when it arrives, you notice something bizarre: the “Large” is not only wider and longer, it also has an extra arm hole. Yes, there are enough holes for three arms as well as your head. Even more bizarrely, the “XL” size has four sets of sleeves, while the “Small” has only one and the “XS” none at all!

Surprisingly, this analogy is very applicable to cloud computing, where properties like compute power, memory, network bandwidth, capacity and performance are often tied together. As we saw in the previous post, a requirement for a certain amount of read I/O Operations Per Second (IOPS) can result in the need to overprovision unwanted capacity and possibly even unnecessary amounts of compute power.

But there is one situation where this causes extra levels of pain: when the workload in question is database software which is licensable by CPU cores (e.g. Oracle Database, Microsoft SQL Server).

To extend the opening analogy into total surrealism, imagine that the above clothing store exists in a state which collects a Sleeve Tax of %100 of the item value per sleeve. Now, your chosen t-shirt might be $40 but the Medium size will cost you $120, the Large $160 and the XXXXXL (suitable for octopods) a massive $360.

Luckily, the cloud providers have a way to help you out here. But it kind of sucks…

Constrained / Optimized VM Sizes

If you need large amounts of memory or I/O, the chances are you will have to pick a VM type which has a larger number of cores. But if you don’t want to buy databases licenses for these additional cores (because you don’t need the extra CPU power), you can choose to restrict the VM instance so that it only uses a subset of the total available cores. This is similar to the concept of logical partitioning which you may already have used on prem. Here are two examples of this practice from the big hyperscalers:

Microsoft Azure: Constrained vCPU capable VM sizes

Amazon Web Services: Introducing Optimize CPUs for Amazon EC2 Instances

As you can see, Microsoft and AWS have different names for this concept, but the idea is the same. You provision, let’s say, a 128 vCPU instance and then you restrict it to only using, for example, 32 vCPUs. Boom – you’ve dropped your database license requirement to 25% of the total number of vCPUs. Ok so you only get the compute performance of 25% too, but that’s still a big win on the license cost… right?

Well yes but…

There’s a snag. You still have to pay the full cost of the virtual machine despite only using a fraction of its resources. The monthly cost from the cloud provider is the same as if you were using the whole machine!

To quote Amazon (emphasis mine):

Please note that CPU optimized instances will have the same price as full-sized EC2 instances of the same size.

Or to quote the slightly longer version from Microsoft (emphasis mine):

The licensing fees charged for SQL Server or Oracle are constrained to the new vCPU count, and other products should be charged based on the new vCPU count. This results in a 50% to 75% increase in the ratio of the VM specs to active (billable) vCPUs. These new VM sizes allow customer workloads to use the same memory, storage, and I/O bandwidth while optimizing their software licensing cost. At this time, the compute cost, which includes OS licensing, remains the same one as the original size.

It’s great to be able to avoid the (potentially astronomical) cost of unnecessary database licences, but this is still a massive compromise – and the cost will add up over each month you are billed for compute cores that you literally cannot use. Again, this is the public cloud demonstrating that inefficiency and overprovisioning are to be accepted as a way of life.

Surely there must be a better way?

Spoiler alert: there IS a better way

Overprovisioning: The Curse Of The Cloud

I want you to imagine that you check in to a nice hotel. You’ve had a good day and you feel like treating yourself, so you decide to order breakfast in your room for the following morning. Why not? You fill out the menu checkboxes… Let’s see now: granola, toast, coffee, some fruit. Maybe a juice. That will do nicely.

You hang the menu on the door outside, but later a knock at the door brings bad news: You can only order a maximum of three items for breakfast. What? That’s crazy… but no amount of arguing will change their rules. Yet you really don’t want to choose just three of your five items. So what do you do? The answer is simple: you pay for a second hotel room so you can order a second breakfast.

Welcome to the world of overprovisioning.

Overprovisioning = Inefficiency

Overprovisioning is the act of deploying – and paying for – resources you don’t need, usually as a compromise to get enough of some other resource. It’s a technical challenge which results in a commercial or financial penalty. More simply, it’s just inefficiency.

The history of Information Technology is full of examples of this as well as technologies to overcome it: virtualization is a solution designed to overcome the inefficiency of deploying multiple physical servers; containerisation overcomes the inefficiency of virtualising a complete operating system many times… it’s all about being more efficient so you don’t have to pay for resources you don’t really need.

In the cloud, the biggest source of overprovisioning is the way that cloud resources like compute, memory, network bandwidth, storage capacity and performance are packaged together. If you need one of these in abundance, the chances are you will need to pay for more of the others regardless of whether they are required or not.

Overprovisioning = Compromise

As an example, at the time of writing, Google Cloud Platform’s pd-balanced block storage options provide 6 read IOPS and 6 write IOPS per GB of capacity:

* Persistent disk IOPS and throughput performance depends on disk size, instance vCPU count, and I/O block size, among other factors.

Consider a 1TB database with a reasonable requirement of 30,000 read IOPS during peak load. To build a solution capable of this, 5000GB (i.e. 5TB) of capacity would need to be provisioned… meaning 80% of the capacity is wasted!

Worse still, the “Read IOPS per instance” row of the table tells us that some of the available GCP instance types may not be able to hit our 30,000 requirement, meaning we may have to (over)provision a larger virtual machine type and pay for cores and RAM that aren’t necessary (by the way, I’m not picking on GCP here, this is common to all public clouds).

But the real sucker punch is that, if this database is licensed by CPU cores (e.g. Oracle, SQL Server) and we are having to overprovision CPU cores to get the required IOPS numbers, we now have to pay for additional, unwanted – and very expensive – database licenses.

Overprovisioning = Overpaying

Let’s not imagine that this is a new phenomenon. If you’ve ever over-specced a server in your data centre (me), if you’ve ever convinced your boss that you need the Enterprise Edition of something because you thought it would be better for your career prospects (also me), or if you’ve ever spent £350 on a thermal imaging camera just so you can win an argument about whether you need a new front door (I neither admit nor deny this) then you have been overprovisioning.

It’s just that the whole nature of cloud computing, with it’s self-service, on-demand, limitlessly-scalable charateristics make it so easy to overprovision things all the time. So while the amounts may seem small when shown on the cloud provider’s Price per hour list, when you multiply them by the number of VMs, the number of regions and the number of hours in a year, they start to look massive on your bill.

And when you consider the knock on effects on database licensing, things really get painful. But let’s save that for the next blog post…

The Public Cloud: The Hotel For Your Applications

Unless you are Larry Ellison (hi Larry!), the chances are you probably live in a normal house or an apartment, maybe with your family. You have a limited number of bedrooms, so if you want to have friends or relatives come to stay with you, there will come point where you cannot fit anybody else in without it being uncomfortable. Of course, for a large investment of time and money, you could extend your existing accommodation or maybe buy somewhere bigger, but that feels a bit extreme if you only want to invite a few people On to your Premises for the weekend.

Another option would be to sell up and move into a hotel. Pick the right hotel and you have what is effectively a limitless ability to scale up your accommodation – now everybody can come and stay in comfort. And as an added bonus, hotels take care of many dull or monotonous daily tasks: cooking, cleaning, laundry, valet parking… Freeing up your time so you can concentrate on more important, high-level tasks – like watching Netflix. And the commercial model is different too: you only pay for rooms on the days when you use them. There is no massive up-front capital investment in property, no need to plan for major construction works at the end of your five year property refresh cycle. It’s true pay-as-you-go!

It’s The Cloud, Stupid

The public cloud really is the hotel for your applications and databases. Moving from an investment model to a consumption-based expense model? Tick. Effectively limitless scale on demand? Tick. Being relieved of all the low-level operational tasks that come with running your own infrastructure? Tick. Watching more Netflix? Definite Tick.

But, of course, the public cloud isn’t better (or worse) than On Prem, it’s just different. It has potential benefits, like those above, but it also has potential disadvantages which stem from the fact that it’s a pre-packaged service, a common offering. Everyone has different, unique requirements but the major cloud providers cannot tailor everything they do to you individual needs – that level of customisation would dilute their profit margins. So you have to adapt your needs to their offering.

To illustrate this, we need to talk about car parking:

Welcome To The Hotel California

So… you decide to uproot your family and move into one of Silicon Valley’s finest hotels (maybe we could call it Hotel California?) so you can take advantage of all those cloud benefits discussed above. But here’s the problem, your $250/day suite only comes with one allocated parking bay in the hotel garage, yet your family has two cars. You can “burst” up by parking in the visitor spaces, but that costs $50/day and there is no guarantee of availability, so the only solution which guarantees you a second allocated bay is to rent a second room from the hotel!

This is an example of how the hotel product doesn’t quite fit with your requirements, so you have to bend your requirement to their offering – at the sacrifice of cost efficiency. (Incurring the cost of a second room that you don’t always need is called overprovisioning.) It happens all the time in every industry: any time a customer has to fit a specific requirement to a vendor’s generic offering, something somewhere won’t quite fit – and the only way to fix it is to pay more.

The public cloud is full of situations like this. The hyperscalers have extensive offerings but their size means they are less flexible to individual needs. Smaller cloud companies can be more attentive to an individual customer’s requirements, but lack the economies of scale of companies like Amazon Web Services, Microsoft and Google, meaning their products are less complete and their prices potentially higher. The only real way to get exactly what you want 100% of the time is… of course… to host your data on your own kit, managed by you, on your premises.

Such A Lovely Place

I should state here for the record that I am not anti-public cloud. Far from it. I just think it’s important to understand the implications of moving to the public cloud. There are a lot of articles written about this journey – and many of them talk about “giving up control of your data”. I’m not sure I entirely buy that argument, other than in a literal data-sovereignty sense, but one thing I believe to be absolutely beyond doubt is that a move to the public cloud will require an inevitable amount of compromise.

That should be the end of this post, but I’m afraid that I cannot now pass up the opportunity to mention one other compromise of the public cloud, purely because it fits into the Hotel California theme. I know, I’m a sucker for a punchline.

You and your family have enjoyed your break at the hotel, but you feel that it’s not completely working – those car parking charges, the way you aren’t allowed to decorate the walls of your room, the way the hotel suddenly discontinued Netflix and replaced it with Crackle. What the …? So you decide to move out, maybe to another hotel or maybe back to your own premises. But that’s when you remember about the egress charges; for every family member checking out of the hotel, you have to pay $50,000. Yikes!

I guess it turns out that, just like with the cloud, you can check out anytime you like… but you can never leave.