So, You Want To Build A DBaaS Part 2: Technical Bits

2023-05-21

In my first post in this pseudo-series, we covered more of the theory and organisational practices around building and operating a database-as-a-service platform. In this post, I’ll cover some of the more technical aspects of this that I think are worth considering, drawn from my own experience.

We’ll cover this in no particular order, and again, here is a table of contents to make navigation a little easier.

Your language choice
Safety, Performance and Style
Data plane, control plane
Replicas, replicas, replicas
Some mutability, not a lot
Feature flags and control rods
State machines and tick rates
Cellular architecture, blast radius
Kernel tunables and other low-level configuration

Your language choice

Getting this out of the way, there is no real wrong language to build your DBaaS in. However, it’s worth considering the following when choosing your language:

Debuggability. Can you easily debug your code? Can you easily attach a debugger to it? Can you easily profile it? Can you easily trace it?
Performance. As mentioned previously, you need to maximise your margins and get the most out your deployment substrate.
Safety. Can you easily write safe code? Can you easily write code that is safe by default? Can you easily write code that is safe by default and performant? We’ll get into this a little more later.
Community. Is there a robust community around the language, providing well-tested libraries and ideally a community interested in using the language to build control planes? Can you easily hire people who know the language?

I’ve worked on DBaaS platforms built in a few languages, such as Go, Ruby and Python. If I had to give a strong recommendation, it would probably be Go. Ruby and Python are fine, but the language features can lead to excessive complexity, and I often wished for the clarity and simplicty of Go.

Safety, Performance and Style

The folks over at TigerBeetle has a fantastic document on what they call Tiger Style. In it, they describe a set of rules that they follow when writing code for their database, which I think can apply pretty broadly to most projects, but especially building a DBaaS. To note:

Use only very simple, explicit control flow for clarity. Do not use recursion to ensure that all executions that should be bounded are bounded.
Put a limit on everything because, in reality, this is what we expect—everything has a limit
A function must not operate blindly on data it has not checked.
Make invalid states unrepresentable through assertions.
Whenever your program has to interact with external entities, don’t do things directly in reaction to external events. Instead, your program should run at its own pace.
All errors must be handled. No ifs, ands, or buts. For why, see this paper

The document goes on, including useful notes on performance (use back-of-the-envelope sketches in regards to resource usage), developer experience and tooling. I highly recommend reading the whole thing - this and sled.rs theoretical performance guide have informed a lot of my thinking around programming recently.

The content described is mostly for lower-level implementations, versus writing a control plane or larger system, but rules for small things scale pretty well in my experience.

Data plane, control plane

For those unfamiliar with the terms, the data plane is is responsible for the movement of data across a network, while the control plane is responsible for the management of the data plane. In the context of a DBaaS, the data plane is the database itself, while the control plane is the software that manages the database.

Depending on your product, you will either have significant control over the data plane (i.e. you are building your own data service), or you will have relatively little (i.e. you are building a control plane for an existing data service). Much of your focus will be on getting the control plane right, as that will be managing the core features of your platform, such as provisioning, backups, upgrades, etc.

However, that does not mean you should ignore the data plane, even if not building your own database - this is the plane that your customers will be interacting with the most. Optimise for the customer experience, while trying to minimise performance impacts. As an example, sticking a protocol-agnostic proxy in front of the database to handle TLS termination and give you some control rods to throttle traffic can be worth the cost of some (small) additional latency.

Replicas, replicas, replicas

It is professionally negligent to not provide replicas, and I would argue that it is also negligent to not provide them by default, though frequently cost is the limiting factor in doing this - it’s not cheap to run a database, and it’s even more expensive to run multiple copies of it. DBaaS customers are also generally price sensitive, so being 2x the price of your competitors, by default, is risky.

Replicas form the foundation for much of the ongoing maintenance of the platform, and offer an attractive option for customers to provide some read-load shedding and high availability. If taking a mostly-immutable approach to infrastructure, as recommended below, this will also be the way of updating the underlying host and so on.

I am always in two minds about giving customers access to replicas that are reserved for high-availability and maintenance purposes - on the one hand, restricting access allows for changes to be made without interrupting the customer and without the possibility of the customer workload impacting the availability of the replica. On the other hand, it helps justify the cost if they can use it - it’s a balance you will have to strike, or even abstract away entirely.

Some mutability, not a lot

I am a fan of immutable infrastructure. It makes it much easier to reason about the overall homogeniety of the fleet, but having to replace machines wholesale can be expensive and time consuming, especially for small configuration changes. I think a good balance is to have a small amount of mutability, but to make it as safe as possible and limit it to the smallest possible scope.

A good example of this is the configuration of the database itself. If you are using a database that supports configuration reloads, such as PostgreSQL, you can allow customers to change the configuration of their database, but only allow them to do so via configuration that is tracked in the control plane. This allows you to validate the configuration before applying it, and to ensure that the configuration is applied to all replicas and so on.

Another example may be increasing the amount of allocated disk space for the database, but again, should still be tracked in the control plane and given sensible limits.

When it comes to the mutable parts of your platform, ensure that the changes are atomic and follow the ususal rules around safe changes. Be doubly careful with config changes - config is often faster to change than code, and that can come with increased risk.

Feature flags and control rods

As you are in charge with a customer’s most valuable resource, with very high availability demands, being able to roll out new features and configuration safely is paramount. Feature flags have had reams of material written about them, so I won’t go into too much detail here. However, something I will discuss in a bit more detail is how they are realted to “control rods”.

A control rod is a term I’ve borrowed from nuclear power plants, and is a mechanism for controlling the rate of fission in a reactor. In the context of a DBaaS, it is a mechanism for controlling the rate of change in the platform. Being able to adjust the rate of a job queue being filled, the rate a new configuration change is applied across the fleet or even changing the network throughput of a backup job are all examples of being control rods, versus the slightly more binary feature flags.

Additionally, invest in feature flag tooling that allows you to manage flags at different levels of granularity, and tooling that supports “sticky” flags, tying groups of flags to a specific identifier - being able to opt a customer in to an early-access feature and keep their services opted in without additional changes will make your operators happy.

That said, a warning - prune flags and associated code regularly after you have achieved full rollout, and ensure that “snowflakes” are not tolerated for long in your fleet - maximise homogeniety where possible.

State machines and tick rates

As Craig Kerstiens mentions, the original design for Heroku’s Postgres DBaaS was around a central finite state machine, or as it is now, a collection of finite state machines (see my post on Unreasonably Effective Patterns for more on the nesting).

Modelling the system as a set of state machines allows for easy reasoning about a given entity’s state and the transitions into new states. It also allows you to more easily follow the advice from earlier - “don’t do things directly in reaction to external events”. Instead, we can have a central “tick rate” (to further borrow from video game design) and “tick” each state machine at that rate.

The upshot to this is that we have a minimum baseline for how often we will check for changes, and we can also use this to batch up changes and apply them at once, which increases the overall throughput of the system. The downside is that this is a minimum - we can’t change state instantly, and we have to wait for the next tick to apply changes. This is where control rods come in - we can adjust the tick rate to increase or decrease the rate of change, and we can also use them to apply changes immediately.

You may wish to have different queues that operate at different tick rates, for example a “normal” queue operating at a 1-minute tick rate, and a “high priority” queue operating at 10-second tick rate for things like progressing through a failover state machine.

Another feature of state machines is that they are (resonably) easy to distribute - store the state somewhere, have something pop the job off a queue and then hydrate an entity with that state, and then run the state machine. This allows you to scale out the processing of state machines, and also to retry failed state machines. Tools such as Temporal can help with this.

Cellular architecture and reducing blast radius

So far, we’ve spoken about “control plane” in the singular. In order to reduce blast radius, and to allow for more granular rollouts of features, the control plane should be split into multiple “cells”, each of which is responsible for a subset of the overall fleet. This could be sharded by region or a similarly suitable “sharding key”.

Although this dribes up the overall complexity, being able to operate cells independently allows for capping the amount of work a give control plane has to do, reducing the blast radius of new feature rollout and helping keep tick rates constant.

That isn’t to say “go full microservices” - a constellation of “macroservices” is likely easier to reason about, especially when each deployment is the same code, only operating on a different subset of the overall fleet.

Kernel tunables and other low-level configuration

Over time, as you become more familiar with post the data service(s) your are operating, you will likely need to tune certain elements of the kernel or other low-level configuration to get the most bang-for-buck out of a given substrate. What exactly you need to tune will depend on the service and the particular OS flavour you run, but a few things to keep in mind are:

vm.overcommit - this is a kernel tunable that controls how much memory the kernel will allow to be allocated. The default is to allow overcommitting, which can lead to OOMs. I recommend setting this to 2, which will allow overcommitting of swap space, but not of physical memory, or 0, which will disable overcommitting entirely. This is a tradeoff between performance and safety, and will depend on your workload.
vm.overcommit_ratio - this is a kernel tunable that controls how much memory the kernel will allow to be allocated, as a percentage of physical memory. The default is 50, but you probably want this to be around 80/90.
vm.swappiness - this is a kernel tunable that controls how aggressively the kernel will swap out memory to disk. The default is 60, but you probably want this to be around 10/20.
noatime,nodiratime - these are mount options that control whether the kernel will update the access time of files and directories. The default is to update the access time, but this can be expensive for databases - it will depend on your data service(s), but this can shave some cycles if you don’t need atime.
net.ipv4.tcp_timestamps - this is a kernel tunable that controls whether the kernel will use TCP timestamps. If you have some high throughput services, it can be somewhat expensive to set these. Again, if your software doesn’t require the use of TCP timestamps, you can disable them for a small bump.

Conclusion

In my career building and operating DBaaS’ of all shapes and sizes, across many different product offerings, I’ve found that the above have helped me build and operate reliable, scalable and performant data services. I hope that you find them useful too.

Considering building your own DBaaS, or starting out on this journey? More than happy to chat. Drop me a line at any of my socials above.

Matt Blewitt