So, You Want To Build A DBaaS

2023-03-29

So you’ve made the decision to build a Database-as-a-Service product (or something similar). As a result, you are making state and coordination your problem. Here are some things I’ve learned to keep in mind as you embark on this project, in no particular order (other than the first). This is a long one, so a table of contents is provided to skip around to the points that interest you.

Contents

  1. Data integrity above all things, including availability
  2. (Meaningful) Availability above the rest
  3. Solve a real problem
  4. Compliance is a real problem
  5. Set boundaries
  6. Demonstrate mastery.
  7. Predicatbility over performance
  8. Don’t ignore performance
  9. Optimise for the 95% of workloads
  10. Maximise your margins
  11. Closing thoughts

1. Data integrity above all things

As a relational-and-non-relational database management system-as-a-service (RaNRDBMSaaS) provider, your customers trust you with their data. Your golden rule, rule 0, first axiom - do not lose a customer’s data.

This extends beyond data loss to cover the full Parkerian Hexad:

These are the table stakes of operating this product. So follow all the rules of backups (3-2-1, testing backups, continuous assurance of backup security and utility etc) and ensure you are checking for data corruption (e.g. using amcheck for PostgreSQL, mysqlcheck for MySQL). Your business will simply cease to exist if you do not meet the hexad.

A notable part of this is control. Customers always own their data, and should be able to both import and export their data from your product as smoothly as possible - do not hold data hostage. You want your customer to stick with your product because of how good it is, not because it is impossible to export. This level of control may or may not extend to how the data is encrypted, such as using customer managed keys.

A side effect of this walking the Parkerian Hexad is that you will need to implement things like deletion receipts so that customers (and you) can continue to assert customers maintain control over their data should they wish to destroy data, including backups.

I recommend using tried-and-true backup technologies to support this goal. If you are developing a novel database technology, spend a large amount of your innovation tokens on getting your data integrity story rock solid.

Remember: trust is earned in drops and lost in buckets.

2. (Meaningful) Availability above the rest

The Parkerian Hexad above includes availability as one of the pillars. This can mean a lot of different things to different people, but for your customers this is about being able to access their data when they want to.

Perhaps the best way I’ve seen on conceiving of this is the meaningful availability paper - capturing what users experience in a proportional and actionable way.

Similarly, you may wish to consider implementing an uptime detector based around the phi accural failure detector.

Why these two in particular? These represent the “lived experience” of your customers the closest - otherwise it is easy to get trapped in synthetic testing that doesn’t expose “perceived failure”, like long-tail latencies.

Once you’ve cracked how you are measuring (your “SLIs”), you will need to establish your objectives. A common approach is to pick a suitbly large number of “nines” - 99.99% availability is a common goal. As per uptime.is, 99.99% affords you 4 wall-clock minutes of non-availability to burn per month. It is important that you burn these minutes - customers should be conditioned to tolerate less than perfect uptime within a given time range, as it will help them build more robust interfaces to your product. You’ll likely need these minutes for maintenance operations (software patching etc).

You will also find you will need to make some architectural decisions on how to maximise availability - the boundary between a data service and its clients is fraught with challenges. You may need to provide some educational material to your customers on how to architect for high availability and resiliency with your product offering.

3. Solve a real problem

What does it mean to solve a real problem? This means that you have done a sales safari and understand the pain you are trying to solve with your DBaaS product. If you aren’t solving real pain, forget about it - you don’t have product-market fit.

Some good examples of problems that DBaaS can solve are:

  1. Providing a “no-ops”/“low-ops” experience for operationally complex software.
  2. Allowing scaling businesses to comply with regulatory requirements without additional effort.
  3. Solving novel data storage and query requirements.

This isn’t exhaustive, but they need to be real problems you have observed in the wild. Aiven was born out of solving the same problems over and over in consulting engagements. QuestDB was born out of a need for high-performance time series ingest. Neon was born out of a deep desire from Postgres users to have a split storage/compute layer. Solve a real problem.

4. Compliance is a real problem

As mentioned above, one option to pursue is offering a DBaaS that meets various bars for regulation. This is especially useful for companies that wish to serve US government needs, healthcare software providers and so on. Across different countries, there are different requirements to meet.

Often, setting up a data service to comply with these regulations is quite challenging, such as picking ciphers for on disk encryption and TLS, enforcing appropriate management of user accounts, data deletion receiepts and passing relevant audits. Doing this to a high enough level (such as PCI Level 1, HIPAA etc) while meeting the table stakes discussed in Data integrity above all things, including availability solves a real problem for many businesses.

As part of this, have “enterprise features” as standard - it’s far easier to build with the most restrictive environments and regulations in mind and feature flag them out for folks that don’t need it than to make a less-compliant product more compliant. You don’t want to end up on SSO Tax, do you?

5. Set boundaries

Good advice in life and in software - set boundaries, and keep them. In most cases, this is about setting boundaries between your users and your product, typically along the lines of operational limits (both hard and soft limits) and drawing a bright line with the shared responsibility model.

From experience, if you give an inch, a customer will take a mile. They will do things with your product that you could not even conceive of. A personal favourite was storing Minecraft world state in a bytea field in Postgres. A single row taking up TiBs.

Part of having boundaries is enforcing them, and this is a hard thing to do. Adequate warnings to customers in advance of turning off features, like writes, are necessary, but also explain why you have these limits. Give customers a chance to course correct and provide an escape hatch.

However, be firm. There are some workloads that will eat at your margins, some customers that will continue to push your boundaries and always demand more. It is totally okay to fire customers that no longer fit your profile - give them a smooth off ramp.

6. Demonstrate mastery

Your customers pay you for expertise. Demonstrate this every day. Crunchy Data do one thing and one thing only - Postgres. By focusing on a single data service, they are deep experts in the technology and how it runs in a number of different configurations across different topologies.

As you increase the number of products in your portfolio, you will need to increase your team size appropriately to maintain enough depth. If you cannot demonstrate mastery, this is probably not the business for you.

Good ways of demonstrating mastery are to hire core contributors to the project you are offering as a service (if going the FOSS-as-a-service route), contributing regularly to development through mailing lists etc, to simply regularly producing content about tricky problems you’ve encountered and how to solve them.

Similarly, demonstrate mastery of the substrate(s) you have chosen to run on - know the quirks of the major (and minor) cloud providers and ultimately hide these from your customers.

7. Predictability over performance

Customers generally value predictability over raw performance in 95% of cases. This means having a service that will serve their data within the 95th percentile of latency the majority of the time, rather than having extremely high performance that is not predictable.

Similarly, all operations should be predictable or otherwise explainable to customers. A good example of this is logical backups with Postgres. If you have a timeout, or some memory limits (e.g. max_locks_per_transaction), eventually pg_dump will hit a limit at completing within the timeout of resource constraint and fail as the number of objects grows, but if the service is oscillating around the “failure point”, it is less predictable. It is better to set an arteficial limit and fail fast.

8. Don’t ignore performance

That said, don’t ignore performance. Customers are paying you for expertise and part of that is to eke out as much performance as possible with as little customisation as possible.

Set good defaults, benchmark, and refine them as hardware generations change. You may need particular performance profiles for given workloads, or even depending on the substrate(s) your operate the service on. Ensure that you leverage all possible tunables, from managing maximum concurrent connections to determining if kernel options like TCP_NODELAY have statistically signiciant impact on workload performance.

Try and make your benchmarks as repeatable as possible, and publicly talk about performance improvements for measured workloads.

9. Optimise for 95% of workloads

Generally speaking, you will want to optimise for the 95% of workloads you encounter. Typically, depending on the data service you run, there will be a typical “shape” of work. For example, a data service that is an OLTP database will probably be evenly split between reads and writes, but an OLAP database will typically be significantly more read heavy with intense computation.

You will need to determine what the shape is for most of your customers, and work towards making that a great experience on your service. You will lose customers who need things outside of that shape, but they are likely not worth pursuing versus getting extremely good at serving the 95%.

10. Maximise your margins

DBaaS is a hideously marginal business, and the economics of your business being successful are heavily dependent on maximising your margins. Outside of key differentiators, such as horizontal scalability a la Snowflake or PlanetScale, the main thing you need to do is maximise your margins.

This means heavy investment in automation and self-healing (to reduce operator overhead), and careful cost controls across your main billable resources (typically compute, storage and network). Tag your resources and ensure you have insights into where your expenditure is going. A common refrain from DBaaS companies I’ve worked at has been “how will this impact cost-to-serve” - running data services is an expensive business, so every little thing helps.

Closing thoughts

Despite how complex operating a DBaaS is, this is definitely a niche I’m glad I’ve found myself in. The Future Database from PlanetScale is a great manifesto on what the future can, and maybe should, look like for DBaaS products, but for the rest of us mortals, maybe this can be somewhat of a guide.

I have another 15 points or so to cover, which are more technical and opinionated (e.g. exposing logs and Prometheus-compatible metrics, modelling everything as finite state machines), but that is a story for another time. I hope this was interesting and useful to you.