Unreasonably Effective Patterns

2020-10-11

Much of my current job is maintaining and enhancing control planes for Heroku’s managed data services. This post explores three patterns used to reduce operational burden and increase system safety and resiliency: state machines (and associated state-transition tables), transducers and re-entrant and idempotent operations.

Unfortunately, I failed in taking Jaana’s career advice in making both state and coordination someone else’s problem, and now my job revolves around both, at scale. We want to maximise availability (and meaningful availability at that) of these services and minimise time to recovery. We also want to ensure recovery is automatable, repeatable and safe.

This sounds like a tall order, but there are three patterns we use repeatedly, amongst others, to make both our lives as operators and our customers' lives easier. Let’s explore some them.

State Machines

The humble state machine is a powerful way of modelling components of a system, including physical systems. Let’s take a naïve example in that of a managed Postgres service:

A state machine diagram showing the flow from provisioning, to provisioned, to deprovisioned.

Here, we start provisioning the service. Once provisioned, we enter into a set of internal states: available, uncertain and unavailable. The goal is maximise the amount of time a service spends in available. We use uncertain to represent a state where the service has temporarily missed a check-in with our monitoring tooling, and unavailable is where the service is consistently unreachable.

It is important at this point to define healthy and unhealthy states, and stable and unstable states. Healthy states are states that we desire a service to be in, while unhealthy states are states we want to get out of. Stable states are states we expect to be in for a while, and unstable states are states we expect to pass through.

In the above set of states, they break down as below:

State Healthy? Stable?
Provisioning
Provisioned
Available
Uncertain
Unavailable
De-provisioning
De-provisioned

Being able to reason about the state of a service in this way is powerful – we know that a service can only be in one state at a time, there are certain expectations about those states and we can represent the states as an acyclic graph.

In reality, there are many more states and sub-states that a service can be in, but these are the big ones. If a service stays in an unstable state for too long, something is wrong. If a service is in an unhealthy state, something is wrong. Additionally, we model almost all system components as state machines, from the infrastructure the services run on through to service specific entities, like Apache Kafka principles, or TLS certificates.

Being able to reason about what state a service has come from (and how long it was in that state) and what state we want to transition to has signifiant upsides – we can ensure that we do not transition to an invalid state, and that we must transition through other states to get to our desired one. This is especially important when dealing with removing cases where a service may lose writes during a failover, or upgrade scenario.

In order to actually do the transition work, we look at the next pattern: transducers.

Transducers

Transducers are a collective term for sensors and actuators – elements of a system that produce a signal to act on and perform an action, respectively.

In the control plane, we map signals (and combinations of signals) to actions. For example, should a customer wish to promote their replica to a writable leader, this signal is set on the service. The control plane receives the signal, and then begins the actuator to execute the steps involved. Some of this model is also explored in an earlier post, Riding the Risk Railway

By decoupling the signal from the action, we can develop a mapping of signals to actions, and also to represent relative priority of actions, as well as invalid signal combinations.

This decoupling also increases testability – we can inject different entities of different states into the action to determine what will happen, as a form of fuzzing. This has, unsurprisingly, uncovered both bugs in the control plane, but also in the actions themselves.

One can imagine the mapping of signals to actions as almost a complex decision table, which allows us to prove that a signal like NeedsUnfollow always produces the execution of the Unfollow action, which helps significantly with the problem of coordination.

As each entity in the control plane has signals applied to it, only expected actions happen, and assertions prevent unexpected actions from taking place. For us, the ideal case is that no action happens directly – operators only give signals to the control plane for actions to be scheduled and executed appropriately. This is to make the system safer to operate.

We’ve been able to model both states, signals leading to actions, but what about the actions themselves, and how can we make them safer? This brings us to the final pattern: re-entrancy and idempotency.

Re-entrant and Idempotent Actions

Re-entrancy and idempotency are not strictly patterns, rather properties of operations. However, in our control plane these properties are so desired that they appear again and again, much like a pattern.

In short, re-entrancy is the ability for an operation to be completed successfully, even if a previous invocation was interrupted, and idempotency is the ability for an operation to be called multiple times, producing the same side effects each time.

To achieve this, operations must follow two rules: be wrapped in a database transaction (so we can ROLLBACK in case of an unexpected error) and be specific about data being operated on (e.g. avoiding non-idempotent queries like SELECT id FROM services ORDER BY updated_at DESC LIMIT 1).

This allows us recover gracefully from interrupts, such as networking failures, restarting of hosts and so on. With the careful tracking of state provided by the state machine modelling, we can avoid getting “wedged” and safely execute operations multiple times. In an “at-least once” architecture, these properties are necessary by design.

Conclusion

By modelling all entites handled by the control plane as state machines, modelling the interactions between the entities and state machines as transducers, and finally ensuring those interactions are both re-entrant and idempotent, we have a safer and more resilient control plane.

Sometimes things do go wrong, but with this system, we are able to determine why quickly, execute the correct actions, and restore services to a healthy state in a repeatable fashion. There are also other patterns in play, such as PID controllers and state-space representations that are not explored here, but are also unreasonably effective when developing control planes for stateful services.