Point and Call

2021-09-14

It’s 2AM. You’re paged to respond to a failing set of components that you are the Subject Matter Expert (SME) for. Sleepy, you load up the playbook for when the SplineReticulatorBlocked alert has gone off, and start executing. The Incident Commander (IC) is vaguely aware of what you are doing, and checks in now and then.

Unreasonably Effective Patterns

2020-10-11

Much of my current job is maintaining and enhancing control planes for Heroku’s managed data services. This post explores three patterns used to reduce operational burden and increase system safety and resiliency: state machines (and associated state-transition tables), transducers and re-entrant and idempotent operations.

Everything I Know About Operations, I Learned From NHS 111

2020-03-10

Ever heard someone say “It’s only software/money/<trivial thing>, not life or death”, in the context of incidents at your company? Although mostly true, I want to talk about a time in my career when sometimes, just sometimes, it was the latter, and how it shaped my approach to operating and owning services.

Riding the Risk Railway

2020-03-09

When building and operating a user-facing system, especially one that is open to the public, it is important to consider the riskiness of a user, which can also be characterised as trustworthiness. These will typically be negatively correlated, with low trust indicating high risk and vice versa, but this is not always the case.

Software Sprezzatura

2020-01-26

Sprezzatura is “a certain nonchalance, so as to conceal all art and make whatever one does or says appear to be without effort and almost without any thought about it”, coined by Castiglione in 1528’s The Book of the Courtier.