Tools for a Culture of Writing

2021-10-30

One of the hardest things we do, as humans, is try and communicate what is going on in our minds to each other. With significant room for misunderstanding, biases, assumptions and cultural differences, communicating with other engineers (or to stakeholders) appears fraught. However, there are tools we can leverage to make ourselves understood, and to smooth the passage of information to makes sure it gets to the right people at the right time.

Point and Call

2021-09-14

It’s 2AM. You’re paged to respond to a failing set of components that you are the Subject Matter Expert (SME) for. Sleepy, you load up the playbook for when the SplineReticulatorBlocked alert has gone off, and start executing. The Incident Commander (IC) is vaguely aware of what you are doing, and checks in now and then.

Unreasonably Effective Patterns

2020-10-11

Much of my current job is maintaining and enhancing control planes for Heroku’s managed data services. This post explores three patterns used to reduce operational burden and increase system safety and resiliency: state machines (and associated state-transition tables), transducers and re-entrant and idempotent operations.

Everything I Know About Operations, I Learned From NHS 111

2020-03-10

Ever heard someone say “It’s only software/money/<trivial thing>, not life or death”, in the context of incidents at your company? Although mostly true, I want to talk about a time in my career when sometimes, just sometimes, it was the latter, and how it shaped my approach to operating and owning services.

Riding the Risk Railway

2020-03-09

When building and operating a user-facing system, especially one that is open to the public, it is important to consider the riskiness of a user, which can also be characterised as trustworthiness. These will typically be negatively correlated, with low trust indicating high risk and vice versa, but this is not always the case.