Everything I Know About Operations, I Learned From NHS 111

2020-03-10

Ever heard someone say “It’s only software/money/<trivial thing>, not life or death”, in the context of incidents at your company? Although mostly true, I want to talk about a time in my career when sometimes, just sometimes, it was the latter, and how it shaped my approach to operating and owning services.

NHS 111 is a service folks in the UK can use to get help with non-emergency medical issues. Many years ago, I worked for a company that helped staff this service, and for just over 12 months I worked a variety of shift patterns (such as 2nd, 3rd, overnight and split) to help the British public with their medical maladies.

Notably, I’m not a nurse or a doctor. The systems in place that enabled debugging of the human body, and the human element of helping those in distress, have shaped the way I approach service ownership and operations to this day. Let’s get started with the first of those: decision support.

Pathways

Imagine the most complex systems you hold the pager for, and try and hold that system in your mind. It is probably magnitudes less complex than the human body, and my colleagues and I were essentially on-call for the human body.

A call to NHS 111 has all the hallmarks of a classic operations problem: incomplete information, lack of direct access to key system components, and a demand for subject matter experts that far outstrips supply. In order to mitigate these constraints, we made extensive use of a decision support tool called Pathways. This was developed by clinicians as subject-matter experts in order to allow the scaling of personnel required to support such a service.

Decision support is something generally seen as a thing for business operations, rather than software and security operations, outside of a clinical setting. However, the benefits of such a system in a technical setting can be significant.

An example

Let’s take stomach pains as an example, paired with the technological equivalent, increased long-tail latencies. Could be nothing, could be serious.

Rather than expecting folks taking NHS 111 calls to be experts in the system they are on-call for, Pathways attempts to deliver the least risky outcome in the shortest number of steps. As part of that, many folks who have interacted with the service roll their eyes at being asked “are you having severe difficulty breathing?”, but this is a useful razor to separate between a life-threatening emergency, and something more routine.

In order to rule out something more serious, Pathways essentially removes branches from a tree in order to arrive at a disposition: an outcome that can range from a fast-tracked ambulance to self-medication and rest.

In our technological example, we can apply the USE method as an overview to lop off as many branches in the decision tree as possible, then dive deeper down a branch to seek the most likely cause. This can also be supported by existing playbooks that suggest what avenues are fruitful to go down.

Outcome

Although applying the USE method and having some playbooks gets you towards a likely cause, it isn’t as efficient as a well structured decision tree to support responders for a given system. However, providing some key questions to focus efforts as a pseudo-decision support tool can help significantly, both from an efficiency standpoint (for bonus points, automatically reduce the number of options to explore), but also from a psychological standpoint knowing the questions you are trying to get the answers to are ones that the experts in the system want to know too.

This reduces cognitive load and increases the feeling of safety a responder can have. However, in order to make effective use of decision support systems, we need sufficient information.

Information (and the lack of it)

One of the key hallmarks of an operations problem is incomplete information. You will not know the states of all the systems involved, and depending on your existing monitoring and observability setup, you may not be able to get an accurate picture. However, despite this imperfect information, we are required to determine the root cause anyway.

In our stomach pain example, the caller may be in such significant pain that communication is difficult. However, being able to determine further information from the caller is useful information in of itself. Similarly, with our increased long-tail latencies, not being able to reach a service for health metrics is useful information, just as much as those metrics are in the first place.

As a result, perhaps the thing I learned the most is that you ultimately have to make do with what is available to you, and you might need to get that information in a round-about way. On top of dealing with insufficient information, you also have to do all of this under pressure.

Safety

Being on-call for the human body is a psychologically punishing job. Much has been written about the impacts the job has on wellbeing among clinicians, and the knock-on effects it has on patients. Similarly, while on shift for NHS 111, being in a constant state of heightened alertness is exhausting, which can be problematic when mistakes can have outsized impacts.

So is on-call for systems. Being woken in the middle of your sleep cycle plays havoc with your cognition, and can have longer term health effects. For much longer treatise, I can highly recommend this talk by Niall Murphy, but the key takeaway is that being regularly interrupted or forced into a constant state of heightened alertness has impacts – both on the physical health of those who are on-call, but also on their ability to do their jobs when under pressure.

Mitigation

So how can we mitigate this? For telephone triage, much of this time was limited due to working regulations. However, for on-call teams, perhaps the best solution is a follow the sun rotation. Perhaps my favourite resource on this it the Google SRE Book’s chapter on being on-call, especially the concept of “safety”.

By supporting responders in making rational and deliberate decisions over intuitive ones, through the use of decision support, training and other support material, responders can feel safer even under high stress scenarios. The chapter goes into further detail on other ways on how to do this. The same applied at NHS 111: training, review and comprehensive decision support empowered call handlers to get to the best outcome while under pressure.

Wrap up

My time working for NHS 111 was formative in ways I couldn’t have imagined, influencing my preferences to ensure those who operate and own systems feel safe, are well supported through decision support, training and processes, and have as much information as possible when dealing with a production incident.

It isn’t life or death, but sometimes it sure feels like it.

Matt Blewitt