Tools for a Culture of Writing

2021-10-30

One of the hardest things we do, as humans, is try and communicate what is going on in our minds to each other. With significant room for misunderstanding, biases, assumptions and cultural differences, communicating with other engineers (or to stakeholders) appears fraught. However, there are tools we can leverage to make ourselves understood, and to smooth the passage of information to makes sure it gets to the right people at the right time.

This is going to be a long one, so here is a table of contents if you wish to skip to a specific section:

  1. Preamble
  2. The RFC
  3. The Decision Record and Decision Log
  4. The Pre-mortem/Prospective and Post-mortem/Retrospective
  5. Closing Thoughts

Preamble

Humanity has a long history of oral tradition, and only recently have we had the ability to distribute our ideas, feelings and opinions in a more efficent way – through writing.

Many engineering organisations aspire to have a culture of writing – where decision making and communication happens primarily through writing, versus through synchronous means such as meetings. There are good reasons for this – writing and reading are generally asynchronous processes, writing can contain visual aids such as diagrams, and most importantly is an artefact.

Artefacts are essential. Think of books in a library, or research papers in a journal. They can be referred to, can be discovered (given correct organisation, which is the domain of library science) and for engineers, is greppable.

In pathological organisations that are power oriented, knowledge is power, so generating artefacts keeps power in the hands that desire it. In bureaucratic organisations, although there are artefacts, they are often extremely numerous and non-discoverable, and perhaps generating new artefacts is a struggle due to coordination headwinds. However, in generative organisations, information flow is paramount, and this typically requires a culture of writing.

So, we come to specific tools. I am not going to use the common umbrella term memorandum here, mostly due to negative connotations about memos that have grown from bureaucratic organisations. Additionally, it is often not appropriate to use one of these tools – in many cases it is much better to execute on something first (e.g. a proptype or spike) versus spending time writing. That said, the first is the Request For Comments, or RFC.

The RFC

The Request for Comments, or RFC, is perhaps one of the most well known forms of document for software engineers, primarily being associated with the Internet Engineering Task Force (my personal favourite RFC is RFC 3339).

What is the purpose?

The purpose of the RFC is to communicate “new concepts [or] information” to peers1. They are intended to facilitate discussion around a concept, idea or other piece of information with peers.

What does it look like?

IETF RFCs are quite terse, and have a defined structure (see RFC 2026). Most organisations do not need this level of ceremony around RFCs, and a typical template might look like the following:

ID: RFC-0000001 // A unique serial number for the RFC.

Title: Beware of stobor! // A short title, max 60 characters or so

Metadata: // A block containing information about the document
    Author(s): Robert Heinlein           // Primary author(s) of the RFC.
    Advisor(s): C.W. Longbottom          // Collaborators and commentors on the RFC.
    Status: [Draft] Discussion Finalised // Status of the Document.
    Updated At: 2021-10-30               // When the document was last updated.

Summary:
    // Bottom line, up front. Short, snappy and if there is to be a call-to-action, this is where it needs to go.
    When arriving through a tunnel in the sky, you MUST beware of stobor!

Glossary:
    // A glossary of terms used in the RFC.

Body:
    // Further headings and detail to the RFC.

This is a template I’ve had success with in a variety of formats, including plain text. The status of the RFC is indicated by the selected item – Draft when it is still being worked on, Discussion when under peer review, and Finalised when the RFC is “done”, and may be published to the wider organisation.

I recommend adding diagrams where suitable, and ensuring the RFC is under version control.

Why should I use it?

Because RFCs are one the best ways to communicate what is in your head to your peers, and therefore expand your sphere of influence. Good RFCs are “viral” – they can end up shared far and wide through many levels of the organisation. RFCs are excellent tools for persuasion, and can often help settle disagreements due to their collaborative, instead of combatative, nature.

RFCs are extremely useful for collaboration when fleshing out an idea or proposal that isn’t yet fully formed.

When shouldn’t I use it?

Don’t use RFCs to record decisions (that comes next!), and don’t use them for everything – it can be easy to overwhelm folks with an unending number of RFCs to review and collaborate on, moving back towards a bureaucratic culture. Remember – we’re trying to be generative and reduce coordination headwinds here!

How should I distribute it?

Typically, an RFC is emailed/Slack’d/whatevered to the group of initial reviewers in the Draft stage. Once initial collaboration is complete, the RFC can be sent far and wide.

I also recommend having a central index of RFCs that is searchable by ID and free text, as well as filterable by status. You may wish to consider additional extensions such as tagging/theming to enhance discoverability.

So you’ve written an RFC, and folks seem to be onboard with your proposal. What’s next? It’s time to make some decisions, and for that, we will need Decision Records and the Decision Log.

The Decision Record and Decision Log

Quite a lot has been written about Decision Records and what they are, specifically in the Architecture Decision Records group and Michael Nygard’s article, so I won’t re-iterate too much here about the specifics of Decision Records, but I will go into a bit more detail about the associated Decision Log.

What is the purpose?

Decision Records are intended to capture information about a decision being made. This can be architecturally, in the context of a business, or even in your day to day life (though maybe not what colour socks you decided to put on today).

A Decision Log is a set of decision records, in reverse chronological order, that may or may not be linked to each other. The Decision Log is used as both an index of decisions, as well as a way to nagivate what decisions led to others.

What does it look like?

Like RFCs, most organisations do not need a lot of ceremony around DRs. A popular Markdown template can be found here, and what follows is a simple plaintext one:

ID: DR-0000001 // A simple serial number for the Decision Record

Title: Preventing glibc change induced index corruption // A short title, max 60 characters or so

Metadata: // A block containing information about the document
    Author(s):                                      // Primary author(s) of the DR.
    Decider(s):                                     // Decider(s) of the DR.
    Status: [Proposed] Rejected Accepted Superseded // Status of DR.
    Updated At: 2021-10-30                          // When the DR was last updated.

Problem Statement:
    // Bottom line, up front. Describe the problem and context briefly.

Decision Drivers / Forces:
    // Enumeration of the drivers for the decision, and the forces acting that require the decision to be made.

Considered Options:
    // Enumeration of considered options.
    Option 0:
        // Option 0 is always "Do Nothing".
        Pros:
        Cons:

    Option 1:
        Pros:
        Cons:

Decision Outcome:
    // Outcome of the decision, and associated consequences

Implementation:
    // Optional – if there is an implementation of the outcome,
    // it should be linked here via changelists etc.

A Decision Log will generally look like this:

2021-10-30 | DR-000003 | Kafka used for FooBar Service | Proposed | #postgres
// Date    | // ID     | // Title                      | //Status | // Tags/Themes

The record and the log should be kept under version control. Folks have advocated for the records to be kept proximate to the code, but this should be considered with care. If this is the case, then the log must link to the document, as the log will be a central place of discovery.

Why should I use it?

Repeatedly making good decisions is the mark of a healthy engineering team, and so is learning from less-than ideal outcomes. A decision record helps crystalise the context in which a decision was made, as well as the forces that influenced that decision.

The decision log helps answer the “why do we do it this way” problem – you can walk through the log for a given project to get a sense of why things were done in a particular way without relying on a historian to tell you the oral history of the project (as fun as that can be).

When shouldn’t I use it?

Don’t use decision records for trivial decisions, as otherwise it is easy to get bogged down in spending all your time writing versus implementing. They should be used for true forks in the road that will have lasting consequences, versus things that are easily changable.

How should I distribute it?

As mentioned briefly above, there are generally two options – keep decision records proximate to the code that implements the decision option, or have a centralised repository for decision records. I have a mild preference for keeping records project local, but compiled into a central index. This central index can be used as an onboarding tool, as well as a useful way to quickly discover the whys behind the how and what for projects.

The Pre-mortem/Prospective and Post-mortem/Retrospective

Before kicking off a project, it is worth conducting a pre-mortem (or prospective) to identify risks and consequences of failure, as well as things that can go right. A good example of this was shared by Atlassian in this playbook.

As a counter to the pre-mortem, we have the much loved/hated post-mortem (or retrospective). Perhaps the best approach to post-mortems is that of the blameless post-mortem, described by John Allspaw in this article.

What is the purpose?

The purpose of a pre-mortem is to communicate to your team and stakeholders how you forsee a project going, including risks, consequences and opportunities. They are venues to collaborate and gain insight from folks that may not be directly related to your team.

The purpose of a post-mortem is to identify how something could have happened, and how we can do better next time. It is also a place to celebrate wins. Ultimately, they are tools to communicate to the rest of your organisation, and ultimately your customers, about what went right and more typically, what went wrong. The hallmark of a good post-mortem is one that is blameless, accurate and clear.

What does it look like?

A pre-mortem can look like the following:

ID: PRE-M-0000001 // A simple serial number for the Pre-Mortem

Title: Pre-mortem – Shard Production Database // A short title, max 60 characters or so

Metadata: // A block containing information about the document
    Author(s):                            // Primary author(s) of the Pre-mortem.
    Collaborator(s):                      // Collabs of the Pre-mortem
    Status: [Draft] Discussion Finalised  // Status of the Pre-mortem.
    Updated At: 2021-10-30                // When the Pre-mortem was last updated.

Threats:
    // Threats to the project

Successes:
    // What can go right?

Actions:
    // What actions are going to be taken to de-risk, or to compound success?

An internal post-mortem can look like the following:

ID: POST-M-0000001 / A simple serial number for the Post-Mortem

Title: Post-mortem – Shard Production Database // A short title

Metadata: // A block containing information about the document
    Author(s):                            // Primary author(s) of the Post-mortem.
    Collaborator(s):                      // Collabs of the Post-mortem
    Status: [Draft] Discussion Finalised  // Status of the Post-mortem.
    Updated At: 2021-10-30                // When the Post-mortem was last updated.
    Incident Date: 2021-10-25             // When the incident ocurred, if an incident.

Summary:
    // Bottom line, up front.
    // Capture the salient details of what happened, impact, and actions to prevent recurrence.

Timeline:
    // Timeline of project/incident.

Impact:
    // What was the consequence of things going wrong?
    // Alternatively, what was the consequence of things going right?

Lessons Learned:
    // Lessons learned from the incident/project.

Actions:
    // Concrete follow up actions to be taken by teams.

You may note that there is no mention of “root cause” in this template. This is intentional, as there is not root cause. More discussion on a concrete example of a “good” Post-mortem can be found in the Google SRE Workbook.

Why should I use it?

You should use a pre-mortem to identify opportunites to de-risk a project, as well as report these risks to stakeholders. Additionally, it gives a venue for project members to voice their concerns, as well as celebrate the improvements if the project is successful.

You should use a post-mortem after incidents, as well as projects themselves. Incident post-mortems are special, and require somewhat special handling, but are not discussed in detail here. They are opportunities to learn, above all, and are an incredible source of information on how an organisation/system actually works, versus how it is stated to work. They are also important sources of information if/when customer communications are required.

Some excellent discussions on learning with post-mortems can be found here and here, both from Learning From Incidents In Software.

When shouldn’t I use it?

Not all projects require a pre-mortem, especially if they are low risk. Additionally, individual pieces of work don’t require this treatment.

Similarly, not everything needs a post-mortem, but all incidents should, and ideally most projects should as well.

How should I distribute it?

Pre- and post-mortems should be available in a central index, and should be emailed out to team members and stakeholders when they are completed.

Closing Thoughts

Phew, that was a lot! Ultimately, these are just tools – you cannot “just” use tools to develop a culture, as a culture is a gestalt of many different factors. Additionally, these are merely tools – there are no magic bullets for organisational dysfunction.

Remember, the goal is get information to the right people at the right time – writing is just one of the best ways we have at doing that, and structured documents like the above help simply and reduce the amount of variance in how we accomplish this task.