Leadership Power Tools: SQL and Statistics

2024-12-04

A common pattern I’ve seen over the years have been folks in engineering leadership positions that are not super comfortable with extracting and interpreting data from stores, be it databases, CSV files in an object store, or even just a spreadsheet. We’re going to cover SQL & DuckDB, then some useful statistical tools: summary stats, distributions, confidence intervals and Bayesian reasoning.

All too often a request is put into an BI team or data analyst to “run a report” or something similar. Instead, pick up some power tools and do it yourself!

A line drawing of a power drill and table saw, continuing the power tools metaphor.

SQL and DuckDB
Statistics
Wrap Up

SQL and DuckDB

DuckDB is probably going to be your best entry point into the world of SQL. It’s a lightweight, in-process SQL database that you can use to run queries against CSV files, Parquet files, and more. It’s a great way to get started with SQL without having to worry about setting up a database server or anything like that. It can also connect to remote databases, like Postgres or MySQL.

If you aren’t familiar with SQL, I’d recommend going through a brief tutorial - SQLTutorial is a pretty good place to start - look at using SQLite as a database to practice with. Once you’re comfortable with the basics, you can start using DuckDB to run queries against your data.

SQL is our lingua franca for data querying. Knowing how to write SQL queries is a powerful tool to have in your toolbox, and will make you more effective. It’s not just for data analysts or BI folks - it’s for anyone who needs to make decisions based on data. I’d recommend getting familiar with JOINs, GROUP BY, and HAVING clauses, as well as window functions. If you want to push the envelope, look into CTEs (Common Table Expressions) and recursive queries.

Lets say you have an extract of data from your production database, and you want to find out how many users signed up per day over the last month. Given a CSV file users.csv with the following columns:

user_id,created_at
1,2024-11-01
2,2024-11-01
3,2024-11-02
4,2024-11-02
5,2024-11-02
6,2024-11-03

You can run the following query in DuckDB:

SELECT
  created_at,
  COUNT(user_id) AS signups
FROM "users.csv"
GROUP BY created_at
ORDER BY created_at ASC;

This will give you a table with the number of signups per day:

created_at	signups
2024-11-01	2
2024-11-02	3
2024-11-03	1

A trivial example, but an example all the same. Now we can get the data and do some surface level aggregations, lets start working through some statistical methods. I’m going to cover 5 statistical methods that I think are especially useful for leaders: summary statistics, distributions, confidence intervals and Bayesian reasoning.

Statistics

Sorry, R fans - Python is my go-to for analysis. Before we get into the specific methods, lets talk a little about why you should bother with this stuff at all. I’m a big believer in data-informed decisions, and to do that we need ways to interpret the data and talk about it authoritatively. Statistical methods are your other set of tools.

Here’s a quick example, levering DuckDB again, but this time for summary statistics. Let’s grab some sample data from IMDB Non-Commercial Datasets - title.ratings.tsv.gz - and SUMMARIZE it:

SELECT
    column_name as col,
    approx_unique as uniq,
    round(avg::double,2) avg,
    min,
    round(q25::double,2) q25,
    round(q50::double,2) q50,
    round(q75::double,2) q75,
    max
FROM (SUMMARIZE "title.ratings.tsv.gz");

col	uniq	avg	min	q25	q50	q75	max
tconst	1394000		tt0000001				tt9916880
averageRating	73	6.95	1.0	6.2	7.15	7.9	10.0
numVotes	23695	1027.92	5	11.0	26.0	100.0	2966204

Descriptive summary statistics like this, including all-important percentiles, give you a quick overview of the data. Here, we can see we’ve got almost 1.4M entries, but only 73 unique ratings between 1.0 and 10.0. We can also see that although the mean average is 6.95, the median is 7.15.

I generally prefer using percentiles when discussing data like this. For example, saying something like “an above average number of 5xx error codes” is not particularly illuminating, but saying “an increase of 10% in the p95 count of 5xx error codes month on month” is more precise. For more on this, see Tyler Neely’s notes on E-Prime and precise language.

Applying Summary Statistics

Lets have a look at something a little more interesting, and maybe more directly relevant to engineering leaders - issue tracker data. I’m going to be using a cleaned dataset from logpai/bughub for Firefox. First of all, let’s load it into DuckDB and get an overview on what we’re looking at:

DESCRIBE TABLE 'mozilla_firefox.csv';

column_name	column_type	null
Issue_id	BIGINT	YES
Priority	VARCHAR	YES
Component	VARCHAR	YES
Duplicated_issue	DOUBLE	YES
Title	VARCHAR	YES
Description	VARCHAR	YES
Status	VARCHAR	YES
Resolution	VARCHAR	YES
Version	VARCHAR	YES
Created_time	VARCHAR	YES
Resolved_time	VARCHAR	YES

Alright, Duplicated_issue is the wrong type given it should be referring to Issue_id, but nothing we can’t work with.

We don’t have some ready made values to summarize, so we’ll have to derive some ourselves. Instead, lets break down time to resolve by Priority, and then by Component, excluding duplicates and non-resolved issues. We need to do a bit of cleanup with timestamp varchar fields first, though, as they aren’t in RFC 3339 format:

import csv
from datetime import datetime

def replace_timestamps(file):
    with open(file, "r") as f:
        reader = csv.reader(f)
        header = next(reader)
        rows = list(reader)

    for row in rows:
        for i in (9, 10):
            if row[i] == "":
                continue
            timestamp = datetime.strptime(row[i], "%Y-%m-%d %H:%M:%S %z")
            row[i] = timestamp.strftime("%Y-%m-%dT%H:%M:%S%z")

    with open(file, "w") as f:
        writer = csv.writer(f)
        writer.writerow(header)
        writer.writerows(rows)


if __name__ == "__main__":
    replace_timestamps("mozilla_firefox.csv")

Then load this into DuckDB.

-- We need to install this extension first to handle timezones.
install ICU;
load ICU;

-- I prefer CTEs over subqueries
with filtered as (
    select
        "Priority",
        ("Resolved_time"::timestamptz - "Created_time"::timestamptz) as time_to_resolve
    from "mozilla_firefox.csv"
    where "Status" = 'RESOLVED'
    and "Resolution" <> 'DUPLICATE'
)

select
    "Priority",
    quantile_disc(time_to_resolve, 0.25 order by "Priority") as p25_ttr,
    quantile_disc(time_to_resolve, 0.50 order by "Priority") as p50_ttr,
    quantile_disc(time_to_resolve, 0.75 order by "Priority") as p75_ttr,
    quantile_disc(time_to_resolve, 0.95 order by "Priority") as p95_ttr
from filtered
group by "Priority"
order by "Priority" ASC;

Priority	p25_ttr	p50_ttr	p75_ttr	p95_ttr
–	6 days 04:17:51	176 days 15:37:52	622 days 20:05:33	1463 days 15:53:53
P1	8 days 04:43:33	64 days 03:00:19	511 days 06:08:24	1416 days 01:15:26
P2	38 days 01:14:05	204 days 00:29:32	895 days 15:40:19	1450 days 18:11:01
P3	56 days 04:59:32	243 days 18:57:23	830 days 03:02:28	1500 days 18:38:30
P4	83 days 23:55:59	508 days 04:25:40	1170 days 08:47:13	2568 days 02:57:14
P5	26 days 00:55:24	416 days 05:43:22	1356 days 05:23:26	2800 days 01:20:26

Neat. And using a similar query, lets look at it by Component:

Component	p25_ttr	p50_ttr	p75_ttr	p95_ttr
General	8 days 16:47:07	256 days 13:06:32	626 days 11:20:33	1354 days 11:48:16
Bookmarks & History	198 days 21:35:22	691 days 08:23:16	1126 days 11:19:23	1705 days 18:23:34
Untriaged	03:04:44	1 day 15:08:14	61 days 05:53:53	485 days 00:03:59
Tabbed Browser	8 days 01:04:35	201 days 14:07:15	719 days 14:09:05	2028 days 12:20:09
Toolbars and Customization	20 days 08:06:01	186 days 06:34:57	836 days 12:22:37	1952 days 19:43:21
[snip]

Distributions And Histograms

DuckDB can also help us with reviewing histograms, before moving into using something like a Python notebook to render it out as a graph. Lets say we wanted to review the distribution of all the P1 issue closures:

with filtered as (
    select
        round(epoch(("Resolved_time"::timestamptz - "Created_time"::timestamptz)) / 3600) as time_to_resolve_hours
    from "mozilla_firefox.csv"
    where "Status" = 'RESOLVED'
    and "Resolution" <> 'DUPLICATE'
    and "Priority" = 'P1'
)

select * from histogram(filtered, time_to_resolve_hours, bin_count:=15);

This gives us an inline bar-chart:

bin	count	bar
x <= 5000.0	607	████████████████████████████████████████████
5000.0 < x <= 10000.0	73	█████████▌
10000.0 < x <= 15000.0	55	███████▏
15000.0 < x <= 20000.0	27	███▌
20000.0 < x <= 25000.0	27	███▌
25000.0 < x <= 30000.0	24	███▏
30000.0 < x <= 35000.0	81	██████████▋
35000.0 < x <= 40000.0	15	█▉
40000.0 < x <= 45000.0	5	▋
45000.0 < x <= 50000.0	1	▏
50000.0 < x <= 55000.0	4	▌
55000.0 < x <= 60000.0	2	▎
60000.0 < x <= 65000.0	8	█
65000.0 < x <= 70000.0	2	▎
70000.0 < x <= 75000.0	0
75000.0 < x <= 80000.0	1	▏
80000.0 < x <= 85000.0	1	▏

If you tilt your head to the side, you can see this data exhibits a strong Pareto distribution, where ~80% of the entries fall within the first bucket of taking up to 5000 hours / 208 days to complete. A larger bin_count would demonstrate a further breakdown.

In my mind, there are about 5 core distributions that you should recognise by sight:

Pareto distribution and associated power law
Gaussian distribution
Gamma distribution
Log-normal distribution
Beta distribution

By being able to grasp probabilistic thinking, this allows you to make better predictions and “be right more often”.

Confidence Intervals

As an engineering leader, you are going to need to be right a lot. Part of that is coming up with high-confidence estimations. The very best seem to have an almost magical way of providing high quality estimates out of thin air. For the rest of us, there are statistical methods.

Given our historical data above, imagine a Firefox PM wanting to know how long it will take for us to solve a bug in our “Tabbed Browser” component. We’ll cover two sides of this - calculating the confidence interval, and using a Monte Carlo simulation.

Lets Do Maths

Confidence intervals have a few different ways of calculating. As we know from above, our data is not normally distributed. Fortunately, we have enough samples to take advantage of the Central Limit Theorem and the distribution of the mean.

This lets us use the following formula:

CI = Mean ± Z * (stddev / √n)

where:

Z is the z-statistic 1.96 for 95% confidence
stddev is the standard deviation
n is the number of samples

The confidence interval is best given as a tuple of the upper and lower bound. Lets prepare our data with DuckDB:

COPY (select
    "Issue_id",
    round(epoch(("Resolved_time"::timestamptz - "Created_time"::timestamptz)) / 3600) as time_to_resolve_hours
from "mozilla_firefox.csv"
where "Status" = 'RESOLVED'
and "Resolution" <> 'DUPLICATE'
and "Component" = 'Tabbed Browser') to 'tabbed_browser_ttr.csv' (header, delimiter ',');

select count(*) from 'tabbed_browser_ttr.csv';

count_star()
2737

Now we can leverage Python and polars to calculate our confidence interval:

import polars as pl

df = pl.read_csv("tabbed_browser_ttr.csv")
mean_ttr = df["time_to_resolve_hours"].mean()
std_ttr = df["time_to_resolve_hours"].std()
n = df.shape[0]

confidence_interval = (
    round(mean_ttr - 1.96 * (std_ttr / (n**0.5))),
    round(mean_ttr + 1.96 * (std_ttr / (n**0.5))),
)

print(f"Number of tickets: {n}")
print(f"Mean time to resolve a ticket: {round(mean_ttr)} hours")
print(f"Confidence interval: {confidence_interval} hours")

$ python confidence.py
Number of tickets: 2737
Mean time to resolve a ticket: 11678 hours
Confidence interval: (11075, 12281) hours

So we can say to our imaginary PM that “based on our historical data, we estimate that the bug will be resolved between 461 days and 511 days from now with 95% confidence.

That’s… not great for our users! Lets dig in a bit deeper based on the Resolution field:

SELECT "Resolution", Count("Issue_id")
FROM "mozilla_firefox.csv"
WHERE "Status" = 'RESOLVED'
AND "Component" = 'Tabbed Browser'
GROUP BY 1 ORDER BY 2 DESC;

Resolution	count(Issue_id)
DUPLICATE	1867
WORKSFORME	754
FIXED	585
INCOMPLETE	535
INVALID	530
WONTFIX	264
EXPIRED	69

Alright, lets just focus on FIXED issues, re-running the export query appropriately:

$ python confidence.py
Number of tickets: 585
Mean time to resolve a ticket: 9537 hours
Confidence interval: (8290, 10784) hours

Welp, sorry Firefox users who had issues with tabs during the period this dataset covers.

Lets Do Simulations

As easy as this was, for some datasets it is better to run a simulation. In this example, we’re going to use the Monte Carlo simulation. We’re going to use 10K iterations, and plot a graph to show the p10 and p90, showing that 80% of simulations falls between the two values:

import polars as pl
import matplotlib.pyplot as plt

df = pl.read_csv("tabbed_browser_ttr.csv")

n_iter = 10000
n = df.shape[0]
ttr = df["time_to_resolve_hours"].drop_nulls()

simulated_means = [ttr.sample(n, with_replacement=True).mean() for _ in range(n_iter)]
simulated_means_df = pl.DataFrame(simulated_means, schema=["mean_ttr"])

p10 = simulated_means_df.quantile(0.1).get_columns()[0][0]
p90 = simulated_means_df.quantile(0.9).get_columns()[0][0]

bg = "#f9ffff"
fg = "#33321a"

plt.figure(figsize=(10, 6), facecolor=bg)
ax = plt.axes()
ax.set_facecolor(bg)
ax.hist(
    simulated_means_df["mean_ttr"],
    bins=50,
    alpha=0.5,
    fill=True,
    color=fg,
    align="mid",
    width=75,
)
ax.axvline(
    x=p10,
    color=fg,
    linestyle="dotted",
    label="10th Percentile (~8743 hours)",
)
ax.axvline(
    x=p90,
    color=fg,
    linestyle="dashed",
    label="90th Percentile (~10370 hours)",
)
ax.tick_params(axis="both", colors=fg)
[ax.spines[spine].set_color(fg) for spine in ["left", "bottom"]]
[ax.spines[spine].set_color(bg) for spine in ["right", "top"]]
plt.title("Distribution of Simulated Mean Resolution Times", fontsize=12, color=fg)
plt.xlabel("Mean Time to Resolve Issues (hours)", fontsize=10, color=fg)
plt.ylabel("Frequency", fontsize=10, color=fg)
plt.legend(labelcolor=fg, shadow=False, frameon=False)
plt.tight_layout()
plt.savefig("monte_carlo.webp")

A graph showing the distribution of means from the Monte Carlo simulation. The 10th and 90th percentile are marked at 8743 hours and 10370 hours respectively.

(Honestly, matplotlib is a power tool in itself - very much recommend learning how to plot your common distributions with it.)

Here, we can say that 80% of the simulations landed between 8743 hours and 10370 hours - a slightly tighter interval than our pure maths one above, but still pretty broad.

This method I find particularly useful in capacity estimation, though this relies on a relatively stable team to be useful. As the team changes, you have to throw out old data - past performance does not indicate guarantee future results, after all.

You can also apply Monte Carlo simulations to other useful domains: incident response, budgeting and estimating defect rates during the lifetime of a project.

Bayesian Reasoning

As engineering leaders, we must strive to be “good Bayesians” - that is, we update our beliefs about the world/processes/system/people as new data becomes available.

Frankly, getting into this in a serious level of detail is beyond the scope of this post. However, we can give a pretty close-to-home example at a high level - project estimation.

Estimating Probability of Project Completion

We need to start with an initial belief. Imagine we had a dataset similar to the above with all our previously completed projects, with estimated start and end dates, and actual start and end dates. From the data, the team completes projects within the initial estimate 70% of the time, so the probability of not completing on time is 0.3

We’re part way through our project, and we’ve completed 40% of the work so far. From our historical data, if we’re on time, normally 50% ± 10% of the work is complete by this point. If we’re off track, its 30% ± 10% of the work completed.

We now need to apply Bayes’ theorem:

P(H/D) = (P(D|H) * P(H) / P(D))

P(H/D) is our posterior probability, P(D|H) is our likelihood, P(H) is our prior probability and P(D) is our evidence.

We can now calculate each component. If we assume a normal distribution of probabilities (like from the Monte Carlo simulation above), our likelihood value for “still on time” is 0.24. Conversely, for “not on time”, we get 0.13.

Our evidence is a normalising constant, which evaluates to 0.207.

From there we can plug our figures into Bayes’ theorem:

For on time, we get (0.24 * 0.7) / 0.207 = 0.8115952, or ~81% probability.
For off time, we get (0.13 * 0.3) / 0.207 = 0.1884057, or ~19% probability.

When we complete another project, we can update our model. Again, this ties into providing high quality estimates to stakeholders.

Wrap Up

We’ve covered getting access to the data that is important to you, using tools like DuckDB and Python to analyse it, and apply statistical methods to understand the data and speak precisely about what the data means for you and your decision making. However, with great power comes great responsibility - always double check your work and ensure it aligns with the business goals.

If you’d like to get even more into stats (and why wouldn’t you!) consider this post by Ivan Savov on a a revised statistics curriculum. We’ve covered, briefly, probability distributions, confidence intervals and Bayesian reasoning (briefly), but it is worth looking into linear regression. I also recommend Think Stats and Python for Stats.

Matt Blewitt