Regular Restarts Are Good, Actually

2024-11-08

Anecdotally, one of the more maligned features of the Heroku platform are the 24-hour limits on compute units, known as “dynos”. This is actually a good thing, but very misunderstood.

Regular restarts of compute units bring one obvious benefit - resolving memory leaks. Broadly, this is what the documentation means by “maintain the health of applications” running on the platform. Slow memory leaks are insidious, only appearing when taking a sufficiently zoomed out view of memory utilisation by the application, and a regular restart is one way of papering over that behaviour.

Historically, Rails applications used to be particularly hit by this problem, mostly due to how easy it is to accumulate objects. Real ones will remember the introduction of frozen_string_literal: true magic comment. Given Heroku’s history as a home for Rails apps, there should be no surprise as to why this regular cycling exists.

But there are other benefits, and for that, we need to re-visit The Twelve-Factor App.

The Twelve-Factor App is a little long in the tooth - Heroku CTO Gail Frederick is speaking about what a re-imagined set would look like for today at KubeCon Americas 2024 - but there are two factors that remain as powerful today as they did 13 years ago: Processes and Disposability.

The Art of Throwing Things Away

The combination of an ephemeral local filesystem and regular restarts means that you will not have local state and you will like it, dammit. And honestly, no local state is the best way to be - don’t accumulate cruft.

By enforcing the lack of local state, we can ensure that state lives somewhere more appropriate - a queue, a database, an object store and so on. This makes coordination significantly easier - shared-nothing services can be thrown away, scaled up and down without having to worry about state.

After all - always make state and coordination someone else’s problem.

We can take this concept further and architect our processes under the assumption that they can, and will, be interrupted at any time. This is why this “feature” gets a lot of flack - what to do about longer-running tasks?

If we’re fully embracing this model, we’re putting state elsewhere, like a queue. It’s entirely possible you will be mid-way through running a job and have your compute unit terminated. This introduces us to the architectural properties we need - idempotency, atomicity and reentrancy.

I’ve written about this elsewhere. By ensuring a given operation is idempotent and atomic, and the collection of operations are reentrant, we don’t need to concern ourselves mid-task interruptions - “just” retry. Embrace “at least once” processing. Kill your darlings. Let the cruft be swept away by a pod reaper or the dyno scheduler. By ensuring repeatable, predictable side effects and the ability to safely resume processing from a given state without leaving half-baked results lying around, we reduce our headaches considerably.

Time Keeps On Slipping

But what about tasks that are going to take a long amount of time anyway? The classic example is the “billing run” or similar end-of-month batch processes that require loading larger volumes of data into memory, doing some number crunching, then spitting out a result.

For these, we must instead decompose into smaller tasks. Providing an intermediate representation that can be stored (again, not locally!) somewhere and loaded, while keeping track of where in the sequence we are to resume processing after cancellation. If you are thinking “this sounds a lot like the MapReduce pattern” - you are right! MapReduce isn’t just for “big data”.

Then we run into the hard problem - long tasks that cannot be broken up into small, atomic tasks. In my line of work, the most familiar is the database dump.

Although Postgres natively supports backing up through the type of process outlined above (through continuous WAL archival, enabling “hot” backups), this isn’t very useful if you just want the data. Outside of logical replication, typically the only alternative is to use something like pg_dump.

Unfortunately, dumping a database, or even parts of a database, using pg_dump is a monolithic task - there are no valid “partial” dumps due to needing access to the transaction snapshot, which if pg_dump is interrupted, no longer exists. For very large datasets, it is entirely possible that even on adequate hardware and running in parallel mode pg_dump, the odds of the process being interrupted are very high.

So what to do about these monolithic tasks? “Just don’t restart” I hear you cry. And for these very particular types of work, you are correct. However, such behaviour should be opt-in instead of opt-out - discourage this kind of task in favour of composable sequences of tasks.

Update 2024-11-09: An earlier version of this post referred to Gail Frederick as Heroku CEO instead of Heroku CTO. This has been corrected.