Table Stakes


This is a short post on what I see are table stakes for any new user-facing service, security-wise. Mostly focused on user-focused, rather than intra-service, considerations.

It’s important to have a baseline set of security-adjacent requirements that all new user-facing services have. This is not just for consistency, but from a user experience point of view, it follows a principle of least surprise: any user of your services will have a similar experience across all of them.

Continuous Assurance

This sits at the top, even though it requires a relatively mature security engineering team to deliver, and isn’t directly user-focused. The reason for this is that you need a way to ensure that you are maintaining your baseline.

In software development, we typically take continuous deployment and/or continuous integration processes for granted – changes are made, tested, reviewed and deployed. However, unless we are both ensuring that security fixes, such as preventing any of the OWASP Top Ten, are both live and have not regressed in production, all we have to go on is a closed ticket.

This is why continuous assurance forms part of your baseline capabilities, as it allows you to be much more confident about the state of the fixes that have been put in place. An example, from earlier posts, is credential stuffing resilience.

As a mixed team, you may wish to have assurance that both a sufficient level of resilience has been achieved, and that detection logic is still working as expected, by automating credential stuffing attacks against dummy users in your live (or shadow production) environment.

Another example would be continuously testing that password strength is always checked against internal policy, or something like zxcvbn.

This requires a relatively sophisticated testing harness, as well as a reporting mechanism to automatically create high-priority break-fix items should a regression be detected. I’ve often thought of relentless automation being one of the core tenets of a successful security engineering team, especially one that is size constrained.

Multi-factor Authentication

With the advent of the Google Titan keys, analogous features being built into the latest Pixel phones, and iOS and iPadOS finally supporting WebAuthn, we can start moving towards universal multi-factor authentication for users across all platforms.

At this stage in the game, modern multi-factor authentication is a core requirement for all user facing services, and should be encouraged as a default. As we have seen in recent high profile cases, SMS as a second factor should be considered insufficient, and due to password re-use, an email inbox should not be considered a root of trust either.

The latest European banking regulations around user identity (Secure Customer Authentication) gives a nice “pick-two” out of three necessary and sufficient options:

If you are in a position to do so, I heavily recommend your organisation shipping security tokens to your users as part of the onboarding process. Ultimately, this is a small capital expenditure per customer that greatly reduces the chances of being phished, credential stuffed and so on. You will also need to provide sufficient recovery mechanisms.

Logging & Observability

This is a big topic, so for the purposes of this post I’m going to focus on what I feel is the minimum.

It is paramount that a service logs all actions that a user takes, and actions that cause changes in state. All reasons for failure (such as rate limiting, insufficient access and server errors) should be logged in a structured format to a sufficient log drain that allows for correlation and exploration.

A tool like Honeycomb can be incredibly useful for security analysts, and not just operators. Capturing spans and being able to trace anomalous activity throughout a system is extremely powerful, and provides a wealth of data that you would not have otherwise.

Beyond that, you need a tool for log exploration. I’m partial to Splunk, but an ELK is also a good option. Being able to express complex alerting logic over a log stream is extremely powerful, so using something like Dropbox’s setup over Kafka or an abstraction over Apache Flink should also be considered. This is especially needed if the alerting logic you are constructing is more complex than what you can express in Splunk or ElasticSearch.

The core part of it is that you must be able to understand the what, when and why of your system. If you are able to answer all three confidently, you’ll be in a good place when handling a security incident.