Thoughts on User Safety: 2

2019-10-23

Following up from last time, let's explore the internal and insider fronts when moving beyond security towards safety for our users.

To reiterate, we are trying to move from mere security towards safety for our users, where we are proactively and reactively approaching areas in which user safety might be compromised.

Last time, we explored the external “front” where users are exposed to the wider internet, and suggested some mitigations primarily to do with authentication and account takeover. This time, we're going to focus on protecting users from other users, and from your own employees.

Internal

Most platforms are multi-tenant – that is, many users will share the same resources some way or another, be it data stores, bandwidth and compute. As a result, you can be a victim (or perpetrator) of the noisy neighbour problem, where a tenant is causing a ruckus and you are being impacted by it.

As a result, we must try and protect our users from both inadvertent and intentional disruption brought to their workloads by other users of the platform. Again, we'll look at proactive and reactive approaches to this problem.

Proactive

Quotas

The first line of defense you have against abuse in a multi-tenant environment are quotas, or limits. In areas where you have a finite resource, you must prevent users from impacting others by consuming all of that resource.

In a Linux environment, perhaps the most ubiquitous tool for this is cgroups, or “control groups”. Let's take an example where customer workloads are executed in processes under a cgroup hierarchy.

an example cgroup hierarchy with a root cgroup and 3 sub-cgroups

example cgroup hierarchy

Here, we have 3 slices representing 3 customer workloads. We can use the cpu.shares and cpu.cfs_quota_us subsystems to distribute CPU time, and place a max cap on CPU time. If customer1 was a very important customer, this would help prevent customer2 and customer3 from starving the machine of CPU time.

Similar controls exist within cgroups subsystems for memory and I/O.

Of course, other workload schedulers can be used. A good concrete example of such a scheduler outside of cgroups are Salesforce's Apex Execution Governors.

Such quotas should be generous enough as to not impact 99% of users, but severely deter those that would take advantage of a quota-less system. Examples of such workloads are left to the readers imagination. Quotas should be applied via a control plane, and be changeable on a case by case basis depending on the user trust/risk profile discussed previously.

Isolation

Outside of users indirectly impacting each other, we must also ensure that users cannot directly interfere with each other. In these cases, we must isolate workloads from one another.

Most commonly, this is accomplished through containerisation of workloads (through LXC, Docker, etc.), and/or namespacing. However, micro-VMs such as Firecracker or gVisor are also gaining traction in this area, and can help provide a further level of isolation.

Merely using these technologies doesn't provide you safety for free, however, and must be configured correctly, and using in conjunction with dropping capabilities, seccomp and so forth. Isolating customer workloads sufficiently to prevent intentional and unintentional direct interaction with other customer workloads is paramount in a safe system.

Reactive

Observability

In the cases of quotas or isolation being breached, we must be made aware of these events. Although setting up the correct logging pipeline into a SIEM or data lake is a proactive step, we're going to focus on the reactive approaches here to minimise customer risk. For a logging setup that I will forever remain envious of, see this setup from Avery Pennarun – I really like the use of kmsg here and the overall architecture.

The goal of the platform is to react to such events automatically and take necessary action to avoid toil. Greg Burek gave an excellent talk in 2016 (link) that encapsulated this thinking in a process of scaling sub-linearly with people through writing playbooks and automating them.

Let's take an example of a sustained quota violation from above, where a customer workload is consistently maxing out their CPU quota. This could be a sign of an inefficient loop, or something more nefarious. Regardless, it requires investigation. However, having a human inspect this workload and take action does not scale sub-linearly with the size of the platform, so we need the violation logged and then other tooling to automatically kick in based off the event in order to test hypotheses about the workload, and take necessary action. We can involve humans when a decision is unable to be reached, or it is an “oh shit” moment.

Similarly, it is crucial to be able to spot impacts on customer workloads to see if quotas are not implemented where the need to be (either due to oversight or technical reasons). The more data you have around workloads, the better you can identify anomalies that may impact other users.

Ultimately, the platform should be mostly self-governing when it comes to responding to violations of user safety, and should be kept in mind when building for user safety.

Plug Pulling

Often the only way to stop a run-away workload, be it actively impacting user safety or simply brushing up against the quotas all too often, is to pull the plug. You must have a control plane that allows you to do this for both operators, and as part of the platform stability feedback loop. In an ideal world, repeated workloads that necessitate plug pulling decrease trust in that user/increase that users risk profile, resulting in a feedback loop that reduces that user's quotas for their workload.

Additionally, workloads that don't impact other users directly, but indirectly (such as impacting the reputation of the service), may also need to be pulled and have the risk profile adjusted accordingly.

Insider

This is perhaps the most insidious form of threat that your users can face, where their trust in the platform is broken intentionally by insiders. Many platforms of a sufficient size to care about user safety are at risk from insider threat, and the forms this threat can take are many faceted. This can be from theft of intellectual property (as in the infamous Waymo case), to theft of customer data, or even longer term circumvention of information security apparatus. If you have something valuable, there will be attempts to steal it.

Most platforms will have some sort of administrator-level access to areas of the platform and underlying data, in order to handle outages, resolve customer problems and generally ease development of the platform. This, of course, is a trap. We often harp on about principle of least privilege, and for good reason. However, in scenarios for your necessary superusers, consider the following:

  1. Auditing. All interactions with the platform control plane and the results of those interactions must be logged and queryable. This will be required by various compliance requirements.
  2. Time limiting. Staff entering a superuser state must alert loudly, and the state should be time limited heavily to prevent accidental exposure.
  3. Plug pull. Your control plane must be able to kill a superuser session immediately, potentially triggered by automated detection or human intervention.

Due to the asymmetric nature of trust you have with your users, you must be able to reduce their exposure to malfeasance by staff.

Epilogue

This isn't intended to be a comprehensive prescription, but something to think about when considering the safety of your users. There are no silver bullets here, and we have not covered complex topics such as content moderation. The most important take-away is the asymmetric nature of trust between you and your users, and the need to increase the trust you have in your users and maintain the trust your users have in you.