Cyber Security
Cyberthreats are constantly evolving, which means that policy must also be regularly re-examined. This policy should be reviewed and revised at least once per year. It was last reviewed on November 13th, 2023.
Threat Model
Our cyber-risk primarily stems from bulk, financially motivated cyber-threats. We don’t currently have the size, profile, or political valence to justify positioning against targeted attacks or non-financial bad actors.
This stance will be continually adjusted based on the size, financial scope, and ideological sensitivities of our clients and the projects we take on for them.
Prevention
Access Management
The overwhelming majority of incidents are precipitated through credential mismanagement, and we can minimize that possibility through a few simple principles.
Logins
We use mandatory 2-factor authentication for our Google Workspace domain, with a preference, in descending order, for:
- Hardware security devices like YubiKeys
- Authenticator Apps (e.g. TOTP-based 2FA)
- Email 2FA
- SMS 2FA
All employees must use a password manager of their choosing, with a preference for open-source and audited tools, like Bitwarden. Passwords must be globally unique, and have high entropy. Passwords should be generated through an employee’s password manager.
We centralize risk and protection on the Google identity, so when given the “Login with Google” option, take it.
Tokens
Prefer accessing resources through short-lived credentials generated for task-specific identities. When long-lived machine authentication tokens (API Keys, access tokens, etc) need to be generated, they should be exclusively stored in git (usually on GitHub), encrypted using sops.
Our usage of sops is described in our Developer Handbook. For most applications and use cases, our sops-encrypted files are backed by KMS Keys on Google Cloud, meaning that Google’s identity system is centralizing our access here.
When generating long-lived machine authentication tokens, document the process used to generate the credential (including the site, approximate time, commands run, etc), omitting sensitive data. This is critical for reproducibility and ease of key rotation. Keys should be rotated periodically through automated mechanisms when technologically feasible.
ACLs
When setting up cloud infrastructure, use standard IAM tools to authenticate and authorize machines to talk to one another. Avoid custom code and “rolling your own” authentication mechanisms.
Configure ACLs through Terraform, and do not override or augment them through UI interactions that aren’t then imported back into Terraform. We will (but do not currently) enforce this with the use of automated tools. This allows our permission hierarchy to be code reviewed.
ACLs should be narrow and reflect a business purpose. During initial infrastructure creation, imprecise authorization is inevitable - that’s OK. Accompany every instance of potentially over-broad permission granting with a TODO in the code that references a GitHub Issue with the tag ACL to rectify it.
Use tooling like Google Cloud Platform’s (GCP) policy analyzer to detect overbroad ACLs, paying particular attention to this prior to public launch/access.
We’re less concerned with internal document/email access management, but follow sensible sharing practices in everything you create.
Devices
Employees should either have a dedicated machine for work-related software development, or treat their personal device as they would a work device. When using a personal device for work, use different profiles for work and personal use, or isolate them as separate virtual machines using something like Vagrant or Qubes OS.
Employees should only download and use software that is broadly trustworthy. If you have questions on how this should be interpreted, ask.
Isolate Test/Dev/Prod environments
Resources in one environment should never talk to resources in another.
Create symmetrical ACL structures in every environment (usually with Terraform). This will allow for the discovery of issues prior to prod, prevents overbroad access in non-prod environments, and allows for simpler, higher confidence changes to ACLs.
In the rare instances where you do want shared state between environments, use a distinct, explicitly -shared environment. We commonly use this to have a single registry for all of our deployment artifacts within a single application/client, to ensure that we’re using the same binaries across environments, e.g. when using “Promote to {Staging,Prod}” workflows.
Code
Selectively use third-party libraries
Do not use unofficial third-party libraries that are designed to handle sensitive data. For example, github.com/<randomuser>/paypalclient would likely be a bad choice of library. Generally payments, authentication, and authorization fall under the umbrella of “sensitive”, but use your best judgement here.
The motivation here is that those third-party libraries are a bulk attack vector. It’s easier to hack a bunch of sites’ PayPal credentials by creating a PayPal client than it is to target those same credentials through a non-PayPal-specific third-party library (though the latter is still possible).
If there is code that provides some valuable functionality on top of these services: fork the project, audit the code (including its dependencies), and use the forked version.1 This is a time-consuming process to do correctly, and should only be done in limited circumstances.
Libraries with significant adoption are more likely to be better audited and managed, but their popularity also makes them bigger targets for exploitation. Libraries provided by large, technologically sophisticated companies (ex: Mozilla, Google, Stripe) are usually safe, but make sure it’s officially supported and not just an employee’s project.
Be careful with version control
Make defensive assumptions about git: it is immutable, every git repo will eventually become public. Even though these aren’t technically true, these assumptions provide strategic value.
For example, when you accidentally commit a secret, even if you overwrite it prior to PR, discard that secret and regenerate another one. Once the compromised secret is no longer in use, use a tool like git-filter-branch or BFG Repo Cleaner to remove any trace of the secret from the repo history.
One approach we’ve used, especially for large projects we plan on open-sourcing, is to first create a private <project>-draft repo and do the initial development in there. Once things are in a publishable state, audit the current state of the repo, remove the .git directory, and publish that as the initial commit in a fresh, public repository.
Clients
In cases where we’re managing infrastructure on behalf of clients, we’ll follow all the guidelines on this page. In cases where clients are managing their infrastructure, there are some things we can do to help them manage their cybersecurity risk:
- Have regular conversations with clients about their cybersecurity risk profile, and the highest value steps they can take to mitigate it
- Develop a long term vision with the client about how they will manage secrets and ACLs for the infrastructure we build for them
- Use a distinct keyring per-project (prefer this over per-client)
- Encourage clients to have their own local, dev, and prod equivalents.
- In designing projects, avoid collecting sensitive data, and design out indefinite data retention. This minimizes the surface area of data exposure or compromise
Culture
Many cybersecurity risks can be mitigated with a healthy engineering culture. This includes things like:
- Asking questions about things you don’t understand, and giving feedback/sharing information with kindness and best-assumptions
- Get a more formal review on every security-pertinent choice you make
- Write documentation as you develop, prefer too much documentation to too litle
- Security through obscurity never carries any weight as an argument
- Expenses incurred by employees in the maintenance of these preventative measures can be reimbursed. This includes security keys, password manager fees, etc
Detection
Since we’re centralizing our assets on GCP and Google, it’s where we’ll invest the most energy detecting issues. We’ll focus detection where we focus mitigation: at the authentication layer.
- Google Workspace Reporting - We’ll use the standard Google security dashboarding and reporting tools for monitoring and alerting on suspicious authentication.
- GCP Standard Tooling - We’ve enabled Security Command Center, Security Health Analytics, and will continue to use and expand standardized tooling for risk detection and notification. We can periodically audit resource definitions, permissions and activity and an organizational level. We have data access and audit logs enabled for all of our projects.
- Resource Limits + Alerts - We impose reasonable limits and budget alerts on all infrastructure - cost and bandwidth limits in particular are powerful tools to combat and detect abuse and infiltration.
- Resource Inventory - We track our deployed assets to catch malicious actors, misconfigurations, and obsolete resources.
Response
We’ll follow standard Data Breach Response best practices. If we think there may have been a data breach, we will:
- Assign a response coordinator responsible for managing the response
- Create two tracking documents for the incident: An internal doc to document investigation, mitigation and notification, and an updates doc where we can post notifications to potentially impacted parties.
- Proactively notify potentially impacted clients, regardless of incident severity, and point them to the updates doc for up-to-date information
- Use identity as the basis on which to determine the scope of potentially breached information.
- If cause can’t be identified in-house, seek external expertise until cause can be determined.
- Once cause has been identified, perform key-rotation on all potentially impacted resources.
- Notify impacted end-users if sensitive personal information has been breached.
We will test this plan with a realistic tabletop exercise to validate what works and what doesn’t.
Recovery
- Every project that we work on will perform regular backups on a cadence that is sensible for the project.
- We’ll test the restore process for each backed up resource on a regular basis.
- We’ll store backups in at least two distinct cloud provider zones. We’re unlikely to require this level of replication for serving data stores.
- We disable deletes of backups, relying exclusively on TTLs to initiate backup deletion.
- With the standard caveat to only use and fork appropriately licensed software. ↩︎