How Reddit Scaled to Millions of Decisions Per Second
- Share:
The Challenge of Scale in Applications
The challenge of scale is a multifaced problem encountered by almost every developer -
You must ensure your application’s operational capacity can handle millions of decisions per second while, at the same time, configuring the system in a way that can adapt to a constant flow of dynamic user demands. These goals must be achieved while identifying which components' scalability is vital for the entire system's performance.
But how do big organizations achieve such feats? Tackling this challenge, Reddit’s journey in scaling the authorization system of its advertising platform is a prime example of meeting these challenges head-on and successfully overcoming them.
In this blog, we look at how the Reddit team handled this challenge inspired by a Reddit post written by Reddit’s Staff Engineer, Braden Groom. Alongside Reddit’s in-house solution, we’ll also examine OPAL, an open-source solution that aligns with the functionality of Reddit’s system while being a simple solution accessible to development teams of all sizes. Let’s dig in -
Reddit's Scaling Dilemma
Reddit is one of the largest online platforms out there, with its advertising platform being an integral part of its business model, allowing businesses to tap into its vast and varied user base. The platform enables advertisers to target audiences based on interests, demographics, and behaviors.
As much of Reddit’s content is public and does not necessitate an overly complex authorization system, Reddit’s team couldn't find an existing generalized authorization service within the company and started exploring the development of a homebrew solution within the ads organization.
Reddit faced a unique scaling challenge with its advertising platform's authorization system. The complexity of authorization decisions, coupled with a dynamic user hierarchy and the need for customizable permissions, presented a significant hurdle. The platform needed to support intricate advertiser requirements while remaining accessible for simpler, company-wide management needs. This complexity highlighted the necessity for a scalable, adaptable authorization system.
Setting Ambitious Goals
Reddit set an ambitious target: to process policy decisions within less than 10 milliseconds. This goal underscored the importance of meticulous design and optimization in enterprise systems, aiming to achieve unparalleled efficiency and reliability in authorization checks.
Alongside this goal, there were additional essential requirements in ensuring that the system would not only meet their current needs but also be adaptable for future challenges:
Availability
An outage in the authorization service could mean a complete halt in the operation of the advertising platform, making it impossible to perform essential authorization checks. Therefore, high uptime is critical to maintain the continuous functioning of the platform.
Auditability
For security and compliance, a detailed log of all decisions made by the authorization service is necessary. This aspect of auditability is not just a regulatory requirement but also a fundamental component of managing this system, especially in the case of unauthorized access. It allows Reddit to track and review authorization decisions, ensuring the system functions correctly and adhering to all necessary policies and regulations.
Flexibility
Reddit, like every player in the advertising landscape, must frequently evolve based on the expectations of its advertising partners - allowing them to define and manage their own roles. Therefore, the authorization system must be flexible and adaptable to changing requirements without significant overhauls.
Multi-Tenancy (stretch goal)
While not an explicit initial requirement, Reddit aimed for a multi-tenant capability in their authorization system. This goal was set with the understanding that a generalized authorization solution was lacking at Reddit, and thus, a system that could address multiple use cases across the company would be beneficial. Although focused on the advertising platform, this stretch goal would enhance the system’s flexibility and scalability, allowing it to potentially serve various needs across Reddit as a whole.
With the requirements laid out before us, there is one more important challenge, unique to ad tech, to consider -
Authorization for Anonymous Identities
An additional significant challenge with advertising on a platform like Reddit comes from the fact that a large portion of user interaction occurs through anonymous identities, as you don’t have to create an account in order to browse content (And see ads while you do). This presents a unique challenge in the context of advertising authorization.
When it comes to advertising, the platform needs to perform authorization checks to determine which ads to show to which users. These checks are straightforward when dealing with registered users, as their profiles, preferences, and histories can guide ad targeting. However, for anonymous users, the platform lacks this personalized data, requiring a different approach to authorization and ad targeting.
So how did the Reddit team seek to tackle these challenges, and what can we learn from their experience? Let’s start with the first principle they decided to implement -
Architectural Decisions for Scale
Decoupling Policy from Code
Inspired by Google's Zanzibar, Reddit opted to decouple policy from code, a strategic move to enhance scalability and flexibility. This approach allowed them to manage authorization policies independently from the application logic, meaning their database had to perform no rule evaluation when fetching rules at query time, keeping the query patterns “simple, fast, and easily cacheable”. Rule evaluation will thus only happen in the application after the database has returned all of the relevant rules. Having the policy storage and evaluation engines clearly isolated also allowed them to potentially replace one of them with relative ease if they decide to do so in the future.
By separating these concerns, Reddit could scale its authorization system more effectively, ensuring rapid decision-making without compromising system complexity or flexibility. To achieve this, Reddit employed Open Policy Agent (OPA).
Tackling Dynamic Configuration - Open Policy Agent
Open Policy Agent (OPA) is an open-source, general-purpose policy engine that decouples policy decision-making from policy enforcement. It provides a high-level declarative language (Rego) to specify policy as code and APIs to offload decision-making from your software.
Already used at Reddit for Kubernetes-related authorization tasks, OPA allowed the Reddit team to further decouple policy evaluation from policy storage. This separation enabled the Reddit team to facilitate centralized rule management, allowing the system to adapt to changing requirements without significant overhauls. Policies could be defined in a unified manner and applied consistently across various parts of a system, maintaining consistency and manageability.
As mentioned, Reddit’s team developed a centralized service for managing authorization. While the Reddit team decided to develop this capability in-house, it is possible to achieve these results using OPAL.
Open Sourced Centralized Rule Management - Open Policy Administration Layer (OPAL)
Open Policy Administration Layer (OPAL), is an open source administration layer for Policy Engines such as Open Policy Agent (OPA) and AWS' Cedar Agent that detects changes to both policy and policy data in real time and pushes live updates to those agents.
In parallel to the bespoke solution developed by Reddit, OPAL (Open Policy Administration Layer) stands out as a ready-made open-source solution offering similar capabilities. OPAL, when used in conjunction with OPA, acts as a dynamic administration layer, ensuring the policy engine is continuously synchronized with the latest policies and data. This is achieved by deploying OPAL Clients alongside OPA, which then subscribe to topic-based Pub/Sub updates. These updates are efficiently managed and disseminated from the OPAL Server, supplemented by data from various sources like databases, APIs, or third-party services.
Using Git repositories and GitOps as a method for rule storage, OPAL provides several benefits:
- Version Control: Using Git repositories for rule storage means that every change is tracked. This is crucial for audit trails, allowing teams to see who made changes, when, and why.
- Rollback and History: In case of errors or unforeseen issues, it’s easy to roll back to previous versions of policies, enhancing the system's reliability.
- Collaboration and Review: GitOps facilitates collaboration among team members. Changes can be reviewed through merge requests, ensuring that updates to policies undergo scrutiny before implementation.
- Automated Deployment: Changes in the repository can trigger automated deployments, making the update process more efficient and reducing manual intervention.
OPAL facilitates real-time updates and policy management, aligning with Reddit's approach and open-sourcing it to wide usage. Additionally, OPAL offers extended flexibility to work with simpler policy languages like AWS’ Cedar.
OPA, AWS’ Cedar, and OPAL
While Reddit's solution is tightly integrated with OPA, OPAL presents a versatile alternative that can accommodate various policy engines and languages - supporting both OPA and AWS’ new open-source policy language, Cedar.
AWS Cedar is a new open-source policy-as-code language developed by AWS to streamline IAM management and access control. It introduces a structured and scalable approach to managing permissions, making it a game-changer for application-level permissions.
Cedar presents two main benefits to OPA when it comes to separating policy and code -
- Readability: Cedar is designed to be highly readable, making it easier for both technical and non-technical team members to understand and work with the language.
- Application-Level Authorization: Cedar is specifically tailored for application-level authorization. It is well-suited for managing and enforcing permissions within applications, ensuring that access control requirements are effectively met at the application level.
You can dive deeper into the difference between the two policy engines here.
This flexibility makes OPAL an attractive option for developers looking to implement scalable authorization systems without committing to the complexity of OPA.
Performance, Results, and Lessons Learned
Following the implementation of their newly developed authorization service, the Reddit team reported outstanding results. The service has demonstrated impressive efficiency, with p99 latencies around 8 milliseconds and p50 latencies close to 3 milliseconds for authorization checks. These metrics indicate the service's ability to handle authorization requests swiftly and effectively.
Equally notable is the service's remarkable stability. Since its launch over a year ago (At the time Branden published his post), the service has operated without any outages, underscoring its reliability. Interestingly, the majority of issues encountered were related to logical errors within the policies themselves rather than the infrastructure or the software.
Reddit's journey in crafting and scaling its authorization system to support millions of decisions per second offers valuable insights into the architectural and strategic considerations necessary for scaling complex systems. The decision to decouple policy from code, along with the separation of policy evaluation from storage, is a pivotal step in creating scalable, flexible, and efficient systems. These strategies met Reddit's immediate needs and laid a foundation for future growth and adaptation.
The next step? Open-Source Implementation!
OPAL allows us to take this solution one step further and not only implement it in your application without building a solution in-house, but also provides the flexibility of using a policy language such as Cedar, which can be more approachable for both developers and their teams. You can support OPAL by giving it a star on GitHub ⭐
Want to learn more about Authorization? Join our Slack community, where there are hundreds of devs building and implementing authorization.
Written by
Daniel Bass
Application authorization enthusiast with years of experience as a customer engineer, technical writing, and open-source community advocacy. Comunity Manager, Dev. Convention Extrovert and Meme Enthusiast.