Back in October we blogged about the Bitbucket team's recent and ongoing efforts to achieve world-class reliability. We've been making major investments in this area, and we're excited to share some really big wins in the coming weeks.
But as things go in the world of development, with every win can come some setbacks as well. Acting on our commitment from our previous post, we want to provide greater visibility going forward and take you along on each step of this journey. Our services experienced an incident in mid-November, causing Pipelines builds to fail due to errors cloning repositories. We view this as an opportunity to shed some light on our incident management process, sharing just one example of the many ways we are working to improve reliability and incident response.
At Atlassian, all teams follow a well-defined incident response process, defined in the Atlassian Incident Handbook. Part of this process is a postmortem exercise following all customer-facing incidents known as a post-incident review (PIR) meeting. The goal of the meeting is to align on what happened, identify root cause(s), and agree on priority actions to address the root cause, mitigate future impact, and improve detection and response moving forward.
Putting it into practice: a glimpse into our most recent post-incident review
To help illustrate what this process looks like, we can look at the post-incident review from our most recent incident. One of the key components of a successful PIR is the "Five Whys" exercise: we start by stating what went wrong from a customer's perspective, then ask "Why?" repeatedly to enumerate proximate causes until arriving at a root cause. This is what our Five Whys analysis looked like last week:
- Our customers’ Pipelines builds were failing, preventing teams from merging and deploying their software. Why were the builds failing?
- Our systems were experiencing an elevated rate of Git and Hg clone failures, which those Pipelines builds relied on. Why were clones failing?
- Database performance was severely degraded, causing errors and timeouts in the code responsible for handling clones. Why was DB performance degraded?
- A small number of accounts were making a disproportionately high rate of requests to expensive endpoints, causing anomalous load on our infrastructure. Why were these requests not throttled by rate limiting?
- Our existing rate limits are based on request volume and do not account for request cost as measured by resource usage, and the API endpoints in question for this incident have variable cost which was not accounted for.
In fact, we might perform multiple Five Whys analyses during a PIR to identify multiple root causes or address different concerns — for example, once to identify the technical root cause of a customer-facing bug, then again to identify shortcomings in our alerting (e.g. why didn't we detect this sooner?) or our ability to resolve the incident (e.g. why were we not able to rollback more quickly?). The above is just one example of what this process can look like when our team works through how to avoid similar incidents from reoccurring in the future.
Throttling based on resource usage, not request volume
As stated in the above Five Whys analysis, we have had simple but fairly effective rate limiting in place on Bitbucket Cloud for some time. And while simplicity is great for code readability and maintainability, sometimes a more sophisticated solution is necessary. This recent incident highlighted some known shortcomings of our existing rate limiting mechanism:
- Our limits were manually defined, leaving gaps.
Our existing rate limits were defined by analyzing our most expensive API endpoints (on average) and defining caps on the number of requests a user could make per hour to each endpoint. These caps are intended to allow the vast majority of legitimate requests and only prevent abuse, whether malicious or accidental (e.g. overly aggressive automation). However, since these caps were based on average response time, they did not provide adequate coverage for endpoints with high variability in cost. - We were only applying limits based on request volume and not resource usage.
For example, a certain API endpoint might allow only 1,000 requests from a given user per hour. This treats all requests equally: even if one request takes 100ms and the next takes 2s, these would both count the same as 1/1,000th of the user's hourly quota despite the fact that the second request had a cost roughly 20x that of the first. - We were limiting requests per endpoint, not per user.
A script or other integration might deplete its quota hitting one expensive API endpoint, then simply move on to another endpoint and resume making a large volume of requests. While this might not typically represent a major issue except in cases of malicious activity, it is not uncommon for integrations to access multiple APIs; and in these cases an integration might therefore impose multiple times the load on our infrastructure that it ought to based on the limits we've defined.
The good news here is that even before this PIR, we had already begun work on a project to introduce a new layer of throttling to our architecture. This new throttling is based on overall per-account resource usage and addresses each of the shortcomings listed above. Overall resource usage limits are defined based on the 99.99th percentile of what we see in production. This does not require any manual analysis because the limit is applied across all website and API requests, and endpoints with highly variable cost are handled correctly because the throttling logic accounts for the cost of each individual request.
The work that has already been done consists of wrapping our request handling in a middleware capable of measuring resource usage. This middleware instruments the code responsible for handling every request and produces an abstract measure of "cost" that incorporates both I/O operations and CPU utilization. A running quota for every user (or IP address for anonymous requests) is tracked in a low-latency key/value store and effectively replenished via a recalculation on every call.
For some time, we have been running this code and simulating the effect that resource usage-based throttling would have. But instead of actually throttling user requests (i.e. returning 429 responses), the application code populates a set of custom HTTP headers indicating request cost, the user's remaining quota, and whether or not the request would have been throttled. These headers are then logged and stripped from the request by the load balancer. This allows us to visualize what the distribution of resource usage across accounts looks like. The following graph shows this distribution for our top 1,000 accounts by resource usage. (Bear in mind this is just the top 1,000 accounts, which already represents a very small % of our total active accounts.)
Clearly, load on our infrastructure is not spread evenly across accounts. A very small percentage of active accounts on Bitbucket have an outsize impact on our systems.
Truthfully, on most days, a graph like the one above is OK. We should expect that some users are much more active than others. But when the spike to the left of the graph occasionally grows out of control, it isn't just bad for us; it's bad for our customers. It should not be possible for one team using Bitbucket to cause disruptions to our service that prevent other teams from accessing their code, even by accident.
What's validating about the above graph is that it allows us to plainly see that throttling even the top 1% of our top 1,000 accounts by resource usage will make a real difference when it comes to mitigating the risk of incidents like this in the future. It certainly would have prevented this latest incident. Over the next week or two we will be performing further analysis, tweaking thresholds, and striving to soon enable this new protection to shield the majority of our users from ripple effects caused by service misuse that isn't already being caught by our existing rate limiting.
This is just one example of how the Bitbucket team applies Atlassian's post-incident review process. Every case is slightly different, but ultimately PIRs enable our team to identify root cause, commit to a solution, and improve the resilience of our services on every iteration of the process.