Earlier today at 2am San Francisco time Bitbucket experienced about three hours of 500 error page responses for users attempting to access the user newsfeed and repository overview pages. The outage was caused by a kernel panic on our Redis server, which is responsible for pages that display recent events related to a user. We are very sorry for the inconvenience this outage has caused.
After rebooting the Redis server, the index that Redis uses to serve the newsfeed content was found to be corrupt, which caused certain pages on Bitbucket to fail. For users accessing pages deeper into the site, such as pull requests, commit views, wikis and issues the site continued to work as expected. During this time Git and Mercurial access continued to work over both HTTP and SSH. After identifying the cause of the problem, we turned off the newsfeed for all of Bitbucket bringing an end to the 500 errors.
With the newsfeed temporarily disabled, we began investigating the corruption problem and discovered a forum post with instructions and a repair tool to fix the corrupted index. We then used the instructions to repair the index and restore full service to Bitbucket.
During this outage we have identified areas for improvement and are implementing changes to the way we manage the operations of Bitbucket:
- Improve our escalation procedures so that the response times are faster during non-office hours
- Update the Bitbucket codebase so we do not have the dashboard and repo overview fail when Redis becomes unavailable
- Increase the number of tests that status.bitbucket.org performs triggering our automatic phone alert system