GitHub Explains the Cause Behind the Past Week’s Outages
GitHub says recent service outages were caused by resource contention issues in their primary database cluster.
Since last week, GitHub says that there were four service outages caused by these problems, on March 16th, March 17th, March 22nd, and March 23rd.
Today, GitHub explained that these outages were caused by “resource contention” issues with their primary MySQL cluster called ‘MySQL1.’
“The underlying theme of our issues over the past few weeks has been due to resource contention in our
mysql1 cluster, which impacted the performance of a large number of our services and features during periods of peak load,” explains a GitHub post about the outages.
Also Read: Got A Notice of Data Breach? Don’t Panic!
Resource contention is when multiple processes/requests compete for the same resources, whether that be memory, CPU, or disk utilization, or even access to a database table.
When there are not enough resource available, a database will not be able to finish queries as quickly, leading to tables being locked and database connections to quickly pile up while they wait for queries to complete.
As request pile up, the server ultimately reaches the maximum number of connections it is configured to handle, and simply rejects all further requests until there is room for more.
This causes any services that require access to the database to fail.
Four service outages since March 16th
GitHub says that there were four service outages that were caused by these problems, ones on March 16th, March 17th, March 22nd, and March 23rd.
On March 16th, GitHub saw increased load during peak hours and poorly written queries that caused the maximum connections to fill up and all write operations to the database failed.
“All write operations were unable to function during this outage, including git operations, webhooks, pull requests, API requests, issues, GitHub Packages, GitHub Codespaces, GitHub Actions, and GitHub Pages services,” explains the GitHub blog post.
The next outage was March 17th, where they saw similar issues before they could resolve the poor query performance. After failing over to other servers, they once again quickly reached their maximum connections.
On March 22nd, GitHub says that while they had applied mitigations to query performance, they were continuing to analyze them and enabled memory profiling on their database proxy. This once again led to maximum connections being reached.
Finally, yesterday, on March 23rd, they once again saw increased load causing client connections to fail. To resolve these issues, GitHub decided to throttle webhook traffic to reduce the load on their servers.
To prevent these types of outages in the future, GitHub states that they are auditing their systems during peak hours and will create performance fixes based on the results.
They are also rerouting traffic to other database to reduce load and increasing infrastructure and sharding to increase performance.
Database sharding is when large database tables are split into multiple tables that can then be stored across multiple servers. By sharding a highly used and large database into multiple smaller databases on different servers, it can increase performance and prevent intensive queries from locking the table.
GitHub states that they will share more detailed information their next Availability Report on what they are doing to prevent these types of outages in the future.