On 17 April 2019 and 29 May 2019, we experienced a major service outage affecting all Rentman users. Both outages were caused by a bug on the side of our hosting provider “Amazon Web Services”.
During the outage we were not able to migrate to our failover datacenter. This incident caused the Rentman service to be offline for approximately 1,5 hours in both occasions.
To understand what happened it’s important to understand a bit more on how our cloud infrastructure works.:
Our servers are hosted in the data centers of Amazon Web Services (AWS). AWS has grown to be the worldwide leader in cloud computing. They operate 44 data centers around the world and currently handle a large part of the global internet traffic. They not only provide us the hardware but we also rely on their Relational Database Service (RDS) where specialized Amazon engineers operate and maintain our database servers. By outsourcing this part of our business we guarantee that we have highly skilled specialised engineers standing by around the clock to assist our own engineers and we can operate a database on a scale that would otherwise be nearly impossible because we can profit from the shared development of tens of thousands of other companies.
Amazon RDS provides us with high availability and failover support for database instances using Multi-AZ deployments. This means that we keep active database servers (replica’s) in sync in multiple data centers. In the event of a planned or unplanned outage of our database instance, Amazon RDS automatically switches to a standby replica in another datacenter. Failover times are typically between 60 and120 seconds. In both instances however we were not able to restore our service within this timeframe and it took us longer to migrate to our standby instances.
Root cause analysis, what went wrong?
Problem 1: The database stopped working
We suspect that the database servers stopped working because of an internal bug in MySQL 8.0.13 or the RDS service.
Problem 2: Migrating to our failover instances took too long
Immediately after we monitored that our primary database server stopped working, RDS started the failover procedure. Because the database stopped working in an error state, RDS triggered a procedure that started cross checking all tables to assure that no data got corrupted by the abrupt stop. In this process the server compares the data of over 750.000 tables between multiple storage locations to check if nothing got lost and to recover any missing data from the backup locations. Because of the great amount of tables this procedure took way longer than the 60-120 seconds RDS aims for.
We regularly test the RDS recovery procedures by shutting down a database server in a separate isolated test-environment. But because we shut down this server in an orderly manner, there was no chance data got corrupted and the data recovery process didn’t start in the test scenarios. This caused us to overlook the impact of the data recovery process.
Steps we take to prevent this from happening in the future
1. Work closely with the internal team of Amazon to identify what caused the database to stop
We’re working with an internal team of Amazon to identify and resolve the specific bug that caused the database to stop functioning. They have deployed a dedicated team on the issue and are working on it with a high priority. This is a long and time consuming process since the existing logs don’t give much starting points on a probable cause.
2. Improve recovery procedure
We have a third database instance standby in case RDS can’t perform a Multi AZ failover. Before it required manual actions to scale up this instance to the more than 300GB of RAM required to run our production database. We now permanently keep this instance in a size that it can handle production traffic so we can perform a quick manual failover in case the RDS Multi AZ failover fails.
3. Migrate database data to smaller servers with a different storage method
We have prioritized the development of a system that enables us to move Rentman services to smaller isolated database servers that use a different storage method. Currently all tables are stored on different files which all have to be compared one by one in a recovery process. The new storage engine can combine multiple tables into a single file, thereby reducing the amount of actions needed after a recovery.
When we spread our data over more independent instances there is less data to compare if one of these instances has to perform a data recovery check in the future.
However switching storage engines has a huge impact on performance, the underlying backup-, and recover services and therefore needs to be carefully tested and gradually rolled out. We hope to launch these improvements to our infrastructure halfway through the summer. As it will require a small maintenance window to transfer the data, we will communicate when your data will be migrated to these new servers.
We couldn't be more sorry about this incident and the impact that these outages had on you, our users and customers. We always use problems like these as an opportunity for us to improve, and this will be no exception.
We thank you for your patience and continued support of Rentman, we are working hard to make sure we live up to the trust you have placed in us. We will continue our goal to provide you with an excellent rental management experience, as we know that the tools we provide you with are critical to the productivity of you all.
If you have any questions or concerns that were not addressed in this postmortem, please reach out to us via firstname.lastname@example.org and we'll do our best to provide you with the answers to your questions or concerns.