Outage
Incident Report for Rentman
Postmortem

Rentman incident report and added stability measures

On Thursday February 10th and Friday February 11th, we experienced two service outages affecting all Rentman users. These incidents caused Rentman’s service to be offline for a total of 45 minutes.

We recognize the severity of these outages, and we apologize to all of our customers for allowing it to occur. As of this writing, the root causes of the outage are fully understood and we have taken several measures to prevent any recurrences. Additionally, our engineering team has been working on implementing extra detection and prevention measures that add additional defense to our already existing safeguards. 

In this report, we want to give you an update about the root cause behind the recent incidents and about the improvements we have made to prevent future occurrences. In addition, we explain the extra measures we have implemented to further contain the impact of software, performance, or infrastructure failures.

Incident detection and preventive measures

To minimize the repercussions of any server errors, Rentman’s clients are divided across multiple servers. In case of a server disruption, only the cluster of clients that are on the affected server are impacted.

The incidents that caused the service outage we experienced on February 10th and again on the day after were the result of a disruption in our administration server. This server is shared by all accounts and only used for our internal management. We use this server for storing active licenses, translations, and account information. All other client data is stored in separate servers to spread the load and risk between multiple servers. After logging in to your account, the administration server is only queried once an hour to check if your license has changed, minimizing the risk of any disruptions to occur. 

During the recent outages the central database stopped responding. A usually innocent bug, caused a loop querying this database with a very high load. Because this server provides information for all accounts it caused the complete infrastructure to go offline. After locating the cause we deployed a fix for the responsible bug. 

To prevent a similar future outage to occur, we eliminated the central database as a single point of failure for most calls. This means that customers will not be affected if the server malfunctions. 

With both the incidents and the immediate risk over, the engineering team focused on extra steps for prevention and mitigation. There are a number of lessons we learned from these  events — mainly the need for additional safeguards to reduce the risks of server disruptions being caused by errors in the system or infrastructure.

More direct and detailed monitoring

We have added additional monitoring tools that will help us to understand the root causes of any disruption quicker and in more detail. We now monitor directly for a decrease in capacity or redundancy even when the system is still functioning properly. Additionally, we increased the number of characteristics that are stored for each request. This gives our team access to a more detailed log of events and helps us to identify more easily the root cause of service disruptions. With these additions, we are able to deploy fixes more rapidly and to anticipate potential outages before they happen. 

Additional preventions

To improve the stability of our platform, we have taken several measures that reduce the chance of system errors impacting the performance of our service. Here are the most important measures we have taken to improve the stability of our platform:

  • We use caching to prevent excessive database queries. In some instances the cache instance sizes were not sufficient to handle rare occasions of high loads, impacting the performance of our databases in some regions. By increasing the cache size, we are now able to handle more accounts and reduce the load on our servers.
  • To reduce the load on available database resources, we reduced the maximum time a database connection is allowed to take. In the new setup, very lengthy requests will be halted automatically, making resources available for future requests.
  • With the move from HTTP 1 to HTTP 2, the restriction on a maximum number of 5 concurrent simultaneous requests was lifted. In occasions where a large number of documents get created within Rentman, excessive simultaneous requests could result in a database overload and loss of server performance. To fix this, we added a batching system that restricts the maximum number of documents to be generated concurrently. This prevents the risk of overloading our databases and eliminates any risk of it leading to an outage of our service. In addition, we also put a maximum on the allowed table size in MySQL. Single requests are limited to a certain maximum amount because of this and prevents MySQL from running out of memory.
  • To increase the maximum load capacity of our servers, we added two new database servers. This not only gives us more buffer to handle peak requests, but also reduces the number of accounts that would be affected in case of degraded server performance.
  • We fixed a previously unseen error in one of our scripts that is designed to loop over database servers in case of a high load or during an outage and cancel requests that take too much time. On some occasions, this script failed to run when the database server it had to retrieve the server list from was not accessible because it was affected by the disruption. Additionally, we created a special database user that has priority over all others. This allows us to still connect to an overloaded database server and cancel requests to make the database available again.

Since the implementation of these new preventive measures, we haven’t witnessed any disruption to our service. 

At Rentman, we take all outages seriously, but we are particularly concerned with outages that affect multiple zones simultaneously. We want you to understand why it happened, what we have done about it, and what steps we are still taking. By being transparent about the incidents and our measures we hope to demonstrate our ongoing commitment to offering you a reliable Rentman platform.

We thank you for your patience and continued support of Rentman, we are working hard to make sure we live up to the trust you have placed in us. If you have any questions or concerns, please don’t hesitate to reach out to us at support@rentman.io.

Posted Mar 07, 2022 - 14:07 CET

Resolved
This incident has been resolved.
Posted Feb 11, 2022 - 15:34 CET
Monitoring
A fix has been implemented and we are monitoring the situation.
Posted Feb 11, 2022 - 15:24 CET
Update
We are continuing to investigate this issue.
Posted Feb 11, 2022 - 15:04 CET
Investigating
Rentman is experiencing an issue at the moment. We are currently investigating it and will provide further updates as they are available.
Posted Feb 11, 2022 - 15:00 CET
This incident affected: Rentman Application.