On Thursday February 10th and Friday February 11th, we experienced two service outages affecting all Rentman users. These incidents caused Rentman’s service to be offline for a total of 45 minutes.
We recognize the severity of these outages, and we apologize to all of our customers for allowing it to occur. As of this writing, the root causes of the outage are fully understood and we have taken several measures to prevent any recurrences. Additionally, our engineering team has been working on implementing extra detection and prevention measures that add additional defense to our already existing safeguards.
In this report, we want to give you an update about the root cause behind the recent incidents and about the improvements we have made to prevent future occurrences. In addition, we explain the extra measures we have implemented to further contain the impact of software, performance, or infrastructure failures.
To minimize the repercussions of any server errors, Rentman’s clients are divided across multiple servers. In case of a server disruption, only the cluster of clients that are on the affected server are impacted.
The incidents that caused the service outage we experienced on February 10th and again on the day after were the result of a disruption in our administration server. This server is shared by all accounts and only used for our internal management. We use this server for storing active licenses, translations, and account information. All other client data is stored in separate servers to spread the load and risk between multiple servers. After logging in to your account, the administration server is only queried once an hour to check if your license has changed, minimizing the risk of any disruptions to occur.
During the recent outages the central database stopped responding. A usually innocent bug, caused a loop querying this database with a very high load. Because this server provides information for all accounts it caused the complete infrastructure to go offline. After locating the cause we deployed a fix for the responsible bug.
To prevent a similar future outage to occur, we eliminated the central database as a single point of failure for most calls. This means that customers will not be affected if the server malfunctions.
With both the incidents and the immediate risk over, the engineering team focused on extra steps for prevention and mitigation. There are a number of lessons we learned from these events — mainly the need for additional safeguards to reduce the risks of server disruptions being caused by errors in the system or infrastructure.
We have added additional monitoring tools that will help us to understand the root causes of any disruption quicker and in more detail. We now monitor directly for a decrease in capacity or redundancy even when the system is still functioning properly. Additionally, we increased the number of characteristics that are stored for each request. This gives our team access to a more detailed log of events and helps us to identify more easily the root cause of service disruptions. With these additions, we are able to deploy fixes more rapidly and to anticipate potential outages before they happen.
To improve the stability of our platform, we have taken several measures that reduce the chance of system errors impacting the performance of our service. Here are the most important measures we have taken to improve the stability of our platform:
Since the implementation of these new preventive measures, we haven’t witnessed any disruption to our service.
At Rentman, we take all outages seriously, but we are particularly concerned with outages that affect multiple zones simultaneously. We want you to understand why it happened, what we have done about it, and what steps we are still taking. By being transparent about the incidents and our measures we hope to demonstrate our ongoing commitment to offering you a reliable Rentman platform.
We thank you for your patience and continued support of Rentman, we are working hard to make sure we live up to the trust you have placed in us. If you have any questions or concerns, please don’t hesitate to reach out to us at support@rentman.io.