On Monday the 17th of February, the Rentman software was unavailable for two periods of respectively 30 and 33 minutes, with degraded performance for larger parts of the day.
We realise that an outage of our services has a severe impact on our users ability to perform their work. That’s why we consider the availability of our service as one of the core services we offer. Any amount of downtime is unacceptable to us and we are sorry for the inconvenience it has caused. We have learned from this incident and are putting new measures in place to prevent problems like this in the future.
On the morning of Monday the 17th January, we witnessed a sudden increase in memory allocation on our primary database server. This led to degraded performance and eventually to the server becoming unresponsive at 11:26 (UTC+1).
We have failback servers in place to which we can migrate in the case of an outage of one of our servers. At 11:27 our engineers acted on this outage by migrating to the failover and moving to the backup servers. At 11:56 this failover was completed and operations resumed on the new server. A similar incident occurred on the failover server, causing an exceptionally high load and resulting in the same sequence of an unresponsive server and a new failover process at 13:49.
In order to reduce the load on our servers we temporarily stopped all non-vital services at 15:25. This allowed us to retain sufficient memory for our core services to stay online. This meant that no background data was refreshed and outgoing emails were temporarily frozen. At 19:00 we decided that it was safe to restart all services. It took until 00:50 before the queue of pending emails was sent.
Our investigation showed that the increased load was caused by unexpected behaviour of the algorithm that calculates the availability of equipment – this behaviour only occurred in a specific situation. At 15:55 we isolated all accounts which had a dataset that could lead to this specific situation. This ensured that if one of these accounts would cause a crash, no other accounts could be affected. With this procedure we isolated the issue so that 99% of the accounts could resume normal operations from 16:00 onwards.
On Tuesday the 18th of February, we released a hotfix that solved the root problem in the availability algorithm. We also decided to temporarily disable real time availability calculations on projects with more than 2000 equipment lines. In the coming weeks, we will work on a solution to enable this feature again.
How we will prevent this from occurring again
The Rentman software runs on many servers which are located in multiple datacenters. In these datacenters we have one or multiple failovers (backups) for every single instance.
For most parts the infrastructure behind Rentman is designed in a way that stopping a random server will not affect the availability of our software. Unfortunately, this is not possible for database servers. This is due to the fact that the data must be stored physically on a single server in order to have a single source of truth.
Historically, our server park has grown from a small amount of very big database servers to multiple small database servers. With this configuration we have less accounts per server and a better isolation of accounts in the event an outage occurs.
Scheduled maintenance for all accounts
To decrease the scale of impact in case of a next outage, we have decided to prioritize scheduled maintenance. During this maintenance we will migrate accounts from shared servers to more (and newer) small instances. Less accounts on the same server decreases the likelihood that a server is affected by erroneous behavior coming from other accounts. In case an outage does occur, only a small percentage of the total accounts will be affected.
What this means for you
During the night from Monday the 24th of February until Monday the 9th of March, we will migrate all older accounts to this new configuration. In order to impact your work as little as possible, we will schedule the migration of your account between 01:00 and 05:00 (in your timezone). Migrating will happen automatically and can take anywhere between 3 and 10 minutes during which your account will not be available.
On the 3rd of March at 05:00 (UTC+1), we will upgrade the datastore of all shared services to hardware that's better equipped to prevent similar issues causing an outage. During this migration, all Rentman accounts will not be available for approximately 5 to 20 minutes.