Even when planned maintenance goes well, there are always things we can learn from it. And when it goes less well, as ours did last week, you can learn even more. So either way, we think it’s useful to review how things went and see what we can take from it for next time. It also gives us the chance to communicate what happened to our users, give them confidence that we care when things don’t go smoothly and let them know that we will take action to prevent it happening next time.
Some background: last Wednesday we had some planned maintenance. The main aim was to move our servers between racks at our data centre, but we also took the opportunity to perform some other small changes we had pending – redistributing some memory and cpus between servers, performing some configuration changes, etc. While we achieved almost all of the aims, the maintenance ended up taking three hours (~50%) longer than planned, and during the process there were several things we could have done better.
Starting with the latter, the main thing was the lack of communication to our users. We’d worked hard to alert them with plenty of notice prior to the event, but it was lacking during the maintenance, especially so when it became clear we were going to exceed our 4pm deadline. There’s not really an excuse for this – the wifi connection in the data centre was less than great, and in the actual data halls it’s not very comfortable to work on a laptop (as you would expect – they’re built for machines, not humans), but the main reason was just us not allowing enough time for communication. Even with those overheads it should have been one of the highest priorities.
So why did it take so much longer than planned? There is no single reason for this, there were just a lot of small issues that added up. But there is a common theme – lack of testing. Almost all of the issues we identified could have been avoided if we had done more testing in advance. In addition almost half the identified issues are not specific to the maintenance, but rather something that could happen under normal circumstances (eg. we fully expect servers to require rebooting during normal operations). Here’s an excerpt from our analysis to give you some examples:
We have logged all the issues in our issue tracker and we will be prioritising and working on them presently. We will also be taking a step back and reviewing the system as a whole to see where any gaps might be.
The takeaways
- Improve communication during maintenance periods, via:
- Nominating a specific member of the team to do the communication.
- Adding specific communication steps in the maintenance plan.
- Setting periodic reminders.
- Talk through the maintenance plan in more detail as a team, so we can highlight any parts which need more investigation/planning to reduce the risk of something going wrong.
- Improve advance testing of the planned processes, where possible.
- Expand our general system testing to ensure we have confidence that we’ll know how recover gracefully from the failure of any given component.
We’re pleased that since the maintenance was completed everything has been running smoothly, so are now looking forward to using the new rack space to expand our processing and storage capacity and allow us deliver the exciting new features and products we have planned in the next 12 months.