How to survive a system crash at your firm

As a startup CEO, you make so many decisions, and touch so many parts of the business, that mess-ups are almost always your fault. Stock image
As a startup CEO, you make so many decisions, and touch so many parts of the business, that mess-ups are almost always your fault. Stock image

Early last Saturday I was woken up by an urgent WhatsApp message from a staff member: a new user from one of our trial clients was not able to log in to the system and was blocked from doing some urgent work. In the events industry, deadlines are absolutely hard and cannot be pushed – events must take place at the advertised time and place. Our vision is to make life easier for event professionals, not harder by locking them out of our system.

The reality of starting an online business is that your website and service will inevitably crash. This happens to the best in the business. Amazon, Google, Apple, Facebook, and others have all had their moments. You can’t plan for all eventualities – an Amazon data centre was taken out by a lightning strike in 2012. No amount of disaster preparation is going to solve a problem like that.

However, you must plan to fail. You must figure out how you are going to deal with failure. I’ve put in place outage management system before, but we’re only getting started with setting this up in voxgig. When it happens to you, there are three things you need to do, in order of hours, days, and weeks.

Firstly, when an issue occurs, you must not lose your head. This is the single biggest mistake you can make as a leader. If you start running around shouting at your staff, you’ll make it much harder to resolve the problem. People make mistakes under pressure, and you’ll just be adding to it. Worse, you’ll make people hide things so that they can stay out of trouble. Things that could help resolve the situation more quickly.

The Canadian astronaut Chris Hadfield – famous for playing the guitar in space, and tweeting fantastic pictures of Ireland from the International Space Station – wrote a book after his adventures where he talks about crisis management. In space, a crisis means you’re dead in 60 seconds or less, so you need to have the best possible approach to fixing technical problems. Astronauts are not selected for their careless attitude to fear and risk. Instead, they are selected for their ability to work together in team, put their egos aside, and “work the problem”.

Forget about all the axes you have to grind, and the “told you so’s”, and focus on the fixing the problem in front of you. This has to be a team effort to get the best outcome. If you’re seen the movie ‘Apollo 11’, think of the scene on the ground in Houston where they say: “you’ve got eight hours to make one of these (a carbon dioxide filter) out of this (a random selection of piping and supplies)”- the team on the ground had to figure out how to save the astronauts using only what was available on the spacecraft.

For a remote working team, this is where the use of online chat tools, such as WhatsApp and Slack really kicks into gear. Not only does typing take the heat out of your problem-solving discussions, it also makes it much easier for new people to get up to speed on the problem quickly.

In our case, last Saturday, it quickly became clear that a ‘clever’ shortcut I had used on Friday had lead to a series of unfortunate events that locked the user out. I called in one of our developers to help, and we ‘worked’ the problem in real time via Slack. It took about three hours, but we survived. Here’s another pro tip: as a startup CEO, you make so many decisions, and touch so many parts of the business, that mess-ups are almost always your fault – another reason not to shout at our people.

The second thing to do occurs in the days after the event, assuming it hasn’t killed your business (that does happen, unfortunately). You have to conduct a blame-free post-mortem. That means you write up an analysis of what happened, with a timeline and commentary. It’s essential that this is safe for everyone so that you can collect the maximum amount of data to understand what the issue occurred.

This is where the online chats come into their own – you have a recorded history of the entire event and how you tried to fix it, so the timeline is easy to write up, and easy to learn from.

What you’ll observe in these chats is a process of deduction: hypothesis, testing, and measurement. Unfortunately, this is usually occurring on your live system, so it’s a high-risk activity. That’s why people need to be left alone to focus. If you’re a non-technical founder, you have to resist the urge to keep asking how things are going. Instead, spend your time communicating with the client. If you are a technical founder, you’ll need to get someone on your team to stay in touch with the client as you resolve the issue.

In your analysis, you need to review this process of deduction: how easy was it to test your bug hypotheses? Could you find the internal technical information you needed quickly. How safe was it to make changes. Did people have the access they needed? Did people have the information they needed.

In our case, we learnt that we have an implicit and informal dependency on WhatsApp for clients to report problems. It actually works pretty well, but it’s accidental infrastructure, and we need to think about how to formalise our use of that application, and think about backups and alternatives.

It also became clear that our technical knowledge is too specialised. Each team member is very focused on their own part of the system, and does not have sufficient knowledge of the whole system to resolve an issue themselves. This is not surprising – it’s the most efficient way to build a system in a startup context, as you get maximum productivity. But as you move away from a Minimum Viable Product situation to one where you have live clients, the trade-offs become more expensive. Startups are full of these types of transitions. Just when you’ve got something working well, you need to change it again to handle the next phase of growth.

The last thing our post-mortem uncovered was that we do not have a process for handling failures in the system. Again, a common startup scenario. This is something we need to start building.

Finally, the third thing that you have to do is take action based on what you have learned. This is also where you have to make hard choices. Often, you know exactly what the problems are, but you don’t have enough resources to solve all of them. Being comfortable letting some fires burn is part of the startup mindset. The trick is to choose how you apportion risk – at this stage that is more art than science, and very context-dependent.

Here are the actions that we are going to take. We really need a status page. This is a separate website that shows our users what the current status of our system is. It can also be used for updates during a crisis. Here’s an example from Google: Ours will be much simpler of course, but the idea is the same. One little tip: unlike Google, you’re not connected directly to the big internet backbones, so your domain name could also be comprised during an outage. It may be better use a status page service such as for full redundancy.

We should have a system checklist. This is a technical worksheet that establishes the technical health of the system and is the first thing to be done in an emergency. This is exactly the same idea as the checklists that airline pilots use before taking off. It ensures a basic level of safety. I’d highly recommend reading the book ‘The Checklist Manifesto’ by Atul Gawande for more on this topic. He was a surgeon who introduced checklists into the operating theatre to improve patient safety.

We should do better on our system monitoring. It was hard to understand exactly what some parts of the system where doing.

At the technical level, you use logs to record the activities of the live software. These are literally like the log book that a ship’s captain uses, but much, much more detailed. We don’t have the ability to fine-tune the level of detail. Normally you don’t want logs to be too verbose, as they take up too much space. In an emergency, however, you want all the fine details. We have no easy way to turn this on and off – that’s a definite technical action for the team.

After an incident, you need to make sure to write up your post-mortem, and then keep it safe. There’ll be many more, and by adopting the attitude of an accident investigator, you’ll make your system more reliable over time. Complex systems are not stable by design – they ones that survive are the ones that are actively shepherded to stability.

Newsletter update: 3,313 subscribers, open rate 12.2pc.

Richard Rodger is the founder of voxgig. He is a former co-founder of Nearform, a technology consultancy firm based in Waterford

Indo Business