We've all seen the heroic spaceship pilot sequence in one form or another. The pilot is flying his damaged craft as it plummets towards an unavoidable and unyielding mass. The ship is either on fire or breaking apart (probably both), alarms are blinking and wailing for attention, and the shaky cam is turned up to eleven. But, despite what Hollywood would have us think, this is not the 23rd century and Scotty can not beam me up.
Our spacecraft are simultaneously far more delicate and more ham fisted then anything out of Star Trek. The Enterprise was powered by Dilithium crystals (a.k.a rocks), compared to the Saturn V which carries 16lbs of oxygen,hydrogen, and kerosene (a.k.a. scary flammable shit) for every 1lb of "stuff". It is because of things like this that real space exploration is not at all like the movies.
There are a few lessons here that we, as developers, can learn from how modern space agencies really deal with alarms during a flight:
- Alarms are never ignored.
- cost of false negative
- annoying alarms
- Responding to an alarm follows a procedure.
- acknowledgement by on-site responder only
- collaboration protocol
- escalation protocol
Alarms are never ignored during flight because cost of a false negative can be huge, they are never dismissed as noise (more on that later). To make sure that our selfish human interests align with addressing the alarm,it is designed to be annoying.
Cost of False Negative
Alarms on spacecraft are placed on any system whose function is deemed critical to the mission. We should do the same in our software.
Ask yourself what is the "mission" of your system? What are the components of that system whose malfunction could cause the mission to fail? If there is a component on that list that would not alarm if it began to fail, then you have some work to do.
There are plenty of OpenSource or Free (as in beer) tools to help you monitor your system.
At the end of the day the people responding to these alarms are just that, people. They are susceptible to drowsiness, distraction, hunger, and all the other weaknesses of this mortal coil.
NASA does not make alarms that only alert a single time or use a polite chime and you shouldn't either. If you deem something important enough to trigger an alarm it should annoy you until you acknowledge it.
Using a tool like PagerDuty for this is a natural supplement to whatever monitoring solution(s) you employ.
Follows a Procedure
On-Site Responder Only
An alarm on a space flight is only acknowledged by the on-site responder. Mission Control might receive an alarm via telemetry and begin responding before an astronaut acknowledges the alarm, but they do not silence the alarms because they are not the on-site responder.
On-site responder takes on a funny meaning when dealing with software systems but we can think of it as the person taking responsibility for managing that alarm. If you have flaky WiFi because you're on a plane or have three minutes of battery life left don't acknowledge the alarm!
An astronaut rarely tackles to an alarm solo, they almost always collaborate with other individuals down on the ground. The first steps of this collaboration begins immediately after acknowledging the alarm. The astronaut informs Mission Control that they are responding to the alarm and Mission Control acknowledges this. From this point onward both parties keep an open channel of communication, constantly updating the other on their investigation and seeking consultation on potential actions.
While in software you may not always have to work with someone to resolve an issue you should still have a process for how those parties will communicate when they do need to. All problems can be made worse if people fail to communicate as they work to resolve them. You risk stepping on each others toes, duplicating effort, or neglecting something because you thought someone else was doing that.
If something starts to go seriously wrong or the responder can't resolve the issue on their own, there exists a process for escalating the issue to those with more knowledge of the system experiencing the problem.
You should have the same process within your team or organization. Have a process for reaching out as an issue becomes increasingly critical or takes to long to resolve. This could be as simple as sending an email to your whole team, as drastic as paging an all hours Reliability Engineer or contacting a vendor for support. Regardless of what your escalation protocol is make sure that you team members know what it is and review it occasionally to make sure it still reflects how you want to handle alarms and problems.
Do What Works
At the end of the day though your software probably does not have the uptime requirements and cost of failure of a manned space flight’s software. Make reasonable choices about how much monitoring and logging you, and your team, need to feel confident in your system.