|We're gonna need a montage.|
Draft in helpSometimes the people who have been working on the project are the ones who are asked to solve the problems. The common misconception is that they are best placed to do it because they are most familiar with the code.
This is not always true because those very people can develop tunnel vision where they view problems with bias and assumptions. This can manifest with quotes such as 'It's the web services, it always is.' Or 'There must be something wrong with the database server. Our sprocs are the best sprocs you will ever see.' This narrow field of view can cause the trouble shooter to miss the actual cause of the problem.
A fresh set of eyes and additional skills can sometimes be what is needed to find the needle in the haystack. The causes of site problems is often related to common areas such as memory leaks or threading and concurrency issues. Expert trouble shooters can quickly hone in on these areas without needing to know what 90 percent of the code base does.
Finally, there is no shame in getting help and calling upon professional services to come onsite and troubleshoot with you. It could save you a lot of money, reputation damage and potentially important clients.
Deflect unnecessary helpIn a crisis situation everybody wants to do their bit and transforms into Action Jackson. You will have ideas and theories blasted at you from all directions. It is best that one person is made the point of contact for all the 'have a go heros' who can, in a nice but firm manner, deflect potentially noisy and disruptive input getting in the way of the heads down trouble shooters. Saying that, sometimes outside input can be helpful so never shut yourself off completely.
Communicate with the client regularlyClients' who's business depends on their online services are understandably the most panicky when things go awry. Good PR and communication is essential to convey a plan of attack and progress along the way. Even if no progress is made updates are still necessary because radio silence makes the client think that you are doing nothing, are stuck and aren't making any progress or have given up and are racing away on the Thames in a speedboat. Account managers should work with tech to formulate palatable and digestible updates and keep the client comforted and confident in your effort.
Drastic Action is OkayIf you've found the source of the problem and are burning a lot of time trying to fix it think about alternatives. Sometimes swapping out a software component can be painful but will be a quicker route to resolution and may even be better for the long term. I can think of two examples when there has been something fundamentally wrong with the search implementation and hacking off the ropey limb and sewing another one on has been an effective solution. It's difficult making that call but knowing when to bite the bullet is all about damage limitation. Obviously, some principles and patterns of programming like Dependency Injection are there to make this less painful and should be followed.
Be proactive and reactiveFirefighting needs a two pronged attack: reactive monitoring and proactive investigation.
Reactive monitoring means putting the logging and diagnostics in place to alert when the site is on its last legs or when it goes down. Logging and tracing gives you something to do an autopsy on after the event. The more logging the better chance of finding the source of the problem. When you have reached the end of the line you may have to go deep and start extracting crash dump files with tools like WindDebug. This happened recently and I marvelled at the artistry used with that tool.
Proactive investigation is an active hunt for the problem. Assessing the software and infrastructure in detail can identify scalability issues, bottle necks, fragile dependencies and silly mistakes like not managing your resources carefully. Again, outside expertise can really help here.
Vampire Launches are something that everyone can do without. Avoiding them is ideal and thorough testing can avoid going live unprepared. But there are always issues which can slip through the net like a problems in a third party platform, framework or library which can catch you out. When this happens - and it will - how you act and the decisions you make maketh the man, woman or gung-ho American soldier.