DevOps - Rolling it out with the Production Support Team

DevOps have been all the rage lately, and it means different things to different people. To some it is about the process, to others it is about the tools, process, or even culture. People are pushing for it from multiple angles, and others are resisting. Nobody likes changes, especially if it means more work. Yet, is IT in some organisations sustainable without changing? Are people crushing under the weight of technical debts? To push change, it is important to understand the resistance. DevOps champions need to understand their audience, and tailor the message accordingly, addressing the pain point of their audience.

Here is one particular story I like to talk about, particularly to the production support team. In many cases, these teams are always resisting any changes. They are already overwhelmed with the various activities requiring to keep the production running "business as usual". They are working late, are under appreciated, and all people could say is that it is IT's fault again for the issues. They are upset and angry. And here comes a new DevOps champion who has just been hired to transform the IT. He is not one of them. He is 'the other side'. He has not done any real work in the environment, and he thinks he knows better. We can hardly blame the production support team to think this way. To implement DevOps, we need to change people, and that is much harder than changing systems.

A Short Story of the Life of a Traditional Production Support Personnel

For those in production support, it must have been pretty painful and scary stepping out of your house. For some, it means even stepping out of office. If you are lucky, you got a call that there is something weird happening to your application. For others not so lucky, the whole world collapsed without you being aware while you are watching Armageddon in the theatres.

For those who are lucky, you are even luckier if you brought along you are trusty laptop. You can now login remotely to your enterprise network and check what is wrong, after attempting a few failed VPN connection back to your enterprise network. Never mind the 3G hotspot network connectivity that maybe only offers 100kbps and barely render the Remote Desktop properly. Just-in-time rendering, maybe 2 seconds slower than your action. You have just invented time traveling, only that you are in the future and the Remote Desktop is in the past. Oh and never mind that the laptop might weigh 3kg. Great workout you have daily when you are out for movies with the loved one. You also pray hard that the laptop do not crash in the mist of troubleshooting.

Now that you got in and found out what was not working (not the root cause yet). It is time to recover the system. It is a race against time again as you battle the weak connectivity. Your managers and business are probably also calling you every other minute, each wanting to find a status update. You felt like shutting the phone out and focused on recovering. You need to restart a service but requires a few manager's approval. A crisis management team was in place and they have approved your plan and request. Time to get the password from an enterprise password vault and perform the manual recovery. Time ticks by as another team process your emergency password request. Business gets upset, and Technology is anxious but sympathetic. It is this way because of all the checks and controls we need in place. Maybe two hours later since the first phone call, the service is recovered. Now it is time to face an angry spouse whom you had skipped out the movies with. Or maybe not. An incident report awaits, with more meetings and discussions through the night. The spouse would have to wait. Again. At least the laptop did not crash in the middle of recovery.

That is the state of Production Support in most traditional IT department in enterprises. To try to sum up what happen above:

Service recovery is dependent on a human manually performing complex but authorised actions (after approvals to get system credentials as well) to troubleshoot and recover the system, using a machine that he needs to carry with him every minute, over a network of questionable quality, and at the same time needing to regularly report the recovery status.

And we wonder why people would want a career switch after a few years in IT, or shun from production support.

Can their life be better? Probably. With DevOps in the picture, how would this look?

A Short Story of the Life of a DevOps Production Support Personnel

You might have application monitoring in place. An alert would be triggered informing you of exactly what went wrong. To do further troubleshooting and confirmation, you might still login remotely to your enterprise network and check what is wrong. You use your mobile phone and connect back to your enterprise network via VPN, and figured out what is wrong. Troubleshooting was easy, as you had a few automated health check scripts in place to trigger. The phone calls are still coming in though, except it is more painful since you are using your same phone to troubleshoot. Perhaps its time to get another phone, or a tablet. Time to recover the system. You seek the necessary approvals, and triggered the automated recovery scripts. You have just saved the world, though you probably probably still missed half of the movie. The incident report still awaits you, though with some luck, the management and business could be happy at the speed of resolution that they are willing to let you enjoy the rest of the night.

Let us try to sum this up as well:

Service recovery is now dependent on a human manually performing simple automated and authorised scripts (no additional access is required, just approval for this action) to troubleshoot and recover the system, using a mobile device that he needs to carry with him every minute, over a network of questionable quality (but because no video rendering this time!), and at the same time needing to regularly report the recovery status. At least bulk of the pain are gone. We hope.

It could get better. Hook up the automated recovery script to listen to the alert. All you might receive is just an alert saying something is wrong, along with a quick alert saying a recovery script is triggered, and finally an alert saying everything is fine again. The incident report can wait for the next working day. Really.

Is this something special that only DevOps could offer? Probably not. Nothing depicted is unique to DevOps. DevOps, however, represent a mindset shift, and place heavy emphasis on automation. With these 'tools' in place, they start automating themselves out of the job. And they are perfectly fine with that. People who believe in DevOps believe themselves to be dispensable. Surprisingly, the more they think so, the more business thinks they are indispensable.

Bonus: A Short Story of the Life of a ChatOps Production Support Personnel

Now that we have DevOps, what benefits do ChatOps have for Production Support? Assuming we do not get the full automated recovery benefits in place, the entire process might go as below:

Bot: There is an issue with system A. Service B is not running

Is Service C running : You

Bot: Service C is running

Boss has joined the room

Boss: What happened?

Service B is not running, but Service C is still running. We just need to start the service : You

Boss: Ok.

Start Service B : You

Bot: Requesting approval from Boss to start Service B

Boss: Approved

Bot: Approved received from Boss. Starting Service B....

Bot: Service B Started

Run System A Health Check : You

Bot: Service B is running. Service C is running. System A is healthy.

Boss: Thanks for looking into this. Enjoy your night!

Boss has left the room

For some, this is an exciting thought. The world is just starting the journey and discovery for chatops, and the potential is limitless. ChatOps is more than service recovery benefits, it is all about visibility. Management and you are in the same chat room. They could see all troubleshooting and resolution as it happens in real time. There is no need for separate reporting. If required, multiple chatrooms could be held, and they could observe the troubleshooting in this room, and discuss crisis management in another. And all you need is a phone that supports the chat software.

DevOps and ChatOps - Is it an easy path?

There are quite a few improvements in place to paint the above scenarios. They do not come free, and it involves a lot of hardship and pain just to get them in place. You will have to fight with business, management, and maybe even within your own team to get everything in place. At the same time, business initiatives are waiting for you to implement as well, and that is marked with a higher priority than all these 'fanciful technology tools' you are trying to get in place. Business seldom see the value of automation until a crisis happens, since Technology is seldom able to articulate the benefits in a way business can appreciate. Pretty sure allowing Production Support time to watch movies with his spouse is not something the Business cares that much. But proactive and quicker business service resumption, or in another words, business service resiliency, is something they can appreciate. Seriously, we are not doing this for fun. We want to ensure a stable and reliable environment and quick service recovery, such that we can have enough capacity to continue to work on new business initiatives.

How do we get there?

Let us just discuss the fully automated DevOps scenario. We would need

  • Health check scripts that can be triggered on demand, and preferably regularly scheduled too. These could be in the form of synthetic transactions as well.
  • Continuous monitoring on processes, components, and results of health check scripts and synthetic transactions.
  • Runbook scripts that can be triggered on demand to recover systems.
  • Event driven automation systems, that can tie any monitoring failures to runback scripts
  • Belief in the vision of a better future
  • Clear roadmap to reach the end state, broken down into many small quick wins.
  • Dedicated team to implement all of the above

DevOps champions must protect the team to prevent them being hijacked to do other 'small simple tasks'. If there is no 100% dedication, the initiative would fail. It is a brutal world out there, and once a team member was asked to help in something small, like a small business initiative launch, the priority gets skewed, and soon the member would be lost forever to deliver 'critical business initiatives'. The team themselves also must believe the importance of the work they are doing, such that they prioritise the work over everything else. They need to believe that investing time in this would bear fruits and yield a better life, freeing up resource capacity in future to work on 'critical business initiatives'.

Just like how agile project breaks a system into multiple small feature deliverables, the same need to be planned for this. A clear roadmap gives stakeholders the assurance that this is not an experiment but is well thought out, and each small quick win help strengthen the case. You could work backwards from runbook scripts to speed up service recovery, or you could implement health check script to speed up fault domain isolation, or you could implement continuous monitoring to be notified of issues immediately. Any of these could be the first step, but essentially pick a well defined and small battlefield and win the battle. Do not fight too many fronts in the war and extend your resources. The recommendation is to go with a small but problematic system and build out sufficient scripts and monitoring to cover all three areas. With that in place, take the time to automate the recovery. It just takes a single small incident that recover by itself, and everyone will be 100% onboard the vision.

Avoid doing things on a huge enterprise scale. Start small. For DevOps, most of the time, a single battle is all it takes to win the war.