From Chaos to Structure on the Operations Team
Problem
At my previous technology job, I was assigned a team that was malfunctioning and had a reputation of not being able to do much quality work. Nothing would ever come out of it, and when it did, it was incomplete and full of bugs. Engineers were rarely available to answer questions, documentation was poor and it was hard to figure out what the team was working on every day. It didn’t come as a surprise when my CEO asked me if I could help them improve their operations. I investigated the situation and decided on a plan of action.
Actions taken
I started off by jumping into their Scrum calls and by participating in their discussions and planning meetings, just to get the feel of what they were going through the day-to-day. Hands-on experience is very useful in order to get the full color of the situation. It didn’t take me much to discover that the team was using a few different ticketing systems, which meant that they didn’t have a single pane of operations. Thus the team could not see the progress of their tickets or catch up on the tickets that might fall through the cracks. Each group of tickets was beating to its own drum.
I also noticed a lot of frustration and burnout near exhaustion among the engineers. Most of the work went to a small group of experienced people while others waited for them to finish it up. A number of people didn’t have the skills needed to tackle the problems the team had to deal with and didn’t have the time to be trained to improve their skills. There were no clear responsibilities and I have identified an overlap where cross-training was required for different team members. I made sure that each individual would have an opportunity to follow a training program and enhance their cross-functional competencies. This reflected a lack of guidance or lack of time for the senior members of the team as they were profoundly embroiled in the problems the team found itself in.
I decided to isolate the support and project work on the team so that people who did support didn’t have to do projects and could remain in the reactive mode, while people who worked on the projects were not distracted by the support work. Project work and support work are two different kinds of activities that human brain cannot handle well at the same time without frustration. Keep this kind of work separate is important for healthy operations. I set up a rotation for those individuals who were on the support team, and instead of being on support all the time they were now put on a weekly rotation. The same person who did support was also on PagerDuty calls in case that something would go wrong. This way each individual had regained control over their time and near-term planning. They knew whether they had hours to dive deep into a specific technical problem or if they could keep the day light in order to answer requests from the outside in a timely manner.
Furthermore, I identified roles and responsibilities for each and every individual on the team and discerned what we needed as a team in terms of skills and competencies. Anyone who was on support was responsible for resolving small or short-timed tickets immediately and assigning more complex tickets to the queue where people working on projects. We also started creating replayable runbooks for anything we needed to do, from providing access to resolving quick permission issues. That allowed us to automate most processes and enable junior engineers who recently joined to follow the procedures with ease. This was also effective at tracing whether or not the certain activity was performed correctly – just walk back the checklist in case the result was not anticipated.
As things started to roll more smoothly, we started to organize deliverables from the prioritization perspective. We initiated weekly conversations with external stakeholders to learn more and adjust their expectations. When they needed something from us -- depending on the scope of a task, current resource allocation, and the company goals -- we would do prioritization and communicate what possible timeline for the execution could look like and what kind of their participation would be necessary for us to deliver on time. We would also explain how they could check in and track the progress, which quieted some of the noise coming from the outside.
Following up on that, we established standard SLAs for our support tasks. More complex projects were negotiable, but with smaller tasks (for example, access or permissions), we set short timelines, and the team was clearly instructed on how to hit them. This provided predictability to the stakeholders and assurance that the work will be performed.
Finally, I established healthy hours for operations. Some of the engineers on the team were clocking in 14 or 16 hours shifts, and that reflected on their well-being in and out of the office. We had to communicate our decision to external stakeholders and make sure they were aware of our availability of resources. Most were quite flexible and supportive to adjust their working hours to support the well-being of our engineers.