©1999 Johanna Rothman
“With man's great ability to think, reason, and compute, we can now pinpoint most of our current problems. The trouble is that we can't solve them.”
I walked in, and put my briefcase down. I fished out my dress shoes, and had one sneaker off before Ted barged in. “JR, I can't figure out what to do here. I've been here all night, and the builds still won't work. What am I going to do?”
We were a week away from our scheduled release date. We needed our product built. We also needed a release engineer with judgement, one who'd slept. I told Ted we'd spend one hour working on this problem, and then if we couldn't solve it, his job would be to go home and sleep for a while. After he slept, we'd continue. I would figure out what to do with all the engineers while the builds were broken.
Although I'm not a release engineer, I could help Ted. I know how to solve problems, and that's what Ted needed that morning. Ted needed to know what the difference was between what he saw and what he wanted to see. Here are the steps Ted and I took to resolve the broken build problem:
Identify the problem. When you identify the problem, make sure you only assume data you have collected. Think about the whole system of the problem. Ted had looked at the output from the original build and smoke test (initial small test to see if the system worked at all). He saw that the executables built, but none of the tests passed. That was our problem: None of the smoke tests passed.
Collect data and clarify the problem. There are a number of tools to collect data, including control charts (see page 8, Metrics tidbit: Control Charts). No matter what tool you use, make sure that you are collecting data to clarify the problem. If you collect data and find that you make the problem appear worse, then you may not have identified the problem. If you have no supporting data, then you haven't properly identified the problem.
In this case, Ted needed to look at what source files went into the build, to see why the sources that had passed the smoke test yesterday did not pass today. When we looked at the files, we saw files with the same names, but a much earlier version number. Ted had been so tired the night before, that he had missed the differences in version numbers. Ted had no supporting data, so he was frantic when I came into work that day.
Analyze the root cause. Use the data you've collected to think about the root cause or causes of the problem you see. I've used mind maps, fishbone diagrams, and cause and effect diagrams to analyze the root cause. Test your analysis with other people, to make sure you've done the analysis correctly. Once Ted and I realized that the build had picked up the wrong sources, we could ask why. We needed more data to understand why. In this particular system, one privileged person could change the file-picking mechanism for builds. Inadvertently, one engineer had changed the way files were picked. We asked the engineer why, and got the “woops” answer. We realized we had not considered the need for security on the file-picking mechanism.
Propose and evaluate solutions. When you propose solutions, brainstorm freely. You may have constraints around possible solutions, especially in resources. Consider the constraints only when you evaluate your solutions. Ted wanted to be the only one to allow changes to the file-picking mechanism. I was concerned that if Ted wanted to take a vacation, we would not be able to get work done. We met with several engineers over lunch, and proposed and discussed a number of solutions. We decided to add an “Are you sure” message with a confirming action to the original mechanism. This would make someone think twice, but not prevent people from doing their work.
Implement the solution. Try out this solution in the circumstances that brought the problem to your attention. You may have to try something special in this context. Ted fixed the file-picking mechanism, and tried it out. He liked it. We asked our lunch-mates to try it also, and they liked it.
Standardize the solution. Once you've implemented and tested the solution, you can standardize it. Sometimes this step is just as hard as the analysis. In fact, you may have to re-analyze the problem and solution, to verify you can standardize the solution. We implemented the solution, sent email to everyone, and went back to having daily builds that (mostly) worked.
Review/Improve. Learn from your actions. After the solution has been in place for a while (choose a period that makes sense for your problem), review it. Is there a way you can improve on the solution? I decided we needed to reconsider security on all of our in-house tools. After this solution had been in place for month, I asked Ted to assess the rest of our in-house tools. We could see other places to ask confirming questions.
Problems sometimes come up in a sequence. When you find one and fix it, sometimes a new problem pops up. In this case, the original problem was “The builds don't work”. The build picked the wrong sources. We fixed things so the builds worked. Then we had a bigger problem: How do we allow people to get their work done and instill some security in our systems?
The new problem may be harder to fix than the original, but if you follow a coherent problem-solving approach, you'll take just a moment to say “darn.” Then you'll take a deep breath and bring yourself back to your familiar problem-solving mode.