Failure is always an option
2003-12-04 23:45:08.53673+01 by Dan Lyke 0 comments
Jay attributes this one to us, but I don't think we linked to it yet: The Atlantic on the space shuttle Columbia's last flight. In reading through it, I'm struck by the similarities to some flaws that I'm correcting right now, some of them my own. There is an easy trap to fall into, especially in software where we're often building tools for specific purposes that we test for the known case, a few boundary conditions, and then ship: if it works, it must be correct. And this is amplified by software, because we don't have a lot of manufacturing issues; if it works then burn a hundred thousand copies and be done with it.
As I go through the hassles of supply chain management for actual hardware, I get a better handle on how systems go wrong: "We can't get that part but I can build an equivalent out of..." becomes "...you can build an equivalent..." becomes "we don't have time to get the designed parts and these parts are equivalent anyway", and since they work, they must be, right? But systems are combinatorial, and that shortcut here becomes something I don't want to deal with, and the next thing I know I'm trying to suss out from some user's interpretation of the failure that something I'd filed as equivalent and ignored has become a failure point. In fact, I'll probably subconsciously try to bury any data that might point to that substitution or glossing over because my ego's on the line: at some point I said they were equivalent.
The reality is that those of us who work on tight development cycles have learned to accept a certain level of failure. "Here's the workaround", "Sure, we could re-engineer that part, or the user could just...", "I don't want to refactor that code because I might introduce bugs, so I'll just kludge around it." And in software you can get away with this almost all the time. Hardware catches up a little more often. And those "little more often"s eventually catch you, even if in the short term they let you run with lower cost.
But I find myself learning, even if slowly, that it's okay to say "no" to customers and managers, even when processes take several times longer to do right; and the longer the feedback loop between the decision and the correction, the more important it is to take the time. Also, there's no such thing as an occasional glitch; either it's right or it's not.
Because failure is always an option, and we have to be conscious to not choose it.