Why Does Software Break?

It’s only natural to wonder why, after all this time and our collective experience, that we still produce buggy, brittle software that breaks and crashes. It’s also only natural to point at “software engineers” and then the other kinds of “engineers” – as in, the people who build bridges, skyscrapers, cars, planes, etc. – who can build things that work for years and don’t (generally) break down and crash, and ask “why can’t we do the same thing with software?”

To answer that question, it’s important to make a distinction between the physical world of bridges, skyscrapers, planes, and such, and the “thought-stuff” world of software.

While software is, to use the words of Frederick Brooks in The Mythical Man-Month, made purely of insubstantial “thought-stuff,” it is, ultimately, made by man – and as man is fallible, so to are the things that he creates. (After all, some bridges fall down, some skyscrapers collapse/leak/shake in the wind, and some planes crash.)

There’s also the “layer” aspect to keep in mind – software may be “thought-stuff,” but it doesn’t exist purely in a vacuum. It relies upon the perfect function of millions (or billions) of tiny, often microscopic physical components, which have been engineered with great specificity and tight tolerances. A few cosmic rays (or a clumsy user pulling out a cord) can screw up the perfect balance of all these components in unimaginable ways – sort of like pulling out the main support for a bridge, or blowing out the tire of a car. (Or, perhaps like having a few large birds fly into the engine of a plane!) When these sorts of things happen, the system – be it bridge, plane, car, or computer – fails, often spectacularly.

So, it’s less accurate to think of a computer system (hardware and software together) as being like a bridge, and more accurate to think of it as being like a giant clockwork mechanism – a huge Rube Goldberg-type device – with hundreds of finely inter-meshing gears and sprockets. If just one gear pops out of place, or one sprocket cracks a tooth, the system stops working properly – perhaps just a little bit, or perhaps so much so that more gears are forced out of place, and more sprockets are broken, until the entire thing collapses in a pile of ruin.

To carry the bridge metaphor in the other direction (as it were), it might be more accurate to think of a computer system as being like a bridge that not only functions like a bridge (gets people from one side to the other), but also functions as a musical instrument capable of producing both classical, jazz, and electronic/techno music; predicts the weather; washes your clothes; generates electrical power; can be quickly reconfigured into a skyscraper home for people or a hospital, as needed; can float up and down the river to a new crossing (dynamically expanding or shortening its length as it goes, of course); and can also fly, carrying everyone on it to a new river, with new road signs that instantly match the new language and traffic patterns of the new location. It also has to do all this while not disturbing the environment around it, while simultaneously accepting any impact its environment puts on it, even if such impact might cause it to function in a manner contrary to the one for which it was designed.

If you were to try to build a physical bridge to do all of these things, it would probably break in much the same ways that software does.

To use a different analogy, consider the difference between a typewriter (a machine designed to do just one thing – type words) and a computer. No one would argue that the computer is a more reliable typing instrument – after all, the typewriter is fairly simple, and because it is designed to do just one thing, it can do it well. Also, when the typewriter fails, the cause is generally immediately apparent (e.g., out of ink ribbon) and can easily be understood – and fixed – by the user.

On the other hand, the computer – while on the surface just the same as the typewriter (keyboard on which you type words), is infinitely more flexible. There is almost an infinite number of other things that the computer could do in addition to typing – it could play music, calculate your taxes, control millions of tiny light-producing elements to display an interactive 3D environment – or a photo of your dog, talk to you using a synthesized voice, control complex machining equipment, participate in a global network, and almost anything else you could imagine.

When you consider that, it’s no wonder that computers have so many ways in which they can break. It’s exactly because they are so flexible that they are so fragile at times – their flexibility is their greatest strength, and at the same time, their greatest weakness. Because they are so generalized, getting them to do any one specific thing involves a lot of re-building of concepts (we call them “metaphors” in the world of software) just to get any useful work done, never mind actually taking care of the main task at hand.

In the end, software breaks because it (and the computers on which it runs) are general purpose machines which we ask to do an enormous number of things (some often contrary to one another!), and even though we might only be asking it to do something simple at the surface (e.g., type a few words onto the screen), in reality there are innumerable hidden complexities involved in getting a general-purpose machine to do something so specific (and, we would hope, do it well) that it’s only natural that there will be errors – both human induced and artifacts of the system itself.

In other words, softare breaks because computers are fantastically flexible general purpose machines that, by their very nature, require complexity in order to do anything specific – and no layers of abstraction, big-M Methodologies, frameworks, or whatever else we come up with – are going to change that simple and immutable fact.

By Keith Survell

A geek, programmer, amateur photographer, anime fan and crazy rabbit person.