One of the things we all struggle with is why do computers have bugs? Why do they run perfectly well for months and then slow down? Worse, why are they slow one day and fast another? Or why does something work a dozen times in a row and then suddenly stop working? Well, I have an experience yesterday that let me explain this to a Doctor and he suggested I share it.
I was in the Doctor’s office and he had his IT support person in looking at why something worked from the computer in his office but not the computer in the exam room. Actually, it had been working in the exam room but had suddenly stopped working correctly. He rhetorically directed this question at me and I rhetorically asked him why a cell suddenly goes crazy and starts reproducing out of control (i.e., Cancer) or why a drug can help 99.9% of patients and kill .1% of them? Then I explained that software (and hardware) have become so complex, with so much state information lying around, that we can no longer completely understand nor control their behavior. You could see the lightbulb go off and he said “I got it, they are Biological Systems”. He, like all of us, think of computers as being ruled by the laws of physics (as he put it, or mathematics as I tend to think of it) and of course at some level (just as with biological systems) they are. But when you look at things at the higher systems level they really have started to resemble a biological system in which no two instances (just like no two people) are exactly alike.
Now I understand why a doctor would so easily jump to this understanding of modern computer systems, but I’ll dive into it in case you aren’t comfortable with the analogy.
People (as an example of biological systems) are each unique individuals. They receive their basic programming (e.g., DNA) from their parents and while each Homo Sapien inherits mostly the same programming we also inherit a bunch of unique programming. There are hundreds of BRCA1 gene mutations. If a woman inherits one of the wrong ones then she has a 60% chance of developing Breast Cancer. There are 20-25,000 genes in the human body, and I don’t know how many variations of each, and then the interactions between genes. So our variety and complexity are quite high. Well, you say, this doesn’t happen with computers? And I say BS :-) Nearly every computer out there is at least a slight variation from every other computer. They have different CPU chips, different graphics boards, different BIOS authors, different version of that BIOS, different hard drive models, different collections and versions of software installed on them, etc. A PC is a PC in a similar way that a Person is a Person. They are the same, yet actually quite different, even in the aspects we think of as invariant.
What is even more interesting is that both People and Computers run around with both temporary and persistent State information floating around, and that this state information alters their systems’ behaviors in seemingly unpredictable ways. For example, there are many drugs which make people photosensitive. For those not taking the drug 30 minutes in the sun helps their tan and produces nice amounts of Vitamin D. For those taking the drug 30 minutes in the sun produces a sun burn. Smoking produces all kinds of persistent state changes. Combine all the persistent (over a lifetime) and temporary state changes and you get strokes, heart disease, cancer, etc. Or take Vioxx. Like most drugs it induced a temporary state change to fight inflamation (e.g., from Arthritis). Unfortunately it turned out that in some patients, the state change it caused interacted with other state (and perhaps genetic programming) in the body to cause heart damage. So how does this work in computers?
Let’s take something as simple as an e-mail message. Each message has a tremendous amount of state, and you are constantly altering that state. Read the message and the computer switches the state from Unread to Read. Reply and it records that the message has been replied to. Flag or otherwise categorize the message and that’s recorded too. Sync your Droid, our iPad, Outlook on your PC, access that same message from IE7, IE8, IE9, Firefox, Chrome, Safari, etc. and you have a tremendous amount of both temporary and persistent state involved. Things would likely be simple if programming always dealt with one state at a time, but often you deal with multiple states simultaneously. Since the amount of state being kept in a typical computer is now so large, from a practical standpoint the variation is approaching dangerously close to infinity. The programming of how to behave in light of all that state, and how to modify that state information, is different for each of the ways of accessing your mail. And so you very quickly end up with things from minor bugs, like the iPad (and iPhone’s) email application not being able to correctly maintain the Read/Unread count, to more serious problems like having your iPhone or Outlook lose the ability to correctly sync with the email server without deleting and re-adding the email account, to disasters like email being completely lost. Now multiply this idea through everything running on your system, and that even seemingly independent things can have state interactions, and you start to see the picture. Why does killing and restarting an application, or rebooting your computer, often resolve problems? Because it clears temporary state information. Sometimes you never get back to that same set of temporary state and thus the problem never recurs. Sometimes you get back to it eventually. Occasionally you can reproduce it quickly, implying an interaction with more persistent state. As annoying as rebooting is, think of it as an advantage Computers have People. We can’t just throw away our temporary state, we have to alter it quite tediously using drugs, nutrition, lifestyle change, etc. But if anything this strengthens the analogy.
Now you know why after decades of trying to make software bug-free they are still so unreliable. 30 years ago most bugs were straightforward coding mistakes, and those now rarely make it out of the software development process. 20 years ago most bugs were about localized mishandling of a single piece of state, and once again those rarely make it through the software development process. But since then we’ve been struggling with the explosion of both temporary and persistent state on both a local and global basis. The trend to new app models (IOS, Windows Phone 7, and now Windows 8’s Metro app model) is largely driven by the realization that the industry needed greater isolation of state-sharing between applications (and greater control over the application’s impact on system state). That’s progress and explains a lot of why an iPhone or Windows Phone 7 device feels so much more reliable than either a Windows or Mac PC. At the same time the move to cloud computing, and thus the greater amount of state sharing between various clients and the cloud, is increasing the amount of distributed state. An explosion in the number of cores in a typical computer processor, and the growth in heterogeneity of cores (or auxiliary processors like GPUs) also is dramatically increasing complexity. So when we look back 10 years from now I don’t expect the overall reliability of computer systems to have improved. Doesn’t the fight to make computer systems reliable feel a lot like the fight to cure Cancer?
So the next time you wonder why computers aren’t more reliable, or try to explain it to a friend, keep the biological system analogy in mind. Because those are the rules computer systems are now following.