I changed ISPs last Friday. At some point Friday, my ISP bounced my email with a strange (to me) message. This is the same ISP that had problems just a few months ago, so I was done. I need email up virtually 100% of the time. And if I can't receive email, I need my ISP to collect it, not lose or return it before delivery to me.
I'm not privy to how my old ISP tested changes to the mail queue. But I can imagine the conversations people had about the changes and how they might test them. If they were anything like some of the systems I've worked on in the last 25+ years, there were comments like this:
- We'll wait till the system starts to degrade and then we'll intervene manually.
- We've tried simulating the system, and we just can't tell enough. We need to put it in place and respond to the degradation.
- We've measured and monitored the system; we won't have problems.
- We've architected the system to degrade gracefully. We'll watch for that and respond.
I have yet to see a system degrade gracefully. I've seen systems show small warning signs before degradation–signs that were sometimes too small to notice as the first instance of degradation. But we, the human beings who were supposed to respond to those signs, more often than not completely missed them. It was too easy to talk ourselves into thinking the warning signs weren't really signs. (BTW, this works for the human body, too.)
If you'd like to know how your system degrades and when, you gotta test it in a variety of ways: stress, load, performance, reliability testing. This is not easy and requires significant knowledge of the internals of the system under test.
I don't know what kinds of testing my old ISP did. But my new ISP seems to have a lot more redundancy and testing of changes before making the changes. And, they don't have to make the changes to everyone all at once–a nice risk management technique.
If you're concerned about system degradation, take the time to think about how you'll test it, especially if you're selling a commodity product where your customers have options. Otherwise, your customers may well do what I did–walk.
I am working through a similar problem with our ISP (I wonder if we have the same ISP?). They have been aware of “issues” for more than eight weeks, but we just became aware when my partner got bounces communicating with me and we were in the same room. We are contacting our clients and using an alternate e-mail address to learn how much communication we have lost. Needless to say, we are changing ISP’s. Our current company, when asked, told us they knew of the problem, but they were “unsure of the solution they would implement.”