bpendleton wrote:
So, it's a trade off. Building soft-fail apps that don't also have cascading failures is hard. I saw the guy who ran CNN.com give a talk on what happened to them on 9/11... and having load balancing actually caused them to take longer to recover, because the load was so high that, once one server failed, the increased load on others caused them to fail pretty readily, as well. That's an extreme situation (and, arguably, a not-conservative-enough disaster response was employed), but it *does* take more work to administer two machines than one. And that's basically what you're going to be up against. If you get two, let us know how it goes!
Ahh, the war stories that I can't tell because they'd fall on the wrong side of corporate confidentiality. :/
Suffice it to say, try crashing *all* of your servers with a cascade crash. And then realizing that bringing up said network is not as easy as it sounds. :/