I believe Linode has worked with folks in the past to reduce power (and perhaps network) SPOFs. It's a somewhat manual process, but it'd be worth opening a ticket to make sure.
I do know the provisioning system avoids, whenever possible, putting more than one instance on the same host for a given account. It will happen if it absolutely cannot avoid it, but this can be fixed with a ticket and migration when a better slot becomes available.
From a system standpoint, organizing hosts into "zones" isn't too tough at the Amazon Web Services scale, but within a single facility, it gets difficult and perhaps even meaningless. Here's some random thinking, of the sort I like to do when I don't want to mow the lawn.
The rest of this is going to consist of a few kilograms of 100% pure, uncut Colombian speculation, and does not describe Linode's infrastructure:
First, we'll assume there's a pattern to Linode's hardware deployment, and it is done in a methodological, organized fashion. This is not a bad assumption, since the hardware set is homogeneous (we'll neglect the backup storage beasts and the border routers), numbered sequentially, and installed/maintained by remote hands who must be told "run a blue patch cable from host123 port 2 to switch8 port 15".
Second, we'll assume the following limits. Entirely guesses, but it helps to stick numbers here: 30 hosts per rack, 4 24-port switches per rack with 15 hosts each (two for the top half, two for the bottom half, in a redundant configuration), 6 remote-controlled power distribution units per rack with 5 servers (and 2/3rds of a switch? Let's go with dual power supplies on those; it'll make things easier) and a 20-ampere circuit each. We'll also assume these circuits are provisioned from a three-phase wye system in a sequential order (circuits 1, 4, 7, ... are phase X to neutral, 2, 5, 8, ... are Y to neutral, etc).
From here, we can come up with a pattern like:
PDU1 depends on Circuit 1 depends on Phase X
PDU2 depends on Circuit 2 depends on Phase Y
PDU3 depends on Circuit 3 depends on Phase Z
PDU4 depends on Circuit 4 depends on Phase X
PDU5 depends on Circuit 5 depends on Phase Y
PDU6 depends on Circuit 6 depends on Phase Z
SWITCH1 depends on PDU1 or PDU2
SWITCH2 depends on PDU2 or PDU3
SWITCH3 depends on PDU4 or PDU5
SWITCH4 depends on PDU5 or PDU6
HOST{1,2,3,4,5} depends on PDU1 and (SWITCH1 or SWITCH2)
HOST{6,7,8,9,10} depends on PDU2 and (SWITCH1 or SWITCH2)
HOST{11,12,13,14,15} depends on PDU3 and (SWITCH1 or SWITCH2)
HOST{16,17,18,19,20} depends on PDU4 and (SWITCH3 or SWITCH4)
HOST{21,22,23,24,25} depends on PDU5 and (SWITCH3 or SWITCH4)
HOST{26,27,28,29,30} depends on PDU6 and (SWITCH3 or SWITCH4)
This would repeat per rack (with rack 2 containing hosts 31..60, circuits 7..12, PDUs 7..12, switches 5..8).
Fate-sharing groupings might be based on PDU/Circuit (5 hosts) or switch pair (15 hosts). Since internal power distribution is usually in a tree configuration, grouping by power distribution panel (~16 circuits?) might also make sense. The main breaker on a panel will knock out a range of circuits,
as happened in October 2009.
Failures of a single phase (which the
Tuesday Fremont outage smells an awful lot like) would span multiple PDUs; losing Phase X would drop PDUs 1, 4, 7, 10, etc, or hosts 1-5, 16-20, 31-35, 46-50, etc. However, single phase failures within a datacenter are rather rare, since internal three-phase breakers are ganged together (like the breakers powering your air conditioner in an American-style split-phase residential three-wire system) and a UPS will take care of upstream problems of this sort. It's pretty obvious at this point that FMT1 lacks something a reasonable person would consider a "UPS,"
at least as of two weeks ago, but that's a much bigger problem.
This analysis neglects (at a minimum) the core switches and border routers between the rack switches and the Internet (plus DNS resolvers, etc), physical issues, and most importantly,
software/operational failures. These are probably going to be rather unpredictable in their scope, if they don't take out entire datacenter(s). (I'd link to coverage of recent Amazon EC2/EBS problems here, but cloudfail.net is down.) This also neglects differences between datacenters: it is possible that different datacenters have different specifications for servers/rack, amps/circuit, circuits/panel, delta vs. wye, etc.
(As a reminder, the above is a
COMPLETE AND TOTAL FABRICATION and is
FULL OF LIES. If I see this cited as The Truth by
ANYONE, I'll have to post a face of disapproval.)
All that said, I hypothesize that the fate-sharing probability between two hosts is inversely proportional to the difference between their numbers; that is, newark10 and newark11 are significantly more likely to be impacted by the same problem than newark10 and newark110. But, this is more of a rule-of-thumb than a strategy, and fails in cases like the Newark partial power outage of 2009 (where newark121 and newark182 shared more fate than newark120 and newark121).
So, where was I... oh yes, open a ticket, explain the situation, and see what happens. I think your concern is extremely valid and your suggestion is a good one, but it's a complicated problem that most folks don't care about, so it probably won't become "automatic."