Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
PostPosted: Fri Aug 19, 2011 1:32 pm 
Offline
Junior Member

Joined: Wed Jul 27, 2011 8:34 pm
Posts: 31
Website: http://eschercms.org
I'm relatively new here and so far I'm really enjoying Linode. You guys have done a great job with the Linode Manager. Of course, there's always room for improvement, right? Here are a few ideas gleaned from my first few weeks working with Linode.

1. The page for viewing an individual linode helpfully shows the name of the physical host the node resides on. Please also show an indication of the node's cabinet, local switch and power circuit so that I can tell if any of my instances that should be redundant (to the degree possible within a single DC) are sharing a failure point. For example, I want to ensure that a pair of database servers (in the same DC) are on completely separate hosts, switches and circuits. Currently, there is no way for me to see this.

2. On the same page, configuration profiles and disk images appear to be sorted according to creation date. It would be very helpful if I could determine the order of items in these lists. Drag and drop would be great. But even a simpler solution of sorting alphabetically by name would allow me to achieve a desired order by prepending a number to the name of the item. The ability to further organize these items into named groups would be icing on the cake.

3. Show the physical host information (including info from #2) in the Linode list view page. I would like to see this information at a glance for all nodes. Also, make the column headers clickable links that sort the list of nodes according to the clicked column.

4. When I create a new Linode, I can choose the DC. For in-DC redundancy purposes, I would also like to choose a specific physical host, based on the host's cabinet or switch/circuit. Or at least indicate that the newly created node should not share any failure points with my other nodes in the same DC.

Thanks for creating such a great system and for considering these suggestions.

_________________
Got Escher? | @artagesw


Last edited by artagesw on Sat Aug 20, 2011 11:39 am, edited 1 time in total.

Top
   
 Post subject:
PostPosted: Fri Aug 19, 2011 4:16 pm 
Offline
Senior Member
User avatar

Joined: Tue May 26, 2009 3:29 pm
Posts: 1691
Location: Montreal, QC
I think you're asking to get deeper in Linode's infrastructure than they're willing to let customers go, and it's also not necessarily the point of cloud hosting. You're far more likely to suffer downtime from a failed host or a DC-wide issue than some bit of shared infrastructure, anyhow, judging by the outages seen over the years. I'm certainly sure Linode will never let you pick which host your node goes on.

You're far better off going for inter-datacenter redundancy than intra-datacenter redundancy. It's the only way to ensure proper isolation anyhow; there is *ALWAYS* going to be some single point of failure in any given DC. Witness all the power issues in Fremont, for example.


Top
   
 Post subject:
PostPosted: Fri Aug 19, 2011 4:24 pm 
Offline
Senior Member

Joined: Fri May 02, 2008 8:44 pm
Posts: 1121
I second #2 and #3. It would be great to have everything sorted alphabetically by default, and also make them sortable by any column. I don't know what Linode uses to store and fetch the data, but most of the time it's as trivial as appending ORDER BY NAME. Sorting by arbitrary columns can be done on the client side with JavaScript.

As for #1 and #4, I agree with @Guspaz that the kind of information you're requesting is probably more than what Linode is willing to expose to customers. They'll probably help you if you raise a ticket asking specifically to have your linodes placed on different hosts, but as far as I can tell, they're usually quite secretive about their infrastructure. Also, the vast majority of connectivity and power issues affect the entire datacenter, so the benefit of having your linodes in different cabinets would be minimal.


Top
   
 Post subject:
PostPosted: Fri Aug 19, 2011 5:43 pm 
Offline
Senior Member
User avatar

Joined: Sun Dec 27, 2009 11:12 pm
Posts: 1038
Location: Colorado, USA
Probably a moot point as HE is releasing pic's of their new backup power plan.


Image


Top
   
 Post subject:
PostPosted: Fri Aug 19, 2011 9:40 pm 
Offline
Junior Member

Joined: Wed Jul 27, 2011 8:34 pm
Posts: 31
Website: http://eschercms.org
Guspaz wrote:
I think you're asking to get deeper in Linode's infrastructure than they're willing to let customers go, and it's also not necessarily the point of cloud hosting...I'm certainly sure Linode will never let you pick which host your node goes on.


Perhaps I worded my request poorly. I don't need to choose a specific physical host directly. I just need to be able to direct the system to choose a host based on a set of "availability" parameters. And I don't need to know anything about Linode's internal "magic sauce" either. All it would take would be for Linode to create an abstraction around the notion of a "zone" (for lack of a better term). A zone is an area of Linode's internal deployment that has the fewest possible components (switches, routers, power circuits, etc.) in common with any other zone. Then, give each zone a label and apply the appropriate label to each host. Now, let me choose a data center and a zone whenever I create a new node, and show the Linode's zone along with the other host info in Linode Manager.

I'm not sure what you're getting at re: "it's also not necessarily the point of cloud hosting." Isn't a major point of cloud hosting to remove the complexity associated with reliable redundant deployments? It is for me.

Guspaz wrote:
You're far more likely to suffer downtime from a failed host or a DC-wide issue than some bit of shared infrastructure, anyhow, judging by the outages seen over the years.


I disagree. I have seen downtime caused by switch and router failures many times. Sometimes due to hardware failure, sometimes due to misconfiguration or a botched firmware upgrade (administrator error) .

If you want your deployment to be as redundant as possible (within the limitations of the host/DC), you care about minimizing your SPFs. You certainly wouldn't want your entire load-balanced cluster of web server VPS's running on the same physical host, right? Then why would you want them all in the same cabinet attached to the same switch? Switches ands routers are only marginally more reliable than a good quality server these days, especially if the server is outfitted with redundant power supplies and NICs and RAIDed drives.

Guspaz wrote:
You're far better off going for inter-datacenter redundancy than intra-datacenter redundancy.


This is not an either-or. I have inter-datacenter redundancy covered. But I need intra-datacenter redundancy as well. Cross-datacenter database replication still poses challenges for real-time transaction-oriented systems, for example.

Guspaz wrote:
There is *ALWAYS* going to be some single point of failure in any given DC. Witness all the power issues in Fremont, for example.


I've set up physical colo deployments that had full redundancy with no SPF all the way out to to the power transformers on the utility poles. A configuration like that can survive a direct lightning strike to the power transformer. Yes, there will always be a point of failure somewhere. The goal is to minimize them. Not to just give up and say "Well, I've got one point of failure, so might as well have four."

_________________
Got Escher? | @artagesw


Top
   
 Post subject:
PostPosted: Fri Aug 19, 2011 9:58 pm 
Offline
Senior Member

Joined: Fri Jan 09, 2009 5:32 pm
Posts: 634
artagesw wrote:
Guspaz wrote:
You're far more likely to suffer downtime from a failed host or a DC-wide issue than some bit of shared infrastructure, anyhow, judging by the outages seen over the years.


I disagree. I have seen downtime caused by switch and router failures many times. Sometimes due to hardware failure, sometimes due to misconfiguration or a botched firmware upgrade (administrator error) .


He was talking about experience with linode specifically.


Top
   
 Post subject:
PostPosted: Sat Aug 20, 2011 12:27 am 
Offline
Senior Member

Joined: Fri May 02, 2008 8:44 pm
Posts: 1121
artagesw wrote:
All it would take would be for Linode to create an abstraction around the notion of a "zone" (for lack of a better term). A zone is an area of Linode's internal deployment that has the fewest possible components (switches, routers, power circuits, etc.) in common with any other zone.

You mean, like Amazon's "availability zones"? Interesting. I wonder how large a company needs to be in order for something like that to make sense from a business point of view.


Top
   
 Post subject:
PostPosted: Sat Aug 20, 2011 6:40 am 
Offline
Senior Newbie

Joined: Fri Aug 05, 2011 6:15 am
Posts: 15
Off-Topic:
vonskippy wrote:
Probably a moot point as HE is releasing pic's of their new backup power plan. [image snipped]


I hope you always continue to be who you are vonskippy :)

I think you actually encouraged me in part to join linode. So if anyone starts passing around credit for my being here make sure you claim some for yourself.

On-Topic:

I also like the idea of being able to sort disk images and think it would be useful to others, though I wouldn't personally use it at this time.

Cheers!


Top
   
 Post subject:
PostPosted: Sat Aug 20, 2011 9:38 am 
Offline
Senior Member

Joined: Fri Jan 09, 2009 5:32 pm
Posts: 634
hybinet wrote:
artagesw wrote:
All it would take would be for Linode to create an abstraction around the notion of a "zone" (for lack of a better term). A zone is an area of Linode's internal deployment that has the fewest possible components (switches, routers, power circuits, etc.) in common with any other zone.

You mean, like Amazon's "availability zones"? Interesting. I wonder how large a company needs to be in order for something like that to make sense from a business point of view.


Except that my impression is that "availability zone" = datacenter or even a group of datacenters, no?


Top
   
 Post subject:
PostPosted: Sat Aug 20, 2011 11:20 am 
Offline
Junior Member

Joined: Wed Jul 27, 2011 8:34 pm
Posts: 31
Website: http://eschercms.org
glg wrote:
hybinet wrote:
You mean, like Amazon's "availability zones"? Interesting. I wonder how large a company needs to be in order for something like that to make sense from a business point of view.


Except that my impression is that "availability zone" = datacenter or even a group of datacenters, no?


Right. My understanding is that Amazon's availability zones are regional entities that encompass multiple data centers. I'm proposing the simpler concept of an in-datacenter zone, since Linode's multiple data centers already provide the essential equivalent of "regional zones."

_________________
Got Escher? | @artagesw


Last edited by artagesw on Sat Aug 20, 2011 11:37 am, edited 1 time in total.

Top
   
 Post subject:
PostPosted: Sat Aug 20, 2011 11:36 am 
Offline
Junior Member

Joined: Wed Jul 27, 2011 8:34 pm
Posts: 31
Website: http://eschercms.org
glg wrote:
artagesw wrote:
Guspaz wrote:
You're far more likely to suffer downtime from a failed host or a DC-wide issue than some bit of shared infrastructure, anyhow, judging by the outages seen over the years.


I disagree. I have seen downtime caused by switch and router failures many times. Sometimes due to hardware failure, sometimes due to misconfiguration or a botched firmware upgrade (administrator error) .


He was talking about experience with linode specifically.


I understand that. And one of the reasons I chose to come to Linode is its pretty decent reliability track record. However, building a reliable deployment is about taking measures to protect against possible **future** failures - i.e. what might happen. The fact that a particular type of failure is rare or hasn't been experienced by a particular customer in the past, while nice to know, has no bearing on whether it will happen in the future.

_________________
Got Escher? | @artagesw


Top
   
 Post subject:
PostPosted: Sat Aug 20, 2011 1:06 pm 
Offline
Senior Member
User avatar

Joined: Sat Aug 30, 2008 1:55 pm
Posts: 1739
Location: Rochester, New York
I believe Linode has worked with folks in the past to reduce power (and perhaps network) SPOFs. It's a somewhat manual process, but it'd be worth opening a ticket to make sure.

I do know the provisioning system avoids, whenever possible, putting more than one instance on the same host for a given account. It will happen if it absolutely cannot avoid it, but this can be fixed with a ticket and migration when a better slot becomes available.

From a system standpoint, organizing hosts into "zones" isn't too tough at the Amazon Web Services scale, but within a single facility, it gets difficult and perhaps even meaningless. Here's some random thinking, of the sort I like to do when I don't want to mow the lawn.

The rest of this is going to consist of a few kilograms of 100% pure, uncut Colombian speculation, and does not describe Linode's infrastructure:

First, we'll assume there's a pattern to Linode's hardware deployment, and it is done in a methodological, organized fashion. This is not a bad assumption, since the hardware set is homogeneous (we'll neglect the backup storage beasts and the border routers), numbered sequentially, and installed/maintained by remote hands who must be told "run a blue patch cable from host123 port 2 to switch8 port 15".

Second, we'll assume the following limits. Entirely guesses, but it helps to stick numbers here: 30 hosts per rack, 4 24-port switches per rack with 15 hosts each (two for the top half, two for the bottom half, in a redundant configuration), 6 remote-controlled power distribution units per rack with 5 servers (and 2/3rds of a switch? Let's go with dual power supplies on those; it'll make things easier) and a 20-ampere circuit each. We'll also assume these circuits are provisioned from a three-phase wye system in a sequential order (circuits 1, 4, 7, ... are phase X to neutral, 2, 5, 8, ... are Y to neutral, etc).

From here, we can come up with a pattern like:

PDU1 depends on Circuit 1 depends on Phase X
PDU2 depends on Circuit 2 depends on Phase Y
PDU3 depends on Circuit 3 depends on Phase Z
PDU4 depends on Circuit 4 depends on Phase X
PDU5 depends on Circuit 5 depends on Phase Y
PDU6 depends on Circuit 6 depends on Phase Z

SWITCH1 depends on PDU1 or PDU2
SWITCH2 depends on PDU2 or PDU3
SWITCH3 depends on PDU4 or PDU5
SWITCH4 depends on PDU5 or PDU6

HOST{1,2,3,4,5} depends on PDU1 and (SWITCH1 or SWITCH2)
HOST{6,7,8,9,10} depends on PDU2 and (SWITCH1 or SWITCH2)
HOST{11,12,13,14,15} depends on PDU3 and (SWITCH1 or SWITCH2)
HOST{16,17,18,19,20} depends on PDU4 and (SWITCH3 or SWITCH4)
HOST{21,22,23,24,25} depends on PDU5 and (SWITCH3 or SWITCH4)
HOST{26,27,28,29,30} depends on PDU6 and (SWITCH3 or SWITCH4)

This would repeat per rack (with rack 2 containing hosts 31..60, circuits 7..12, PDUs 7..12, switches 5..8).

Fate-sharing groupings might be based on PDU/Circuit (5 hosts) or switch pair (15 hosts). Since internal power distribution is usually in a tree configuration, grouping by power distribution panel (~16 circuits?) might also make sense. The main breaker on a panel will knock out a range of circuits, as happened in October 2009.

Failures of a single phase (which the Tuesday Fremont outage smells an awful lot like) would span multiple PDUs; losing Phase X would drop PDUs 1, 4, 7, 10, etc, or hosts 1-5, 16-20, 31-35, 46-50, etc. However, single phase failures within a datacenter are rather rare, since internal three-phase breakers are ganged together (like the breakers powering your air conditioner in an American-style split-phase residential three-wire system) and a UPS will take care of upstream problems of this sort. It's pretty obvious at this point that FMT1 lacks something a reasonable person would consider a "UPS," at least as of two weeks ago, but that's a much bigger problem.

This analysis neglects (at a minimum) the core switches and border routers between the rack switches and the Internet (plus DNS resolvers, etc), physical issues, and most importantly, software/operational failures. These are probably going to be rather unpredictable in their scope, if they don't take out entire datacenter(s). (I'd link to coverage of recent Amazon EC2/EBS problems here, but cloudfail.net is down.) This also neglects differences between datacenters: it is possible that different datacenters have different specifications for servers/rack, amps/circuit, circuits/panel, delta vs. wye, etc.

(As a reminder, the above is a COMPLETE AND TOTAL FABRICATION and is FULL OF LIES. If I see this cited as The Truth by ANYONE, I'll have to post a face of disapproval.)

All that said, I hypothesize that the fate-sharing probability between two hosts is inversely proportional to the difference between their numbers; that is, newark10 and newark11 are significantly more likely to be impacted by the same problem than newark10 and newark110. But, this is more of a rule-of-thumb than a strategy, and fails in cases like the Newark partial power outage of 2009 (where newark121 and newark182 shared more fate than newark120 and newark121).

So, where was I... oh yes, open a ticket, explain the situation, and see what happens. I think your concern is extremely valid and your suggestion is a good one, but it's a complicated problem that most folks don't care about, so it probably won't become "automatic."


Top
   
 Post subject:
PostPosted: Sat Aug 20, 2011 1:27 pm 
Offline
Junior Member

Joined: Wed Jul 27, 2011 8:34 pm
Posts: 31
Website: http://eschercms.org
@hoopycat Interesting speculative analysis! I fully intend to request this via tickets when needed. But definitely would be nice to see an indication of this in Linode Manager (via a zone label of sorts), for confirmation, ease of management and internal audit purposes. And if such a labeling mechanism were implemented, the next logical step would be to have the option to use it to further specify instance creation, eliminating the need for a support ticket and opening up the possibility of using the API to create instances in different "zones." Thanks for sharing your thoughts.

_________________
Got Escher? | @artagesw


Top
   
 Post subject:
PostPosted: Sat Aug 20, 2011 1:36 pm 
Offline
Senior Member
User avatar

Joined: Sat Aug 30, 2008 1:55 pm
Posts: 1739
Location: Rochester, New York
artagesw wrote:
The fact that a particular type of failure is rare or hasn't been experienced by a particular customer in the past, while nice to know, has no bearing on whether it will happen in the future.


As an optimist, I figure that problems that have happened in the past are less likely to happen in the future than the ones that haven't yet happened. :-) (Note, Fremont users, that the time between "the past" and "the future" is nonzero.)

_________________
Code:
/* TODO: need to add signature to posts */


Top
   
 Post subject:
PostPosted: Sat Aug 20, 2011 4:17 pm 
Offline
Senior Member

Joined: Wed May 13, 2009 1:18 am
Posts: 681
artagesw wrote:
Right. My understanding is that Amazon's availability zones are regional entities that encompass multiple data centers. I'm proposing the simpler concept of an in-datacenter zone, since Linode's multiple data centers already provide the essential equivalent of "regional zones."

While I agree with other comments that this is a valid concern and would be interested in responses you may get to any tickets, I'm not sure I'd agree that what you are asking for is "simpler", particularly from Linode's viewpoint.

Inter-datacenter zones are essentially free from Linode's perspective (no coding, management, allocation policies, etc...) and about as isolated as you can get, sans perhaps common upstream paths.

However, intra-datacenter zones are a lot of work. First, the DC provider itself has to provide the necessary capabilities for identifying such hardware independence, which itself could be a major hurdle if Linode doesn't want to have to grab lots of partially utilized floor space to maintain independent power, which is often pre-run to fixed locations. Then Linode needs to track it all electronically, use it during provisioning and export the information to the manager. Plus stay tied into the data center for any engineering changes over time. Probably a ton of work, if even possible given a particular data center provider.

After all, building such redundancy into the core infrastructure is why you go for a Tier 1 data center to start with, recent Fremont failures notwithstanding. Fremont is a useful data point too, since virtually all of those failures would have taken out any intra-data center redundancy anyway, so reliability gain may be minimal for all the provisioning work.

As above, I don't disagree it's a valid desire. For me though, I'd be surprised though if Linode has this level of detail available for use by its provisioning systems, even if just internally, much less provide an interface to it in the manager. I read your original request as almost assuming that such zone information was already available and just not exported to the manager. I suspect it's not even available to internal systems at this point.

-- David


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group