Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
 Post subject:
PostPosted: Mon Nov 22, 2010 7:35 am 
Offline
Senior Member

Joined: Mon Dec 07, 2009 6:46 am
Posts: 331
Anyone else in Newark having discontinued graphs until cca 17.30 UTC yesterday, on all the linodes? The nodes were up and fine, I know through external monitoring.

Can this be related to the Fremont failure?


Top
   
 Post subject:
PostPosted: Mon Nov 22, 2010 9:44 am 
Offline
Senior Member
User avatar

Joined: Fri Oct 24, 2003 3:51 pm
Posts: 965
Location: Netherlands
Azathoth wrote:
Anyone else in Newark having discontinued graphs until cca 17.30 UTC yesterday, on all the linodes? The nodes were up and fine, I know through external monitoring.

Yes.

Azathoth wrote:
Can this be related to the Fremont failure?

I'm guessing that this was the case.

_________________
/ Peter


Top
   
PostPosted: Mon Nov 22, 2010 3:01 pm 
Offline
Senior Newbie

Joined: Fri Nov 19, 2010 6:05 pm
Posts: 11
vonskippy wrote:
JasonTokoph wrote:
I'll be happy when a post mortem is in my inbox along with an apology for lack of communication.

...They were down, they fixed it, they reported it, it's a rare event, get over it already.


It's completely reasonable to ask why something that shouldn't happen happened. It sounds like it wasn't their fault, and they recovered beautifully. But we have a fragmentary bit of info on the status blog, and I'd like to know how (or if) they will prevent this from happening in the future, or if this is the expected response and everything went as well as could be.


Top
   
 Post subject:
PostPosted: Mon Nov 22, 2010 3:42 pm 
Offline
Senior Member

Joined: Sat Feb 10, 2007 7:49 pm
Posts: 96
Website: http://www.arbitraryconstant.com/
JasonTokoph wrote:
While its nice that the pseudo redundancy was there, I'm still disappointed that I had to learn of the outage by searching twitter for "linode". No email, phone call, or a minimum @linode tweet. Unless I'm blind, there isn't a link to the status page on the main linode.com site.

A phone call? Seriously?

Jay3ld wrote:
I hope they really look into why the UPS system failed. I was shocked to hear a power outage would take down a hosting company. I can understand now that I see the UPS failed. Still a shock that the UPS systems failed says to me that something failed along the lines of making sure they where tested to ensure they worked.

It's entirely possible for this stuff to fail even with regular testing.

The reason it's so important for a company like Linode to be able to recover is because it simply isn't possible to completely protect yourself from failures like this. It happens.

Up and running before start of business the next day is an amazing effort, made possible by hard work from competent staff and planning ahead to have the spares available.


Top
   
 Post subject:
PostPosted: Mon Nov 22, 2010 6:02 pm 
Offline
Senior Member

Joined: Wed May 13, 2009 1:18 am
Posts: 681
Internat wrote:
email is useless as a status message medium.. esepcially given my frontend client access server for mail was one of the nodes that went down.. Such is life though :)

I don't know - given how many people have smart devices on their person nowadays that can notify them of new email on a real time basis, it still seems a perfectly good communications channel to support. Certainly likely to be the most ubiquitous.

Besides, I think it's safe to assume that if email status updates were offered, you'd want to put in an email address that was independent from the host for which you were getting status...

-- David


Top
   
 Post subject:
PostPosted: Mon Nov 22, 2010 6:09 pm 
Offline
Senior Member
User avatar

Joined: Fri Dec 11, 2009 7:09 pm
Posts: 168
db3l wrote:
Besides, I think it's safe to assume that if email status updates were offered, you'd want to put in an email address that was independent from the host for which you were getting status...

-- David


I would never use a "mydomain" email address in dealing with my host- really for support, billing or what not.
Though Google handles "mydomain" email, and I'm using third party dns, so....
A dedicated twitter account might be a good idea- like @linodealerts. I guess rss is getting pretty old school.
:)

_________________
--
Chris Bryant


Top
   
 Post subject:
PostPosted: Mon Nov 22, 2010 6:35 pm 
Offline
Senior Member

Joined: Wed May 13, 2009 1:18 am
Posts: 681
ArbitraryConstant wrote:
Jay3ld wrote:
I hope they really look into why the UPS system failed. I was shocked to hear a power outage would take down a hosting company. I can understand now that I see the UPS failed. Still a shock that the UPS systems failed says to me that something failed along the lines of making sure they where tested to ensure they worked.

It's entirely possible for this stuff to fail even with regular testing.

I don't know that I'd be quite that forgiving or understanding, depending on what the actual underlying outage was. Modern data centers should be at least N+1 redundant, and no, I don't think it's unreasonable to expect failure scenarios to be regularly tested and expected to function correctly in the very scenarios for which they are designed to cope. More often than not, it's an error of some sort involved.

With that said, the status mentioned a serious lightning storm, but not whether the data center "outage" was directly involved in a strike, or just from losing utility power as an indirect result. The latter should be much easier to expect no interruption from, while a direct strike could well fry enough front-end components while dissipating the energy involved to be a problem, even with an otherwise robust design.

In the end this is all verbal bit twiddling until we get further analysis and a post-mortem, but I certainly wouldn't fault (nor discourage) anyone at this point from wondering why the redundancies and UPSes failed to function as intended. To me, it's especially interesting that equipment within the protected zone behind the UPSes got damaged, since that seems to imply a potential issue in design (such as unprotected electrical path somewhere). But even that could in theory be explained by induced currents if the strike was close enough.

-- David


Top
   
 Post subject:
PostPosted: Mon Nov 22, 2010 10:32 pm 
Offline
Junior Member

Joined: Mon Feb 22, 2010 9:40 pm
Posts: 37
I can remember something being said about a surge? If it was something like that it could have knocked out the redundant power systems. Power seems to be very, very hard to get complete redundancy in, since you are dealing with huge amounts of power in a data center.

Would be good to get a report of the investigation on what happened, but I wouldn't be pointing fingers until we know what actually happened


Top
   
 Post subject:
PostPosted: Mon Nov 22, 2010 10:46 pm 
Offline
Senior Newbie

Joined: Fri Nov 19, 2010 6:05 pm
Posts: 11
jords wrote:
I can remember something being said about a surge? If it was something like that it could have knocked out the redundant power systems. Power seems to be very, very hard to get complete redundancy in, since you are dealing with huge amounts of power in a data center.


The status post said lightning strike, which knocked out power and "redundant" UPSes. I was in co-lo for nearly a decade, and spent a lot of time talking to them about power. Their setup, which is standard for Tier 1 data centers, which I presume from Linode's description is where they operate, requires that the capacity is in place to provide continuous power through UPSes to all servers--that is, the power isn't from the mains, but conditioned through the UPS units, which are constantly charge. If the mains drop, the UPS continue their function without interruption. UPSes have to be scaled to handle this, of course, and regular testing and battery swapouts.

At Tier 1 facilities, sufficient generator capacity is in place to operate indefinitely with fuel deliveries, and for some number of days with fuel on hand. Generators can sometimes kick on in a matter of seconds.

The fact that Linode's facility had primary (mains) and secondary (UPS) failures can be perfectly reasonable if a lightning strike overwhelmed systems. It sounds like the power conditioning prevented hardware from melting down, which is awesome.

Linode doesn't provide battery-backed RAID 10's, from their specs. Some VPS services do. It should be double-overkill, but this demonstrates that the unlikely happens.


Top
   
 Post subject:
PostPosted: Mon Nov 22, 2010 10:53 pm 
Offline
Junior Member

Joined: Mon Feb 22, 2010 9:40 pm
Posts: 37
Agreed - I think the thing is, redundancy costs. A lot. I'm sure that the facilities powering the backbone systems for the internet have more redundancy than a linode host - but if you need that kind of reliability you have to be prepared to pay for it. Linode still has great uptime and it takes a major event to knock them offline, but if it can happen it will. And even if it can't happen it probably still will, eventually.


Top
   
 Post subject:
PostPosted: Tue Nov 23, 2010 12:36 am 
Offline
Senior Member

Joined: Wed May 13, 2009 1:18 am
Posts: 681
pundit wrote:
Linode doesn't provide battery-backed RAID 10's, from their specs. Some VPS services do. It should be double-overkill, but this demonstrates that the unlikely happens.

I'm pretty sure the raid cards in the Linode hosts do have BBUs[*], but of course that really only ensures that data won't get lost in the cache, not to provide additional runtime in the absence of power, which will shut down the host processor in any event.

-- David

[*] A quick forum search only showed a status report for a host, but it did mention the card's BBU. I'm pretty sure I recall it being discussed in more general threads.


Top
   
 Post subject:
PostPosted: Tue Nov 23, 2010 12:46 am 
Offline
Senior Member

Joined: Wed May 13, 2009 1:18 am
Posts: 681
pundit wrote:
UPSes have to be scaled to handle this, of course, and regular testing and battery swapouts.

It might also be worth mentioning that the scaling isn't necessarily as bad as it may sound, since the job of the UPS is typically only to provide power long enough for the generator to spin up (done automatically on power loss) and its power output to stabilize. So the UPS needs enough capability for maybe 10-20 seconds plus whatever additional latency the DC establishes for the generator. Non-battery, flywheel UPS solutions are also becoming more popular I think in recent years with materials advances cutting down on the size/weight of the flywheel and definite advantages in terms of maintenance (up to 10-20 year cycle, and no batteries to replace).

-- David


Top
   
 Post subject:
PostPosted: Tue Nov 23, 2010 10:20 am 
Offline
Senior Member
User avatar

Joined: Tue May 26, 2009 3:29 pm
Posts: 1691
Location: Montreal, QC
db3l wrote:
pundit wrote:
UPSes have to be scaled to handle this, of course, and regular testing and battery swapouts.

It might also be worth mentioning that the scaling isn't necessarily as bad as it may sound, since the job of the UPS is typically only to provide power long enough for the generator to spin up (done automatically on power loss) and its power output to stabilize. So the UPS needs enough capability for maybe 10-20 seconds plus whatever additional latency the DC establishes for the generator. Non-battery, flywheel UPS solutions are also becoming more popular I think in recent years with materials advances cutting down on the size/weight of the flywheel and definite advantages in terms of maintenance (up to 10-20 year cycle, and no batteries to replace).

-- David


The problem is that if a genset fails to come fully online within the few seconds that the flywheel can provide power, then you're boned. From what I've seen, UPS installations can usually provide a few minutes of power, but I could be talking out of my ass here.


Top
   
 Post subject:
PostPosted: Tue Nov 23, 2010 1:29 pm 
Offline
Senior Newbie

Joined: Fri Nov 19, 2010 6:05 pm
Posts: 11
db3l wrote:
pundit wrote:
Linode doesn't provide battery-backed RAID 10's, from their specs. Some VPS services do. It should be double-overkill, but this demonstrates that the unlikely happens.

I'm pretty sure the raid cards in the Linode hosts do have BBUs[*], but of course that really only ensures that data won't get lost in the cache, not to provide additional runtime in the absence of power, which will shut down the host processor in any event.


They had drive failures, which seems unlikely (but possible) if they had battery backups on the RAIDs. The battery would prevent a head crash and other hardware problems.


Top
   
 Post subject:
PostPosted: Tue Nov 23, 2010 2:57 pm 
Offline
Senior Member

Joined: Wed May 13, 2009 1:18 am
Posts: 681
Guspaz wrote:
The problem is that if a genset fails to come fully online within the few seconds that the flywheel can provide power, then you're boned. From what I've seen, UPS installations can usually provide a few minutes of power, but I could be talking out of my ass here.

Well, it's certainly something to take into consideration, but my understanding is that current flywheel systems, which have been able to use newer materials (lighter and stronger) to spin faster (rpm scales energy more than mass) can bridge standard generator startup times. If a generator fails to start quickly, the odds of it getting fixed within the slightly longer window of a UPS isn't really that high, so the benefit of that window is marginal.

I'm definitely not an expert in either systems, but I think I've seen growing references to data centers choosing flywheel systems in recent years. Certainly from an operational (and maintenance) perspective they have a lot of attractive qualities.

-- David


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group