Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Forum locked  This topic is locked, you cannot edit posts or make further replies.
Author Message
PostPosted: Thu Oct 29, 2009 10:04 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
On Tuesday, October 27th we became aware of a problem that affected some of our newer host deployments. Upon investigation, it was determined that a software library was not present on these hosts due to an oversight in our build procedure. This shared library, once distributed across our fleet, caused an unforeseen incompatibility on some of the machines.

The affected machines were then unable to process jobs (boots, shutdowns, deployments, etc), unable to determine the current state of their Linodes, and were unable to manipulate their Linodes in any way (such as issuing shutdowns). Additionally, the affected machines incorrectly deconstructed networking for some Linodes.

Once system administrators on duty became aware of the issues with these machines, all staff members were immediately alerted to the issue and began working diligently to develop and execute a plan of action to conduct emergency maintenance on the affected machines. Given the urgency of the situation, with consideration to the largely inoperable state of these hosts, the decision was made to begin corrective actions without further delay. System administrators worked together to divide the workload and reported results back to managing personnel as the situation evolved.

Administrators worked until all outstanding customer issues were properly resolved, providing direct assistance to affected customers via support tickets and real-time support in our IRC channel. As a consequence of these events, we have significantly revised our internal procedures to include the following changes:
  • We have adopted more formalized testing procedures for any and all updates to host machines.
  • We have devised and enacted a more conservative approach to deploying system updates across our network. A key part of this approach is a gradual, more tightly controlled deployment method for such updates.
  • Through customer feedback and our own internal discussions on this matter, we recognize the need for greatly improved customer communication. We are committed to developing and implementing reliable mechanisms for achieving this goal. We appreciate the feedback we have received on potential means of addressing these concerns, and are evaluating all options.
We are continuing to give this matter the full attention our customers deserve, and sincerely apologize for any issues experienced throughout the service outage. As both a company and as individuals, we place a high value on the lessons learned from these unfortunate circumstances, and we look forward to being better able to serve our customers as a result of the knowledge gained.

-Linode Team

-------------------------------

On a personal note, I'd like to add that I am incredibly proud of our team's response to these issues. Every team member demonstrated unending dedication and devotion to resolving problems faced by our customers. Multiple personnel insisted on staying through the night to ensure the complete resolution of outstanding issues, while others worked remotely and came in early to relieve those that stayed, only to stay overnight themselves. Again, we apologize for the downtime, would like to reiterate our commitment to our customers, and assure you that this will serve to make Linode better.

-Chris


Top
   
 Post subject:
PostPosted: Thu Oct 29, 2009 10:40 pm 
Offline
Junior Member

Joined: Wed Feb 18, 2009 4:50 pm
Posts: 47
While I was more than a little frustrated with the sitaution, I have to admit that I am impressed with the outcome.

I could not imagine having to resolve issues for that many clients and keeping a cool-head.

I'm confident that there will not be a repeat of this incident and I appreciate the honest breakdown in this thread.


Top
   
 Post subject:
PostPosted: Fri Oct 30, 2009 10:02 am 
Offline
Senior Member

Joined: Fri Jun 13, 2008 4:11 pm
Posts: 65
Website: http://www.skafari.com
Thank you for the analysis and all you hard work. I'm glad you guys are putting some more formal procedures around how maintenance is performed.


Top
   
 Post subject:
PostPosted: Fri Oct 30, 2009 10:08 am 
Offline
Newbie

Joined: Mon Mar 02, 2009 10:53 am
Posts: 3
AOL: soundfreak290
Location: Maine, US
This is why I am with Linode exclusively. I have yet to come across a host like you guys who are so dedicated to your customers, personable but yet highly professional. Your communication and situations such as these is unparalleled. I sense the high level of morale in both your customer base and staff here at Linode.


David G


Top
   
 Post subject:
PostPosted: Fri Oct 30, 2009 11:12 am 
Offline
Senior Newbie

Joined: Tue Aug 12, 2008 9:39 am
Posts: 6
Website: http://www.sandycovesw.com
Location: CT, USA
Thanks for the update, and I am glad to see that there are some changes being put in place to help keep something like this from happening again.


Top
   
 Post subject:
PostPosted: Fri Oct 30, 2009 11:37 am 
Offline
Senior Newbie

Joined: Tue Oct 06, 2009 5:07 pm
Posts: 5
Website: http://plzkthxbai.com
AOL: mcowboye19
I also really appreciate the post and explanation of what actually happened. I was *extremely* impressed with the support on the IRC channel during Tuesday night when the hosts were starting to come back. The Linode admins were juggling hundreds of questions and answering them surprisingly quick! (Jed you rock!)

I would have liked to see more updates via Twitter as the outage was happening. There were a few tweets from @linode but most of them were not as informative as I would have liked. Maybe a @linode_support account could be created to better inform customers when this type of thing is happening? Twitter seems like a great channel for network status issues like this one.

Thank you guys for the persistent effort to get everything back up and running as quick as humanly possible. :D


Top
   
PostPosted: Fri Oct 30, 2009 2:09 pm 
Offline
Senior Member
User avatar

Joined: Fri Oct 24, 2003 3:51 pm
Posts: 965
Location: Netherlands
caker wrote:
  • We have devised and enacted a more conservative approach to deploying system updates across our network. A key part of this approach is a gradual, more tightly controlled deployment method for such updates.

This is what we needed to hear. Now we can be pretty sure that our main and backup Linodes will not go down simultaneously.

caker wrote:
  • Through customer feedback and our own internal discussions on this matter, we recognize the need for greatly improved customer communication. We are committed to developing and implementing reliable mechanisms for achieving this goal. We appreciate the feedback we have received on potential means of addressing these concerns, and are evaluating all options.

The Newark power loss thread seemed to work well, and locking the thread during the active phase to channel discussions to another thread was a good move.

_________________
/ Peter


Top
   
 Post subject:
PostPosted: Fri Oct 30, 2009 2:18 pm 
Offline
Senior Newbie

Joined: Sat Jun 13, 2009 9:59 am
Posts: 13
thanks for taking our feedback and working on improved processes. We really appreciate the effort!


Top
   
 Post subject:
PostPosted: Fri Oct 30, 2009 5:29 pm 
Offline
Junior Member
User avatar

Joined: Wed Mar 19, 2008 10:34 pm
Posts: 32
Website: http://www.claws-and-paws.com/
WLM: doug.muth@gmail.com
Yahoo Messenger: dmuthathome
AOL: Dmuth+At+Home
Location: Ardmore, PA
Just to chime in with the others, thank you for sharing the details and striving to improve your procedures.

_________________
Disclaimer: I am not an Linode staff member.


Top
   
 Post subject:
PostPosted: Fri Oct 30, 2009 5:54 pm 
Offline
Senior Member

Joined: Sun Aug 31, 2008 4:29 pm
Posts: 177
Caker: thank you for this post.

At risk of being labelled a fanboy, I'll say that Linode recovered extremely well from this incident. A mistake was made, they recognized it and went balls-out to fix it. If you want a provider that is infallible, that never makes mistakes, send me $50 and I'll set you up -- and include a nice piece of land in the deal.


Top
   
PostPosted: Fri Oct 30, 2009 7:01 pm 
Offline
Senior Member
User avatar

Joined: Mon Dec 10, 2007 4:30 pm
Posts: 341
Website: http://markwalling.org
pclissold wrote:
locking the thread during the active phase to channel discussions to another thread was a good move.

+1


Top
   
 Post subject:
PostPosted: Fri Oct 30, 2009 8:46 pm 
Offline
Senior Member

Joined: Mon Sep 01, 2008 5:14 pm
Posts: 92
Thanks for the continued efforts. You guys still rock.


Top
   
 Post subject: Thank you.
PostPosted: Fri Oct 30, 2009 10:53 pm 
Offline
Senior Member
User avatar

Joined: Thu Jun 21, 2007 7:13 pm
Posts: 100
Website: http://neo101.org
Thank you for being as open about what happened, as you are. And thank you all for all the effort you gave, when you realized (almost) everything broke down simultaneously. Your suggested changes will very likely mean that these kind of scenarios will not happen again. We are all humans who learn from our mistakes and you really seem to have learned from yours'. I'll continue to recommend you whenever anyone asks me to recommend a good (the best even) Linux VPS provider.


Top
   
 Post subject:
PostPosted: Mon Nov 02, 2009 9:39 pm 
Offline
Senior Newbie
User avatar

Joined: Fri Jun 05, 2009 11:31 am
Posts: 5
Location: Australia
I REALLY appreciate that you heard the request for more communications, so I wanted to try and helpfully list what would have made my day easier during this outage.

I just wanted to say that as a sysadmin, I understand things go wrong and Im 100% happy with the recovery action that was taken (great work and *hugs* all round!). Things like this happen, punishing people is not helpful, learning from it is.

Things like this are "another fsck'ing 'learning opportunity'" ;-) so lets try and be helpful and suggest what we'd like to see communications wise next time round.

I know procedures suck, but I'd suggest a standard outage template (save thinking when under strain) and a policy to update at some regular interval (sucks, I know, but will make users love you!).

My list of 'what i wanted to know' (in nice 'standard form we have to fill in' format)
    * possible impact on my machine
      reboot
      possible file corruption
      down until further notice
    *who is impacted
      eg, only some Dallas systems, more details in next update
      systems that are currently down, all others will need reboot
      only dallas123
      unknown, many hosts

    * who is NOT impacted
      dallas only, other datacenters ok
      only dallas systems unavailable as of date/time, all others ok
      only dallas123 is impacted
      unknown, may be all

    * time frame
      rebooting all hosts, will take 2-4 hours
      unknown, atleast 1 hour
      unknown timeframe, still investigating

    * timestamps on the updates and when the next update will be (with a handy java countdown timer so I dont need to translate timezones or subtract time)

    * some detail on the issue
      typed rm -rf / by accident on dallas123
      unknown crash on dallas123
      unknown host shutdowns in dallas

    * any action I could take
      rebooting all hosts, please clean shutdown any host you can



    I think the first 5-6 are critical, even if they are just 'best guesses' (and marked as such). Part of the pain in this downtime was not knowing if my un-rebooted hosts would be rebooted later (ie, were they safe? was my downtime still to come?) and how long that window of doooooooom was.

    Also could we maybe clean up the communications channels? It's great you have so many, but.....
      * forum
        announcement ("http://blog.linode.com/ supersedes the announcement forum"): had no info.
        sys/network status. At first the linode site was down. Later i found the thread with the short post about a library update but that didnt seem to be updated or really very helpful

      * blog: had no info (still has no info)

      * twitter had little info (just saying things coming up and an hour later another undetailed message saying the same)

      * front page on linode website (when it was up) had no info


None of these had any useful info and that just added to the painful gnashing of teeth.

The action was _all_ on the irc channel, but everyone who joined (me included) asked the same questions over and over again. There were helpful people answering the same questions over and over again, but it makes for a noisy communications channel and really the information would be better suited to a static page. Seems to me that you want people on the irc channel to have specific issues (or just to be on to monitor).

I think you need _one_ official place to post 'emergency' stuff that everyone knows to goto.

An ideal solution would be an external site (linodestatus.com so it does not need your dns either) with a front text status page, then a page for each datacenter with an xml/traffic light for each host (red = problem, orange = possible problem) or even a 'login and get a list of your host status'. Something that could be easily plugged into nagios etc.

Adding 'email me if anything isnt green' would just be the icing on the cake. Hmm. cake.

Anyway, I think if I had an update with the info at the top, I would have been a happy(ish/er) camper on the day.

Cheers and thanks again.

_________________
---
Monkeys for the win....
.... but Labradors for the snooze.


Top
   
 Post subject:
PostPosted: Tue Nov 03, 2009 1:02 am 
Offline
Senior Member
User avatar

Joined: Sun Feb 08, 2004 7:18 pm
Posts: 562
Location: Austin
blackwd has many great points, but the one that jumped out to me was the idea of the official communications channel for emergencies.

For example, I've never found it necessary to figure out how to get on the IRC channel, which it appears was the primary place to get information about this. Although I had no way of knowing it during the outage. (I wasn't affected, thankfully, but didn't know for sure...)

It would be really helpful to designate one official communications medium. Whatever it is, I'll be ready to use it for next time. Ideally all the other channels would point to that main one.


Top
   
Display posts from previous:  Sort by  
Forum locked  This topic is locked, you cannot edit posts or make further replies.


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group