 |
Linode Forum Linode Community Forums
|
| Author |
Message |
caker
Joined: 15 Apr 2003
Posts: 2906
Location: Galloway, NJ
|
| Posted: Thu Oct 29, 2009 9:04 pm Post subject: Host Reboots - October 27th, 2009 - Analysis |
|
|
On Tuesday, October 27th we became aware of a problem that affected some of our newer host deployments. Upon investigation, it was determined that a software library was not present on these hosts due to an oversight in our build procedure. This shared library, once distributed across our fleet, caused an unforeseen incompatibility on some of the machines.
The affected machines were then unable to process jobs (boots, shutdowns, deployments, etc), unable to determine the current state of their Linodes, and were unable to manipulate their Linodes in any way (such as issuing shutdowns). Additionally, the affected machines incorrectly deconstructed networking for some Linodes.
Once system administrators on duty became aware of the issues with these machines, all staff members were immediately alerted to the issue and began working diligently to develop and execute a plan of action to conduct emergency maintenance on the affected machines. Given the urgency of the situation, with consideration to the largely inoperable state of these hosts, the decision was made to begin corrective actions without further delay. System administrators worked together to divide the workload and reported results back to managing personnel as the situation evolved.
Administrators worked until all outstanding customer issues were properly resolved, providing direct assistance to affected customers via support tickets and real-time support in our IRC channel. As a consequence of these events, we have significantly revised our internal procedures to include the following changes:
We have adopted more formalized testing procedures for any and all updates to host machines.
We have devised and enacted a more conservative approach to deploying system updates across our network. A key part of this approach is a gradual, more tightly controlled deployment method for such updates.
Through customer feedback and our own internal discussions on this matter, we recognize the need for greatly improved customer communication. We are committed to developing and implementing reliable mechanisms for achieving this goal. We appreciate the feedback we have received on potential means of addressing these concerns, and are evaluating all options.We are continuing to give this matter the full attention our customers deserve, and sincerely apologize for any issues experienced throughout the service outage. As both a company and as individuals, we place a high value on the lessons learned from these unfortunate circumstances, and we look forward to being better able to serve our customers as a result of the knowledge gained.
-Linode Team
-------------------------------
On a personal note, I'd like to add that I am incredibly proud of our team's response to these issues. Every team member demonstrated unending dedication and devotion to resolving problems faced by our customers. Multiple personnel insisted on staying through the night to ensure the complete resolution of outstanding issues, while others worked remotely and came in early to relieve those that stayed, only to stay overnight themselves. Again, we apologize for the downtime, would like to reiterate our commitment to our customers, and assure you that this will serve to make Linode better.
-Chris |
|
| Back to top |
|
nsajeff
Joined: 18 Feb 2009
Posts: 47
|
| Posted: Thu Oct 29, 2009 9:40 pm Post subject: |
|
|
While I was more than a little frustrated with the sitaution, I have to admit that I am impressed with the outcome.
I could not imagine having to resolve issues for that many clients and keeping a cool-head.
I'm confident that there will not be a repeat of this incident and I appreciate the honest breakdown in this thread. |
|
| Back to top |
|
ohkus
Joined: 13 Jun 2008
Posts: 61
|
| Posted: Fri Oct 30, 2009 9:02 am Post subject: |
|
|
| Thank you for the analysis and all you hard work. I'm glad you guys are putting some more formal procedures around how maintenance is performed. |
|
| Back to top |
|
djg320
Joined: 02 Mar 2009
Posts: 3
Location: Maine, US
|
| Posted: Fri Oct 30, 2009 9:08 am Post subject: |
|
|
This is why I am with Linode exclusively. I have yet to come across a host like you guys who are so dedicated to your customers, personable but yet highly professional. Your communication and situations such as these is unparalleled. I sense the high level of morale in both your customer base and staff here at Linode.
David G |
|
| Back to top |
|
craversp
Joined: 12 Aug 2008
Posts: 6
Location: CT, USA
|
| Posted: Fri Oct 30, 2009 10:12 am Post subject: |
|
|
| Thanks for the update, and I am glad to see that there are some changes being put in place to help keep something like this from happening again. |
|
| Back to top |
|
jcw5002
Joined: 06 Oct 2009
Posts: 5
|
| Posted: Fri Oct 30, 2009 10:37 am Post subject: |
|
|
I also really appreciate the post and explanation of what actually happened. I was *extremely* impressed with the support on the IRC channel during Tuesday night when the hosts were starting to come back. The Linode admins were juggling hundreds of questions and answering them surprisingly quick! (Jed you rock!)
I would have liked to see more updates via Twitter as the outage was happening. There were a few tweets from @linode but most of them were not as informative as I would have liked. Maybe a @linode_support account could be created to better inform customers when this type of thing is happening? Twitter seems like a great channel for network status issues like this one.
Thank you guys for the persistent effort to get everything back up and running as quick as humanly possible. :D |
|
| Back to top |
|
pclissold
Joined: 24 Oct 2003
Posts: 877
Location: Netherlands
|
| Posted: Fri Oct 30, 2009 1:09 pm Post subject: Re: Host Reboots - October 27th, 2009 - Analysis |
|
|
caker wrote: We have devised and enacted a more conservative approach to deploying system updates across our network. A key part of this approach is a gradual, more tightly controlled deployment method for such updates.
This is what we needed to hear. Now we can be pretty sure that our main and backup Linodes will not go down simultaneously.
caker wrote: Through customer feedback and our own internal discussions on this matter, we recognize the need for greatly improved customer communication. We are committed to developing and implementing reliable mechanisms for achieving this goal. We appreciate the feedback we have received on potential means of addressing these concerns, and are evaluating all options.
The Newark power loss thread seemed to work well, and locking the thread during the active phase to channel discussions to another thread was a good move. |
|
| Back to top |
|
roopesh
Joined: 13 Jun 2009
Posts: 13
|
| Posted: Fri Oct 30, 2009 1:18 pm Post subject: |
|
|
| thanks for taking our feedback and working on improved processes. We really appreciate the effort! |
|
| Back to top |
|
dmuth
Joined: 19 Mar 2008
Posts: 32
Location: Ardmore, PA
|
| Posted: Fri Oct 30, 2009 4:29 pm Post subject: |
|
|
| Just to chime in with the others, thank you for sharing the details and striving to improve your procedures. |
|
| Back to top |
|
sleddog
Joined: 31 Aug 2008
Posts: 101
|
| Posted: Fri Oct 30, 2009 4:54 pm Post subject: |
|
|
Caker: thank you for this post.
At risk of being labelled a fanboy, I'll say that Linode recovered extremely well from this incident. A mistake was made, they recognized it and went balls-out to fix it. If you want a provider that is infallible, that never makes mistakes, send me $50 and I'll set you up -- and include a nice piece of land in the deal. |
|
| Back to top |
|
mwalling
Joined: 10 Dec 2007
Posts: 335
|
| Posted: Fri Oct 30, 2009 6:01 pm Post subject: Re: Host Reboots - October 27th, 2009 - Analysis |
|
|
pclissold wrote: locking the thread during the active phase to channel discussions to another thread was a good move.
+1 |
|
| Back to top |
|
eld101
Joined: 01 Sep 2008
Posts: 63
|
| Posted: Fri Oct 30, 2009 7:46 pm Post subject: |
|
|
| Thanks for the continued efforts. You guys still rock. |
|
| Back to top |
|
harmone
Joined: 21 Jun 2007
Posts: 100
|
| Posted: Fri Oct 30, 2009 9:53 pm Post subject: Thank you. |
|
|
| Thank you for being as open about what happened, as you are. And thank you all for all the effort you gave, when you realized (almost) everything broke down simultaneously. Your suggested changes will very likely mean that these kind of scenarios will not happen again. We are all humans who learn from our mistakes and you really seem to have learned from yours'. I'll continue to recommend you whenever anyone asks me to recommend a good (the best even) Linux VPS provider. |
|
| Back to top |
|
blackwd
Joined: 05 Jun 2009
Posts: 5
Location: Australia
|
| Posted: Mon Nov 02, 2009 8:39 pm Post subject: |
|
|
I REALLY appreciate that you heard the request for more communications, so I wanted to try and helpfully list what would have made my day easier during this outage.
I just wanted to say that as a sysadmin, I understand things go wrong and Im 100% happy with the recovery action that was taken (great work and *hugs* all round!). Things like this happen, punishing people is not helpful, learning from it is.
Things like this are "another fsck'ing 'learning opportunity'" ;-) so lets try and be helpful and suggest what we'd like to see communications wise next time round.
I know procedures suck, but I'd suggest a standard outage template (save thinking when under strain) and a policy to update at some regular interval (sucks, I know, but will make users love you!).
My list of 'what i wanted to know' (in nice 'standard form we have to fill in' format)
[list]
* possible impact on my machinereboot
possible file corruption
down until further notice
*who is impacted
eg, only some Dallas systems, more details in next update
systems that are currently down, all others will need reboot
only dallas123
unknown, many hosts
* who is NOT impacted
dallas only, other datacenters ok
only dallas systems unavailable as of date/time, all others ok
only dallas123 is impacted
unknown, may be all
* time frame
rebooting all hosts, will take 2-4 hours
unknown, atleast 1 hour
unknown timeframe, still investigating
* timestamps on the updates and when the next update will be (with a handy java countdown timer so I dont need to translate timezones or subtract time)
* some detail on the issue
typed rm -rf / by accident on dallas123
unknown crash on dallas123
unknown host shutdowns in dallas
* any action I could take
rebooting all hosts, please clean shutdown any host you can
I think the first 5-6 are critical, even if they are just 'best guesses' (and marked as such). Part of the pain in this downtime was not knowing if my un-rebooted hosts would be rebooted later (ie, were they safe? was my downtime still to come?) and how long that window of doooooooom was.
Also could we maybe clean up the communications channels? It's great you have so many, but.....
* forum
announcement ("http://blog.linode.com/ supersedes the announcement forum"): had no info.
sys/network status. At first the linode site was down. Later i found the thread with the short post about a library update but that didnt seem to be updated or really very helpful
* blog: had no info (still has no info)
* twitter had little info (just saying things coming up and an hour later another undetailed message saying the same)
* front page on linode website (when it was up) had no info
None of these had any useful info and that just added to the painful gnashing of teeth.
The action was _all_ on the irc channel, but everyone who joined (me included) asked the same questions over and over again. There were helpful people answering the same questions over and over again, but it makes for a noisy communications channel and really the information would be better suited to a static page. Seems to me that you want people on the irc channel to have specific issues (or just to be on to monitor).
I think you need _one_ official place to post 'emergency' stuff that everyone knows to goto.
An ideal solution would be an external site (linodestatus.com so it does not need your dns either) with a front text status page, then a page for each datacenter with an xml/traffic light for each host (red = problem, orange = possible problem) or even a 'login and get a list of your host status'. Something that could be easily plugged into nagios etc.
Adding 'email me if anything isnt green' would just be the icing on the cake. Hmm. cake.
Anyway, I think if I had an update with the info at the top, I would have been a happy(ish/er) camper on the day.
Cheers and thanks again. |
|
| Back to top |
|
Xan
Joined: 08 Feb 2004
Posts: 562
Location: Austin
|
| Posted: Tue Nov 03, 2009 12:02 am Post subject: |
|
|
blackwd has many great points, but the one that jumped out to me was the idea of the official communications channel for emergencies.
For example, I've never found it necessary to figure out how to get on the IRC channel, which it appears was the primary place to get information about this. Although I had no way of knowing it during the outage. (I wasn't affected, thankfully, but didn't know for sure...)
It would be really helpful to designate one official communications medium. Whatever it is, I'll be ready to use it for next time. Ideally all the other channels would point to that main one. |
|
| Back to top |
|
| |
|