Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Forum locked  This topic is locked, you cannot edit posts or make further replies.
Author Message
 Post subject: Migration: dallas165
PostPosted: Tue Jul 21, 2009 8:18 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
We're investigating a problem with dallas165. Updates in a bit.

-Chris


Top
   
 Post subject:
PostPosted: Tue Jul 21, 2009 9:42 pm 
Offline
Senior Member

Joined: Sat Mar 28, 2009 4:23 pm
Posts: 415
Website: http://jedsmith.org/
Location: Out of his depth and job-hopping without a clue about network security fundamentals
Early this afternoon, one hard drive in dallas165's RAID 10 array failed. Calls were made, tickets filed, and a plan of action put into place. Customers would have never been any wiser, had no other drives failed; however, at around 8 PM EDT, another drive did.

Not even RAID can prepare for double drive failure. Two drives failing within six hours of each other is unprecedented, and quite unlucky. This is an extremely rare situation for Linode, and one we regret immensely. After extensive triage and troubleshooting, we have determined that all customer data on dallas165 has been lost.

However, hardware does fail; this sort of situation is mostly outside of our control. Let me be the first, on behalf of Linode, to apologize if you are affected by this host failure.

Customers on this host have been moved to dallas166, and tickets opened to discuss specifics relating to their account. If you have any questions whatsoever, please don't hesitate to open a ticket or e-mail us. Check your e-mail for a ticket from us, if you were on dallas165.

Once again, I apologize.

_________________
Disclaimer: I am no longer employed by Linode; opinions are my own alone.


Top
   
 Post subject:
PostPosted: Tue Jul 21, 2009 9:56 pm 
Offline
Senior Member
User avatar

Joined: Sun Feb 08, 2004 7:18 pm
Posts: 562
Location: Austin
Ouch... My condolences to people affected. Once again, the lesson is: backups! Any valuable data must be backed up, whether it's on Linode or anywhere. Events like this are good to remind us of that, because it's a matter of when, not if, data loss hits us all.

Tangentially, of course RAID certainly can protect against double drive failures, but RAID 10 can't. But in any case, RAID is not a backup.

[edited to fix error pointed out by hybinet]


Last edited by Xan on Thu Jul 23, 2009 3:52 am, edited 1 time in total.

Top
   
 Post subject:
PostPosted: Wed Jul 22, 2009 1:00 am 
Offline
Senior Member

Joined: Fri Sep 12, 2008 3:17 am
Posts: 166
Website: http://independentchaos.com
jed wrote:
Not even RAID can prepare for double drive failure. Two drives failing within six hours of each other is unprecedented, and quite unlucky.


Should tell that to our SAN system, we had 17 drives fail semi-simultaneously out of our 32 drive array.

It is unfortunate that all customer data is lost and unfortunate that Linode Backup is completely up and running yet either :P But we knew what un-managed meant.

_________________
If it ain't broke, you didn't tweak it enough. If it is broke, use more duct tape.
http://independentchaos.com


Top
   
 Post subject:
PostPosted: Wed Jul 22, 2009 12:04 pm 
Offline

Joined: Wed Jul 22, 2009 11:39 am
Posts: 1
We had our Linode hosted on dallas165.

It was a complete shock when I've logged in and saw no disk images in the dashboard. Talk about scary... I hope that none of you will ever see something like that.

We had backup of everything locally. However, we also had bunch of things set up on our Linode (mail, web, svn, mysql...) with a lot of optimizations and tweaks (custom patched Apache...). So, just restoring these would take days I guess (even with the server log we manually update).

We had "luck" to move to dallas165 at the beginning of July (this was our most important image which was 2 years old). Tech support managed to somehow recover our two weeks old image file. This was very convenient as at the end only emails and two projects we are working on right now had to be recovered in addition to old image file. Though, just those took us full working day (8 hours) to restore. All of our live production projects were up and running within an hour (so we were "offline" for very short period of time in central European time zone).

I am now as lucky as desperate I was this morning. Finally, we have put this behind us.

The moral of the story: when you do backups, don't think just about backing up. The reverse process and its speed is equally important.

I hope that Linode backup will be available soon.

My condolences to the rest of our "cohabitants" on dallas165.

_________________
Nikola Stojiljkovic
Essential Dots


Top
   
 Post subject:
PostPosted: Wed Jul 22, 2009 1:55 pm 
Offline
Senior Member
User avatar

Joined: Tue May 26, 2009 3:29 pm
Posts: 1691
Location: Montreal, QC
There seem to have been a few double-drive failures of late (well, I Think this is only the second recently).

Nevertheless, considering the huge impact when they do occur (and we've seen that they do occur on occasion), has Linode considered switching from RAID10 to RAID6 or switching to simple RAID1 with 3 larger drives instead of (presumably) the four smaller drives used in RAID10? Either of these solutions would provide for the ability to survive two-drive failures.


Top
   
 Post subject:
PostPosted: Wed Jul 22, 2009 2:13 pm 
Offline
Senior Member

Joined: Fri May 02, 2008 8:44 pm
Posts: 1121
jed wrote:
Two drives failing within six hours of each other is unprecedented,

AFAIK, drives purchased at the same time from the same vendor tend to do that. They most likely are from the same production line (same "batch"), which sort of explains why they might have similar potentials for premature failure.

Xan wrote:
it's a matter of if, not when, data loss hits us all.

I think you got it backwards. It's a matter of when, not if, data loss will occur.


Top
   
 Post subject:
PostPosted: Thu Jul 23, 2009 3:52 am 
Offline
Senior Member
User avatar

Joined: Sun Feb 08, 2004 7:18 pm
Posts: 562
Location: Austin
pfft, what a doof, thanks!


Top
   
 Post subject:
PostPosted: Fri Jul 24, 2009 6:24 pm 
Offline
Senior Member
User avatar

Joined: Wed Jan 24, 2007 12:04 am
Posts: 90
Website: http://www.smiffysplace.com
Location: Rural South Australia
Once again, I find myself wishing that Linux kernel licensing were different so that we could have ZFS. That CAN cope with multiple disc failures. There's even a video of a guy taking a sledgehammer to a pair of discs in a hot system, plugging new discs in and watching it all rebuild without a hitch.


Top
   
 Post subject:
PostPosted: Mon Jul 27, 2009 10:09 am 
Offline
Senior Member
User avatar

Joined: Tue May 26, 2009 3:29 pm
Posts: 1691
Location: Montreal, QC
smiffy wrote:
Once again, I find myself wishing that Linux kernel licensing were different so that we could have ZFS. That CAN cope with multiple disc failures. There's even a video of a guy taking a sledgehammer to a pair of discs in a hot system, plugging new discs in and watching it all rebuild without a hitch.


ZFS can't handle multiple disk failures. It has no inherent redundancy. RAID-Z2 can handle two disk failures (RAID-Z can handle one).

But, RAID-5 can handle one disk failure, and RAID-6 can handle two.

ZFS/RAID-Z's advantages are not in the number of disk failures they can handle, they're in other things.


Top
   
 Post subject:
PostPosted: Mon Jul 27, 2009 1:03 pm 
Offline
Linode Staff
User avatar

Joined: Sat Jun 21, 2003 2:21 pm
Posts: 160
Location: Absecon, NJ
Most of you who were affected by the RAID crash have probably seen the ticket updates, but I just wanted to let everyone know that we finally managed to get the RAID to respond again this weekend. All customer data was copied off to a standby host and it's sitting there now in case anyone wants access to it.

If you haven't redeployed yet, we can put your Linode back the way it was. If you have redeployed and you'd just like access to the disks, let us know and we'll set it up for you.

-James


Top
   
 Post subject:
PostPosted: Tue Jul 28, 2009 9:45 am 
Offline
Junior Member

Joined: Tue Dec 09, 2008 2:33 pm
Posts: 49
Website: http://www.ragtop.org
Location: Gilbert, AZ
I wasn't affected by this, but it is good to know that Linode kept on working on getting the data back and didn't just give up. Good job Linode.


Top
   
 Post subject:
PostPosted: Thu Jul 30, 2009 1:56 am 
Offline
Senior Member

Joined: Fri Sep 12, 2008 3:17 am
Posts: 166
Website: http://independentchaos.com
jsr wrote:
I wasn't affected by this, but it is good to know that Linode kept on working on getting the data back and didn't just give up. Good job Linode.


This is why people that come to linode, stay with linode :)

_________________
If it ain't broke, you didn't tweak it enough. If it is broke, use more duct tape.

http://independentchaos.com


Top
   
Display posts from previous:  Sort by  
Forum locked  This topic is locked, you cannot edit posts or make further replies.


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group