Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Forum locked  This topic is locked, you cannot edit posts or make further replies.
Author Message
 Post subject: Host8 down for repair
PostPosted: Wed Dec 24, 2003 12:32 am 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
ThePlanet and I are currently working on an issue with host8. More details to follow.


Top
   
 Post subject:
PostPosted: Wed Dec 24, 2003 12:59 am 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
It looks like ThePlanet's power outage caused root filesystem corruption on host8. ThePlanet is currently de-racking the machine to run diagnostics on the drives. It is unknown at this point the extent of the data loss. I'll be keeping at it until I know more, and will post here.

-Chris


Top
   
 Post subject:
PostPosted: Wed Dec 24, 2003 1:27 am 
Offline
Senior Newbie

Joined: Tue Nov 04, 2003 12:32 am
Posts: 12
Host 7 may have been hit as well. I've got a number of seemingly corrupted files on my node, unfortuneatly including scp and sftp.


Top
   
 Post subject:
PostPosted: Wed Dec 24, 2003 1:46 am 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
We've got the machine to boot back, so currently looking into the havoc that this caused. There is some corruption left on the server. I've left one of the raid drives in the state it was in when we brought the machine down. It looks as though the Linode filesystems will be recoverable, but I'll know more when the machine is re-racked.

-Chris


Top
   
 Post subject:
PostPosted: Wed Dec 24, 2003 2:33 am 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
Host8 is back online, but I'm still checking everything -- I expect the Linodes to be back up and running shortly.


Top
   
 Post subject:
PostPosted: Wed Dec 24, 2003 3:51 am 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
Linodes have been restarted.

Some file corruption may have occured. Please check your /lost+found directories for orphaned files. Also, I don't see how this could have been any more severe. There are Linodes that lost files, potentially even entire filesystems.

If your Linode won't boot, there are a few things you can do. First, boot connect to the console, perhaps in single user mode (edit your config profile). You may very well be able to repair/fix your filesysem. If that fails, I suggest resizing your filesystem if you need to make room to deploy another fs (Debian will be the smallest), add your old root fs to your config and salvage what you can.

I'll work up some kind of reimbursement or extra disk space for the trouble this has caused.

-Chris


Top
   
 Post subject:
PostPosted: Wed Dec 24, 2003 1:33 pm 
Offline
Junior Member

Joined: Tue Sep 09, 2003 11:59 am
Posts: 47
Website: http://blog.griffinn.org/
Oopsie. :shock:

If your main goal is to keep the system up and running in a tolerable state rather than immediate eradication of all errors, here's my approach for your reference. I have a single ext3 filesystem on /dev/ubda running Debian GNU/Linux, so you might need to change the commands around a bit to adapt to your filesystem / device / distro.

When a running (but not very busy) Linux system dies catastrophically, the files most likely to be corrupted are:
  1. opened log files;
  2. perpetually active directories like /var/spool/* and /tmp;
  3. random inodes that are in the vicinity of inodes pointing to (1) and (2) above. I don't know why; that's just the way it is.

(1) and (2) are probably already fixed while your system rebooted. If they aren't, you can likely fix them by deleting and re-creating the offending files/directories, so they're no biggie.

As for (3), you'll need to rely on e2fsck (or whatever is the fsck your filesystem uses -- pardon the pun) to find them. Here's what I did.

(Execute all the following over and over until no directory-level errors are reported.)

Code:
# e2fsck -n /dev/ubda > errs
e2fsck 1.35-WIP (07-Dec-2003)
e2fsck: aborted
(The -n flag makes e2fsck perform a dry run without making changes. All the errors that e2fsck reports are now in the file "errs".)

Code:
# cat errs
[ ... Lots of lines omitted ...]
Pass 2: Checking directory structure
Directory inode 231197, block 0, offset 0: directory corrupted
Salvage? no
(Just pay attention to the last few lines; this is where directory-level errors are detected. In this example, e2fsck says inode 231197 is dead. So go and look for it.)

Code:
# find / -inum 231197
/usr/X11R6/lib/X11/locale/koi8-u
(Let's see how screwed up this directory is.)

Code:
# ls -alF /usr/X11R6/lib/X11/locale/koi8-u
total 0
(Hmm... completely screwed up. Nonetheless it looks like this isn't a crucial directory, so we can just scrap it and re-create it. First, find out which package provides this directory. Then force a re-install of that package. -- Note: These are Debian-specific commands. Vary according to distro.)

Code:
# dpkg -S /usr/X11R6/lib/X11/locale/koi8-u
xlibs: /usr/X11R6/lib/X11/locale/koi8-u
# \rm -rf /usr/X11R6/lib/X11/locale/koi8-u
# apt-get --reinstall install xlibs
Reading Package Lists... Done
Building Dependency Tree... Done
0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 0 not upgraded.
Need to get 1377kB of archives.
After unpacking 0B of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://ftp.us.debian.org unstable/main xlibs 4.2.1-14 [1377kB]
Fetched 1377kB in 1s (1026kB/s)
(Reading database ... 21836 files and directories currently installed.)
Preparing to replace xlibs 4.2.1-14 (using .../xlibs_4.2.1-14_i386.deb) ...
Unpacking replacement xlibs ...
Setting up xlibs (4.2.1-14) ...
(Now we should be all set.)

Code:
# ls -alF /usr/X11R6/lib/X11/locale/koi8-u
total 6
drwxr-xr-x    2 root     root         1024 Dec 25 00:22 ./
drwxr-xr-x   49 root     root         2048 Dec 25 00:22 ../
-rw-r--r--    1 root     root          376 Nov 14 06:37 Compose
-rw-r--r--    1 root     root          338 Nov 14 06:37 XI18N_OBJS
-rw-r--r--    1 root     root          979 Nov 14 06:37 XLC_LOCALE
(Voila. One directory fixed. Repeat until e2fsck no longer reports directory-level errors.)

Provided that all the corrupted directories are managed by your distro without your own customization, you can at least fix all directory-level errors that may cause programs to die whenever they access these directories.

This process does not fix everything. You'll likely end up with a bunch of orphaned inodes (directories you deleted and files therein), unattached inodes, and incorrect inode counts. But at least they're no longer showstoppers.

This process only buys you time to worry about a complete fix later. It's probably unsafe to ignore the remaining errors forever, and I do intend to bring down the filesystem and run a real e2fsck (without the -n flag) at some point. (Only not during Christmas.)

Hope this helps somehow. :)


Top
   
 Post subject:
PostPosted: Wed Dec 24, 2003 2:41 pm 
Offline
Senior Member

Joined: Wed Oct 29, 2003 12:27 pm
Posts: 50
Thanks for the info.

Do you have any suggestions for files or directories that seem to have disappeared but are still there in the sense that they cannot be overwritten. I have a directory that contains a file like that. I've renamed the directory to get it out the way and then recreated it, including the missing file. Now if I delete all visible files in the original directory I get a 'Directory is not empty' message if I try rmdir but ls -la shows nothing.

Ross


Top
   
 Post subject:
PostPosted: Wed Dec 24, 2003 2:49 pm 
Offline
Junior Member

Joined: Tue Sep 09, 2003 11:59 am
Posts: 47
Website: http://blog.griffinn.org/
First, try the direct "\rm -rf directory" approach.

If that fails, it's likely there are invisible files in that directory that rm refuses to traverse automatically. In such cases an I/O error is probably logged in your /var/log/syslog which will reveal the name of these files. Strangely enough, if you now rm the offending file directly, it will work.

Or you can force an I/O error on each of these invisible files by forcing the system to traverse all directories. A command such as "find / -name some_non_existent_filename" will do this.


Top
   
 Post subject:
PostPosted: Wed Dec 24, 2003 5:23 pm 
Offline
Senior Newbie

Joined: Fri Aug 15, 2003 5:03 pm
Posts: 11
Spending xmas trying to rebuild my hosed Linode :cry:

(fs dead, can't even boot or create a new Linode)


Top
   
 Post subject:
PostPosted: Thu Dec 25, 2003 5:36 am 
Offline
Junior Member

Joined: Tue Sep 09, 2003 11:59 am
Posts: 47
Website: http://blog.griffinn.org/
Yikes. That's unfortunate. :(

If you have a swap image that's larger than 80Mb, you may consider removing the swap image and deploying the 80Mb Debian distro (that's the smallest from the list) from the distro wizard, then carrying out your rescue operation from there. Hopefully it's only a matter of specifying a backup superblock for e2fsck.

Luckily my Debian image was healthy enough to boot up so I could do an online e2fsck to fix it as far as possible (as described above). Later on I shrunk the Debian image by 64Mb, created a 64Mb image to set up a rescue image (I took the rootdisk of RIP and brutally hacked /etc/inittab), rebooted into the rescue image, and did a complete fsck of the Debian image to restore it to a healthy state. I guess it's all good now.

For the less unfortunate (no spare disk space to setup a temporary rescue image; filesystem too corrupted to resize), I suggest that chris helps out by giving them a temporary (or permanent? :wink:) 80Mb extra so they can at least deploy the Debian distro to carry out the rescue operation.

In the long run, I suggest that Linode makes available a specialised rescue root image (this can be an adapted version of publicly available rescue floppies/CDs like RIP and tomsrtbt) with e2fsck, debugfs, reiserfsck, the works. Some of these rescue images are real tiny; the floppy versions are smaller than 8Mb uncompressed. My idea is that the entire rescue image can be set up as an initrd that doesn't remount the real root but just drops into /bin/sh. This initrd image can be made read-only and made available to all customers as if it is one of their private hard disk images. Customers can set up an emergency configuration on-demand to boot into this "standard" initrd and perform their rescue operations -- without having to shell out 8Mb from their own disk quota.

I will volunteer to create such an image if needs be. (After all, it's Christmas.) I used to create custom ISOLINUX boot CDs for my customers so I'm reasonably well-versed with the boot process to play with initrd images.


Top
   
 Post subject:
PostPosted: Fri Dec 26, 2003 11:07 am 
Offline
Senior Newbie

Joined: Fri Aug 15, 2003 5:03 pm
Posts: 11
All is not lost. Turns out I couldn't reboot or redeploy because Chris had to fix something on the server. I was able to deploy a new fs and recover files.


Top
   
 Post subject:
PostPosted: Fri Dec 26, 2003 11:10 am 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
Strangely, /dev/zero was missing on host8's fileystem (go figure), which prevented new filesystem images from being created... mknod'ed it back "the day after" and all has been well with regard to creating new filesystems.

-Chris


Top
   
Display posts from previous:  Sort by  
Forum locked  This topic is locked, you cannot edit posts or make further replies.


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
cron
RSS

Powered by phpBB® Forum Software © phpBB Group