Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
PostPosted: Mon Dec 20, 2004 4:49 am 
Offline
Senior Member

Joined: Sun Dec 19, 2004 6:46 pm
Posts: 58
OS: Debian Sarge
Kernel: Latest 2.4 Series (2.4.28-linode37-1um)
Host: host36
Plan: Linode96

This is the 2nd time this linode crashed in the past 24 hours.

The job queue doesn't show any shutdown or reboot request. It just shows the linode is powered off.

Last crash occurred around 2:20 am CST (10 minutes ago). Prior one happened during the day.

I'll try to reproduce the problem, any advice in the meantime would be appreciated--particularly advice on what to look for in the logs produced by syslog-ng.

Also, is it possible to automatically power-on the Linode when something like this happens again?

ps

Sorry if I'm lite on info, it's my turn to crash....zzz


Top
   
 Post subject: More info
PostPosted: Mon Dec 20, 2004 5:08 am 
Offline
Senior Member

Joined: Sun Dec 19, 2004 6:46 pm
Posts: 58
I think I found the cause (at least circumstantial evidence)--but in theory, Linux should not be this fragile if this is reproducable.

It seems the crash happened at the exact same time as an hourly cron job that issues a "shorewall refresh" after adding dshield.org blacklist entries.

Perhaps this causes a crash if there is traffic (such as ssh tunnelling) at the time of the shorewall refresh. Since this is an hourly cron job and it doesn't crash every hour, there must be some other factor such as network traffic involved with this. And at most, I only have up to ~40 blacklist entries.

Anyway, it would be nice if the Linode automatically powered on when it crashes like this.


Top
   
 Post subject: Re: More info
PostPosted: Mon Dec 20, 2004 9:58 am 
Offline
Senior Member

Joined: Thu Aug 28, 2003 12:57 am
Posts: 273
sarge wrote:
I think I found the cause (at least circumstantial evidence)--but in theory, Linux should not be this fragile if this is reproducable.

It seems the crash happened at the exact same time as an hourly cron job that issues a "shorewall refresh" after adding dshield.org blacklist entries.

Perhaps this causes a crash if there is traffic (such as ssh tunnelling) at the time of the shorewall refresh. Since this is an hourly cron job and it doesn't crash every hour, there must be some other factor such as network traffic involved with this. And at most, I only have up to ~40 blacklist entries.

Anyway, it would be nice if the Linode automatically powered on when it crashes like this.


I'll bet that with a little bit of lish scripting, and another host permanently connected to the network, you could accomplish this. You write a script that every minute, attempted to connect to your Linode via lish (ssh you@hostXX.linode.com), detected when it was down, and issued a boot/reboot command when it is down.


Top
   
 Post subject:
PostPosted: Mon Dec 20, 2004 2:24 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
I've had a few reports of mysterious crashes with no additional information in the console log with Linodes on a host machine that is running the 2.6.8.1-4 kernel (check cat /proc/cpuinfo), which host36 is.

However, when this crash happens, please check Lish's console log and provide me that information if you can.

I'll have a new host online today with a kernel we've been testing for a while. I'd like to move your Linode over to it and see if that eliminates the problem. Up for that?

Thanks,
-Chris


Top
   
 Post subject:
PostPosted: Mon Dec 20, 2004 3:34 pm 
Offline
Senior Member

Joined: Thu Oct 30, 2003 11:27 am
Posts: 52
Website: http://www.wasteland.org/
Location: Rochester, NY
I've been having similar problems, and have been working with Chris to try to solve them. In the meantime, here is a watchdog script that I use. I run it from a linux box at my house (via cable modem). It checks every 15 minutes to make sure it's booted. Even if it makes a mistake, the worst it will do is issue a boot command to an already running linode (which does nothing).

Replace <linode> with the name of your linode, <username> with your LPM username, and hostxx with the host your linode is on. This script is dependant on you having set up your ssh key in lish, and having the key loaded into an agent on the machine you're running the watchdog from. If you can "ssh username@hostxx.linode.com" and get the lish prompt without being prompted for a password, you should be set.

Dave

Code:
#!/bin/bash
#
# Simple watchdog to make sure <linode> is running, and if not boot it.
#

while [ 1 ]; do
        echo Checking <linode> at: `date`
        ssh <username>@hostxx.linode.com version | grep "OK Linux"
        if [ $? -eq 1 ] ; then
                echo "<linode> down!"
                echo "<linode> is down, booting up" | mailx -s "<linode> down" youremail@hostname.com
                ssh <username>@hostxx.linode.com boot
        fi
        sleep 900
done


Top
   
 Post subject:
PostPosted: Mon Dec 20, 2004 10:26 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
caker wrote:
I'll have a new host online today with a kernel we've been testing for a while. I'd like to move your Linode over to it and see if that eliminates the problem. Up for that?

The new server is online -- shoot me a support ticket and I'll set up the migration for you.

-Chris


Top
   
PostPosted: Mon Dec 20, 2004 10:40 pm 
Offline
Senior Member

Joined: Sun Dec 19, 2004 6:46 pm
Posts: 58
Chris,

I'll try to reproduce some of the same conditions that were present during the last crash and take a closer look at the existing logs.

I appreciate the offer and I'm up for moving to a new host (sooner the better). The timing is great since we're not live yet.

Also, I prefer 2.4 kernels but am open to trying the 2.6 series as soon as you think it is ready for production use with Debian Sarge. I got interested when you mentioned the cryptoloop limitations in 2.4 kernels.

Dave,

Thanks for the script!!! Every bit helps as I'm working insane hours.

Everyone else,
So far, the Linode experience has been Chris (caker) proving that he really knows his stuff (unlike other hosting companies with clueless staff) and very helpful fellow customers in both IRC and forums. But the thing I love most so far is the ability to fix almost everything I can possibly mess up without waiting for support staff--even screwed up sshd configs or init.d scripts.


Top
   
PostPosted: Tue Dec 21, 2004 12:00 am 
Offline
Senior Member

Joined: Sun Dec 19, 2004 6:46 pm
Posts: 58
Hi,

I was able to reproduce the power-off crash on the new server after migrating.

I don't know if the crash is due to fragile code in iptables, shorewall, kernel 2.4, openssh, etc. or something else. Maybe these are red herrings, and the crash is caused by some subtle network error that generally doesn't cause problems except under special conditions like the following test.

1. I had another hosted test server use siege with 40 concurrent users on the linode server.

2. The test server and the linode had an ssh tunnel between them for these requests.

3. The linode had an ssh tunnel between it and a private server (adsl at home office) which generated the html content.

This test runs with a bunch of free RAM and idle cpu available on both the test server and linode server.

The crash occurs when I perform a 'shorewall refresh' command on the linode during the test, which adds a couple dozen blacklisted ip addresses using iptables. This command does not always bring down the server on the first try--it might have to be executed a few times to make the crash happen.


Top
   
 Post subject:
PostPosted: Tue Dec 21, 2004 12:10 am 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
Ok. And looking at your console logs, I can see it just terminates without any debugging information.

Since you can reproduce the problem, would you mind trying 2.6.9-linode9? You will need to "mv /lib/tls /lib/tls-disabled" before booting into 2.6.9. My hope is that 2.6.9 will provide *some* console output relating to the problem.

-Chris


Top
   
 Post subject:
PostPosted: Tue Dec 21, 2004 12:29 am 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
Actually, if you wouldn't mind popping into the IRC channel, I might have a way to get some more information out of the kernel (thanks to Jeff Dike).

http://www.linode.com/forums/viewtopic.php?t=588

-Chris


Top
   
 Post subject:
PostPosted: Sat Jan 08, 2005 6:52 am 
Offline
Senior Member

Joined: Sun Dec 19, 2004 6:46 pm
Posts: 58
UPDATE:

The problem was reproduced using both 2.4 and 2.6 kernels.

Caker was kind enough to stay up late to monitor the host machine and help try to debug this problem. Ultimately, it is thought to be a bug in UML that was most likely introduced in a recent version.

At caker's request, I provided a crashkit as a standalone Linode disk image so he can reproduce the problem around December 21.

He offered to pass the info along to the UML dev team after taking a look at the crashkit.

As-is, the crashkit requires 2 machines external to crash the linode. It might be possible to create a much simpler crashkit but my schedule has become swamped (hence the message at 4:45am). I guess if another customer encounters this bug, they can attempt to create a simpler crashkit to help figure this out.

Given that the Linode completely powers off unexpectedly, I'm hoping this nasty UML bug gets fixed before it causes loss of valuable data.


Top
   
 Post subject:
PostPosted: Tue Jan 11, 2005 8:49 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
Just an update:

Using sarge's crashkit as a jumping off point, I've been able to reproduce this bug very easily:

http://www.theshore.net/~caker/uml/cras ... le-output/

Quote:
All you need is an UML guest assigned an IP on your network, iptables, and
the two files below in the same directory. Run the script, while ping-flooding
the UML's IP address from the host, or another machine on your network.

This script works best with a 2.6 based UML. I can recreate the crash easily using
2.6.9-linode9 (based on -bb4), also available on this website.

I've forwarded the info to Jeff, so hopefully we'll see a fix soon enough.

Thanks,
-Chris


Top
   
 Post subject:
PostPosted: Tue Jan 11, 2005 10:00 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
Quote:
19:39 < caker> jdike: http://www.theshore.net/~caker/uml/cras ... le-output/
19:41 < caker> jdike: very easy to reproduce -- just a script, another file with IPs, and ping flood the UML's IP
19:51 < caker> time for food .. brb
20:14 < jdike> Blocking 222.64.190.133...
20:14 < jdike> Jan 11 20:06:28 usermode syslogd: recvfrom unix: Resource temporarily unavailable
20:14 < jdike> Segmentation fault
20:14 < jdike> Looking like a stack overflow
20:16 < jdike> and that doesn't surprise me too much
20:27 < jdike> When caker comes back, someone tell him that his problem smells heavily of a stack overflow
20:28 < jdike> Probably easily fixed, although I need to think about it some
20:35 < caker> jdike: thanks for taking a look at that
20:37 < jdike> caker: For some reason, you can't make the idle thread stack overflow, or at least not badly enough to cause a crash
20:37 < jdike> caker: however when its doing something else in the kernel, and I'm guessing iptables is generating deep stacks on its own, it does cause a crash
20:38 < jdike> caker: theory only, I haven't poked at it at all yet
20:38 < caker> jdike: ok -- glad it was easy enough for you to reproduce
20:38 < jdike> caker: yup
20:38 < jdike> caker: the right test case is a wondrous thing
20:40 * jdike disappears again


Top
   
 Post subject: patch
PostPosted: Wed Jan 12, 2005 1:33 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
Jeff is QUICK:

http://marc.theaimsgroup.com/?l=user-mo ... 608577&w=2

Code:
This patch fixes a long-standing problem in skas mode process creation.  Chris
Aker has been seeing it at linode, and found a way of reproducing it.  Once
I spotted the bug, I found an easier way:

ping flood the UML from the host while running
   while true; do ls > /dev/null; done

In 10-15 seconds, UML will simply exit back to the shell with a segfault,
no panic, no output, no nothing.

When UML sets up the kernel stack for a new process, it sends itself a
SA_ONSTACK signal with the signal stack being the new kernel stack.  It
calls setjmp there to set up a context that it can longjmp to when the new
process is run for the first time.

The problem was that, while signals were blocked during this, they were
re-enabled before SA_ONSTACK was disabled.  Thus, a signal arriving at the
wrong time, between signals being turned on and SA_ONSTACK being disabled,
would cause the signal to be handled on the stack, destroying the context
that had been set up there.

When the new process ran, it would longjmp to this trashed stack, and UML
would die.

The fix is obvious:

Index: 2.6.10/arch/um/kernel/skas/process.c
===================================================================
--- 2.6.10.orig/arch/um/kernel/skas/process.c   2005-01-12 11:17:22.000000000 -0500
+++ 2.6.10/arch/um/kernel/skas/process.c   2005-01-12 11:18:03.000000000 -0500
@@ -323,9 +323,10 @@
    block_signals();
    if(sigsetjmp(fork_buf, 1) == 0)
       new_thread_proc(stack, handler);
-   set_signals(flags);
 
    remove_sigstack();
+
+   set_signals(flags);
 }
 
 void thread_wait(void *sw, void *fb)

            Jeff


I'll be releasing kernels later today that fix this and the recent local root exploit vulnerability.

-Chris


Top
   
 Post subject:
PostPosted: Thu Jan 13, 2005 2:12 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
Update: I can still reproduce this bug. Jeff is having another look.

-Chris


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group