Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
PostPosted: Sat Jun 05, 2010 1:02 pm 
Offline
Senior Newbie

Joined: Fri Jan 29, 2010 7:34 am
Posts: 8
Hi,

Had a couple of odd server hangs recently and wonder if anyone had any good tips on tools to use to diagnose them.

The server's running Debian Lenny, Apache, MySQL, PHP and Postfix/Dovecot for mail handling.

Situation is that CPU usage is hitting 100% and then locking up - memory and disk seem to be running fine.

This was odd because the linode monitoring was showing that the server was up, but using the shell access through the linode website it was limited to lish, rather than coming up with the login prompt it would normally give if the server were up.

I'm assuming that the server was responding to pings, but everything else was locked up.

In the past when I've had problems it's been PHP and I've been able to do the following:

1) ensure max execution time limit set in php.ini
2) check using ps aux / top to see what was stuck (httpd in previous cases)
3) having set ExtendedStatus On in http.conf I could use lynx localhost/server-status through the shell on linode to see detailed diagnostics of apache and find the culprit script

Obviously not being able to get any kind of shell access this was not possible and all I could do was restart.

What I'd like to know is:

1) what useful things could I be logging so that when I grep my logfiles after restarting I could see what had gone wrong

2) how could I force a hard %CPU execution limit on all processes from a single executable - i.e. ensure that if it is PHP then it can't take 100% CPU, so that I'll always be able to shell in.

I've googled various things (and, as I said above, the ExtendedStatus was useful), and my guess is that there are _many_ ways to do this. So I wonder if anyone could share their favourite quick tips / pointers to ways to limit damage and log excessive resource utilisation?

Best wishes

Peter


Top
   
 Post subject:
PostPosted: Mon Jun 07, 2010 12:21 pm 
Offline
Senior Member
User avatar

Joined: Tue May 26, 2009 3:29 pm
Posts: 1691
Location: Montreal, QC
One thing you can try is leaving 'top' running on your lish console. When your machine locks up, log in to lish and check which process is using the CPU.

I think it's more likely that you're OOMing rather than the CPU maxing out, because you've got four virtual cores; full CPU usage would show up as 400%, not 100%. It also means that you'd need four processes hammering the CPU to lock up the node, and even then merely maxing out a CPU doesn't cause the machine to lock up.


Top
   
 Post subject:
PostPosted: Mon Jun 07, 2010 12:43 pm 
Offline
Senior Member

Joined: Sun Mar 07, 2010 7:47 pm
Posts: 1970
Website: http://www.rwky.net
Location: Earth
check for /var/log/syslog and /var/log/messages for kernel warnings, you can also install munin http://library.linode.com/server-monito ... an-5-lenny

What kernel version are you running? I had a similar problem where I had random hangs for some reason the clock stopped which caused all kinds of hell, switching kernel fixed it. If you have a lot of repeated timestamps at the end of your logs could be you had the issue I had.


Top
   
 Post subject: Thanks for the replies
PostPosted: Fri Jun 11, 2010 6:16 am 
Offline
Senior Newbie

Joined: Fri Jan 29, 2010 7:34 am
Posts: 8
Hi guys,

Thanks for the replies. It happened again yesterday so I'm really focusing on it again.

@Guspaz - sorry I should have mentioned that it does max out at 400% (I meant 100% of the physical server, but you're right, that shows as 400% on the linode web interfact).

Leaving top running might not be practical because it happens about once a fortnight, and when I spot it I can't even log in through the lish console. Yesterday it at least came up with a login prompt, but hung when the password was typed in.

What I'd like is be able to limit any apache/php/etc. processes to, say, 80% of total CPU, then at least I'd be able to shell in through lish or directly and see what's going on.

@obs - Thanks for the suggestions - the link to linode's library with a section called 'server-monitoring' made me smile. I tried Google when I should have first tried Linode ;-)

As for kernel version, uname -a tells me:
2.6.18.8-linode22 #1 SMP Tue Nov 10 16:12:12 UTC 2009 i686 GNU/Linux

which I _think_ is Linode's latest debian build

Lookig at /var/log/messages I can see a boatload of messages at the same timestamp, but they are all related (apparently) to the hard reboot we did. Just before that there are some telltale memory ones.

Every day there is the usual
kernel: imklog 3.18.6, log source = /proc/kmsg started.
rsyslogd: [origin software="rsyslogd" swVersion="3.18.6" x-pid="1112" x-info="http://www.rsyslog.com"] restart

But yesterday morning this was followed an hour or so later by a stream of kernel memory one, e.g.:

<:0 all_unreclaimable? no
.....
oom-killer: gfp_mask=0x201d2, order=0
[<c014f979>] out_of_memory+0x1c9/0x200
[<c015172f>] __alloc_pages+0x28f/0x310
[<c0152af9>] __do_page_cache_readahead+0x139/0x2f0
.....
HighMem per-cpu: empty

So OOM seems to be the final thing that killed it, with the CPU maxing out because some process was hanging as it couldn't get any memory alloc-ed? Does that sound reasonable?


Looking in /var/log/syslog at the time /var/log/messages complained (7:24 am) I can see something extremely suspicious... quite a few of these:

postfix/local[24786]: warning: maildir access problem for UID/GID=33/33: create maildir file /var/www/Maildir/tmp/1276153966.P24786.li62-252.members.linode.com: Permission denied
postfix/local[24786]: warning: perhaps you need to create the maildirs in advance

I'll go and check out the server-monitoring stuff you suggest, but my working hypothesis now is that I've royally screwed up the postfix configuration..... I'm sure that that would be quite capable of bringing the server to its knees....

Thanks again to both of you for replying to my post. Sorry about the long reply. What I really like about this forum is that you get a chance not just to fix your screw-ups, but learn something along the way :-)

Cheers

Peter


Top
   
 Post subject:
PostPosted: Fri Jun 11, 2010 8:27 am 
Offline
Senior Member

Joined: Sun Mar 07, 2010 7:47 pm
Posts: 1970
Website: http://www.rwky.net
Location: Earth
Something chewed up your ram hence the OOM error. Could be postfix, not sure why postfix is trying to write to /var/www mine doesn't do that.

If in doubt purge postfix (not just uninstall) then reinstall and re-create your config files. If you're only sending emails from postfix then it should only take around 10-15 mins.


Top
   
 Post subject:
PostPosted: Fri Jun 11, 2010 9:56 am 
Offline
Senior Member
User avatar

Joined: Tue May 26, 2009 3:29 pm
Posts: 1691
Location: Montreal, QC
You don't need to log in via LISH. Just run top and leave LISH logged in when you disconnect. You should then be able to reconnect (since LISH is not hosted by your linode) and see the results of top. If you're OOMing, I'd suggest sorting by RAM usage (capital M).

Another thing to do is log the output of 'ps aux' to a file, so when it goes down you can reboot and check out the log.


Top
   
 Post subject: Will try that
PostPosted: Fri Jun 11, 2010 12:02 pm 
Offline
Senior Newbie

Joined: Fri Jan 29, 2010 7:34 am
Posts: 8
Thanks for the further replies....

@Guspaz,

Didn't know that Lish should stay logged in when I left the webpage - I assumed it would timeout somehow. Useful to know.

@obs,
To get Postfix up and running I followed a pretty good howto - http://workaround.org/ispmail/lenny/

It recommends setting dovecot as follows:
mail_location = maildir:/var/vmail/%d/%n/Maildir

I saw an earlier FAQ version that had it pointing to /home/vmail which is a bit pants, but I'm not in a rush to move it....

You're right that it might not be postfix - just that that was the thing logging errors into the syslog closest to the crashtime.

I think I've fixed the postfix config problem now, but will leave lish running top.

re: logging ps-aux to a file, I assume you're thinking of something like a 5 minute cronjob with a day or so of logrotate.

Thanks again for the comments. I'm determined to get to the bottom of this :-)

Peter


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group