Hi guys,
Thanks for the replies. It happened again yesterday so I'm really focusing on it again.
@Guspaz - sorry I should have mentioned that it does max out at 400% (I meant 100% of the physical server, but you're right, that shows as 400% on the linode web interfact).
Leaving top running might not be practical because it happens about once a fortnight, and when I spot it I can't even log in through the lish console. Yesterday it at least came up with a login prompt, but hung when the password was typed in.
What I'd like is be able to limit any apache/php/etc. processes to, say, 80% of total CPU, then at least I'd be able to shell in through lish or directly and see what's going on.
@obs - Thanks for the suggestions - the link to linode's library with a section called 'server-monitoring' made me smile. I tried Google when I should have first tried Linode
As for kernel version, uname -a tells me:
2.6.18.8-linode22 #1 SMP Tue Nov 10 16:12:12 UTC 2009 i686 GNU/Linux
which I _think_ is Linode's latest debian build
Lookig at /var/log/messages I can see a boatload of messages at the same timestamp, but they are all related (apparently) to the hard reboot we did. Just before that there are some telltale memory ones.
Every day there is the usual
kernel: imklog 3.18.6, log source = /proc/kmsg started.
rsyslogd: [origin software="rsyslogd" swVersion="3.18.6" x-pid="1112" x-info="http://www.rsyslog.com"] restart
But yesterday morning this was followed an hour or so later by a stream of kernel memory one, e.g.:
<:0 all_unreclaimable? no
.....
oom-killer: gfp_mask=0x201d2, order=0
[<c014f979>] out_of_memory+0x1c9/0x200
[<c015172f>] __alloc_pages+0x28f/0x310
[<c0152af9>] __do_page_cache_readahead+0x139/0x2f0
.....
HighMem per-cpu: empty
So OOM seems to be the final thing that killed it, with the CPU maxing out because some process was hanging as it couldn't get any memory alloc-ed? Does that sound reasonable?
Looking in /var/log/syslog at the time /var/log/messages complained (7:24 am) I can see something extremely suspicious... quite a few of these:
postfix/local[24786]: warning: maildir access problem for UID/GID=33/33: create maildir file /var/www/Maildir/tmp/1276153966.P24786.li62-252.members.linode.com: Permission denied
postfix/local[24786]: warning: perhaps you need to create the maildirs in advance
I'll go and check out the server-monitoring stuff you suggest, but my working hypothesis now is that I've royally screwed up the postfix configuration..... I'm sure that that would be quite capable of bringing the server to its knees....
Thanks again to both of you for replying to my post. Sorry about the long reply. What I really like about this forum is that you get a chance not just to fix your screw-ups, but learn something along the way
Cheers
Peter