sweh wrote:
Run this program from cron every 5 minutes:
Code:
#!/bin/ksh -p
LOG=/var/tmp/srvr_stat.$(date +%Y%m%d)
{
date
uptime
free
ps aux
echo
echo
} >> $LOG
This'll let you see some basics of what your machine if doing; in particular free memory (are you swapping to death?) and processes using lots of CPU. After your machine crashes you can review the log files to see what happened.
Ok, I modified the program to write a new file every 5 minutes and put this files in a new direcotry every day.
Code:
#!/bin/ksh -p
mkdir -p /root/log_for_crash_detect/day_$(date +%Y-%m-%d)
LOG=/root/log_for_crash_detect/day_$(date +%Y-%m-%d)/log_$(date +%Y-%m-%d-%H-%M)
{
date
uptime
free
ps aux
echo
echo
} >> $LOG
In this way it will be easyer to track the problem.
I really suspect that fail2ban is the killer.
This particular linode does not run anything such resource intensive, it runs a mailserver, a svn server, a proxy server and I use it for tunneling.
I think that the problem is in fail2ban because I know it has many problem in analyzing big files.
I rotate my maillog every week but it can be up to 300MB and this may create problems to fail2ban I think.
IN any case I will keep you posted if I discover something more.
Thanks to help me tracking the problem.