Ok, so for a few days now I've been getting strange freezes. This started immediately after I upgraded from Karmic to Lucid. The load average seems to go way up, to the 30-70 range, and the system becomes totally unresponsive. All network services, the console, etc, all become unresponsive.
It happens about 1 time per day, but not at any set regular time. Sometimes it will last 2 days before freezing, sometimes it freezes a few times in one day.
I can't find anything interesting in logs. I only know the load average is high because I leave a terminal open with htop running, and the last loadavg displayed before the SSH session disconnects shows something like this:
Code:
kiomava@h2:~$ age from root@h4 1.6%] Tasks: 382 total, 1 running
2 [ unknown) at 19:10 ... 0.0%] Load average: 38.85 37.81 33.63
Mem[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||868/2020MB] Uptime: 2 days, 12:29:11
The system is going down for power off NOW! 42/719MB]
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
32657 kiomava 20 0 2816 1472 940 R 1.0 0.1 1h15:41 htop
The only way to recover is to issue a reboot from the linode dashboard. It will continue to be unresponsive for hours, until I reboot from the linode dashboard.
I've tried several kernels. I've tried pretty much all the recent 2.6 kernels. This started when on latest 2.6 paravirt.
This system has been in more or less the same configuration (by configuration here I mean packages installed, their config files, etc) for a year. This all started exactly when I upgraded to Lucid, so it seems almost a certainty there is some Lucid-specific problem in play.
I have munin going, and there is nothing I see leading up to the points where it freezes. There is just a discontinuity in the various graphs starting at the point where it dies and until it recovers. Even the load average doesn't spike -- it must happen too quickly for munin to catch and log successfully. Only running htop catches the load average spike. Munin shows nothing suspicious -- no slowly increasing memory/cpu/load, no slowly increasing process count -- nothing. Just total perfect normalness until it dies.
The logs also show nothing interesting. Just discontinuity when it dies. I've scoured apache logs, java appserver logs, etc., and found nothing interesting around when it dies.
There are no cron jobs timed near when these freezes happen. They happen at seemingly random times, I haven't seen any pattern like it failing every day around the same time or minute or whatever.
So...
Has anybody seen anything similar?
Anybody have suggestions how to better instrument this to see what's going on?
Any other suggested courses of action to fix this?
Thanks in advance for any help you all can offer.