Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
 Post subject:
PostPosted: Fri Jun 25, 2010 11:39 am 
Offline
Senior Newbie

Joined: Thu Jun 17, 2010 1:10 pm
Posts: 16
Website: http://www.nerdkits.com/
Hi stefantalpalaru,

Good to know. Mine is still OK at the moment. Hopefully we can try to track this down! Let's collect some basic information that might be helpful to those more knowledgeable about the clock / timer freezing issue. Here are some questions I thought might be important, as well as my personal answers.

stefantalpalaru and obs, and anyone else who has had the clock freeze issue, can you both respond to these questions so we can make sure we're on the same page?

1) Can you confirm that the nature of the clock freeze is identical to what I described in my first post? (System is still reachable and serves web pages, but cronjobs don't run, load average statistics don't update, problems with interactive ssh sessions, non-working lish console. Running "ssh me@mylinode uptime" shows some fields that do update (number of users, amount of uptime) and some fields that don't (currrent time, load averages). Running "ssh me@mylinode date" still updates with the correct time.)

2) What distribution are you running? (I'm on Debian lenny / stable)

3) What Linode plan? (I'm on the Linode 1024, which was 720 when I had my two clock freezes)

4) What kind of load was that kernel seeing? (Based on Linode graphs, I'm typically around 4-5% CPU, sometimes bursting to maybe 110% for some batch jobs. I use almost zero swap space and make a serious effort to keep everything in RAM. This includes vm.swappiness=0, use of memcached, and making strategic use of "tmpfs" RAM-backed filesystems for certain parts of my application.)

5) Are you running ntpd? (I have seen clock freezes on 2.6.33-linode24 both with and without ntpd, but just for the record...)

6) Have you had this issue with other kernels too? (So far, I've only experienced it with 2.6.33-linode24 -- not yet with 2.6.34-linode26.)

7) How much uptime did the box have before the clocks froze? (I had roughly 10-15 days uptime on both occurrences.)

8) Have you been able to make a correlation / guess as to whether the issue occurs with high CPU usage, high IO usage, high network usage, etc? Any unusual log messages from those incidents? (I have not been able to find anything that I thought might be related.)

9) Do you have anyway to quickly / controllably reproduce the clock freeze? (Unfortunately I don't.)

10) What datacenter are you in? (I'm in newark)

Mike


Top
   
 Post subject:
PostPosted: Fri Jun 25, 2010 11:44 am 
Offline
Senior Member

Joined: Sun Mar 07, 2010 7:47 pm
Posts: 1970
Website: http://www.rwky.net
Location: Earth
stefantalpalaru wrote:
2.6.34-linode26 has frozen the clock for me on 2 different linodes.

Hrmm well that's not good maybe I didn't test for long enough, have you raised a ticket with support?


Top
   
 Post subject:
PostPosted: Fri Jun 25, 2010 12:05 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
No need, we're already watching this thread.


Top
   
 Post subject:
PostPosted: Sat Jun 26, 2010 12:27 pm 
Offline
Newbie

Joined: Fri Jun 25, 2010 6:08 am
Posts: 2
compumike, here's the info:

1. the system responds to ping and I can ssh into it, but can't input anything in the interactive session. The time I see in the shell prompt is way off. The web server doesn't work, CPU usage is at 100%.

2. Debian unstable and Gentoo ~x86

3. Linode 1024 and Linode 4096

4. very low CPU usage. see the munin graphs: http://munin.od-eon.com/com/od-eon.com/index.html

5. yes, ntpd runs on both linodes

6. yes, all the paravirt kernels I've tried, with varying periods of time between clock freezes (some of them lasted for more than a month). But I need the latest DRBD version so I keep trying to stabilize it. Most of the kernels had custom configs (booted with PV_GRUB) so I was pretty much on my own, but now I see the same problem with the official config.

7. last uptimes: 5 and 2 days

8. no, but I suspect it's all triggered by the clock as presented by Xen

9. no

10. Dallas and Newark


Last edited by stefantalpalaru on Sat Jun 26, 2010 2:11 pm, edited 1 time in total.

Top
   
 Post subject:
PostPosted: Sat Jun 26, 2010 12:35 pm 
Offline
Senior Member

Joined: Sun Mar 07, 2010 7:47 pm
Posts: 1970
Website: http://www.rwky.net
Location: Earth
Here's my info

1) Yes web server still serves pages (but not for long since the firewall detects a synflood and blocks connections) SSH however doesn't work it just locks up if the session is already connected, if not connected it hangs on connection.

2) Ubuntu 9.10 32 bit.

3) 512 which was 360 when the freeze happened.

4) Not sure since it was a long time ago, but if I was to hazard a guess probably around 5-10%.

5) Yes - logs didn't show NTPD trying to change the time if memory serves.

6) No (I'm currently using pv_grub with ubuntus ec2 kernel)

7) 12-24 hours at most, the server is in use almost 24/7 and people tended to yell at me as soon as it locked up

8) Absolutely nothing, I delved into my logs and could find diddly squat it seemed completely random.

9) Again no sadly.

10) Dallas


Top
   
 Post subject:
PostPosted: Sat Jun 26, 2010 3:23 pm 
Offline
Senior Newbie

Joined: Thu Jun 17, 2010 1:10 pm
Posts: 16
Website: http://www.nerdkits.com/
Hi obs,

Two days ago you said, in reference to 2.6.34-linode26:

Quote:
It's been running for 2 days no issues! This kernel's a keeper!


Is that box still running -- now up to 4 days / 96 hours? If so, that seems like a significant departure from your

Quote:
12-24 hours at most


description about earlier kernels.

Mike


Top
   
 Post subject:
PostPosted: Sat Jun 26, 2010 6:06 pm 
Offline
Senior Member

Joined: Sun Mar 07, 2010 7:47 pm
Posts: 1970
Website: http://www.rwky.net
Location: Earth
No it's not still running it was a clone I didn't want to risk the live server, after 2 days I deleted the clone. But it did run fine for 2 days which is an improvement on the 12-24 hours :)


Top
   
 Post subject:
PostPosted: Mon Jun 28, 2010 11:29 pm 
Offline
Senior Newbie

Joined: Thu Jun 17, 2010 1:10 pm
Posts: 16
Website: http://www.nerdkits.com/
For what it's worth, my Linode is now at 10 days, 21 hours uptime on running 2.6.34-linode26 with no issues at all -- so far, so good! However, my last two timer freezes (both on the earlier kernel version 2.6.33-linode24) were with roughly 10 and 14 days uptime, so we're not out of the woods yet.

Can anyone suggest useful things to try to record in the event that it does freeze again (before rebooting)? Catting particular files within /proc perhaps?

With great speculation: this may be load triggered in some way, either as a cumulative load in some kernel variable that isn't getting reset properly, or as an instantaneous load that causes some virtual interrupt to get missed or something like that. Alternatively, it may be host-triggered.

However, the fact that Obs suggests that it happens rather consistently on a 12-24 hour time span is really interesting. Obs, does this mean you're still manually rebooting it every 12-24 hours at this point? When you made a clone to test with the new 2.6.34-linode26 kernel for two days, was that clone taking any of your client load, or was it unloaded?

Since mine only occurs on the time period of weeks on the 2.6.33-linode24 kernel, it's just about impossible for me to do any testing. But if I had a setup that I knew would lock up within a period of minutes or hours, then I think we could really get to the bottom of this.


Top
   
 Post subject:
PostPosted: Mon Jun 28, 2010 11:51 pm 
Offline
Senior Member

Joined: Sun Mar 07, 2010 7:47 pm
Posts: 1970
Website: http://www.rwky.net
Location: Earth
I currently run the ubuntu 2.6.31-307-ec2 kernel using pv_grub, so no I don't reboot every 12-24 hours (I'd have no business if I did!)

The clone was under load, I asked my users to use the server as normal and they did.

One thing I can pretty sure say it's not to do with is network load, since my backups are sent to s3 storage every night around the same time and it never crashed during that, it really was quite random.

I can tell you what runs on the box.

Nginx takes the brunt of the web serving static files, it passes php back to apache, mysql is running as the database, there's also a nsd DNS server running. The usual system utilities i.e. logrotate, munin etc are running.

Now if memory serves the last lockup was around 4:37pm with it being such an odd time no cron jobs are running, no backups are running. I checked my nginx, mysql, munin etc logs when it happened and I couldn't;spot anything unusual.


Top
   
 Post subject:
PostPosted: Tue Jun 29, 2010 9:14 am 
Offline
Senior Newbie

Joined: Tue Feb 16, 2010 4:32 pm
Posts: 6
My linode froze up sometime in the middle of last night running the 2.6.34 kernel after 6 days of up time. It's the first time I've ever had the problem though, the previous kernels have all been flawless for me.

I'll add a few additional notes about my setup on the off chance it helps get to the bottom of this as there's nothing obvious in my logs to say what happened:

  • Webserver still serves pages for a short period. SSH lets me connect but doesn't allow me to issue commands.
  • OS is Ubuntu 10.04 32-bit running the 2.6.34 kernel
  • Linode is a 512 (formerly a 360)
  • Load on the server is very low
  • NTPD didn't try to correct the time as far as I can tell
  • The linode is located in newark


Top
   
 Post subject:
PostPosted: Tue Jun 29, 2010 12:17 pm 
Offline
Senior Member

Joined: Sun Mar 07, 2010 7:47 pm
Posts: 1970
Website: http://www.rwky.net
Location: Earth
Hi Dru, welcome to the unlucky club.

Did you have any cron jobs running? Can you provide a time stamp from when it was happening, maybe the linode guys can check the host for any weirdness at that time?


Top
   
 Post subject:
PostPosted: Tue Jun 29, 2010 1:11 pm 
Offline
Senior Newbie

Joined: Tue Feb 16, 2010 4:32 pm
Posts: 6
I did have a cron job running, it's the last thing logged in fact. It's not a particularly complex job, all it does is ping a website every 30 minutes. The last time it ran was at 1:30am UTC, if the linode guys want to check the host.


Top
   
 Post subject:
PostPosted: Tue Jun 29, 2010 1:36 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
Does switching clocksources have any effect?

Code:
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource

-Chris


Top
   
 Post subject:
PostPosted: Tue Jun 29, 2010 1:38 pm 
Offline
Senior Member

Joined: Sun Mar 07, 2010 7:47 pm
Posts: 1970
Website: http://www.rwky.net
Location: Earth
I'll knock up a node with a kernel that I have issues with and get back to you, I should have something in a few days.


Top
   
 Post subject:
PostPosted: Tue Jun 29, 2010 2:19 pm 
Offline
Senior Newbie

Joined: Thu Jun 17, 2010 1:10 pm
Posts: 16
Website: http://www.nerdkits.com/
Hi Chris,

I am at 11 days, 12 hours uptime (and still have not had any issue with 2.6.34-linode26).

I'm guessing that you have a good reason to believe that tsc is a winner, so I have now switched to the tsc clocksource (previously was xen) without rebooting. Ntpd appears to be maintaining time fine after the switch.

(Of course, I expect the more useful testing results to come from the other users who have had this issue with greater frequency -- looking forward to seeing your test results!)

Mike


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group