Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
PostPosted: Sun Nov 01, 2009 5:34 pm 
Offline
Senior Member

Joined: Wed May 13, 2009 1:18 am
Posts: 681
I think I've researched enough data to believe I'm right, but was wondering if anyone might know for sure?

When running under a pv_ops (paravirt) kernel such as the current 2.6.31.5, the CPU usage stats appear to be properly contained to the 4 CPUs my Linode has visibility to, but the idle time appears to account for all 8 CPUs on the host?

I noticed this because on a newer Linode my munin graphs show an idle time of 800%, whereas the default graph tops out at 400% based on visible CPUs. The newer Linode was cloned from an older one that doesn't have the problem (but runs with the latest regular 2.6.18 kernel) so the software aside from the kernel (and possibly the xen host given the host is newer) is the same.

I've seen some earlier munin posts having to do with NaN values for idle, which I don't have a problem with, and some with a similar ~800%. Also some independent references that suggest it may be related to the pv_ops kernel and/or just the fact that the pv_ops kernels are late enough that they're configured to run tickless (CONFIG_NO_HZ).

I don't think it's a munin problem per-se, since monitoring /proc/stat over a time range on both hosts clearly shows the idle counter incrementing twice as fast on my new Linode than my old.

Naively, I was guessing that a tickless kernel is computing or deriving the idle value (I know the xen pv_ops stuff doesn't actually execute idle ticks in the DomU), but what's odd is the per-CPU values for the four visible CPUs continue to sum to the total, and aren't identical, so they're accumulating per-CPU, just seemingly twice as fast as on the non-pv_ops Linode.

So I was wondering if anyone knew for sure what was going on.

Thanks.

-- David


Top
   
 Post subject:
PostPosted: Sun Jan 24, 2010 11:38 pm 
Offline
Senior Member

Joined: Sat Mar 28, 2009 4:23 pm
Posts: 415
Website: http://jedsmith.org/
Location: Out of his depth and job-hopping without a clue about network security fundamentals
I've seen this too. To work around it, I simply divided the idle value in the Munin cpu plugin (and it is only idle) by 2 when installing Munin on my Linodes.

Seems like a pv_ops bug, but a really minor one.

_________________
Disclaimer: I am no longer employed by Linode; opinions are my own alone.


Top
   
 Post subject:
PostPosted: Mon Jan 25, 2010 1:31 am 
Offline
Senior Member

Joined: Wed May 13, 2009 1:18 am
Posts: 681
jed wrote:
I've seen this too. To work around it, I simply divided the idle value in the Munin cpu plugin (and it is only idle) by 2 when installing Munin on my Linodes.

Seems like a pv_ops bug, but a really minor one.

Yeah, that's what I chose to do as well.

-- David


Top
   
 Post subject:
PostPosted: Mon Jan 25, 2010 12:38 pm 
Offline
Senior Member
User avatar

Joined: Tue May 26, 2009 3:29 pm
Posts: 1691
Location: Montreal, QC
I've not had that problem, but I've had something of a different one. My graphs are all scaled to 400%, but idle has always been nan. Not that it matters.


Top
   
 Post subject:
PostPosted: Mon Jan 25, 2010 3:45 pm 
Offline
Senior Member

Joined: Wed May 13, 2009 1:18 am
Posts: 681
Guspaz wrote:
I've not had that problem, but I've had something of a different one. My graphs are all scaled to 400%, but idle has always been nan. Not that it matters.

The graphs scale to 400% because a Linode "sees" 4 processors, so maximum CPU figures (busy, idle) will be up to 400% given how Linux measures such things. But on the paravirt kernels, without fixing the idle time, it'll measure up to the full 800% (8 processor host) so extend beyond the rest of the processor metrics. Since the graph limits are set based on the visible CPUs, you pretty much end up not being able to see the top of the idle bar varying since it's above the graph.

With that said, I've definitely had other problems with the idle metric in general with munin, and in fact one of my paravirt kernels after having just fine stats for a few months, suddenly switched idle to nan. More commonly though, and with any kernel, I hit cases where the idle metric shows some enormous (like 80 digit) value. I suspect, but have yet to track down, that it's due to some calculation being performed along the way, but once present it sort of infects all the graphs since you've got that bad (measured or calculated) data point for all time. It's possible the nan reading is a similar calculation overflow.

So far my only recourse has to been reset data collection and the rrd files. I've had no luck searching for help on this behavior, which given how frequently it happens to me I'd have thought I could find more references.

Now this is with Ubuntu 8.04 LTS, where the repository munin (and probably rrdtools) isn't the latest and greatest, so I figured I'd try a locally build version at some point. Other metrics all seem to be fine, so for now I just get a bit annoyed when viewing the CPU graphs on some nodes...

-- David


Top
   
 Post subject:
PostPosted: Mon Jan 25, 2010 4:33 pm 
Offline
Senior Member

Joined: Mon Oct 27, 2008 10:24 am
Posts: 173
Website: http://www.worshiproot.com
db3l wrote:
Guspaz wrote:
I've not had that problem, but I've had something of a different one. My graphs are all scaled to 400%, but idle has always been nan. Not that it matters.

The graphs scale to 400% because a Linode "sees" 4 processors, so maximum CPU figures (busy, idle) will be up to 400% given how Linux measures such things.


Pretty sure his concern was the NaN displayed for idle, not the 400% scale.


Top
   
 Post subject:
PostPosted: Mon Jan 25, 2010 5:35 pm 
Offline
Senior Member

Joined: Wed May 13, 2009 1:18 am
Posts: 681
JshWright wrote:
Pretty sure his concern was the NaN displayed for idle, not the 400% scale.

Wasn't absolutely sure when reading it, so figured the first paragraph couldn't hurt (perhaps to other readers if not the OP). But the rest is related to strange idle readings.

-- David


Top
   
 Post subject:
PostPosted: Mon Jan 25, 2010 8:05 pm 
Offline
Senior Member

Joined: Sat Mar 28, 2009 4:23 pm
Posts: 415
Website: http://jedsmith.org/
Location: Out of his depth and job-hopping without a clue about network security fundamentals
JshWright wrote:
db3l wrote:
Guspaz wrote:
I've not had that problem, but I've had something of a different one. My graphs are all scaled to 400%, but idle has always been nan. Not that it matters.

The graphs scale to 400% because a Linode "sees" 4 processors, so maximum CPU figures (busy, idle) will be up to 400% given how Linux measures such things.


Pretty sure his concern was the NaN displayed for idle, not the 400% scale.

They're related. Munin will get confused with the fast idle and report NaN for idle (or ~800%); I see them as symptoms of the same problem.

_________________
Disclaimer: I am no longer employed by Linode; opinions are my own alone.


Top
   
 Post subject:
PostPosted: Mon Jan 25, 2010 10:50 pm 
Offline
Senior Member

Joined: Wed May 13, 2009 1:18 am
Posts: 681
jed wrote:
They're related. Munin will get confused with the fast idle and report NaN for idle (or ~800%); I see them as symptoms of the same problem.

Note that on one of my systems I was fine for a while, and now have NaN, even though I have the "/2" modification in the plugin so the idle should never exceed 400%. I do agree that these issues are likely all related (along with my seemingly overflow values in other cases), but am still not certainly precisely how.

-- David


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group