Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
PostPosted: Mon Feb 18, 2013 11:44 am 
Offline
Senior Newbie

Joined: Thu Jun 17, 2010 1:10 pm
Posts: 16
Website: http://www.nerdkits.com/
For the last ~1.5 months I've been experiencing seemingly random kernel lock-ups, in the range of 1-2 times per week now! All services remain unresponsive, and the lish console shell interface shows some messages but does not respond to input. This requires a "reboot", and to avoid waiting the several-minutes for the gentle sync/reboot to work, a "destroy" operation.

The messages seem to vary, but they all seem to be related to memory page faults. Looks like a Xen bug. Anyone have recommendations on how I can alleviate this problem? Downgrade to a known-good kernel, etc? Many thanks.

Code:
INFO: rcu_sched self-detected stall on CPU
        3: (779947 ticks this GP) idle=1b9/140000000000001/0
         (t=780012 jiffies)
Pid: 24440, comm: gs Tainted: G    B D      3.7.5-linode48 #1
Call Trace:
 [<c0193e8f>] ? print_cpu_stall+0xdf/0x190
 [<c078bae1>] ? _raw_spin_unlock_irqrestore+0x11/0x20
 [<c016890f>] ? update_wall_time+0x18f/0x290
 [<c019435a>] ? rcu_check_callbacks+0x12a/0x230
 [<c013f965>] ? update_process_times+0x35/0x70
 [<c016f2fd>] ? tick_sched_timer+0x6d/0xc0
 [<c0151f65>] ? __remove_hrtimer+0x45/0xa0
 [<c016f290>] ? tick_nohz_handler+0xe0/0xe0
 [<c01520ed>] ? __run_hrtimer+0x4d/0xf0
 [<c0152569>] ? hrtimer_interrupt+0x119/0x2f0
 [<c01068f7>] ? xen_timer_interrupt+0x17/0x30
 [<c018d9ff>] ? handle_irq_event_percpu+0x3f/0x150
 [<c018fed5>] ? irq_get_irq_data+0x5/0x10
 [<c04f97a5>] ? info_for_irq+0x5/0x20
 [<c04f9e60>] ? evtchn_from_irq+0x10/0x40
 [<c0190191>] ? handle_percpu_irq+0x31/0x50
 [<c04f9664>] ? __xen_evtchn_do_upcall+0x164/0x210
 [<c04fa868>] ? xen_evtchn_do_upcall+0x18/0x30
 [<c078cb3b>] ? xen_do_upcall+0x7/0xc
 [<c018007b>] ? update_if_frozen+0x6b/0xd0
 [<c04f00d8>] ? irq_cpu_rmap_add+0x88/0x90
 [<c01013a7>] ? xen_hypercall_sched_op+0x7/0x20
 [<c04f9ed7>] ? xen_poll_irq_timeout+0x47/0x60
 [<c0108295>] ? xen_spin_lock_slow+0x65/0xd0
 [<c010835c>] ? xen_spin_lock_flags+0x5c/0x70
 [<c078ba97>] ? _raw_spin_lock_irqsave+0x27/0x40
 [<c01ab83d>] ? pagevec_lru_move_fn+0x5d/0xb0
 [<c01ab170>] ? pagevec_lookup+0x20/0x20

 [<c01c0bd7>] ? exit_mmap+0x37/0x110
 [<c04fa86d>] ? xen_evtchn_do_upcall+0x1d/0x30
 [<c078cb3b>] ? xen_do_upcall+0x7/0xc
 [<c0101227>] ? xen_hypercall_xen_version+0x7/0x20
 [<c0106297>] ? xen_force_evtchn_callback+0x17/0x30
 [<c01308eb>] ? mmput+0x2b/0xa0
 [<c0136113>] ? exit_mm+0xd3/0x100
 [<c078bac0>] ? _raw_spin_lock_irq+0x10/0x20
 [<c0137b9d>] ? do_exit+0x11d/0x3a0
 [<c0131f17>] ? print_oops_end_marker+0x27/0x30
 [<c010c272>] ? oops_end+0x72/0xa0
 [<c012687e>] ? __bad_area_nosemaphore+0xae/0x140
 [<c018da40>] ? handle_irq_event_percpu+0x80/0x150
 [<c018fed5>] ? irq_get_irq_data+0x5/0x10
 [<c012696b>] ? bad_area+0x3b/0x50
 [<c0126f32>] ? __do_page_fault+0x402/0x410
 [<c04f96ce>] ? __xen_evtchn_do_upcall+0x1ce/0x210
 [<c01947b3>] ? rcu_irq_exit+0x53/0xb0
 [<c04fa86d>] ? xen_evtchn_do_upcall+0x1d/0x30
 [<c078cb3b>] ? xen_do_upcall+0x7/0xc
 [<c0126f40>] ? __do_page_fault+0x410/0x410
 [<c078c2fe>] ? error_code+0x5a/0x60
 [<c0126f40>] ? __do_page_fault+0x410/0x410
 [<c01a78f8>] ? get_page_from_freelist+0x118/0x3c0
 [<c0103138>] ? load_TLS_descriptor+0x58/0xa0
 [<c01a7e81>] ? __alloc_pages_nodemask+0x141/0x6d0
 [<c01aa98d>] ? __do_page_cache_readahead+0xdd/0x1a0
 [<c01aaa6e>] ? ra_submit+0x1e/0x30
 [<c01a3099>] ? filemap_fault+0x309/0x3e0
 [<c01ba5a5>] ? __do_fault+0x75/0x570
 [<c01bdbf0>] ? handle_pte_fault+0xa0/0x2f0
 [<c01bdf35>] ? handle_mm_fault+0xf5/0x1b0
 [<c0126c6a>] ? __do_page_fault+0x13a/0x410
 [<c01c36a5>] ? sys_mprotect+0x1b5/0x1f0
 [<c0126f40>] ? __do_page_fault+0x410/0x410
 [<c078c2fe>] ? error_code+0x5a/0x60
 [<c0126f40>] ? __do_page_fault+0x410/0x410


Top
   
PostPosted: Mon Feb 18, 2013 11:52 am 
Offline
Senior Member

Joined: Sat Sep 25, 2010 2:25 am
Posts: 75
Website: http://www.ruchirablog.com
Location: Sri Lanka
whats the OS are you on?

_________________
www.ruchirablog.com


Top
   
PostPosted: Mon Feb 18, 2013 11:55 am 
Offline
Senior Member

Joined: Tue May 03, 2011 11:55 am
Posts: 105
I recently got one of these myself - very similar to what you have there. The error I had referenced my webserver, Litespeed.

INFO: rcu_sched self-detected stall on CPU
1: (239698 ticks this GP) idle=98d/140000000000001/0
(t=240004 jiffies)
Pid: 2486, comm: litespeed Not tainted 3.7.5-linode48 #1
Call Trace:


Top
   
PostPosted: Mon Feb 18, 2013 12:04 pm 
Offline
Senior Member

Joined: Fri Feb 18, 2005 4:09 pm
Posts: 594
After a recent reboot, I see high CPU for a process named rcu_sched. I've never seen that in my years with Linode. Ubuntu 12.10 64 bit, latest kernel.

James


Top
   
PostPosted: Mon Feb 18, 2013 12:10 pm 
Offline
Junior Member

Joined: Wed Jul 04, 2012 11:08 am
Posts: 34
A 3.7 rcu_sched stall bug related to TCP was fixed in 3.7.8 upstream. Linode might want to cherry pick or just roll out 3.7.9 that was just recently released

commit 09ea1383126d942a993b0895cec16e0961db5af9
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Jan 10 07:06:10 2013 +0000

tcp: splice: fix an infinite loop in tcp_read_sock()

[ Upstream commit ff905b1e4aad8ccbbb0d42f7137f19482742ff07 ]

commit 02275a2ee7c0 (tcp: don't abort splice() after small transfers)
added a regression.

[ 83.843570] INFO: rcu_sched self-detected stall on CPU
[ 83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132)
[ 83.844582] Task dump for CPU 6:
[ 83.844584] netperf R running task 0 8966 8952 0x0000000c
[ 83.844587] 0000000000000000 0000000000000006 0000000000006c6c 0000000000000000
[ 83.844589] 000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10
[ 83.844592] ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8
[ 83.844594] Call Trace:
[ 83.844596] [<ffffffff81088679>] ? vprintk_emit+0x1c9/0x4c0
[ 83.844601] [<ffffffff815ad449>] ? schedule+0x29/0x70
[ 83.844606] [<ffffffff81537bd2>] ? tcp_splice_data_recv+0x42/0x50
[ 83.844610] [<ffffffff8153beaa>] ? tcp_read_sock+0xda/0x260
[ 83.844613] [<ffffffff81537b90>] ? tcp_prequeue_process+0xb0/0xb0
[ 83.844615] [<ffffffff8153c0f0>] ? tcp_splice_read+0xc0/0x250
[ 83.844618] [<ffffffff814dc0c2>] ? sock_splice_read+0x22/0x30
[ 83.844622] [<ffffffff811b820b>] ? do_splice_to+0x7b/0xa0
[ 83.844627] [<ffffffff811ba4bc>] ? sys_splice+0x59c/0x5d0
[ 83.844630] [<ffffffff8119745b>] ? putname+0x2b/0x40
[ 83.844633] [<ffffffff8118bcb4>] ? do_sys_open+0x174/0x1e0
[ 83.844636] [<ffffffff815b6202>] ? system_call_fastpath+0x16/0x1b

if recv_actor() returns 0, we should stop immediately,
because looping wont give a chance to drain the pipe.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


Top
   
PostPosted: Mon Feb 18, 2013 12:29 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
3.7.9 kernels are inbound!


Top
   
PostPosted: Mon Feb 18, 2013 12:41 pm 
Offline
Senior Newbie

Joined: Thu Jun 17, 2010 1:10 pm
Posts: 16
Website: http://www.nerdkits.com/
Thanks for the heads up about the TCP bug. However, all of mine seem to be memory / page fault related.

Code:
reboot   system boot  3.7.5-linode48   Mon Feb 18 05:53
reboot   system boot  3.7.5-linode48   Mon Feb  4 12:58
reboot   system boot  3.7.5-linode48   Mon Feb  4 07:22
reboot   system boot  3.6.5-linode47   Mon Jan 28 21:39
reboot   system boot  3.6.5-linode47   Sun Jan 27 13:51
reboot   system boot  3.6.5-linode47   Mon Jan 14 07:50
reboot   system boot  3.6.5-linode47   Sun Jan  6 15:42
reboot   system boot  3.6.5-linode47   Sat Dec 22 13:09


It would be amazing/heroic if Linode could provide 1) some sort of very simple "external" monitoring service, i.e. does a particular URL respond to at least 1 of 5 retried requests over 60 seconds, 2) hook the automatic reboot capability into this, 3) notify me of the event. I realize this has its own dangers and complexities, but I'm fairly sure that kernel bugs like this will pop up from now to eternity, and the silent hard lockups are a real pain. (I'm using Server Density to do monitoring, which is how I discover these outages, but I'm grandfathered in under their old [sane!] pricing.) Charge me $5/month for this, or give it away free knowing that I will be more hesitant to leave Linode thanks to this extra automatic monitoring/reliability.


Top
   
PostPosted: Mon Feb 18, 2013 12:54 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
One can configure Linux kernel to panic on OOPs, meaning your kernel should exit and then Lassie (reboot watchdog) would get your system back online.

echo 1 > /proc/sys/kernel/panic # reboot (in our case, exit) 1 second later after a panic
echo 1 > /proc/sys/kernel/panic_on_oops # give up after OOPsing

-Chris


Top
   
PostPosted: Mon Feb 18, 2013 1:10 pm 
Offline
Senior Newbie

Joined: Thu Jun 17, 2010 1:10 pm
Posts: 16
Website: http://www.nerdkits.com/
Hi Chris, learned something, thanks! Will try the /proc/sys/kernel/panic_on_oops & /proc/sys/kernel/panic settings and hope that Lassie comes to save me next time :) - Mike


Top
   
PostPosted: Wed Feb 20, 2013 4:04 pm 
Offline
Senior Newbie

Joined: Thu Jun 17, 2010 1:10 pm
Posts: 16
Website: http://www.nerdkits.com/
Just crashed again a few minutes ago. Fortunately, it did reboot itself as per caker's suggestion (thanks!). Unfortunately, there are no log mentions of the issue at all, making it just about impossible to debug further.


Top
   
PostPosted: Thu Feb 21, 2013 4:59 pm 
Offline
Senior Member

Joined: Fri Feb 18, 2005 4:09 pm
Posts: 594
caker wrote:
3.7.9 kernels are inbound!


Have you been able to estimate an arrival date?

James


Top
   
PostPosted: Sat Feb 23, 2013 6:32 am 
Offline
Senior Member

Joined: Fri Feb 18, 2005 4:09 pm
Posts: 594
caker wrote:
3.7.9 kernels are inbound!


Have you been able to estimate an arrival date?

James


Top
   
PostPosted: Sun Feb 24, 2013 7:59 pm 
Offline
Junior Member

Joined: Wed Jul 04, 2012 11:08 am
Posts: 34
FWIW, I've been running (custom) 3.7.9 on 3 linodes for a few days with no issues. Then again I didnt have any issues with earlier 3.7's either, but at least it didnt break anything else yet.


Top
   
PostPosted: Mon Feb 25, 2013 8:56 pm 
Offline
Senior Member

Joined: Fri Feb 18, 2005 4:09 pm
Posts: 594
caker wrote:
3.7.9 kernels are inbound!


Have you been able to estimate an arrival date?

James


Top
   
PostPosted: Wed Feb 27, 2013 4:23 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
3.7.10-linode49 and 3.7.10-x86_64-linode30 were released today. "Latest" now points to them.

http://www.linode.com/kernels/
http://www.linode.com/kernels/rss.xml

Enjoy!
-Chris


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group