Background: 2 linode setup that basically follows the configuration in the library (
http://library.linode.com/linux-ha/ip-failover-heartbeat-pacemaker-drbd-mysql-ubuntu-10.04). Nodes are in Newark.
Twice in the last couple of months, one of the nodes has had heartbeat jump to 100%, and this time it was the backup node that had the problem. However, it caused apache on the primary to go unresponsive as well. The first time, both nodes needed rebooting. So far, only the failed backup node needed rebooting. Once that was done, apache on the primary resumed serving, but it still might be a little wacky - haven't fully checked it out yet.
Searching around, it seems like there are instances of heartbeat taking up 100% CPU time and rendering the box useless without a reboot. From what I can tell, it's generally caused by cumulative snowballing of failing-and-retransmitting packets.
While this problem may have been caused by today's DoS attack against Newark (started ~3hours before the status alert
http://status.linode.com/2011/10/network-issue-in-newark.html, it doesn't mean I wouldn't like to try and prevent this situation, as without the "high availability" setup, everything would have started working by itself after the attack was mitigated.
I've snipped a couple of bits from the log file below. While I don't have any control over the network, does anyone have any suggestions on how to improve this situation on the server?
Thanks in advance.
log from when the problem seemed to start:
(timezone is UTC)
Oct 8 17:03:27 ewrha01 heartbeat: [1027]: WARN: 3 lost packet(s) for [ewrha02] [256919:256923]
Oct 8 17:03:27 ewrha01 heartbeat: [1027]: WARN: Late heartbeat: Node ewrha02: interval 8010 ms
Oct 8 17:03:30 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:31 ewrha01 heartbeat: [1027]: WARN: 1 lost packet(s) for [ewrha02] [256923:256925]
Oct 8 17:03:31 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:35 ewrha01 heartbeat: [1027]: WARN: 1 lost packet(s) for [ewrha02] [256925:256927]
Oct 8 17:03:35 ewrha01 attrd: [1061]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_webfs:1 (1000)
Oct 8 17:03:35 ewrha01 attrd: [1061]: info: attrd_perform_update: Sent update 125: master-drbd_webfs:1=1000
Oct 8 17:03:36 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:37 ewrha01 heartbeat: [1027]: WARN: 1 lost packet(s) for [ewrha02] [256927:256929]
Oct 8 17:03:37 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:45 ewrha01 heartbeat: [1027]: WARN: 3 lost packet(s) for [ewrha02] [256929:256933]
Oct 8 17:03:45 ewrha01 heartbeat: [1027]: WARN: Late heartbeat: Node ewrha02: interval 8000 ms
Oct 8 17:03:46 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:49 ewrha01 heartbeat: [1027]: WARN: 1 lost packet(s) for [ewrha02] [256933:256935]
Oct 8 17:03:53 ewrha01 attrd: [1061]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_userfs:1 (1000)
Oct 8 17:03:53 ewrha01 attrd: [1061]: info: attrd_perform_update: Sent update 127: master-drbd_userfs:1=1000
Oct 8 17:03:54 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:55 ewrha01 heartbeat: [1027]: WARN: 2 lost packet(s) for [ewrha02] [256935:256938]
Oct 8 17:03:55 ewrha01 heartbeat: [1027]: WARN: Late heartbeat: Node ewrha02: interval 6010 ms
Oct 8 17:03:55 ewrha01 heartbeat: [1027]: info: No pkts missing from ewrha02!
Oct 8 17:03:57 ewrha01 heartbeat: [1027]: WARN: 1 lost packet(s) for [ewrha02] [256938:256940]
And I now have 80MB of this, with the dispatch delay growing as time went on.
Oct 8 17:39:26 ewrha01 heartbeat: [1027]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 20 ms (> 10 ms) (GSource: 0x8d0aa48)
Oct 8 17:39:26 ewrha01 heartbeat: [1027]: info: Link ewrha02:eth0 dead.
Oct 8 17:39:26 ewrha01 heartbeat: [1027]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 20 ms (> 10 ms) (GSource: 0x8d0bb58)
Oct 8 17:39:27 ewrha01 heartbeat: [1027]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 20 ms (> 10 ms) (GSource: 0x8d0c378)
Oct 8 17:39:27 ewrha01 heartbeat: [1027]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 20 ms (> 10 ms) (GSource: 0x8d0d910)