Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
PostPosted: Thu Nov 25, 2010 11:06 pm 
Offline
Junior Member
User avatar

Joined: Wed Oct 20, 2010 7:10 pm
Posts: 36
Location: Sebastopol, CA
Hi folks,

I followed along with the HA Linode Library article for two Linodes to run MySQL.
http://library.linode.com/linux-ha/high ... untu-10.04

Now, with the latest Fremont DDoS attack last night (I rebooted ha2-db because I could not reach it with ssh), both of my Linodes think they are the Primary machine according to the "crm_mon" command, but my ha1-db is actually doing the job. ha2-db was not able to mount the file system and subsequently did not launch MySQL. Also, running "cat /proc/drbd" on each node shows ha2-db as the Secondary storage device and Inconsistent information. I tried running "drbdadm invalidate all" on ha2-db, as that had fixed synchronization issues other times I ran into them with drbd, but not this time.

Code:
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@ha2-db, 2010-11-11 04:04:44
 0: cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:6291228


ha1-db looks like:

Code:
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@ha1-db, 2010-11-11 04:03:49
 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r----
    ns:0 nr:0 dw:3615468 dr:4021959 al:4472 bm:4304 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:603364


The main problem is that ha1-db is running over 100% CPU, because heartbeat is running at 100% all the time. Database queries are still working, though after the holiday I expect more traffic and don't know what will happen. I am worried to just reboot ha1-db and cross my fingers.

top on ha1-db shows:

Code:
643 root      -2   0  105m 105m 5876 R  100 21.4 914:08.17 heartbeat


Any other suggestions to debug or fix this?

crm_mon from ha1-db shows:

Code:
============
Last updated: Thu Nov 25 19:02:20 2010
Stack: Heartbeat
Current DC: ha1-db (c8658f6d-b186-4143-9b16-5eacf721cb7b) - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 1 expected votes
2 Resources configured.
============

Online: [ ha1-db ]
OFFLINE: [ ha2-db ]

 Resource Group: HAServices
     ip1        (ocf::heartbeat:IPaddr2):       Started ha1-db
     ip1arp     (ocf::heartbeat:SendArp):       Started ha1-db
     fs_mysql   (ocf::heartbeat:Filesystem):    Started ha1-db
     mysql      (ocf::heartbeat:mysql): Started ha1-db
 Master/Slave Set: ms_drbd_mysql
     Masters: [ ha1-db ]
     Stopped: [ drbd_mysql:0 ]


crm_mon from ha2-db shows:

Code:
============
Last updated: Thu Nov 25 19:03:18 2010
Stack: Heartbeat
Current DC: ha2-db (a46a8fc8-2c6a-4f81-93a3-2dab6f9439c2) - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 1 expected votes
2 Resources configured.
============

Online: [ ha2-db ]
OFFLINE: [ ha1-db ]

 Resource Group: HAServices
     ip1        (ocf::heartbeat:IPaddr2):       Started ha2-db
     ip1arp     (ocf::heartbeat:SendArp):       Started ha2-db
     fs_mysql   (ocf::heartbeat:Filesystem):    Started ha2-db FAILED
     mysql      (ocf::heartbeat:mysql): Stopped
 Master/Slave Set: ms_drbd_mysql
     Slaves: [ ha2-db ]
     Stopped: [ drbd_mysql:1 ]

Failed actions:
    fs_mysql_start_0 (node=ha2-db, call=14, rc=1, status=complete): unknown error


I can ping each node from the other, also.

More detail (tail of /var/log/syslog):

Code:
Nov 25 20:23:07 ha1-db heartbeat: [643]: ERROR: Message hist queue is filling up (500 messages in queue)
Nov 25 20:23:07 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 100 ms (> 10 ms) (GSource: 0x1011cd10)
Nov 25 20:23:07 ha1-db lrmd: [757]: info: RA output: (ip1:monitor:stderr) eth0:1: warning: name may be invalid
Nov 25 20:23:07 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 80 ms (> 10 ms) (GSource: 0x1011cd78)
Nov 25 20:23:07 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 80 ms (> 10 ms) (GSource: 0x1011cde0)
Nov 25 20:23:07 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 110 ms (> 10 ms) (GSource: 0x1011ce48)
Nov 25 20:23:08 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 100 ms (> 10 ms) (GSource: 0x1011ceb0)
Nov 25 20:23:08 ha1-db heartbeat: [643]: WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took too long to execute: 120 ms (> 10 ms) (GSource: 0x1011cf18)


Thanks, Josh


Top
   
 Post subject:
PostPosted: Fri Nov 26, 2010 1:43 am 
Offline
Junior Member
User avatar

Joined: Wed Oct 20, 2010 7:10 pm
Posts: 36
Location: Sebastopol, CA
Looks like heartbeat can get into an infinite loop in some situations.

http://www.gossamer-threads.com/lists/l ... sers/67922

I planned for the worst, did a Linode backup and a database backup, then rebooted ha1-db.

Everything is working as expected, and now ha2-db is synching properly with DRBD.

Thanks, Josh


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group