Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Post new topic  Reply to topic
Author Message
PostPosted: Sun Mar 14, 2010 7:43 am 
Offline
Junior Member

Joined: Sun Oct 11, 2009 12:44 pm
Posts: 29
Location: Northern Ireland
I logged on today to find my sites were down, and a stream of emails about my Linode exceeding its IO threshold (badly).

You can see the dashboard graphs here, something clearly went wrong over night: http://img12.imageshack.us/img12/9017/graphsu.png

The first thing I did was just to reboot the Linode, and that seems to have been completely wrong. Now I can't even connect to it by SSH anymore :( (Though Linode manager does say its running)

Does anyone have any ideas what I should do? I'm really at a loss now :(


Top
   
 Post subject:
PostPosted: Sun Mar 14, 2010 7:57 am 
Offline
Newbie

Joined: Thu Aug 06, 2009 7:24 pm
Posts: 4
Have you tried logging in via Lish? Check the "Console" tab in the Linode Manager. From there, you should be able to find out why SSH isn't running.


Top
   
 Post subject:
PostPosted: Sun Mar 14, 2010 8:26 am 
Offline
Junior Member

Joined: Sun Oct 11, 2009 12:44 pm
Posts: 29
Location: Northern Ireland
Thanks ayman. Logging in via Lish and restarting SSH that way has let me connect again via SSH itself at least, thanks! :)

Trying to restart Apache is giving an error now that's causing it to fail. Reading more about it online now:

Quote:
(30)Read-only file system: apache2: could not open error log file /var/log/apache2/error.log.


The few results I've found seem to say it's something to do with filesystem errors, but I can't work out what could have changed to make this suddenly appear.


Top
   
 Post subject:
PostPosted: Sun Mar 14, 2010 9:06 am 
Offline
Senior Member
User avatar

Joined: Sat Aug 30, 2008 1:55 pm
Posts: 1739
Location: Rochester, New York
The "dmesg" command will show you the output from the kernel's log; that might help. Alternatively, detaching from the console (ctrl-A then d) and using the "logview" command at the lish prompt will show you the last bit of the previous run's log and anything that's happened on the console this time around.

But indeed, it does sound filesystem-related.


Top
   
 Post subject:
PostPosted: Sun Mar 14, 2010 9:59 am 
Offline
Junior Member

Joined: Sun Oct 11, 2009 12:44 pm
Posts: 29
Location: Northern Ireland
Hi hoopycat, thanks for that! dmesg sounds like a really useful command, I'll remember that for the future! :)


I *may* have things sorted temporarily.

The problem was that Ubuntu turned the filesystem readonly. The most common reason given for that happening seems to be that it perceived a disk issue. Given the crazy stats in the graphs I posted though, I'm guessing something on my server was the cause of it :(

Running the "fsck" command was enough to fix it though.

With that said, it's only been fixed for about half an hour now. Will be watching those graphs to see if the issue comes back.

Does anyone know how I could find out what is causing the massive load? (if it does come back)


Top
   
 Post subject:
PostPosted: Sun Mar 14, 2010 10:31 am 
Offline
Senior Member
User avatar

Joined: Tue Apr 13, 2004 6:54 pm
Posts: 833
What _could_ have happened is your machine went into swap hell.

Basically if you run out of memory then your system starts to swap. This is normal. But if you're REALLY short of memory then it can swap a lot. So much that the system spends nearly all of the time swapping pages in/out. Response is almost zero, I/O activity is through the roof... it almost looks like the machine has crashed.

Now if, at this point, you told the control panel to reboot your machine it would attempt to do a graceful shutdown. BUT if your linode was in swap hell then it might not have been able to do it, so the control panel may have switched to a more aggressive reboot method and effectively pressed the "Reset" button.

This is an unclean shutdown and can result in filesystems needing fsck'ing afterwards; the machine doesn't fully reboot and the only way of accessing the machine is via lish.

If this is what happened then you need to look into why your machine started taking up so much memory. Are you running MySQL or similar? If so check the dozens of threads here on how to ensure MySQL never explodes like this. Similarly there are threads on how to tune Apache.

Basically, you just need to tune all your application processes so they can live happily in memory and not cause swap hell.

_________________
Rgds
Stephen
(Linux user since kernel version 0.11)


Top
   
 Post subject:
PostPosted: Sun Mar 14, 2010 12:30 pm 
Offline
Junior Member

Joined: Sun Oct 11, 2009 12:44 pm
Posts: 29
Location: Northern Ireland
Thanks Sweh! The issue you described there does line up with the symptoms my server had. When I came on this morning, the server hadn't fully crashed, it was just slow to a point of being useless. The restart was what caused the complete crash.

I'll take a look at the things you mentioned.

The strange part is that my sites are only running fairly standard scripts; WordPress, phpBB, and Coppermine Photo Gallery. I'll take a look at them all (And any mods/plugins especially) like you said though, hopefully will be able to avoid a repeat!

Thanks again for your detailed reply, really helps to get an understanding of what happened!


Top
   
 Post subject:
PostPosted: Sun Mar 14, 2010 12:45 pm 
Offline
Senior Member
User avatar

Joined: Tue Apr 13, 2004 6:54 pm
Posts: 833
Wordpress and phpBB are typical culprits, especially if using a mySQL backend and you've done no tuning. Many of these programs assume a full sized server and if running on a smaller linode (eg linode360) they can quickly use up all resources.

You might want to look at http://library.linode.com/databases/mys ... l-centos-5 for some mySQL tuning hints on the mySQL component.

_________________
Rgds

Stephen

(Linux user since kernel version 0.11)


Top
   
 Post subject:
PostPosted: Sun Mar 14, 2010 5:46 pm 
Offline
Junior Member

Joined: Sun Oct 11, 2009 12:44 pm
Posts: 29
Location: Northern Ireland
Thanks Sweh, read that now. I'll have a look at adjusting my MySQL settings now.

The sites have hit the exact same issue again, but I haven't restarted the server this time.

What's the best way to handle this for now?

I have 2 active sites on this. If I disable one (Just using a2dissite), will that be enough to stop it from causing any more trouble if that site is the culprit? Or would I need a different way of detecting it?

Update, looks like you were right about MySQL being the issue! Lish shows this as soon as I load it:

Code:
fsck from util-linux-ng 2.16
/dev/xvda: clean, 62959/770048 files, 2408634/3072000 blocks
FATAL: Module nf_conntrack_ftp not found.
FATAL: Module nf_nat_ftp not found.   
FATAL: Module nf_conntrack_irc not found.
FATAL: Module nf_nat_irc not found.
 * Setting preliminary keymap...                                                 
* Setting up console font and keymap...
* Stopping NTP server ntpd             
 * Starting OpenBSD Secure Shell server sshd
* Starting NTP server ntpd                 
 * Starting MySQL database server mysqld ...done.         
 * Checking for corrupt, not cleanly closed and upgrade needing tables.                             
 * Starting Postfix Mail Transport Agent postfix                                 * Starting NTP server ntpd
* Starting web server apache2

Ubuntu 9.10 merlin hvc0
                                                                                                   
merlin login: Out of memory: kill process 2338 (mysqld_safe) score 224717 or a child               
Killed process 2446 (mysqld)           
Out of memory: kill process 2873 (apache2) score 197820 or a child
Killed process 6182 (apache2)                           
Out of memory: kill process 2873 (apache2) score 197787 or a child
Killed process 6464 (apache2)                                                     
Out of memory: kill process 2873 (apache2) score 93674 or a child
Killed process 6492 (apache2)                                         
Out of memory: kill process 2338 (mysqld_safe) score 110602 or a child
Killed process 7010 (mysqld)                               
Out of memory: kill process 2873 (apache2) score 98892 or a child
Killed process 6649 (apache2)                                                   
Out of memory: kill process 2873 (apache2) score 98406 or a child
Killed process 8311 (apache2)


Top
   
 Post subject:
PostPosted: Sun Mar 14, 2010 9:41 pm 
Offline
Senior Member

Joined: Sat Mar 28, 2009 4:23 pm
Posts: 415
Website: http://jedsmith.org/
Location: Out of his depth and job-hopping without a clue about network security fundamentals
Time to upgrade, or tune your stack to use less memory. You don't have enough memory to sustain your applications, so Linux is ragekilling applications it determines to be a threat.

Option one is significantly easier than option two.

_________________
Disclaimer: I am no longer employed by Linode; opinions are my own alone.


Top
   
 Post subject:
PostPosted: Sun Mar 14, 2010 10:45 pm 
Offline
Senior Member
User avatar

Joined: Tue Apr 13, 2004 6:54 pm
Posts: 833
Michael-Martin wrote:
Update, looks like you were right about MySQL being the issue! Lish shows this as soon as I load it:


Well, not necessarily. The Out of memory (OOM) killer doesn't necessarily kill the program that's exploding.

But given past experience in these forums it probably _is_ a combination of apache instances (too many?) and mySQL going mad (it normally is :-))

As for how to resolve the problem... depends on how urgent your needs are.

If you need "working web site now!" then upgrade to a bigger linode (heck, even a 2880). Then work very hard in getting your footprint down to a reasonable size, then downgrade to the smallest linode that meets your needs. Because of how linode pro-rata's usage, you'll get a credit back (not a refund) for the unused 2880 period and this can be used to pay for the smaller linode.

(Umm, I think I'm right; I'm sure linode staff will correct me if I've mis-stated the billing/refund policies).

If this is still in the "I don't care if it's down" stage, then work on reducing the footprint and accept the outages.

_________________
Rgds

Stephen

(Linux user since kernel version 0.11)


Top
   
 Post subject:
PostPosted: Sun Mar 14, 2010 10:52 pm 
Offline
Senior Member

Joined: Sat Mar 28, 2009 4:23 pm
Posts: 415
Website: http://jedsmith.org/
Location: Out of his depth and job-hopping without a clue about network security fundamentals
sweh wrote:
Umm, I think I'm right

Yes.

_________________
Disclaimer: I am no longer employed by Linode; opinions are my own alone.


Top
   
 Post subject:
PostPosted: Mon Mar 15, 2010 2:24 pm 
Offline
Junior Member

Joined: Sun Oct 11, 2009 12:44 pm
Posts: 29
Location: Northern Ireland
Thanks for the replies guys, I've upgraded to the next Linode stage now, and still have one of the sites disabled. The amount of traffic to the site hasn't changed, so maybe this will be enough.

I'm looking into optimizing things now while it's back up. Do yous know of any good resources (online, or even books!) I should start with?

Thanks again for the help! :)


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group