cloudry wrote:
My free -m was actually showing that I had free memory, and a lot of it.
Just a dumb question, but this free output is taken from exactly the moment in time when your requests are hanging right? Memory status can change very rapidly.
If in fact you do have that much memory free, then it certainly doesn't seem like a system-wide OOM, but perhaps some component has its own limit that is being exceeded. I think I'd concur with the other suggestion about seeing about disabling apc for the time being (given that's who seems to be logging about the memory failures). I'm assuming those failures might result in something failing to execute properly, which could then hang a response.
I understand you've been running with apc it for a long time, but something clearly changed recently and at this point I think you're better off having as few components running as you can get away with until you can get a better handle on what is going on. Then you can add stuff back in.
In terms of your mysql question, certainly mysqld is going to need resources, but again if your free output above is representative, you're not hurting for memory right at the moment so I wouldn't worry about that too much. If you want you could make sure you have slow query logging enabled (log-slow-queries and log-query-time parameters), so at least you'll know if mysql is contributing significantly to processing time for requests.
Depending on how hard it is to simplify your configuration, you might even start considering using a second, temporary, Linode for testing. You could clone your current Linode to it (with a few tweaks for network configuration and host name), then use it for testing making it easier to hack and slash the configuration until you've got stability.
Of course, if your primary Linode is essentially unusable at the moment, not sure how much worse it can get using it to experiment in that case.
Bottom line I guess my best overall suggestion at this point is simplify things as much as you can, to as few components as possible in an attempt to achieve basic stability, even if you have to do so at a cost to overall req/s throughput. You've got to reach a baseline that is at least stable since right now I suspect you might have had more than one issue over time contributing to the behavior you have seen. For example, you were clearly OOMing at one point (given your console logs) but now seem to have enough memory, so something else is happening.
Once you have some sort of baseline, you can start tweaking parameters and adding back components while continuing to stress the system and watching stats (cpu, memory, I/O, log files, etc..) and hopefully once it becomes unstable again, have a decent chance at identifying the component or resource that is the root cause.
-- David