mattryan29 wrote:
constant OOMing has me at a loss for the cause.
Well, the immediate cause is fairly easy to determine - you're running out of memory, to state the obvious. There's some supporting evidence for that in your munin graphs that show significant swap space usage, and if you're using a default 256MB swap image I can certainly see how it's possible you'd run completely out of memory. Not sure the graphs directly show such a failure, but its often the case that when it happens it's fast enough to be missed (or preclude) the next munin polling cycle. And yes, if the OOM killer can't clean out enough fast enough, it can lead to a kernel panic, which can max out CPU usage (though in my case it's always seemed to be a single core, probably for the kernel thread).
Of course, the rub is determining why, but a reasonable first step is to focus on your web stack, since that tends to have the most variability in a lot of configurations.
I'd start by reviewing status when the system is up, to determine a process size estimate for Apache, and how much memory is available once you take away any other standard processes on your system. Then see if that size, multiplied by your MaxClients configuration, could use more memory than you have.
If I could do so I'd also try stress testing the configuration (such as with "ab" choosing a representative URL that involves the full stack and database) in its current configuration to see if I could cause the problem. If so, it's a big advantage since you can stress test after changes.
In either case, you're likely going to want to drop MaxClients. If your analysis shows the current value is too large, you can use that to pick a new value. But even if it appears ok, one troubleshooting approach is to drop MaxClients a lot - like down to 1-2 - as well as dropping MaxRequestsPerChild a lot - maybe low double or single digits - in case there's a per-request leak going on. Dropping MaxRequestsPerChild is to help avoid a single process growing unusually large, which you may not have been able to catch while observing the system. Performance may suffer, but at this point the goal is to completely stop the full failure, and worry about performance second.
If you can't afford to do the stress testing or configuration changes on the production Linode, clone it to another Linode (even if you just add one for a few days for the testing), and then perform your stress testing and test changes there.
If your application stack is large enough memory-wise for each request, it's not necessarily wrong to have to drop MaxClients into single digits on a Linode 512. Nor does doing so necessarily imply terrible performance, unless each request takes a really long time to satisfy. Though of course, delaying requests is still a far more graceful degradation than keeling over completely.
There are a number of topics here on tuning Apache (and associated application stacks) and MySQL that have more detailed suggestions and ways to work up to a final configuration once you've stopped the pain, plus ways to improve performance at whatever configuration your Linode can support, so I'd definitely do a few forum searches to see if those can also help. But in terms of first steps, most of them boil down to dropping the configuration extremely low to guarantee you have enough resources, and then adjusting them slowly upwards. This may eventually also lead to a conclusion that a larger Linode is needed for your purpose, but that's really only something you can conclude with certainty after having tuned everything to the current Linode.
-- David