So, I was adding some hi-res photos to my Zen Photo installation, and the Linode ground to a halt fairly quickly when creating thumbnails. I discovered that hi-res photos require a lot of memory (even when heavily compressed) when manipulated by ImageMagick.
Compressed storage formats are too inefficient for actual processing, so an image program is pretty certain to expand an image for processing, and sometimes even waste some memory in an internal format (e.g., using 4 bytes per pixel for alignment rather than 3 even if it's just an RGB image with no alpha channel) for efficiency.
Some tools (such as Gimp's tile cache, and I think ImageMagick has some limit options) provide a way to use disk as their own backing storage - at a significant cost in terms of performance, but controlling their impact on over system working set. Of course, whether swap or just application I/O, it's all running in to the same disk I/O bottleneck we have here, but it might help contain the impact to the image application.
Anyway, the strange thing was this: the actual apps memory usage stayed level, but the committed memory went from 333MB to 613MB. Swap usage was over 50MB and climbing and, obviously, things slowed severely. Restarting Apache sorted things, since the VIRT memory usage for some processes was about 125MB, even though RES was 90MB.
What reference were you using to judge "actual apps memory usage" versus the committed memory? Did they have different snapshot cycles? Depending on how the images were being processed, could there have been processing coming and going without being caught in one or both of the stats?
Here's my questions: Why did swap get thrashed when the commit went high, even though the apps appeared not to use it (according to munin)? Or was munin giving me false info, and the apps was indeed using the full 512MB plus swap?
Historically this would be a simpler answer - "committed" memory would be the system working set (actual current requirements) so if higher than physical memory you would have to swap.
I think Linux's representation of committed memory is a little different, in part because it doesn't actually commit memory upon request, but only on use, so there's an implicit unknown factor. As I understand it, the committed stat is more a probabilistic report as to how much memory you would need to ensure you can't OOM, but I don't know that it requires all of that memory be actively in use, or even have been touched yet.
I think for most practical purposes the difference can be ignored though, when it comes to sizing the needs of a system. And given that you were, in fact, swapping, I think a classical definition is close enough for your purposes. It seems safe to assume your aggregate instantaneous memory usage was exceeding physical memory.
As for not accounting for it with apps, Munin by default (I think) is using a 5 minute cycle, and just taking snapshots at those points of process metrics (unlike, say, network I/O which is an increasing counter it can delta from the prior run). So there's a lot it can miss (including very large spikes in resource usage) especially with short lived applications. It's just a guess but perhaps you had a lot of ImageMagick processes starting and stopping as the thumbnails were being built, and on average you were overcommitted, but the instantaneous snapshots taken by Munin couldn't show it. In such a case I don't think I'd call the Munin information "false", just that the metric you were looking for was beyond it's capabilities at its sampling rate.
You can see something similar if you get caught in a problem that constantly forks processes. A system can be brought to a dead crawl but even attempting to constantly monitor with something like top might not show why since the processes are being created and destroyed too quickly for the monitoring frequency.