dfd25 wrote:
And here's a link
http://explain.depesz.com/s/WWX to the depesz output. Should I focus on the exclusive measurements?
It depends - you sort of want to remain aware of both exclusive and inclusive. Wall clock time is eventually going to be based on inclusive, so you could have a step with a small exclusive value but that is looped a ton of times and yield a high inclusive at the outer step. But clearly any large exclusive value (even if its a step only executed once) is a direct contribution to the final time.
You can see changes in this query since you have the larger data set in that some of the nested loops have been replaced with merge or hash joins. And yeah, I'd probably be looking in the same directions as you note as things to try to attack.
For example, the date and state filters are being applied to an index scan of the respective table, and now there's a decent number of rows involved, so if an index would cut that down it could help. But it'll depend on the current distribution of values and how selective your filter is.
What appears to be happening with pulldate is that the planner chose to scan c first by its primary key index, filtering the tuples by the pulldate filter, and afterwards merge joining to a. (So filter, join rather than join, filter, which avoids matching up rows in c to those in a for those for which the pulldate comparison will eliminate). So if your pulldate filter is reasonably selective (e.g., it's pulling out ~163K rows, but I don't know how big the table is to start with), an index on it will help as the planner should use that index rather than the c_id_pkey index. But if your pulldate is selecting a large fraction of the records in c anyway, that may be a marginal win or the planner may still stay with its current scan. Same concept for state.
The merge and hash join steps are largely a function of the row count, so if you need all the matched rows, there's not a lot you can do about them. However, one thing that may help is if you don't actually need all the columns of each complete joined row in your result set, you can cut down on the row size all along the way by being more selective so there is less data for the join steps to move around. It looks like c is your widest table (see the width= values) so if you only pick that table to be more selective, assuming you don't really need all the columns, that would be the single point to start with.
I guess if it were me, I would go ahead and separately try indices on pulldate and state, plus try to tune my query to only retrieve the columns I really needed.
On the sort estimate, I think you read that backwards - at the outermost sort the planner was estimating 36k rows and got 10 (probably a mis-estimate on the selectivity of the T6 self-join selection). So the planner estimate was high not low (see the arrow by the "rows x" value on the web site). Large estimate errors (whether over or under) can certainly shift a plan selection, but in this case it's for your final sort which is always going to occur at that point, and the mis-estimate works out in your favor.
While mentioning sorts, you may also notice that the inner sort spilled to disk as it was about 1.5MB. It didn't take particularly long but if it had been a slow step in an inner loop, that might have been another place to consider attacking. In such cases you can also consider increasing work_mem (assuming your system memory can support it) to let it take place in memory. Note that work_mem is allocated at least once, but sometimes multiple times per client per query, so you want to be cautious how high you increase it. Or you can just increase it for a particular query that you know it would benefit.
In terms of other parameters, shared_buffers of 24MB and effective_cache_size of 128MB are pretty conservative (as are most of PostgreSQL's default settings). And yes, the commented values are the defaults, so your work_mem will be 1MB. You can check live values with the "show" SQL command. shared_buffers is the actual memory used by PostgreSQL to hold data during query processing, though that's not as critical with PostgreSQL as some databases as PostgreSQL assumes heavy use of the OS cache. effective_cache_size is just a hint as to the system memory available for caching, but is important to accurately reflect your system since it can shift the planner choices to better take advantage of the fact that the OS is most likely already caching the data PostgreSQL wants.
How big is the system you are running this on? For recent PostgreSQL releases (8.4+) you can bump shared_buffers up quite a bit (it wasn't as effective in earlier versions), though you'll need to adjust the kernel shmmax parameter since Ubuntu defaults to 32MB. I think the current recommendation for 1G+ systems (available to PostgreSQL) is to try around 25%, while there's diminishing returns above about 40%. Personally I've been using 128MB on a Linode 1024, but could probably go higher, and have 768MB as the effective_cache_size. This is on a dedicated node so not much besides PostgreSQL running.
I'm not sure any of these parameters are particularly hurting this query, so that may just be some items to try in the future.
-- David
PS: I'd recommend upgrading to 9.1.9 (latest version in Ubuntu 12.04 repository), especially if your server accepts network client connections - it has an important security fix. There's no need to do anything with your data as it's completely backwards compatible with your current 9.1.8 cluster, so you can just do an apt-get upgrade in-place.