Re: [PERFORM] mergehashloop
On Wed, Apr 19, 2006 at 01:25:28AM -0400, Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: Jim C. Nasby wrote: Good point. :/ I'm guessing there's no easy way to see how many blocks for a given relation are in shared memory, either... contrib/pg_buffercache will tell you this - I think the key word in Jim's comment was easy, ie, cheap. Grovelling through many thousands of buffers to count the matches to a given relation doesn't sound appetizing, especially not if it gets done over again several times during each query-planning cycle. Trying to keep centralized counts somewhere would be even worse (because of locking/ contention issues). Very true. OTOH, it might not be unreasonable to periodically slog through the buffers and store that information, perhaps once a minute, or every X number of transactions. I think a bigger issue is that we currently have no way to really measure the effictiveness of the planner. Without that it's impossible to come up with any real data on whether cost formula A is better or worse than cost formula B. The only means I can think of for doing this would be to measure estimated cost vs actual cost, but with the overhead of EXPLAIN ANALYZE and other variables that might not prove terribly practical. Maybe someone else has some ideas... -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] mergehashloop
Hi, Tom, Tom Lane wrote: Well, the other thing that's going on here is that we know we are overestimating the cost of nestloop-with-inner-indexscan plans. The current estimation for that is basically outer scan cost plus N times inner scan cost where N is the estimated number of outer tuples; in other words the repeated indexscan probes are each assumed to happen from a cold start. In reality, caching of the upper levels of the index means that the later index probes are much cheaper than this model thinks. We've known about this for some time but no one's yet proposed a more reasonable cost model. My spontaneus guess would be to use log(N)*inner instead of N*inner. I don't have any backings for that, it's just what my intuition tells me as a first shot. In my mind this is tied into another issue, which is that the planner always costs on the basis of each query starting from zero. In a real environment it's much cheaper to use heavily-used indexes than this cost model suggests, because they'll already be swapped in due to use by previous queries. But we haven't got any infrastructure to keep track of what's been heavily used, let alone a cost model that could make use of the info. An easy first approach would be to add a user tunable cache probability value to each index (and possibly table) between 0 and 1. Then simply multiply random_page_cost with (1-that value) for each scan. Later, this value could be automatically tuned by stats analysis or other means. I think part of the reason that people commonly reduce random_page_cost to values much lower than physical reality would suggest is that it provides a crude way of partially compensating for this basic problem. I totall agree with this, it's just what we did here from time to time. :-) Hmm, how does effective_cach_size correspond with it? Shouldn't a high effective_cache_size have a similar effect? Thanks, Markus -- Markus Schaber | Logical TrackingTracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] mergehashloop
On Tue, Apr 18, 2006 at 12:51:59PM +0200, Markus Schaber wrote: In my mind this is tied into another issue, which is that the planner always costs on the basis of each query starting from zero. In a real environment it's much cheaper to use heavily-used indexes than this cost model suggests, because they'll already be swapped in due to use by previous queries. But we haven't got any infrastructure to keep track of what's been heavily used, let alone a cost model that could make use of the info. An easy first approach would be to add a user tunable cache probability value to each index (and possibly table) between 0 and 1. Then simply multiply random_page_cost with (1-that value) for each scan. Later, this value could be automatically tuned by stats analysis or other means. Actually, if you run with stats_block_level turned on you have a first-order approximation of what is and isn't cached. Perhaps the planner could make use of this information if it's available. I think part of the reason that people commonly reduce random_page_cost to values much lower than physical reality would suggest is that it provides a crude way of partially compensating for this basic problem. I totall agree with this, it's just what we did here from time to time. :-) Hmm, how does effective_cach_size correspond with it? Shouldn't a high effective_cache_size have a similar effect? Generally, effective_cache_size is used to determine the likelyhood that something will be in-cache. random_page_cost tells us how expensive it will be to get that information if it isn't in cache. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] mergehashloop
Jim C. Nasby [EMAIL PROTECTED] writes: Actually, if you run with stats_block_level turned on you have a first-order approximation of what is and isn't cached. Only if those stats decayed (pretty fast) with time; which they don't. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] mergehashloop
Markus Schaber [EMAIL PROTECTED] writes: Hmm, how does effective_cach_size correspond with it? Shouldn't a high effective_cache_size have a similar effect? It seems reasonable to suppose that effective_cache_size ought to be used as a number indicating how much stuff would hang around from query to query. Right now it's not used that way... regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] mergehashloop
On Tue, Apr 18, 2006 at 06:22:26PM -0400, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: Actually, if you run with stats_block_level turned on you have a first-order approximation of what is and isn't cached. Only if those stats decayed (pretty fast) with time; which they don't. Good point. :/ I'm guessing there's no easy way to see how many blocks for a given relation are in shared memory, either... -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] mergehashloop
On Tue, Apr 18, 2006 at 06:26:48PM -0400, Tom Lane wrote: Markus Schaber [EMAIL PROTECTED] writes: Hmm, how does effective_cach_size correspond with it? Shouldn't a high effective_cache_size have a similar effect? It seems reasonable to suppose that effective_cache_size ought to be used as a number indicating how much stuff would hang around from query to query. Right now it's not used that way... Maybe it would be a reasonable first pass to have estimators calculate the cost if a node found everything it wanted in cache and then do a linear interpolation between that and the costs we currently come up with? Something like pg_class.relpages / sum(pg_class.relpages) would give an idea of how much of a relation is likely to be cached, which could be used for the linear interpolation. Of course having *any* idea as to how much of a relation was actually in shared_buffers (or better yet, the OS cache) would be a lot more accurate, but this simple method might be a good enough first-pass. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] mergehashloop
Markus Schaber [EMAIL PROTECTED] writes: An easy first approach would be to add a user tunable cache probability value to each index (and possibly table) between 0 and 1. Then simply multiply random_page_cost with (1-that value) for each scan. That's not the way you'd need to use it. But on reflection I do think there's some merit in a cache probability parameter, ranging from zero (giving current planner behavior) to one (causing the planner to assume everything is already in cache from prior queries). We'd have to look at exactly how such an assumption should affect the cost equations ... regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] mergehashloop
Jim C. Nasby wrote: On Tue, Apr 18, 2006 at 06:22:26PM -0400, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: Actually, if you run with stats_block_level turned on you have a first-order approximation of what is and isn't cached. Only if those stats decayed (pretty fast) with time; which they don't. Good point. :/ I'm guessing there's no easy way to see how many blocks for a given relation are in shared memory, either... contrib/pg_buffercache will tell you this - what buffers from what relation are in shared_buffers (if you want to interrogate the os file buffer cache, that's a different story - tho I've been toying with doing a utility for Freebsd that would do this). Cheers Mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] mergehashloop
On Wed, Apr 19, 2006 at 04:47:40PM +1200, Mark Kirkwood wrote: Jim C. Nasby wrote: On Tue, Apr 18, 2006 at 06:22:26PM -0400, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: Actually, if you run with stats_block_level turned on you have a first-order approximation of what is and isn't cached. Only if those stats decayed (pretty fast) with time; which they don't. Good point. :/ I'm guessing there's no easy way to see how many blocks for a given relation are in shared memory, either... contrib/pg_buffercache will tell you this - what buffers from what relation are in shared_buffers (if you want to interrogate the os file So theoretically with that code we could make the cost estimator functions more intelligent about actual query costs. Now, how you'd actually see how those estimates improved... buffer cache, that's a different story - tho I've been toying with doing a utility for Freebsd that would do this). Well, the problem is that I doubt anything that OS-specific would be accepted into core. What we really need is some method that's OS-agnostic... -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] mergehashloop
Mark Kirkwood [EMAIL PROTECTED] writes: Jim C. Nasby wrote: Good point. :/ I'm guessing there's no easy way to see how many blocks for a given relation are in shared memory, either... contrib/pg_buffercache will tell you this - I think the key word in Jim's comment was easy, ie, cheap. Grovelling through many thousands of buffers to count the matches to a given relation doesn't sound appetizing, especially not if it gets done over again several times during each query-planning cycle. Trying to keep centralized counts somewhere would be even worse (because of locking/ contention issues). regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] mergehashloop
Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: Jim C. Nasby wrote: Good point. :/ I'm guessing there's no easy way to see how many blocks for a given relation are in shared memory, either... contrib/pg_buffercache will tell you this - I think the key word in Jim's comment was easy, ie, cheap. Grovelling through many thousands of buffers to count the matches to a given relation doesn't sound appetizing, especially not if it gets done over again several times during each query-planning cycle. Trying to keep centralized counts somewhere would be even worse (because of locking/ contention issues). Yeah - not sensible for a transaction oriented system - might be ok for DSS tho. Cheers mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PERFORM] mergehashloop
I have this query, where PG (8.1.2) prefers Merge Join over Hash Join over Nested Loop. However, this order turns out to increase in performance. I was hoping someone might be able to shed some light on why PG chooses the plans in this order, and what I might do to influence it otherwise. Thanks, itvtrackdata= explain analyze SELECT mva,blob.* FROM blobs33c3_c16010 AS blob NATURAL JOIN objects33c3 WHERE AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1))#BOX('(50,10),(10,50)'))/AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1)))0 AND time=1263627787-32768 AND time1273458187 AND finish-start=8738 ORDER BY time ASC; QUERY PLAN Sort (cost=47170.44..47184.46 rows=5607 width=28) (actual time=2661.980..2663.642 rows=1687 loops=1) Sort Key: blob.time - Merge Join (cost=46547.88..46821.32 rows=5607 width=28) (actual time=2645.685..2657.621 rows=1687 loops=1) Merge Cond: (outer.sva = inner.sva) - Sort (cost=18003.31..18045.36 rows=16821 width=20) (actual time=181.303..183.092 rows=1741 loops=1) Sort Key: blob.sva - Bitmap Heap Scan on blobs33c3_c16010 blob (cost=535.77..16822.65 rows=16821 width=20) (actual time=10.827..177.671 rows=1741 loops=1) Recheck Cond: ((time = 1263595019) AND (time 1273458187)) Filter: ((area((box(point((bbox_x1)::double precision, (bbox_y0)::double precision), point((bbox_x0)::double precision, (bbox_y1)::double precision)) # '(50,50),(10,10)'::box)) / area(box(point((bbox_x1)::double precision, (bbox_y0)::double precision), point((bbox_x0)::double precision, (bbox_y1)::double precision 0::double precision) - Bitmap Index Scan on idx_blobs33c3_c16010_time (cost=0.00..535.77 rows=50462 width=0) (actual time=8.565..8.565 rows=50673 loops=1) Index Cond: ((time = 1263595019) AND (time 1273458187)) - Sort (cost=28544.56..28950.88 rows=162526 width=16) (actual time=2387.726..2437.429 rows=29969 loops=1) Sort Key: objects33c3.sva - Seq Scan on objects33c3 (cost=0.00..14477.68 rows=162526 width=16) (actual time=0.085..826.125 rows=207755 loops=1) Filter: ((finish - start) = 8738) Total runtime: 2675.037 ms (16 rows) itvtrackdata= set enable_mergejoin to false; SET itvtrackdata= explain analyze SELECT mva,blob.* FROM blobs33c3_c16010 AS blob NATURAL JOIN objects33c3 WHERE AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1))#BOX('(50,10),(10,50)'))/AREA(BOX(POINT(bbox_x1,bbox_y0),POINT(bbox_x0,bbox_y1)))0 AND time=1263627787-32768 AND time1273458187 AND finish-start=8738 ORDER BY time ASC; QUERY PLAN -- Sort (cost=65783.09..65797.10 rows=5607 width=28) (actual time=1211.228..1212.671 rows=1687 loops=1) Sort Key: blob.time - Hash Join (cost=15419.77..65433.97 rows=5607 width=28) (actual time=1067.514..1207.727 rows=1687 loops=1) Hash Cond: (outer.sva = inner.sva) - Bitmap Heap Scan on blobs33c3_c16010 blob (cost=535.77..16822.65 rows=16821 width=20) (actual time=14.720..149.179 rows=1741 loops=1) Recheck Cond: ((time = 1263595019) AND (time 1273458187)) Filter: ((area((box(point((bbox_x1)::double precision, (bbox_y0)::double precision), point((bbox_x0)::double precision, (bbox_y1)::double precision)) # '(50,50),(10,10)'::box)) / area(box(point((bbox_x1)::double precision, (bbox_y0)::double precision), point((bbox_x0)::double precision, (bbox_y1)::double precision 0::double precision) - Bitmap Index Scan on idx_blobs33c3_c16010_time (cost=0.00..535.77 rows=50462 width=0) (actual time=12.880..12.880 rows=50673 loops=1) Index Cond: ((time = 1263595019) AND (time 1273458187))
Re: [PERFORM] mergehashloop
Ian Westmacott [EMAIL PROTECTED] writes: I have this query, where PG (8.1.2) prefers Merge Join over Hash Join over Nested Loop. However, this order turns out to increase in performance. I was hoping someone might be able to shed some light on why PG chooses the plans in this order, and what I might do to influence it otherwise. Thanks, Reducing random_page_cost would influence it to prefer the nestloop. However, I doubt you're ever going to get really ideal results for this query --- the estimated row counts are too far off, and the WHERE conditions are sufficiently bizarre that there's not much hope that they'd ever be much better. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] mergehashloop
On Fri, 2006-04-14 at 12:13 -0400, Tom Lane wrote: Ian Westmacott [EMAIL PROTECTED] writes: I have this query, where PG (8.1.2) prefers Merge Join over Hash Join over Nested Loop. However, this order turns out to increase in performance. I was hoping someone might be able to shed some light on why PG chooses the plans in this order, and what I might do to influence it otherwise. Thanks, Reducing random_page_cost would influence it to prefer the nestloop. However, I doubt you're ever going to get really ideal results for this query --- the estimated row counts are too far off, and the WHERE conditions are sufficiently bizarre that there's not much hope that they'd ever be much better. That's what I feared, thanks. But then if I simplify things a bit, such that the row counts are quite good, and PG still chooses a worse plan, can I conclude anything about my configuration settings (like random_page_cost)? itvtrackdata= explain analyze SELECT mva,blob.* FROM blobs33c3_c16010 AS blob NATURAL JOIN objects33c3 WHERE time=1263627787 AND time1273458187 ORDER BY time ASC; QUERY PLAN -- Sort (cost=75426.47..75552.21 rows=50295 width=28) (actual time=2083.624..2146.654 rows=50472 loops=1) Sort Key: blob.time - Hash Join (cost=16411.51..70749.84 rows=50295 width=28) (actual time=1790.746..1910.204 rows=50472 loops=1) Hash Cond: (outer.sva = inner.sva) - Bitmap Heap Scan on blobs33c3_c16010 blob (cost=533.77..14421.19 rows=50295 width=20) (actual time=12.533..78.284 rows=50472 loops=1) Recheck Cond: ((time = 1263627787) AND (time 1273458187)) - Bitmap Index Scan on idx_blobs33c3_c16010_time (cost=0.00..533.77 rows=50295 width=0) (actual time=12.447..12.447 rows=50472 loops=1) Index Cond: ((time = 1263627787) AND (time 1273458187)) - Hash (cost=12039.79..12039.79 rows=487579 width=16) (actual time=1618.128..1618.128 rows=487579 loops=1) - Seq Scan on objects33c3 (cost=0.00..12039.79 rows=487579 width=16) (actual time=0.019..833.931 rows=487579 loops=1) Total runtime: 2194.492 ms (11 rows) itvtrackdata= set enable_hashjoin to false; SET itvtrackdata= explain analyze SELECT mva,blob.* FROM blobs33c3_c16010 AS blob NATURAL JOIN objects33c3 WHERE time=1263627787 AND time1273458187 ORDER BY time ASC; QUERY PLAN -- Sort (cost=321096.91..321222.64 rows=50295 width=28) (actual time=929.893..992.316 rows=50472 loops=1) Sort Key: blob.time - Nested Loop (cost=533.77..316420.28 rows=50295 width=28) (actual time=16.780..719.140 rows=50472 loops=1) - Bitmap Heap Scan on blobs33c3_c16010 blob (cost=533.77..14421.19 rows=50295 width=20) (actual time=16.693..84.299 rows=50472 loops=1) Recheck Cond: ((time = 1263627787) AND (time 1273458187)) - Bitmap Index Scan on idx_blobs33c3_c16010_time (cost=0.00..533.77 rows=50295 width=0) (actual time=16.546..16.546 rows=50472 loops=1) Index Cond: ((time = 1263627787) AND (time 1273458187)) - Index Scan using idx_objects33c3_sva on objects33c3 (cost=0.00..5.99 rows=1 width=16) (actual time=0.006..0.008 rows=1 loops=50472) Index Cond: (outer.sva = objects33c3.sva) Total runtime: 1039.725 ms (10 rows) -- Ian Westmacott [EMAIL PROTECTED] IntelliVid Corp. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] mergehashloop
Ian Westmacott [EMAIL PROTECTED] writes: That's what I feared, thanks. But then if I simplify things a bit, such that the row counts are quite good, and PG still chooses a worse plan, can I conclude anything about my configuration settings (like random_page_cost)? Well, the other thing that's going on here is that we know we are overestimating the cost of nestloop-with-inner-indexscan plans. The current estimation for that is basically outer scan cost plus N times inner scan cost where N is the estimated number of outer tuples; in other words the repeated indexscan probes are each assumed to happen from a cold start. In reality, caching of the upper levels of the index means that the later index probes are much cheaper than this model thinks. We've known about this for some time but no one's yet proposed a more reasonable cost model. In my mind this is tied into another issue, which is that the planner always costs on the basis of each query starting from zero. In a real environment it's much cheaper to use heavily-used indexes than this cost model suggests, because they'll already be swapped in due to use by previous queries. But we haven't got any infrastructure to keep track of what's been heavily used, let alone a cost model that could make use of the info. I think part of the reason that people commonly reduce random_page_cost to values much lower than physical reality would suggest is that it provides a crude way of partially compensating for this basic problem. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly