Re: [HACKERS] Hash Join cost estimates
* Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > I'm certainly curious about those, but I'm also very interested in the > > possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable > > depending on the work_mem setting. > > Not sure about that. That would make the hash-bucket-header array > larger without changing the size of the rest of the hash table, thus > probably making the CPU cache situation worse not better (which would > manifest as more time at the first of these two statements relative to > the second). In the testing that I've done, I've yet to find a case where having a smaller NTUP_PER_BUCKET makes things worse and it can have a dramatic improvement. Regarding the memory situation, I'm not sure that using buckets really helps, though I'm no CPU architect. However, assuming that the CPU is smart enough to only pull in bits of memory that it needs, I'm not sure how having to pull in parts of a large array of pointers is worse than having to go fetch randomly placed entries in a bucket, especially since we have to step through an entire bucket each time and the bucket tuples are allocated independently, while the hash array is allocated once. Indeed, it seems you'd be more likely to get other things you need in a given pull into cache for the hash table array than with the bucket tuple entries which might well include unrelated garbage. > Can you add some instrumentation code to get us information about the > average/max length of the bucket chains? And maybe try to figure out > how many distinct hash values per bucket, which would give us a clue > whether your two-level-list idea is worth anything. I ended up implementing the two-level system and doing some testing with it. It ends up making hash table building take quite a bit longer and only improves the scan performance in very select cases. The improvement requires an individual bucket to have both lots of dups and a lot of distinct values, because it only helps when it can skip a significant number of tuples. If we move to a system where there's rarely more than one distinct value in a bucket then the chances of a serious improvment from the two-level system goes down that much more. As such, I've come up with this (trival) patch which simply modifies ExecChooseHashTableSize() to ignore NTUP_PER_BUCKET (essentially treating it as 1) when work_mem is large enough to fit the entire hash table (which also implies that there is only one batch). I'd love to hear feedback from others on what this does under different conditions. This also makes the "hash-the-small-table" case faster for the test case which I provided earlier (http://snowman.net/~sfrost/test_case2.sql), and it uses quite a bit less memory too. Now that I've convinced myself that the two-level system isn't practical and the "hash-the-small-table" case is actually faster than "hash-the-big-table", I'll start looking at improving on your proposal to change the costing to favor the smaller table being hashed more often. Thanks, Stephen colordiff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c new file mode 100644 index 6a2f236..a8a8168 *** a/src/backend/executor/nodeHash.c --- b/src/backend/executor/nodeHash.c *** ExecChooseHashTableSize(double ntuples, *** 489,496 /* We expect the hashtable to fit in memory */ double dbuckets; ! dbuckets = ceil(ntuples / NTUP_PER_BUCKET); ! dbuckets = Min(dbuckets, max_pointers); nbuckets = (int) dbuckets; nbatch = 1; --- 489,495 /* We expect the hashtable to fit in memory */ double dbuckets; ! dbuckets = Min(ntuples, max_pointers); nbuckets = (int) dbuckets; nbatch = 1; signature.asc Description: Digital signature
Re: [HACKERS] Hash Join cost estimates
* Matthias (nitrogen...@gmail.com) wrote: > >In this example, hashing the large table is actually 2 seconds *faster* > >than hashing the small table (again, all on my laptop). > > Are you running the laptop on battery? When I've benchmarked pgsql > last time I used my laptop as well and it only occured to me after a > lot of trying that laptops (even with all energy saving disabled in > my case) don't always make for reliable benchmark machines. Things > like your CPU clockspeed being dynamically adjusted can produce > really strange results. Those runs were with the laptop plugged in, but I've also run it w/o the battery and while the performance is certainly different between those two cases, the relative speed of hashing vs. hash-lookup has been consistent. Also, that's why I provided the test case- feel free (and please do!) test it on any/all hardware you can find. I'd love to hear reports from others on their experiences. Also, the relative speeds on my laptop runs matched the performance (the laptop was slower, but slower in both paths in a comparable way) on the big server where this is all originating. > Of course your test case might not be affected by this at all, but > it's something to watch out for. Certainly. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hash Join cost estimates
In this example, hashing the large table is actually 2 seconds *faster* than hashing the small table (again, all on my laptop). Are you running the laptop on battery? When I've benchmarked pgsql last time I used my laptop as well and it only occured to me after a lot of trying that laptops (even with all energy saving disabled in my case) don't always make for reliable benchmark machines. Things like your CPU clockspeed being dynamically adjusted can produce really strange results. Also when I was running on battery the performance numbers could not be compared in any way to when I was running with the laptop connected straight to a socket. Things like IO/CPU ratio were completely different. And numbers on the final testing servers were even different. Of course your test case might not be affected by this at all, but it's something to watch out for. -Matthias -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
On Thu, Apr 04, 2013 at 04:16:12PM -0400, Stephen Frost wrote: > * Stephen Frost (sfr...@snowman.net) wrote: > > It does look like reducing bucket depth, as I outlined before through > > the use of a 2-level hashing system, might help speed up > > ExecScanHashBucket, as it would hopefully have very few (eg: 1-2) > > entries to consider instead of more. Along those same lines, I really > > wonder if we're being too generous wrt the bucket-depth goal of '10' > > instead of, say, '1', especially when we've got plenty of work_mem > > available. > > Rerunning using a minimally configured build (only --enable-openssl > and --enable-debug passed to configure) with NTUP_PER_BUCKET set to '1' > results in a couple of interesting things- > > First, the planner actually picks the plan to hash the small table and > seqscan the big one. That also, finally, turns out to be *faster* for > this test case. > > ... > > I'm certainly curious about those, but I'm also very interested in the > possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable > depending on the work_mem setting. It's only used in > ExecChooseHashTableSize, so while making it variable or depending on > work_mem could slow planning down a bit, it's not a per-tuple cost item. > +1 for adjusting this based on work_mem value. Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
Stephen Frost writes: > Looking with opannotate, there's two main hotspots in > ExecScanHashBucket: > 12846 17.4001 :hashTuple = > hashtable->buckets[hjstate->hj_CurBucketNo]; > and > 22100 29.9348 :hashTuple = hashTuple->next; Those are, of course, pretty trivial statements; so the way I read this is that the fundamental slowdown comes from the hash table being large compared to the CPU's cache, so that you're incurring lots of cache misses at these particular fetches. (You might be able to confirm that if you can set oprofile to count cache misses rather than wall clock time.) > I'm certainly curious about those, but I'm also very interested in the > possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable > depending on the work_mem setting. Not sure about that. That would make the hash-bucket-header array larger without changing the size of the rest of the hash table, thus probably making the CPU cache situation worse not better (which would manifest as more time at the first of these two statements relative to the second). Can you add some instrumentation code to get us information about the average/max length of the bucket chains? And maybe try to figure out how many distinct hash values per bucket, which would give us a clue whether your two-level-list idea is worth anything. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
* Stephen Frost (sfr...@snowman.net) wrote: > It does look like reducing bucket depth, as I outlined before through > the use of a 2-level hashing system, might help speed up > ExecScanHashBucket, as it would hopefully have very few (eg: 1-2) > entries to consider instead of more. Along those same lines, I really > wonder if we're being too generous wrt the bucket-depth goal of '10' > instead of, say, '1', especially when we've got plenty of work_mem > available. Rerunning using a minimally configured build (only --enable-openssl and --enable-debug passed to configure) with NTUP_PER_BUCKET set to '1' results in a couple of interesting things- First, the planner actually picks the plan to hash the small table and seqscan the big one. That also, finally, turns out to be *faster* for this test case. explain analyze results here: Hash small table / seqscan big table: http://explain.depesz.com/s/nP1 Hash big table / seqscan small table: http://explain.depesz.com/s/AUv Here's the oprofile reports: Hash small table / seqscan big table: samples cum. samples %cum. % linenr info image name symbol name 3902339023 52.8574 52.8574nodeHash.c:915 postgres ExecScanHashBucket 3743 42766 5.0700 57.9273xact.c:682 postgres TransactionIdIsCurrentTransactionId 3110 45876 4.2126 62.1399nodeHashjoin.c:63 postgres ExecHashJoin 2561 48437 3.4689 65.6088heapam.c:711postgres heapgettup_pagemode 2427 50864 3.2874 68.8962heapam.c:300postgres heapgetpage 2395 53259 3.2441 72.1403heaptuple.c:1028postgres slot_deform_tuple 2395 55654 3.2441 75.3843heaptuple.c:1135postgres slot_getattr 2383 58037 3.2278 78.6122nodeHash.c:786 postgres ExecHashGetHashValue 1811 59848 2.4530 81.0652tqual.c:1044postgres HeapTupleSatisfiesMVCC 1796 61644 2.4327 83.4979execScan.c:111 postgres ExecScan 1298 62942 1.7582 85.2561hashfunc.c:517 postgres hash_uint32 1274 64216 1.7257 86.9817execProcnode.c:356 postgres ExecProcNode 1011 65227 1.3694 88.3511heapam.c:1453 postgres heap_getnext 905 66132 1.2258 89.5770execTuples.c:333postgres ExecStoreTuple 858 66990 1.1622 90.7392fmgr.c:1291 postgres FunctionCall1Coll 835 67825 1.1310 91.8702execQual.c:668 postgres ExecEvalScalarVarFast 834 68659 1.1297 92.mcxt.c:126 postgres MemoryContextReset 818 69477 1.1080 94.1078nodeSeqscan.c:48postgres SeqNext Hash big table / seqscan small table: samples cum. samples %cum. % linenr info image name symbol name 3861238612 41.2901 41.2901nodeHash.c:709 postgres ExecHashTableInsert 7435 46047 7.9507 49.2408(no location information) no-vmlinux /no-vmlinux 4900 50947 5.2399 54.4806aset.c:563 postgres AllocSetAlloc 3803 54750 4.0668 58.5474xact.c:682 postgres TransactionIdIsCurrentTransactionId 3335 58085 3.5663 62.1137heapam.c:711postgres heapgettup_pagemode 2532 60617 2.7076 64.8213nodeHash.c:786 postgres ExecHashGetHashValue 2523 63140 2.6980 67.5193memcpy-ssse3-back.S:60 libc-2.15.so __memcpy_ssse3_back 2518 65658 2.6926 70.2119heaptuple.c:1028postgres slot_deform_tuple 2378 68036 2.5429 72.7549heapam.c:300postgres heapgetpage 2374 70410 2.5387 75.2935heaptuple.c:1135postgres slot_getattr 1852 72262 1.9805 77.2740nodeHash.c:915 postgres ExecScanHashBucket 1831 74093 1.9580 79.2320tqual.c:1044postgres HeapTupleSatisfiesMVCC 1732 75825 1.8521 81.0841heapam.c:1453 postgres heap_getnext 1320 77145 1.4116 82.4957nodeHash.c:76 postgres MultiExecHash 1219 78364 1.3
Re: [HACKERS] Hash Join cost estimates
* Stephen Frost (sfr...@snowman.net) wrote: > 8504125241524.3624 72.3115mcount.c:60 > libc-2.15.so __mcount_internal > 28370280785 8.1274 80.4389_mcount.S:33 > libc-2.15.so mcount [...] And as a side-note, I'm rebuilding w/o profiling, asserts, etc and will run it again, though I don't really expect the top-hitters to change. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hash Join cost estimates
* Tom Lane (t...@sss.pgh.pa.us) wrote: > perf or oprofile reveal anything? Here's what we get from oprofile (perhaps not too surprising): Hash the small table / scan the big table: samples cum. samples %cum. % linenr info image name symbol name 167374 16737447.9491 47.9491nodeHash.c:915 postgres ExecScanHashBucket 8504125241524.3624 72.3115mcount.c:60 libc-2.15.so __mcount_internal 28370280785 8.1274 80.4389_mcount.S:33 libc-2.15.so mcount 15856296641 4.5424 84.9814(no location information) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ff) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ff) 6291 302932 1.8022 86.7836xact.c:682 postgres TransactionIdIsCurrentTransactionId 4555 307487 1.3049 88.0885instrument.c:70 postgres InstrStopNode 3849 311336 1.1027 89.1912heapam.c:711postgres heapgettup_pagemode 3567 314903 1.0219 90.2130nodeHashjoin.c:63 postgres ExecHashJoin Hash the big table / scan the small table: samples cum. samples %cum. % linenr info image name symbol name 112060 11206039.2123 39.2123mcount.c:60 libc-2.15.so __mcount_internal 3654714860712.7886 52.0009nodeHash.c:709 postgres ExecHashTableInsert 3357018217711.7469 63.7477_mcount.S:33 libc-2.15.so mcount 16383198560 5.7328 69.4805(no location information) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ff) [vdso] (tgid:30643 range:0x7fffe6fff000-0x7fffe6ff) 13200211760 4.6190 74.0995(no location information) no-vmlinux /no-vmlinux 6345 218105 2.2203 76.3197xact.c:682 postgres TransactionIdIsCurrentTransactionId 5250 223355 1.8371 78.1568nodeHash.c:915 postgres ExecScanHashBucket 4797 228152 1.6786 79.8354heapam.c:711postgres heapgettup_pagemode 4661 232813 1.6310 81.4664aset.c:563 postgres AllocSetAlloc 4588 237401 1.6054 83.0718instrument.c:70 postgres InstrStopNode 3550 240951 1.2422 84.3140memcpy-ssse3-back.S:60 libc-2.15.so __memcpy_ssse3_back 3013 243964 1.0543 85.3684aset.c:1109 postgres AllocSetCheck Looking at the 'Hash the small table / scan the big table', opannotate claims that this is bar far the worst offender: 147588 42.2808 :hashTuple = hashTuple->next; While most of the time in the 'Hash the big table / scan the small table' is in: 34572 12.0975 :hashTuple->next = hashtable->buckets[bucketno]; Neither of those strike me as terribly informative though. To be honest, I've not really played w/ oprofile all that much. Now that I've got things set up to support this, I'd be happy to provide more info if anyone has suggestions on how to get something more useful. It does look like reducing bucket depth, as I outlined before through the use of a 2-level hashing system, might help speed up ExecScanHashBucket, as it would hopefully have very few (eg: 1-2) entries to consider instead of more. Along those same lines, I really wonder if we're being too generous wrt the bucket-depth goal of '10' instead of, say, '1', especially when we've got plenty of work_mem available. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hash Join cost estimates
* Tom Lane (t...@sss.pgh.pa.us) wrote: > > What I'm trying to get at in this overall email is: why in the world is > > it so expensive to do hash lookups? > > perf or oprofile reveal anything? Working on a test case actually- I've got one now: http://snowman.net/~sfrost/test_case2.sql In this example, hashing the large table is actually 2 seconds *faster* than hashing the small table (again, all on my laptop). > Also, I assume that the cases you are looking at are large enough that > even the "small" table doesn't fit in a single hash batch? No, quite the opposite, sorry for not mentioning that before. Either side fits completely into memory w/ a single batch. The explain analyze's that I posted before show that, either way, there's only one batch involved. > (You never did mention what work_mem setting you're testing, anyway.) With the test case above (where I got a 2s faster run time by hashing the big table) used a work_mem of 1GB. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hash Join cost estimates
Stephen Frost writes: > I've been fiddling with this on the very much larger overall database > where this test case came from and have found that hashing the large > table can actually be *faster* and appears to cause a more consistent > and constant amount of disk i/o (which is good). Interesting. > What I'm trying to get at in this overall email is: why in the world is > it so expensive to do hash lookups? perf or oprofile reveal anything? Also, I assume that the cases you are looking at are large enough that even the "small" table doesn't fit in a single hash batch? It could well be that the answer has to do with some bogus or at least unintuitive behavior of the batching process, and it isn't really at all a matter of individual hash lookups being slow. (You never did mention what work_mem setting you're testing, anyway.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
Tom, all, * Tom Lane (t...@sss.pgh.pa.us) wrote: > So that's at least going in the right direction. I agree that this is going in the right direction; it certainly would make the plan that I *expect* to be chosen more likely, however.. I've been fiddling with this on the very much larger overall database where this test case came from and have found that hashing the large table can actually be *faster* and appears to cause a more consistent and constant amount of disk i/o (which is good). The test case exhibits a bit of why this is the case- the per-tuple hash lookup is way closer to the per-tuple cost of building the hash table than I'd expect. per-tuple cost to build the hash table (41M tuples): 0.33us per-tuple cost to scan/do hash lookups (41M tuples): 0.29us (with a small hash table of only 41K entries) The total difference being: 1233.854 vs. 1428.424, or only 194.570ms in favor of scanning the big table instead of hashing it. These numbers are just from those posted with my original email: http://explain.depesz.com/s/FEq http://explain.depesz.com/s/FOU I've seen much worse though- I have one case where hash-the-big-table took 5s and hash-the-small-table took almost 10s (total times). I'm trying to see if I can pull that out and isolate how it's different (and see if it was just due to other load on the box). What I'm trying to get at in this overall email is: why in the world is it so expensive to do hash lookups? I would have expected the cost of the hash table to be *much* more than the cost to do a hash lookup, and that doing hash lookups against a small hash table would be fast enough to put serious pressure on the i/o subsystem. Instead, the building of the hash table actually puts more pressure and can end up being more efficient overall. We have a process that basically does this a whole bunch and the "hash-the-big-table" operation takes about 4.7 hrs, while the "hash-the-small-table" approach went well past 5 hours and was only about 70% complete. Thoughts? Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hash Join cost estimates
I wrote: > Perhaps what we should do is charge the hash_qual_cost only for some > small multiple of the number of tuples that we expect will *pass* the > hash quals, which is a number we have to compute anyway. The multiple > would represent the rate of hash-code collisions we expect. I tried the attached quick-hack patch on Stephen's example. With work_mem set to 16MB I get these results: regression=# explain analyze select * from small_table join big_table using (id_short); QUERY PLAN - Hash Join (cost=1229.46..74154.49 rows=41176 width=24) (actual time=47.723..1845.869 rows=13731 loops=1) Hash Cond: (big_table.id_short = small_table.id_short) -> Seq Scan on big_table (cost=0.00..61626.71 rows=4272271 width=4) (actual time=0.045..506.212 rows=4272271 loops=1) -> Hash (cost=714.76..714.76 rows=41176 width=24) (actual time=24.944..24.944 rows=41176 loops=1) Buckets: 8192 Batches: 1 Memory Usage: 2574kB -> Seq Scan on small_table (cost=0.00..714.76 rows=41176 width=24) (actual time=0.007..11.608 rows=41176 loops=1) Total runtime: 1847.697 ms (7 rows) Forcing the other plan to be chosen, I get regression=# explain analyze select * from small_table join big_table using (id_short); QUERY PLAN - Hash Join (cost=131719.10..150327.44 rows=41176 width=24) (actual time=1922.942..2810.095 rows=13731 loops=1) Hash Cond: (small_table.id_short = big_table.id_short) -> Seq Scan on small_table (cost=0.00..714.76 rows=41176 width=24) (actual time=0.012..10.058 rows=41176 loops=1) -> Hash (cost=61626.71..61626.71 rows=4272271 width=4) (actual time=1921.962..1921.962 rows=4272271 loops=1) Buckets: 65536 Batches: 16 Memory Usage: 9412kB -> Seq Scan on big_table (cost=0.00..61626.71 rows=4272271 width=4) (actual time=0.043..702.898 rows=4272271 loops=1) Total runtime: 2820.633 ms (7 rows) So that's at least going in the right direction. I have not thought about how the calculation should be adjusted in the semi/anti join case, nor about how we ought to repurpose the bucket-size-variance calculations for checking whether work_mem will be exceeded. So this is a long way from being committable, but it seems promising. regards, tom lane diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 8d24902..48b21cb 100644 *** a/src/backend/optimizer/path/costsize.c --- b/src/backend/optimizer/path/costsize.c *** final_cost_hashjoin(PlannerInfo *root, H *** 2661,2685 else { /* - * The number of tuple comparisons needed is the number of outer - * tuples times the typical number of tuples in a hash bucket, which - * is the inner relation size times its bucketsize fraction. At each - * one, we need to evaluate the hashjoin quals. But actually, - * charging the full qual eval cost at each tuple is pessimistic, - * since we don't evaluate the quals unless the hash values match - * exactly. For lack of a better idea, halve the cost estimate to - * allow for that. - */ - startup_cost += hash_qual_cost.startup; - run_cost += hash_qual_cost.per_tuple * outer_path_rows * - clamp_row_est(inner_path_rows * innerbucketsize) * 0.5; - - /* * Get approx # tuples passing the hashquals. We use * approx_tuple_count here because we need an estimate done with * JOIN_INNER semantics. */ hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses); } /* --- 2661,2692 else { /* * Get approx # tuples passing the hashquals. We use * approx_tuple_count here because we need an estimate done with * JOIN_INNER semantics. */ hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses); + + /* + * We assume that the hash equality operators will be applied to twice + * as many join pairs as actually pass the hashquals, that is there's + * about one hash-value collision per tuple. This is probably a + * substantial overestimate, but it seems wise to be conservative. + * Lacking any good model of the effectiveness of the hash functions, + * it's hard to do much better than a constant ratio. (XXX perhaps we + * ought to charge more than this for very large relations, since the + * hash values are only 32 bits wide and hence will suffer + * proportionally more collisions when there are many rows.) + * + * Note that we don't charge anything for v
Re: [HACKERS] Hash Join cost estimates
On Mon, Apr 1, 2013 at 2:35 AM, Jeff Davis wrote: > On Sun, 2013-03-31 at 15:45 -0400, Tom Lane wrote: >> Really, when we're traipsing down a bucket >> list, skipping over bucket entries with the wrong hash code is just >> about free, or at least it's a whole lot cheaper than applying ExecQual. >> >> Perhaps what we should do is charge the hash_qual_cost only for some >> small multiple of the number of tuples that we expect will *pass* the >> hash quals, which is a number we have to compute anyway. The multiple >> would represent the rate of hash-code collisions we expect. > > +1. > >> I'd still be inclined to charge something per bucket entry, but it >> should be really small, perhaps on the order of 0.01 times >> cpu_operator_cost. > >> Or we could just drop that term entirely. > > FWIW, either of those are fine with me based on my limited experience. FWIW, I have also seen this problem and the proposed fixes sound reasonable to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
On Sun, 2013-03-31 at 15:45 -0400, Tom Lane wrote: > Really, when we're traipsing down a bucket > list, skipping over bucket entries with the wrong hash code is just > about free, or at least it's a whole lot cheaper than applying ExecQual. > > Perhaps what we should do is charge the hash_qual_cost only for some > small multiple of the number of tuples that we expect will *pass* the > hash quals, which is a number we have to compute anyway. The multiple > would represent the rate of hash-code collisions we expect. +1. > I'd still be inclined to charge something per bucket entry, but it > should be really small, perhaps on the order of 0.01 times > cpu_operator_cost. > Or we could just drop that term entirely. FWIW, either of those are fine with me based on my limited experience. > Maybe what we should be doing with the bucketsize numbers is estimating > peak memory consumption to gate whether we'll accept the plan at all, > rather than adding terms to the cost estimate. Sounds reasonable. Ideally, we'd have a way to continue executing even in that case; but that's a project by itself, and would make it even more difficult to cost accurately. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
Stephen Frost writes: > * Jeff Davis (pg...@j-davis.com) wrote: >> In Stephen's case the table was only 41KB, so something still seems off. >> Maybe we should model the likelihood of a collision based on the >> cardinalities (assuming a reasonably good hash function)? > It's not really 'hash collisions' that we're trying to be wary of, per > se, it's the risk of duplicates. I spent some time looking at this. I think the real issue is that the code is trying to charge something close to cpu_operator_cost for each time an outer tuple is compared to a hashtable entry. That was actually reasonable back when the costing code was last touched --- the current scheme of storing and comparing the hash codes before testing the equality operator proper only dates to commit 849074f9ae422c64501bb1d53ef840de870bf65c. I punted on changing the cost estimates at that time, and I think what this example is showing is that that was a bad decision. Really, when we're traipsing down a bucket list, skipping over bucket entries with the wrong hash code is just about free, or at least it's a whole lot cheaper than applying ExecQual. Perhaps what we should do is charge the hash_qual_cost only for some small multiple of the number of tuples that we expect will *pass* the hash quals, which is a number we have to compute anyway. The multiple would represent the rate of hash-code collisions we expect. I'd still be inclined to charge something per bucket entry, but it should be really small, perhaps on the order of 0.01 times cpu_operator_cost. Or we could just drop that term entirely. It strikes me that the reason to be worried about skewed distribution in the inner relation is not really that it changes the run time much, but rather that it risks blowing out work_mem if specific virtual buckets have too many members (or at best, we're forced to increase the number of batches more than we thought to stay under work_mem; which carries runtime costs of its own). Maybe what we should be doing with the bucketsize numbers is estimating peak memory consumption to gate whether we'll accept the plan at all, rather than adding terms to the cost estimate. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
* Jeff Davis (pg...@j-davis.com) wrote: > In Stephen's case the table was only 41KB, so something still seems off. > Maybe we should model the likelihood of a collision based on the > cardinalities (assuming a reasonably good hash function)? It's not really 'hash collisions' that we're trying to be wary of, per se, it's the risk of duplicates. To push this very far in the other direction- if you have 41k of the *same value* in the small table, then it's currently faster to build the hash table on the large table and then seq scan the small table (10s vs. 14s on my laptop running w/o being plugged in, so it's a bit slower). Now, that's a pretty ridiculous case, but it seems like we're being pretty dumb here- for every input row from the outer table, we're looking through all 41K *duplicate* keys in that one hash bucket. This goes back to the suggestion I just made- if the hash bucket list contained only unique values (which are a result of actual hash collisions), then we'd only be doing *one* comparison for each input row of the outer table that doesn't match- and when we found one which *did*, we would only need to step through the dup list for that one entry and blast back all of those rows, forgetting about the rest of the bucket which we know can't match. > Also, I think I found an important assumption that seems dubious (in > comment for estimate_hash_bucketsize()): I've wondered about that also. It certainly seems quite bogus that we can end up with an 'estimated # of entries in a bucket' that's larger than the # of entries we've found for the MCV in the table, especially *double* that. > Stephen, do you think this could explain your problem? As I tried to articulate in my initial email- even if we had a *perfect* answer to "how many comparisons will we need to do", the current costing would cause us to pick the plan that, intuitively and empirically, takes longer (hash the 41M row table) because that cost is multiplied times the number of outer row tables and the cpu_tuple_cost (charged to build the hash table) isn't high enough relative to the cpu_op_cost (charged to do the comparisons in the buckets). Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hash Join cost estimates
* Tom Lane (t...@sss.pgh.pa.us) wrote: > I think the point is that there may *not* be few hash collisions ... Right, but that's actually all entirely based on concerns over there being duplicates (hence why we consider the MCVs and ndistinct), which makes *some* sense given that we currently have a single linked-list in each bucket into which any dups are placed. It occurs to me that we could get away from this by having a 2-level system. We hash to a bucket which contains a linked list of linked lists. The bucket list would be for actual collisions (which we don't consider at all in the current bucket estimating) and each of those entries would be a linked list of duplicates for that entry in the bucket. This would add some small cost for building the hash, since we would have to step through each entry in the bucket to determine if the entry being added is a new entry or not, but not very much unless we're worried about lots of collisions happening (which I certainly hope isn't the case). Hash tables are generally expected to take more effort to build initially anyway, so I don't see a problem putting more logic there. Also, we could skip checking each entry in the bucket when the input is known to be unique and instead just skip to the end of the bucket since the new entry can't match any existing. We could still work through the bucketing logic and add some costing to that case for those situations where we are hashing on only one key of a multi-key join and we expect a lot of duplicates to exist. I'm not sure how much that happens though- I would hope that we would use a composite hash key most of the time that we have multi-key joins that use hash tables. Thoughts? Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hash Join cost estimates
Jeff, * Jeff Davis (pg...@j-davis.com) wrote: > On Thu, 2013-03-28 at 19:56 -0400, Stephen Frost wrote: > > 41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56 > > 4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66 > > I think those are backwards -- typo? Yes, sorry, those are backwards. The '4M hashed, seqscan 41K' entry comes out with the lower cost and that's what we end up using. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Hash Join cost estimates
On Fri, 2013-03-29 at 16:37 -0400, Tom Lane wrote: > Jeff Davis writes: > > Yes, I have run into this issue (or something very similar). I don't > > understand why the bucketsize even matters much -- assuming few hash > > collisions, we are not actually evaluating the quals any more times than > > necessary. So why all of the hashjoin-specific logic in determining the > > number of qual evaluations? The only reason I can think of is to model > > the cost of comparing the hashes themselves. > > I think the point is that there may *not* be few hash collisions ... In Stephen's case the table was only 41KB, so something still seems off. Maybe we should model the likelihood of a collision based on the cardinalities (assuming a reasonably good hash function)? Also, I think I found an important assumption that seems dubious (in comment for estimate_hash_bucketsize()): "If the other relation in the join has a key distribution similar to this one's, then the most-loaded buckets are exactly those that will be probed most often. Therefore, the "average" bucket size for costing purposes should really be taken as something close to the "worst case" bucket size. We try to estimate this by adjusting the fraction if there are too few distinct data values, and then scaling up by the ratio of the most common value's frequency to the average frequency." But the key distribution is not necessarily similar at all... the large table might have many more distinct values. Stephen, do you think this could explain your problem? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
Jeff Davis writes: > Yes, I have run into this issue (or something very similar). I don't > understand why the bucketsize even matters much -- assuming few hash > collisions, we are not actually evaluating the quals any more times than > necessary. So why all of the hashjoin-specific logic in determining the > number of qual evaluations? The only reason I can think of is to model > the cost of comparing the hashes themselves. I think the point is that there may *not* be few hash collisions ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Join cost estimates
On Thu, 2013-03-28 at 19:56 -0400, Stephen Frost wrote: > 41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56 > 4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66 I think those are backwards -- typo? > In the end, I think the problem here is that we are charging far too > much for these bucket costs (particularly when we're getting them so > far wrong) and not nearly enough for the cost of building the hash > table in the first place. > > Thoughts? Ideas about how we can 'fix' this? Have others run into > similar issues? Yes, I have run into this issue (or something very similar). I don't understand why the bucketsize even matters much -- assuming few hash collisions, we are not actually evaluating the quals any more times than necessary. So why all of the hashjoin-specific logic in determining the number of qual evaluations? The only reason I can think of is to model the cost of comparing the hashes themselves. Also, searching the archives turns up at least one other, but I think I've seen more: http://www.postgresql.org/message-id/a82128a6-4e3b-43bd-858d-21b129f7b...@richrelevance.com Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers