Re: [HACKERS] Hash Join cost estimates

2013-04-12 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Stephen Frost sfr...@snowman.net writes:
  I'm certainly curious about those, but I'm also very interested in the
  possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
  depending on the work_mem setting.
 
 Not sure about that.  That would make the hash-bucket-header array
 larger without changing the size of the rest of the hash table, thus
 probably making the CPU cache situation worse not better (which would
 manifest as more time at the first of these two statements relative to
 the second).

In the testing that I've done, I've yet to find a case where having a
smaller NTUP_PER_BUCKET makes things worse and it can have a dramatic
improvement.  Regarding the memory situation, I'm not sure that using
buckets really helps, though I'm no CPU architect.  However, assuming
that the CPU is smart enough to only pull in bits of memory that it
needs, I'm not sure how having to pull in parts of a large array of
pointers is worse than having to go fetch randomly placed entries in a
bucket, especially since we have to step through an entire bucket each
time and the bucket tuples are allocated independently, while the hash
array is allocated once.  Indeed, it seems you'd be more likely to get
other things you need in a given pull into cache for the hash table
array than with the bucket tuple entries which might well include
unrelated garbage.

 Can you add some instrumentation code to get us information about the
 average/max length of the bucket chains?  And maybe try to figure out
 how many distinct hash values per bucket, which would give us a clue
 whether your two-level-list idea is worth anything.

I ended up implementing the two-level system and doing some testing with
it.  It ends up making hash table building take quite a bit longer and
only improves the scan performance in very select cases.  The
improvement requires an individual bucket to have both lots of dups and
a lot of distinct values, because it only helps when it can skip a
significant number of tuples.  If we move to a system where there's
rarely more than one distinct value in a bucket then the chances of a
serious improvment from the two-level system goes down that much more.

As such, I've come up with this (trival) patch which simply modifies
ExecChooseHashTableSize() to ignore NTUP_PER_BUCKET (essentially
treating it as 1) when work_mem is large enough to fit the entire hash
table (which also implies that there is only one batch).  I'd love to
hear feedback from others on what this does under different conditions.
This also makes the hash-the-small-table case faster for the test case
which I provided earlier (http://snowman.net/~sfrost/test_case2.sql),
and it uses quite a bit less memory too.  Now that I've convinced myself
that the two-level system isn't practical and the hash-the-small-table
case is actually faster than hash-the-big-table, I'll start looking
at improving on your proposal to change the costing to favor the smaller
table being hashed more often.

Thanks,

Stephen
colordiff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
new file mode 100644
index 6a2f236..a8a8168
*** a/src/backend/executor/nodeHash.c
--- b/src/backend/executor/nodeHash.c
*** ExecChooseHashTableSize(double ntuples,
*** 489,496 
  		/* We expect the hashtable to fit in memory */
  		double		dbuckets;
  
! 		dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
! 		dbuckets = Min(dbuckets, max_pointers);
  		nbuckets = (int) dbuckets;
  
  		nbatch = 1;
--- 489,495 
  		/* We expect the hashtable to fit in memory */
  		double		dbuckets;
  
! 		dbuckets = Min(ntuples, max_pointers);
  		nbuckets = (int) dbuckets;
  
  		nbatch = 1;


signature.asc
Description: Digital signature


Re: [HACKERS] Hash Join cost estimates

2013-04-05 Thread Matthias

In this example, hashing the large table is actually 2 seconds *faster*
than hashing the small table (again, all on my laptop).


Are you running the laptop on battery? When I've benchmarked pgsql last  
time I used my laptop as well and it only occured to me after a lot of  
trying that laptops (even with all energy saving disabled in my case)  
don't always make for reliable benchmark machines. Things like your CPU  
clockspeed being dynamically adjusted can produce really strange results.


Also when I was running on battery the performance numbers could not be  
compared in any way to when I was running with the laptop connected  
straight to a socket. Things like IO/CPU ratio were completely different.  
And numbers on the final testing servers were even different.


Of course your test case might not be affected by this at all, but it's  
something to watch out for.


-Matthias


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Join cost estimates

2013-04-05 Thread Stephen Frost
* Matthias (nitrogen...@gmail.com) wrote:
 In this example, hashing the large table is actually 2 seconds *faster*
 than hashing the small table (again, all on my laptop).
 
 Are you running the laptop on battery? When I've benchmarked pgsql
 last time I used my laptop as well and it only occured to me after a
 lot of trying that laptops (even with all energy saving disabled in
 my case) don't always make for reliable benchmark machines. Things
 like your CPU clockspeed being dynamically adjusted can produce
 really strange results.

Those runs were with the laptop plugged in, but I've also run it w/o the
battery and while the performance is certainly different between those
two cases, the relative speed of hashing vs. hash-lookup has been
consistent.  Also, that's why I provided the test case- feel free (and
please do!) test it on any/all hardware you can find.  I'd love to hear
reports from others on their experiences.  Also, the relative speeds on
my laptop runs matched the performance (the laptop was slower, but
slower in both paths in a comparable way) on the big server where this
is all originating.

 Of course your test case might not be affected by this at all, but
 it's something to watch out for.

Certainly.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hash Join cost estimates

2013-04-04 Thread Stephen Frost
Tom, all,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
 So that's at least going in the right direction.

I agree that this is going in the right direction; it certainly would
make the plan that I *expect* to be chosen more likely, however..

I've been fiddling with this on the very much larger overall database
where this test case came from and have found that hashing the large
table can actually be *faster* and appears to cause a more consistent
and constant amount of disk i/o (which is good).

The test case exhibits a bit of why this is the case- the per-tuple hash
lookup is way closer to the per-tuple cost of building the hash table
than I'd expect.

per-tuple cost to build the hash table (41M tuples): 0.33us
per-tuple cost to scan/do hash lookups (41M tuples): 0.29us
  (with a small hash table of only 41K entries)

The total difference being: 1233.854 vs. 1428.424, or only 194.570ms in
favor of scanning the big table instead of hashing it.

These numbers are just from those posted with my original email:

http://explain.depesz.com/s/FEq
http://explain.depesz.com/s/FOU

I've seen much worse though- I have one case where hash-the-big-table
took 5s and hash-the-small-table took almost 10s (total times).  I'm
trying to see if I can pull that out and isolate how it's different (and
see if it was just due to other load on the box).

What I'm trying to get at in this overall email is: why in the world is
it so expensive to do hash lookups?  I would have expected the cost of
the hash table to be *much* more than the cost to do a hash lookup, and
that doing hash lookups against a small hash table would be fast enough
to put serious pressure on the i/o subsystem.  Instead, the building of
the hash table actually puts more pressure and can end up being more
efficient overall.  We have a process that basically does this a whole
bunch and the hash-the-big-table operation takes about 4.7 hrs, while
the hash-the-small-table approach went well past 5 hours and was only
about 70% complete.

Thoughts?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hash Join cost estimates

2013-04-04 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 I've been fiddling with this on the very much larger overall database
 where this test case came from and have found that hashing the large
 table can actually be *faster* and appears to cause a more consistent
 and constant amount of disk i/o (which is good).

Interesting.

 What I'm trying to get at in this overall email is: why in the world is
 it so expensive to do hash lookups?

perf or oprofile reveal anything?

Also, I assume that the cases you are looking at are large enough that
even the small table doesn't fit in a single hash batch?  It could
well be that the answer has to do with some bogus or at least
unintuitive behavior of the batching process, and it isn't really at all
a matter of individual hash lookups being slow.

(You never did mention what work_mem setting you're testing, anyway.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Join cost estimates

2013-04-04 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
  What I'm trying to get at in this overall email is: why in the world is
  it so expensive to do hash lookups?
 
 perf or oprofile reveal anything?

Working on a test case actually- I've got one now:
http://snowman.net/~sfrost/test_case2.sql

In this example, hashing the large table is actually 2 seconds *faster*
than hashing the small table (again, all on my laptop).

 Also, I assume that the cases you are looking at are large enough that
 even the small table doesn't fit in a single hash batch?  

No, quite the opposite, sorry for not mentioning that before.  Either
side fits completely into memory w/ a single batch.  The explain
analyze's that I posted before show that, either way, there's only one
batch involved.

 (You never did mention what work_mem setting you're testing, anyway.)

With the test case above (where I got a 2s faster run time by hashing
the big table) used a work_mem of 1GB.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hash Join cost estimates

2013-04-04 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 perf or oprofile reveal anything?

Here's what we get from oprofile (perhaps not too surprising):

Hash the small table / scan the big table:
samples  cum. samples  %cum. % linenr info image 
name   symbol name
167374   16737447.9491  47.9491nodeHash.c:915  postgres 
ExecScanHashBucket
8504125241524.3624  72.3115mcount.c:60 
libc-2.15.so __mcount_internal
28370280785 8.1274  80.4389_mcount.S:33
libc-2.15.so mcount
15856296641 4.5424  84.9814(no location information)   [vdso] 
(tgid:30643 range:0x7fffe6fff000-0x7fffe6ff) [vdso] (tgid:30643 
range:0x7fffe6fff000-0x7fffe6ff)
6291 302932 1.8022  86.7836xact.c:682  postgres 
TransactionIdIsCurrentTransactionId
4555 307487 1.3049  88.0885instrument.c:70 postgres 
InstrStopNode
3849 311336 1.1027  89.1912heapam.c:711postgres 
heapgettup_pagemode
3567 314903 1.0219  90.2130nodeHashjoin.c:63   postgres 
ExecHashJoin

Hash the big table / scan the small table:
samples  cum. samples  %cum. % linenr info image 
name   symbol name
112060   11206039.2123  39.2123mcount.c:60 
libc-2.15.so __mcount_internal
3654714860712.7886  52.0009nodeHash.c:709  postgres 
ExecHashTableInsert
3357018217711.7469  63.7477_mcount.S:33
libc-2.15.so mcount
16383198560 5.7328  69.4805(no location information)   [vdso] 
(tgid:30643 range:0x7fffe6fff000-0x7fffe6ff) [vdso] (tgid:30643 
range:0x7fffe6fff000-0x7fffe6ff)
13200211760 4.6190  74.0995(no location information)   
no-vmlinux   /no-vmlinux
6345 218105 2.2203  76.3197xact.c:682  postgres 
TransactionIdIsCurrentTransactionId
5250 223355 1.8371  78.1568nodeHash.c:915  postgres 
ExecScanHashBucket
4797 228152 1.6786  79.8354heapam.c:711postgres 
heapgettup_pagemode
4661 232813 1.6310  81.4664aset.c:563  postgres 
AllocSetAlloc
4588 237401 1.6054  83.0718instrument.c:70 postgres 
InstrStopNode
3550 240951 1.2422  84.3140memcpy-ssse3-back.S:60  
libc-2.15.so __memcpy_ssse3_back
3013 243964 1.0543  85.3684aset.c:1109 postgres 
AllocSetCheck

Looking at the 'Hash the small table / scan the big table', opannotate
claims that this is bar far the worst offender:

147588 42.2808 :hashTuple = hashTuple-next;

While most of the time in the 'Hash the big table / scan the small
table' is in:

 34572 12.0975 :hashTuple-next = hashtable-buckets[bucketno];

Neither of those strike me as terribly informative though.  To be
honest, I've not really played w/ oprofile all that much.  Now that I've
got things set up to support this, I'd be happy to provide more info if
anyone has suggestions on how to get something more useful.

It does look like reducing bucket depth, as I outlined before through
the use of a 2-level hashing system, might help speed up
ExecScanHashBucket, as it would hopefully have very few (eg: 1-2)
entries to consider instead of more.  Along those same lines, I really
wonder if we're being too generous wrt the bucket-depth goal of '10'
instead of, say, '1', especially when we've got plenty of work_mem
available.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hash Join cost estimates

2013-04-04 Thread Stephen Frost
* Stephen Frost (sfr...@snowman.net) wrote:
 8504125241524.3624  72.3115mcount.c:60 
 libc-2.15.so __mcount_internal
 28370280785 8.1274  80.4389_mcount.S:33
 libc-2.15.so mcount
[...]

And as a side-note, I'm rebuilding w/o profiling, asserts, etc and will
run it again, though I don't really expect the top-hitters to change.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hash Join cost estimates

2013-04-04 Thread Stephen Frost
* Stephen Frost (sfr...@snowman.net) wrote:
 It does look like reducing bucket depth, as I outlined before through
 the use of a 2-level hashing system, might help speed up
 ExecScanHashBucket, as it would hopefully have very few (eg: 1-2)
 entries to consider instead of more.  Along those same lines, I really
 wonder if we're being too generous wrt the bucket-depth goal of '10'
 instead of, say, '1', especially when we've got plenty of work_mem
 available.

Rerunning using a minimally configured build (only --enable-openssl
and --enable-debug passed to configure) with NTUP_PER_BUCKET set to '1'
results in a couple of interesting things-

First, the planner actually picks the plan to hash the small table and
seqscan the big one.  That also, finally, turns out to be *faster* for
this test case.

explain analyze results here:
Hash small table / seqscan big table: http://explain.depesz.com/s/nP1
Hash big table / seqscan small table: http://explain.depesz.com/s/AUv

Here's the oprofile reports:

Hash small table / seqscan big table:
samples  cum. samples  %cum. % linenr info image 
name   symbol name
3902339023 52.8574  52.8574nodeHash.c:915  postgres 
ExecScanHashBucket
3743 42766  5.0700  57.9273xact.c:682  postgres 
TransactionIdIsCurrentTransactionId
3110 45876  4.2126  62.1399nodeHashjoin.c:63   postgres 
ExecHashJoin
2561 48437  3.4689  65.6088heapam.c:711postgres 
heapgettup_pagemode
2427 50864  3.2874  68.8962heapam.c:300postgres 
heapgetpage
2395 53259  3.2441  72.1403heaptuple.c:1028postgres 
slot_deform_tuple
2395 55654  3.2441  75.3843heaptuple.c:1135postgres 
slot_getattr
2383 58037  3.2278  78.6122nodeHash.c:786  postgres 
ExecHashGetHashValue
1811 59848  2.4530  81.0652tqual.c:1044postgres 
HeapTupleSatisfiesMVCC
1796 61644  2.4327  83.4979execScan.c:111  postgres 
ExecScan
1298 62942  1.7582  85.2561hashfunc.c:517  postgres 
hash_uint32
1274 64216  1.7257  86.9817execProcnode.c:356  postgres 
ExecProcNode
1011 65227  1.3694  88.3511heapam.c:1453   postgres 
heap_getnext
905  66132  1.2258  89.5770execTuples.c:333postgres 
ExecStoreTuple
858  66990  1.1622  90.7392fmgr.c:1291 postgres 
FunctionCall1Coll
835  67825  1.1310  91.8702execQual.c:668  postgres 
ExecEvalScalarVarFast
834  68659  1.1297  92.mcxt.c:126  postgres 
MemoryContextReset
818  69477  1.1080  94.1078nodeSeqscan.c:48postgres 
SeqNext

Hash big table / seqscan small table:
samples  cum. samples  %cum. % linenr info image 
name   symbol name
3861238612 41.2901  41.2901nodeHash.c:709  postgres 
ExecHashTableInsert
7435 46047  7.9507  49.2408(no location information)   
no-vmlinux   /no-vmlinux
4900 50947  5.2399  54.4806aset.c:563  postgres 
AllocSetAlloc
3803 54750  4.0668  58.5474xact.c:682  postgres 
TransactionIdIsCurrentTransactionId
3335 58085  3.5663  62.1137heapam.c:711postgres 
heapgettup_pagemode
2532 60617  2.7076  64.8213nodeHash.c:786  postgres 
ExecHashGetHashValue
2523 63140  2.6980  67.5193memcpy-ssse3-back.S:60  
libc-2.15.so __memcpy_ssse3_back
2518 65658  2.6926  70.2119heaptuple.c:1028postgres 
slot_deform_tuple
2378 68036  2.5429  72.7549heapam.c:300postgres 
heapgetpage
2374 70410  2.5387  75.2935heaptuple.c:1135postgres 
slot_getattr
1852 72262  1.9805  77.2740nodeHash.c:915  postgres 
ExecScanHashBucket
1831 74093  1.9580  79.2320tqual.c:1044postgres 
HeapTupleSatisfiesMVCC
1732 75825  1.8521  81.0841heapam.c:1453   postgres 
heap_getnext
1320 77145  1.4116  82.4957nodeHash.c:76   postgres 
MultiExecHash
1219 78364  1.3035  

Re: [HACKERS] Hash Join cost estimates

2013-04-04 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 Looking with opannotate, there's two main hotspots in
 ExecScanHashBucket:

  12846 17.4001 :hashTuple = 
 hashtable-buckets[hjstate-hj_CurBucketNo];
 and
  22100 29.9348 :hashTuple = hashTuple-next;

Those are, of course, pretty trivial statements; so the way I read this
is that the fundamental slowdown comes from the hash table being large
compared to the CPU's cache, so that you're incurring lots of cache
misses at these particular fetches.  (You might be able to confirm that
if you can set oprofile to count cache misses rather than wall clock
time.)

 I'm certainly curious about those, but I'm also very interested in the
 possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
 depending on the work_mem setting.

Not sure about that.  That would make the hash-bucket-header array
larger without changing the size of the rest of the hash table, thus
probably making the CPU cache situation worse not better (which would
manifest as more time at the first of these two statements relative to
the second).

Can you add some instrumentation code to get us information about the
average/max length of the bucket chains?  And maybe try to figure out
how many distinct hash values per bucket, which would give us a clue
whether your two-level-list idea is worth anything.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Join cost estimates

2013-04-04 Thread k...@rice.edu
On Thu, Apr 04, 2013 at 04:16:12PM -0400, Stephen Frost wrote:
 * Stephen Frost (sfr...@snowman.net) wrote:
  It does look like reducing bucket depth, as I outlined before through
  the use of a 2-level hashing system, might help speed up
  ExecScanHashBucket, as it would hopefully have very few (eg: 1-2)
  entries to consider instead of more.  Along those same lines, I really
  wonder if we're being too generous wrt the bucket-depth goal of '10'
  instead of, say, '1', especially when we've got plenty of work_mem
  available.
 
 Rerunning using a minimally configured build (only --enable-openssl
 and --enable-debug passed to configure) with NTUP_PER_BUCKET set to '1'
 results in a couple of interesting things-
 
 First, the planner actually picks the plan to hash the small table and
 seqscan the big one.  That also, finally, turns out to be *faster* for
 this test case.
 
 ...
 
 I'm certainly curious about those, but I'm also very interested in the
 possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
 depending on the work_mem setting.  It's only used in
 ExecChooseHashTableSize, so while making it variable or depending on
 work_mem could slow planning down a bit, it's not a per-tuple cost item.
 
+1 for adjusting this based on work_mem value.

Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Join cost estimates

2013-04-01 Thread Jeff Davis
On Sun, 2013-03-31 at 15:45 -0400, Tom Lane wrote:
 Really, when we're traipsing down a bucket
 list, skipping over bucket entries with the wrong hash code is just
 about free, or at least it's a whole lot cheaper than applying ExecQual.
 
 Perhaps what we should do is charge the hash_qual_cost only for some
 small multiple of the number of tuples that we expect will *pass* the
 hash quals, which is a number we have to compute anyway.  The multiple
 would represent the rate of hash-code collisions we expect.

+1.

 I'd still be inclined to charge something per bucket entry, but it
 should be really small, perhaps on the order of 0.01 times
 cpu_operator_cost.

 Or we could just drop that term entirely. 

FWIW, either of those are fine with me based on my limited experience.

 Maybe what we should be doing with the bucketsize numbers is estimating
 peak memory consumption to gate whether we'll accept the plan at all,
 rather than adding terms to the cost estimate.

Sounds reasonable.

Ideally, we'd have a way to continue executing even in that case; but
that's a project by itself, and would make it even more difficult to
cost accurately.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Join cost estimates

2013-04-01 Thread Robert Haas
On Mon, Apr 1, 2013 at 2:35 AM, Jeff Davis pg...@j-davis.com wrote:
 On Sun, 2013-03-31 at 15:45 -0400, Tom Lane wrote:
 Really, when we're traipsing down a bucket
 list, skipping over bucket entries with the wrong hash code is just
 about free, or at least it's a whole lot cheaper than applying ExecQual.

 Perhaps what we should do is charge the hash_qual_cost only for some
 small multiple of the number of tuples that we expect will *pass* the
 hash quals, which is a number we have to compute anyway.  The multiple
 would represent the rate of hash-code collisions we expect.

 +1.

 I'd still be inclined to charge something per bucket entry, but it
 should be really small, perhaps on the order of 0.01 times
 cpu_operator_cost.

 Or we could just drop that term entirely.

 FWIW, either of those are fine with me based on my limited experience.

FWIW, I have also seen this problem and the proposed fixes sound
reasonable to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Join cost estimates

2013-04-01 Thread Tom Lane
I wrote:
 Perhaps what we should do is charge the hash_qual_cost only for some
 small multiple of the number of tuples that we expect will *pass* the
 hash quals, which is a number we have to compute anyway.  The multiple
 would represent the rate of hash-code collisions we expect.

I tried the attached quick-hack patch on Stephen's example.  With
work_mem set to 16MB I get these results:

regression=# explain analyze select * from small_table join big_table using 
(id_short);
 QUERY PLAN 
 
-
 Hash Join  (cost=1229.46..74154.49 rows=41176 width=24) (actual 
time=47.723..1845.869 rows=13731 loops=1)
   Hash Cond: (big_table.id_short = small_table.id_short)
   -  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271 width=4) 
(actual time=0.045..506.212 rows=4272271 loops=1)
   -  Hash  (cost=714.76..714.76 rows=41176 width=24) (actual 
time=24.944..24.944 rows=41176 loops=1)
 Buckets: 8192  Batches: 1  Memory Usage: 2574kB
 -  Seq Scan on small_table  (cost=0.00..714.76 rows=41176 width=24) 
(actual time=0.007..11.608 rows=41176 loops=1)
 Total runtime: 1847.697 ms
(7 rows)

Forcing the other plan to be chosen, I get

regression=# explain analyze select * from small_table join big_table using 
(id_short);
   QUERY PLAN   
 
-
 Hash Join  (cost=131719.10..150327.44 rows=41176 width=24) (actual 
time=1922.942..2810.095 rows=13731 loops=1)
   Hash Cond: (small_table.id_short = big_table.id_short)
   -  Seq Scan on small_table  (cost=0.00..714.76 rows=41176 width=24) (actual 
time=0.012..10.058 rows=41176 loops=1)
   -  Hash  (cost=61626.71..61626.71 rows=4272271 width=4) (actual 
time=1921.962..1921.962 rows=4272271 loops=1)
 Buckets: 65536  Batches: 16  Memory Usage: 9412kB
 -  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271 width=4) 
(actual time=0.043..702.898 rows=4272271 loops=1)
 Total runtime: 2820.633 ms
(7 rows)

So that's at least going in the right direction.

I have not thought about how the calculation should be adjusted in the
semi/anti join case, nor about how we ought to repurpose the
bucket-size-variance calculations for checking whether work_mem will be
exceeded.  So this is a long way from being committable, but it seems
promising.

regards, tom lane

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8d24902..48b21cb 100644
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
*** final_cost_hashjoin(PlannerInfo *root, H
*** 2661,2685 
  	else
  	{
  		/*
- 		 * The number of tuple comparisons needed is the number of outer
- 		 * tuples times the typical number of tuples in a hash bucket, which
- 		 * is the inner relation size times its bucketsize fraction.  At each
- 		 * one, we need to evaluate the hashjoin quals.  But actually,
- 		 * charging the full qual eval cost at each tuple is pessimistic,
- 		 * since we don't evaluate the quals unless the hash values match
- 		 * exactly.  For lack of a better idea, halve the cost estimate to
- 		 * allow for that.
- 		 */
- 		startup_cost += hash_qual_cost.startup;
- 		run_cost += hash_qual_cost.per_tuple * outer_path_rows *
- 			clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;
- 
- 		/*
  		 * Get approx # tuples passing the hashquals.  We use
  		 * approx_tuple_count here because we need an estimate done with
  		 * JOIN_INNER semantics.
  		 */
  		hashjointuples = approx_tuple_count(root, path-jpath, hashclauses);
  	}
  
  	/*
--- 2661,2692 
  	else
  	{
  		/*
  		 * Get approx # tuples passing the hashquals.  We use
  		 * approx_tuple_count here because we need an estimate done with
  		 * JOIN_INNER semantics.
  		 */
  		hashjointuples = approx_tuple_count(root, path-jpath, hashclauses);
+ 
+ 		/*
+ 		 * We assume that the hash equality operators will be applied to twice
+ 		 * as many join pairs as actually pass the hashquals, that is there's
+ 		 * about one hash-value collision per tuple.  This is probably a
+ 		 * substantial overestimate, but it seems wise to be conservative.
+ 		 * Lacking any good model of the effectiveness of the hash functions,
+ 		 * it's hard to do much better than a constant ratio.  (XXX perhaps we
+ 		 * ought to charge more than this for very large relations, since the
+ 		 * hash values are only 32 bits wide and hence will suffer
+ 		 * proportionally more collisions when there are many rows.)
+ 		 *
+ 		 * Note that we don't charge anything for visiting hash 

Re: [HACKERS] Hash Join cost estimates

2013-03-31 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 * Jeff Davis (pg...@j-davis.com) wrote:
 In Stephen's case the table was only 41KB, so something still seems off.
 Maybe we should model the likelihood of a collision based on the
 cardinalities (assuming a reasonably good hash function)?

 It's not really 'hash collisions' that we're trying to be wary of, per
 se, it's the risk of duplicates.

I spent some time looking at this.  I think the real issue is that the
code is trying to charge something close to cpu_operator_cost for each
time an outer tuple is compared to a hashtable entry.  That was actually
reasonable back when the costing code was last touched --- the current
scheme of storing and comparing the hash codes before testing the
equality operator proper only dates to commit
849074f9ae422c64501bb1d53ef840de870bf65c.  I punted on changing the cost
estimates at that time, and I think what this example is showing is that
that was a bad decision.  Really, when we're traipsing down a bucket
list, skipping over bucket entries with the wrong hash code is just
about free, or at least it's a whole lot cheaper than applying ExecQual.

Perhaps what we should do is charge the hash_qual_cost only for some
small multiple of the number of tuples that we expect will *pass* the
hash quals, which is a number we have to compute anyway.  The multiple
would represent the rate of hash-code collisions we expect.

I'd still be inclined to charge something per bucket entry, but it
should be really small, perhaps on the order of 0.01 times
cpu_operator_cost.

Or we could just drop that term entirely.  It strikes me that the reason
to be worried about skewed distribution in the inner relation is not
really that it changes the run time much, but rather that it risks
blowing out work_mem if specific virtual buckets have too many members
(or at best, we're forced to increase the number of batches more than we
thought to stay under work_mem; which carries runtime costs of its own).
Maybe what we should be doing with the bucketsize numbers is estimating
peak memory consumption to gate whether we'll accept the plan at all,
rather than adding terms to the cost estimate.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Join cost estimates

2013-03-30 Thread Stephen Frost
Jeff,

* Jeff Davis (pg...@j-davis.com) wrote:
 On Thu, 2013-03-28 at 19:56 -0400, Stephen Frost wrote:
41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56
4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66
 
 I think those are backwards -- typo?

Yes, sorry, those are backwards.  The '4M hashed, seqscan 41K' entry
comes out with the lower cost and that's what we end up using.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hash Join cost estimates

2013-03-30 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 I think the point is that there may *not* be few hash collisions ...

Right, but that's actually all entirely based on concerns over there
being duplicates (hence why we consider the MCVs and ndistinct), which
makes *some* sense given that we currently have a single linked-list in
each bucket into which any dups are placed.

It occurs to me that we could get away from this by having a 2-level
system.  We hash to a bucket which contains a linked list of linked
lists.  The bucket list would be for actual collisions (which we don't
consider at all in the current bucket estimating) and each of those
entries would be a linked list of duplicates for that entry in the
bucket.  This would add some small cost for building the hash, since we
would have to step through each entry in the bucket to determine if the
entry being added is a new entry or not, but not very much unless we're
worried about lots of collisions happening (which I certainly hope isn't
the case).  Hash tables are generally expected to take more effort to
build initially anyway, so I don't see a problem putting more logic
there.  Also, we could skip checking each entry in the bucket when the
input is known to be unique and instead just skip to the end of the
bucket since the new entry can't match any existing.

We could still work through the bucketing logic and add some costing to
that case for those situations where we are hashing on only one key of a
multi-key join and we expect a lot of duplicates to exist.  I'm not sure
how much that happens though- I would hope that we would use a composite
hash key most of the time that we have multi-key joins that use hash
tables.

Thoughts?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hash Join cost estimates

2013-03-30 Thread Stephen Frost
* Jeff Davis (pg...@j-davis.com) wrote:
 In Stephen's case the table was only 41KB, so something still seems off.
 Maybe we should model the likelihood of a collision based on the
 cardinalities (assuming a reasonably good hash function)?

It's not really 'hash collisions' that we're trying to be wary of, per
se, it's the risk of duplicates.  To push this very far in the other
direction- if you have 41k of the *same value* in the small table, then
it's currently faster to build the hash table on the large table and
then seq scan the small table (10s vs. 14s on my laptop running w/o
being plugged in, so it's a bit slower).

Now, that's a pretty ridiculous case, but it seems like we're being
pretty dumb here- for every input row from the outer table, we're
looking through all 41K *duplicate* keys in that one hash bucket.  This
goes back to the suggestion I just made- if the hash bucket list
contained only unique values (which are a result of actual hash
collisions), then we'd only be doing *one* comparison for each input row
of the outer table that doesn't match- and when we found one which
*did*, we would only need to step through the dup list for that one
entry and blast back all of those rows, forgetting about the rest of the
bucket which we know can't match.

 Also, I think I found an important assumption that seems dubious (in
 comment for estimate_hash_bucketsize()):

I've wondered about that also.  It certainly seems quite bogus that we
can end up with an 'estimated # of entries in a bucket' that's larger
than the # of entries we've found for the MCV in the table, especially
*double* that.

 Stephen, do you think this could explain your problem?

As I tried to articulate in my initial email- even if we had a *perfect*
answer to how many comparisons will we need to do, the current costing
would cause us to pick the plan that, intuitively and empirically, takes
longer (hash the 41M row table) because that cost is multiplied times
the number of outer row tables and the cpu_tuple_cost (charged to build
the hash table) isn't high enough relative to the cpu_op_cost (charged
to do the comparisons in the buckets).

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Hash Join cost estimates

2013-03-29 Thread Jeff Davis
On Thu, 2013-03-28 at 19:56 -0400, Stephen Frost wrote:
   41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56
   4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66

I think those are backwards -- typo?

   In the end, I think the problem here is that we are charging far too
   much for these bucket costs (particularly when we're getting them so
   far wrong) and not nearly enough for the cost of building the hash
   table in the first place.
 
   Thoughts?  Ideas about how we can 'fix' this?  Have others run into
   similar issues?

Yes, I have run into this issue (or something very similar). I don't
understand why the bucketsize even matters much -- assuming few hash
collisions, we are not actually evaluating the quals any more times than
necessary. So why all of the hashjoin-specific logic in determining the
number of qual evaluations? The only reason I can think of is to model
the cost of comparing the hashes themselves.

Also, searching the archives turns up at least one other, but I think
I've seen more:

http://www.postgresql.org/message-id/a82128a6-4e3b-43bd-858d-21b129f7b...@richrelevance.com

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Join cost estimates

2013-03-29 Thread Tom Lane
Jeff Davis pg...@j-davis.com writes:
 Yes, I have run into this issue (or something very similar). I don't
 understand why the bucketsize even matters much -- assuming few hash
 collisions, we are not actually evaluating the quals any more times than
 necessary. So why all of the hashjoin-specific logic in determining the
 number of qual evaluations? The only reason I can think of is to model
 the cost of comparing the hashes themselves.

I think the point is that there may *not* be few hash collisions ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Join cost estimates

2013-03-29 Thread Jeff Davis
On Fri, 2013-03-29 at 16:37 -0400, Tom Lane wrote: 
 Jeff Davis pg...@j-davis.com writes:
  Yes, I have run into this issue (or something very similar). I don't
  understand why the bucketsize even matters much -- assuming few hash
  collisions, we are not actually evaluating the quals any more times than
  necessary. So why all of the hashjoin-specific logic in determining the
  number of qual evaluations? The only reason I can think of is to model
  the cost of comparing the hashes themselves.
 
 I think the point is that there may *not* be few hash collisions ...

In Stephen's case the table was only 41KB, so something still seems off.
Maybe we should model the likelihood of a collision based on the
cardinalities (assuming a reasonably good hash function)?

Also, I think I found an important assumption that seems dubious (in
comment for estimate_hash_bucketsize()):

If the other relation in the join has a key distribution similar to
this one's, then the most-loaded buckets are exactly those that will be
probed most often.  Therefore, the average bucket size for costing
purposes should really be taken as something close to the worst case
bucket size.  We try to estimate this by adjusting the fraction if there
are too few distinct data values, and then scaling up by the ratio of
the most common value's frequency to the average frequency.

But the key distribution is not necessarily similar at all... the large
table might have many more distinct values.

Stephen, do you think this could explain your problem?

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Hash Join cost estimates

2013-03-28 Thread Stephen Frost
All,

  Marty's issue w/ NestLoop reminded me that I'd put together a test
  case which illustrates a HashJoin doing the wrong thing.

  The test case is here:

  http://snowman.net/~sfrost/test_case.sql

  Simply load that up and then check out this plan:

  explain select * from small_table join big_table using (id_short);

  We end up deciding to build a hash table of 4M records and then seq
  scan the table that only has 41K.  Much of the reason for this is
  because the analytics point out, correctly, that the column we're
  using from the small table to join isn't unique and therefore the
  buckets will be deeper- but it's not *nearly* as bad as we estimate.

  The bucket estimate for the small table comes back as 26, while the
  reality is that we only look through ~5.5 entries per bucket with
  the longest run being 21.  With the big table being hashed, the bucket
  estimate is 4 (though this seem to vary way more than I would expect,
  sometimes 4, sometimes 8, sometimes 2..) while the average number
  scanned through is actually ~3.5 and the longest scan ends up being
  20.

  Without the bucket question, we end up with pretty reasonable results
  (directly out of initial_cost_hashjoin):

  41K hashed, seqscan 4M:
  startup_cost = 1229.46
  run_cost = 72307.39

  4M hashed, 41K seqscan:
  startup_cost = 115030.10
  run_cost = 817.70

  When we get through dealing with the bucketing question in
  final_cost_hashjoin (costsize.c:2673), we have some pretty gross
  results for the 'good' plan's run_cost:

  run_cost = 72307.39 + 138848.81 = 211156.20

  While the 'bad' plan's run_cost is only bumped a tiny bit:

  run_cost = 817.7 + 411.76 = 1229.46

  Resulting in total costs that look all kinds of wacky:

  41K hashed, seqscan 4M: 115030.10 + 1229.46 = 116259.56
  4M hashed, seqscan 41K: 1229.46 + 211156.20 = 212385.66

  Or the 'good' plan being costed at *nearly double*.  Now, my laptop
  might not be the 'best' system CPU wise, but there's a pretty massive
  time difference between these plans:

  41K hashed, seqscan 4M: 2252.007 ms http://explain.depesz.com/s/FEq
  4M hashed, seqscan 41K: 2708.471 ms http://explain.depesz.com/s/FOU

  That's half a second and a good 15+% difference.

  Now, back to the bucket estimation- the ndistinct for the small table
  is -0.475058 (or 19561 tuples), which is about right.  There are 23
  distinct values, 18,795 duplicated, and another 841 with dup counts
  ranging from 3 to 10.  This leads to an avgfreq of 0.51,
  unfortunately, we're going for only 8192 buckets, so this gets moved
  up to 0.00012 and then the adjustment for MCV (which is 0.00027) bumps
  this all the way up to 0.00064, leading to our bucket depth estimate
  of 26 (bucket size estimate multiplied by the inner rows) and the
  resulting cost dominating the overall costing.

  If we consider the bucketing wrong and look at what the costs would
  have been with the actual number of average scans (and therefore
  excluding the 0.5 'adjustment'), we get:

  seqscan 41K cost: 360.28 (cpu_op_cost * 41K * 3.5)
  seqscan 4M cost: 58743.73 (cpu_op_cost * 4M * 5.5)

  which isn't exactly going in the 'right' direction for us.  Now, I'm
  sure that there's a cost to having to consider more buckets, but it
  seems to be far less, in comparison to the hash table creation cost,
  than what our model would suggest.
  
  In the end, I think the problem here is that we are charging far too
  much for these bucket costs (particularly when we're getting them so
  far wrong) and not nearly enough for the cost of building the hash
  table in the first place.

  Thoughts?  Ideas about how we can 'fix' this?  Have others run into
  similar issues?

Thanks,

Stephen


signature.asc
Description: Digital signature