Hi

Could someone please try and explain why the cost estimator for the hash is implemented as it is? (cost_hashjoin in costsize.c)
Especially these issues:


First, there is the estimation on the number of rows and their size. ExecChooseHashTableSize() apparently trusts neither and doubles them. Thus the function estimates the input relation is 4 times larger than the rest of the optimizer thinks. Why is that?
And why is this doubling also applied to the size of HashJoinTupleData? But not applied twice to the size of estimated bytes the hash would use, a number used in the calculation on the number of batches?


Second, why does the optimizer first guess that there can be 10 values in a bucket and then afterwards spend a lot of time estimating this number for use in another calculation? Using numbers that was based on the guess that there can be 10 values...

Third, the calculation assumes that the most common value is most dominant by far, but that the other common value mean nothing?

Fourth, the hashfunction does not create any collisions between non-identical values? And multiple join qualifiers does not affect this either?

Fifth, a probe most often looks in a chain with the average number of buckets? I would assume that a lot more time is spent looking in the chains with the most buckets...


-- Dennis


---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Reply via email to