[HACKERS] why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

2014-06-09 Thread b8flowerfire
When I read the source code about the hashjoin, I was very confused that the
postgresql define the NTUP_PER_BUCKET value as 10.
Since this value is used to estimate the tuple count in one bucket, is it
better if we have a smaller value?
I have not done some experiments, but it seems that we could archive less
hash collisions and better performance if we decrease the value.
So could anyone explain to me that why we define NTUP_PER_BUCKET as 10?
If there exists a specified situation that we would get worse performance or
some troubles if set NTUP_PER_BUCKET to 1 or 2?

Thanks very much. 



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/why-postgresql-define-NTUP-PER-BUCKET-as-10-not-other-numbers-smaller-tp5806472.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: why postgresql define NTUP_PER_BUCKET as 10, not other numbers smaller

2014-06-09 Thread b8flowerfire
Robert Haas wrote
 On Mon, Jun 9, 2014 at 4:06 AM, b8flowerfire lt;

 b8flowerfire@

 gt; wrote:
 
 This has come up before.  Basically, the problem is that if you reduce
 NTUP_PER_BUCKET, the bucket array gets larger, which might reduce the
 amount of space available for tuples to the point where the hash join
 overflows to multiple batches.  That will be more expensive than
 performing the hash join with a higher NTUP_PER_BUCKET.
 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

Thanks for the explanation. But i don't think it is very convincible.
Simply reduce the value of NTUP_PER_BUCKET will enlarge the pointer array
and reduce the tuples in one batch. But is that effect significant to the
performance?
The utilization of the work_mem, i think, is determined by the ratio of size
of the pointer and the size of the tuple.
Let's assume the size of tuple is 28 bytes, which is very reasonable because
it's the sum of the size of HJTUPLE_OVERHEAD(at least 8 bytes), the size of
MinimalTupleData(at least 10 bytes) and the content of a tuple(assume 10
bytes). And the size of pointer is 4 bytes.
So, if NTUP_PER_BUCKET is set to 10, about (28 * 10 / 28 * 10 + 4) of the
work_mem is used to store tuples. If NTUP_PER_BUCKET is set to 1, about (28
/ 28 + 4) of the work_mem is used to store tuples, reduced to 90% of the
original.
As a result, changing the value of NTUP_PER_BUCKET to 1 may increase the
batches number by only about 10%. So it that enough to effect the
performance? Or maybe i can not do the calculation simply in this way.

Besides, we have larger main-memory now. If we set the work_mem larger, the
more batches effect introduced by the smaller NTUP_PER_BUCKET value may be
reduced, couldn't it?

I have read about discussion about the NTUP_PER_BUCKET before. It seems that
if we change NTUP_PER_BUCKET to 50 or even larger, the performance wouldn't
be much worse. Because every tuple in the chain of a bucket has a hash
value. Having more tuples in a bucket simply increase some comparisons of
two integers. So is it the same if we change it smaller, that we could not
get much better? Is it one of the reasons that we define it as 10?

After all, it is my first time to discuss in a open source community. Thank
you all for the reply and the discussion. Thanks.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/why-postgresql-define-NTUP-PER-BUCKET-as-10-not-other-numbers-smaller-tp5806472p5806617.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers