Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-23 Thread Aleksander Alekseev
> Thanks!  I can't think of anything else to worry about with regards to
> that version, so I have committed it.
> 

Thanks, Robert. And thanks everyone for contributing to this patch.


-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-23 Thread Robert Haas
On Wed, Mar 23, 2016 at 5:49 AM, Aleksander Alekseev
 wrote:
> I still believe that optimizing 1% blindly without considering possible
> side effects this optimization can bring (other data alignment, some
> additional machine instructions - just to name a few) and having no
> way to measure these side effects is a bad idea. But I also admit a
> possibility that I can somehow be wrong on this. So I rewrote this
> patch one again :'( the way you suggested (without that alignment
> related hack I tried, it's just too ugly). I also attached original
> hashhdr-rmh.patch just to have both patches in one message so it would
> be easier to find both patches in this thread.

Thanks!  I can't think of anything else to worry about with regards to
that version, so I have committed it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-23 Thread Aleksander Alekseev
> > I have a strong feeling that we are just wasting our time here.
> 
> That is possible.  However, I would like it if you would give me the
> benefit of the doubt and assume that, if I seem to be more cautious
> than you would be were you a committer, there might possibly be some
> good reasons for that.  The fact is that, despite being more cautious
> than some people think I should be, I still manage to introduce quite
> a number of bugs via the patches I commit - see the thread 'Missing
> rows with index scan when collation is not "C"' on pgsql-bugs for just
> the very latest example of that.  Nobody thinks that will happen with
> *their* patch, of course, but it does all the same.

Oh, it explains a lot! You see, I couldn't figure out whats happening.
Patch was discussed and reviewed a long time ago, everyone seems to be
reasonably happy with it, etc. But for some reason it's ignored for
weeks and no one tells why. Now it makes sense.

I should probably mention that this patch was merged to PostgresPro
build a few months ago. Several our clients are already using this code
in production environment (guess who discovered this issue and wanted
it to be fixed). There were no complains so far.

> I'd still like an answer to the question of why this helps so much
> when there must be huge amounts of false sharing between the different
> mutexes.  Maybe it doesn't matter, though.

Well, the reason is that there is no more bottleneck here. Code is
executed like 1% of the time here and 99% of the time somewhere else.
There is a false sharing but it's not as expensive as 99% of other
code. Thus optimizing 1% of the code any further doesn't give noticeable
performance improvement.

I still believe that optimizing 1% blindly without considering possible
side effects this optimization can bring (other data alignment, some
additional machine instructions - just to name a few) and having no
way to measure these side effects is a bad idea. But I also admit a
possibility that I can somehow be wrong on this. So I rewrote this
patch one again :'( the way you suggested (without that alignment
related hack I tried, it's just too ugly). I also attached original
hashhdr-rmh.patch just to have both patches in one message so it would
be easier to find both patches in this thread.

If there are any other questions or doubts regarding these patches
please don't hesitate to ask.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 24a53da..06c413c 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -15,7 +15,7 @@
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
  * of the hash header change after init except nentries and freeList.
- * A partitioned table uses a spinlock to guard changes of those two fields.
+ * A partitioned table uses spinlocks to guard changes of those fields.
  * This lets any subset of the hash buckets be treated as a separately
  * lockable partition.  We expect callers to use the low-order bits of a
  * lookup key's hash value as a partition number --- this will work because
@@ -111,6 +111,8 @@
 #define DEF_DIRSIZE			   256
 #define DEF_FFACTOR			   1	/* default fill factor */
 
+/* Number of freelists to be used for a partitioned hash table. */
+#define NUM_FREELISTS			32
 
 /* A hash bucket is a linked list of HASHELEMENTs */
 typedef HASHELEMENT *HASHBUCKET;
@@ -128,12 +130,17 @@ typedef HASHBUCKET *HASHSEGMENT;
  */
 struct HASHHDR
 {
-	/* In a partitioned table, take this lock to touch nentries or freeList */
-	slock_t		mutex;			/* unused if not partitioned table */
-
-	/* These fields change during entry addition/deletion */
-	long		nentries;		/* number of entries in hash table */
-	HASHELEMENT *freeList;		/* linked list of free elements */
+	/*
+	 * The freelist can become a point of contention on high-concurrency hash
+	 * tables, so we use an array of freelist, each with its own mutex and
+	 * nentries count, instead of just a single one.
+	 *
+	 * If hash table is not partitioned only nentries[0] and freeList[0] are
+	 * used and spinlocks are not used at all.
+	 */
+	slock_t		mutex[NUM_FREELISTS];		/* array of spinlocks */
+	long		nentries[NUM_FREELISTS];	/* number of entries */
+	HASHELEMENT *freeList[NUM_FREELISTS]; /* lists of free elements */
 
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
@@ -166,6 +173,9 @@ struct HASHHDR
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
 
+#define FREELIST_IDX(hctl, hashcode) \
+	(IS_PARTITIONED(hctl) ? hashcode % NUM_FREELISTS : 0)
+
 /*
  * Top control structure for a hashtable --- in a shared table, each backend
  * has its own copy (OK since no fields change at runtime)
@@ -219,10 +229,10 @@ static long hash_accesses,
 

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-22 Thread Robert Haas
On Mon, Mar 21, 2016 at 4:48 AM, Aleksander Alekseev
 wrote:
>> This is the point where I think I am missing something about patch.
>> As far as I can understand, it uses the same freelist index
>> (freelist_idx) for allocating and putting back the entry, so I think
>> the chance of increment in one list and decrement in another is there
>> when the value of freelist_idx is calculated differently for the same
>> input, is it so, or there is something else in patch which I am
>> missing?
>
> You are right, nentries _can't_ be negative unless we are using getpid()
> for calculating freelist_idx, since same index of nentries[] is used
> when we add (increment) and remove (decrement) element from/to hash
> table. The fact that we also borrow elements from other freelists if
> there is no more elements in our freelist doesn't change anything.

Ah.  OK, I missed that.

> Once again I suggest we merge this patch already:
>
> http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrtymq5hdqdarf5j7o+abdowqhe...@mail.gmail.com
>
> I have a strong feeling that we are just wasting our time here.

That is possible.  However, I would like it if you would give me the
benefit of the doubt and assume that, if I seem to be more cautious
than you would be were you a committer, there might possibly be some
good reasons for that.  The fact is that, despite being more cautious
than some people think I should be, I still manage to introduce quite
a number of bugs via the patches I commit - see the thread 'Missing
rows with index scan when collation is not "C"' on pgsql-bugs for just
the very latest example of that.  Nobody thinks that will happen with
*their* patch, of course, but it does all the same.

I'd still like an answer to the question of why this helps so much
when there must be huge amounts of false sharing between the different
mutexes.  Maybe it doesn't matter, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-21 Thread Aleksander Alekseev
> This is the point where I think I am missing something about patch.
> As far as I can understand, it uses the same freelist index
> (freelist_idx) for allocating and putting back the entry, so I think
> the chance of increment in one list and decrement in another is there
> when the value of freelist_idx is calculated differently for the same
> input, is it so, or there is something else in patch which I am
> missing?

You are right, nentries _can't_ be negative unless we are using getpid()
for calculating freelist_idx, since same index of nentries[] is used
when we add (increment) and remove (decrement) element from/to hash
table. The fact that we also borrow elements from other freelists if
there is no more elements in our freelist doesn't change anything.

> One idea is to jigger things so that we maintain a count of the total
> number of entries that doesn't change except when we allocate, and
> then for each freelist partition we maintain the number of entries in
> that freelist partition.  So then the size of the hash table, instead
> of being sum(nentries) is totalsize - sum(nfree).

This is an interesting idea. Still I strongly disagree that is should
be implemented in this concrete patch and discussed in this concrete
thread. Obviously such type of change deserves a separate research and
discussing since it has nothing to do with performance and since we
agreed that "change LOG2_NUM_LOCK_PARTITIONS value" and "change the
calling convention for ShmemInitHash()" patches should be implemented
separately. I added it to my TODO list.

Once again I suggest we merge this patch already:

http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrtymq5hdqdarf5j7o+abdowqhe...@mail.gmail.com

I have a strong feeling that we are just wasting our time here.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-20 Thread Amit Kapila
On Mon, Mar 21, 2016 at 5:12 AM, Robert Haas  wrote:
>
> On Sun, Mar 20, 2016 at 3:01 AM, Amit Kapila 
wrote:
> > On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas 
wrote:
> >> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila 
> >> wrote:
> >> > Won't in theory, without patch as well nentries can overflow after
> >> > running
> >> > for very long time?  I think with patch it is more prone to overflow
> >> > because
> >> > we start borrowing from other free lists as well.
> >>
> >> Uh, I don't think so.  Without the patch, there is just one entries
> >> counter and it goes up and down.  How would it ever overflow?
> >
> > I thought it can overflow because we haven't kept any upper limit on
> > incrementing it unless the memory finishes (ofcourse that is just a
> > theoretical assumption, as the decrements will keep the number in
control),
> > so are you thinking about the risk of overflow with patch because we
have to
> > use sum of all the nentries from all the arrays for total or is there
any
> > thing else which makes you think that changing nentries into arrays of
> > nentries can make it prone to overflow?
>
> Well, I mean, perhaps nentries could overflow if you had more than
> 2^32 elements, but I'm not even positive we support that.  If you
> assume a fixed table with a million entries, the nentries value can
> vary only between 0 and a million.  But now split that into a bunch of
> separate counters.   The increment when you allocate an entry and the
> decrement when you put one back don't have to hit the same bucket,
>

This is the point where I think I am missing something about patch.  As far
as I can understand, it uses the same freelist index (freelist_idx) for
allocating and putting back the entry, so I think the chance of increment
in one list and decrement in another is there when the value
of freelist_idx is calculated differently for the same input, is it so, or
there is something else in patch which I am missing?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-20 Thread Robert Haas
On Sun, Mar 20, 2016 at 3:01 AM, Amit Kapila  wrote:
> On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas  wrote:
>> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila 
>> wrote:
>> > Won't in theory, without patch as well nentries can overflow after
>> > running
>> > for very long time?  I think with patch it is more prone to overflow
>> > because
>> > we start borrowing from other free lists as well.
>>
>> Uh, I don't think so.  Without the patch, there is just one entries
>> counter and it goes up and down.  How would it ever overflow?
>
> I thought it can overflow because we haven't kept any upper limit on
> incrementing it unless the memory finishes (ofcourse that is just a
> theoretical assumption, as the decrements will keep the number in control),
> so are you thinking about the risk of overflow with patch because we have to
> use sum of all the nentries from all the arrays for total or is there any
> thing else which makes you think that changing nentries into arrays of
> nentries can make it prone to overflow?

Well, I mean, perhaps nentries could overflow if you had more than
2^32 elements, but I'm not even positive we support that.  If you
assume a fixed table with a million entries, the nentries value can
vary only between 0 and a million.  But now split that into a bunch of
separate counters.   The increment when you allocate an entry and the
decrement when you put one back don't have to hit the same bucket, so
I'm not sure there's anything that prevents the counter for one bucket
from getting arbitrarily large and the counter for another bucket
getting arbitrarily small while still summing to a value between 0 and
a million.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-20 Thread Amit Kapila
On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas  wrote:
>
> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila 
wrote:
> > Won't in theory, without patch as well nentries can overflow after
running
> > for very long time?  I think with patch it is more prone to overflow
because
> > we start borrowing from other free lists as well.
>
> Uh, I don't think so.  Without the patch, there is just one entries
> counter and it goes up and down.  How would it ever overflow?
>

I thought it can overflow because we haven't kept any upper limit on
incrementing it unless the memory finishes (ofcourse that is just a
theoretical assumption, as the decrements will keep the number in control),
so are you thinking about the risk of overflow with patch because we have
to use sum of all the nentries from all the arrays for total or is there
any thing else which makes you think that changing nentries into arrays of
nentries can make it prone to overflow?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-19 Thread Robert Haas
On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila  wrote:
> Won't in theory, without patch as well nentries can overflow after running
> for very long time?  I think with patch it is more prone to overflow because
> we start borrowing from other free lists as well.

Uh, I don't think so.  Without the patch, there is just one entries
counter and it goes up and down.  How would it ever overflow?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-18 Thread Amit Kapila
On Sat, Mar 19, 2016 at 1:41 AM, Robert Haas  wrote:
>
> On Tue, Mar 1, 2016 at 9:43 AM, Aleksander Alekseev
>  wrote:
> >
> > So answering your question - it turned out that we _can't_ reduce
> > NUM_FREELISTS this way.
>
> That's perplexing.  I would have expected that with all of the mutexes
> packed in back-to-back like this, we would end up with a considerable
> amount of false sharing.  I don't know why it ever helps to have an
> array of bytes all in the same cache line of which each one is a
> heavily-trafficked mutex.   Anybody else have a theory?
>
> One other thing that concerns me mildly is the way we're handling
> nentries.  It should be true, with this patch, that the individual
> nentries sum to the right value modulo 2^32.  But I don't think
> there's any guarantee that the values are positive any more, and in
> theory after running long enough one of them could manage to overflow
> or underflow.
>

Won't in theory, without patch as well nentries can overflow after running
for very long time?  I think with patch it is more prone to overflow
because we start borrowing from other free lists as well.

  So at a very minimum I think we need to remove the
> Assert() the value is not negative.  But really, I wonder if we
> shouldn't rework things a little more than that.
>
> One idea is to jigger things so that we maintain a count of the total
> number of entries that doesn't change except when we allocate, and
> then for each freelist partition we maintain the number of entries in
> that freelist partition.  So then the size of the hash table, instead
> of being sum(nentries) is totalsize - sum(nfree).
>

To me, your idea sounds much better than current code in terms of
understanding the free list concept as well.  So, +1 for changing the code
in this way.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-18 Thread Robert Haas
On Tue, Mar 1, 2016 at 9:43 AM, Aleksander Alekseev
 wrote:
>> I am not sure, if this is exactly what has been suggested by Robert,
>> so it is not straightforward to see if his suggestion can allow us to
>> use NUM_FREELISTS as 8 rather than 32.  I think instead of trying to
>> use FREELISTBUFF, why not do it as Robert has suggested and try with
>> different values of NUM_FREELISTS?
>
> Well, what I did is in fact a bit more general solution then Robert
> suggested. At first I just joined freeList and mutex arrays into one
> array and tried different NUM_FREELISTS values (see FREELISTS column).
> That was Robert's idea. Unfortunately it didn't give any considerable
> speedup comparing with previous approach.
>
> Then I tried to manually control sizeof every array item (see
> different SIZE values). By making it larger we can put every array item
> into a separate cache line. This approach helped a bit in terms of TPS
> (before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the
> same (32).
>
> So answering your question - it turned out that we _can't_ reduce
> NUM_FREELISTS this way.

That's perplexing.  I would have expected that with all of the mutexes
packed in back-to-back like this, we would end up with a considerable
amount of false sharing.  I don't know why it ever helps to have an
array of bytes all in the same cache line of which each one is a
heavily-trafficked mutex.   Anybody else have a theory?

One other thing that concerns me mildly is the way we're handling
nentries.  It should be true, with this patch, that the individual
nentries sum to the right value modulo 2^32.  But I don't think
there's any guarantee that the values are positive any more, and in
theory after running long enough one of them could manage to overflow
or underflow.  So at a very minimum I think we need to remove the
Assert() the value is not negative.  But really, I wonder if we
shouldn't rework things a little more than that.

One idea is to jigger things so that we maintain a count of the total
number of entries that doesn't change except when we allocate, and
then for each freelist partition we maintain the number of entries in
that freelist partition.  So then the size of the hash table, instead
of being sum(nentries) is totalsize - sum(nfree).

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[ P.S. Sorry it's taken me a while to circle back around to this. ]


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-03 Thread Amit Kapila
On Thu, Mar 3, 2016 at 1:50 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

> > Won't it always use the same freelist to remove and add the entry from
> > freelist as for both cases it will calculate the freelist_idx in same
> > way?
>
> No. If "our" freelist is empty when we try to remove an item from it we
> borrow item from another freelist.


I think the way patch works is if indicated freelist is empty, then it
tries to allocate new elements in that list and if it can't allocate, then
it tries to borrow from other freelist and in both cases the element to be
removed from freelist is considered to be the element of indicated freelist
(basically it always increment nentries[freelist_idx]).


> Then this borrowed item will be
> returned to "our" freelist instead of original. Without some sort of
> additional logic there is no way to figure out what freelist was
> original.
>
>
As the patch always operates on nentries[freelist_idx], so it seems to me
that in both cases (remove, add), the element is considered to be belonging
to same freelist.  Have you faced this problem of negative entries in any
of your test, if so then share the test, so that I can also understand the
scenario?



> Generally speaking negative nentries value is not something that
> couldn't be solved. But I would like to remind that in this context we
> are discussing a quick and dirty solution created for benchmark
> purposes in a first place.
>
>
I thought negative entries could be a problem for the patch as indicated by
Robert[1], but may be I am missing something.



> > You are not convinced, then lets leave it to committer unless
> > somebody else wants to try that suggestion.
>
> Agree. Frankly I'm tired of rewriting this patch over and over and over
> again. So I would like to avoid rewriting it once again unless there is
> a clear indication that this way we would gain something. Benchmarks
> shows that this is not a case thus it's only a matter of taste and
> intuition.


I can understand your feeling here and agree with you that it is a matter
of taste.  However, many a times if we change the patch according to
committer's taste, the chances of patch getting accepted early is better,
but I am not sure if that applies here, so feel free to go in the way you
think is better.

[1] -
http://www.postgresql.org/message-id/CA+TgmoacVsdcY=qn0do7nok7w2-ssqz3kr2y84bavifckqd...@mail.gmail.com


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-03 Thread Aleksander Alekseev
> Won't it always use the same freelist to remove and add the entry from
> freelist as for both cases it will calculate the freelist_idx in same
> way?

No. If "our" freelist is empty when we try to remove an item from it we
borrow item from another freelist. Then this borrowed item will be
returned to "our" freelist instead of original. Without some sort of
additional logic there is no way to figure out what freelist was
original.

Generally speaking negative nentries value is not something that
couldn't be solved. But I would like to remind that in this context we
are discussing a quick and dirty solution created for benchmark
purposes in a first place.

> You are not convinced, then lets leave it to committer unless
> somebody else wants to try that suggestion.

Agree. Frankly I'm tired of rewriting this patch over and over and over
again. So I would like to avoid rewriting it once again unless there is
a clear indication that this way we would gain something. Benchmarks
shows that this is not a case thus it's only a matter of taste and
intuition. We know from experience that both are bad advisers in
optimization matter.

I suggest we merge this patch already:

http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrtymq5hdqdarf5j7o+abdowqhe...@mail.gmail.com

... and get done with it. Clearly this patch is good enough ---
reviewed, tested, discussed, has proven performance extremum. When and
if dynahash will become a bottleneck again nothing will stop us from
optimizing it one more time for hardware we will have by then. Who knows
what this hardware will be like in 5-10 years? Don't we have better
things to do now than arguing about imaginary benchmarks and taste?

Lets see what is committer's opinion on this.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-01 Thread Amit Kapila
On Tue, Mar 1, 2016 at 8:13 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:
>
> Hello, Amit
>
> > I am not sure, if this is exactly what has been suggested by Robert,
> > so it is not straightforward to see if his suggestion can allow us to
> > use NUM_FREELISTS as 8 rather than 32.  I think instead of trying to
> > use FREELISTBUFF, why not do it as Robert has suggested and try with
> > different values of NUM_FREELISTS?
>
> Well, what I did is in fact a bit more general solution then Robert
> suggested. At first I just joined freeList and mutex arrays into one
> array and tried different NUM_FREELISTS values (see FREELISTS column).
> That was Robert's idea. Unfortunately it didn't give any considerable
> speedup comparing with previous approach.
>

Okay, now I can understand the data better and I think your tests indicate
that it is better to keep NUM_FREELISTS as 32.

>
> Then I tried to manually control sizeof every array item (see
> different SIZE values). By making it larger we can put every array item
> into a separate cache line. This approach helped a bit in terms of TPS
> (before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the
> same (32).
>
> So answering your question - it turned out that we _can't_ reduce
> NUM_FREELISTS this way.
>
> Also I don't believe that +0.3% TPS according to synthetic benchmark
> that doesn't represent any possible workload worth it considering
> additional code complexity.
>

I think data structure HASHHDR is looking simpler, however the code is
looking more complex, especially with the addition of FREELISTBUFF for
cacheline purpose.  I think you might want to try with exactly what has
been suggested and see if the code looks better that way, but if you are
not convinced, then lets leave it to committer unless somebody else wants
to try that suggestion.

> > In which case, do you think entries can go negative?  I think the
> > nentries is incremented and decremented in the way as without patch,
> > so I am not getting what can make it go negative.
>
> In this context we are discussing a quick and dirty fix "what would
> happen if FREELIST_IDX would be implemented as getpid() % NUM_FREE_LIST
> instead of hash % NUM_FREELIST". The code is:
>
> int freelist_idx = FREELIST_IDX(hctl, hashvalue);
>
> // ...
>
> switch (action)
> {
>
> // ...
>
>   case HASH_REMOVE:
> if (currBucket != NULL)
> {
>   if (IS_PARTITIONED(hctl))
> SpinLockAcquire(&(FREELIST(hctl, freelist_idx).mutex));
>
>   // Assert(FREELIST(hctl, freelist_idx).nentries > 0);
>   FREELIST(hctl, freelist_idx).nentries--;
>
>   /* remove record from hash bucket's chain. */
>   *prevBucketPtr = currBucket->link;
>
> // ...
>
> No matter what freelist was used when element was added to a hash table.
> We always try to return free memory to the same freelist number getpid()
> % FREELIST_ITEMS and decrease number of elements in a hash table using
> corresponding nentries field.
>

Won't it always use the same freelist to remove and add the entry from
freelist as for both cases it will calculate the freelist_idx in same way?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-01 Thread Aleksander Alekseev
Hello, Amit

> I am not sure, if this is exactly what has been suggested by Robert,
> so it is not straightforward to see if his suggestion can allow us to
> use NUM_FREELISTS as 8 rather than 32.  I think instead of trying to
> use FREELISTBUFF, why not do it as Robert has suggested and try with
> different values of NUM_FREELISTS?

Well, what I did is in fact a bit more general solution then Robert
suggested. At first I just joined freeList and mutex arrays into one
array and tried different NUM_FREELISTS values (see FREELISTS column).
That was Robert's idea. Unfortunately it didn't give any considerable
speedup comparing with previous approach.

Then I tried to manually control sizeof every array item (see
different SIZE values). By making it larger we can put every array item
into a separate cache line. This approach helped a bit in terms of TPS
(before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the
same (32).

So answering your question - it turned out that we _can't_ reduce
NUM_FREELISTS this way. 

Also I don't believe that +0.3% TPS according to synthetic benchmark
that doesn't represent any possible workload worth it considering
additional code complexity.

> In which case, do you think entries can go negative?  I think the
> nentries is incremented and decremented in the way as without patch,
> so I am not getting what can make it go negative.

In this context we are discussing a quick and dirty fix "what would
happen if FREELIST_IDX would be implemented as getpid() % NUM_FREE_LIST
instead of hash % NUM_FREELIST". The code is:

int freelist_idx = FREELIST_IDX(hctl, hashvalue); 

// ...

switch (action)
{

// ...

  case HASH_REMOVE:
if (currBucket != NULL)
{
  if (IS_PARTITIONED(hctl))
SpinLockAcquire(&(FREELIST(hctl, freelist_idx).mutex));

  // Assert(FREELIST(hctl, freelist_idx).nentries > 0); 
  FREELIST(hctl, freelist_idx).nentries--;

  /* remove record from hash bucket's chain. */
  *prevBucketPtr = currBucket->link;

// ...

No matter what freelist was used when element was added to a hash table.
We always try to return free memory to the same freelist number getpid()
% FREELIST_ITEMS and decrease number of elements in a hash table using
corresponding nentries field. For this reason nentries could be
negative. Still sum of all nentries remains valid.




I realize that this thread is quite long already and that its been a
while since previous commitfest ended. So I would like to provide a
short summery of this thread from my perspective:

1. Last version of a path is here:

http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrtymq5hdqdarf5j7o+abdowqhe...@mail.gmail.com

Its reviewed, tested, well-commented and apparently nobody has strong
objections against it.

2. Robert suggested a few possible improvements. Experiments showed
that in best case resulting patches are as good as path from item 1,
but code become considerably more complex.

3. Number of further changes (changing calling convention for
ShmemInitHash(), changing LOG2_NUM_LOCK_PARTITIONS constant) will be
discussed separately when and if path from this thread will be applied.

> 
> 
> 
> >
> > > I am however wondering if it to set the freelist affinity based on
> > > something other than the hash value, like say the PID, so that the
> > > same process doesn't keep switching to a different freelist for
> > > every buffer eviction.
> >
> > Also I tried to use PID to determine freeList number:
> >
> > ```
> > #include 
> > #include 
> >
> > ...
> >
> > #define FREELIST_IDX(hctl, hashcode) \
> >   (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0)
> >
> >
> ...
> >
> > // now nentries could be negative in this case
> > // Assert(FREELIST(hctl, freelist_idx).nentries > 0);
> >
> >
> In which case, do you think entries can go negative?  I think the
> nentries is incremented and decremented in the way as without patch,
> so I am not getting what can make it go negative.
> 
> 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-03-01 Thread Amit Kapila
On Fri, Feb 12, 2016 at 9:04 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

> Hello, Robert
>
> > It also strikes me that it's probably quite likely that slock_t
> > mutex[NUM_FREELISTS] is a poor way to lay out this data in memory.
> > For example, on a system where slock_t is just one byte, most likely
> > all of those mutexes are going to be in the same cache line, which
> > means you're going to get a LOT of false sharing.  It seems like it
> > would be sensible to define a new struct that contains an slock_t, a
> > long, and a HASHELEMENT *, and then make an array of those structs.
> > That wouldn't completely eliminate false sharing, but it would reduce
> > it quite a bit.  My guess is that if you did that, you could reduce
> > the number of freelists to 8 or less and get pretty much the same
> > performance benefit that you're getting right now with 32. And that,
> > too, seems likely to be good for single-client performance.
>
> I experimented for a while trying to fit every spinlock in a separate
> cache line. Indeed we can gain some speedup this way. Here are
> benchmark results on 12-core server for NUM_LOCK_PARTITIONS = 32 (in
> this case difference is more notable):
>
> | FREELISTS | SIZE =  32 | SIZE =  64 | SIZE = 128 | SIZE = 256 |
> |---|||||
> | 4 |   +25.4%   |   +28.7%   |   +28.4%   |   +28.3%   |
> | 8 |   +28.2%   |   +29.4%   |   +31.7%   |   +31.4%   |
> |16 |   +29.9%   |   +32.6%   |   +31.6%   |   +30.8%   |
> |32 |   +33.7%   |   +34.2%   |   +33.6%   |   +32.6%   |
>
> Here SIZE is short for FREELIST_BUFFER_SIZE (part of a hack I used to
> align FREELIST structure, see attached patch).


I am not sure, if this is exactly what has been suggested by Robert, so it
is not straightforward to see if his suggestion can allow us to
use NUM_FREELISTS as 8 rather than 32.  I think instead of trying to
use FREELISTBUFF, why not do it as Robert has suggested and try with
different values of NUM_FREELISTS?



>
> > I am however wondering if it to set the freelist affinity based on
> > something other than the hash value, like say the PID, so that the
> > same process doesn't keep switching to a different freelist for every
> > buffer eviction.
>
> Also I tried to use PID to determine freeList number:
>
> ```
> #include 
> #include 
>
> ...
>
> #define FREELIST_IDX(hctl, hashcode) \
>   (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0)
>
>
...
>
> // now nentries could be negative in this case
> // Assert(FREELIST(hctl, freelist_idx).nentries > 0);
>
>
In which case, do you think entries can go negative?  I think the nentries
is incremented and decremented in the way as without patch, so I am not
getting what can make it go negative.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-12 Thread Aleksander Alekseev
Hello, Robert

> It also strikes me that it's probably quite likely that slock_t
> mutex[NUM_FREELISTS] is a poor way to lay out this data in memory.
> For example, on a system where slock_t is just one byte, most likely
> all of those mutexes are going to be in the same cache line, which
> means you're going to get a LOT of false sharing.  It seems like it
> would be sensible to define a new struct that contains an slock_t, a
> long, and a HASHELEMENT *, and then make an array of those structs.
> That wouldn't completely eliminate false sharing, but it would reduce
> it quite a bit.  My guess is that if you did that, you could reduce
> the number of freelists to 8 or less and get pretty much the same
> performance benefit that you're getting right now with 32. And that,
> too, seems likely to be good for single-client performance.

I experimented for a while trying to fit every spinlock in a separate
cache line. Indeed we can gain some speedup this way. Here are
benchmark results on 12-core server for NUM_LOCK_PARTITIONS = 32 (in
this case difference is more notable):

| FREELISTS | SIZE =  32 | SIZE =  64 | SIZE = 128 | SIZE = 256 |
|---|||||
| 4 |   +25.4%   |   +28.7%   |   +28.4%   |   +28.3%   |
| 8 |   +28.2%   |   +29.4%   |   +31.7%   |   +31.4%   |
|16 |   +29.9%   |   +32.6%   |   +31.6%   |   +30.8%   |
|32 |   +33.7%   |   +34.2%   |   +33.6%   |   +32.6%   |

Here SIZE is short for FREELIST_BUFFER_SIZE (part of a hack I used to
align FREELIST structure, see attached patch). Cache line on this CPU
is 64 bytes according to /sys/devices/system/cpu/cpu0/cache/index0/*

I would like to remind that without these changes we had the following
picture (please note "NLP = 32" column):

pgbench -f pgbench.sql -T 150 -P 1 -c 40 -j 12

 DMN | NLP = 16 | NLP = 32 | NLP = 64 | NLP = 128
-|--|--|--|--
   8 |  +15.1%  |  +28.2%  |  +34.1%  |  +33.7%  
  16 |  +16.6%  |  +30.9%  |  +37.0%  |  +40.8%  
  32 |  +15.1%  |  +33.9%  |  +39.5%  |  +41.9%  
  64 |  +15.0%  |  +31.9%  |  +40.1%  |  +42.9%  
 128 |   +7.7%  |  +24.7%  |  +29.6%  |  +31.6%  

* NLP = NUM_LOCK_PARTITIONS
* DMN = DEFAULT_MUTEXES_NUM

So its +34.2% vs +33.9% and the number of freeLists remains the same.

> I am however wondering if it to set the freelist affinity based on
> something other than the hash value, like say the PID, so that the
> same process doesn't keep switching to a different freelist for every
> buffer eviction.

Also I tried to use PID to determine freeList number:

```
#include 
#include 

...

#define FREELIST_IDX(hctl, hashcode) \
  (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0)

...

// now nentries could be negative in this case
// Assert(FREELIST(hctl, freelist_idx).nentries > 0);

```

Unfortunately this approach gives +33.9% TPS in best case. Also there
is a possibility that nentries will overflow. So far I don't have a
clear idea how to solve this issue effectively.

I'm not sure if further improvements worth an effort.diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 24a53da..4d9ed3e 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -15,7 +15,7 @@
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
  * of the hash header change after init except nentries and freeList.
- * A partitioned table uses a spinlock to guard changes of those two fields.
+ * A partitioned table uses spinlocks to guard changes of those fields.
  * This lets any subset of the hash buckets be treated as a separately
  * lockable partition.  We expect callers to use the low-order bits of a
  * lookup key's hash value as a partition number --- this will work because
@@ -111,6 +111,12 @@
 #define DEF_DIRSIZE			   256
 #define DEF_FFACTOR			   1	/* default fill factor */
 
+/* Number of freelists to be used for a partitioned hash table. */
+#define NUM_FREELISTS			4
+
+// TODO: comment. Should be power of 2.
+// #define FREELIST_IDX_STEP 8
+#define FREELIST_BUFFER_SIZE 32
 
 /* A hash bucket is a linked list of HASHELEMENTs */
 typedef HASHELEMENT *HASHBUCKET;
@@ -118,6 +124,20 @@ typedef HASHELEMENT *HASHBUCKET;
 /* A hash segment is an array of bucket headers */
 typedef HASHBUCKET *HASHSEGMENT;
 
+// TODO: re-check comments
+// TODO: pgindent
+
+// TODO: comment!
+typedef struct
+{
+	slock_t mutex;	/* a spinlock */
+	long nentries;	/* number of entries */
+	HASHELEMENT* freeList; /* lists of free elements */
+} FREELIST;
+
+// TODO: comment
+typedef struct { uint8 x[FREELIST_BUFFER_SIZE]; } FREELISTBUFF;
+
 /*
  * Header structure for a hash table --- contains all changeable info
  *
@@ -128,12 +148,16 @@ typedef HASHBUCKET *HASHSEGMENT;
  */
 struct HASHHDR
 {
-	/* In a partitioned table, take this lock to touch nentries or fre

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-12 Thread Robert Haas
On Thu, Feb 11, 2016 at 3:26 PM, Robert Haas  wrote:
> On Thu, Feb 11, 2016 at 9:53 AM, Robert Haas  wrote:
>> The fact that InitLocks() doesn't do this has been discussed before
>> and there's no consensus on changing it.  It is, at any rate, a
>> separate issue.  I'll go through the rest of this patch again now.
>
> I did a little bit of cleanup on this patch and the results are
> attached.  I also did some benchmarking of this version.  I tested
> this on two systems, in each case using five-minute, read-only pgbench
> runs at scale factor 3000, varying the client count and taking the
> median of three runs.  First, I tested it on hydra, a 2-socket,
> 16-processor, 64-thread POWER box.  This is a community resource
> hosted at OSUOSL.  Second, I tested it on cthulhu, an 8-socket,
> 64-processor, 128-thread x86_64 box.  This is an EnterpriseDB
> resource.  Here are the results:
>
> hydra, master vs. patched
> 1 client: 8042.872109 vs. 7839.587491 (-2.5%)
> 64 clients: 213311.852043 vs. 214002.314071 (+0.3%)
> 96 clients: 219551.356907 vs. 221908.397489 (+1.1%)
> 128 clients: 210279.022760 vs. 217974.079171 (+3.7%)
>
> cthulhu, master vs. patched
> 1 client: 3407.705820 vs. 3645.129360 (+7.0%)
> 64 clients: 88667.681890 vs. 82636.914343 (-6.8%)
> 96 clients: 79303.750250 vs. 105442.993869 (+33.0%)
> 128 clients: 74684.510668 vs. 120984.938371 (+62.0%)
>
> Obviously, the high-client count results on cthulhu are stellar.  I'm
> *assuming* that the regressions are just random variation.  I am
> however wondering if it to set the freelist affinity based on
> something other than the hash value, like say the PID, so that the
> same process doesn't keep switching to a different freelist for every
> buffer eviction.  It also strikes me that it's probably quite likely
> that slock_t mutex[NUM_FREELISTS] is a poor way to lay out this data
> in memory.  For example, on a system where slock_t is just one byte,
> most likely all of those mutexes are going to be in the same cache
> line, which means you're going to get a LOT of false sharing.  It
> seems like it would be sensible to define a new struct that contains
> an slock_t, a long, and a HASHELEMENT *, and then make an array of
> those structs.  That wouldn't completely eliminate false sharing, but
> it would reduce it quite a bit.  My guess is that if you did that, you
> could reduce the number of freelists to 8 or less and get pretty much
> the same performance benefit that you're getting right now with 32.
> And that, too, seems likely to be good for single-client performance.

One other thing: I don't see what prevents the nentries count for a
particular freelist from going negative, which would subsequently
cause an Assertion failure.  I wonder if we should restructure the
code so that we keep track of the total number of elements allocated
and the number on each freelist, so that the count of elements =
elements allocation - sum(elements on each freelist).  This would
require a little more code churn but it seems that the result would be
more clearly correct.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-11 Thread Robert Haas
On Thu, Feb 11, 2016 at 9:53 AM, Robert Haas  wrote:
> The fact that InitLocks() doesn't do this has been discussed before
> and there's no consensus on changing it.  It is, at any rate, a
> separate issue.  I'll go through the rest of this patch again now.

I did a little bit of cleanup on this patch and the results are
attached.  I also did some benchmarking of this version.  I tested
this on two systems, in each case using five-minute, read-only pgbench
runs at scale factor 3000, varying the client count and taking the
median of three runs.  First, I tested it on hydra, a 2-socket,
16-processor, 64-thread POWER box.  This is a community resource
hosted at OSUOSL.  Second, I tested it on cthulhu, an 8-socket,
64-processor, 128-thread x86_64 box.  This is an EnterpriseDB
resource.  Here are the results:

hydra, master vs. patched
1 client: 8042.872109 vs. 7839.587491 (-2.5%)
64 clients: 213311.852043 vs. 214002.314071 (+0.3%)
96 clients: 219551.356907 vs. 221908.397489 (+1.1%)
128 clients: 210279.022760 vs. 217974.079171 (+3.7%)

cthulhu, master vs. patched
1 client: 3407.705820 vs. 3645.129360 (+7.0%)
64 clients: 88667.681890 vs. 82636.914343 (-6.8%)
96 clients: 79303.750250 vs. 105442.993869 (+33.0%)
128 clients: 74684.510668 vs. 120984.938371 (+62.0%)

Obviously, the high-client count results on cthulhu are stellar.  I'm
*assuming* that the regressions are just random variation.  I am
however wondering if it to set the freelist affinity based on
something other than the hash value, like say the PID, so that the
same process doesn't keep switching to a different freelist for every
buffer eviction.  It also strikes me that it's probably quite likely
that slock_t mutex[NUM_FREELISTS] is a poor way to lay out this data
in memory.  For example, on a system where slock_t is just one byte,
most likely all of those mutexes are going to be in the same cache
line, which means you're going to get a LOT of false sharing.  It
seems like it would be sensible to define a new struct that contains
an slock_t, a long, and a HASHELEMENT *, and then make an array of
those structs.  That wouldn't completely eliminate false sharing, but
it would reduce it quite a bit.  My guess is that if you did that, you
could reduce the number of freelists to 8 or less and get pretty much
the same performance benefit that you're getting right now with 32.
And that, too, seems likely to be good for single-client performance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 24a53da..06c413c 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -15,7 +15,7 @@
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
  * of the hash header change after init except nentries and freeList.
- * A partitioned table uses a spinlock to guard changes of those two fields.
+ * A partitioned table uses spinlocks to guard changes of those fields.
  * This lets any subset of the hash buckets be treated as a separately
  * lockable partition.  We expect callers to use the low-order bits of a
  * lookup key's hash value as a partition number --- this will work because
@@ -111,6 +111,8 @@
 #define DEF_DIRSIZE			   256
 #define DEF_FFACTOR			   1	/* default fill factor */
 
+/* Number of freelists to be used for a partitioned hash table. */
+#define NUM_FREELISTS			32
 
 /* A hash bucket is a linked list of HASHELEMENTs */
 typedef HASHELEMENT *HASHBUCKET;
@@ -128,12 +130,17 @@ typedef HASHBUCKET *HASHSEGMENT;
  */
 struct HASHHDR
 {
-	/* In a partitioned table, take this lock to touch nentries or freeList */
-	slock_t		mutex;			/* unused if not partitioned table */
-
-	/* These fields change during entry addition/deletion */
-	long		nentries;		/* number of entries in hash table */
-	HASHELEMENT *freeList;		/* linked list of free elements */
+	/*
+	 * The freelist can become a point of contention on high-concurrency hash
+	 * tables, so we use an array of freelist, each with its own mutex and
+	 * nentries count, instead of just a single one.
+	 *
+	 * If hash table is not partitioned only nentries[0] and freeList[0] are
+	 * used and spinlocks are not used at all.
+	 */
+	slock_t		mutex[NUM_FREELISTS];		/* array of spinlocks */
+	long		nentries[NUM_FREELISTS];	/* number of entries */
+	HASHELEMENT *freeList[NUM_FREELISTS]; /* lists of free elements */
 
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
@@ -166,6 +173,9 @@ struct HASHHDR
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
 
+#define FREELIST_IDX(hctl, hashcode) \
+	(IS_PARTITIONED(hctl) ? hashcode % NUM_FREELISTS : 0)
+
 /*
  * Top control structure for a hashtable --- in a shared table, each backend
  * has its own copy (OK since no fi

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-11 Thread Robert Haas
On Thu, Feb 11, 2016 at 9:11 AM, Aleksander Alekseev
 wrote:
>> Thanks, this looks way better.  Some more comments:
>>
>> - I don't think there's any good reason for this patch to change the
>> calling convention for ShmemInitHash().  Maybe that's a good idea and
>> maybe it isn't, but it's a separate issue from what this patch is
>> doing, and if we're going to do it at all, it should be discussed
>> separately.  Let's leave it out of this patch.
>>
>> - I would not provide an option to change the number of freelist
>> mutexes.  Let's dump DEFAULT_MUTEXES_NUM and MAX_MUTEXES_NUM and have
>> FREELIST_MUTEXES_NUM.  The value of 32 which you selected is fine with
>> me.
>>
>> - The change to LOG2_NUM_LOCK_PARTITIONS in lwlock.h looks like an
>> independent change.  Again, leave it out of this patch and submit it
>> separately with its own benchmarking data if you want to argue for it.
>>
>> I think if you make these changes this patch will be quite a bit
>> smaller and in pretty good shape.
>>
>> Thanks for working on this.
>
> Here is a corrected patch. I decided to keep a small change in
> InitLocks procedure (lock.c) since ShmemInitHash description clearly
> states "For a table whose maximum size is certain, [init_size] should
> be equal to max_size". Also I added two items to my TODO list to send
> more patches as soon (and if) this patch will be applied.
>
> Thanks for your contribution to this patch.

The fact that InitLocks() doesn't do this has been discussed before
and there's no consensus on changing it.  It is, at any rate, a
separate issue.  I'll go through the rest of this patch again now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-11 Thread Aleksander Alekseev
Hello, Robert

> Thanks, this looks way better.  Some more comments:
> 
> - I don't think there's any good reason for this patch to change the
> calling convention for ShmemInitHash().  Maybe that's a good idea and
> maybe it isn't, but it's a separate issue from what this patch is
> doing, and if we're going to do it at all, it should be discussed
> separately.  Let's leave it out of this patch.
> 
> - I would not provide an option to change the number of freelist
> mutexes.  Let's dump DEFAULT_MUTEXES_NUM and MAX_MUTEXES_NUM and have
> FREELIST_MUTEXES_NUM.  The value of 32 which you selected is fine with
> me.
> 
> - The change to LOG2_NUM_LOCK_PARTITIONS in lwlock.h looks like an
> independent change.  Again, leave it out of this patch and submit it
> separately with its own benchmarking data if you want to argue for it.
> 
> I think if you make these changes this patch will be quite a bit
> smaller and in pretty good shape.
> 
> Thanks for working on this.
> 

Here is a corrected patch. I decided to keep a small change in
InitLocks procedure (lock.c) since ShmemInitHash description clearly
states "For a table whose maximum size is certain, [init_size] should
be equal to max_size". Also I added two items to my TODO list to send
more patches as soon (and if) this patch will be applied.

Thanks for your contribution to this patch.diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index e3e9599..5296e77 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -374,18 +374,10 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long		max_table_size;
 	bool		found;
 
 	/*
-	 * Compute init/max size to request for lock hashtables.  Note these
-	 * calculations must agree with LockShmemSize!
-	 */
-	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
-
-	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
 	 * information.
 	 */
@@ -393,16 +385,16 @@ InitLocks(void)
 	info.keysize = sizeof(LOCKTAG);
 	info.entrysize = sizeof(LOCK);
 	info.num_partitions = NUM_LOCK_PARTITIONS;
+	max_table_size = NLOCKENTS();
 
 	LockMethodLockHash = ShmemInitHash("LOCK hash",
-	   init_table_size,
+	   max_table_size,
 	   max_table_size,
 	   &info,
 	HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 
 	/* Assume an average of 2 holders per lock */
 	max_table_size *= 2;
-	init_table_size *= 2;
 
 	/*
 	 * Allocate hash table for PROCLOCK structs.  This stores
@@ -414,7 +406,7 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodProcLockHash = ShmemInitHash("PROCLOCK hash",
-		   init_table_size,
+		   max_table_size,
 		   max_table_size,
 		   &info,
  HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 24a53da..1a3244a 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -15,7 +15,7 @@
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
  * of the hash header change after init except nentries and freeList.
- * A partitioned table uses a spinlock to guard changes of those two fields.
+ * A partitioned table uses spinlocks to guard changes of those fields.
  * This lets any subset of the hash buckets be treated as a separately
  * lockable partition.  We expect callers to use the low-order bits of a
  * lookup key's hash value as a partition number --- this will work because
@@ -111,6 +111,11 @@
 #define DEF_DIRSIZE			   256
 #define DEF_FFACTOR			   1	/* default fill factor */
 
+/*
+ * Number of mutexes and respectively number of nentries and freeLists
+ * (see below). Should be power of 2.
+ */
+#define MUTEXES_NUM32
 
 /* A hash bucket is a linked list of HASHELEMENTs */
 typedef HASHELEMENT *HASHBUCKET;
@@ -128,12 +133,23 @@ typedef HASHBUCKET *HASHSEGMENT;
  */
 struct HASHHDR
 {
-	/* In a partitioned table, take this lock to touch nentries or freeList */
-	slock_t		mutex;			/* unused if not partitioned table */
-
-	/* These fields change during entry addition/deletion */
-	long		nentries;		/* number of entries in hash table */
-	HASHELEMENT *freeList;		/* linked list of free elements */
+	/*
+	 * There are two fields declared below: nentries and freeList. nentries
+	 * stores current number of entries in a hash table. freeList is a linked
+	 * list of free elements.
+	 *
+	 * To keep these fields consistent in a partitioned table we need to
+	 * synchronize access to them using a spinlock. But it turned out that a
+	 * single spinlock can create a bottleneck. To prevent lock contention an
+	 * array of MUTEXES_NUM spinlocks is used. Each spinlock protects one
+	 * element of nentries and freeList arrays.
+	 *
+	 * If hash table is not partitioned only nentries[0] an

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-10 Thread Robert Haas
On Wed, Feb 10, 2016 at 3:24 AM, Aleksander Alekseev
 wrote:
>> Basically, the burden for you to impose a new coding rule on everybody
>> who uses shared hash tables in the future is very high.
>
> I fixed an issue you described. Number of spinlocks doesn't depend of
> NUM_LOCK_PARTITIONS anymore and could be specified for each hash table
> on a calling side.

Thanks, this looks way better.  Some more comments:

- I don't think there's any good reason for this patch to change the
calling convention for ShmemInitHash().  Maybe that's a good idea and
maybe it isn't, but it's a separate issue from what this patch is
doing, and if we're going to do it at all, it should be discussed
separately.  Let's leave it out of this patch.

- I would not provide an option to change the number of freelist
mutexes.  Let's dump DEFAULT_MUTEXES_NUM and MAX_MUTEXES_NUM and have
FREELIST_MUTEXES_NUM.  The value of 32 which you selected is fine with
me.

- The change to LOG2_NUM_LOCK_PARTITIONS in lwlock.h looks like an
independent change.  Again, leave it out of this patch and submit it
separately with its own benchmarking data if you want to argue for it.

I think if you make these changes this patch will be quite a bit
smaller and in pretty good shape.

Thanks for working on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-10 Thread Aleksander Alekseev
Hello, Robert

> Basically, the burden for you to impose a new coding rule on everybody
> who uses shared hash tables in the future is very high.

I fixed an issue you described. Number of spinlocks doesn't depend of
NUM_LOCK_PARTITIONS anymore and could be specified for each hash table
on a calling side.

I did a benchmark described in a first message of this thread again.
Currently I don't have access to the same 60-core server so I used more
common 12-core server (24 with HT). According to this benchmark TPS
increment depends on NUM_LOCK_PARTITIONS and default number of
spinlocks this way:

pgbench -f pgbench.sql -T 150 -P 1 -c 40 -j 12

 DMN | NLP = 16 | NLP = 32 | NLP = 64 | NLP = 128
-|--|--|--|--
   8 |  +15.1%  |  +28.2%  |  +34.1%  |  +33.7%  
  16 |  +16.6%  |  +30.9%  |  +37.0%  |  +40.8%  
  32 |  +15.1%  |  +33.9%  |  +39.5%  |  +41.9%  
  64 |  +15.0%  |  +31.9%  |  +40.1%  |  +42.9%  
 128 |   +7.7%  |  +24.7%  |  +29.6%  |  +31.6%  

* NLP = NUM_LOCK_PARTITIONS
* DMN = DEFAULT_MUTEXES_NUM

I realize this benchmark doesn't represent any possible workload so for
attached patch I choose NUM_LOCK_PARTITIONS = DEFAULT_MUTEXES_NUM = 32.
It seems to be a reasonable compromise of a speedup according to
"synthetic and meaningless in practice" benchmark and number of used
locks which could mean quite a lot in practice. Still this values could
be easily changed in any moment.

Here are before/after benchmark results for this concrete patch.


BEFORE

pgbench (default):

tps = 1295.798531 (including connections establishing)
tps = 1295.858295 (excluding connections establishing)

pgbench -f pgbench.sql:

tps = 1020.072172 (including connections establishing)
tps = 1020.116888 (excluding connections establishing)


AFTER

pgbench (default):

tps = 1299.369807 (including connections establishing)
tps = 1299.429764 (excluding connections establishing)

pgbench -f pgbench.sql:

tps = 1365.749333 (including connections establishing)
tps = 1365.814384 (excluding connections establishing)


So as I understand this patch solves a lock contention problem and
doesn't make anything else worse.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index dffc477..9a15a2a 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -495,7 +495,7 @@ pgss_shmem_startup(void)
 	info.hash = pgss_hash_fn;
 	info.match = pgss_match_fn;
 	pgss_hash = ShmemInitHash("pg_stat_statements hash",
-			  pgss_max, pgss_max,
+			  pgss_max,
 			  &info,
 			  HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
 
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 39e8baf..dd5acb7 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -62,7 +62,7 @@ InitBufTable(int size)
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
-  size, size,
+  size,
   &info,
   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 }
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 81506ea..4c18701 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -237,7 +237,7 @@ InitShmemIndex(void)
 	hash_flags = HASH_ELEM;
 
 	ShmemIndex = ShmemInitHash("ShmemIndex",
-			   SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
+			   SHMEM_INDEX_SIZE,
 			   &info, hash_flags);
 }
 
@@ -255,17 +255,12 @@ InitShmemIndex(void)
  * exceeded substantially (since it's used to compute directory size and
  * the hash table buckets will get overfull).
  *
- * init_size is the number of hashtable entries to preallocate.  For a table
- * whose maximum size is certain, this should be equal to max_size; that
- * ensures that no run-time out-of-shared-memory failures can occur.
- *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +294,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index e3e9599..fc20a67 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -374,18 +374,10 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_tab

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-08 Thread Alvaro Herrera
I've closed this as returned-with-feedback.  Please resubmit once you
have found answers to the questions posed; from the submitted benchmark
numbers this looks very exciting, but it needs a bit more work.  You
don't necessarily have to agree with everything Robert says, but you
need to have well reasoned explanations for the points where you
disagree.

Thanks,

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-08 Thread Robert Haas
On Mon, Feb 8, 2016 at 5:39 AM, Aleksander Alekseev
 wrote:
>> So: do we have clear evidence that we need 128 partitions here, or
>> might, say, 16 work just fine?
>
> Yes, this number of partitions was chosen based on this benchmark (see
> "spinlock array" column):
>
> http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu

It's very hard to understand what that email is trying to say.  It
uses terms like PN and AN without defining them.  There's certainly
not enough detail for somebody else to replicate what you did.

> In fact we can choose 16, 32, 64 or 128 partitions - any number works
> better than current implementation with a single spinlock. But
> according to this benchmark 128 partitions give twice as much TPS then
> 16 partitions, almost as if we didn't have any locks at all (see "no
> locks" column).

OK.

> Still I'm a bit concerned that 128 partitions require (128-16)*
> ( sizeof(slock_t) + sizeof(void*) + sizeof(long) ) bytes of additional
> memory per hash table which is:
>
> (gdb) p (128-16)*(sizeof(slock_t) + sizeof(long) + sizeof(void*))
> $1 = 1904
>
> ... bytes of shared memory on amd64.
>
> I'm not sure if this is considered OK. If it is I suggest to leave
> number 128. If it's not, perhaps we could agree on say 64 partitions as
> a compromise. Described benchmark doesn't represent any possible
> workload anyway. I personally believe that we don't have so many small
> hash tables that 1.8 Kb of additional shared memory per hash table would
> be an issue.

I don't particularly care about using slightly more shared memory per
hash table.

>> I don't really want to increase the number of lock manager partition
>> locks to 128 just for the convenience of this patch, and even less do
>> I want to impose the requirement that every future shared hash table
>> must use an equal number of partitions.
>
> May I ask why? Does it create a real problem?

It's hard to be 100% sure about the future, but suppose that in
another 10 years we discover that the buffer mapping hash table now
needs to have 1024 partitions, but the lock manager partition hash
table still only needs 16.  Do we really want to drag the lock manager
and any future shared hash tables up to 1024 just because the buffer
manager needs it?  That sounds like a bad idea from here.  For
example, one advantage of having fewer partitions is that it takes
less time to lock them all.  That matters to - for example - the cost
of running the deadlock detector.  Is the cost enough to matter in
that particular case with the code as it stands?  I don't know.  Could
having more partitions ever be worse than having fewer?  I'm pretty
sure that the answer is yes.

>> I don't see any reason why the number of free list partitions needs
>> to be a constant, or why it needs to be equal to the number of hash
>> bucket partitions.  That just seems like a completely unnecessary
>> constraint. You can take the hash value, or whatever else you use to
>> pick the partition, modulo any number you want.
>
> True, number of spinlocks / freeLists could be a variable. I believe
> it's an old-good simplicity vs flexibility question.
>
> Lets say we allow user (i.e. calling side) to choose spinlocks and free
> lists number. So now we should use freeList[hash % items_number] formula
> which means division instead of binary shift. Do we really need such an
> expensive operation just to calculate index of a free list? Probably
> not. Let limit possible items_number values so it could be only power
> of two. Now there is only four possible values user would more likely to
> choose (16, 32, 64, 128), which is not as flexible anymore.

Sure, I don't see anything wrong with requiring the number of
partitions to be a power of two.  On the other hand, I also suspect
that on modern hardware the cost of a shift versus a modulo operator
is almost irrelevant.  What's really going to make this fast or slow
is the degree to which it has, or does not have, cache line
contention.

> What about memory? If we allocate mutex, nentries and freeList
> dynamically depending on items_number value, HASHHDR should store
> pointers to allocated memory which means additional indirection for
> almost every access to mutex, nentries and freeList fields. Probably a
> bad idea. Otherwise we still waste shared memory, probably even more
> memory since we don't want maximum items_number value to be too low.

These seem like problems that are admit of more than one good
solution, and which you are certainly capable of solving.  For
example, you could always include space for 128 freelists but not tie
the number of freelists to the number of partition locks, or you could
always include space for 128 freelists and have the system only use
them all if there are 128 or more partitions.  If there are fewer than
128 partitions, it uses only as many freelists as there are
partitions.  Neither of these things sounds very hard to implement.

Basically, the burden for you to impose a new codin

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-08 Thread Aleksander Alekseev
Hello, Robert.

> So: do we have clear evidence that we need 128 partitions here, or
> might, say, 16 work just fine?

Yes, this number of partitions was chosen based on this benchmark (see
"spinlock array" column):

http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu

In fact we can choose 16, 32, 64 or 128 partitions - any number works
better than current implementation with a single spinlock. But
according to this benchmark 128 partitions give twice as much TPS then
16 partitions, almost as if we didn't have any locks at all (see "no
locks" column).

Still I'm a bit concerned that 128 partitions require (128-16)*
( sizeof(slock_t) + sizeof(void*) + sizeof(long) ) bytes of additional
memory per hash table which is:

(gdb) p (128-16)*(sizeof(slock_t) + sizeof(long) + sizeof(void*))
$1 = 1904

... bytes of shared memory on amd64.

I'm not sure if this is considered OK. If it is I suggest to leave
number 128. If it's not, perhaps we could agree on say 64 partitions as
a compromise. Described benchmark doesn't represent any possible
workload anyway. I personally believe that we don't have so many small
hash tables that 1.8 Kb of additional shared memory per hash table would
be an issue.

> I don't really want to increase the number of lock manager partition
> locks to 128 just for the convenience of this patch, and even less do
> I want to impose the requirement that every future shared hash table
> must use an equal number of partitions.

May I ask why? Does it create a real problem?

> I don't see any reason why the number of free list partitions needs
> to be a constant, or why it needs to be equal to the number of hash
> bucket partitions.  That just seems like a completely unnecessary
> constraint. You can take the hash value, or whatever else you use to
> pick the partition, modulo any number you want. 

True, number of spinlocks / freeLists could be a variable. I believe
it's an old-good simplicity vs flexibility question.

Lets say we allow user (i.e. calling side) to choose spinlocks and free
lists number. So now we should use freeList[hash % items_number] formula
which means division instead of binary shift. Do we really need such an
expensive operation just to calculate index of a free list? Probably
not. Let limit possible items_number values so it could be only power
of two. Now there is only four possible values user would more likely to
choose (16, 32, 64, 128), which is not as flexible anymore.

What about memory? If we allocate mutex, nentries and freeList
dynamically depending on items_number value, HASHHDR should store
pointers to allocated memory which means additional indirection for
almost every access to mutex, nentries and freeList fields. Probably a
bad idea. Otherwise we still waste shared memory, probably even more
memory since we don't want maximum items_number value to be too low.

Etc.

I believe such a change will cause only additional code complexity
(code is already far from trivial in dynahash.c) so next time it will
be much harder to change anything, probably some bugs we will miss and
will not solve any real problem. Why just not to choose a constant which
seems to be good enough instead?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-06 Thread Robert Haas
On Thu, Jan 21, 2016 at 9:03 AM, Anastasia Lubennikova
 wrote:
> First of all, why not merge both patches into one? They aren't too big
> anyway.

So looking over this patch, it institutes a new coding rule that all
shared hash tables must have the same number of partitions, and that
number is hard-coded at 128.  The number of freelist partitions is
also hardcoded at 128.  I find this design surprising.  First, I would
not have expected that we'd need 128 freelist partitions.  We get by
just fine with only 16 lock manager partitions, at least AFAIK, and
there's no reason some future hashtable might not have little enough
contention that 16 partitions works just fine.  Even for the buffer
manager, which does have 128 partitions, I wouldn't assume that we
need 128 partitions for the free list just because we need 128
partitions for the locks themselves.  It's possible we do, but I'd
expect that existing buffer mapping locking might dissipate some of
that contention - e.g. if somebody's doing a buffer lookup on a
particular partition, they have a shared lock on that partition, so
nobody has an exclusive lock to be able to hit the freelist.  So: do
we have clear evidence that we need 128 partitions here, or might,
say, 16 work just fine?

Second, I don't see any reason why the number of free list partitions
needs to be a constant, or why it needs to be equal to the number of
hash bucket partitions.  That just seems like a completely unnecessary
constraint.  You can take the hash value, or whatever else you use to
pick the partition, modulo any number you want.  I don't really want
to increase the number of lock manager partition locks to 128 just for
the convenience of this patch, and even less do I want to impose the
requirement that every future shared hash table must use an equal
number of partitions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-27 Thread Dilip Kumar
On Wed, Jan 27, 2016 at 5:15 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

> Most likely HASHHDR.mutex is not only bottleneck in your case so my
> patch doesn't help much. Unfortunately I don't have access to any
> POWER8 server so I can't investigate this issue. I suggest to use a
> gettimeofday trick I described in a first message of this thread. Its
> time consuming but it gives a clear understanding which code is keeping
> a lock.
>

I have also tested the pgbench Readonly test when data don't fit into
shared buffer,
Because in this case HASHHDR.mutex access will be quite frequent.
And in this case i do see very good improvement in POWER8 server.

Test Result:

Scale Factor:300
Shared Buffer:512MB
pgbench -c$ -j$ -S -M Prepared postgres

Clientbasepatch
64   222173318318
128195805262290

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-27 Thread Aleksander Alekseev
> This comment certainly requires some changes.

Fixed.

> BTW, could you explain why init_table_size was two times less than 
> max_table_size?

I have no clue. My best guess is that it was a reasonable thing to do in
the past. Then somebody changed a code and now there is little reason
to use init_table_size for partitioned tables.

> Why did you delete these two lines? I wonder if you should rewrite
> them instead?

```
MemSet(hctl, 0, sizeof(HASHHDR));
 
-   hctl->nentries = 0;
-   hctl->freeList = NULL;
```

These fields were initialized with zero values twice. It makes little
sense to me. 

> As far as I understood, this number was obtained experimentally?
> Maybe you should note that in the comment.

These numbers are very platform specific and will be outdated very
soon. I recall that my code was criticized for including exact numbers
not a long time ago. So I suggest to keep this part as is.

> For example, if you have nelem=25 and partitions_number=6.
> 25 / 6 = 4. And then you allocate 24 nelems, while 1 is lost.

Agree. Fixed.

> Except mentioned notes, I suppose the patch is good enough

I guess I will mark this patch as "Ready for Committer" then.
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9673fe0..0c8e4fb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -495,7 +495,7 @@ pgss_shmem_startup(void)
 	info.hash = pgss_hash_fn;
 	info.match = pgss_match_fn;
 	pgss_hash = ShmemInitHash("pg_stat_statements hash",
-			  pgss_max, pgss_max,
+			  pgss_max,
 			  &info,
 			  HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
 
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 39e8baf..dd5acb7 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -62,7 +62,7 @@ InitBufTable(int size)
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
-  size, size,
+  size,
   &info,
   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 }
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 81506ea..4c18701 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -237,7 +237,7 @@ InitShmemIndex(void)
 	hash_flags = HASH_ELEM;
 
 	ShmemIndex = ShmemInitHash("ShmemIndex",
-			   SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
+			   SHMEM_INDEX_SIZE,
 			   &info, hash_flags);
 }
 
@@ -255,17 +255,12 @@ InitShmemIndex(void)
  * exceeded substantially (since it's used to compute directory size and
  * the hash table buckets will get overfull).
  *
- * init_size is the number of hashtable entries to preallocate.  For a table
- * whose maximum size is certain, this should be equal to max_size; that
- * ensures that no run-time out-of-shared-memory failures can occur.
- *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +294,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9c2e49c..8585a76 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -373,18 +373,10 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long		max_table_size;
 	bool		found;
 
 	/*
-	 * Compute init/max size to request for lock hashtables.  Note these
-	 * calculations must agree with LockShmemSize!
-	 */
-	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
-
-	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
 	 * information.
 	 */
@@ -392,16 +384,15 @@ InitLocks(void)
 	info.keysize = sizeof(LOCKTAG);
 	info.entrysize = sizeof(LOCK);
 	info.num_partitions = NUM_LOCK_PARTITIONS;
+	max_table_size = NLOCKENTS();
 
 	LockMethodLockHash = ShmemInitHash("LOCK hash",
-	   init_table_size,
 	   max_table_size,
 	   &info,
 	HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 
 	/* Assume an average of 2 holders per lock */
 	max_table_size *= 2;
-	init_table_size *= 2;
 
 	/*
 	 * Allocate hash table for PROCLOCK structs.  This stores
@@ -413,7 +404,6 @@ InitLocks(void)
 	info.num_partitions = NUM_LOC

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-27 Thread Anastasia Lubennikova



22.01.2016 13:48, Aleksander Alekseev:

Then, this thread became too tangled. I think it's worth to write a
new message with the patch, the test script, some results and brief
overview of how does it really works. It will make following review
much easier.

Sure.

HASHHDR represents a hash table. It could be usual or partitioned.
Partitioned table is stored in a shared memory and accessed by multiple
processes simultaneously. To prevent data corruption hash table is
partitioned and each process has to acquire a lock for a corresponding
partition before accessing data in it. Number of partition is determine
by lower bits of key's hash value. Most tricky part is --- dynahash
knows nothing about these locks, all locking is done on calling side.

Since shared memory is pre-allocated and can't grow to allocate memory
in a hash table we use freeList. Also we use nentries field to count
current number of elements in a hash table. Since hash table is used by
multiple processes these fields are protected by a spinlock. Which as
it turned out could cause lock contention and create a bottleneck.

After trying a few approaches I discovered that using one spinlock per
partition works quite well. Here are last benchmark results:

http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu

Note that "no locks" solution cant be used because it doesn't guarantee
that all available memory will be used in some corner cases.

You can find a few more details and a test script in the first message
of this thread. If you have any other questions regarding this patch
please don't hesitate to ask.

InitLocks
/*
 * Compute init/max size to request for lock hashtables.  Note these
 * calculations must agree with LockShmemSize!
 */

This comment certainly requires some changes.
BTW, could you explain why init_table_size was two times less than 
max_table_size?

Although, I don't see any problems with your changes.


-hctl->nentries = 0;
-hctl->freeList = NULL;

Why did you delete these two lines? I wonder if you should rewrite them 
instead?


+ * This particular number of partitions significantly reduces lock 
contention

+ * in partitioned hash tables, almost if partitioned tables didn't use any
+ * locking at all

As far as I understood, this number was obtained experimentally? Maybe 
you should note that in the comment.



And the last, but not least.

+nelem_alloc = ((int) nelem) / partitions_number;
+if (nelem_alloc == 0)
+nelem_alloc = 1;
+
+for (i = 0; i < partitions_number; i++)
+if (!element_alloc(hashp, nelem_alloc, i))
+ereport(ERROR,
+(errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));


It looks like you forgot to round the result of the division.
For example, if you have nelem=25 and partitions_number=6.
25 / 6 = 4. And then you allocate 24 nelems, while 1 is lost.

What about productivity, I did a test on my notebook. Patched version 
shows small performance improvement.


pgbench -j 1 -c 1 -f pgbench.sql -T 300 postgres
base patched
127tps  145tps
pgbench -j 8 -c 8 -f pgbench.sql -T 300 postgres
base patched
247tps  270tps

But I haven't any big machine to perform tests, where the patch is 
supposed to make significant changes.

So I would rely on your and the other reviewers results.

Except mentioned notes, I suppose the patch is good enough to pass it to 
a reviewer.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-27 Thread Aleksander Alekseev
> > This patch affects header files. By any chance didn't you forget to
> > run `make clean` after applying it? As we discussed above, when you
> > change .h files autotools doesn't rebuild dependent .c files:
> >
> 
> Yes, actually i always compile using "make clean;make -j20; make
> install" If you want i will run it again may be today or tomorrow and
> post the result.
> 
> 

Most likely HASHHDR.mutex is not only bottleneck in your case so my
patch doesn't help much. Unfortunately I don't have access to any
POWER8 server so I can't investigate this issue. I suggest to use a
gettimeofday trick I described in a first message of this thread. Its
time consuming but it gives a clear understanding which code is keeping
a lock.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-24 Thread Dilip Kumar
On Fri, Jan 22, 2016 at 3:44 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

> This patch affects header files. By any chance didn't you forget to run
> `make clean` after applying it? As we discussed above, when you
> change .h files autotools doesn't rebuild dependent .c files:
>

Yes, actually i always compile using "make clean;make -j20; make install"
If you want i will run it again may be today or tomorrow and post the
result.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-22 Thread Aleksander Alekseev
Hi,

> First of all, why not merge both patches into one? They aren't too
> big anyway.

Agree.

> I think comments should be changed, to be more informative here.

> Add a comment here too.

> Maybe you should explain this magic number 7 in the comment above?

Done.

> Then, this thread became too tangled. I think it's worth to write a
> new message with the patch, the test script, some results and brief
> overview of how does it really works. It will make following review
> much easier.

Sure.

HASHHDR represents a hash table. It could be usual or partitioned.
Partitioned table is stored in a shared memory and accessed by multiple
processes simultaneously. To prevent data corruption hash table is
partitioned and each process has to acquire a lock for a corresponding
partition before accessing data in it. Number of partition is determine
by lower bits of key's hash value. Most tricky part is --- dynahash
knows nothing about these locks, all locking is done on calling side.

Since shared memory is pre-allocated and can't grow to allocate memory
in a hash table we use freeList. Also we use nentries field to count
current number of elements in a hash table. Since hash table is used by
multiple processes these fields are protected by a spinlock. Which as
it turned out could cause lock contention and create a bottleneck.

After trying a few approaches I discovered that using one spinlock per
partition works quite well. Here are last benchmark results:

http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu

Note that "no locks" solution cant be used because it doesn't guarantee
that all available memory will be used in some corner cases.

You can find a few more details and a test script in the first message
of this thread. If you have any other questions regarding this patch
please don't hesitate to ask.

Best regards,
Aleksander

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9673fe0..0c8e4fb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -495,7 +495,7 @@ pgss_shmem_startup(void)
 	info.hash = pgss_hash_fn;
 	info.match = pgss_match_fn;
 	pgss_hash = ShmemInitHash("pg_stat_statements hash",
-			  pgss_max, pgss_max,
+			  pgss_max,
 			  &info,
 			  HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
 
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 39e8baf..dd5acb7 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -62,7 +62,7 @@ InitBufTable(int size)
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
-  size, size,
+  size,
   &info,
   HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 }
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 81506ea..4c18701 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -237,7 +237,7 @@ InitShmemIndex(void)
 	hash_flags = HASH_ELEM;
 
 	ShmemIndex = ShmemInitHash("ShmemIndex",
-			   SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
+			   SHMEM_INDEX_SIZE,
 			   &info, hash_flags);
 }
 
@@ -255,17 +255,12 @@ InitShmemIndex(void)
  * exceeded substantially (since it's used to compute directory size and
  * the hash table buckets will get overfull).
  *
- * init_size is the number of hashtable entries to preallocate.  For a table
- * whose maximum size is certain, this should be equal to max_size; that
- * ensures that no run-time out-of-shared-memory failures can occur.
- *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +294,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9c2e49c..8d9b36a 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -373,8 +373,7 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long		max_table_size;
 	bool		found;
 
 	/*
@@ -382,7 +381,6 @@ InitLocks(void)
 	 * calculations must agree with LockShmemSize!
 	 */
 	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
 
 	/*
 	 * Allocate hash table for LOCK structs

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-22 Thread Aleksander Alekseev
This patch affects header files. By any chance didn't you forget to run
`make clean` after applying it? As we discussed above, when you
change .h files autotools doesn't rebuild dependent .c files:

http://www.postgresql.org/message-id/caf4au4y4mtapp2u3ugtbqd_z0chesjwfnavrgk4umfcwa4e...@mail.gmail.com

On Thu, 21 Jan 2016 16:49:12 +0530
Dilip Kumar  wrote:

> On Tue, Jan 12, 2016 at 1:50 PM, Aleksander Alekseev <
> a.aleks...@postgrespro.ru> wrote:
> 
> >
> > increasing number of lock partitions (see columns "no locks",
> > "lwlock" and "spinlock array"). Previously it couldn't help us (see
> > "master" column) because of a bottleneck.
> >
> > If you have any other questions regarding this patch please don't
> > hesitate to ask.
> >
> 
> I have done some performance bench marking for this
> patch.(dynahash-optimization-v10-step1.patch)
> 
> Machine Detail:
> cpu   : POWER8
> cores: 24 (192 with HT)
> 
> Non Default Parameter:
> 
> Shared Buffer= 30GB
> max_wal_size= 10GB
> max_connections=500
> 
> Test1:
> *schema.sql* and *pgbench.sql* are same files which Aleksander has
> attached in first mail of the thread.
> 
> psql -d postgres -f schema.sql
> pgbench -c$ -j$ -f pgbench.sql  postgres
> 
> clientbasepatch
> 1145145
> 2262258
> 4453472
> 8749732
> 1611141128
> 3217271789
> 6427292793
> 12835343520
> 
> With this test i did not see any improvement, though i think it should
> improve the performance, do you have any suggestion to see the
> results same as yours ?
> 
> I have also dump stacks using some script and I have observed many
> stacks dumps as you mentioned in initial thread. And after patch, I
> found that number of lock waiting stacks are reduced.
> 
> Stack Dump:
> ---
> #0  0x7f25842899a7 in semop () from /lib64/libc.so.6
> #1  0x007116d0 in PGSemaphoreLock (sema=0x7f257cb170d8) at
> pg_sema.c:387
> #2  0x0078955f in LWLockAcquire (lock=0x7f247698a980,
> mode=LW_EXCLUSIVE) at lwlock.c:1028
> #3  0x007804c4 in LockAcquireExtended
> #4  0x0077fe91 in LockAcquire
> #5  0x0077e862 in LockRelationOid
> #6  0x0053bc7b in find_inheritance_children
> #7  0x0053bd4f in find_all_inheritors
> #8  0x006fc0a2 in expand_inherited_rtentry
> #9  0x006fbf91 in expand_inherited_tables
> 
> I have tried to analyze using perf also, I can see that amount of time
> taken in hash_search_with_hash_value is reduced from 3.86%(master) to
> 1.72%(patch).
> 
> I have plan to do further investigation, in different scenarios of
> dynahash.
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-21 Thread Anastasia Lubennikova

30.12.2015 16:01, Aleksander Alekseev:

Here is a clean version of the patch. Step 1 is an optimization. Step 2
refactors this:

  HTAB *
  ShmemInitHash(const char *name, /* table string name for shmem index */
- long init_size,   /* initial table size */
+ long init_size,   /* initial table size XXX ignored,
  refactor! */


Hi, I found that the patch is still not reviewed, so I've looked it over.
I haven't dived into this subject before, so my comments will be more 
general than relating to your investigation.
Sorry if some things seem like nitpicks, I just suppose that clear patch 
would be more convenient for reviewers.


First of all, why not merge both patches into one? They aren't too big 
anyway.
Then, this thread became too tangled. I think it's worth to write a new 
message with the patch, the test script, some results and brief overview 
of how does it really works. It will make following review much easier.


+   /* number of entries in hash table */
+   longnentries[NUM_LOCK_PARTITIONS];
+   /* linked list of free elements */
+   HASHELEMENT *freeList[NUM_LOCK_PARTITIONS];

I think comments should be changed, to be more informative here.

+   if (IS_PARTITIONED(hashp->hctl))
+   partitions_number = NUM_LOCK_PARTITIONS;
+   else
+   partitions_number = 1;
+
+   nelem_alloc = ((int) nelem) / partitions_number;
+   if (nelem_alloc == 0)
+   nelem_alloc = 1;

Add a comment here too.

-/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  128
-
 /* Number of partitions the shared lock tables are divided into */
-#define LOG2_NUM_LOCK_PARTITIONS  4
+#define LOG2_NUM_LOCK_PARTITIONS  7
 #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)

+/* Number of partitions of the shared buffer mapping hashtable */
+#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
+

I'm sure, it was discussed above. but these definitions do not look 
obvious at all.

Maybe you should explain this magic number 7 in the comment above?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-21 Thread Dilip Kumar
On Tue, Jan 12, 2016 at 1:50 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

>
> increasing number of lock partitions (see columns "no locks", "lwlock"
> and "spinlock array"). Previously it couldn't help us (see "master"
> column) because of a bottleneck.
>
> If you have any other questions regarding this patch please don't
> hesitate to ask.
>

I have done some performance bench marking for this
patch.(dynahash-optimization-v10-step1.patch)

Machine Detail:
cpu   : POWER8
cores: 24 (192 with HT)

Non Default Parameter:

Shared Buffer= 30GB
max_wal_size= 10GB
max_connections=500

Test1:
*schema.sql* and *pgbench.sql* are same files which Aleksander has attached
in first mail of the thread.

psql -d postgres -f schema.sql
pgbench -c$ -j$ -f pgbench.sql  postgres

clientbasepatch
1145145
2262258
4453472
8749732
1611141128
3217271789
6427292793
12835343520

With this test i did not see any improvement, though i think it should
improve the performance, do you have any suggestion to see the results same
as yours ?

I have also dump stacks using some script and I have observed many stacks
dumps as you mentioned in initial thread. And after patch, I found that
number of lock waiting stacks are reduced.

Stack Dump:
---
#0  0x7f25842899a7 in semop () from /lib64/libc.so.6
#1  0x007116d0 in PGSemaphoreLock (sema=0x7f257cb170d8) at
pg_sema.c:387
#2  0x0078955f in LWLockAcquire (lock=0x7f247698a980,
mode=LW_EXCLUSIVE) at lwlock.c:1028
#3  0x007804c4 in LockAcquireExtended
#4  0x0077fe91 in LockAcquire
#5  0x0077e862 in LockRelationOid
#6  0x0053bc7b in find_inheritance_children
#7  0x0053bd4f in find_all_inheritors
#8  0x006fc0a2 in expand_inherited_rtentry
#9  0x006fbf91 in expand_inherited_tables

I have tried to analyze using perf also, I can see that amount of time
taken in hash_search_with_hash_value is reduced from 3.86%(master) to
1.72%(patch).

I have plan to do further investigation, in different scenarios of dynahash.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-12 Thread Alvaro Herrera
Andres Freund wrote:
> On 2016-01-11 21:16:42 -0300, Alvaro Herrera wrote:
> > Andres, Robert, are you still reviewing this patch?
> 
> I don't think I've signed to - or actually did - reviewed this path.

Well, maybe not as such, but you replied to the thread several times,
which is why I asked.  No problem, you don't *have* to review all
patches.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-12 Thread Andres Freund
On 2016-01-11 21:16:42 -0300, Alvaro Herrera wrote:
> Andres, Robert, are you still reviewing this patch?

I don't think I've signed to - or actually did - reviewed this path.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-12 Thread Aleksander Alekseev
Hello, Álvaro

> So you were saying some posts upthread that the new hash tables would
> "borrow" AN elements from the freelist then put AN elements back when
> too many got unused.  Is that idea embodied in this patch?

Right. It didn't satisfy "use all memory" requirement anyway. It's
better to sacrifice a few TPS (according only to my benchmark which
doesn't represent all possible use cases) than accidentally break
someones system after upgrading to the next version of PostgreSQL. See
example with pg_dump given by Robert Haas above.

>> +/* Number of partitions of the shared buffer mapping hashtable */
>> +#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
>> +
>>  /* Number of partitions the shared predicate lock tables are divided
>> into */ #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
>>  #define NUM_PREDICATELOCK_PARTITIONS  (1 <<
>> LOG2_NUM_PREDICATELOCK_PARTITIONS)  
>
> I find this odd.  Why is it a good idea to define the bufMapping
> partitions like this? What's the explanation for the performance
> numbers you saw?

My mistake. I thought that last table was self-explanatory.

We see that a single spinlock for accessing a freeList (current
implementation) is obviously a bottleneck. Replacing it with say an
array of spinlocks reduces lock contention and therefore gives more TPS
(see first row). Also we can increase concurrency level even further by
increasing number of lock partitions (see columns "no locks", "lwlock"
and "spinlock array"). Previously it couldn't help us (see "master"
column) because of a bottleneck.

If you have any other questions regarding this patch please don't
hesitate to ask.

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-11 Thread Alvaro Herrera
Andres, Robert, are you still reviewing this patch?

Aleksander Alekseev wrote:
> Here is a clean version of the patch. Step 1 is an optimization. Step 2
> refactors this:
> 
>  HTAB *
>  ShmemInitHash(const char *name, /* table string name for shmem index */
> - long init_size,   /* initial table size */
> + long init_size,   /* initial table size XXX ignored,
>  refactor! */

So you were saying some posts upthread that the new hash tables would
"borrow" AN elements from the freelist then put AN elements back when
too many got unused.  Is that idea embodied in this patch?  I'm nervous
about continuing a line of development that does away with the
"init_size" concept completely, but if this patch is keep the "borrow"
concept then maybe it doesn't matter.

> -/* Number of partitions of the shared buffer mapping hashtable */
> -#define NUM_BUFFER_PARTITIONS  128
> -
>  /* Number of partitions the shared lock tables are divided into */
> -#define LOG2_NUM_LOCK_PARTITIONS  4
> +#define LOG2_NUM_LOCK_PARTITIONS  7
>  #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)
>  
> +/* Number of partitions of the shared buffer mapping hashtable */
> +#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
> +
>  /* Number of partitions the shared predicate lock tables are divided into */
>  #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
>  #define NUM_PREDICATELOCK_PARTITIONS  (1 << 
> LOG2_NUM_PREDICATELOCK_PARTITIONS)

I find this odd.  Why is it a good idea to define the bufMapping
partitions like this?  What's the explanation for the performance
numbers you saw?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-02 Thread Fabrízio de Royes Mello
On Wed, Dec 30, 2015 at 1:10 PM, Oleg Bartunov  wrote:
>
>
>
> On Wed, Dec 30, 2015 at 5:44 PM, Andres Freund  wrote:
>>
>> On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
>> > Aleksander Alekseev wrote:
>> >
>> > > Here is a funny thing - benchmark results I shared 22.12.2015 are
wrong
>> > > because I forgot to run `make clean` after changing lwlock.h
(autotools
>> > > doesn't rebuild project properly after changing .h files).
>> >
>> > Running configure with --enable-depend should avoid this problem.
>>
>> I still maintain that --enable-depend should be on by default. We're
>
>
> +1
>

+1, I always use it.

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello


Re: --enable-depend by default (was Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex)

2015-12-30 Thread Andres Freund
On 2015-12-30 10:49:27 -0500, Tom Lane wrote:
> > On Wed, Dec 30, 2015 at 5:44 PM, Andres Freund  wrote:
> >> I still maintain that --enable-depend should be on by default. We're
> >> absurdly optimizing towards saving a handful of cycles in scenarios
> >> which are usually bottlenecked by other things (build boxes spend more
> >> times on tests and such)
>
> Nonsense.  The buildfarm will simply get slower, and at least for my
> animals it's too damn slow already.

I think you're overstating the performance impact of
--enable-depend. It's just an additional output file for the compiler,
given the way we're doing it (-MMD -MP -MF $outfile). Let's make a quick test.

I've built both trees once before, to prime ccache. The test is:
for i in $(seq 1 5); do git clean -dfxq && ./configure --(dis|en)able-depend 
--quiet && time make -j4 -s;done

(coffee break)

no ccache, --disable-depend:0m55.810s, 0m58.361s, 0m58.517s, 0m58.674s, 
0m58.466s
ccache, --disable-depend:   0m5.248s,  0m5.279s,  0m5.244s,  0m5.771s,  
0m5.296s
no ccache, --enable-depend: 0m56.443s, 0m58.507s, 0m58.587s, 0m58.866s, 
0m58.429s
ccache, --enable-depend:0m5.538s,  0m5.518s,  0m5.572s,  0m5.555s,  
0m5.528s


Yes, this is on a much faster machine (my laptop) than what some
buildfarm animal are running. But given it's really just some
additional, *small*, files that are written (compiler) and read (make),
I don't see why the impact would be significantly different on slower
machines.


> We could fix that by having the
> buildfarm script automatically add --disable-depend; but if we're going
> to change this, we should roll out that script update BEFORE we change
> it, not after.

Sounds reasonable, despite the above. This isn't an urgent change, just
something to make newer developers waste less time.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


--enable-depend by default (was Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex)

2015-12-30 Thread Tom Lane
[ A change as significant as this should not be debated in a thread with
a title that suggests it's of interest to only one or two people ]

> On Wed, Dec 30, 2015 at 5:44 PM, Andres Freund  wrote:
>> I still maintain that --enable-depend should be on by default. We're
>> absurdly optimizing towards saving a handful of cycles in scenarios
>> which are usually bottlenecked by other things (build boxes spend more
>> times on tests and such)

Nonsense.  The buildfarm will simply get slower, and at least for my
animals it's too damn slow already.  We could fix that by having the
buildfarm script automatically add --disable-depend; but if we're going
to change this, we should roll out that script update BEFORE we change
it, not after.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-30 Thread Oleg Bartunov
On Wed, Dec 30, 2015 at 5:44 PM, Andres Freund  wrote:

> On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
> > Aleksander Alekseev wrote:
> >
> > > Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
> > > because I forgot to run `make clean` after changing lwlock.h (autotools
> > > doesn't rebuild project properly after changing .h files).
> >
> > Running configure with --enable-depend should avoid this problem.
>
> I still maintain that --enable-depend should be on by default. We're
>

+1


> absurdly optimizing towards saving a handful of cycles in scenarios
> which are usually bottlenecked by other things (build boxes spend more
> times on tests and such), rather than optimizing for developer time. I
> don't know how many people failed setting --enable-depend by now, but it
> definitely goes into several hundres of wasted ours territory.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-30 Thread Aleksander Alekseev
Agree. --enable-depend turned on by default could save me a lot of
time. Now I'm aware regarding --enable-depend and `make clean`. Still
other people will definitely do the same mistake over and over again.

On Wed, 30 Dec 2015 11:49:13 -0300
Alvaro Herrera  wrote:

> Andres Freund wrote:
> > On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
> > > Aleksander Alekseev wrote:
> > > 
> > > > Here is a funny thing - benchmark results I shared 22.12.2015
> > > > are wrong because I forgot to run `make clean` after changing
> > > > lwlock.h (autotools doesn't rebuild project properly after
> > > > changing .h files).
> > > 
> > > Running configure with --enable-depend should avoid this problem.
> > 
> > I still maintain that --enable-depend should be on by default.
> 
> +1  Let's do it.
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-30 Thread Alvaro Herrera
Andres Freund wrote:
> On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
> > Aleksander Alekseev wrote:
> > 
> > > Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
> > > because I forgot to run `make clean` after changing lwlock.h (autotools
> > > doesn't rebuild project properly after changing .h files).
> > 
> > Running configure with --enable-depend should avoid this problem.
> 
> I still maintain that --enable-depend should be on by default.

+1  Let's do it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-30 Thread Andres Freund
On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
> Aleksander Alekseev wrote:
> 
> > Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
> > because I forgot to run `make clean` after changing lwlock.h (autotools
> > doesn't rebuild project properly after changing .h files).
> 
> Running configure with --enable-depend should avoid this problem.

I still maintain that --enable-depend should be on by default. We're
absurdly optimizing towards saving a handful of cycles in scenarios
which are usually bottlenecked by other things (build boxes spend more
times on tests and such), rather than optimizing for developer time. I
don't know how many people failed setting --enable-depend by now, but it
definitely goes into several hundres of wasted ours territory.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-30 Thread Alvaro Herrera
Aleksander Alekseev wrote:

> Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
> because I forgot to run `make clean` after changing lwlock.h (autotools
> doesn't rebuild project properly after changing .h files).

Running configure with --enable-depend should avoid this problem.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-30 Thread Aleksander Alekseev
Here is a clean version of the patch. Step 1 is an optimization. Step 2
refactors this:

 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
- long init_size,   /* initial table size */
+ long init_size,   /* initial table size XXX ignored,
 refactor! */


As I understand there is no performance degradation:

Before
==

Core i7, pgbench -j 8 -c 8 -T 50 pgbench:

```
number of transactions actually processed: 14315
latency average: 27.945 ms
latency stddev: 44.920 ms
tps = 286.097043 (including connections establishing)
tps = 286.127642 (excluding connections establishing)
```

60-core server, pgbench -j 64 -c 64 -T 50 pgbench:

```
number of transactions actually processed: 176127
latency average: 18.162 ms
latency stddev: 23.861 ms
tps = 3521.069672 (including connections establishing)
tps = 3522.166706 (excluding connections establishing)
```


After
=

Core i7, pgbench -j 8 -c 8 -T 50 pgbench:

```
number of transactions actually processed: 14212
latency average: 27.984 ms
latency stddev: 43.250 ms
tps = 284.038588 (including connections establishing)
tps = 285.112772 (excluding connections establishing)
```

60-core server, pgbench -j 64 -c 64 -T 50 pgbench:

```
number of transactions actually processed: 172135
latency average: 17.767 ms
latency stddev: 22.692 ms
tps = 3590.410302 (including connections establishing)
tps = 3594.607165 (excluding connections establishing)
```diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 78f15f0..d361dbb 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -265,7 +265,7 @@ InitShmemIndex(void)
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
+			  long init_size,	/* initial table size XXX ignored, refactor! */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +299,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index eacffc4..a50a2d3 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -15,7 +15,7 @@
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
  * of the hash header change after init except nentries and freeList.
- * A partitioned table uses a spinlock to guard changes of those two fields.
+ * A partitioned table uses spinlocks to guard changes of those fields.
  * This lets any subset of the hash buckets be treated as a separately
  * lockable partition.  We expect callers to use the low-order bits of a
  * lookup key's hash value as a partition number --- this will work because
@@ -87,6 +87,7 @@
 #include "access/xact.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "storage/lock.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
 
@@ -128,12 +129,21 @@ typedef HASHBUCKET *HASHSEGMENT;
  */
 struct HASHHDR
 {
-	/* In a partitioned table, take this lock to touch nentries or freeList */
-	slock_t		mutex;			/* unused if not partitioned table */
+	/*
+	 * In a partitioned table take these locks to touch nentries or freeList.
+	 * If hash table is not partitioned mutexes are not used.
+	 */
+	slock_t		mutex[NUM_LOCK_PARTITIONS];
+
+	/*
+	 * These fields change during entry addition/deletion. If hash table is
+	 * not partitioned only nentries[0] and freeList[0] are used.
+	 */
 
-	/* These fields change during entry addition/deletion */
-	long		nentries;		/* number of entries in hash table */
-	HASHELEMENT *freeList;		/* linked list of free elements */
+	/* number of entries in hash table */
+	long		nentries[NUM_LOCK_PARTITIONS];
+	/* linked list of free elements */
+	HASHELEMENT *freeList[NUM_LOCK_PARTITIONS];
 
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
@@ -166,6 +176,8 @@ struct HASHHDR
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
 
+#define PARTITION_IDX(hctl, hashcode) (IS_PARTITIONED(hctl) ? LockHashPartition(hashcode) : 0)
+
 /*
  * Top control structure for a hashtable --- in a shared table, each backend
  * has its own copy (OK since no fields change at runtime)
@@ -219,10 +231,10 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem);
+static bool element_alloc(HTAB *hashp, int nelem, int partition_idx);
 static bool d

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-29 Thread Aleksander Alekseev
Good news, everyone!

I discovered that NUM_LOCK_OF_PARTITIONS is a bottleneck for a last
patch. Long story short - NUM_LOCK_OF_PARTITIONS = (1 << 7) gives best
performance:

PN = 16, AN =  8, NUM_LOCK_PARTITIONS = (1 << 7): 4782.9
PN = 16, AN =  4, NUM_LOCK_PARTITIONS = (1 << 7): 4089.9
PN = 16, AN =  2, NUM_LOCK_PARTITIONS = (1 << 7): 2822.5

PN =  8, AN =  8, NUM_LOCK_PARTITIONS = (1 << 7): 4391.7
PN =  8, AN =  4, NUM_LOCK_PARTITIONS = (1 << 7): 3665.0
PN =  8, AN =  2, NUM_LOCK_PARTITIONS = (1 << 7): 2205.7

PN =  4, AN =  8, NUM_LOCK_PARTITIONS = (1 << 7): 4031.9
PN =  4, AN =  4, NUM_LOCK_PARTITIONS = (1 << 7): 3378.2
PN =  4, AN =  2, NUM_LOCK_PARTITIONS = (1 << 7): 2142.4

Also I tried to optimize global_free_list_* procedures as I planned.
Optimized version is asymptotically faster but works about 5% slower in
practice because of some additional operations we need to perform.
Patch is attached to this message in case you would like to examine a
code.

Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
because I forgot to run `make clean` after changing lwlock.h (autotools
doesn't rebuild project properly after changing .h files). Here are
correct results:

60-core server,
pgbench -j 64 -c 64 -f pgbench.sql -T 50 -P 1 my_database

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock 
PARTITIONS | (99ccb2) |  |  |  |  array   
---|--|--|--|--|--
 1 << 4|  660.4   |  2296.3  |  1441.8  |  454.4   |  1597.8  
---|--|--|--|--|--
 1 << 5|  536.6   |  3006.5  |  1632.3  |  381.8   |  1896.8  
---|--|--|--|--|--
 1 << 6|  469.9   |  4083.5  |  1693.4  |  427.7   |  2271.5  
---|--|--|--|--|--
 1 << 7|  430.8   |  4653.0  |  2016.3  |  361.1   |  3277.1  

As you may see "spinlock array" version works quite well. It is almost
5 times faster than current dynahash implementation and only 30% slower
than "no locks" version. It also guarantees that we will use all
available memory.

I experimented with various improvements of "spinlock array" as well.
E.g. I tried to borrow elements not one a time, but it didn't cause any
performance improvements.

In case you believe that 5 times faster is not good enough we can also
use sharded-global-free-list.patch with appropriate AN and PN values.
Still this patch doesn't guarantee that all available memory will be
used ("no lock" patch doesn't guarantee it either).

Considering that benchmark above is not for all possible cases, but for
a very specific one, and that "spinlock array" patch has much better
guarantees then "no lock" patch, and that lock-free algorithms are
pretty complicated and error-prone I suggest we call "spinlock array"
solution good enough, at least until someone complain again that
dynahash is a bottleneck. Naturally first I clean a code, write more
comments, then re-check once again that there is no performance
degradation according to usual pgbench, etc.diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 78f15f0..91fcc05 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -265,7 +265,7 @@ InitShmemIndex(void)
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
+			  long init_size,	/* initial table size */ // AALEKSEEV: is ignored, refactor!
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +299,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index eacffc4..07a46db 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -87,6 +87,8 @@
 #include "access/xact.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "storage/lock.h"
+#include "storage/lwlock.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
 
@@ -118,6 +120,13 @@ typedef HASHELEMENT *HASHBUCKET;
 /* A hash segment is an array of bucket headers */
 typedef HASHBUCKET *HASHSEGMENT;
 
+// AALEKSEEV: TODO: comment, should be power of two
+#define GLOBAL_FREE_LIST_PARTITIONS_NUM 4
+#define GLOBAL_FREE_LIST_PARTITIONS_MASK (GLOBAL_FREE_LIST_PARTITIONS_NUM-1)
+
+// AALEKSEEV: TODO: comment
+#define GLOBAL_FREE_LIST_ALLOC_NUM 8
+
 /*
  * Header structure for a hash table --- contains all changeable info
  *
@@ -129,11 +138,24 @@ typedef HASHBUCKET *HASHSEGMENT;
 struct HASHHDR
 {
 	/* In a pa

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-28 Thread Aleksander Alekseev
Here is another preliminary result I would like to share. As always you
will find corresponding patch in attachment. It has work in progress
quality.

The idea was suggested by colleague of mine Aleksander Lebedev.

freeList is partitioned like in "no lock" patch. When there is no
enough free items in a freeList we borrow AN items from a global list.
When freeList become too large we return AN items back to global list.
This global list is also partitioned into PN partitions. Each partition
is protected by a spinlock. 

This way we have less lock contention than in "lwlock" or "spinlock
array" versions since we borrow multiple free elements, not one a time.
Also in worst case only AN*(NUM_LOCK_PARTITIONS-1) free items are not
used instead of (Total/NUM_LOCK_PARTITIONS)*(NUM_LOCK_PARTITIONS-1).

On 60-core server amount of TPS depends on AN and PN like this:

   ||||||
   | AN = 1 | AN = 2 | AN = 4 | AN = 8 | AN =16 | AN =32 
---||||||
   |  733.0 | 1120.6 | 1605.5 | 1842.5 | 1545.5 | 1237.0 
PN = 1 |  740.3 | 1127.0 | 1634.2 | 1800.8 | 1573.5 | 1245.1 
   |  742.9 | 1102.1 | 1647.2 | 1853.6 | 1533.4 | 1251.9 
---||||||
   | 1052.0 | 1438.1 | 1755.6 | 1981.0 | 2022.0 | 1816.8 
PN = 2 | 1044.8 | 1453.1 | 1784.0 | 1958.3 | 2033.2 | 1819.2 
   | 1028.7 | 1419.8 | 1809.2 | 1981.2 | 2028.2 | 1790.2 
---||||||
   | 1182.0 | 1521.5 | 1813.2 | 1932.6 | 2035.2 | 1948.4 
PN = 4 | 1212.4 | 1535.4 | 1816.8 | 1927.0 | 2018.7 | 2014.6 
   | 1189.4 | 1528.9 | 1816.9 | 1942.6 | 2011.9 | 2018.3 
---||||||
   | 1148.1 | 1522.2 | 1795.4 | 1926.6 | 2031.7 | 2015.6 
PN = 8 | 1175.6 | 1529.4 | 1807.6 | 1913.5 | 2007.3 | 2062.0 
   | 1169.9 | 1528.0 | 1796.3 | 1926.0 | 2011.1 | 2042.8 
---||||||
   | 1117.7 | 1491.0 | 1803.9 | 1925.3 | 2029.4 | 2056.2 
PN =16 | 1132.8 | 1481.0 | 1809.6 | 1968.1 | 2033.8 | 2068.5 
   | 1131.4 | 1481.8 | 1819.4 | 1946.2 | 2071.1 | 2073.8 

AN = GLOBAL_FREE_LIST_ALLOC_NUMBER
PN = GLOBAL_FREE_LIST_PARTITIONS_NUM

There is no performance degradation on Core i7. By increasing PN or AN
any further we don't gain any more TPS.

As you may see this version is about 30% faster than "lwlock" or
"spinlock array" and 3.1 times faster than master. Still it's about 2.5
slower than "no locks" version which I find frustrating.

Next I will try to speedup this version by modifying global_free_list_*
procedures. Current implementations are not most efficient ones. Also
I'm planning to explore approaches which involve lock free algorithms.

I would like to know your opinion about such approach. For instance
could we spare AN*(NUM_LOCK_PARTITIONS-1) items in a worst case or we
can't by same reason we can't do it for (Total / NUM_LOCK_PARTITIONS) *
(NUM_LOCK_PARTITIONS-1)?
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 78f15f0..91fcc05 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -265,7 +265,7 @@ InitShmemIndex(void)
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
+			  long init_size,	/* initial table size */ // AALEKSEEV: is ignored, refactor!
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +299,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index eacffc4..8375c3b 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -87,6 +87,8 @@
 #include "access/xact.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "storage/lock.h"
+#include "storage/lwlock.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
 
@@ -118,6 +120,13 @@ typedef HASHELEMENT *HASHBUCKET;
 /* A hash segment is an array of bucket headers */
 typedef HASHBUCKET *HASHSEGMENT;
 
+// AALEKSEEV: TODO: comment, should be power of two
+#define GLOBAL_FREE_LIST_PARTITIONS_NUM 16
+#define GLOBAL_FREE_LIST_PARTITIONS_MASK (GLOBAL_FREE_LIST_PARTITIONS_NUM-1)
+
+// AALEKSEEV: TODO: comment
+#define GLOBAL_FREE_LIST_ALLOC_NUMBER 16
+
 /*
  * Header structure for a hash table --- contains all changeable info
  *
@@ -129,11 +138,20 @@ typedef HASHBUCKET *HASHSEGMENT;
 struct HASHHDR
 {
 	/* In a partitioned table, take this lock to touch nentries or freeList

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-22 Thread Robert Haas
On Tue, Dec 22, 2015 at 10:39 AM, Aleksander Alekseev
 wrote:
> Obviously after splitting a freelist into NUM_LOCK_PARTITIONS
> partitions (and assuming that all necessary locking/unlocking is done
> on calling side) tables can't have more than NUM_LOCK_PARTITIONS
> partitions because it would cause race conditions. For this reason I
> had to define NUM_BUFFER_PARTITIONS as NUM_LOCK_PARTITIONS and compare
> behaviour of PostgreSQL depending on different values of
> NUM_LOCK_PARTITIONS.

This is not at all obvious.  There is no obvious reason why the number
of ways that the freelist is partitioned needs to have any relation at
all to the number of ways that the hash table itself is partitioned.
Probably, each process should pick a freelist affinity based on
MyBackendId % number_of_partitions or something like that.  If it
doesn't find one there, it can search the others, but always starting
with the one for which it has an affinity.  That way, the number of
processes contending on a single spinlock will be limited to
max_connections/number_of_partitions except when the table is very
nearly full.  That may happen a lot, though, for the buffer mapping
table, so we might need to over-allocate to compensate.  We want it to
be the case that a backend will almost always find a free entry on its
own freelist when it needs one, and only occasionally need to visit
any other freelist.

> Now regarding 60-core server:
>
> - One spinlock per hash table doesn't scale. I personally was expecting
>   this;
> - LWLock's and array of spinlocks do scale on NUMA up to a certain
>   point;
> - Best results are shown by "no locks";
>
> I believe that "no locks" implementation is the best one since it is at
> least 3 times faster on NUMA then any other implementation. Also it is
> simpler and doesn't have stealing-from-other-freelists logic that
> executes rarely and therefore is a likely source of bugs. Regarding ~16
> elements of freelists which in some corner cases could but wouldn't be
> used --- as I mentioned before I believe its not such a big problem.
> Also its a small price to pay for 3 times more TPS.

I doubt it.  Having the system become fundamentally less reliable is a
big cost.  If the system tries to find a lock table entry and fails,
the user's query is going to error out.  They are not at that point
going to say, well, it's good that my system runs fast even though I
can't run pg_dump.  They are going to say, running pg_dump used to
work and now it fails.  The consequences of failing to make an entry
in the buffer mapping table are at least as bad.

And there's a second point, too, which is that you haven't proven that
accepting the costs of your proposed model is even necessary.  You've
tried a few experiments with other approaches, not even all of the
ones that were proposed in the earlier thread, and you don't seem to
have done any investigation of why they didn't perform as well, or if
you did, it's not discussed in this email.  So maybe those problems
can be fixed, after all.

> Regarding NUM_LOCK_PARTITIONS (and NUM_BUFFER_PARTITIONS) I have some
> doubts. For sure Robert had a good reason for committing 3acc10c9.
> Unfortunately I'm not familiar with a story behind this commit. What do
> you think?

See the thread "Scaling shared buffer eviction".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-22 Thread Aleksander Alekseev
> > Actually, I'd like to improve all partitioned hashes instead of
> > improve only one case.  
> 
> Yeah.  I'm not sure that should be an LWLock rather than a spinlock,
> but we can benchmark it both ways.  

I would like to share some preliminary results. I tested four
implementations:

- no locks and no element stealing from other partitions;
- single LWLock per partitioned table;
- single spinlock per partitioned table;
- NUM_LOCK_PARTITIONS spinlocks per partitioned table;

Interestingly "Shared Buffer Lookup Table" (see buf_table.c) has 128
partitions. The constant NUM_BUFFER_PARTITIONS was increased from 16 to
128 in commit 3acc10c9: 

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3acc10c997f916f6a741d0b4876126b7b08e3892;hp=952872698d9443fdf9b808a1376017f00c91065a

Obviously after splitting a freelist into NUM_LOCK_PARTITIONS
partitions (and assuming that all necessary locking/unlocking is done
on calling side) tables can't have more than NUM_LOCK_PARTITIONS
partitions because it would cause race conditions. For this reason I
had to define NUM_BUFFER_PARTITIONS as NUM_LOCK_PARTITIONS and compare
behaviour of PostgreSQL depending on different values of
NUM_LOCK_PARTITIONS.

So here are results:

Core i7, pgbench -j 8 -c 8 -T 30 pgbench
(3 tests, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock 
PARTITIONS | (99ccb2) |  |  |  |  array   
---|--|--|--|--|--
   |  295.4   |  297.4   |  299.4   |  285.6   |  302.7   
(1 << 4)   |  286.1   |  300.5   |  283.4   |  300.9   |  300.4   
   |  300.0   |  300.0   |  302.1   |  300.7   |  300.3   
---|--|--|--|--|--
   |  |  296.7   |  299.9   |  298.8   |  298.3
(1 << 5)   |      |  301.9   |  302.2   |  305.7   |  306.3
   |  |  287.7   |  301.0   |  303.0   |  304.5
---|--|--|--|--|--
   |  |  296.4   |  300.5   |  302.9   |  304.6   
(1 << 6)   |      |  301.7   |  305.6   |  306.4   |  302.3   
   |  |  299.6   |  304.5   |  306.6   |  300.4   
---|--|--|--|--|--
   |  |  295.9   |  298.7   |  295.3   |  305.0   
(1 << 7)   |      |  299.5   |  300.5   |  299.0   |  310.2   
   |  |  287.8   |  285.9   |  300.2   |  302.2   

Core i7, pgbench -j 8 -c 8 -f big_table.sql -T 30 my_database
(3 test, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock 
PARTITIONS | (99ccb2) |  |  |  |  array   
---|--|--|--|--|--
   |  505.1   |  521.3   |  511.1   |  524.4   |  501.6   
(1 << 4)   |  452.4   |  467.4   |  509.2   |  472.3   |  453.7   
   |  435.2   |  462.4   |  445.8   |  467.9   |  467.0   
---|--|--|--|--|--
   |  |  514.8   |  476.3   |  507.9   |  510.6   
(1 << 5)   |      |  457.5   |  491.2   |  464.6   |  431.7   
   |  |  442.2   |  457.0   |  495.5   |  448.2
---|--|--|--|--|--
   |  |  516.4   |  502.5   |  468.0   |  521.3   
(1 << 6)   |      |  463.6   |  438.7   |  488.8   |  455.4   
   |  |  434.2   |  468.1   |  484.7   |  433.5   
---|--|--|--|--|--
   |  |  513.6   |  459.4   |  519.6   |  510.3   
(1 << 7)   |      |  470.1   |  454.6   |  445.5   |  415.9   
   |  |  459.4   |  489.7   |  457.1   |  452.8   

60-core server, pgbench -j 64 -c 64 -T 30 pgbench
(3 tests, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock 
PARTITIONS | (99ccb2) |  |  |  |  array   
---|--|--|--|--|--
   |  3156.2  |  3157.9  |  3542.0  |  3444.3  |  3472.4  
(1 << 4)   |  3268.5  |  3444.7  |  3485.7  |  3486.0  |  3500.5  
   |  3251.2  |  3482.3  |  3398.7  |  3587.1  |  3557.7  
---|--|--|--|--|--
   |  |  3352.7  |  3556.0  |  3543.3  |  3526.8 
(1 << 5)   |      |  3465.0  |  3475.2  |  3486.9  |  3528.4  
   |  |  3410.0  |  3482.0  |  3493.7  |  3444.9  
---|--|--|--|--|--
   |  |  3437.8  |  3413.1  |  3445.8  |  3481.6  
(1 << 6)   |      |  3470.1  |  3478.4  |  3538.5  |  3579.9  
   |  |  3450.8  |  3431.1  |  3509.0  |  3512.5  
---|--|--|--|--|--
   |  

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Robert Haas
On Fri, Dec 18, 2015 at 8:46 AM, Teodor Sigaev  wrote:
>> Oh, that's an interesting idea.  I guess the problem is that if the
>> freelist is unshared, then users might get an error that the lock
>> table is full when some other partition still has elements remaining.
>
> Could we split one freelist in hash to NUM_LOCK_PARTITIONS freelists?
> Each partition will have its own freelist and if freelist is empty then
> partition should search an entry in freelists of other partitions. To
> prevent concurrent access it's needed to add one LWLock to hash, each
> partition should lock LWlock in share mode to work with its own freelist and
> exclusive to work with other freelists.
>
> Actually, I'd like to improve all partitioned hashes instead of improve only
> one case.

Yeah.  I'm not sure that should be an LWLock rather than a spinlock,
but we can benchmark it both ways.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Aleksander Alekseev
> This idea can improve the situation with ProcLock hash table, but I
> think IIUC what Andres is suggesting would reduce the contention
> around dynahash freelist and can be helpful in many more situations
> including BufMapping locks.

I agree. But as I understand PostgreSQL community doesn't generally
approve big changes that affects whole system. Especially if original
problem was only in one particular place. Therefore for now I suggest
only a small change. Naturally if it will be accepted there is no
reason not to apply same changes for BufMapping or even dynahash itself
with corresponding PROCLOCK hash refactoring.

BTW could you (or anyone?) please help me find this thread regarding
BufMapping or perhaps provide a benchmark? I would like to reproduce
this issue but I can't find anything relevant in a mailing list. Also
it seems to be a good idea to compare alternative approaches that were
mentioned (atomics ops, group leader). Are there any discussions,
benchmarks or patches regarding this topic?

Frankly I have serious doubts regarding atomics ops since they will more
likely create the same contention that a spinlock does. But perhaps
there is a patch that works not the way I think it could work.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Aleksander Alekseev
> Could we split one freelist in hash to NUM_LOCK_PARTITIONS freelists?
> Each partition will have its own freelist and if freelist is empty
> then partition should search an entry in freelists of other
> partitions. To prevent concurrent access it's needed to add one
> LWLock to hash, each partition should lock LWlock in share mode to
> work with its own freelist and exclusive to work with other freelists.
> 
> Actually, I'd like to improve all partitioned hashes instead of
> improve only one case.

It seems to be a most promising direction of research for now. I will
send a patch and benchmark results soon.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Teodor Sigaev

Oh, that's an interesting idea.  I guess the problem is that if the
freelist is unshared, then users might get an error that the lock
table is full when some other partition still has elements remaining.


Could we split one freelist in hash to NUM_LOCK_PARTITIONS freelists?
Each partition will have its own freelist and if freelist is empty then 
partition should search an entry in freelists of other partitions. To prevent 
concurrent access it's needed to add one LWLock to hash, each partition should 
lock LWlock in share mode to work with its own freelist and exclusive to work 
with other freelists.


Actually, I'd like to improve all partitioned hashes instead of improve only one 
case.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Amit Kapila
On Fri, Dec 18, 2015 at 2:50 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:
>
> > This idea can improve the situation with ProcLock hash table, but I
> > think IIUC what Andres is suggesting would reduce the contention
> > around dynahash freelist and can be helpful in many more situations
> > including BufMapping locks.
>
> I agree. But as I understand PostgreSQL community doesn't generally
> approve big changes that affects whole system. Especially if original
> problem was only in one particular place. Therefore for now I suggest
> only a small change. Naturally if it will be accepted there is no
> reason not to apply same changes for BufMapping or even dynahash itself
> with corresponding PROCLOCK hash refactoring.
>
> BTW could you (or anyone?) please help me find this thread regarding
> BufMapping or perhaps provide a benchmark?
>

You can find that in below thread:
http://www.postgresql.org/message-id/CAA4eK1+U+GQDc2sio4adRk+ux6obFYRPxkY=ch5bknabtoo...@mail.gmail.com

This even contains the original idea and initial patch for replacing
spinlocks with atomic ops.  I have mentioned about the A-B-A problem
few mails down in that thread and given a link to paper suggesting how
that can be solved.  It needs more work, but doable.

> I would like to reproduce
> this issue but I can't find anything relevant in a mailing list. Also
> it seems to be a good idea to compare alternative approaches that were
> mentioned (atomics ops, group leader). Are there any discussions,
> benchmarks or patches regarding this topic?
>

You can find the discussion and patch related to Group leader approach
in the thread:
http://www.postgresql.org/message-id/caa4ek1jbx4fzphignt0jsaz30a85bpjv+ewhk+wg_o-t6xu...@mail.gmail.com
This patch is already committed.

> Frankly I have serious doubts regarding atomics ops since they will more
> likely create the same contention that a spinlock does. But perhaps
> there is a patch that works not the way I think it could work.
>

I think it is difficult to say without implementing it.  If we want
to evaluate
multiple approaches then what we can do here is I can help with writing a
patch using LWLocks and you can once evaluate the atomic ops approach
and we already have your current patch, then we can see what works out
best.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Aleksander Alekseev
> What's in pgbench.sql?

It's from first message of this thread:

http://www.postgresql.org/message-id/20151211170001.78ded9d7@fujitsu


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Andres Freund
On 2015-12-18 11:40:58 +0300, Aleksander Alekseev wrote:
> $ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database

What's in pgbench.sql?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Aleksander Alekseev
> This idea can improve the situation with ProcLock hash table, but I
> think IIUC what Andres is suggesting would reduce the contention
> around dynahash freelist and can be helpful in many more situations
> including BufMapping locks.

I agree. But as I understand PostgreSQL community doesn't generally
аpprоvе big changes that affects whole system. Especially if original
problem was only in one particular place. Therefore for now I suggest
only a small change. Naturally if it will be accepted there is no
reason not to apply same changes for BufMapping or even dynahash itself
with corresponding PROCLOCK hash refactoring.

BTW could you (or anyone?) please help me find this thread regarding
BufMapping or perhaps provide a benchmark? I would like to reproduce
this issue but I can't find anything relevant in a mailing list. Also
it seems to be a good idea to compare alternative approaches that were
mentioned (atomics ops, group leader). Are there any discussions,
benchmarks or patches regarding this topic?

Frankly I have serious doubts regarding atomics ops since they will more
likely create the same contention that a spinlock does. But perhaps
there is a patch that works not the way I think it could work.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Amit Kapila
On Thu, Dec 17, 2015 at 9:33 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:
>
> > It'd really like to see it being replaced by a queuing lock
> > (i.e. lwlock) before we go there. And then maybe partition the
> > freelist, and make nentries an atomic.
>
> I believe I just implemented something like this (see attachment). The
> idea is to partition PROCLOCK hash table manually into NUM_LOCK_
> PARTITIONS smaller and non-partitioned hash tables. Since these tables
> are non-partitioned spinlock is not used and there is no lock
> contention.
>

This idea can improve the situation with ProcLock hash table, but I
think IIUC what Andres is suggesting would reduce the contention
around dynahash freelist and can be helpful in many more situations
including BufMapping locks.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Aleksander Alekseev
> Oops, 3.5-4 _times_ more TPS, i.e. 2206 TPS vs 546 TPS.

In fact these numbers are for similar but a little bit different
benchmark (same schema without checks on child tables). Here are exact
numbers for a benchmark described above.



Before:

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 20433
latency average: 93.966 ms
tps = 679.698439 (including connections establishing)
tps = 680.353897 (excluding connections establishing)

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 19111
latency average: 100.466 ms
tps = 635.763523 (including connections establishing)
tps = 636.112682 (excluding connections establishing)

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 19218
latency average: 99.906 ms
tps = 639.506848 (including connections establishing)
tps = 639.838757 (excluding connections establishing)


After:

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 95900
latency average: 20.021 ms
tps = 3194.142762 (including connections establishing)
tps = 3196.091843 (excluding connections establishing)

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 96837
latency average: 19.827 ms
tps = 3225.822355 (including connections establishing)
tps = 3227.762847 (excluding connections establishing)

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 96143
latency average: 19.970 ms
tps = 3202.637126 (including connections establishing)
tps = 3204.070466 (excluding connections establishing)


Ratio:

$ python

min(3194.0, 3225.0, 3202.0) / max(679.0, 635.0, 639.0)
4.703976435935199


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-18 Thread Amit Kapila
On Thu, Dec 17, 2015 at 8:44 PM, Andres Freund  wrote:
>
> On 2015-12-17 09:47:57 -0500, Robert Haas wrote:
> > On Tue, Dec 15, 2015 at 7:25 AM, Andres Freund 
wrote:
> > > I'd consider using a LWLock instead of a spinlock here. I've seen this
> > > contended in a bunch of situations, and the queued behaviour, combined
> > > with directed wakeups on the OS level, ought to improve the worst case
> > > behaviour measurably.
> >
> > Amit had the idea a while back of trying to replace the HASHHDR mutex
> > with something based on atomic ops.  It seems hard to avoid the
> > attendant A-B-A problems but maybe there's a way.
>
> It'd really like to see it being replaced by a queuing lock
> (i.e. lwlock) before we go there. And then maybe partition the freelist,
> and make nentries an atomic.  Just doing those might already be good
> enough and should be a lot easier.
>

makes sense to me, but I think we should as well try the Group leader idea
used for ProcArrayLock optimisation as during those tests, I found that
it gives better results as compare to partitioning.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-17 Thread Aleksander Alekseev
> Oh, that's an interesting idea.  I guess the problem is that if the
> freelist is unshared, then users might get an error that the lock
> table is full when some other partition still has elements remaining.

True, but I don't believe it should be a real problem assuming we have
a strong hash function. If user get such an error it means that
database is under heavy load and there is not much more free elements
in other tables either. This particular user didn't get lucky, some
other will. Anyway administrator should do something about it -
fight DDoS attack, tune database parameters, etc.

> 3.5-4 more TPS, or 3.5 times more TPS?  Can you share the actual
> numbers?

Oops, 3.5-4 _times_ more TPS, i.e. 2206 TPS vs 546 TPS.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-17 Thread Robert Haas
On Thu, Dec 17, 2015 at 11:03 AM, Aleksander Alekseev
 wrote:
>> It'd really like to see it being replaced by a queuing lock
>> (i.e. lwlock) before we go there. And then maybe partition the
>> freelist, and make nentries an atomic.
>
> I believe I just implemented something like this (see attachment). The
> idea is to partition PROCLOCK hash table manually into NUM_LOCK_
> PARTITIONS smaller and non-partitioned hash tables. Since these tables
> are non-partitioned spinlock is not used and there is no lock
> contention.

Oh, that's an interesting idea.  I guess the problem is that if the
freelist is unshared, then users might get an error that the lock
table is full when some other partition still has elements remaining.

> On 60-core server we gain 3.5-4 more TPS according to benchmark
> described above. As I understand there is no performance degradation in
> other cases (different CPU, traditional pgbench, etc).

3.5-4 more TPS, or 3.5 times more TPS?  Can you share the actual numbers?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-17 Thread Aleksander Alekseev
> It'd really like to see it being replaced by a queuing lock
> (i.e. lwlock) before we go there. And then maybe partition the
> freelist, and make nentries an atomic.

I believe I just implemented something like this (see attachment). The
idea is to partition PROCLOCK hash table manually into NUM_LOCK_
PARTITIONS smaller and non-partitioned hash tables. Since these tables
are non-partitioned spinlock is not used and there is no lock
contention.

On 60-core server we gain 3.5-4 more TPS according to benchmark
described above. As I understand there is no performance degradation in
other cases (different CPU, traditional pgbench, etc).

If this patch seems to be OK I believe we could consider applying the
same change not only to PROCLOCK hash table.diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 76fc615..1a86f86 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -249,15 +249,56 @@ static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;
  * shared memory; LockMethodLocalHash is local to each backend.
  */
 static HTAB *LockMethodLockHash;
-static HTAB *LockMethodProcLockHash;
+static HTAB *LockMethodProcLockHashPartitions[NUM_LOCK_PARTITIONS];
 static HTAB *LockMethodLocalHash;
 
+/*
+ * LockMethodProcLockHash hash table is partitioned into NUM_LOCK_PARTITIONS
+ * usual (non-partitioned, i.e. without HASH_PARTITION flag) hash tables. The
+ * reason for partitioning LockMethodProcLockHash manually instead of just
+ * using single hash table with HASH_PARTITION flag is following. If hash
+ * table has HASH_PARTITION flag it uses a single spinlock to safely
+ * access some of its fields. Turned out in case of this particular hash
+ * table it causes a considerable performance degradation becouse of lock
+ * contention on servers with tens of cores.
+ */
+#define LockMethodProcLockHash(hashcode) \
+	(LockMethodProcLockHashPartitions[LockHashPartition(hashcode)])
+
+/*
+ * Each partition of LockMethodProcLockHash hash table has an unique name
+ * generated from this template using snprintf.
+ */
+static const char ProcLockHashNameTemplate[] = "PROCLOCK hash, partition %d";
+
+/*
+ * Size of buffer required to store LockMethodProcLockHash partition name.
+ *
+ * 10 is number of digits in 2^32. 2 is length of "%d" string.
+ */
+#define PROC_LOCK_HASH_NAME_BUFF_SIZE (sizeof(ProcLockHashNameTemplate)+10-2)
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
 static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;
 
+/*
+ * Calculate total number of entries in LockMethodProcLockHash hash table.
+ *
+ * Caller is responsible for having acquired appropriate LWLocks.
+ */
+static long
+GetProcLockHashNumEntries()
+{
+	int			i;
+	long		sum = 0;
+
+	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+		sum += hash_get_num_entries(LockMethodProcLockHashPartitions[i]);
+
+	return sum;
+}
 
 #ifdef LOCK_DEBUG
 
@@ -373,16 +414,16 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long		shmem_table_size;
 	bool		found;
+	int			i;
+	char		proclock_hash_name[PROC_LOCK_HASH_NAME_BUFF_SIZE];
 
 	/*
 	 * Compute init/max size to request for lock hashtables.  Note these
 	 * calculations must agree with LockShmemSize!
 	 */
-	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
+	shmem_table_size = NLOCKENTS();
 
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
@@ -394,14 +435,13 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodLockHash = ShmemInitHash("LOCK hash",
-	   init_table_size,
-	   max_table_size,
+	   shmem_table_size,
+	   shmem_table_size,
 	   &info,
 	HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 
 	/* Assume an average of 2 holders per lock */
-	max_table_size *= 2;
-	init_table_size *= 2;
+	shmem_table_size = (shmem_table_size * 2) / NUM_LOCK_PARTITIONS;
 
 	/*
 	 * Allocate hash table for PROCLOCK structs.  This stores
@@ -412,11 +452,17 @@ InitLocks(void)
 	info.hash = proclock_hash;
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
-	LockMethodProcLockHash = ShmemInitHash("PROCLOCK hash",
-		   init_table_size,
-		   max_table_size,
-		   &info,
- HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+	{
+		snprintf(proclock_hash_name, sizeof(proclock_hash_name),
+ ProcLockHashNameTemplate, i + 1);
+
+		LockMethodProcLockHashPartitions[i] = ShmemInitHash(proclock_hash_name,
+			shmem_table_size,
+			shmem_table_size,
+			&info,
+  HASH_ELEM | HASH_FUNCTION);
+	}
 
 	/*
 	 * Allocate fast-path structures.
@@ -943,7 +989,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 proclock_hashcode = ProcLockHashCode(&proclock->tag, hashcode);
 SHMQueueDelete(&proclock->lockLink);
 SHMQueueDelete(&proclock->procLink);

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-17 Thread Andres Freund
On 2015-12-17 09:47:57 -0500, Robert Haas wrote:
> On Tue, Dec 15, 2015 at 7:25 AM, Andres Freund  wrote:
> > I'd consider using a LWLock instead of a spinlock here. I've seen this
> > contended in a bunch of situations, and the queued behaviour, combined
> > with directed wakeups on the OS level, ought to improve the worst case
> > behaviour measurably.
> 
> Amit had the idea a while back of trying to replace the HASHHDR mutex
> with something based on atomic ops.  It seems hard to avoid the
> attendant A-B-A problems but maybe there's a way.

It'd really like to see it being replaced by a queuing lock
(i.e. lwlock) before we go there. And then maybe partition the freelist,
and make nentries an atomic.  Just doing those might already be good
enough and should be a lot easier.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-17 Thread Robert Haas
On Tue, Dec 15, 2015 at 7:25 AM, Andres Freund  wrote:
> On 2015-12-11 17:00:01 +0300, Aleksander Alekseev wrote:
>> The problem is that code between LWLockAcquire (lock.c:881) and
>> LWLockRelease (lock.c:1020) can _sometimes_ run up to 3-5 ms. Using
>> old-good gettimeofday and logging method I managed to find a bottleneck:
>>
>> -- proclock = SetupLockInTable [lock.c:892]
>>  `-- proclock = (PROCLOCK *) hash_search_with_hash_value [lock.c:1105]
>>`-- currBucket = get_hash_entry(hashp) [dynahash.c:985]
>>  `-- SpinLockAcquire(&hctl->mutex) [dynahash.c:1187]
>>
>> If my measurements are not wrong (I didn't place gettimeofday between
>> SpinLockAquire/SpinLockRelease, etc) we sometimes spend about 3 ms here
>> waiting for a spinlock, which doesn't seems right.
>
> Well, it's quite possible that your process was scheduled out, while
> waiting for that spinlock. Which'd make 3ms pretty normal.
>
> I'd consider using a LWLock instead of a spinlock here. I've seen this
> contended in a bunch of situations, and the queued behaviour, combined
> with directed wakeups on the OS level, ought to improve the worst case
> behaviour measurably.

Amit had the idea a while back of trying to replace the HASHHDR mutex
with something based on atomic ops.  It seems hard to avoid the
attendant A-B-A problems but maybe there's a way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-15 Thread Andres Freund
On 2015-12-11 17:00:01 +0300, Aleksander Alekseev wrote:
> The problem is that code between LWLockAcquire (lock.c:881) and
> LWLockRelease (lock.c:1020) can _sometimes_ run up to 3-5 ms. Using
> old-good gettimeofday and logging method I managed to find a bottleneck:
> 
> -- proclock = SetupLockInTable [lock.c:892]
>  `-- proclock = (PROCLOCK *) hash_search_with_hash_value [lock.c:1105]
>`-- currBucket = get_hash_entry(hashp) [dynahash.c:985]
>  `-- SpinLockAcquire(&hctl->mutex) [dynahash.c:1187]
> 
> If my measurements are not wrong (I didn't place gettimeofday between
> SpinLockAquire/SpinLockRelease, etc) we sometimes spend about 3 ms here
> waiting for a spinlock, which doesn't seems right.

Well, it's quite possible that your process was scheduled out, while
waiting for that spinlock. Which'd make 3ms pretty normal.

I'd consider using a LWLock instead of a spinlock here. I've seen this
contended in a bunch of situations, and the queued behaviour, combined
with directed wakeups on the OS level, ought to improve the worst case
behaviour measurably.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-15 Thread Aleksander Alekseev
Hello, Tom.

I was exploring this issue further and discovered something strange.

"PROCLOCK hash" and "LOCK hash" are hash tables in shared memory. All
memory for these tables is in fact pre-allocated. But for some reason
these two tables are created (lock.c:394) with init_size =/= max_size.
It causes small overhead on calling memory allocator after hash table
creation and additional locking/unlocking.

I checked all other hash tables created via ShmemInitHash. All of these
tables have init_size == max_size. I suggest to create "PROCLOCK hash"
and "LOCK hash" with init_size == max_size too (see attached patch).
Granted this change doesn't cause any noticeable performance
improvements according to pgbench. Nevertheless it looks like a very
right thing to do which at least doesn't make anything worse.

If this patch seems to be OK next logical step I believe would be to
remove init_size parameter in ShmemInitHash procedure since in practice
it always equals max_size.

On Fri, 11 Dec 2015 10:30:30 -0500
Tom Lane  wrote:

> Aleksander Alekseev  writes:
> > Turns out PostgreSQL can spend a lot of time waiting for a lock in
> > this particular place, especially if you are running PostgreSQL on
> > 60-core server. Which obviously is a pretty bad sign.
> > ...
> > I managed to fix this behaviour by modifying choose_nelem_alloc
> > procedure in dynahash.c (see attached patch).
> 
> TBH, this is just voodoo.  I don't know why this change would have
> made any impact on lock acquisition performance, and neither do you,
> and the odds are good that it's pure chance that it changed
> anything.  One likely theory is that you managed to shift around
> memory allocations so that something aligned on a cacheline boundary
> when it hadn't before.  But, of course, the next patch that changes
> allocations anywhere in shared memory could change that back.  There
> are lots of effects like this that appear or disappear based on
> seemingly unrelated code changes when you're measuring edge-case
> performance.
> 
> The patch is not necessarily bad in itself.  As time goes by and
> machines get bigger, it can make sense to allocate more memory at a
> time to reduce memory management overhead.  But arguing for it on the
> basis that it fixes lock allocation behavior with 60 cores is just
> horsepucky.  What you were measuring there was steady-state hash
> table behavior, not the speed of the allocate-some-more-memory code
> path.
> 
>   regards, tom lane
> 
> 

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 76fc615..86d2f88 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -373,16 +373,14 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long		shmem_table_size;
 	bool		found;
 
 	/*
 	 * Compute init/max size to request for lock hashtables.  Note these
 	 * calculations must agree with LockShmemSize!
 	 */
-	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
+	shmem_table_size = NLOCKENTS();
 
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
@@ -394,14 +392,13 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodLockHash = ShmemInitHash("LOCK hash",
-	   init_table_size,
-	   max_table_size,
+	   shmem_table_size,
+	   shmem_table_size,
 	   &info,
 	HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 
 	/* Assume an average of 2 holders per lock */
-	max_table_size *= 2;
-	init_table_size *= 2;
+	shmem_table_size *= 2;
 
 	/*
 	 * Allocate hash table for PROCLOCK structs.  This stores
@@ -413,8 +410,8 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodProcLockHash = ShmemInitHash("PROCLOCK hash",
-		   init_table_size,
-		   max_table_size,
+		   shmem_table_size,
+		   shmem_table_size,
 		   &info,
  HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-13 Thread Simon Riggs
On 11 December 2015 at 16:14, Aleksander Alekseev  wrote:


> I see your point, but I would like to clarify a few things.
>
> 1. Do we consider described measurement method good enough to conclude
> that sometimes PostgreSQL really spends 3 ms in a spinlock (like a RTT
> between two Internet hosts in the same city)? If not, what method
> should be used to approve or disapprove this?
>

Timing things seems fine. You may have located an important issue.


> 2. If we agree that PostgreSQL does sometimes spend 3 ms in a spinlock
> do we consider this a problem?
>

Probably, but then Tom didn't question 1 or 2. What he questioned was your
fix.


> 3. If we consider this a problem, what method is considered appropriate
> to find a real reason of such behaviour so we could fix it?
>

The problem you identify is in only one place, yet your fix changes many
parts of the code. Why is that the best fix out of the many possible ones?
Why would such a change be acceptable?

I personally don't know the answers to those questions. It would be
wonderful if somebody else would structure our lives such that all we had
to do was find simple answers, but that isn't the way life is. You ask for
a chain of logical thought, but it is for you to create one, somehow:
patches are default-reject, not committer-explain-why-reject.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-11 Thread Aleksander Alekseev
Oops. s/approve or disapprove/confirm or deny/

On Fri, 11 Dec 2015 19:14:41 +0300
Aleksander Alekseev  wrote:

> Hello, Tom
> 
> I see your point, but I would like to clarify a few things.
> 
> 1. Do we consider described measurement method good enough to conclude
> that sometimes PostgreSQL really spends 3 ms in a spinlock (like a RTT
> between two Internet hosts in the same city)? If not, what method
> should be used to approve or disapprove this?
> 
> 2. If we agree that PostgreSQL does sometimes spend 3 ms in a spinlock
> do we consider this a problem?
> 
> 3. If we consider this a problem, what method is considered
> appropriate to find a real reason of such behaviour so we could fix
> it?
> 
> Best regards,
> Aleksander
> 
> 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-11 Thread Aleksander Alekseev
Hello, Tom

I see your point, but I would like to clarify a few things.

1. Do we consider described measurement method good enough to conclude
that sometimes PostgreSQL really spends 3 ms in a spinlock (like a RTT
between two Internet hosts in the same city)? If not, what method
should be used to approve or disapprove this?

2. If we agree that PostgreSQL does sometimes spend 3 ms in a spinlock
do we consider this a problem?

3. If we consider this a problem, what method is considered appropriate
to find a real reason of such behaviour so we could fix it?

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-11 Thread Tom Lane
Aleksander Alekseev  writes:
> Turns out PostgreSQL can spend a lot of time waiting for a lock in this
> particular place, especially if you are running PostgreSQL on 60-core
> server. Which obviously is a pretty bad sign.
> ...
> I managed to fix this behaviour by modifying choose_nelem_alloc
> procedure in dynahash.c (see attached patch).

TBH, this is just voodoo.  I don't know why this change would have made
any impact on lock acquisition performance, and neither do you, and the
odds are good that it's pure chance that it changed anything.  One likely
theory is that you managed to shift around memory allocations so that
something aligned on a cacheline boundary when it hadn't before.  But, of
course, the next patch that changes allocations anywhere in shared memory
could change that back.  There are lots of effects like this that appear
or disappear based on seemingly unrelated code changes when you're
measuring edge-case performance.

The patch is not necessarily bad in itself.  As time goes by and machines
get bigger, it can make sense to allocate more memory at a time to reduce
memory management overhead.  But arguing for it on the basis that it fixes
lock allocation behavior with 60 cores is just horsepucky.  What you were
measuring there was steady-state hash table behavior, not the speed of the
allocate-some-more-memory code path.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-11 Thread Aleksander Alekseev
Hello all,

Consider following stacktrace:

(gdb) bt
#0  0x7f77c81fae87 in semop () syscall-template.S:81
#1  0x0063b721 in PGSemaphoreLock pg_sema.c:387
#2  0x00692e1b in LWLockAcquire lwlock.c:1026
#3  0x0068d4c5 in LockAcquireExtended lock.c:881
#4  0x0068dcc1 in LockAcquire lock.c:672
#5  0x0068b7a8 in LockRelationOid lmgr.c:112
#6  0x00501d18 in find_inheritance_children pg_inherits.c:120
#7  0x00501d80 in find_all_inheritors pg_inherits.c:182
#8  0x0062db8d in expand_inherited_rtentry prepunion.c:1285
#9  expand_inherited_tables prepunion.c:1212
#10 0x00622705 in subquery_planner planner.c:501
#11 0x00622d31 in standard_planner planner.c:285
#12 0x0069ef0c in pg_plan_query postgres.c:809
#13 0x0069f004 in pg_plan_queries postgres.c:868
#14 0x006a0fc2 in exec_simple_query postgres.c:1033
#15 PostgresMain postgres.c:4032
#16 0x00467479 in BackendRun postmaster.c:4237
#17 BackendStartup postmaster.c:3913
#18 ServerLoop () postmaster.c:1684
#19 0x0064c828 in PostmasterMain postmaster.c:1292
#20 0x00467f3e in main main.c:223

Turns out PostgreSQL can spend a lot of time waiting for a lock in this
particular place, especially if you are running PostgreSQL on 60-core
server. Which obviously is a pretty bad sign.

You can easily reproduce this issue on regular Core i7 server. Use
attached schema.sql file to create a database and run:

pgbench -j 8 -c 8 -f pgbench.sql -T 300 my_database 2>/dev/null &

While this example is running connect to some PostgreSQL process with
gdb and run bt/c from time to time. You will see that PostgreSQL waits
for this particular lock quite often.

The problem is that code between LWLockAcquire (lock.c:881) and
LWLockRelease (lock.c:1020) can _sometimes_ run up to 3-5 ms. Using
old-good gettimeofday and logging method I managed to find a bottleneck:

-- proclock = SetupLockInTable [lock.c:892]
 `-- proclock = (PROCLOCK *) hash_search_with_hash_value [lock.c:1105]
   `-- currBucket = get_hash_entry(hashp) [dynahash.c:985]
 `-- SpinLockAcquire(&hctl->mutex) [dynahash.c:1187]

If my measurements are not wrong (I didn't place gettimeofday between
SpinLockAquire/SpinLockRelease, etc) we sometimes spend about 3 ms here
waiting for a spinlock, which doesn't seems right.

I managed to fix this behaviour by modifying choose_nelem_alloc
procedure in dynahash.c (see attached patch). The idea is to double
number of items we allocate when there is no more free items in hash
table. So we need twice less allocations which reduces contention.

This patch doesn't cause any performance degradation according to
pgbench, `make check` passes, etc.

Best regards,
Aleksanderdiff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index eacffc4..48def5e 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -544,19 +544,19 @@ choose_nelem_alloc(Size entrysize)
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(entrysize);
 
 	/*
-	 * The idea here is to choose nelem_alloc at least 32, but round up so
+	 * The idea here is to choose nelem_alloc at least 64, but round up so
 	 * that the allocation request will be a power of 2 or just less. This
 	 * makes little difference for hash tables in shared memory, but for hash
 	 * tables managed by palloc, the allocation request will be rounded up to
 	 * a power of 2 anyway.  If we fail to take this into account, we'll waste
 	 * as much as half the allocated space.
 	 */
-	allocSize = 32 * 4;			/* assume elementSize at least 8 */
+	allocSize = 64 * 4;			/* assume elementSize at least 8 */
 	do
 	{
 		allocSize <<= 1;
 		nelem_alloc = allocSize / elementSize;
-	} while (nelem_alloc < 32);
+	} while (nelem_alloc < 64);
 
 	return nelem_alloc;
 }


schema.sql
Description: application/sql


pgbench.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers