Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-11-06 Thread Sokolov Yura

On 2017-10-20 11:54, Sokolov Yura wrote:

Hello,

On 2017-10-19 19:46, Andres Freund wrote:

On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote:

> > + init_local_spin_delay();
>
> The way you moved this around has the disadvantage that we now do this -
> a number of writes - even in the very common case where the lwlock can
> be acquired directly.

Excuse me, I don't understand fine.
Do you complain against init_local_spin_delay placed here?


Yes.


I could place it before perform_spin_delay under `if (!spin_inited)` if 
you

think it is absolutely must.


I looked at assembly, and remembered, that last commit simplifies
`init_local_spin_delay` to just two-three writes of zeroes (looks
like compiler combines 2*4byte write into 1*8 write). Compared to
code around (especially in LWLockAcquire itself), this overhead
is negligible.

Though, I found that there is benefit in calling LWLockAttemptLockOnce
before entering loop with calls to LWLockAttemptLockOrQueue in the
LWLockAcquire (in there is not much contention). And this way, `inline`
decorator for LWLockAttemptLockOrQueue could be omitted. Given, clang
doesn't want to inline this function, it could be the best way.

Should I add such commit to patch?

--
With regards,
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-10-20 Thread Sokolov Yura

Hello,

On 2017-10-19 19:46, Andres Freund wrote:

On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote:

> > + init_local_spin_delay();
>
> The way you moved this around has the disadvantage that we now do this -
> a number of writes - even in the very common case where the lwlock can
> be acquired directly.

Excuse me, I don't understand fine.
Do you complain against init_local_spin_delay placed here?


Yes.


I could place it before perform_spin_delay under `if (!spin_inited)` if 
you

think it is absolutely must.





Placing it in other place will complicate code.




Or you complain against setting `mask` and `add`?


That seems right.


In both cases, I think simpler version should be accepted first. It 
acts

as algorithm definition. And it already gives measurable improvement.


Well, in scalability. I'm less sure about uncontended performance.




> > +  * We intentionally do not call finish_spin_delay here, because
> > the loop
> > +  * above usually finished by queuing into the wait list on
> > contention, and
> > +  * doesn't reach spins_per_delay thereby doesn't sleep inside of
> > +  * perform_spin_delay. Also, different LWLocks has very different
> > +  * contention pattern, and it is wrong to update spin-lock
> > statistic based
> > +  * on LWLock contention.
> > +  */
>
> Huh? This seems entirely unconvincing. Without adjusting this here we'll
> just spin the same way every iteration. Especially for the case where
> somebody else holds LW_FLAG_LOCKED that's quite bad.

LWLock's are very different. Some of them are always short-term
(BufferLock), others are always locked for a long time.


That seems not particularly relevant. The same is true for
spinlocks. The relevant question isn't how long the lwlock is held, 
it's

how long LW_FLAG_LOCKED is held - and that should only depend on
contention (i.e. bus speed, amount of times put into sleep while 
holding

lock, etc), not on how long the lock is held.


I've tried to place this delay into lock itself (it has 2 free bytes),
but this attempt performed worse.


That seems unsurprising - there's a *lot* of locks, and we'd have to
tune all of them. Additionally there's a bunch of platforms where we do
*not* have free bytes (consider the embedding in BufferTag).



Now I understand, that delays should be stored in array indexed by
tranche. But I have no time to test this idea. And I doubt it will 
give

cardinally better results (ie > 5%), so I think it is better to accept
patch in this way, and then experiment with per-tranche delay.


I don't think tranches have any decent predictive power here.


Look after "Make acquiring LWLock to look more like spinlock".
First `skip_wait_list` iterations there is no attempt to queue self into
wait list. It gives most of improvement. (without it there is just no
degradation on high number of clients, but a little decline on low
clients, because current algorithm is also a little `spinny` (because
it attempts to acquire lock again after queueing into wait list)).

`skip_wait_list` depends on tranche very much. And per-tranche
`skip_wait_list` should be calculated based on ability to acquire lock
without any sleep, ie without both queuing itself into wait list and
falling into `pg_usleep` inside `perform_spin_delay` . I suppose, it is
obvious that `spins_per_delay` have to be proportional to 
`skip_wait_list`

(it should be at least greater than `skip_wait_list`). Therefore
`spins_per_delay` is also should be runtime-dependent on lock's tranche.

Am I wrong?

Without that per-tranche auto-tuning, it is better to not touch
`spins_per_delay` inside of `LWLockAttemptLockOrQueue`, I think.

With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-10-19 Thread Andres Freund
On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote:
> > > + init_local_spin_delay();
> > 
> > The way you moved this around has the disadvantage that we now do this -
> > a number of writes - even in the very common case where the lwlock can
> > be acquired directly.
> 
> Excuse me, I don't understand fine.
> Do you complain against init_local_spin_delay placed here?

Yes.


> Placing it in other place will complicate code.


> Or you complain against setting `mask` and `add`?

That seems right.


> In both cases, I think simpler version should be accepted first. It acts
> as algorithm definition. And it already gives measurable improvement.

Well, in scalability. I'm less sure about uncontended performance.



> > > +  * We intentionally do not call finish_spin_delay here, because
> > > the loop
> > > +  * above usually finished by queuing into the wait list on
> > > contention, and
> > > +  * doesn't reach spins_per_delay thereby doesn't sleep inside of
> > > +  * perform_spin_delay. Also, different LWLocks has very different
> > > +  * contention pattern, and it is wrong to update spin-lock
> > > statistic based
> > > +  * on LWLock contention.
> > > +  */
> > 
> > Huh? This seems entirely unconvincing. Without adjusting this here we'll
> > just spin the same way every iteration. Especially for the case where
> > somebody else holds LW_FLAG_LOCKED that's quite bad.
> 
> LWLock's are very different. Some of them are always short-term
> (BufferLock), others are always locked for a long time.

That seems not particularly relevant. The same is true for
spinlocks. The relevant question isn't how long the lwlock is held, it's
how long LW_FLAG_LOCKED is held - and that should only depend on
contention (i.e. bus speed, amount of times put into sleep while holding
lock, etc), not on how long the lock is held.

> I've tried to place this delay into lock itself (it has 2 free bytes),
> but this attempt performed worse.

That seems unsurprising - there's a *lot* of locks, and we'd have to
tune all of them. Additionally there's a bunch of platforms where we do
*not* have free bytes (consider the embedding in BufferTag).


> Now I understand, that delays should be stored in array indexed by
> tranche. But I have no time to test this idea. And I doubt it will give
> cardinally better results (ie > 5%), so I think it is better to accept
> patch in this way, and then experiment with per-tranche delay.

I don't think tranches have any decent predictive power here.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-10-19 Thread Sokolov Yura

On 2017-10-19 02:28, Andres Freund wrote:

On 2017-06-05 16:22:58 +0300, Sokolov Yura wrote:

Algorithm for LWLockWaitForVar is also refactored. New version is:
1. If lock is not held by anyone, it immediately exit.
2. Otherwise it is checked for ability to take WaitList lock, because
  variable is protected with it. If so, CAS is performed, and if it 
is

  successful, loop breaked to step 4.
3. Otherwise spin_delay perfomed, and loop returns to step 1.
4. var is checked for change.
  If it were changed, we unlock wait list lock and exit.
  Note: it could not change in following steps because we are holding
  wait list lock.
5. Otherwise CAS on setting necessary flags is performed. If it 
succeed,

  then queue Proc to wait list and exit.
6. If CAS failed, then there is possibility for LWLock to be already
  released - if so then we should unlock wait list and exit.
7. Otherwise loop returns to step 5.

So invariant is preserved:
- either we found LWLock free,
- or we found changed variable,
- or we set flags and queue self while LWLock were held.

Spin_delay is not performed at step 7, because we should release wait
list lock as soon as possible.


That seems unconvincing - by not delaying you're more likely to
*increase* the time till the current locker that holds the lock can
release the lock.


But why? If our CAS wasn't satisfied, then other's one were satisfied.
So there is always overall progress. And if it will sleep in this
loop, then other waiters will spin in first loop in this functions.
But given concrete usage of LWLockWaitForVar, probably it is not too
bad to hold other waiters in first loop in this function (they loops
until we release WaitListLock or lock released at whole).

I'm in doubts what is better. May I keep it unchanged now?


In fact, if waiter will sleep here, lock holder will not be able to set
variable we are waiting for, and therefore will not release lock.
It is stated in the comment for the loop:

+ * Note: value could not change again cause we are holding WaitList 
lock.


So delaying here we certainly will degrade.



BTW, I found a small mistake in this place: I forgot to set
LW_FLAG_LOCKED in a state before this CAS. Looks like it wasn't real
error, because CAS always failed at first loop iteration (because real
`lock->state` had LW_FLAG_LOCKED already set), and after failed CAS
state adsorbs value from `lock->state`.
I'll fix it.


Attach contains version with a fix.

--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

lwlock_v4.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-10-19 Thread Sokolov Yura

Hi,

On 2017-10-19 03:03, Andres Freund wrote:

Hi,

On 2017-09-08 22:35:39 +0300, Sokolov Yura wrote:

 /*
  * Internal function that tries to atomically acquire the lwlock in 
the passed

- * in mode.
+ * in mode. If it could not grab the lock, it doesn't puts proc into 
wait

+ * queue.
  *
- * This function will not block waiting for a lock to become free - 
that's the

- * callers job.
+ * It is used only in LWLockConditionalAcquire.
  *
- * Returns true if the lock isn't free and we need to wait.
+ * Returns true if the lock isn't free.
  */
 static bool
-LWLockAttemptLock(LWLock *lock, LWLockMode mode)
+LWLockAttemptLockOnce(LWLock *lock, LWLockMode mode)


This now has become a fairly special cased function, I'm not convinced
it makes much sense with the current naming and functionality.


It just had not to be as complex as LWLockAttpemptLockOrQueueSelf.
Their functionality is too different.


+/*
+ * Internal function that tries to atomically acquire the lwlock in 
the passed

+ * in mode or put it self into waiting queue with waitmode.
+ * This function will not block waiting for a lock to become free - 
that's the

+ * callers job.
+ *
+ * Returns true if the lock isn't free and we are in a wait queue.
+ */
+static inline bool
+LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, 
LWLockMode waitmode)

+{
+   uint32  old_state;
+   SpinDelayStatus delayStatus;
+   boollock_free;
+   uint32  mask,
+   add;
+
+   AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
+
+   if (mode == LW_EXCLUSIVE)
+   {
+   mask = LW_LOCK_MASK;
+   add = LW_VAL_EXCLUSIVE;
+   }
+   else
+   {
+   mask = LW_VAL_EXCLUSIVE;
+   add = LW_VAL_SHARED;
+   }
+
+   init_local_spin_delay();


The way you moved this around has the disadvantage that we now do this 
-

a number of writes - even in the very common case where the lwlock can
be acquired directly.


Excuse me, I don't understand fine.
Do you complain against init_local_spin_delay placed here?
Placing it in other place will complicate code.

Or you complain against setting `mask` and `add`?
At least it simplifies loop body. For example, it is simpler to write
specialized Power assembly using already filled `mask+add` registers
(i have a working draft with power assembly).

In both cases, I think simpler version should be accepted first. It acts
as algorithm definition. And it already gives measurable improvement.
After some testing, attempt to improve it may be performed (assuming
release will be in a next autumn, there is enough time for testing and
improvements)
I can be mistaken.


+   /*
+	 * We intentionally do not call finish_spin_delay here, because the 
loop
+	 * above usually finished by queuing into the wait list on 
contention, and

+* doesn't reach spins_per_delay thereby doesn't sleep inside of
+* perform_spin_delay. Also, different LWLocks has very different
+	 * contention pattern, and it is wrong to update spin-lock statistic 
based

+* on LWLock contention.
+*/


Huh? This seems entirely unconvincing. Without adjusting this here 
we'll

just spin the same way every iteration. Especially for the case where
somebody else holds LW_FLAG_LOCKED that's quite bad.


LWLock's are very different. Some of them are always short-term
(BufferLock), others are always locked for a long time. Some sparse, and
some crowded. It is quite bad to tune single variable `spins_per_delay`
by such different load.

And it is already tuned by spin-locks. And spin-locks tune it quite 
well.


I've tried to place this delay into lock itself (it has 2 free bytes),
but this attempt performed worse.
Now I understand, that delays should be stored in array indexed by
tranche. But I have no time to test this idea. And I doubt it will give
cardinally better results (ie > 5%), so I think it is better to accept
patch in this way, and then experiment with per-tranche delay.




From e5b13550fc48d62b0b855bedd7fcd5848b806b09 Mon Sep 17 00:00:00 2001
From: Sokolov Yura 
Date: Tue, 30 May 2017 18:54:25 +0300
Subject: [PATCH 5/6] Fix implementation description in a lwlock.c .

---
 src/backend/storage/lmgr/lwlock.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c 
b/src/backend/storage/lmgr/lwlock.c

index 334c2a2d96..0a41c2c4e2 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -62,16 +62,15 @@
  * notice that we have to wait. Unfortunately by the time we have 
finished
  * queuing, the former locker very well might have already finished 
it's
  * work. That's problematic because we're now stuck waiting inside 
the OS.

-
- * To mitigate those races we use a two phased attempt at locking:
- *  Phase 1: Try to do it atomically, if we succeed, nice
- *  

Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-10-18 Thread Andres Freund
Hi,

On 2017-09-08 22:35:39 +0300, Sokolov Yura wrote:
> From 722a8bed17f82738135264212dde7170b8c0f397 Mon Sep 17 00:00:00 2001
> From: Sokolov Yura 
> Date: Mon, 29 May 2017 09:25:41 +
> Subject: [PATCH 1/6] More correct use of spinlock inside LWLockWaitListLock.
> 
> SpinDelayStatus should be function global to count iterations correctly,
> and produce more correct delays.
> 
> Also if spin didn't spin, do not count it in spins_per_delay correction.
> It wasn't necessary before cause delayStatus were used only in contented
> cases.

I'm not convinced this is benefial. Adds a bit of overhead to the
casewhere LW_FLAG_LOCKED can be set immediately. OTOH, it's not super
likely to make a large difference if the lock has to be taken anyway...


> +
> +/* just shortcut to not declare lwlock_stats variable at the end of function 
> */
> +static void
> +add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus)
> +{
> + lwlock_stats *lwstats;
> +
> + lwstats = get_lwlock_stats_entry(lock);
> + lwstats->spin_delay_count += delayStatus->delays;
> +}
>  #endif   /* LWLOCK_STATS 
> */

This seems like a pretty independent change.


>  /*
>   * Internal function that tries to atomically acquire the lwlock in the 
> passed
> - * in mode.
> + * in mode. If it could not grab the lock, it doesn't puts proc into wait
> + * queue.
>   *
> - * This function will not block waiting for a lock to become free - that's 
> the
> - * callers job.
> + * It is used only in LWLockConditionalAcquire.
>   *
> - * Returns true if the lock isn't free and we need to wait.
> + * Returns true if the lock isn't free.
>   */
>  static bool
> -LWLockAttemptLock(LWLock *lock, LWLockMode mode)
> +LWLockAttemptLockOnce(LWLock *lock, LWLockMode mode)

This now has become a fairly special cased function, I'm not convinced
it makes much sense with the current naming and functionality.


> +/*
> + * Internal function that tries to atomically acquire the lwlock in the 
> passed
> + * in mode or put it self into waiting queue with waitmode.
> + * This function will not block waiting for a lock to become free - that's 
> the
> + * callers job.
> + *
> + * Returns true if the lock isn't free and we are in a wait queue.
> + */
> +static inline bool
> +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, LWLockMode 
> waitmode)
> +{
> + uint32  old_state;
> + SpinDelayStatus delayStatus;
> + boollock_free;
> + uint32  mask,
> + add;
> +
> + AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
> +
> + if (mode == LW_EXCLUSIVE)
> + {
> + mask = LW_LOCK_MASK;
> + add = LW_VAL_EXCLUSIVE;
> + }
> + else
> + {
> + mask = LW_VAL_EXCLUSIVE;
> + add = LW_VAL_SHARED;
> + }
> +
> + init_local_spin_delay();

The way you moved this around has the disadvantage that we now do this -
a number of writes - even in the very common case where the lwlock can
be acquired directly.


> + /*
> +  * Read once outside the loop. Later iterations will get the newer value
> +  * either via compare & exchange or with read after perform_spin_delay.
> +  */
> + old_state = pg_atomic_read_u32(>state);
> + /* loop until we've determined whether we could acquire the lock or not 
> */
> + while (true)
> + {
> + uint32  desired_state;
> +
> + desired_state = old_state;
> +
> + lock_free = (old_state & mask) == 0;
> + if (lock_free)
>   {
> - if (lock_free)
> + desired_state += add;
> + if (pg_atomic_compare_exchange_u32(>state,
> + 
>_state, desired_state))
>   {
>   /* Great! Got the lock. */
>  #ifdef LOCK_DEBUG
>   if (mode == LW_EXCLUSIVE)
>   lock->owner = MyProc;
>  #endif
> - return false;
> + break;
> + }
> + }
> + else if ((old_state & LW_FLAG_LOCKED) == 0)
> + {
> + desired_state |= LW_FLAG_LOCKED | LW_FLAG_HAS_WAITERS;
> + if (pg_atomic_compare_exchange_u32(>state,
> + 
>_state, desired_state))
> + {
> + LWLockQueueSelfLocked(lock, waitmode);
> + break;
>   }
> - else
> - return true;/* somebody else has the lock */
> + }
> + else
> + {
> +  

Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-10-18 Thread Andres Freund
Hi,

On 2017-06-05 16:22:58 +0300, Sokolov Yura wrote:
> Patch changes the way LWLock is acquired.
> 
> Before patch, lock is acquired using following procedure:
> 1. first LWLock->state is checked for ability to take a lock, and CAS
>   performed (either lock could be acquired or not). And it is retried if
>   status were changed.
> 2. if lock is not acquired at first 1, Proc is put into wait list, using
>   LW_FLAG_LOCKED bit of LWLock->state as a spin-lock for wait list.
>   In fact, profile shows that LWLockWaitListLock spends a lot of CPU on
>   contendend LWLock (upto 20%).
>   Additional CAS loop is inside pg_atomic_fetch_or_u32 for setting
>   LW_FLAG_HAS_WAITERS. And releasing wait list lock is another CAS loop
>   on the same LWLock->state.
> 3. after that state is checked again, because lock could be released
>   before Proc were queued into wait list.
> 4. if lock were acquired at step 3, Proc should be dequeued from wait
>   list (another spin lock and CAS loop). And this operation could be
>   quite heavy, because whole list should be traversed.
> 
> Patch makes lock acquiring in single CAS loop:
> 1. LWLock->state is read, and ability for lock acquiring is detected.
>   If there is possibility to take a lock, CAS tried.
>   If CAS were successful, lock is aquired (same to original version).
> 2. but if lock is currently held by other backend, we check ability for
>   taking WaitList lock. If wait list lock is not help by anyone, CAS
>   perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at once.
>   If CAS were successful, then LWLock were still held at the moment wait
>   list lock were held - this proves correctness of new algorithm. And
>   Proc is queued to wait list then.
> 3. Otherwise spin_delay is performed, and loop returns to step 1.

Right, something like this didn't use to be possible because we'd both
the LWLock->state and LWLock->mutex. But after 008608b9d5106 that's not
a concern anymore.

> Algorithm for LWLockWaitForVar is also refactored. New version is:
> 1. If lock is not held by anyone, it immediately exit.
> 2. Otherwise it is checked for ability to take WaitList lock, because
>   variable is protected with it. If so, CAS is performed, and if it is
>   successful, loop breaked to step 4.
> 3. Otherwise spin_delay perfomed, and loop returns to step 1.
> 4. var is checked for change.
>   If it were changed, we unlock wait list lock and exit.
>   Note: it could not change in following steps because we are holding
>   wait list lock.
> 5. Otherwise CAS on setting necessary flags is performed. If it succeed,
>   then queue Proc to wait list and exit.
> 6. If Case failed, then there is possibility for LWLock to be already
>   released - if so then we should unlock wait list and exit.
> 7. Otherwise loop returns to step 5.
> 
> So invariant is preserved:
> - either we found LWLock free,
> - or we found changed variable,
> - or we set flags and queue self while LWLock were held.
> 
> Spin_delay is not performed at step 7, because we should release wait
> list lock as soon as possible.

That seems unconvincing - by not delaying you're more likely to
*increase* the time till the current locker that holds the lock can
release the lock.

> Additionally, taking wait list lock is skipped in both algorithms if
> SpinDelayStatus.spins is less than part of spins_per_delay loop
> iterations (so, it is initial iterations, and some iterations after
> pg_usleep, because SpinDelayStatus.spins is reset after sleep).

I quite strongly think it's a good idea to change this at the same time
as the other changes you're proposing here.  I think it's a worthwhile
thing to pursue, but that change also has quit ethe potential for
regressing in some cases.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-09-14 Thread Jesper Pedersen

On 09/11/2017 11:01 AM, Jesper Pedersen wrote:

Thanks for working on this !



Moved to "Ready for Committer".

Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-09-11 Thread Jesper Pedersen

Hi,

On 09/08/2017 03:35 PM, Sokolov Yura wrote:

I'm seeing

-M prepared: Up to 11% improvement
-M prepared -S: No improvement, no regression ("noise")
-M prepared -N: Up to 12% improvement

for all runs the improvement shows up the closer you get to the number
of CPU threads, or above. Although I'm not seeing the same
improvements as you on very large client counts there are definitely
improvements :)


It is expected:
- patch "fixes NUMA": for example, it doesn't give improvement on 1 socket
   at all (I've tested it using numactl to bind to 1 socket)
- and certainly it gives less improvement on 2 sockets than on 4 sockets
   (and 28 cores vs 72 cores also gives difference),
- one of hot points were CLogControlLock, and it were fixed with
   "Group mode for CLOG updates" [1]



I'm planning to re-test that patch.



+static inline bool
+LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode,
LWLockMode waitmode)

I'll leave it to the Committer to decide if this method is too big to
be "inline".


GCC 4.9 doesn't want to inline it without directive, and function call
is then remarkable in profile.

Attach contains version with all suggestions applied except remove of
"inline".



Yes, ideally the method will be kept at "inline".


Open questions:
---
* spins_per_delay as extern
* Calculation of skip_wait_list


Currently calculation of skip_wait_list is mostly empirical (ie best
i measured).



Ok, good to know.


I strongly think that instead of spins_per_delay something dependent
on concrete lock should be used. I tried to store it in a LWLock
itself, but it were worse.


Yes, LWLock should be kept as small as possible, and cache line aligned 
due to the cache storms, as shown by perf c2c.



Recently I understand it should be stored in array indexed by tranche,
but I didn't implement it yet, and therefore didn't measure.



Different constants for the LWLock could have an impact, but the 
constants would also be dependent on machine setup, and work load.


Thanks for working on this !

Best regards,
 Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-09-08 Thread Sokolov Yura

Hi, Jesper

Thank you for reviewing.

On 2017-09-08 18:33, Jesper Pedersen wrote:

Hi,

On 07/18/2017 01:20 PM, Sokolov Yura wrote:

I'm sending rebased version with couple of one-line tweaks.
(less skip_wait_list on shared lock, and don't update spin-stat on 
aquiring)




I have been running this patch on a 2S/28C/56T/256Gb w/ 2 x RAID10 SSD
setup (1 to 375 clients on logged tables).

I'm seeing

-M prepared: Up to 11% improvement
-M prepared -S: No improvement, no regression ("noise")
-M prepared -N: Up to 12% improvement

for all runs the improvement shows up the closer you get to the number
of CPU threads, or above. Although I'm not seeing the same
improvements as you on very large client counts there are definitely
improvements :)


It is expected:
- patch "fixes NUMA": for example, it doesn't give improvement on 1 
socket

  at all (I've tested it using numactl to bind to 1 socket)
- and certainly it gives less improvement on 2 sockets than on 4 sockets
  (and 28 cores vs 72 cores also gives difference),
- one of hot points were CLogControlLock, and it were fixed with
  "Group mode for CLOG updates" [1]



Some comments:
==

lwlock.c:
-

...

+static void
+add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus)

Add a method description.


Neighbor functions above also has no description, that is why I didn't 
add.

But ok, I've added now.



+static inline bool
+LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode,
LWLockMode waitmode)

I'll leave it to the Committer to decide if this method is too big to
be "inline".


GCC 4.9 doesn't want to inline it without directive, and function call
is then remarkable in profile.

Attach contains version with all suggestions applied except remove of
"inline".


Open questions:
---
* spins_per_delay as extern
* Calculation of skip_wait_list


Currently calculation of skip_wait_list is mostly empirical (ie best
i measured).

I strongly think that instead of spins_per_delay something dependent
on concrete lock should be used. I tried to store it in a LWLock
itself, but it were worse.
Recently I understand it should be stored in array indexed by tranche,
but I didn't implement it yet, and therefore didn't measure.

[1]: https://commitfest.postgresql.org/14/358/

--
With regards,
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres CompanyFrom 722a8bed17f82738135264212dde7170b8c0f397 Mon Sep 17 00:00:00 2001
From: Sokolov Yura 
Date: Mon, 29 May 2017 09:25:41 +
Subject: [PATCH 1/6] More correct use of spinlock inside LWLockWaitListLock.

SpinDelayStatus should be function global to count iterations correctly,
and produce more correct delays.

Also if spin didn't spin, do not count it in spins_per_delay correction.
It wasn't necessary before cause delayStatus were used only in contented
cases.
---
 src/backend/storage/lmgr/lwlock.c | 37 +
 src/backend/storage/lmgr/s_lock.c |  5 +
 2 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 82a1cf5150..7714085663 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -323,6 +323,16 @@ get_lwlock_stats_entry(LWLock *lock)
 	}
 	return lwstats;
 }
+
+/* just shortcut to not declare lwlock_stats variable at the end of function */
+static void
+add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus)
+{
+	lwlock_stats *lwstats;
+
+	lwstats = get_lwlock_stats_entry(lock);
+	lwstats->spin_delay_count += delayStatus->delays;
+}
 #endif			/* LWLOCK_STATS */
 
 
@@ -800,13 +810,9 @@ static void
 LWLockWaitListLock(LWLock *lock)
 {
 	uint32		old_state;
-#ifdef LWLOCK_STATS
-	lwlock_stats *lwstats;
-	uint32		delays = 0;
-
-	lwstats = get_lwlock_stats_entry(lock);
-#endif
+	SpinDelayStatus delayStatus;
 
+	init_local_spin_delay();
 	while (true)
 	{
 		/* always try once to acquire lock directly */
@@ -815,20 +821,10 @@ LWLockWaitListLock(LWLock *lock)
 			break;/* got lock */
 
 		/* and then spin without atomic operations until lock is released */
+		while (old_state & LW_FLAG_LOCKED)
 		{
-			SpinDelayStatus delayStatus;
-
-			init_local_spin_delay();
-
-			while (old_state & LW_FLAG_LOCKED)
-			{
-perform_spin_delay();
-old_state = pg_atomic_read_u32(>state);
-			}
-#ifdef LWLOCK_STATS
-			delays += delayStatus.delays;
-#endif
-			finish_spin_delay();
+			perform_spin_delay();
+			old_state = pg_atomic_read_u32(>state);
 		}
 
 		/*
@@ -836,9 +832,10 @@ LWLockWaitListLock(LWLock *lock)
 		 * we're attempting to get it again.
 		 */
 	}
+	finish_spin_delay();
 
 #ifdef LWLOCK_STATS
-	lwstats->spin_delay_count += delays;
+	add_lwlock_stats_spin_stat(lock, );
 #endif
 }
 
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 40d8156331..b75c432773 100644
--- 

Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-09-08 Thread Jesper Pedersen

Hi,

On 07/18/2017 01:20 PM, Sokolov Yura wrote:

I'm sending rebased version with couple of one-line tweaks.
(less skip_wait_list on shared lock, and don't update spin-stat on 
aquiring)




I have been running this patch on a 2S/28C/56T/256Gb w/ 2 x RAID10 SSD 
setup (1 to 375 clients on logged tables).


I'm seeing

-M prepared: Up to 11% improvement
-M prepared -S: No improvement, no regression ("noise")
-M prepared -N: Up to 12% improvement

for all runs the improvement shows up the closer you get to the number 
of CPU threads, or above. Although I'm not seeing the same improvements 
as you on very large client counts there are definitely improvements :)


Some comments:
==

lwlock.c:
-
+ * This race is avoiding by taking lock for wait list using CAS with a old
+ * state value, so it could succeed only if lock is still held. 
Necessary flag

+ * is set with same CAS.
+ *
+ * Inside LWLockConflictsWithVar wait list lock is reused to protect 
variable

+ * value. So first it is acquired to check variable value, then flags are
+ * set with second CAS before queueing to wait list to ensure lock were not
+ * released yet.

 * This race is avoided by taking a lock for the wait list using CAS 
with the old
 * state value, so it would only succeed if lock is still held. 
Necessary flag

 * is set using the same CAS.
 *
 * Inside LWLockConflictsWithVar the wait list lock is reused to 
protect the variable
 * value. First the lock is acquired to check the variable value, then 
flags are
 * set with a second CAS before queuing to the wait list in order to 
ensure that the lock was not

 * released yet.


+static void
+add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus)

Add a method description.

+   /*
+* This barrier is never needed for correctness, and it 
is no-op
+* on x86. But probably first iteration of cas loop in
+* ProcArrayGroupClearXid will succeed oftener with it.
+*/

* "more often"

+static inline bool
+LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, LWLockMode 
waitmode)


I'll leave it to the Committer to decide if this method is too big to be 
"inline".


+   /*
+* We intentionally do not call finish_spin_delay here, cause loop above
+* usually finished in queueing into wait list on contention, and 
doesn't
+* reach spins_per_delay (and so, doesn't sleep inside of
+* perform_spin_delay). Also, different LWLocks has very different
+* contention pattern, and it is wrong to update spin-lock statistic 
based
+* on LWLock contention.
+*/

/*
 * We intentionally do not call finish_spin_delay here, because the 
loop above
 * usually finished by queuing into the wait list on contention, and 
doesn't

 * reach spins_per_delay thereby doesn't sleep inside of
 * perform_spin_delay. Also, different LWLocks has very different
 * contention pattern, and it is wrong to update spin-lock statistic based
 * on LWLock contention.
 */


s_lock.c:
-
+   if (status->spins == 0)
+   /* but we didn't spin either, so ignore */
+   return;

Use { } for the if, or move the comment out of the nesting for readability.

Open questions:
---
* spins_per_delay as extern
* Calculation of skip_wait_list


You could run the patch through pgindent too.

Passes make check-world.

Status: Waiting on Author

Best regards,
 Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-08-10 Thread Sokolov Yura

On 2017-07-18 20:20, Sokolov Yura wrote:

On 2017-06-05 16:22, Sokolov Yura wrote:

Good day, everyone.

This patch improves performance of contended LWLock.
Patch makes lock acquiring in single CAS loop:
1. LWLock->state is read, and ability for lock acquiring is detected.
  If there is possibility to take a lock, CAS tried.
  If CAS were successful, lock is aquired (same to original version).
2. but if lock is currently held by other backend, we check ability 
for

  taking WaitList lock. If wait list lock is not help by anyone, CAS
  perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at 
once.
  If CAS were successful, then LWLock were still held at the moment 
wait

  list lock were held - this proves correctness of new algorithm. And
  Proc is queued to wait list then.
3. Otherwise spin_delay is performed, and loop returns to step 1.



I'm sending rebased version with couple of one-line tweaks.
(less skip_wait_list on shared lock, and don't update spin-stat on 
aquiring)


With regards,


Here are results for zipfian distribution (50/50 r/w) in conjunction
with "lazy hash table for XidInMVCCSnapshot":
(https://www.postgresql.org/message-id/642da34694800dab801f04c62950ce8a%40postgrespro.ru)

clients | master | hashsnap2 | hashsnap2_lwlock
++---+--
 10 | 203384 |203813 |   204852
 20 | 334344 |334268 |   363510
 40 | 228496 |231777 |   383820
 70 | 146892 |148173 |   221326
110 |  99741 |111580 |   157327
160 |  65257 | 81230 |   112028
230 |  38344 | 56790 |77514
310 |  22355 | 39249 |55907
400 |  13402 | 26899 |39742
500 |   8382 | 17855 |28362
650 |   5313 | 11450 |17497
800 |   3352 |  7816 |11030

With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-07-18 Thread Sokolov Yura

On 2017-06-05 16:22, Sokolov Yura wrote:

Good day, everyone.

This patch improves performance of contended LWLock.
Patch makes lock acquiring in single CAS loop:
1. LWLock->state is read, and ability for lock acquiring is detected.
  If there is possibility to take a lock, CAS tried.
  If CAS were successful, lock is aquired (same to original version).
2. but if lock is currently held by other backend, we check ability for
  taking WaitList lock. If wait list lock is not help by anyone, CAS
  perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at 
once.
  If CAS were successful, then LWLock were still held at the moment 
wait

  list lock were held - this proves correctness of new algorithm. And
  Proc is queued to wait list then.
3. Otherwise spin_delay is performed, and loop returns to step 1.



I'm sending rebased version with couple of one-line tweaks.
(less skip_wait_list on shared lock, and don't update spin-stat on 
aquiring)


With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres CompanyFrom dc8d9489074973ba589adf9b0239e1467cbba44b Mon Sep 17 00:00:00 2001
From: Sokolov Yura 
Date: Mon, 29 May 2017 09:25:41 +
Subject: [PATCH 1/6] More correct use of spinlock inside LWLockWaitListLock.

SpinDelayStatus should be function global to count iterations correctly,
and produce more correct delays.

Also if spin didn't spin, do not count it in spins_per_delay correction.
It wasn't necessary before cause delayStatus were used only in contented
cases.
---
 src/backend/storage/lmgr/lwlock.c | 36 
 src/backend/storage/lmgr/s_lock.c |  3 +++
 2 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 82a1cf5150..86966b804e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -323,6 +323,15 @@ get_lwlock_stats_entry(LWLock *lock)
 	}
 	return lwstats;
 }
+
+static void
+add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus)
+{
+	lwlock_stats *lwstats;
+
+	lwstats = get_lwlock_stats_entry(lock);
+	lwstats->spin_delay_count += delayStatus->delays;
+}
 #endif			/* LWLOCK_STATS */
 
 
@@ -800,13 +809,9 @@ static void
 LWLockWaitListLock(LWLock *lock)
 {
 	uint32		old_state;
-#ifdef LWLOCK_STATS
-	lwlock_stats *lwstats;
-	uint32		delays = 0;
-
-	lwstats = get_lwlock_stats_entry(lock);
-#endif
+	SpinDelayStatus delayStatus;
 
+	init_local_spin_delay();
 	while (true)
 	{
 		/* always try once to acquire lock directly */
@@ -815,20 +820,10 @@ LWLockWaitListLock(LWLock *lock)
 			break;/* got lock */
 
 		/* and then spin without atomic operations until lock is released */
+		while (old_state & LW_FLAG_LOCKED)
 		{
-			SpinDelayStatus delayStatus;
-
-			init_local_spin_delay();
-
-			while (old_state & LW_FLAG_LOCKED)
-			{
-perform_spin_delay();
-old_state = pg_atomic_read_u32(>state);
-			}
-#ifdef LWLOCK_STATS
-			delays += delayStatus.delays;
-#endif
-			finish_spin_delay();
+			perform_spin_delay();
+			old_state = pg_atomic_read_u32(>state);
 		}
 
 		/*
@@ -836,9 +831,10 @@ LWLockWaitListLock(LWLock *lock)
 		 * we're attempting to get it again.
 		 */
 	}
+	finish_spin_delay();
 
 #ifdef LWLOCK_STATS
-	lwstats->spin_delay_count += delays;
+	add_lwlock_stats_spin_stat(lock, );
 #endif
 }
 
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 40d8156331..f3e0fbc602 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -177,6 +177,9 @@ finish_spin_delay(SpinDelayStatus *status)
 	if (status->cur_delay == 0)
 	{
 		/* we never had to delay */
+		if (status->spins == 0)
+			/* but we didn't spin either, so ignore */
+			return;
 		if (spins_per_delay < MAX_SPINS_PER_DELAY)
 			spins_per_delay = Min(spins_per_delay + 100, MAX_SPINS_PER_DELAY);
 	}
-- 
2.11.0


From 067282cce98c77b9f1731cf7d302c37fcc70d59e Mon Sep 17 00:00:00 2001
From: Sokolov Yura 
Date: Tue, 30 May 2017 14:05:26 +0300
Subject: [PATCH 2/6] Perform atomic LWLock acquirement or putting into wait
 list.

Before this change, lwlock were acquired in several steps:
- first try acquire lockfree (one CAS-spin-loop)
- if it were not successful, queue self into wait list
  (two CAS-spin-loops (in fact, 3 loop, cause of atomic_fetch_and_or))
- try again to acquire lock (another one CAS loop)
- if second attempt were successful, deque our self from wait queue
  (two CAS spin loops).

With this change, lock acquired, or wait list lock acquired using single
CAS loop:
- if it is likely to acquire desired lock mode, we do CAS for that
- otherwise we are trying for CAS to lock wait list and set necessary
  flag (LG_FLAG_HAS_WAITERS) at once.
  (LWLockWaitListUnlock still performs second CAS loop for releasing
   wait list lock).

Correctness provided by the fact we are trying to 

Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-06-06 Thread Robert Haas
On Mon, Jun 5, 2017 at 9:22 AM, Sokolov Yura
 wrote:
> Good day, everyone.
>
> This patch improves performance of contended LWLock.
> It was tested on 4 socket 72 core x86 server (144 HT) Centos 7.1
> gcc 4.8.5
>
> Patch makes lock acquiring in single CAS loop:
> 1. LWLock->state is read, and ability for lock acquiring is detected.
>   If there is possibility to take a lock, CAS tried.
>   If CAS were successful, lock is aquired (same to original version).
> 2. but if lock is currently held by other backend, we check ability for
>   taking WaitList lock. If wait list lock is not help by anyone, CAS
>   perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at once.
>   If CAS were successful, then LWLock were still held at the moment wait
>   list lock were held - this proves correctness of new algorithm. And
>   Proc is queued to wait list then.
> 3. Otherwise spin_delay is performed, and loop returns to step 1.

Interesting work.  Thanks for posting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-06-05 Thread Sokolov Yura

Good day, everyone.

This patch improves performance of contended LWLock.
It was tested on 4 socket 72 core x86 server (144 HT) Centos 7.1
gcc 4.8.5

Results:

pgbench -i -s 300 + pgbench --skip-some-updates
Clients |  master | patched
+=+===
   30   |32k  |32k
   50   |53k  |53k
  100   |   102k  |   107k
  150   |   120k  |   131k
  200   |   129k  |   148k
  252   |   121k  |   159k
  304   |   101k  |   163k
  356   |79k  |   166k
  395   |70k  |   166k
  434   |62k  |   166k

pgbench -i -s 900 + pgbench
Clients |  master | patched
+=+===
   30   |31k  |31k
   50   |50k  |49k
  100   |75k  |78k
  150   |92k  |96k
  200   |87k  |   106k
  252   |69k  |   108k
  304   |55k  |   110k
  356   |48k  |   110k
  395   |44k  |   108k
  434   |40k  |   109k

I could not claim read-only benchmarks are affected in any way: results
are unstable between postgresql restarts and varies in a same interval
for both versions. Although some micro-optimization were made to ensure
this parity.
Since investigations exist on changing shared buffers structure (by
removing LWLock from a buffer hash partition lock [1]), I stopped
attempts to gather more reliable statistic for readonly benchmarks.

[1] Takashi Horikawa "Don't stop the world"
https://www.pgcon.org/2017/schedule/events/1078.en.html

Also, looks like patch doesn't affect single NUMA node performance
significantly.

postgresql.conf is tuned with following parameters:
shared_buffers = 32GB
max_connections = 500
fsync = on
synchronous_commit = on
full_page_writes = off
wal_buffers = 16MB
wal_writer_flush_after = 16MB
commit_delay = 2
max_wal_size = 16GB

Patch changes the way LWLock is acquired.

Before patch, lock is acquired using following procedure:
1. first LWLock->state is checked for ability to take a lock, and CAS
  performed (either lock could be acquired or not). And it is retried if
  status were changed.
2. if lock is not acquired at first 1, Proc is put into wait list, using
  LW_FLAG_LOCKED bit of LWLock->state as a spin-lock for wait list.
  In fact, profile shows that LWLockWaitListLock spends a lot of CPU on
  contendend LWLock (upto 20%).
  Additional CAS loop is inside pg_atomic_fetch_or_u32 for setting
  LW_FLAG_HAS_WAITERS. And releasing wait list lock is another CAS loop
  on the same LWLock->state.
3. after that state is checked again, because lock could be released
  before Proc were queued into wait list.
4. if lock were acquired at step 3, Proc should be dequeued from wait
  list (another spin lock and CAS loop). And this operation could be
  quite heavy, because whole list should be traversed.

Patch makes lock acquiring in single CAS loop:
1. LWLock->state is read, and ability for lock acquiring is detected.
  If there is possibility to take a lock, CAS tried.
  If CAS were successful, lock is aquired (same to original version).
2. but if lock is currently held by other backend, we check ability for
  taking WaitList lock. If wait list lock is not help by anyone, CAS
  perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at once.
  If CAS were successful, then LWLock were still held at the moment wait
  list lock were held - this proves correctness of new algorithm. And
  Proc is queued to wait list then.
3. Otherwise spin_delay is performed, and loop returns to step 1.

So invariant is preserved:
- either we grab the lock,
- or we set HAS_WAITERS while lock were held by someone, and queue self
  into wait list.

Algorithm for LWLockWaitForVar is also refactored. New version is:
1. If lock is not held by anyone, it immediately exit.
2. Otherwise it is checked for ability to take WaitList lock, because
  variable is protected with it. If so, CAS is performed, and if it is
  successful, loop breaked to step 4.
3. Otherwise spin_delay perfomed, and loop returns to step 1.
4. var is checked for change.
  If it were changed, we unlock wait list lock and exit.
  Note: it could not change in following steps because we are holding
  wait list lock.
5. Otherwise CAS on setting necessary flags is performed. If it succeed,
  then queue Proc to wait list and exit.
6. If Case failed, then there is possibility for LWLock to be already
  released - if so then we should unlock wait list and exit.
7. Otherwise loop returns to step 5.

So invariant is preserved:
- either we found LWLock free,
- or we found changed variable,
- or we set flags and queue self while LWLock were held.

Spin_delay is not performed at step 7, because we should release wait
list lock as soon as possible.

Additionally, taking wait list lock is skipped in both algorithms if
SpinDelayStatus.spins is less than part of spins_per_delay loop
iterations (so, it is initial iterations, and some iterations after
pg_usleep, because SpinDelayStatus.spins is reset after sleep). It
effectively turns LWLock into spinlock on low contended case. It was
made