Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
On 2017-10-20 11:54, Sokolov Yura wrote: Hello, On 2017-10-19 19:46, Andres Freund wrote: On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote: > > + init_local_spin_delay(&delayStatus); > > The way you moved this around has the disadvantage that we now do this - > a number of writes - even in the very common case where the lwlock can > be acquired directly. Excuse me, I don't understand fine. Do you complain against init_local_spin_delay placed here? Yes. I could place it before perform_spin_delay under `if (!spin_inited)` if you think it is absolutely must. I looked at assembly, and remembered, that last commit simplifies `init_local_spin_delay` to just two-three writes of zeroes (looks like compiler combines 2*4byte write into 1*8 write). Compared to code around (especially in LWLockAcquire itself), this overhead is negligible. Though, I found that there is benefit in calling LWLockAttemptLockOnce before entering loop with calls to LWLockAttemptLockOrQueue in the LWLockAcquire (in there is not much contention). And this way, `inline` decorator for LWLockAttemptLockOrQueue could be omitted. Given, clang doesn't want to inline this function, it could be the best way. Should I add such commit to patch? -- With regards, Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Hello, On 2017-10-19 19:46, Andres Freund wrote: On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote: > > + init_local_spin_delay(&delayStatus); > > The way you moved this around has the disadvantage that we now do this - > a number of writes - even in the very common case where the lwlock can > be acquired directly. Excuse me, I don't understand fine. Do you complain against init_local_spin_delay placed here? Yes. I could place it before perform_spin_delay under `if (!spin_inited)` if you think it is absolutely must. Placing it in other place will complicate code. Or you complain against setting `mask` and `add`? That seems right. In both cases, I think simpler version should be accepted first. It acts as algorithm definition. And it already gives measurable improvement. Well, in scalability. I'm less sure about uncontended performance. > > + * We intentionally do not call finish_spin_delay here, because > > the loop > > + * above usually finished by queuing into the wait list on > > contention, and > > + * doesn't reach spins_per_delay thereby doesn't sleep inside of > > + * perform_spin_delay. Also, different LWLocks has very different > > + * contention pattern, and it is wrong to update spin-lock > > statistic based > > + * on LWLock contention. > > + */ > > Huh? This seems entirely unconvincing. Without adjusting this here we'll > just spin the same way every iteration. Especially for the case where > somebody else holds LW_FLAG_LOCKED that's quite bad. LWLock's are very different. Some of them are always short-term (BufferLock), others are always locked for a long time. That seems not particularly relevant. The same is true for spinlocks. The relevant question isn't how long the lwlock is held, it's how long LW_FLAG_LOCKED is held - and that should only depend on contention (i.e. bus speed, amount of times put into sleep while holding lock, etc), not on how long the lock is held. I've tried to place this delay into lock itself (it has 2 free bytes), but this attempt performed worse. That seems unsurprising - there's a *lot* of locks, and we'd have to tune all of them. Additionally there's a bunch of platforms where we do *not* have free bytes (consider the embedding in BufferTag). Now I understand, that delays should be stored in array indexed by tranche. But I have no time to test this idea. And I doubt it will give cardinally better results (ie > 5%), so I think it is better to accept patch in this way, and then experiment with per-tranche delay. I don't think tranches have any decent predictive power here. Look after "Make acquiring LWLock to look more like spinlock". First `skip_wait_list` iterations there is no attempt to queue self into wait list. It gives most of improvement. (without it there is just no degradation on high number of clients, but a little decline on low clients, because current algorithm is also a little `spinny` (because it attempts to acquire lock again after queueing into wait list)). `skip_wait_list` depends on tranche very much. And per-tranche `skip_wait_list` should be calculated based on ability to acquire lock without any sleep, ie without both queuing itself into wait list and falling into `pg_usleep` inside `perform_spin_delay` . I suppose, it is obvious that `spins_per_delay` have to be proportional to `skip_wait_list` (it should be at least greater than `skip_wait_list`). Therefore `spins_per_delay` is also should be runtime-dependent on lock's tranche. Am I wrong? Without that per-tranche auto-tuning, it is better to not touch `spins_per_delay` inside of `LWLockAttemptLockOrQueue`, I think. With regards, -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote: > > > + init_local_spin_delay(&delayStatus); > > > > The way you moved this around has the disadvantage that we now do this - > > a number of writes - even in the very common case where the lwlock can > > be acquired directly. > > Excuse me, I don't understand fine. > Do you complain against init_local_spin_delay placed here? Yes. > Placing it in other place will complicate code. > Or you complain against setting `mask` and `add`? That seems right. > In both cases, I think simpler version should be accepted first. It acts > as algorithm definition. And it already gives measurable improvement. Well, in scalability. I'm less sure about uncontended performance. > > > + * We intentionally do not call finish_spin_delay here, because > > > the loop > > > + * above usually finished by queuing into the wait list on > > > contention, and > > > + * doesn't reach spins_per_delay thereby doesn't sleep inside of > > > + * perform_spin_delay. Also, different LWLocks has very different > > > + * contention pattern, and it is wrong to update spin-lock > > > statistic based > > > + * on LWLock contention. > > > + */ > > > > Huh? This seems entirely unconvincing. Without adjusting this here we'll > > just spin the same way every iteration. Especially for the case where > > somebody else holds LW_FLAG_LOCKED that's quite bad. > > LWLock's are very different. Some of them are always short-term > (BufferLock), others are always locked for a long time. That seems not particularly relevant. The same is true for spinlocks. The relevant question isn't how long the lwlock is held, it's how long LW_FLAG_LOCKED is held - and that should only depend on contention (i.e. bus speed, amount of times put into sleep while holding lock, etc), not on how long the lock is held. > I've tried to place this delay into lock itself (it has 2 free bytes), > but this attempt performed worse. That seems unsurprising - there's a *lot* of locks, and we'd have to tune all of them. Additionally there's a bunch of platforms where we do *not* have free bytes (consider the embedding in BufferTag). > Now I understand, that delays should be stored in array indexed by > tranche. But I have no time to test this idea. And I doubt it will give > cardinally better results (ie > 5%), so I think it is better to accept > patch in this way, and then experiment with per-tranche delay. I don't think tranches have any decent predictive power here. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
On 2017-10-19 02:28, Andres Freund wrote: On 2017-06-05 16:22:58 +0300, Sokolov Yura wrote: Algorithm for LWLockWaitForVar is also refactored. New version is: 1. If lock is not held by anyone, it immediately exit. 2. Otherwise it is checked for ability to take WaitList lock, because variable is protected with it. If so, CAS is performed, and if it is successful, loop breaked to step 4. 3. Otherwise spin_delay perfomed, and loop returns to step 1. 4. var is checked for change. If it were changed, we unlock wait list lock and exit. Note: it could not change in following steps because we are holding wait list lock. 5. Otherwise CAS on setting necessary flags is performed. If it succeed, then queue Proc to wait list and exit. 6. If CAS failed, then there is possibility for LWLock to be already released - if so then we should unlock wait list and exit. 7. Otherwise loop returns to step 5. So invariant is preserved: - either we found LWLock free, - or we found changed variable, - or we set flags and queue self while LWLock were held. Spin_delay is not performed at step 7, because we should release wait list lock as soon as possible. That seems unconvincing - by not delaying you're more likely to *increase* the time till the current locker that holds the lock can release the lock. But why? If our CAS wasn't satisfied, then other's one were satisfied. So there is always overall progress. And if it will sleep in this loop, then other waiters will spin in first loop in this functions. But given concrete usage of LWLockWaitForVar, probably it is not too bad to hold other waiters in first loop in this function (they loops until we release WaitListLock or lock released at whole). I'm in doubts what is better. May I keep it unchanged now? In fact, if waiter will sleep here, lock holder will not be able to set variable we are waiting for, and therefore will not release lock. It is stated in the comment for the loop: + * Note: value could not change again cause we are holding WaitList lock. So delaying here we certainly will degrade. BTW, I found a small mistake in this place: I forgot to set LW_FLAG_LOCKED in a state before this CAS. Looks like it wasn't real error, because CAS always failed at first loop iteration (because real `lock->state` had LW_FLAG_LOCKED already set), and after failed CAS state adsorbs value from `lock->state`. I'll fix it. Attach contains version with a fix. -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company lwlock_v4.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Hi, On 2017-10-19 03:03, Andres Freund wrote: Hi, On 2017-09-08 22:35:39 +0300, Sokolov Yura wrote: /* * Internal function that tries to atomically acquire the lwlock in the passed - * in mode. + * in mode. If it could not grab the lock, it doesn't puts proc into wait + * queue. * - * This function will not block waiting for a lock to become free - that's the - * callers job. + * It is used only in LWLockConditionalAcquire. * - * Returns true if the lock isn't free and we need to wait. + * Returns true if the lock isn't free. */ static bool -LWLockAttemptLock(LWLock *lock, LWLockMode mode) +LWLockAttemptLockOnce(LWLock *lock, LWLockMode mode) This now has become a fairly special cased function, I'm not convinced it makes much sense with the current naming and functionality. It just had not to be as complex as LWLockAttpemptLockOrQueueSelf. Their functionality is too different. +/* + * Internal function that tries to atomically acquire the lwlock in the passed + * in mode or put it self into waiting queue with waitmode. + * This function will not block waiting for a lock to become free - that's the + * callers job. + * + * Returns true if the lock isn't free and we are in a wait queue. + */ +static inline bool +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, LWLockMode waitmode) +{ + uint32 old_state; + SpinDelayStatus delayStatus; + boollock_free; + uint32 mask, + add; + + AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED); + + if (mode == LW_EXCLUSIVE) + { + mask = LW_LOCK_MASK; + add = LW_VAL_EXCLUSIVE; + } + else + { + mask = LW_VAL_EXCLUSIVE; + add = LW_VAL_SHARED; + } + + init_local_spin_delay(&delayStatus); The way you moved this around has the disadvantage that we now do this - a number of writes - even in the very common case where the lwlock can be acquired directly. Excuse me, I don't understand fine. Do you complain against init_local_spin_delay placed here? Placing it in other place will complicate code. Or you complain against setting `mask` and `add`? At least it simplifies loop body. For example, it is simpler to write specialized Power assembly using already filled `mask+add` registers (i have a working draft with power assembly). In both cases, I think simpler version should be accepted first. It acts as algorithm definition. And it already gives measurable improvement. After some testing, attempt to improve it may be performed (assuming release will be in a next autumn, there is enough time for testing and improvements) I can be mistaken. + /* + * We intentionally do not call finish_spin_delay here, because the loop + * above usually finished by queuing into the wait list on contention, and +* doesn't reach spins_per_delay thereby doesn't sleep inside of +* perform_spin_delay. Also, different LWLocks has very different + * contention pattern, and it is wrong to update spin-lock statistic based +* on LWLock contention. +*/ Huh? This seems entirely unconvincing. Without adjusting this here we'll just spin the same way every iteration. Especially for the case where somebody else holds LW_FLAG_LOCKED that's quite bad. LWLock's are very different. Some of them are always short-term (BufferLock), others are always locked for a long time. Some sparse, and some crowded. It is quite bad to tune single variable `spins_per_delay` by such different load. And it is already tuned by spin-locks. And spin-locks tune it quite well. I've tried to place this delay into lock itself (it has 2 free bytes), but this attempt performed worse. Now I understand, that delays should be stored in array indexed by tranche. But I have no time to test this idea. And I doubt it will give cardinally better results (ie > 5%), so I think it is better to accept patch in this way, and then experiment with per-tranche delay. From e5b13550fc48d62b0b855bedd7fcd5848b806b09 Mon Sep 17 00:00:00 2001 From: Sokolov Yura Date: Tue, 30 May 2017 18:54:25 +0300 Subject: [PATCH 5/6] Fix implementation description in a lwlock.c . --- src/backend/storage/lmgr/lwlock.c | 17 - 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 334c2a2d96..0a41c2c4e2 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -62,16 +62,15 @@ * notice that we have to wait. Unfortunately by the time we have finished * queuing, the former locker very well might have already finished it's * work. That's problematic because we're now stuck waiting inside the OS. - - * To mitigate those races we use a two phased attempt at locking: - * Phase 1: Try to do it atomically, if we succeed, nice - * Phase 2: Add ourse
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Hi, On 2017-09-08 22:35:39 +0300, Sokolov Yura wrote: > From 722a8bed17f82738135264212dde7170b8c0f397 Mon Sep 17 00:00:00 2001 > From: Sokolov Yura > Date: Mon, 29 May 2017 09:25:41 + > Subject: [PATCH 1/6] More correct use of spinlock inside LWLockWaitListLock. > > SpinDelayStatus should be function global to count iterations correctly, > and produce more correct delays. > > Also if spin didn't spin, do not count it in spins_per_delay correction. > It wasn't necessary before cause delayStatus were used only in contented > cases. I'm not convinced this is benefial. Adds a bit of overhead to the casewhere LW_FLAG_LOCKED can be set immediately. OTOH, it's not super likely to make a large difference if the lock has to be taken anyway... > + > +/* just shortcut to not declare lwlock_stats variable at the end of function > */ > +static void > +add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus) > +{ > + lwlock_stats *lwstats; > + > + lwstats = get_lwlock_stats_entry(lock); > + lwstats->spin_delay_count += delayStatus->delays; > +} > #endif /* LWLOCK_STATS > */ This seems like a pretty independent change. > /* > * Internal function that tries to atomically acquire the lwlock in the > passed > - * in mode. > + * in mode. If it could not grab the lock, it doesn't puts proc into wait > + * queue. > * > - * This function will not block waiting for a lock to become free - that's > the > - * callers job. > + * It is used only in LWLockConditionalAcquire. > * > - * Returns true if the lock isn't free and we need to wait. > + * Returns true if the lock isn't free. > */ > static bool > -LWLockAttemptLock(LWLock *lock, LWLockMode mode) > +LWLockAttemptLockOnce(LWLock *lock, LWLockMode mode) This now has become a fairly special cased function, I'm not convinced it makes much sense with the current naming and functionality. > +/* > + * Internal function that tries to atomically acquire the lwlock in the > passed > + * in mode or put it self into waiting queue with waitmode. > + * This function will not block waiting for a lock to become free - that's > the > + * callers job. > + * > + * Returns true if the lock isn't free and we are in a wait queue. > + */ > +static inline bool > +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, LWLockMode > waitmode) > +{ > + uint32 old_state; > + SpinDelayStatus delayStatus; > + boollock_free; > + uint32 mask, > + add; > + > + AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED); > + > + if (mode == LW_EXCLUSIVE) > + { > + mask = LW_LOCK_MASK; > + add = LW_VAL_EXCLUSIVE; > + } > + else > + { > + mask = LW_VAL_EXCLUSIVE; > + add = LW_VAL_SHARED; > + } > + > + init_local_spin_delay(&delayStatus); The way you moved this around has the disadvantage that we now do this - a number of writes - even in the very common case where the lwlock can be acquired directly. > + /* > + * Read once outside the loop. Later iterations will get the newer value > + * either via compare & exchange or with read after perform_spin_delay. > + */ > + old_state = pg_atomic_read_u32(&lock->state); > + /* loop until we've determined whether we could acquire the lock or not > */ > + while (true) > + { > + uint32 desired_state; > + > + desired_state = old_state; > + > + lock_free = (old_state & mask) == 0; > + if (lock_free) > { > - if (lock_free) > + desired_state += add; > + if (pg_atomic_compare_exchange_u32(&lock->state, > + >&old_state, desired_state)) > { > /* Great! Got the lock. */ > #ifdef LOCK_DEBUG > if (mode == LW_EXCLUSIVE) > lock->owner = MyProc; > #endif > - return false; > + break; > + } > + } > + else if ((old_state & LW_FLAG_LOCKED) == 0) > + { > + desired_state |= LW_FLAG_LOCKED | LW_FLAG_HAS_WAITERS; > + if (pg_atomic_compare_exchange_u32(&lock->state, > + >&old_state, desired_state)) > + { > + LWLockQueueSelfLocked(lock, waitmode); > + break; > } > - else > - return true;/* somebody else has the lock */ > + } > + else > + { > +
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Hi, On 2017-06-05 16:22:58 +0300, Sokolov Yura wrote: > Patch changes the way LWLock is acquired. > > Before patch, lock is acquired using following procedure: > 1. first LWLock->state is checked for ability to take a lock, and CAS > performed (either lock could be acquired or not). And it is retried if > status were changed. > 2. if lock is not acquired at first 1, Proc is put into wait list, using > LW_FLAG_LOCKED bit of LWLock->state as a spin-lock for wait list. > In fact, profile shows that LWLockWaitListLock spends a lot of CPU on > contendend LWLock (upto 20%). > Additional CAS loop is inside pg_atomic_fetch_or_u32 for setting > LW_FLAG_HAS_WAITERS. And releasing wait list lock is another CAS loop > on the same LWLock->state. > 3. after that state is checked again, because lock could be released > before Proc were queued into wait list. > 4. if lock were acquired at step 3, Proc should be dequeued from wait > list (another spin lock and CAS loop). And this operation could be > quite heavy, because whole list should be traversed. > > Patch makes lock acquiring in single CAS loop: > 1. LWLock->state is read, and ability for lock acquiring is detected. > If there is possibility to take a lock, CAS tried. > If CAS were successful, lock is aquired (same to original version). > 2. but if lock is currently held by other backend, we check ability for > taking WaitList lock. If wait list lock is not help by anyone, CAS > perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at once. > If CAS were successful, then LWLock were still held at the moment wait > list lock were held - this proves correctness of new algorithm. And > Proc is queued to wait list then. > 3. Otherwise spin_delay is performed, and loop returns to step 1. Right, something like this didn't use to be possible because we'd both the LWLock->state and LWLock->mutex. But after 008608b9d5106 that's not a concern anymore. > Algorithm for LWLockWaitForVar is also refactored. New version is: > 1. If lock is not held by anyone, it immediately exit. > 2. Otherwise it is checked for ability to take WaitList lock, because > variable is protected with it. If so, CAS is performed, and if it is > successful, loop breaked to step 4. > 3. Otherwise spin_delay perfomed, and loop returns to step 1. > 4. var is checked for change. > If it were changed, we unlock wait list lock and exit. > Note: it could not change in following steps because we are holding > wait list lock. > 5. Otherwise CAS on setting necessary flags is performed. If it succeed, > then queue Proc to wait list and exit. > 6. If Case failed, then there is possibility for LWLock to be already > released - if so then we should unlock wait list and exit. > 7. Otherwise loop returns to step 5. > > So invariant is preserved: > - either we found LWLock free, > - or we found changed variable, > - or we set flags and queue self while LWLock were held. > > Spin_delay is not performed at step 7, because we should release wait > list lock as soon as possible. That seems unconvincing - by not delaying you're more likely to *increase* the time till the current locker that holds the lock can release the lock. > Additionally, taking wait list lock is skipped in both algorithms if > SpinDelayStatus.spins is less than part of spins_per_delay loop > iterations (so, it is initial iterations, and some iterations after > pg_usleep, because SpinDelayStatus.spins is reset after sleep). I quite strongly think it's a good idea to change this at the same time as the other changes you're proposing here. I think it's a worthwhile thing to pursue, but that change also has quit ethe potential for regressing in some cases. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
On 09/11/2017 11:01 AM, Jesper Pedersen wrote: Thanks for working on this ! Moved to "Ready for Committer". Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Hi, On 09/08/2017 03:35 PM, Sokolov Yura wrote: I'm seeing -M prepared: Up to 11% improvement -M prepared -S: No improvement, no regression ("noise") -M prepared -N: Up to 12% improvement for all runs the improvement shows up the closer you get to the number of CPU threads, or above. Although I'm not seeing the same improvements as you on very large client counts there are definitely improvements :) It is expected: - patch "fixes NUMA": for example, it doesn't give improvement on 1 socket at all (I've tested it using numactl to bind to 1 socket) - and certainly it gives less improvement on 2 sockets than on 4 sockets (and 28 cores vs 72 cores also gives difference), - one of hot points were CLogControlLock, and it were fixed with "Group mode for CLOG updates" [1] I'm planning to re-test that patch. +static inline bool +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, LWLockMode waitmode) I'll leave it to the Committer to decide if this method is too big to be "inline". GCC 4.9 doesn't want to inline it without directive, and function call is then remarkable in profile. Attach contains version with all suggestions applied except remove of "inline". Yes, ideally the method will be kept at "inline". Open questions: --- * spins_per_delay as extern * Calculation of skip_wait_list Currently calculation of skip_wait_list is mostly empirical (ie best i measured). Ok, good to know. I strongly think that instead of spins_per_delay something dependent on concrete lock should be used. I tried to store it in a LWLock itself, but it were worse. Yes, LWLock should be kept as small as possible, and cache line aligned due to the cache storms, as shown by perf c2c. Recently I understand it should be stored in array indexed by tranche, but I didn't implement it yet, and therefore didn't measure. Different constants for the LWLock could have an impact, but the constants would also be dependent on machine setup, and work load. Thanks for working on this ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Hi, Jesper Thank you for reviewing. On 2017-09-08 18:33, Jesper Pedersen wrote: Hi, On 07/18/2017 01:20 PM, Sokolov Yura wrote: I'm sending rebased version with couple of one-line tweaks. (less skip_wait_list on shared lock, and don't update spin-stat on aquiring) I have been running this patch on a 2S/28C/56T/256Gb w/ 2 x RAID10 SSD setup (1 to 375 clients on logged tables). I'm seeing -M prepared: Up to 11% improvement -M prepared -S: No improvement, no regression ("noise") -M prepared -N: Up to 12% improvement for all runs the improvement shows up the closer you get to the number of CPU threads, or above. Although I'm not seeing the same improvements as you on very large client counts there are definitely improvements :) It is expected: - patch "fixes NUMA": for example, it doesn't give improvement on 1 socket at all (I've tested it using numactl to bind to 1 socket) - and certainly it gives less improvement on 2 sockets than on 4 sockets (and 28 cores vs 72 cores also gives difference), - one of hot points were CLogControlLock, and it were fixed with "Group mode for CLOG updates" [1] Some comments: == lwlock.c: - ... +static void +add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus) Add a method description. Neighbor functions above also has no description, that is why I didn't add. But ok, I've added now. +static inline bool +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, LWLockMode waitmode) I'll leave it to the Committer to decide if this method is too big to be "inline". GCC 4.9 doesn't want to inline it without directive, and function call is then remarkable in profile. Attach contains version with all suggestions applied except remove of "inline". Open questions: --- * spins_per_delay as extern * Calculation of skip_wait_list Currently calculation of skip_wait_list is mostly empirical (ie best i measured). I strongly think that instead of spins_per_delay something dependent on concrete lock should be used. I tried to store it in a LWLock itself, but it were worse. Recently I understand it should be stored in array indexed by tranche, but I didn't implement it yet, and therefore didn't measure. [1]: https://commitfest.postgresql.org/14/358/ -- With regards, Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres CompanyFrom 722a8bed17f82738135264212dde7170b8c0f397 Mon Sep 17 00:00:00 2001 From: Sokolov Yura Date: Mon, 29 May 2017 09:25:41 + Subject: [PATCH 1/6] More correct use of spinlock inside LWLockWaitListLock. SpinDelayStatus should be function global to count iterations correctly, and produce more correct delays. Also if spin didn't spin, do not count it in spins_per_delay correction. It wasn't necessary before cause delayStatus were used only in contented cases. --- src/backend/storage/lmgr/lwlock.c | 37 + src/backend/storage/lmgr/s_lock.c | 5 + 2 files changed, 22 insertions(+), 20 deletions(-) diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 82a1cf5150..7714085663 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -323,6 +323,16 @@ get_lwlock_stats_entry(LWLock *lock) } return lwstats; } + +/* just shortcut to not declare lwlock_stats variable at the end of function */ +static void +add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus) +{ + lwlock_stats *lwstats; + + lwstats = get_lwlock_stats_entry(lock); + lwstats->spin_delay_count += delayStatus->delays; +} #endif /* LWLOCK_STATS */ @@ -800,13 +810,9 @@ static void LWLockWaitListLock(LWLock *lock) { uint32 old_state; -#ifdef LWLOCK_STATS - lwlock_stats *lwstats; - uint32 delays = 0; - - lwstats = get_lwlock_stats_entry(lock); -#endif + SpinDelayStatus delayStatus; + init_local_spin_delay(&delayStatus); while (true) { /* always try once to acquire lock directly */ @@ -815,20 +821,10 @@ LWLockWaitListLock(LWLock *lock) break;/* got lock */ /* and then spin without atomic operations until lock is released */ + while (old_state & LW_FLAG_LOCKED) { - SpinDelayStatus delayStatus; - - init_local_spin_delay(&delayStatus); - - while (old_state & LW_FLAG_LOCKED) - { -perform_spin_delay(&delayStatus); -old_state = pg_atomic_read_u32(&lock->state); - } -#ifdef LWLOCK_STATS - delays += delayStatus.delays; -#endif - finish_spin_delay(&delayStatus); + perform_spin_delay(&delayStatus); + old_state = pg_atomic_read_u32(&lock->state); } /* @@ -836,9 +832,10 @@ LWLockWaitListLock(LWLock *lock) * we're attempting to get it again. */ } + finish_spin_delay(&delayStatus); #ifdef LWLOCK_STATS - lwstats->spin_delay_count += delays; + add_lwlock_stats_spin_stat(lock, &delayStatus); #endif } diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lo
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Hi, On 07/18/2017 01:20 PM, Sokolov Yura wrote: I'm sending rebased version with couple of one-line tweaks. (less skip_wait_list on shared lock, and don't update spin-stat on aquiring) I have been running this patch on a 2S/28C/56T/256Gb w/ 2 x RAID10 SSD setup (1 to 375 clients on logged tables). I'm seeing -M prepared: Up to 11% improvement -M prepared -S: No improvement, no regression ("noise") -M prepared -N: Up to 12% improvement for all runs the improvement shows up the closer you get to the number of CPU threads, or above. Although I'm not seeing the same improvements as you on very large client counts there are definitely improvements :) Some comments: == lwlock.c: - + * This race is avoiding by taking lock for wait list using CAS with a old + * state value, so it could succeed only if lock is still held. Necessary flag + * is set with same CAS. + * + * Inside LWLockConflictsWithVar wait list lock is reused to protect variable + * value. So first it is acquired to check variable value, then flags are + * set with second CAS before queueing to wait list to ensure lock were not + * released yet. * This race is avoided by taking a lock for the wait list using CAS with the old * state value, so it would only succeed if lock is still held. Necessary flag * is set using the same CAS. * * Inside LWLockConflictsWithVar the wait list lock is reused to protect the variable * value. First the lock is acquired to check the variable value, then flags are * set with a second CAS before queuing to the wait list in order to ensure that the lock was not * released yet. +static void +add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus) Add a method description. + /* +* This barrier is never needed for correctness, and it is no-op +* on x86. But probably first iteration of cas loop in +* ProcArrayGroupClearXid will succeed oftener with it. +*/ * "more often" +static inline bool +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, LWLockMode waitmode) I'll leave it to the Committer to decide if this method is too big to be "inline". + /* +* We intentionally do not call finish_spin_delay here, cause loop above +* usually finished in queueing into wait list on contention, and doesn't +* reach spins_per_delay (and so, doesn't sleep inside of +* perform_spin_delay). Also, different LWLocks has very different +* contention pattern, and it is wrong to update spin-lock statistic based +* on LWLock contention. +*/ /* * We intentionally do not call finish_spin_delay here, because the loop above * usually finished by queuing into the wait list on contention, and doesn't * reach spins_per_delay thereby doesn't sleep inside of * perform_spin_delay. Also, different LWLocks has very different * contention pattern, and it is wrong to update spin-lock statistic based * on LWLock contention. */ s_lock.c: - + if (status->spins == 0) + /* but we didn't spin either, so ignore */ + return; Use { } for the if, or move the comment out of the nesting for readability. Open questions: --- * spins_per_delay as extern * Calculation of skip_wait_list You could run the patch through pgindent too. Passes make check-world. Status: Waiting on Author Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
On 2017-07-18 20:20, Sokolov Yura wrote: On 2017-06-05 16:22, Sokolov Yura wrote: Good day, everyone. This patch improves performance of contended LWLock. Patch makes lock acquiring in single CAS loop: 1. LWLock->state is read, and ability for lock acquiring is detected. If there is possibility to take a lock, CAS tried. If CAS were successful, lock is aquired (same to original version). 2. but if lock is currently held by other backend, we check ability for taking WaitList lock. If wait list lock is not help by anyone, CAS perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at once. If CAS were successful, then LWLock were still held at the moment wait list lock were held - this proves correctness of new algorithm. And Proc is queued to wait list then. 3. Otherwise spin_delay is performed, and loop returns to step 1. I'm sending rebased version with couple of one-line tweaks. (less skip_wait_list on shared lock, and don't update spin-stat on aquiring) With regards, Here are results for zipfian distribution (50/50 r/w) in conjunction with "lazy hash table for XidInMVCCSnapshot": (https://www.postgresql.org/message-id/642da34694800dab801f04c62950ce8a%40postgrespro.ru) clients | master | hashsnap2 | hashsnap2_lwlock ++---+-- 10 | 203384 |203813 | 204852 20 | 334344 |334268 | 363510 40 | 228496 |231777 | 383820 70 | 146892 |148173 | 221326 110 | 99741 |111580 | 157327 160 | 65257 | 81230 | 112028 230 | 38344 | 56790 |77514 310 | 22355 | 39249 |55907 400 | 13402 | 26899 |39742 500 | 8382 | 17855 |28362 650 | 5313 | 11450 |17497 800 | 3352 | 7816 |11030 With regards, -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
On 2017-06-05 16:22, Sokolov Yura wrote: Good day, everyone. This patch improves performance of contended LWLock. Patch makes lock acquiring in single CAS loop: 1. LWLock->state is read, and ability for lock acquiring is detected. If there is possibility to take a lock, CAS tried. If CAS were successful, lock is aquired (same to original version). 2. but if lock is currently held by other backend, we check ability for taking WaitList lock. If wait list lock is not help by anyone, CAS perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at once. If CAS were successful, then LWLock were still held at the moment wait list lock were held - this proves correctness of new algorithm. And Proc is queued to wait list then. 3. Otherwise spin_delay is performed, and loop returns to step 1. I'm sending rebased version with couple of one-line tweaks. (less skip_wait_list on shared lock, and don't update spin-stat on aquiring) With regards, -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres CompanyFrom dc8d9489074973ba589adf9b0239e1467cbba44b Mon Sep 17 00:00:00 2001 From: Sokolov Yura Date: Mon, 29 May 2017 09:25:41 + Subject: [PATCH 1/6] More correct use of spinlock inside LWLockWaitListLock. SpinDelayStatus should be function global to count iterations correctly, and produce more correct delays. Also if spin didn't spin, do not count it in spins_per_delay correction. It wasn't necessary before cause delayStatus were used only in contented cases. --- src/backend/storage/lmgr/lwlock.c | 36 src/backend/storage/lmgr/s_lock.c | 3 +++ 2 files changed, 19 insertions(+), 20 deletions(-) diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 82a1cf5150..86966b804e 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -323,6 +323,15 @@ get_lwlock_stats_entry(LWLock *lock) } return lwstats; } + +static void +add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus) +{ + lwlock_stats *lwstats; + + lwstats = get_lwlock_stats_entry(lock); + lwstats->spin_delay_count += delayStatus->delays; +} #endif /* LWLOCK_STATS */ @@ -800,13 +809,9 @@ static void LWLockWaitListLock(LWLock *lock) { uint32 old_state; -#ifdef LWLOCK_STATS - lwlock_stats *lwstats; - uint32 delays = 0; - - lwstats = get_lwlock_stats_entry(lock); -#endif + SpinDelayStatus delayStatus; + init_local_spin_delay(&delayStatus); while (true) { /* always try once to acquire lock directly */ @@ -815,20 +820,10 @@ LWLockWaitListLock(LWLock *lock) break;/* got lock */ /* and then spin without atomic operations until lock is released */ + while (old_state & LW_FLAG_LOCKED) { - SpinDelayStatus delayStatus; - - init_local_spin_delay(&delayStatus); - - while (old_state & LW_FLAG_LOCKED) - { -perform_spin_delay(&delayStatus); -old_state = pg_atomic_read_u32(&lock->state); - } -#ifdef LWLOCK_STATS - delays += delayStatus.delays; -#endif - finish_spin_delay(&delayStatus); + perform_spin_delay(&delayStatus); + old_state = pg_atomic_read_u32(&lock->state); } /* @@ -836,9 +831,10 @@ LWLockWaitListLock(LWLock *lock) * we're attempting to get it again. */ } + finish_spin_delay(&delayStatus); #ifdef LWLOCK_STATS - lwstats->spin_delay_count += delays; + add_lwlock_stats_spin_stat(lock, &delayStatus); #endif } diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c index 40d8156331..f3e0fbc602 100644 --- a/src/backend/storage/lmgr/s_lock.c +++ b/src/backend/storage/lmgr/s_lock.c @@ -177,6 +177,9 @@ finish_spin_delay(SpinDelayStatus *status) if (status->cur_delay == 0) { /* we never had to delay */ + if (status->spins == 0) + /* but we didn't spin either, so ignore */ + return; if (spins_per_delay < MAX_SPINS_PER_DELAY) spins_per_delay = Min(spins_per_delay + 100, MAX_SPINS_PER_DELAY); } -- 2.11.0 From 067282cce98c77b9f1731cf7d302c37fcc70d59e Mon Sep 17 00:00:00 2001 From: Sokolov Yura Date: Tue, 30 May 2017 14:05:26 +0300 Subject: [PATCH 2/6] Perform atomic LWLock acquirement or putting into wait list. Before this change, lwlock were acquired in several steps: - first try acquire lockfree (one CAS-spin-loop) - if it were not successful, queue self into wait list (two CAS-spin-loops (in fact, 3 loop, cause of atomic_fetch_and_or)) - try again to acquire lock (another one CAS loop) - if second attempt were successful, deque our self from wait queue (two CAS spin loops). With this change, lock acquired, or wait list lock acquired using single CAS loop: - if it is likely to acquire desired lock mode, we do CAS for that - otherwise we are trying for CAS to lock wait list and set necessary flag (LG_FLAG_HAS_WAITERS) at once. (LWLockWaitListUnlock still performs second CAS loop for releasing wait list lock). Correctness pr
Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
On Mon, Jun 5, 2017 at 9:22 AM, Sokolov Yura wrote: > Good day, everyone. > > This patch improves performance of contended LWLock. > It was tested on 4 socket 72 core x86 server (144 HT) Centos 7.1 > gcc 4.8.5 > > Patch makes lock acquiring in single CAS loop: > 1. LWLock->state is read, and ability for lock acquiring is detected. > If there is possibility to take a lock, CAS tried. > If CAS were successful, lock is aquired (same to original version). > 2. but if lock is currently held by other backend, we check ability for > taking WaitList lock. If wait list lock is not help by anyone, CAS > perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at once. > If CAS were successful, then LWLock were still held at the moment wait > list lock were held - this proves correctness of new algorithm. And > Proc is queued to wait list then. > 3. Otherwise spin_delay is performed, and loop returns to step 1. Interesting work. Thanks for posting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers