Re: Lockless queue of waiters in LWLock

2024-01-26 Thread vignesh C
On Sat, 20 Jan 2024 at 07:28, vignesh C  wrote:
>
> On Sat, 26 Nov 2022 at 00:24, Pavel Borisov  wrote:
> >
> > Hi, hackers!
> > In the measurements above in the thread, I've been using LIFO wake
> > queue in a primary lockless patch (and it was attached as the previous
> > versions of a patch) and an "inverted wake queue" (in faсt FIFO) as
> > the alternative benchmarking option. I think using the latter is more
> > fair and natural and provided they show no difference in the speed,
> > I'd make the main patch using it (attached as v6). No other changes
> > from v5, though.
>
> There has not been much interest on this as the thread has been idle
> for more than a year now. I'm not sure if we should take it forward or
> not. I would prefer to close this in the current commitfest unless
> someone wants to take it further.

I have returned this patch in commitfest as nobody had shown any
interest in pursuing it. Feel free to add a new entry when someone
wants to work on this more actively.

Regards,
Vignesh




Re: Lockless queue of waiters in LWLock

2024-01-19 Thread vignesh C
On Sat, 26 Nov 2022 at 00:24, Pavel Borisov  wrote:
>
> Hi, hackers!
> In the measurements above in the thread, I've been using LIFO wake
> queue in a primary lockless patch (and it was attached as the previous
> versions of a patch) and an "inverted wake queue" (in faсt FIFO) as
> the alternative benchmarking option. I think using the latter is more
> fair and natural and provided they show no difference in the speed,
> I'd make the main patch using it (attached as v6). No other changes
> from v5, though.

There has not been much interest on this as the thread has been idle
for more than a year now. I'm not sure if we should take it forward or
not. I would prefer to close this in the current commitfest unless
someone wants to take it further.

Regards,
Vignesh




Re: Lockless queue of waiters in LWLock

2022-11-25 Thread Pavel Borisov
Hi, hackers!
In the measurements above in the thread, I've been using LIFO wake
queue in a primary lockless patch (and it was attached as the previous
versions of a patch) and an "inverted wake queue" (in faсt FIFO) as
the alternative benchmarking option. I think using the latter is more
fair and natural and provided they show no difference in the speed,
I'd make the main patch using it (attached as v6). No other changes
from v5, though.

Regards,
Pavel.


v6-0001-Lockless-queue-of-LWLock-waiters.patch
Description: Binary data


Re: Lockless queue of waiters in LWLock

2022-11-24 Thread Pavel Borisov
Hi, hackers!
Andres Freund recently committed his nice LWLock optimization
a4adc31f6902f6f. So I've rebased the patch on top of the current
master (PFA v5).

Regards,
Pavel Borisov,
Supabase.


v5-0001-Lockless-queue-of-LWLock-waiters.patch
Description: Binary data


Re: Lockless queue of waiters in LWLock

2022-11-17 Thread Pavel Borisov
Hi, hackers!

I've noticed that alignment requirements for using
pg_atomic_init_u64() for LWlock state (there's an assertion that it's
aligned at 8 bytes) do not correspond to the code in SimpleLruInit()
on 32-bit arch when MAXIMUM_ALIGNOF = 4.
Fixed this in v4 patch (PFA).

Regards,
Pavel.


v4-0001-Lockless-queue-of-LWLock-waiters.patch
Description: Binary data


Re: Lockless queue of waiters in LWLock

2022-11-11 Thread Alexander Korotkov
Hi, Pavel!

On Fri, Nov 11, 2022 at 2:40 PM Pavel Borisov  wrote:
> I've done some more measurements to check the hypotheses regarding the
> performance of a previous patch v2, and explain the results of tests
> in [1].
>
> The results below are the same (tps vs connections) plots as in [1],
> and the test is identical to the insert test in this thread [2].
> Additionally, in each case, there is a plot with results relative to
> Andres Freund's patch [3]. Log plots are good for seeing details in
> the range of 20-30 connections, but they somewhat hide the fact that
> the effect in the range of 500+ connections is much more significant
> overall, so I'd recommend looking at the linear plots as well.

Thank you for doing all the experiments!

BTW, sometimes it's hard to distinguish so many lines on a jpg
picture.  Could I ask you to post the same graphs in png and also post
raw data in csv format?

> I'm also planning to do the same tests on an ARM server when the free
> one comes available to me.
> Thoughts?

ARM tests should be great.  We definitely need to check this on more
than just one architecture.  Please, check with and without LSE
instructions.  They could lead to dramatic speedup [1].  Although,
most of precompiled binaries are distributed without them.  So, both
cases seems important to me so far.

>From what we have so far, I think we could try combine the multiple
strategies to achieve the best result.  2x1ms is one of the leaders
before ~200 connections, and 1x1ms is once of the leaders after.  We
could implement simple heuristics to switch between 1 and 2 retries
similar to what we have to spin delays.  But let's have ARM results
first.


Links
1. https://akorotkov.github.io/blog/2021/04/30/arm/

--
Regards,
Alexander Korotkov




Re: Lockless queue of waiters in LWLock

2022-11-11 Thread Pavel Borisov
CFbot isn't happy because of additional patches in the previous
message, so here I attach v3 patch alone.

Regards,
Pavel


v3-0001-Lockless-queue-of-LWLock-waiters.patch
Description: Binary data


Re: Lockless queue of waiters in LWLock

2022-11-05 Thread Andres Freund
Hi,

On 2022-11-05 12:05:43 +0300, Alexander Korotkov wrote:
> On Fri, Nov 4, 2022 at 10:07 PM Andres Freund  wrote:
> > The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why
> > lwlocks are slower than a spinlock (but obviously are better under 
> > contention
> > nonetheless).
> >
> >
> > I have a benchmark program that starts a thread for each physical core and
> > just increments a counter on an atomic value.
> 
> Thank you for this insight!  I didn't know xadd is much cheaper than
> cmpxchg unless there are retries.

The magnitude of the effect is somewhat surprising, I agree. Some difference
makes sense to me, but...


> I also wonder how cmpxchg becomes faster with higher concurrency.

If you're referring to the leading 32/64 that's not concurrency, that's
32/64bit values. Sorry for not being clearer on that.

Greetings,

Andres Freund




Re: Lockless queue of waiters in LWLock

2022-11-05 Thread Alexander Korotkov
Hi, Andres!

On Fri, Nov 4, 2022 at 10:07 PM Andres Freund  wrote:
> The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why
> lwlocks are slower than a spinlock (but obviously are better under contention
> nonetheless).
>
>
> I have a benchmark program that starts a thread for each physical core and
> just increments a counter on an atomic value.

Thank you for this insight!  I didn't know xadd is much cheaper than
cmpxchg unless there are retries.  I also wonder how cmpxchg becomes
faster with higher concurrency.

--
Regards,
Alexander Korotkov




Re: Lockless queue of waiters in LWLock

2022-11-04 Thread Andres Freund
Hi,

On 2022-11-03 14:50:11 +0400, Pavel Borisov wrote:
> Or maybe there is another explanation for now small performance
> difference around 20 connections described in [0]?
> Thoughts?

Using xadd is quite a bit cheaper than cmpxchg, and now every lock release
uses a compare-exchange, I think.

In the past I had a more complicated version of LWLockAcquire which tried to
use an xadd to acquire locks. IIRC (and this is long enough ago that I might
not) that proved to be a benefit, but I was worried about the complexity. And
just getting in the version that didn't always use a spinlock was the higher
priority.

The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why
lwlocks are slower than a spinlock (but obviously are better under contention
nonetheless).


I have a benchmark program that starts a thread for each physical core and
just increments a counter on an atomic value.

On my dual Xeon Gold 5215 workstation:

cmpxchg:
32: throughput per thread: 0.55M/s, total: 11.02M/s
64: throughput per thread: 0.63M/s, total: 12.68M/s

lock add:
32: throughput per thread: 2.10M/s, total: 41.98M/s
64: throughput per thread: 2.12M/s, total: 42.40M/s

xadd:
32: throughput per thread: 2.10M/s, total: 41.91M/s
64: throughput per thread: 2.04M/s, total: 40.71M/s


and even when there's no contention, every thread just updating its own
cacheline:

cmpxchg:
32: throughput per thread: 88.83M/s, total: 1776.51M/s
64: throughput per thread: 96.46M/s, total: 1929.11M/s

lock add:
32: throughput per thread: 166.07M/s, total: 3321.31M/s
64: throughput per thread: 165.86M/s, total: 3317.22M/s

add (no lock):
32: throughput per thread: 530.78M/s, total: 10615.62M/s
64: throughput per thread: 531.22M/s, total: 10624.35M/s

xadd:
32: throughput per thread: 165.88M/s, total: 3317.51M/s
64: throughput per thread: 165.93M/s, total: 3318.53M/s


Greetings,

Andres Freund




Re: Lockless queue of waiters in LWLock

2022-11-04 Thread Alexander Korotkov
Hi, Pavel!

On Thu, Nov 3, 2022 at 1:51 PM Pavel Borisov  wrote:
> > I'm planning to gather more detailed statistics on different
> > LWLockAcquire calls soon to understand prospects for further
> > optimizations.
>
> So, I've made more measurements.
>
> 1. Applied measuring patch 0001 to a patch with lockless queue
> optimization (v2) from [0] in this thread and run the same concurrent
> insert test described in [1] on 20 pgbench connections.
> The new results for ProcArray lwlock are as follows:
> exacq 45132  // Overall number of exclusive locks taken
> ex_attempt[0] 20755  // Exclusive locks taken immediately
> ex_attempt[1] 18800  // Exclusive locks taken after one waiting on semaphore
> ex_attempt[2] 5577// Exclusive locks taken after two or more
> waiting on semaphore
> shacq 494871// .. same stats for shared locks
> sh_attempt[0] 463211 // ..
> sh_attempt[1] 29767   // ..
> sh_attempt[2] 1893 // .. same stats for shared locks
> sh_wake_calls 31070 // Number of calls to wake up shared waiters
> sh_wakes 36190// Number of shared waiters woken up.
> GroupClearXid 55300 // Number of calls of ProcArrayGroupClearXid
> EndTransactionInternal: 236193 // Number of calls
> ProcArrayEndTransactionInternal
>
> 2. Applied measuring patch 0002 to a Andres Freund's patch v3 from [2]
> and run the same concurrent insert test described in [1] on 20 pgbench
> connections.
> The results for ProcArray lwlock are as follows:
> exacq 49300   // Overall number of exclusive locks taken
> ex_attempt1[0] 18353// Exclusive locks taken immediately by first
> call of LWLockAttemptLock in LWLockAcquire loop
> ex_attempt2[0] 18144.   // Exclusive locks taken immediately by second
> call of LWLockAttemptLock in LWLockAcquire loop
> ex_attempt1[1] 9985  // Exclusive locks taken after one waiting on
> semaphore by first call of LWLockAttemptLock in LWLockAcquire loop
> ex_attempt2[1] 1838. // Exclusive locks taken after one waiting on
> semaphore by second call of LWLockAttemptLock in LWLockAcquire loop
> ex_attempt1[2] 823.   // Exclusive locks taken after two or more
> waiting on semaphore by first call of LWLockAttemptLock in
> LWLockAcquire loop
> ex_attempt2[2] 157.   // Exclusive locks taken after two or more
> waiting on semaphore by second call of LWLockAttemptLock in
> LWLockAcquire loop
> shacq 508131 // .. same stats for shared locks
> sh_attempt1[0] 469410   //..
> sh_attempt2[0] 27858.   //..
> sh_attempt1[1] 10309.   //..
> sh_attempt2[1] 460.   //..
> sh_attempt1[2] 90. //..
> sh_attempt2[2] 4.   // .. same stats for shared locks
> dequeue self 48461   // Number of dequeue_self calls
> sh_wake_calls 27560   // Number of calls to wake up
> shared waiters
> sh_wakes 19408  // Number of shared waiters woken up.
> GroupClearXid 65021. // Number of calls of
> ProcArrayGroupClearXid
> EndTransactionInternal: 249003 // Number of calls
> ProcArrayEndTransactionInternal
>
> It seems that two calls in each look in Andres's (and master) code
> help evade semaphore-waiting loops that may be relatively expensive.
> The probable reason for this is that the small delay between these two
> calls is sometimes enough for concurrent takers to free spinlock for
> the queue modification. Could we get even more performance if we do
> three or more tries to take the lock in the queue? Will this degrade
> performance in some other cases?

Thank you for gathering the statistics. Let me do some relative
analysis of that.

*Lockless queue patch*

1. Shared lockers
1.1. 93.60% of them acquire lock without sleeping on semaphore
1.2. 6.02% of them acquire lock after sleeping on semaphore 1 time
1.3. 0.38% of them acquire lock after sleeping on semaphore 2 or more times
2. Exclusive lockers
2.1. 45.99% of them acquire lock without sleeping on semaphore
2.2. 41.66% of them acquire lock after sleeping on semaphore 1 time
2.3. 12.36% of them acquire lock after sleeping on semaphore 2 or more times

In general, about 10% of lockers sleep on the semaphore.

*Andres's patch*

1. Shared lockers
1.1. 97.86% of them acquire lock without sleeping on the semaphore
(92.38% do this immediately and 5.48% after queuing)
1.2. 2.12% of them acquire lock after sleeping on semaphore 1 time
(2.03% do this immediately and 0.09% after queuing)
1.3. 0.02% of them acquire lock after sleeping on semaphore 2 or more
times (0.02% do this immediately and 0.00% after queuing)
2. Exclusive lockers
2.1. 74.03% of them acquire lock without sleeping on the semaphore
(37.23% do this immediately and 36.80% after queuing)
2.2. 23.98% of them acquire lock after sleeping on semaphore 1 time
(20.25% do this immediately and 3.73% after queuing)
2.3. 1.99% of them acquire lock after sleeping on semaphore 2 or more
times (1.67% do this immediately and 0.32% after queuing)

In general, about 4% of lockers sleep on the 

Re: Lockless queue of waiters in LWLock

2022-11-03 Thread Pavel Borisov
Hi, hackers!
> I'm planning to gather more detailed statistics on different
> LWLockAcquire calls soon to understand prospects for further
> optimizations.

So, I've made more measurements.

1. Applied measuring patch 0001 to a patch with lockless queue
optimization (v2) from [0] in this thread and run the same concurrent
insert test described in [1] on 20 pgbench connections.
The new results for ProcArray lwlock are as follows:
exacq 45132  // Overall number of exclusive locks taken
ex_attempt[0] 20755  // Exclusive locks taken immediately
ex_attempt[1] 18800  // Exclusive locks taken after one waiting on semaphore
ex_attempt[2] 5577// Exclusive locks taken after two or more
waiting on semaphore
shacq 494871// .. same stats for shared locks
sh_attempt[0] 463211 // ..
sh_attempt[1] 29767   // ..
sh_attempt[2] 1893 // .. same stats for shared locks
sh_wake_calls 31070 // Number of calls to wake up shared waiters
sh_wakes 36190// Number of shared waiters woken up.
GroupClearXid 55300 // Number of calls of ProcArrayGroupClearXid
EndTransactionInternal: 236193 // Number of calls
ProcArrayEndTransactionInternal

2. Applied measuring patch 0002 to a Andres Freund's patch v3 from [2]
and run the same concurrent insert test described in [1] on 20 pgbench
connections.
The results for ProcArray lwlock are as follows:
exacq 49300   // Overall number of exclusive locks taken
ex_attempt1[0] 18353// Exclusive locks taken immediately by first
call of LWLockAttemptLock in LWLockAcquire loop
ex_attempt2[0] 18144.   // Exclusive locks taken immediately by second
call of LWLockAttemptLock in LWLockAcquire loop
ex_attempt1[1] 9985  // Exclusive locks taken after one waiting on
semaphore by first call of LWLockAttemptLock in LWLockAcquire loop
ex_attempt2[1] 1838. // Exclusive locks taken after one waiting on
semaphore by second call of LWLockAttemptLock in LWLockAcquire loop
ex_attempt1[2] 823.   // Exclusive locks taken after two or more
waiting on semaphore by first call of LWLockAttemptLock in
LWLockAcquire loop
ex_attempt2[2] 157.   // Exclusive locks taken after two or more
waiting on semaphore by second call of LWLockAttemptLock in
LWLockAcquire loop
shacq 508131 // .. same stats for shared locks
sh_attempt1[0] 469410   //..
sh_attempt2[0] 27858.   //..
sh_attempt1[1] 10309.   //..
sh_attempt2[1] 460.   //..
sh_attempt1[2] 90. //..
sh_attempt2[2] 4.   // .. same stats for shared locks
dequeue self 48461   // Number of dequeue_self calls
sh_wake_calls 27560   // Number of calls to wake up
shared waiters
sh_wakes 19408  // Number of shared waiters woken up.
GroupClearXid 65021. // Number of calls of
ProcArrayGroupClearXid
EndTransactionInternal: 249003 // Number of calls
ProcArrayEndTransactionInternal

It seems that two calls in each look in Andres's (and master) code
help evade semaphore-waiting loops that may be relatively expensive.
The probable reason for this is that the small delay between these two
calls is sometimes enough for concurrent takers to free spinlock for
the queue modification. Could we get even more performance if we do
three or more tries to take the lock in the queue? Will this degrade
performance in some other cases?

Or maybe there is another explanation for now small performance
difference around 20 connections described in [0]?
Thoughts?

Regards,
Pavel Borisov

[0] 
https://www.postgresql.org/message-id/CALT9ZEF7q%2BSarz1MjrX-fM7OsoU7CK16%3DONoGCOkY3Efj%2BGrnw%40mail.gmail.com
[1] 
https://www.postgresql.org/message-id/CALT9ZEEz%2B%3DNepc5eti6x531q64Z6%2BDxtP3h-h_8O5HDdtkJcPw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/20221031235114.ftjkife57zil7ryw%40awork3.anarazel.de


0001-Print-extended-lwlock_stats-and-proc_stats-on-CAS-re.patch
Description: Binary data


0002-Print-extended-lwlock_stats-and-proc_stats.patch
Description: Binary data


Re: Lockless queue of waiters in LWLock

2022-11-01 Thread Alexander Korotkov
Hi Andres,

Thank you for your feedback.

On Tue, Nov 1, 2022 at 2:15 AM Andres Freund  wrote:
> On 2022-10-31 14:38:23 +0400, Pavel Borisov wrote:
> > If the lock state contains references to the queue head and tail, we can
> > implement a lockless queue of waiters for the LWLock.  Adding new items to
> > the queue head or tail can be done with a single CAS operation (adding to
> > the tail will also require further setting the reference from the previous
> > tail).  Given that there could be only one lock releaser, which wakes up
> > waiters in the queue, we can handle all the concurrency issues with
> > reasonable complexity.
>
> Right now lock releases happen *after* the lock is released.

That makes sense. The patch makes the locker hold the lock, which it's
processing the queue. So, the lock is held for a longer time.

> I suspect that is
> at least part of the reason for the regression you're seeing. It also looks
> like you're going to need a substantially higher number of atomic operations -
> right now the queue processing needs O(1) atomic ops, but your approach ends
> up with O(waiters) atomic ops.

Hmm... In the patch, queue processing calls CAS once after processing
the queue. There could be retries to process the queue parts, which
were added concurrently. But I doubt it ends up with O(waiters) atomic
ops. Pavel, I think we could gather some statistics to check how many
retries we have on average.

> I suspect it might be worth going halfway, i.e. put the list head/tail in the
> atomic variable, but process the list with a lock held, after the lock is
> released.

Good idea. We, anyway, only allow one locker at a time to process the queue.

--
Regards,
Alexander Korotkov




Re: Lockless queue of waiters in LWLock

2022-10-31 Thread Andres Freund
Hi,

Thanks for working on this - I think it's something we need to improve.


On 2022-10-31 14:38:23 +0400, Pavel Borisov wrote:
> If the lock state contains references to the queue head and tail, we can
> implement a lockless queue of waiters for the LWLock.  Adding new items to
> the queue head or tail can be done with a single CAS operation (adding to
> the tail will also require further setting the reference from the previous
> tail).  Given that there could be only one lock releaser, which wakes up
> waiters in the queue, we can handle all the concurrency issues with
> reasonable complexity.

Right now lock releases happen *after* the lock is released. I suspect that is
at least part of the reason for the regression you're seeing. It also looks
like you're going to need a substantially higher number of atomic operations -
right now the queue processing needs O(1) atomic ops, but your approach ends
up with O(waiters) atomic ops.

I suspect it might be worth going halfway, i.e. put the list head/tail in the
atomic variable, but process the list with a lock held, after the lock is
released.


> Removing the queue spinlock bit and the corresponding contention should
> give the performance gain on high concurrent LWLock usage scenarios.

> Currently, the maximum size of procarray is 2^18 (as stated in
> buf_internals.h), so with the use of the 64-bit LWLock state variable, we
> can store the procarray indexes for both head and tail of the queue, all
> the existing machinery, and even have some spare bits for future flags.

Another advantage: It shrinks the LWLock struct, which makes it cheaper to
make locks more granular...


> I can not understand why the performance of a lockless queue patch has a
> minor regression in the region of around 20 connections, even when compared
> to the current master branch.

I suspect that's the point where the additional atomic operations hurt the
most, while not yet providing a substantial gain.

Greetings,

Andres Freund




Lockless queue of waiters in LWLock

2022-10-31 Thread Pavel Borisov
Hi, hackers!

When we take LWlock, we already use atomic CAS operation to atomically
modify the lock state even in the presence of concurrent lock-takers. But
if we can not take the lock immediately, we need to put the waiters on a
waiting list, and currently, this operation is done not atomically and in a
complicated way:
- Parts are done under LWLockWaitListLock, which includes a spin delay to
take.
- Also, we need a two-step procedure to immediately dequeue if, after
adding a current process into a wait queue, it appears that we don’t need
to (and can not!) sleep as the lock is already free.

If the lock state contains references to the queue head and tail, we can
implement a lockless queue of waiters for the LWLock.  Adding new items to
the queue head or tail can be done with a single CAS operation (adding to
the tail will also require further setting the reference from the previous
tail).  Given that there could be only one lock releaser, which wakes up
waiters in the queue, we can handle all the concurrency issues with
reasonable complexity.

Removing the queue spinlock bit and the corresponding contention should
give the performance gain on high concurrent LWLock usage scenarios.

Currently, the maximum size of procarray is 2^18 (as stated in
buf_internals.h), so with the use of the 64-bit LWLock state variable, we
can store the procarray indexes for both head and tail of the queue, all
the existing machinery, and even have some spare bits for future flags.

Thus we almost entirely avoid spinlocks and replace them with repeated try
to change the lock state if the CAS operation is unsuccessful due to
concurrent state modification.

The attached patch implements described approach.  The patch is based on
the developments of Alexander Korotkov in the OrioleDB engine.  I made
further adoption of those ideas with guidance from him.

We did some preliminary performance checks and saw that the concurrent
insert-only and txid_current tests with ~1000 connections pgbench
significantly increase performance. The scripts for testing and results are
attached. I’ve done the tests on 32-vcore x86-64 AWS machine using tmpfs as
storage for the database to avoid random delays related to disk IO.

Though recently, Andres proposed a different patch, which just evades O(N)
removal from the queue of waiters but improves performance even more [1].
The results of the comparison of the master branch, lockless queue
(current) patch, and Andres’ patch are below. Please, take into account
that the horizontal axis uses a log scale.

---
cat insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 10 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid,
:delta);

pgbench -d postgres -i -s 1 --unlogged-tables
echo -e
"max_connections=2500\nmax_wal_senders=0\nwal_level=minimal\nmax_wal_size =
10GB\nshared_buffers = 2MB\nautovacuum = off\nfsync =
off\nfull_page_writes = off\nmax_worker_processes =
1024\nmax_parallel_workers = 1024\n" > ./pgdata$v/postgresql.auto.conf
psql -dpostgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;

for conns in 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 18 20 22 24 27 30 33 36 39
43 47 51 56 62 68 75 82 91 100 110 120 130 150 160 180 200 220 240 270 300
330 360 390 430 470 510 560 620 680 750 820 910 1000 1100 1200 1300 1500
1600 1800 2000

do
psql -dpostgres -c "truncate pgbench_accounts;"
psql -dpostgres -c "vacuum full pgbench_accounts;"
pgbench postgres -f insert.sql -s 1 -P20 -M prepared -T 10 -j 5 -c $conns
--no-vacuum | grep tps
done


--

cat txid.sql
select txid_current();

for conns in 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 18 20 22 24 27 30 33 36 39
43 47 51 56 62 68 75 82 91 100 110 120 130 150 160 180 200 220 240 270 300
330 360 390 430 470 510 560 620 680 750 820 910 1000 1100 1200 1300 1500
1600 1800 2000

do
pgbench postgres -f txid.sql -s 1 -P20 -M prepared -T 10 -j 5 -c $conns
--no-vacuum | grep tps
done


-
I can not understand why the performance of a lockless queue patch has a
minor regression in the region of around 20 connections, even when compared
to the current master branch.


Are there some scenarios where the lockless queue approach is superior? I
expected they should be, at least in theory. Probably, there is a way to
improve the attached patch further to achieve that superiority.


Best regards,
Pavel Borisov,
Supabase.

[1]
https://www.postgresql.org/message-id/20221027165914.2hofzp4cvutj6...@awork3.anarazel.de


v1-0001-Lockless-queue-of-LWLock-waiters.patch
Description: Binary data