Re: MultiXact\SLRU buffers configuration

2024-04-07 Thread Andrey M. Borodin



> On 7 Apr 2024, at 21:41, Alvaro Herrera  wrote:
> 
> Well, it would be nice to have *some* test, but as you say it is way
> more complex than the thing being tested, and it zooms in on the
> functioning of the multixact creation in insane quantum-physics ways ...
> to the point that you can no longer trust that multixact works the same
> way with the test than without it.  So what I did is manually run other
> tests (pgbench) to verify that the corner case in multixact creation is
> being handled correctly, and pushed the patch after a few corrections,
> in particular so that it would follow the CV sleeping protocol a little
> more closely to what the CV documentation suggests, which is not at all
> what the patch did.  I also fixed a few comments that had been neglected
> and changed the name and description of the CV in the docs.

That's excellent! Thank you!

> Now, maybe we can still add the test later, but it needs a rebase.

Sure. 'wait' injection points are there already, so I'll produce patch with 
"prepared" injection points and re-implement test on top of that. I'll put it 
on July CF.


Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2024-04-07 Thread Alvaro Herrera
On 2024-Feb-03, Andrey M. Borodin wrote:

> Here's the test draft. This test reliably reproduces sleep on CV when waiting 
> next multixact to be filled into "members" SLRU.
> Cost of having this test:
> 1. We need a new injection point type "wait" (in addition to "error" and 
> "notice"). It cannot be avoided, because we need to sync at least 3 processed 
> to observe condition we want.
> 2. We need new way to declare injection point that can happen inside critical 
> section. I've called it "prepared injection point".
> 
> Complexity of having this test is higher than complexity of CV-sleep patch 
> itself. Do we want it? If so I can produce cleaner version, currently all 
> multixact tests are int injection_points test module.

Well, it would be nice to have *some* test, but as you say it is way
more complex than the thing being tested, and it zooms in on the
functioning of the multixact creation in insane quantum-physics ways ...
to the point that you can no longer trust that multixact works the same
way with the test than without it.  So what I did is manually run other
tests (pgbench) to verify that the corner case in multixact creation is
being handled correctly, and pushed the patch after a few corrections,
in particular so that it would follow the CV sleeping protocol a little
more closely to what the CV documentation suggests, which is not at all
what the patch did.  I also fixed a few comments that had been neglected
and changed the name and description of the CV in the docs.

Now, maybe we can still add the test later, but it needs a rebase.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
“Cuando no hay humildad las personas se degradan” (A. Christie)




Re: MultiXact\SLRU buffers configuration

2024-04-07 Thread Andrey M. Borodin



> On 6 Apr 2024, at 14:24, Andrey M. Borodin  wrote:
> 
> What do you think?

OK, I'll follow this plan.
As long as most parts of this thread were committed, I'll mark CF item as 
"committed".
Thanks to everyone involved!
See you in a followup thread about sleeping on CV.


Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2024-04-06 Thread Andrey M. Borodin



> On 29 Feb 2024, at 06:59, Kyotaro Horiguchi  wrote:
> 
> At Sat, 3 Feb 2024 22:32:45 +0500, "Andrey M. Borodin"  
> wrote in 
>> Here's the test draft. This test reliably reproduces sleep on CV when 
>> waiting next multixact to be filled into "members" SLRU.
> 
> By the way, I raised a question about using multiple CVs
> simultaneously [1]. That is, I suspect that the current CV
> implementation doesn't allow us to use multiple condition variables at
> the same time, because all CVs use the same PCPROC member cvWaitLink
> to accommodate different waiter sets. If this assumption is correct,
> we should resolve the issue before spreading more uses of CVs.

Alvaro, Kyotaro, what's our plan for this?
It seems to late to deal with this pg_usleep(1000L) for PG17.
I propose following course of action
1. Close this long-standing CF item
2. Start new thread with CV-sleep patch aimed at PG18
3. Create new entry in July CF

What do you think?


Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2024-02-28 Thread Kyotaro Horiguchi
At Sat, 3 Feb 2024 22:32:45 +0500, "Andrey M. Borodin"  
wrote in 
> Here's the test draft. This test reliably reproduces sleep on CV when waiting 
> next multixact to be filled into "members" SLRU.

By the way, I raised a question about using multiple CVs
simultaneously [1]. That is, I suspect that the current CV
implementation doesn't allow us to use multiple condition variables at
the same time, because all CVs use the same PCPROC member cvWaitLink
to accommodate different waiter sets. If this assumption is correct,
we should resolve the issue before spreading more uses of CVs.

[1] 
https://www.postgresql.org/message-id/20240227.150709.1766217736683815840.horikyota.ntt%40gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: MultiXact\SLRU buffers configuration

2024-02-03 Thread Andrey M. Borodin


> On 28 Jan 2024, at 23:17, Andrey M. Borodin  wrote:
> 
> 
>> Perhaps a test to make the code reach the usleep(1000) can be written
>> using injection points (49cd2b93d7db)?
> 
> I've tried to prototype something like that. But interesting point between 
> GetNewMultiXactId() and RecordNewMultiXact() is a critical section, and we 
> cannot have injection points in critical sections...
> Also, to implement such a test we need "wait" type of injection points, see 
> step 2 in attachment. With this type of injection points I can stop a backend 
> amidst entering information about new MultiXact.

Here's the test draft. This test reliably reproduces sleep on CV when waiting 
next multixact to be filled into "members" SLRU.
Cost of having this test:
1. We need a new injection point type "wait" (in addition to "error" and 
"notice"). It cannot be avoided, because we need to sync at least 3 processed 
to observe condition we want.
2. We need new way to declare injection point that can happen inside critical 
section. I've called it "prepared injection point".

Complexity of having this test is higher than complexity of CV-sleep patch 
itself. Do we want it? If so I can produce cleaner version, currently all 
multixact tests are int injection_points test module.


Best regards, Andrey Borodin.


v2-0003-Test-multixact-CV-sleep.patch
Description: Binary data


v2-0002-Add-wait-type-for-injection-points.patch
Description: Binary data


v2-0001-Add-conditional-variable-to-wait-for-next-MultXac.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2024-01-28 Thread Andrey M. Borodin


> On 28 Jan 2024, at 17:49, Alvaro Herrera  wrote:
> 
> I'd appreciate it if you or Horiguchi-san can update his patch to remove
> use of usleep in favor of a CV in multixact, and keep this CF entry to
> cover it.

Sure! Sounds great!

> Perhaps a test to make the code reach the usleep(1000) can be written
> using injection points (49cd2b93d7db)?

I've tried to prototype something like that. But interesting point between 
GetNewMultiXactId() and RecordNewMultiXact() is a critical section, and we 
cannot have injection points in critical sections...
Also, to implement such a test we need "wait" type of injection points, see 
step 2 in attachment. With this type of injection points I can stop a backend 
amidst entering information about new MultiXact.


Best regards, Andrey Borodin.



0001-Add-conditional-variable-to-wait-for-next-MultXact-o.patch
Description: Binary data


0002-Add-wait-type-for-injection-points.patch
Description: Binary data


0003-Try-to-test-multixact-CV-sleep.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2024-01-28 Thread Alvaro Herrera
On 2024-Jan-27, Andrey Borodin wrote:

> thanks for the ping! Most important parts of this patch set are discussed in 
> [0]. If that patchset will be committed, I'll withdraw entry for this thread 
> from commitfest.
> There's a version of Multixact-specific optimizations [1], but I hope they 
> will not be necessary with effective caches developed in [0]. It seems to me 
> that most important part of those optimization is removing sleeps under SLRU 
> lock on standby [2] by Kyotaro Horiguchi. But given that cache optimizations 
> took 4 years to get closer to commit, I'm not sure we will get this 
> optimization any time soon...

I'd appreciate it if you or Horiguchi-san can update his patch to remove
use of usleep in favor of a CV in multixact, and keep this CF entry to
cover it.

Perhaps a test to make the code reach the usleep(1000) can be written
using injection points (49cd2b93d7db)?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)




Re: MultiXact\SLRU buffers configuration

2024-01-26 Thread Andrey Borodin



> On 20 Jan 2024, at 08:31, vignesh C  wrote:
> 
> On Mon, 9 Jan 2023 at 09:49, Andrey Borodin  wrote:
>> 
>> On Tue, Jan 3, 2023 at 5:02 AM vignesh C  wrote:
>>> does not apply on top of HEAD as in [1], please post a rebased patch:
>>> 
>> Thanks! Here's the rebase.
> 
> I'm seeing that there has been no activity in this thread for more
> than 1 year now, I'm planning to close this in the current commitfest
> unless someone is planning to take it forward.

Hi Vignesh,

thanks for the ping! Most important parts of this patch set are discussed in 
[0]. If that patchset will be committed, I'll withdraw entry for this thread 
from commitfest.
There's a version of Multixact-specific optimizations [1], but I hope they will 
not be necessary with effective caches developed in [0]. It seems to me that 
most important part of those optimization is removing sleeps under SLRU lock on 
standby [2] by Kyotaro Horiguchi. But given that cache optimizations took 4 
years to get closer to commit, I'm not sure we will get this optimization any 
time soon...

Thanks!

Best regards, Andrey Borodin.


[0] 
https://www.postgresql.org/message-id/flat/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com
[1] 
https://www.postgresql.org/message-id/flat/2ECE132B-C042-4489-930E-DBC5D0DAB84A%40yandex-team.ru#5f7e7022647be9eeecfc2ae75d765500
[2] 
https://www.postgresql.org/message-id/flat/20200515.090333.24867479329066911.horikyota.ntt%40gmail.com#855f8bb7205890579a363d2344b4484d



Re: MultiXact\SLRU buffers configuration

2024-01-19 Thread vignesh C
On Mon, 9 Jan 2023 at 09:49, Andrey Borodin  wrote:
>
> On Tue, Jan 3, 2023 at 5:02 AM vignesh C  wrote:
> > does not apply on top of HEAD as in [1], please post a rebased patch:
> >
> Thanks! Here's the rebase.

I'm seeing that there has been no activity in this thread for more
than 1 year now, I'm planning to close this in the current commitfest
unless someone is planning to take it forward.

Regards,
Vignesh




Re: MultiXact\SLRU buffers configuration

2023-01-11 Thread Andrey Borodin
Hi Dilip! Thank you for the review!

On Tue, Jan 10, 2023 at 9:58 PM Dilip Kumar  wrote:
>
> On Mon, Jan 9, 2023 at 9:49 AM Andrey Borodin  wrote:
> >
> > On Tue, Jan 3, 2023 at 5:02 AM vignesh C  wrote:
> > > does not apply on top of HEAD as in [1], please post a rebased patch:
> > >
> > Thanks! Here's the rebase.
>
> I was looking into this patch, it seems like three different
> optimizations are squeezed in a single patch
> 1) dividing buffer space in banks to reduce the seq search cost 2) guc
> parameter for buffer size scale 3) some of the buffer size values are
> modified compared to what it is on the head.  I think these are 3
> patches which should be independently committable.
There's no point in dividing SLRU buffers in parts unless the buffer's
size is configurable.
And it's only possible to enlarge default buffers size if SLRU buffers
are divided into banks.
So the features can be viewed as independent commits, but make no
sense separately.

But, probably, it's a good idea to split the patch back anyway, for
easier review.

>
> While looking into the first idea of dividing the buffer space in
> banks, I see that it will speed up finding the buffers but OTOH while
> searching the victim buffer it will actually can hurt the performance
> the slru pages which are frequently accessed are not evenly
> distributed across the banks.  So imagine the cases where we have some
> banks with a lot of empty slots and other banks from which we
> frequently have to evict out the pages in order to get the new pages
> in.
>

Yes. Despite the extremely low probability of such a case, this
pattern when a user accesses pages assigned to only one bank may
happen.
This case is equivalent to having just one bank, which means small
buffers. Just as we have now.

Thank you!

Best regards, Andrey Borodin.




Re: MultiXact\SLRU buffers configuration

2023-01-10 Thread Dilip Kumar
On Mon, Jan 9, 2023 at 9:49 AM Andrey Borodin  wrote:
>
> On Tue, Jan 3, 2023 at 5:02 AM vignesh C  wrote:
> > does not apply on top of HEAD as in [1], please post a rebased patch:
> >
> Thanks! Here's the rebase.

I was looking into this patch, it seems like three different
optimizations are squeezed in a single patch
1) dividing buffer space in banks to reduce the seq search cost 2) guc
parameter for buffer size scale 3) some of the buffer size values are
modified compared to what it is on the head.  I think these are 3
patches which should be independently committable.

While looking into the first idea of dividing the buffer space in
banks, I see that it will speed up finding the buffers but OTOH while
searching the victim buffer it will actually can hurt the performance
the slru pages which are frequently accessed are not evenly
distributed across the banks.  So imagine the cases where we have some
banks with a lot of empty slots and other banks from which we
frequently have to evict out the pages in order to get the new pages
in.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: MultiXact\SLRU buffers configuration

2023-01-08 Thread Andrey Borodin
On Tue, Jan 3, 2023 at 5:02 AM vignesh C  wrote:
> does not apply on top of HEAD as in [1], please post a rebased patch:
>
Thanks! Here's the rebase.

Best regards, Andrey Borodin.


v24-0001-Divide-SLRU-buffers-into-8-associative-banks.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2023-01-03 Thread vignesh C
On Fri, 19 Aug 2022 at 21:18,  wrote:
>
> Andrey Borodin wrote 2022-08-18 06:35:
> >
> > I like the idea of one knob instead of one per each SLRU. Maybe we
> > even could deduce sane value from NBuffers? That would effectively
> > lead to 0 knobs :)
> >
> > Your patch have a prefix "v22-0006", does it mean there are 5 previous
> > steps of the patchset?
> >
> > Thank you!
> >
> >
> > Best regards, Andrey Borodin.
>
> Not sure it's possible to deduce from NBuffers only.
> slru_buffers_scale_shift looks like relief valve for systems with ultra
> scaled NBuffers.
>
> Regarding v22-0006 I just tried to choose index unique for this thread
> so now it's fixed to 0001 indexing.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
325bc54eed4ea0836a0bb715bb18342f0c1c668a ===
=== applying patch ./v23-0001-bucketed-SLRUs-simplified.patch
patching file src/include/miscadmin.h
...
patching file src/backend/utils/misc/guc.c
Hunk #1 FAILED at 32.
Hunk #2 FAILED at 2375.
2 out of 2 hunks FAILED -- saving rejects to file
src/backend/utils/misc/guc.c.rej

[1] - http://cfbot.cputube.org/patch_41_2627.log

Regards,
Vignesh




Re: MultiXact\SLRU buffers configuration

2022-12-20 Thread Andrey Borodin
On Sat, Jul 23, 2022 at 1:48 AM Thomas Munro  wrote:
>
> On Sat, Jul 23, 2022 at 8:41 PM Andrey Borodin  wrote:
> > Thomas, do you still have any doubts? Or is it certain that SLRU will be 
> > replaced by any better subsystem in 16?
>
> Hi Andrey,
>
> Sorry for my lack of replies on this and the other SLRU thread -- I'm
> thinking and experimenting.  More soon.
>

Hi Thomas,

PostgreSQL 16 feature freeze is approaching again. Let's choose
something from possible solutions, even if the chosen one is
temporary.
Thank you!

Best regards, Andrey Borodin.




Re: MultiXact\SLRU buffers configuration

2022-08-19 Thread i . lazarev

Andrey Borodin wrote 2022-08-18 06:35:


I like the idea of one knob instead of one per each SLRU. Maybe we
even could deduce sane value from NBuffers? That would effectively
lead to 0 knobs :)

Your patch have a prefix "v22-0006", does it mean there are 5 previous
steps of the patchset?

Thank you!


Best regards, Andrey Borodin.


Not sure it's possible to deduce from NBuffers only. 
slru_buffers_scale_shift looks like relief valve for systems with ultra 
scaled NBuffers.


Regarding v22-0006 I just tried to choose index unique for this thread 
so now it's fixed to 0001 indexing.Index: src/include/miscadmin.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
--- a/src/include/miscadmin.h	(revision 020258fbd30d37ddd03d0ec68264d1544f8d2838)
+++ b/src/include/miscadmin.h	(date 1660678108730)
@@ -176,6 +176,7 @@
 extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
 extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int slru_buffers_size_scale;
 
 extern PGDLLIMPORT int MyProcPid;
 extern PGDLLIMPORT pg_time_t MyStartTime;
Index: src/include/access/subtrans.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/include/access/subtrans.h b/src/include/access/subtrans.h
--- a/src/include/access/subtrans.h	(revision 020258fbd30d37ddd03d0ec68264d1544f8d2838)
+++ b/src/include/access/subtrans.h	(date 1660678108698)
@@ -12,7 +12,7 @@
 #define SUBTRANS_H
 
 /* Number of SLRU buffers to use for subtrans */
-#define NUM_SUBTRANS_BUFFERS	32
+#define NUM_SUBTRANS_BUFFERS	(32 << slru_buffers_size_scale)
 
 extern void SubTransSetParent(TransactionId xid, TransactionId parent);
 extern TransactionId SubTransGetParent(TransactionId xid);
Index: src/backend/utils/init/globals.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
--- a/src/backend/utils/init/globals.c	(revision 020258fbd30d37ddd03d0ec68264d1544f8d2838)
+++ b/src/backend/utils/init/globals.c	(date 1660678064048)
@@ -150,3 +150,5 @@
 
 int			VacuumCostBalance = 0;	/* working state for vacuum */
 bool		VacuumCostActive = false;
+
+int			slru_buffers_size_scale = 2;	/* power 2 scale for SLRU buffers */
Index: src/backend/access/transam/slru.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
--- a/src/backend/access/transam/slru.c	(revision 020258fbd30d37ddd03d0ec68264d1544f8d2838)
+++ b/src/backend/access/transam/slru.c	(date 1660678063980)
@@ -59,6 +59,7 @@
 #include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
+#include "port/pg_bitutils.h"
 
 #define SlruFileName(ctl, path, seg) \
 	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
@@ -71,6 +72,17 @@
  */
 #define MAX_WRITEALL_BUFFERS	16
 
+/*
+ * To avoid overflowing internal arithmetic and the size_t data type, the
+ * number of buffers should not exceed this number.
+ */
+#define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ)
+
+/*
+ * SLRU bank size for slotno hash banks
+ */
+#define SLRU_BANK_SIZE 8
+
 typedef struct SlruWriteAllData
 {
 	int			num_files;		/* # files actually open */
@@ -134,7 +146,7 @@
 static SlruErrorCause slru_errcause;
 static int	slru_errno;
 
-
+static void SlruAdjustNSlots(int *nslots, int *bankmask);
 static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
 static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
 static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
@@ -148,6 +160,25 @@
 	  int segpage, void *data);
 static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
 
+/*
+ * Pick number of slots and bank size optimal for hashed associative SLRU buffers.
+ * We declare SLRU nslots is always power of 2.
+ * We split SLRU to 8-sized hash banks, after some performance benchmarks.
+ * We hash pageno to banks by pageno masked by 3 upper bits.
+ */
+static void
+SlruAdjustNSlots(int *nslots, int *bankmask)
+{
+	Assert(*nslots > 0);
+	Assert(*nslots <= SLRU_MAX_ALLOWED_BUFFERS);
+
+	*nslots = (int) pg_nextpower2_32(Max(SLRU_BANK_SIZE, Min(*nslots, NBuffers / 256)));
+
+	*bankmask = *nslots / SLRU_BANK_SIZE - 1;
+
+	elog(DEBUG5, "nslots %d banksize %d nbanks %d bankmask %x", *nslots, SLRU_BANK_SIZE, *nslots / SLRU_BANK_SIZE, *bankmask);
+}
+
 /*
  * Initialization of shared memory
  */
@@ -156,6 +187,9 @@
 SimpleLruShmemSize(int nslots, int nlsns)
 {
 	Size		sz;
+	int			bankmask_ignore;
+
+	

Re: MultiXact\SLRU buffers configuration

2022-08-17 Thread Andrey Borodin



> On 17 Aug 2022, at 00:36, i.laza...@postgrespro.ru wrote:
> 
>> Andrey Borodin wrote 2022-07-23 11:39:
>> Yura, thank you for your benchmarks!
>> We already knew that patch can save the day on pathological workloads,
>> now we have a proof of this.
>> Also there's the evidence that user can blindly increase size of SLRU
>> if they want (with the second patch). So there's no need for hard
>> explanations on how to tune the buffers size.
> 
> Hi @Andrey.Borodin, With some considerations and performance checks from 
> @Yura.Sokolov we simplified your approach by the following:
> 
> 1. Preamble. We feel free to increase any SLRU's, since there's no 
> performance degradation on large Buffers count using your SLRU buckets 
> solution.
> 2. `slru_buffers_size_scale` is only one config param introduced for all 
> SLRUs. It scales SLRUs upper cap by power 2.
> 3. All SLRU buffers count are capped by both `MBuffers (shared_buffers)` and 
> `slru_buffers_size_scale`. see
> 4. Magic initial constants `NUM_*_BUFFERS << slru_buffers_size_scale` are 
> applied for every SLRU.
> 5. All SLRU buffers are always sized as power of 2, their hash bucket size is 
> always 8.
> 
> There's attached patch for your consideration. It does gather and simplify 
> both `v21-0001-Make-all-SLRU-buffer-sizes-configurable.patch` and 
> `v21-0002-Divide-SLRU-buffers-into-8-associative-banks.patch` to much simpler 
> approach.

I like the idea of one knob instead of one per each SLRU. Maybe we even could 
deduce sane value from NBuffers? That would effectively lead to 0 knobs :)

Your patch have a prefix "v22-0006", does it mean there are 5 previous steps of 
the patchset?

Thank you!


Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2022-08-16 Thread i . lazarev

Andrey Borodin wrote 2022-07-23 11:39:

Yura, thank you for your benchmarks!
We already knew that patch can save the day on pathological workloads,
now we have a proof of this.
Also there's the evidence that user can blindly increase size of SLRU
if they want (with the second patch). So there's no need for hard
explanations on how to tune the buffers size.


Hi @Andrey.Borodin, With some considerations and performance checks from 
@Yura.Sokolov we simplified your approach by the following:


1. Preamble. We feel free to increase any SLRU's, since there's no 
performance degradation on large Buffers count using your SLRU buckets 
solution.
2. `slru_buffers_size_scale` is only one config param introduced for all 
SLRUs. It scales SLRUs upper cap by power 2.
3. All SLRU buffers count are capped by both `MBuffers (shared_buffers)` 
and `slru_buffers_size_scale`. see
4. Magic initial constants `NUM_*_BUFFERS << slru_buffers_size_scale` 
are applied for every SLRU.
5. All SLRU buffers are always sized as power of 2, their hash bucket 
size is always 8.


There's attached patch for your consideration. It does gather and 
simplify both `v21-0001-Make-all-SLRU-buffer-sizes-configurable.patch` 
and `v21-0002-Divide-SLRU-buffers-into-8-associative-banks.patch` to 
much simpler approach.


Thank you, Yours,
- Ivan
Index: src/include/miscadmin.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
--- a/src/include/miscadmin.h	(revision 020258fbd30d37ddd03d0ec68264d1544f8d2838)
+++ b/src/include/miscadmin.h	(date 1660678108730)
@@ -176,6 +176,7 @@
 extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
 extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int slru_buffers_size_scale;
 
 extern PGDLLIMPORT int MyProcPid;
 extern PGDLLIMPORT pg_time_t MyStartTime;
Index: src/include/access/subtrans.h
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/include/access/subtrans.h b/src/include/access/subtrans.h
--- a/src/include/access/subtrans.h	(revision 020258fbd30d37ddd03d0ec68264d1544f8d2838)
+++ b/src/include/access/subtrans.h	(date 1660678108698)
@@ -12,7 +12,7 @@
 #define SUBTRANS_H
 
 /* Number of SLRU buffers to use for subtrans */
-#define NUM_SUBTRANS_BUFFERS	32
+#define NUM_SUBTRANS_BUFFERS	(32 << slru_buffers_size_scale)
 
 extern void SubTransSetParent(TransactionId xid, TransactionId parent);
 extern TransactionId SubTransGetParent(TransactionId xid);
Index: src/backend/utils/init/globals.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
--- a/src/backend/utils/init/globals.c	(revision 020258fbd30d37ddd03d0ec68264d1544f8d2838)
+++ b/src/backend/utils/init/globals.c	(date 1660678064048)
@@ -150,3 +150,5 @@
 
 int			VacuumCostBalance = 0;	/* working state for vacuum */
 bool		VacuumCostActive = false;
+
+int			slru_buffers_size_scale = 2;	/* power 2 scale for SLRU buffers */
Index: src/backend/access/transam/slru.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
--- a/src/backend/access/transam/slru.c	(revision 020258fbd30d37ddd03d0ec68264d1544f8d2838)
+++ b/src/backend/access/transam/slru.c	(date 1660678063980)
@@ -59,6 +59,7 @@
 #include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
+#include "port/pg_bitutils.h"
 
 #define SlruFileName(ctl, path, seg) \
 	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
@@ -71,6 +72,17 @@
  */
 #define MAX_WRITEALL_BUFFERS	16
 
+/*
+ * To avoid overflowing internal arithmetic and the size_t data type, the
+ * number of buffers should not exceed this number.
+ */
+#define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ)
+
+/*
+ * SLRU bank size for slotno hash banks
+ */
+#define SLRU_BANK_SIZE 8
+
 typedef struct SlruWriteAllData
 {
 	int			num_files;		/* # files actually open */
@@ -134,7 +146,7 @@
 static SlruErrorCause slru_errcause;
 static int	slru_errno;
 
-
+static void SlruAdjustNSlots(int *nslots, int *bankmask);
 static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
 static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
 static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
@@ -148,6 +160,25 @@
 	  int segpage, void *data);
 static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
 
+/*
+ * Pick number of slots and bank size optimal for hashed associative SLRU buffers.
+ * We declare SLRU 

Re: MultiXact\SLRU buffers configuration

2022-07-23 Thread Thomas Munro
On Sat, Jul 23, 2022 at 8:41 PM Andrey Borodin  wrote:
> Thomas, do you still have any doubts? Or is it certain that SLRU will be 
> replaced by any better subsystem in 16?

Hi Andrey,

Sorry for my lack of replies on this and the other SLRU thread -- I'm
thinking and experimenting.  More soon.




Re: MultiXact\SLRU buffers configuration

2022-07-23 Thread Andrey Borodin



> On 21 Jul 2022, at 18:00, Yura Sokolov  wrote:
> 
> In this case simple buffer increase does help. But "buckets"
> increase performance gain.
Yura, thank you for your benchmarks!
We already knew that patch can save the day on pathological workloads, now we 
have a proof of this.
Also there's the evidence that user can blindly increase size of SLRU if they 
want (with the second patch). So there's no need for hard explanations on how 
to tune the buffers size.

Thomas, do you still have any doubts? Or is it certain that SLRU will be 
replaced by any better subsystem in 16?


Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2022-07-21 Thread Yura Sokolov
Good day, all.

I did benchmark of patch on 2 socket Xeon 5220 CPU @ 2.20GHz .
I used "benchmark" used to reproduce problems with SLRU on our
customers setup.
In opposite to Shawn's tests I concentrated on bad case: a lot
of contention.

slru-funcs.sql - function definitions
  - functions creates a lot of subtrunsactions to stress subtrans
  - and select random rows for share to stress multixacts

slru-call.sql - function call for benchmark

slru-ballast.sql - randomly select 1000 consequent rows
"for update skip locked" to stress multixacts

patch1 - make SLRU buffers configurable
patch2 - make "8-associative banks"

Benchmark done by pgbench.
Inited with scale 1 to induce contention.
pgbench -i -s 1 testdb

Benchmark 1:
- low number of connections (50), 60% slru-call, 40% slru-ballast
pgbench -f slru-call.sql@60 -f slru-ballast.sql@40 -c 50 -j 75 -P 1 -T 30 
testdb

version | subtrans | multixact | tps
| buffers  | offs/memb | func+ballast
+--+---+--
master  | 32   | 8/16  | 184+119
patch1  | 32   | 8/16  | 184+119
patch1  | 1024 | 8/16  | 121+77
patch1  | 1024 | 512/1024  | 118+75
patch2  | 32   | 8/16  | 190+122
patch2  | 1024 | 8/16  | 190+125
patch2  | 1024 | 512/1024  | 190+127

As you see, this test case degrades with dumb increase of
SLRU buffers. But use of "hash table" in form of "associative
buckets" makes performance stable.

Benchmark 2:
- high connection number (600), 98% slru-call, 2% slru-ballast
pgbench -f slru-call.sql@98 -f slru-ballast.sql@2 -c 600 -j 75 -P 1 -T 30 
testdb

I don't paste "ballast" tps here since 2% make them too small,
and they're very noisy.

version | subtrans | multixact | tps
| buffers  | offs/memb | func
+--+---+--
master  | 32   | 8/16  | 13
patch1  | 32  
| 8/16  | 13
patch1  | 1024 | 8/16  | 31
patch1  | 1024 | 512/1024  | 53
patch2  | 32   | 8/16  | 12
patch2  | 1024 | 8/16  | 34
patch2  | 1024 | 512/1024  | 67

In this case simple buffer increase does help. But "buckets"
increase performance gain.

I didn't paste here results third part of patch ("Pack SLRU...")
because I didn't see any major performance gain from it, and
it consumes large part of patch diff.

Rebased versions of first two patch parts are attached.

regards,

Yura Sokolov


slru-ballast.sql
Description: application/sql


slru-call.sql
Description: application/sql


slru-func.sql
Description: application/sql
From 41ec9d1c54184c515d53ecc8021c4a998813f2a9 Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Sun, 11 Apr 2021 21:18:10 +0300
Subject: [PATCH v21 2/2] Divide SLRU buffers into 8-associative banks

We want to eliminate linear search within SLRU buffers.
To do so we divide SLRU buffers into banks. Each bank holds
approximately 8 buffers. Each SLRU pageno may reside only in one bank.
Adjacent pagenos reside in different banks.
---
 src/backend/access/transam/slru.c | 43 ---
 src/include/access/slru.h |  2 ++
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index b65cb49d7ff..abc534bbd06 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -134,7 +134,7 @@ typedef enum
 static SlruErrorCause slru_errcause;
 static int	slru_errno;
 
-
+static void SlruAdjustNSlots(int* nslots, int* banksize, int* bankoffset);
 static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
 static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
 static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
@@ -148,6 +148,30 @@ static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
 	  int segpage, void *data);
 static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
 
+/*
+ * Pick bank size optimal for N-assiciative SLRU buffers.
+ * We expect bank number to be picked from lowest bits of requested pageno.
+ * Thus we want number of banks to be power of 2. This routine computes number
+ * of banks aiming to make each bank of size 8. So we can pack page number and
+ * statuses of each bank on one cacheline.
+ */
+static void SlruAdjustNSlots(int* nslots, int* banksize, int* bankoffset)
+{
+	int nbanks = 1;
+	*banksize = *nslots;
+	*bankoffset = 0;
+	while (*banksize > 15)
+	{
+		if ((*banksize & 1) != 0)
+			*banksize +=1;
+		*banksize /= 2;
+		nbanks *= 2;
+		*bankoffset += 1;
+	}
+	elog(DEBUG5, "nslots %d banksize %d nbanks %d ", *nslots, *banksize, nbanks);
+	*nslots = *banksize * nbanks;
+}
+
 /*
  * Initialization of shared memory
  */
@@ -156,6 +180,8 @@ Size
 SimpleLruShmemSize(int nslots, int nlsns)
 {
 	Size		sz;
+	int bankoffset, banksize;
+	SlruAdjustNSlots(, , );
 
 	/* we assume nslots isn't so large as to risk overflow */
 	sz = MAXALIGN(sizeof(SlruSharedData));
@@ -190,6 +216,8 @@ SimpleLruInit(SlruCtl ctl, const char 

Re: MultiXact\SLRU buffers configuration

2022-03-18 Thread Andrey Borodin



> On 20 Feb 2022, at 02:38, Thomas Munro  wrote:
> 
> Back to this patch: assuming we can settle on a good-enough-for-now
> replacement algorithm, do we want to add this set of 7 GUCs?  Does
> anyone else want to weigh in on that?

Hi Thomas!

It seems we don't have any other input besides reviews and Andres's -0.2.
Is there a chance to proceed?

Best regards, Andrey Borodin.




Re: MultiXact\SLRU buffers configuration

2022-02-19 Thread Andrey Borodin



> On 20 Feb 2022, at 02:42, Andres Freund  wrote:
> 
> On 2022-02-20 10:38:53 +1300, Thomas Munro wrote:
>> Back to this patch: assuming we can settle on a good-enough-for-now
>> replacement algorithm, do we want to add this set of 7 GUCs?  Does
>> anyone else want to weigh in on that?
> 
> I'm -0.2 on it, given that we have a better path forward.
That’s a really good path forward, but it's discussed at least for 3.5 
years[0]. And guaranteed not to be there until 2023. Gilles, Shawn, Dmitry 
expressed their opinion in lines with that the patch “is a must-have” referring 
to real pathological performance degradation inflicted by SLRU cache 
starvation. And I can remember dozen of other incidents that would not happen 
if the patch was applied, e.g. this post is referring to the patch as a cure 
[1].

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/20180814213500.GA74618%4060f81dc409fc.ant.amazon.com
[1] 
https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/#what-can-we-do-about-getting-rid-of-nessie
 



Re: MultiXact\SLRU buffers configuration

2022-02-19 Thread Andres Freund
On 2022-02-20 10:38:53 +1300, Thomas Munro wrote:
> Back to this patch: assuming we can settle on a good-enough-for-now
> replacement algorithm, do we want to add this set of 7 GUCs?  Does
> anyone else want to weigh in on that?

I'm -0.2 on it, given that we have a better path forward.




Re: MultiXact\SLRU buffers configuration

2022-02-19 Thread Thomas Munro
On Sat, Feb 19, 2022 at 6:34 PM Andrey Borodin  wrote:
> There's feature freeze approaching again. I see that you are working on 
> moving SLRUs to buffer pools, but it is not clear to which PG version it will 
> land. And there is 100% consensus that first patch is useful and helps to 
> prevent big issues. Maybe let's commit 1'st step without lifting default 
> xact_buffers limit? Or 1st patch as-is with any simple technique that 
> prevents linear search in SLRU buffers.

Hi Andrey,

Yeah, the SLRU/buffer pool thing would be complete enough to propose
for 16 at the earliest.  I posted the early prototype to see what sort
of reaction the concept would get before doing more work; I know
others have investigated this topic too... maybe it can encourage more
patches, experimental results, ideas to be shared... but this is not
relevant for 15.

Back to this patch: assuming we can settle on a good-enough-for-now
replacement algorithm, do we want to add this set of 7 GUCs?  Does
anyone else want to weigh in on that?  Concretely, this patch adds:

multixact_offsets_buffers
multixact_members_buffers
subtrans_buffers
notify_buffers
serial_buffers
xact_buffers
commit_ts_buffers

I guess the people at
https://ottertune.com/blog/postgresql-knobs-list/ would be happy if we
did.  Hopefully we'd drop the settings in a future release once we
figure out the main buffer pool thing (or some other scheme to
automate sizing).




Re: MultiXact\SLRU buffers configuration

2022-02-18 Thread Andrey Borodin



> 8 апр. 2021 г., в 17:22, Thomas Munro  написал(а):
> 
> Thanks!  I chickened out of committing a buffer replacement algorithm
> patch written 11 hours before the feature freeze, but I also didn't
> really want to commit the GUC patch without that.  Ahh, if only we'd
> latched onto the real problems here just a little sooner, but there is
> always PostgreSQL 15, I heard it's going to be amazing.  Moved to next
> CF.

Hi Thomas!

There's feature freeze approaching again. I see that you are working on moving 
SLRUs to buffer pools, but it is not clear to which PG version it will land. 
And there is 100% consensus that first patch is useful and helps to prevent big 
issues. Maybe let's commit 1'st step without lifting default xact_buffers 
limit? Or 1st patch as-is with any simple technique that prevents linear search 
in SLRU buffers.

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2022-01-20 Thread Andrey Borodin



> 21 янв. 2022 г., в 05:19, Shawn Debnath  написал(а):
> 
> On Thu, Jan 20, 2022 at 09:21:24PM +0500, Andrey Borodin wrote:
>> CAUTION: This email originated from outside of the organization. Do not 
>> click links or open attachments unless you can confirm the sender and know 
>> the content is safe.
>> 
>>> 20 янв. 2022 г., в 20:44, Shawn Debnath  написал(а):
>> Can you please also test 2nd patch against a large multixact SLRUs?
>> 2nd patch is not intended to do make thing better on default buffer sizes. 
>> It must save the perfromance in case of really huge SLRU buffers.
> 
> Test was performed on 128/256 for multixact offset/members cache as 
> stated in my previous email.  Sure I can test it for higher values - but 
> what's a real world value that would make sense? We have been using this 
> configuration successfully for a few of our customers that ran into 
> MultiXact contention.

Sorry, seems like I misinterpreted results yesterday.
I had one concern about 1st patch step: it makes CLOG buffers size dependent on 
shared_buffers. But in your tests you seem to have already exercised 
xact_buffers = 24576 without noticeable degradation. Is it correct? I doubt a 
little bit that linear search among 24K elements on each CLOG access does not 
incur performance impact, but your tests seem to prove it.

IMV splitting SLRU buffer into banks would make sense for values greater than 
1<<10. But you are right that 256 seems enough to cope with most of problems of 
multixacts so far. I just thought about stressing SLRU buffers with multixacts 
to be sure that CLOG buffers will not suffer degradation. But yes, it's too 
indirect test.

Maybe, just to be sure, let's repeat tests with autovacuum turned off to stress 
xact_buffers? 

Thanks!

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2022-01-20 Thread Andrey Borodin



> 20 янв. 2022 г., в 20:44, Shawn Debnath  написал(а):
> 
> If my test workload can be made better, please let me know. Happy to 
> re-run tests as needed.

Shawn, thanks for the benchmarking!

Can you please also test 2nd patch against a large multixact SLRUs?
2nd patch is not intended to do make thing better on default buffer sizes. It 
must save the perfromance in case of really huge SLRU buffers.

Thanks!

Best regards, Andrey Borodin.



Re: MultiXact/SLRU buffers configuration

2022-01-15 Thread Andrey Borodin


> 15 янв. 2022 г., в 20:46, Justin Pryzby  написал(а):
>>> 
>>> I was planning on running a set of stress tests on these patches. Could 
>>> we confirm which ones we plan to include in the commitfest?
>> 
>> Many thanks for your interest. Here's the latest version.
> 
> This is failing to compile under linux and windows due to bitfield syntax.
> http://cfbot.cputube.org/andrey-borodin.html

Uh, sorry, I formatted a patch from wrong branch.

Just tested Cirrus. It's wonderful, thanks! Really faster than doing stuff on 
my machines...

Best regards, Andrey Borodin.


v20-0001-Make-all-SLRU-buffer-sizes-configurable.patch
Description: Binary data


v20-0002-Divide-SLRU-buffers-into-8-associative-banks.patch
Description: Binary data


v20-0003-Pack-SLRU-page_number-page_status-and-page_dirty.patch
Description: Binary data


Re: MultiXact/SLRU buffers configuration

2022-01-15 Thread Justin Pryzby
On Sat, Jan 15, 2022 at 12:16:59PM +0500, Andrey Borodin wrote:
> > 15 янв. 2022 г., в 03:20, Shawn Debnath  написал(а):
> > On Fri, Jan 14, 2022 at 05:28:38PM +0800, Julien Rouhaud wrote:
> >>> PFA rebase of the patchset. Also I've added a patch to combine 
> >>> page_number, page_status, and page_dirty together to touch less 
> >>> cachelines.
> >> 
> >> The cfbot reports some errors on the latest version of the patch:
> >> 
> >> https://cirrus-ci.com/task/6121317215764480
> >> [...]
> >> Could you send a new version?  In the meantime I will switch the patch 
> >> status to Waiting on Author.
> > 
> > I was planning on running a set of stress tests on these patches. Could 
> > we confirm which ones we plan to include in the commitfest?
> 
> Many thanks for your interest. Here's the  latest version.

This is failing to compile under linux and windows due to bitfield syntax.
http://cfbot.cputube.org/andrey-borodin.html

And compile warnings:

slru.c: In function ‘SlruAdjustNSlots’:
slru.c:161:2: warning: ISO C90 forbids mixed declarations and code 
[-Wdeclaration-after-statement]
  161 |  int nbanks = 1;
  |  ^~~
slru.c: In function ‘SimpleLruReadPage_ReadOnly’:
slru.c:530:2: warning: ISO C90 forbids mixed declarations and code 
[-Wdeclaration-after-statement]
  530 |  int bankstart = (pageno & shared->bank_mask) * shared->bank_size;
  |  ^~~

Note that you can test the CI result using any github account without waiting
for the cfbot.  See ./src/tools/ci/README.

-- 
Justin




Re: MultiXact\SLRU buffers configuration

2022-01-14 Thread Andrey Borodin


> 15 янв. 2022 г., в 03:20, Shawn Debnath  написал(а):
> 
> On Fri, Jan 14, 2022 at 05:28:38PM +0800, Julien Rouhaud wrote:
>>> PFA rebase of the patchset. Also I've added a patch to combine 
>>> page_number, page_status, and page_dirty together to touch less 
>>> cachelines.
>> 
>> The cfbot reports some errors on the latest version of the patch:
>> 
>> https://cirrus-ci.com/task/6121317215764480
>> [...]
>> Could you send a new version?  In the meantime I will switch the patch 
>> status
>> to Waiting on Author.
>> 
> 
> I was planning on running a set of stress tests on these patches. Could 
> we confirm which ones we plan to include in the commitfest?

Many thanks for your interest. Here's the  latest version.

Best regards, Andrey Borodin.


v19-0003-Pack-SLRU-page_number-page_status-and-page_dirty.patch
Description: Binary data


v19-0002-Divide-SLRU-buffers-into-8-associative-banks.patch
Description: Binary data


v19-0001-Make-all-SLRU-buffer-sizes-configurable.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2022-01-14 Thread Julien Rouhaud
Hi,
On Sun, Dec 26, 2021 at 03:09:59PM +0500, Andrey Borodin wrote:
> 
> 
> PFA rebase of the patchset. Also I've added a patch to combine page_number, 
> page_status, and page_dirty together to touch less cachelines.

The cfbot reports some errors on the latest version of the patch:

https://cirrus-ci.com/task/6121317215764480
[04:56:38.432] su postgres -c "make -s -j${BUILD_JOBS} world-bin"
[04:56:48.270] In file included from async.c:134:
[04:56:48.270] ../../../src/include/access/slru.h:47:28: error: expected 
identifier or ‘(’ before ‘:’ token
[04:56:48.270]47 | typedef enum SlruPageStatus:int16_t
[04:56:48.270]   |^
[04:56:48.270] ../../../src/include/access/slru.h:53:3: warning: data 
definition has no type or storage class
[04:56:48.270]53 | } SlruPageStatus;
[04:56:48.270]   |   ^~
[04:56:48.270] ../../../src/include/access/slru.h:53:3: warning: type defaults 
to ‘int’ in declaration of ‘SlruPageStatus’ [-Wimplicit-int]
[04:56:48.270] ../../../src/include/access/slru.h:58:2: error: expected 
specifier-qualifier-list before ‘SlruPageStatus’
[04:56:48.270]58 |  SlruPageStatus page_status;
[04:56:48.270]   |  ^~
[04:56:48.270] async.c: In function ‘asyncQueueAddEntries’:
[04:56:48.270] async.c:1448:41: error: ‘SlruPageEntry’ has no member named 
‘page_dirty’
[04:56:48.270]  1448 |  NotifyCtl->shared->page_entries[slotno].page_dirty = 
true;
[04:56:48.270]   | ^
[04:56:48.271] make[3]: *** [: async.o] Error 1
[04:56:48.271] make[3]: *** Waiting for unfinished jobs
[04:56:48.297] make[2]: *** [common.mk:39: commands-recursive] Error 2
[04:56:48.297] make[2]: *** Waiting for unfinished jobs
[04:56:54.554] In file included from clog.c:36:
[04:56:54.554] ../../../../src/include/access/slru.h:47:28: error: expected 
identifier or ‘(’ before ‘:’ token
[04:56:54.554]47 | typedef enum SlruPageStatus:int16_t
[04:56:54.554]   |^
[04:56:54.554] ../../../../src/include/access/slru.h:53:3: warning: data 
definition has no type or storage class
[04:56:54.554]53 | } SlruPageStatus;
[04:56:54.554]   |   ^~
[04:56:54.554] ../../../../src/include/access/slru.h:53:3: warning: type 
defaults to ‘int’ in declaration of ‘SlruPageStatus’ [-Wimplicit-int]
[04:56:54.554] ../../../../src/include/access/slru.h:58:2: error: expected 
specifier-qualifier-list before ‘SlruPageStatus’
[04:56:54.554]58 |  SlruPageStatus page_status;
[04:56:54.554]   |  ^~
[04:56:54.554] clog.c: In function ‘TransactionIdSetPageStatusInternal’:
[04:56:54.554] clog.c:396:39: error: ‘SlruPageEntry’ has no member named 
‘page_dirty’
[04:56:54.554]   396 |  XactCtl->shared->page_entries[slotno].page_dirty = true;
[04:56:54.554]   |   ^
[04:56:54.554] In file included from ../../../../src/include/postgres.h:46,
[04:56:54.554]  from clog.c:33:
[04:56:54.554] clog.c: In function ‘BootStrapCLOG’:
[04:56:54.554] clog.c:716:47: error: ‘SlruPageEntry’ has no member named 
‘page_dirty’
[04:56:54.554]   716 |  
Assert(!XactCtl->shared->page_entries[slotno].page_dirty);
[04:56:54.554]   |   ^
[04:56:54.554] ../../../../src/include/c.h:848:9: note: in definition of macro 
‘Assert’
[04:56:54.554]   848 |   if (!(condition)) \
[04:56:54.554]   | ^
[04:56:54.554] clog.c: In function ‘TrimCLOG’:
[04:56:54.554] clog.c:801:40: error: ‘SlruPageEntry’ has no member named 
‘page_dirty’
[04:56:54.554]   801 |   XactCtl->shared->page_entries[slotno].page_dirty = 
true;
[04:56:54.554]   |^
[04:56:54.554] In file included from ../../../../src/include/postgres.h:46,
[04:56:54.554]  from clog.c:33:
[04:56:54.554] clog.c: In function ‘clog_redo’:
[04:56:54.554] clog.c:997:48: error: ‘SlruPageEntry’ has no member named 
‘page_dirty’
[04:56:54.554]   997 |   
Assert(!XactCtl->shared->page_entries[slotno].page_dirty);
[04:56:54.554]   |^
[04:56:54.554] ../../../../src/include/c.h:848:9: note: in definition of macro 
‘Assert’
[04:56:54.554]   848 |   if (!(condition)) \
[04:56:54.554]   | ^
[04:56:54.555] make[4]: *** [: clog.o] Error 1
[04:56:54.555] make[3]: *** [../../../src/backend/common.mk:39: 
transam-recursive] Error 2
[04:56:54.555] make[3]: *** Waiting for unfinished jobs
[04:56:56.405] make[2]: *** [common.mk:39: access-recursive] Error 2
[04:56:56.405] make[1]: *** [Makefile:42: all-backend-recurse] Error 2
[04:56:56.405] make: *** [GNUmakefile:21: world-bin-src-recurse] Error 2
[04:56:56.407]
[04:56:56.407] Exit status: 2

Could you send a new version?  In the meantime I will switch the patch status
to Waiting on Author.




Re: MultiXact\SLRU buffers configuration

2021-12-26 Thread Andrey Borodin


>> 8 апр. 2021 г., в 15:22, Thomas Munro  написал(а):
>>
> I have one more idea inspired by CPU caches.
> Let's make SLRU n-associative, where n ~ 8.
> We can divide buffers into "banks", number of banks must be power of 2.
> All banks are of equal size. We choose bank size to approximately satisfy 
> user's configured buffer size.
> Each page can live only within one bank. We use same search and eviction 
> algorithms as we used in SLRU, but we only need to search\evict over 8 
> elements.
> All SLRU data of a single bank will be colocated within at most 2 cache line.
> 
> I did not come up with idea how to avoid multiplication of bank_number * 
> bank_size in case when user configured 31337 buffers (any number that is 
> radically not a power of 2).

We can avoid this multiplication by using gapped memory under SLRU 
page_statuses, but from my POV here complexity does not worth possible 
performance gain.

PFA rebase of the patchset. Also I've added a patch to combine page_number, 
page_status, and page_dirty together to touch less cachelines.

Best regards, Andrey Borodin.

From ea59d2ebde818ddc2a9111858b3d956cbcc7bff2 Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Mon, 15 Feb 2021 21:51:56 +0500
Subject: [PATCH v=18 1/3] Make all SLRU buffer sizes configurable.

Provide new GUCs to set the number of buffers, instead of using hard
coded defaults.

Remove the limits on xact_buffers and commit_ts_buffers.  The default
sizes for those caches are ~0.2% and ~0.1% of shared_buffers, as before,
but now there is no cap at 128 and 16 buffers respectively (unless
track_commit_timestamp is disabled, in the latter case, then we might as
well keep it tiny).  Sizes much larger than the old limits have been
shown to be useful on modern systems, and an earlier commit replaced a
linear search with a hash table to avoid problems with extreme cases.

Author: Andrey M. Borodin 
Reviewed-by: Anastasia Lubennikova 
Reviewed-by: Tomas Vondra 
Reviewed-by: Alexander Korotkov 
Reviewed-by: Gilles Darold 
Reviewed-by: Thomas Munro 
Discussion: https://postgr.es/m/2BEC2B3F-9B61-4C1D-9FB5-5FAB0F05EF86%40yandex-team.ru
---
 doc/src/sgml/config.sgml  | 135 ++
 src/backend/access/transam/clog.c |  23 ++-
 src/backend/access/transam/commit_ts.c|   5 +
 src/backend/access/transam/multixact.c|   8 +-
 src/backend/access/transam/subtrans.c |   5 +-
 src/backend/commands/async.c  |   8 +-
 src/backend/storage/lmgr/predicate.c  |   4 +-
 src/backend/utils/init/globals.c  |   8 ++
 src/backend/utils/misc/guc.c  |  99 +
 src/backend/utils/misc/postgresql.conf.sample |   9 ++
 src/include/access/clog.h |  10 ++
 src/include/access/commit_ts.h|   1 -
 src/include/access/multixact.h|   4 -
 src/include/access/slru.h |   5 +
 src/include/access/subtrans.h |   2 -
 src/include/commands/async.h  |   5 -
 src/include/miscadmin.h   |   7 +
 src/include/storage/predicate.h   |   4 -
 18 files changed, 299 insertions(+), 43 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index afbb6c35e30..57d9696abe8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1952,6 +1952,141 @@ include_dir 'conf.d'

   
  
+ 
+
+  multixact_offsets_buffers (integer)
+  
+   multixact_offsets_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_multixact/offsets (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 8.
+This parameter can only be set at server start.
+   
+  
+ 
+
+
+  multixact_members_buffers (integer)
+  
+   multixact_members_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_multixact/members (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 16.
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  subtrans_buffers (integer)
+  
+   subtrans_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_subtrans (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 8.
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  

Re: MultiXact\SLRU buffers configuration

2021-06-14 Thread Васильев Дмитрий
пн, 14 июн. 2021 г. в 15:07, Andrey Borodin :

> PFA patch implementing this idea.
>

I'm benchmarked v17 patches.
Testing was done on a 96-core machine, with PGDATA completely placed in
tmpfs.
PostgreSQL was built with CFLAGS -O2.

for-update PgBench script:
\set aid random_zipfian(1, 100, 2)
begin;
select :aid from pgbench_accounts where aid = :aid for update;
update pgbench_accounts set abalance = abalance + 1 where aid = :aid;
update pgbench_accounts set abalance = abalance * 2 where aid = :aid;
update pgbench_accounts set abalance = abalance - 2 where aid = :aid;
end;

Before each test sample data was filled with "pgbench -i -s 100", testing
was performed 3 times for 1 hour each test.
The benchmark results are presented with changing
multi_xact_members_buffers and multicast_offsets_buffers (1:2 respectively):
settings  tps
multixact_members_buffers_64Kb   693.2
multixact_members_buffers_128Kb  691.4
multixact_members_buffers_192Kb  696.3
multixact_members_buffers_256Kb  694.4
multixact_members_buffers_320Kb  692.3
multixact_members_buffers_448Kb  693.7
multixact_members_buffers_512Kb  693.3
vanilla  676.1

Best regards, Dmitry Vasiliev.


Re: MultiXact\SLRU buffers configuration

2021-04-11 Thread Andrey Borodin


> 8 апр. 2021 г., в 15:22, Thomas Munro  написал(а):
> 
> On Thu, Apr 8, 2021 at 7:24 PM Andrey Borodin  wrote:
>> I agree that this version of eviction seems much more effective and less 
>> intrusive than RR. And it's still LRU, which is important for subsystem that 
>> is called SLRU.
>> shared->search_slotno is initialized implicitly with memset(). But this 
>> seems like a common practice.
>> Also comment above "max_search = Min(shared->num_slots, 
>> MAX_REPLACEMENT_SEARCH);" does not reflect changes.
>> 
>> Besides this patch looks good to me.
> 
> Thanks!  I chickened out of committing a buffer replacement algorithm
> patch written 11 hours before the feature freeze, but I also didn't
> really want to commit the GUC patch without that.  Ahh, if only we'd
> latched onto the real problems here just a little sooner, but there is
> always PostgreSQL 15, I heard it's going to be amazing.  Moved to next
> CF.

I have one more idea inspired by CPU caches.
Let's make SLRU n-associative, where n ~ 8.
We can divide buffers into "banks", number of banks must be power of 2.
All banks are of equal size. We choose bank size to approximately satisfy 
user's configured buffer size.
Each page can live only within one bank. We use same search and eviction 
algorithms as we used in SLRU, but we only need to search\evict over 8 elements.
All SLRU data of a single bank will be colocated within at most 2 cache line.

I did not come up with idea how to avoid multiplication of bank_number * 
bank_size in case when user configured 31337 buffers (any number that is 
radically not a power of 2).

PFA patch implementing this idea.

Best regards, Andrey Borodin.




v17-0002-Divide-SLRU-buffers-into-n-associative-banks.patch
Description: Binary data


v17-0001-Make-all-SLRU-buffer-sizes-configurable.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2021-04-08 Thread Thomas Munro
On Thu, Apr 8, 2021 at 7:24 PM Andrey Borodin  wrote:
> I agree that this version of eviction seems much more effective and less 
> intrusive than RR. And it's still LRU, which is important for subsystem that 
> is called SLRU.
> shared->search_slotno is initialized implicitly with memset(). But this seems 
> like a common practice.
> Also comment above "max_search = Min(shared->num_slots, 
> MAX_REPLACEMENT_SEARCH);" does not reflect changes.
>
> Besides this patch looks good to me.

Thanks!  I chickened out of committing a buffer replacement algorithm
patch written 11 hours before the feature freeze, but I also didn't
really want to commit the GUC patch without that.  Ahh, if only we'd
latched onto the real problems here just a little sooner, but there is
always PostgreSQL 15, I heard it's going to be amazing.  Moved to next
CF.




Re: MultiXact\SLRU buffers configuration

2021-04-08 Thread Andrey Borodin



> 8 апр. 2021 г., в 03:30, Thomas Munro  написал(а):
> 
> Here's another approach that is a little less exciting than
> "tournament RR" (or whatever that should be called; I couldn't find an
> established name for it).  This version is just our traditional linear
> search, except that it stops at 128, and remembers where to start from
> next time (like a sort of Fisher-Price GCLOCK hand).  This feels more
> committable to me.  You can argue that all buffers above 128 are bonus
> buffers that PostgreSQL 13 didn't have, so the fact that we can no
> longer find the globally least recently used page when you set
> xact_buffers > 128 doesn't seem too bad to me, as an incremental step
> (but to be clear, of course we can do better than this with more work
> in later releases).
I agree that this version of eviction seems much more effective and less 
intrusive than RR. And it's still LRU, which is important for subsystem that is 
called SLRU.
shared->search_slotno is initialized implicitly with memset(). But this seems 
like a common practice.
Also comment above "max_search = Min(shared->num_slots, 
MAX_REPLACEMENT_SEARCH);" does not reflect changes.

Besides this patch looks good to me.

Thanks!

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2021-04-07 Thread Thomas Munro
On Thu, Apr 8, 2021 at 12:13 AM Andrey Borodin  wrote:
> > 7 апр. 2021 г., в 14:44, Andrey Borodin  написал(а):
> > Maybe instead of fully associative cache with random replacement we could 
> > use 1-associative cache?
> > i.e. each page can reside only in one spcific buffer slot. If there's 
> > something else - evict it.
> > I think this would be as efficient as RR cache. And it's s fast.
>
> I thought a bit more and understood that RR is protected from two competing 
> pages in working set, while 1-associative cache is not. So, discard that idea.

It's an interesting idea.  I know that at least one proprietary fork
just puts the whole CLOG in memory for direct indexing, which is what
we'd have here if we said "oh, your xact_buffers setting is so large
I'm just going to use slotno = pageno & mask".

Here's another approach that is a little less exciting than
"tournament RR" (or whatever that should be called; I couldn't find an
established name for it).  This version is just our traditional linear
search, except that it stops at 128, and remembers where to start from
next time (like a sort of Fisher-Price GCLOCK hand).  This feels more
committable to me.  You can argue that all buffers above 128 are bonus
buffers that PostgreSQL 13 didn't have, so the fact that we can no
longer find the globally least recently used page when you set
xact_buffers > 128 doesn't seem too bad to me, as an incremental step
(but to be clear, of course we can do better than this with more work
in later releases).
From 72ebd5052851aa4b9aa281df30b9bf42c0ad5de4 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 25 Mar 2021 10:11:31 +1300
Subject: [PATCH v16 1/3] Add a buffer mapping table for SLRUs.

Instead of doing a linear search for the buffer holding a given page
number, use a hash table.  This will allow us to increase the size of
these caches.

Reviewed-by: Andrey M. Borodin 
Discussion: https://postgr.es/m/2BEC2B3F-9B61-4C1D-9FB5-5FAB0F05EF86%40yandex-team.ru
---
 src/backend/access/transam/slru.c | 121 +-
 src/include/access/slru.h |   2 +
 2 files changed, 103 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 82149ad782..82c61c475b 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -58,6 +58,7 @@
 #include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
+#include "utils/hsearch.h"
 
 #define SlruFileName(ctl, path, seg) \
 	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
@@ -79,6 +80,12 @@ typedef struct SlruWriteAllData
 
 typedef struct SlruWriteAllData *SlruWriteAll;
 
+typedef struct SlruMappingTableEntry
+{
+	int			pageno;
+	int			slotno;
+} SlruMappingTableEntry;
+
 /*
  * Populate a file tag describing a segment file.  We only use the segment
  * number, since we can derive everything else we need by having separate
@@ -146,13 +153,16 @@ static int	SlruSelectLRUPage(SlruCtl ctl, int pageno);
 static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
 	  int segpage, void *data);
 static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
+static void	SlruMappingAdd(SlruCtl ctl, int pageno, int slotno);
+static void	SlruMappingRemove(SlruCtl ctl, int pageno);
+static int	SlruMappingFind(SlruCtl ctl, int pageno);
 
 /*
  * Initialization of shared memory
  */
 
-Size
-SimpleLruShmemSize(int nslots, int nlsns)
+static Size
+SimpleLruStructSize(int nslots, int nlsns)
 {
 	Size		sz;
 
@@ -167,10 +177,16 @@ SimpleLruShmemSize(int nslots, int nlsns)
 
 	if (nlsns > 0)
 		sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));	/* group_lsn[] */
-
 	return BUFFERALIGN(sz) + BLCKSZ * nslots;
 }
 
+Size
+SimpleLruShmemSize(int nslots, int nlsns)
+{
+	return SimpleLruStructSize(nslots, nlsns) +
+		hash_estimate_size(nslots, sizeof(SlruMappingTableEntry));
+}
+
 /*
  * Initialize, or attach to, a simple LRU cache in shared memory.
  *
@@ -187,11 +203,14 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
 			  LWLock *ctllock, const char *subdir, int tranche_id,
 			  SyncRequestHandler sync_handler)
 {
+	char		mapping_table_name[SHMEM_INDEX_KEYSIZE];
+	HASHCTL		mapping_table_info;
+	HTAB	   *mapping_table;
 	SlruShared	shared;
 	bool		found;
 
 	shared = (SlruShared) ShmemInitStruct(name,
-		  SimpleLruShmemSize(nslots, nlsns),
+		  SimpleLruStructSize(nslots, nlsns),
 		  );
 
 	if (!IsUnderPostmaster)
@@ -258,11 +277,21 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
 	else
 		Assert(found);
 
+	/* Create or find the buffer mapping table. */
+	memset(_table_info, 0, sizeof(mapping_table_info));
+	mapping_table_info.keysize = sizeof(int);
+	mapping_table_info.entrysize = sizeof(SlruMappingTableEntry);
+	snprintf(mapping_table_name, sizeof(mapping_table_name),
+			 "%s Lookup Table", name);
+	mapping_table = ShmemInitHash(mapping_table_name, nslots, nslots,
+  

Re: MultiXact\SLRU buffers configuration

2021-04-07 Thread Andrey Borodin



> 7 апр. 2021 г., в 14:44, Andrey Borodin  написал(а):
> 
> Maybe instead of fully associative cache with random replacement we could use 
> 1-associative cache?
> i.e. each page can reside only in one spcific buffer slot. If there's 
> something else - evict it.
> I think this would be as efficient as RR cache. And it's s fast.

I thought a bit more and understood that RR is protected from two competing 
pages in working set, while 1-associative cache is not. So, discard that idea.

Best regards, Andrey Borodin.






Re: MultiXact\SLRU buffers configuration

2021-04-07 Thread Andrey Borodin



> 7 апр. 2021 г., в 08:59, Thomas Munro  написал(а):
> 
> The remaining thing that bothers me about this patch set is that there is 
> still a linear search in the replacement algorithm, and it runs with an 
> exclusive lock.  That creates a serious problem for large caches that still 
> aren't large enough.  I wonder if we can do something to improve that 
> situation in the time we have.  I considered a bunch of ideas but could only 
> find one that fits with slru.c's simplistic locking while tracking recency.  
> What do you think about a hybrid of SLRU and random replacement, that retains 
> some characteristics of both?  You could think of it as being a bit like the 
> tournament selection of the genetic algorithm, with a tournament size of 
> (say) 8 or 16.  Any ideas on how to evaluate this and choose the number?  See 
> attached.
> 

Maybe instead of fully associative cache with random replacement we could use 
1-associative cache?
i.e. each page can reside only in one spcific buffer slot. If there's something 
else - evict it.
I think this would be as efficient as RR cache. And it's s fast.

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2021-04-07 Thread Thomas Munro
On Sun, Apr 4, 2021 at 7:57 AM Andrey Borodin  wrote:
> I was toying around with big values. For example if we set different big
xact_buffers we can get something like
> FATAL:  not enough shared memory for data structure "Notify" (72768 bytes
requested)
> FATAL:  not enough shared memory for data structure "Async Queue Control"
(2492 bytes requested)
> FATAL:  not enough shared memory for data structure "Checkpointer Data"
(393280 bytes requested)
>
> But never anything about xact_buffers. I don't think it's important,
though.

I had added the hash table size in SimpleLruShmemSize(), but then
SimpleLruInit() passed that same value in when allocating the struct, so
the struct was oversized.  Oops.  Fixed.

> > Likewise, I removed the cap of 16 buffers for commit_ts_buffers, but
> > only if you have track_commit_timestamp enabled.

> Is there a reason to leave 16 pages if commit_ts is disabled? They might
be useful for some artefacts of previously enabled commit_ts?

Alvaro, do you have an opinion on that?

The remaining thing that bothers me about this patch set is that there is
still a linear search in the replacement algorithm, and it runs with an
exclusive lock.  That creates a serious problem for large caches that still
aren't large enough.  I wonder if we can do something to improve that
situation in the time we have.  I considered a bunch of ideas but could
only find one that fits with slru.c's simplistic locking while tracking
recency.  What do you think about a hybrid of SLRU and random replacement,
that retains some characteristics of both?  You could think of it as being
a bit like the tournament selection of the genetic algorithm, with a
tournament size of (say) 8 or 16.  Any ideas on how to evaluate this and
choose the number?  See attached.
From 5f61bd7d227077f86649a890be603eae01c828f8 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 25 Mar 2021 10:11:31 +1300
Subject: [PATCH v15 1/3] Add a buffer mapping table for SLRUs.

Instead of doing a linear search for the buffer holding a given page
number, use a hash table.  This will allow us to increase the size of
these caches.

Reviewed-by: Andrey M. Borodin 
Discussion: https://postgr.es/m/2BEC2B3F-9B61-4C1D-9FB5-5FAB0F05EF86%40yandex-team.ru
---
 src/backend/access/transam/slru.c | 121 +-
 src/include/access/slru.h |   2 +
 2 files changed, 103 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 82149ad782..82c61c475b 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -58,6 +58,7 @@
 #include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
+#include "utils/hsearch.h"
 
 #define SlruFileName(ctl, path, seg) \
 	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
@@ -79,6 +80,12 @@ typedef struct SlruWriteAllData
 
 typedef struct SlruWriteAllData *SlruWriteAll;
 
+typedef struct SlruMappingTableEntry
+{
+	int			pageno;
+	int			slotno;
+} SlruMappingTableEntry;
+
 /*
  * Populate a file tag describing a segment file.  We only use the segment
  * number, since we can derive everything else we need by having separate
@@ -146,13 +153,16 @@ static int	SlruSelectLRUPage(SlruCtl ctl, int pageno);
 static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
 	  int segpage, void *data);
 static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
+static void	SlruMappingAdd(SlruCtl ctl, int pageno, int slotno);
+static void	SlruMappingRemove(SlruCtl ctl, int pageno);
+static int	SlruMappingFind(SlruCtl ctl, int pageno);
 
 /*
  * Initialization of shared memory
  */
 
-Size
-SimpleLruShmemSize(int nslots, int nlsns)
+static Size
+SimpleLruStructSize(int nslots, int nlsns)
 {
 	Size		sz;
 
@@ -167,10 +177,16 @@ SimpleLruShmemSize(int nslots, int nlsns)
 
 	if (nlsns > 0)
 		sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));	/* group_lsn[] */
-
 	return BUFFERALIGN(sz) + BLCKSZ * nslots;
 }
 
+Size
+SimpleLruShmemSize(int nslots, int nlsns)
+{
+	return SimpleLruStructSize(nslots, nlsns) +
+		hash_estimate_size(nslots, sizeof(SlruMappingTableEntry));
+}
+
 /*
  * Initialize, or attach to, a simple LRU cache in shared memory.
  *
@@ -187,11 +203,14 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
 			  LWLock *ctllock, const char *subdir, int tranche_id,
 			  SyncRequestHandler sync_handler)
 {
+	char		mapping_table_name[SHMEM_INDEX_KEYSIZE];
+	HASHCTL		mapping_table_info;
+	HTAB	   *mapping_table;
 	SlruShared	shared;
 	bool		found;
 
 	shared = (SlruShared) ShmemInitStruct(name,
-		  SimpleLruShmemSize(nslots, nlsns),
+		  SimpleLruStructSize(nslots, nlsns),
 		  );
 
 	if (!IsUnderPostmaster)
@@ -258,11 +277,21 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
 	else
 		Assert(found);
 
+	/* Create or find the buffer mapping table. */
+	memset(_table_info, 0, sizeof(mapping_table_info));
+	

Re: MultiXact\SLRU buffers configuration

2021-04-03 Thread Andrey Borodin



> 1 апр. 2021 г., в 06:40, Thomas Munro  написал(а):
> 
> 2.  Remove the cap of 128 buffers for xact_buffers as agreed.  We
> still need a cap though, to avoid a couple of kinds of overflow inside
> slru.c, both when computing the default value and accepting a
> user-provided number.  I introduced SLRU_MAX_ALLOWED_BUFFERS to keep
> it <= 1GB, and tested this on a 32 bit build with extreme block sizes.
BTW we do not document maximum values right now.
I was toying around with big values. For example if we set different big 
xact_buffers we can get something like
FATAL:  not enough shared memory for data structure "Notify" (72768 bytes 
requested)
FATAL:  not enough shared memory for data structure "Async Queue Control" (2492 
bytes requested)
FATAL:  not enough shared memory for data structure "Checkpointer Data" (393280 
bytes requested)

But never anything about xact_buffers. I don't think it's important, though.

> 
> Likewise, I removed the cap of 16 buffers for commit_ts_buffers, but
> only if you have track_commit_timestamp enabled.  
Is there a reason to leave 16 pages if commit_ts is disabled? They might be 
useful for some artefacts of previously enabled commit_ts?

> 4.  Change the default for commit_ts_buffers back to shared_buffers /
> 1024 (with a minimum of 4), because I think you might have changed it
> by a copy and paste error -- or did you intend to make the default
> higher?
I changed default due to some experiments with 
https://www.postgresql.org/message-id/flat/20210115220744.GA24457%40alvherre.pgsql
In fact most important part of that thread was removing the cap, which is done 
by the patchset now.

Thanks!

Best regards, Andrey Borodin.







Re: MultiXact\SLRU buffers configuration

2021-03-31 Thread Thomas Munro
On Thu, Apr 1, 2021 at 10:09 AM Andrey Borodin  wrote:
> > 29 марта 2021 г., в 11:26, Andrey Borodin  написал(а):
> > My TODO list:
> > 1. Try to break patch set v13-[0001-0004]
> > 2. Think how to measure performance of linear search versus hash search in 
> > SLRU buffer mapping.
>
> Hi Thomas!
> I'm still doing my homework. And to this moment all my catch is that 
> "utils/dynahash.h" is not necessary.

Thanks.  Here's a new patch with that fixed, and also:

1.  New names ("... Mapping Table" -> "... Lookup Table" in
pg_shmem_allocations view, "clog_buffers" -> "xact_buffers") and a
couple of typos fixed here and there.

2.  Remove the cap of 128 buffers for xact_buffers as agreed.  We
still need a cap though, to avoid a couple of kinds of overflow inside
slru.c, both when computing the default value and accepting a
user-provided number.  I introduced SLRU_MAX_ALLOWED_BUFFERS to keep
it <= 1GB, and tested this on a 32 bit build with extreme block sizes.

Likewise, I removed the cap of 16 buffers for commit_ts_buffers, but
only if you have track_commit_timestamp enabled.  It seems silly to
waste 1MB per 1GB of shared_buffers on a feature you're not using.  So
the default is capped at 16 in that case to preserve existing
behaviour, but otherwise can be set very large if you want.

I think it's plausible that we'll want to make the multixact sizes
adaptive too, but I that might have to be a job for later.  Likewise,
I am sure that substransaction-heavy workloads might be slower than
they need to be due to the current default, but I have not done the
research,  With these new GUCs, people are free to experiment and
develop theories about what the defaults should be in v15.

3.  In the special case of xact_buffers, there is a lower cap of
512MB, because there's no point trying to cache more xids than there
can be in existence, and that is computed by working backwards from
CLOG_XACTS_PER_PAGE etc.  It's not possible to do the same sort of
thing for the other SLRUs without overflow problems, and it doesn't
seem worth trying to fix that right now (1GB of cached commit
timestamps ought to be enough for anyone™, while the theoretical
maximum is 10 bytes for 2b xids = 20GB).

To make this more explicit for people not following our discussion in
detail: with shared_buffers=0...512MB, the behaviour doesn't change.
But for shared_buffers=1GB you'll get twice as many xact_buffers as
today (2MB instead of being capped at 1MB), and it keeps scaling
linearly from there at 0.2%.  In other words, all real world databases
will get a boost in this department.

4.  Change the default for commit_ts_buffers back to shared_buffers /
1024 (with a minimum of 4), because I think you might have changed it
by a copy and paste error -- or did you intend to make the default
higher?

5.  Improve descriptions for the GUCs, visible in pg_settings view, to
match the documentation for related wait events.  So "for commit log
SLRU" -> "for the transaction status SLRU cache", and similar
corrections elsewhere.  (I am tempted to try to find a better word
than "SLRU", which doesn't seem relevant to users, but for now
consistency is good.)

6.  Added a callback so that SHOW xact_buffers and SHOW
commit_ts_buffers display the real size in effect (instead of "0" for
default).

I tried running with xact_buffers=1 and soon saw why you change it to
interpret 1 the same as 0; with 1 you hit buffer starvation and get
stuck.  I wish there were some way to say "the default for this GUC is
0, but if it's not zero then it's got to be at least 4".  I didn't
study the theoretical basis for the previous minimum value of 4, so I
think we should keep it that way, so that if you say 3 you get 4.  I
thought it was better to express that like so:

/* Use configured value if provided. */
if (xact_buffers > 0)
return Max(4, xact_buffers);
return Min(CLOG_MAX_ALLOWED_BUFFERS, Max(4, NBuffers / 512));

> I'm thinking about hashtables and measuring performance near optimum of 
> linear search does not seem a good idea now.
> It's impossible to prove that difference is statistically significant on all 
> platforms. But even on one platform measurements are just too noisy.
>
> Shared buffers lookup table is indeed very similar to this SLRU lookup table. 
> And it does not try to use more memory than needed. I could not find 
> pgbench-visible impact of growing shared buffer lookup table. Obviously, 
> because it's not a bottleneck on regular workload. And it's hard to guess 
> representative pathological workload.

Thanks for testing.  I agree that it's a good idea to follow the main
buffer pool's approach for our first version of this.  One small
difference is that the SLRU patch performs the hash computation while
it holds the lock.  If we computed the hash first and used
hash_search_with_hash_value(), we could compute it before we obtain
the lock, like the main buffer pool.

If we computed the hash value first, we could also ignore the rule in

Re: MultiXact\SLRU buffers configuration

2021-03-31 Thread Andrey Borodin


> 29 марта 2021 г., в 11:26, Andrey Borodin  написал(а):
> 
> My TODO list:
> 1. Try to break patch set v13-[0001-0004]
> 2. Think how to measure performance of linear search versus hash search in 
> SLRU buffer mapping.

Hi Thomas!
I'm still doing my homework. And to this moment all my catch is that 
"utils/dynahash.h" is not necessary.

I'm thinking about hashtables and measuring performance near optimum of linear 
search does not seem a good idea now.
It's impossible to prove that difference is statistically significant on all 
platforms. But even on one platform measurements are just too noisy.

Shared buffers lookup table is indeed very similar to this SLRU lookup table. 
And it does not try to use more memory than needed. I could not find 
pgbench-visible impact of growing shared buffer lookup table. Obviously, 
because it's not a bottleneck on regular workload. And it's hard to guess 
representative pathological workload.

In fact, this thread started with proposal to use reader-writer lock for multis 
(instead of exclusive lock), and this proposal encountered same problem. It's 
very hard to create stable reproduction of pathological workload when this lock 
is heavily contented. Many people observed the problem, but still there is no 
open repro.

I bet it will be hard to prove that simplehash is any better then HTAB. But if 
it is really better, shared buffers could benefit from the same technique.

I think its just fine to use HTAB with normal size, as long as shared buffers 
do so. But there we allocate slightly more space InitBufTable(NBuffers + 
NUM_BUFFER_PARTITIONS). Don't we need to allocate nslots + 1 ? It seems that we 
always do SlruMappingRemove() before SlruMappingAdd() and it is not necessary.

Thanks!

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2021-03-29 Thread Andrey Borodin



> 29 марта 2021 г., в 02:15, Thomas Munro  написал(а):
> 
> On Sat, Mar 27, 2021 at 6:31 PM Andrey Borodin  wrote:
>>> 27 марта 2021 г., в 01:26, Thomas Munro  написал(а):
>>> , and murmurhash which is inlineable and
>>> branch-free.
> 
>> I think pageno is a hash already. Why hash any further? And pages accessed 
>> together will have smaller access time due to colocation.
> 
> Yeah, if clog_buffers is large enough then it's already a "perfect
> hash", but if it's not then you might get some weird "harmonic"
> effects (not sure if that's the right word), basically higher or lower
> collision rate depending on coincidences in the data.  If you apply a
> hash, the collisions should be evenly spread out so at least it'll be
> somewhat consistent.  Does that make sense?
As far as I understand "Harmonic" effects only make sense if the distribution 
is unknown. Hash protects from "periodic" data when periods are equal to hash 
table size. I don't think we need to protect from this case, SLRU data is 
expected to be localised...
Cost of this protection is necessity to calculate murmur hash on each SLRU 
lookup. Probably, 10-100ns. Seems like not a big deal.

> (At some point I figured out that the syscaches have lower collision
> rates and perform better if you use oids directly instead of hashing
> them... but then it's easy to create a pathological pattern of DDL
> that turns your hash table into a linked list.  Not sure what to think
> about that.)
> 
>>> I had to tweak it to support "in-place" creation and
>>> fixed size (in other words, no allocators, for use in shared memory).
> 
>> We really need to have a test to know what happens when this structure goes 
>> out of memory, as you mentioned below. What would be apropriate place for 
>> simplehash tests?
> 
> Good questions.  This has to be based on being guaranteed to have
> enough space for all of the entries, so the question is really just
> "how bad can performance get with different load factors".  FWIW there
> were some interesting cases with clustering when simplehash was first
> used in the executor (see commits ab9f2c42 and parent) which required
> some work on hashing quality to fix.
Interesting read, I didn't know much about simple hash, but seems like there is 
still many cases where it can be used for good. I always wondered why Postgres 
uses only Larson's linear hash.

> 
>>> Then I was annoyed that I had to add a "status" member to our struct,
>>> so I tried to fix that.
> 
>> Indeed, sizeof(SlruMappingTableEntry) == 9 seems strange. Will simplehash 
>> align it well?
> 
> With that "intrusive status" patch, the size is back to 8.  But I
> think I made a mistake: I made it steal some key space to indicate
> presence, but I think the presence test should really get access to
> the whole entry so that you can encode it in more ways.  For example,
> with slotno == -1.
> 
> Alright, considering the date, if we want to get this into PostgreSQL
> 14 it's time to make some decisions.
> 
> 1.  Do we want customisable SLRU sizes in PG14?
> 
> +1 from me, we have multiple reports of performance gains from
> increasing various different SLRUs, and it's easy to find workloads
> that go faster.
Yes, this is main point of this discussion. So +1 from me too.

> 
> One thought: it'd be nice if the user could *see* the current size,
> when using the default.  SHOW clog_buffers -> 0 isn't very helpful if
> you want to increase it, but don't know what it's currently set to.
> Not sure off the top of my head how best to do that.
Don't we expect that SHOW command indicate exactly same value as in config or 
SET command? If this convention does not exist - probably showing effective 
value is a good idea.


> 2.  What names do we want the GUCs to have?  Here's what we have:
> 
> Proposed GUC  DirectorySystem views
> clog_buffers  pg_xact  Xact
> multixact_offsets_buffers pg_multixact/offsets MultiXactOffset
> multixact_members_buffers pg_multixact/members MultiXactMember
> notify_bufferspg_notifyNotify
> serial_bufferspg_serialSerial
> subtrans_buffers  pg_subtrans  Subtrans
> commit_ts_buffers pg_commit_ts CommitTs
> 
> By system views, I mean pg_stat_slru, pg_shmem_allocations and
> pg_stat_activity (lock names add "SLRU" on the end).
> 
> Observations:
> 
> It seems obvious that "clog_buffers" should be renamed to "xact_buffers".
+1
> It's not clear whether the multixact GUCs should have the extra "s"
> like the directories, or not, like the system views.
I think we show break the ties by native English speaker's ears or typing 
habits. I'm not a native speaker.

> It see that we have "Shared Buffer Lookup Table" in
> pg_shmem_allocations, so where I generated names like "Subtrans
> Mapping Table" I should change that to "Lookup" to match.
> 
> 3.  What recommendations should we make about how to set it?
> 
> I think the answer 

Re: MultiXact\SLRU buffers configuration

2021-03-28 Thread Thomas Munro
On Sat, Mar 27, 2021 at 6:31 PM Andrey Borodin  wrote:
> > 27 марта 2021 г., в 01:26, Thomas Munro  написал(а):
> > , and murmurhash which is inlineable and
> > branch-free.

> I think pageno is a hash already. Why hash any further? And pages accessed 
> together will have smaller access time due to colocation.

Yeah, if clog_buffers is large enough then it's already a "perfect
hash", but if it's not then you might get some weird "harmonic"
effects (not sure if that's the right word), basically higher or lower
collision rate depending on coincidences in the data.  If you apply a
hash, the collisions should be evenly spread out so at least it'll be
somewhat consistent.  Does that make sense?

(At some point I figured out that the syscaches have lower collision
rates and perform better if you use oids directly instead of hashing
them... but then it's easy to create a pathological pattern of DDL
that turns your hash table into a linked list.  Not sure what to think
about that.)

> >  I had to tweak it to support "in-place" creation and
> > fixed size (in other words, no allocators, for use in shared memory).

> We really need to have a test to know what happens when this structure goes 
> out of memory, as you mentioned below. What would be apropriate place for 
> simplehash tests?

Good questions.  This has to be based on being guaranteed to have
enough space for all of the entries, so the question is really just
"how bad can performance get with different load factors".  FWIW there
were some interesting cases with clustering when simplehash was first
used in the executor (see commits ab9f2c42 and parent) which required
some work on hashing quality to fix.

> > Then I was annoyed that I had to add a "status" member to our struct,
> > so I tried to fix that.

> Indeed, sizeof(SlruMappingTableEntry) == 9 seems strange. Will simplehash 
> align it well?

With that "intrusive status" patch, the size is back to 8.  But I
think I made a mistake: I made it steal some key space to indicate
presence, but I think the presence test should really get access to
the whole entry so that you can encode it in more ways.  For example,
with slotno == -1.

Alright, considering the date, if we want to get this into PostgreSQL
14 it's time to make some decisions.

1.  Do we want customisable SLRU sizes in PG14?

+1 from me, we have multiple reports of performance gains from
increasing various different SLRUs, and it's easy to find workloads
that go faster.

One thought: it'd be nice if the user could *see* the current size,
when using the default.  SHOW clog_buffers -> 0 isn't very helpful if
you want to increase it, but don't know what it's currently set to.
Not sure off the top of my head how best to do that.

2.  What names do we want the GUCs to have?  Here's what we have:

Proposed GUC  DirectorySystem views
clog_buffers  pg_xact  Xact
multixact_offsets_buffers pg_multixact/offsets MultiXactOffset
multixact_members_buffers pg_multixact/members MultiXactMember
notify_bufferspg_notifyNotify
serial_bufferspg_serialSerial
subtrans_buffers  pg_subtrans  Subtrans
commit_ts_buffers pg_commit_ts CommitTs

By system views, I mean pg_stat_slru, pg_shmem_allocations and
pg_stat_activity (lock names add "SLRU" on the end).

Observations:

It seems obvious that "clog_buffers" should be renamed to "xact_buffers".
It's not clear whether the multixact GUCs should have the extra "s"
like the directories, or not, like the system views.
It see that we have "Shared Buffer Lookup Table" in
pg_shmem_allocations, so where I generated names like "Subtrans
Mapping Table" I should change that to "Lookup" to match.

3.  What recommendations should we make about how to set it?

I think the answer depends partially on the next questions!  I think
we should probably at least say something short about the pg_stat_slru
view (cache miss rate) and pg_stat_actitity view (waits on locks), and
how to tell if you might need to increase it.  I think this probably
needs a new paragraph, separate from the docs for the individual GUC.

4.  Do we want to ship the dynahash patch?

+0.9.  The slight hesitation is that it's new code written very late
in the cycle, so it may still have bugs or unintended consequences,
and as you said, at small sizes the linear search must be faster than
the hash computation.  Could you help test it, and try to break it?
Can we quantify the scaling effect for some interesting workloads, to
see at what size the dynahash beats the linear search, so that we can
make an informed decision?  Of course, without a hash table, large
sizes will surely work badly, so it'd be tempting to restrict the size
you can set the GUC to.

If we do include the dynahash patch, then I think it would also be
reasonable to change the formula for the default, to make it higher on
large systems.  The restriction to 128 buffers (= 1MB) 

Re: MultiXact\SLRU buffers configuration

2021-03-26 Thread Andrey Borodin



> 27 марта 2021 г., в 01:26, Thomas Munro  написал(а):
> 
> On Sat, Mar 27, 2021 at 4:52 AM Andrey Borodin  wrote:
>> Some thoughts on HashTable patch:
>> 1. Can we allocate bigger hashtable to reduce probability of collisions?
> 
> Yeah, good idea, might require some study.
In a long run we always have this table filled with nslots. But the keys will 
be usually consecutive numbers (current working set of CLOG\Multis\etc). So in 
a happy hashing scenario collisions will only appear for some random backward 
jumps. I think just size = nslots * 2 will produce results which cannot be 
improved significantly.
And this reflects original growth strategy SH_GROW(tb, tb->size * 2).

>> 2. Can we use specialised hashtable for this case? I'm afraid hash_search() 
>> does comparable number of CPU cycles as simple cycle from 0 to 128. We could 
>> inline everything and avoid hashp->hash(keyPtr, hashp->keysize) call. I'm 
>> not insisting on special hash though, just an idea.
> 
> I tried really hard to not fall into this rabbit h [hack hack
> hack], OK, here's a first attempt to use simplehash,

> Andres's
> steampunk macro-based robinhood template
Sounds magnificent.

> that we're already using for
> several other things
I could not find much tests to be sure that we do not break something...

> , and murmurhash which is inlineable and
> branch-free.
I think pageno is a hash already. Why hash any further? And pages accessed 
together will have smaller access time due to colocation.

>  I had to tweak it to support "in-place" creation and
> fixed size (in other words, no allocators, for use in shared memory).
We really need to have a test to know what happens when this structure goes out 
of memory, as you mentioned below. What would be apropriate place for 
simplehash tests?

> Then I was annoyed that I had to add a "status" member to our struct,
> so I tried to fix that.
Indeed, sizeof(SlruMappingTableEntry) == 9 seems strange. Will simplehash align 
it well?


Thanks!

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2021-03-26 Thread Thomas Munro
On Sat, Mar 27, 2021 at 4:52 AM Andrey Borodin  wrote:
> Some thoughts on HashTable patch:
> 1. Can we allocate bigger hashtable to reduce probability of collisions?

Yeah, good idea, might require some study.

> 2. Can we use specialised hashtable for this case? I'm afraid hash_search() 
> does comparable number of CPU cycles as simple cycle from 0 to 128. We could 
> inline everything and avoid hashp->hash(keyPtr, hashp->keysize) call. I'm not 
> insisting on special hash though, just an idea.

I tried really hard to not fall into this rabbit h [hack hack
hack], OK, here's a first attempt to use simplehash, Andres's
steampunk macro-based robinhood template that we're already using for
several other things, and murmurhash which is inlineable and
branch-free.  I had to tweak it to support "in-place" creation and
fixed size (in other words, no allocators, for use in shared memory).
Then I was annoyed that I had to add a "status" member to our struct,
so I tried to fix that.  Definitely needs more work to think about
failure modes when running out of memory, how much spare space you
need, etc.

I have not experimented with this much beyond hacking until the tests
pass, but it *should* be more efficient...

> 3. pageno in SlruMappingTableEntry seems to be unused.

It's the key (dynahash uses the first N bytes of your struct as the
key, but in this new simplehash version it's more explicit).
From 5f5d4ed8ae2808766ac1fd48f68602ef530e3833 Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Mon, 15 Feb 2021 21:51:56 +0500
Subject: [PATCH v13 1/5] Make all SLRU buffer sizes configurable.

Provide new GUCs to set the number of buffers, instead of using hard
coded defaults.

Author: Andrey M. Borodin 
Reviewed-by: Anastasia Lubennikova 
Reviewed-by: Tomas Vondra 
Reviewed-by: Alexander Korotkov 
Reviewed-by: Gilles Darold 
Reviewed-by: Thomas Munro 
Discussion: https://postgr.es/m/2BEC2B3F-9B61-4C1D-9FB5-5FAB0F05EF86%40yandex-team.ru
---
 doc/src/sgml/config.sgml  | 137 ++
 src/backend/access/transam/clog.c |   6 +
 src/backend/access/transam/commit_ts.c|   5 +-
 src/backend/access/transam/multixact.c|   8 +-
 src/backend/access/transam/subtrans.c |   5 +-
 src/backend/commands/async.c  |   8 +-
 src/backend/storage/lmgr/predicate.c  |   4 +-
 src/backend/utils/init/globals.c  |   8 +
 src/backend/utils/misc/guc.c  |  77 ++
 src/backend/utils/misc/postgresql.conf.sample |   9 ++
 src/include/access/multixact.h|   4 -
 src/include/access/subtrans.h |   3 -
 src/include/commands/async.h  |   5 -
 src/include/miscadmin.h   |   7 +
 src/include/storage/predicate.h   |   4 -
 15 files changed, 261 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ddc6d789d8..f1112bfa9c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1886,6 +1886,143 @@ include_dir 'conf.d'

   
  
+ 
+
+  multixact_offsets_buffers (integer)
+  
+   multixact_offsets_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to used to cache the contents
+of pg_multixact/offsets (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 8.
+This parameter can only be set at server start.
+   
+  
+ 
+
+
+  multixact_members_buffers (integer)
+  
+   multixact_members_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to used to cache the contents
+of pg_multixact/members (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 16.
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  subtrans_buffers (integer)
+  
+   subtrans_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to used to cache the contents
+of pg_subtrans (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 8.
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  notify_buffers (integer)
+  
+   notify_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to used to cache the contents
+of pg_notify (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+ 

Re: MultiXact\SLRU buffers configuration

2021-03-26 Thread Andrey Borodin



> 26 марта 2021 г., в 11:00, Andrey Borodin  написал(а):
> 
>> I'm not saying the 0002 patch is bug-free yet though, it's a bit finickity.
> I think the idea of speeding up linear search is really really good for 
> scaling SLRUs. It's not even about improving normal performance of the 
> cluster, but it's important from preventing pathological degradation under 
> certain circumstances. Bigger cache really saves SLAs :) I'll look into the 
> patch more closely this weekend. Thank you!


Some thoughts on HashTable patch:
1. Can we allocate bigger hashtable to reduce probability of collisions?
2. Can we use specialised hashtable for this case? I'm afraid hash_search() 
does comparable number of CPU cycles as simple cycle from 0 to 128. We could 
inline everything and avoid hashp->hash(keyPtr, hashp->keysize) call. I'm not 
insisting on special hash though, just an idea.
3. pageno in SlruMappingTableEntry seems to be unused.

Thanks!

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2021-03-26 Thread Andrey Borodin
Hi Thomas, Gilles, all!

Thanks for reviewing this!

> 25 марта 2021 г., в 02:31, Thomas Munro  написал(а):
> 
> I don't think we should put "slru" (the name of the buffer replacement
> algorithm, implementation detail) in the GUC names.
+1


> +It defaults to 0, in this case CLOG size is taken as
> shared_buffers / 512.
> 
> We already know that increasing the number of CLOG buffers above the
> current number hurts as the linear search begins to dominate
Uh, my intent was to copy original approach of CLOG SLRU size, I just missed 
that Min(,) thing in shared_buffers logic.


> 26 марта 2021 г., в 08:46, Thomas Munro  написал(а):
> 
> Hi Andrey, all,
> 
> I propose some changes, and I'm attaching a new version:
> 
> I renamed the GUCs as clog_buffers etc (no "_slru_").  I fixed some
> copy/paste mistakes where the different GUCs were mixed up.  I made
> some changes to the .conf.sample.  I rewrote the documentation so that
> it states the correct unit and defaults, and refers to the
> subdirectories that are cached by these buffers instead of trying to
> give a new definition of each of the SLRUs.
> 
> Do you like those changes?
Yes!

> Some things I thought about but didn't change:
> 
> I'm not entirely sure if we should use the internal and historical
> names well known to hackers (CLOG), or the visible directory names (I
> mean, we could use pg_xact_buffers instead of clog_buffers).
While it is good idea to make notes about directory name, I think the real 
naming criteria is to help find this GUC when user encounters wait events in 
pg_stat_activity. I think there is no CLOG mentions in docs [0], only 
XactBuffer, XactSLRU et c.

> I'm not saying the 0002 patch is bug-free yet though, it's a bit finickity.
I think the idea of speeding up linear search is really really good for scaling 
SLRUs. It's not even about improving normal performance of the cluster, but 
it's important from preventing pathological degradation under certain 
circumstances. Bigger cache really saves SLAs :) I'll look into the patch more 
closely this weekend. Thank you!

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW



Re: MultiXact\SLRU buffers configuration

2021-03-25 Thread Thomas Munro
Hi Andrey, all,

I propose some changes, and I'm attaching a new version:

I renamed the GUCs as clog_buffers etc (no "_slru_").  I fixed some
copy/paste mistakes where the different GUCs were mixed up.  I made
some changes to the .conf.sample.  I rewrote the documentation so that
it states the correct unit and defaults, and refers to the
subdirectories that are cached by these buffers instead of trying to
give a new definition of each of the SLRUs.

Do you like those changes?

Some things I thought about but didn't change:

I'm not entirely sure if we should use the internal and historical
names well known to hackers (CLOG), or the visible directory names (I
mean, we could use pg_xact_buffers instead of clog_buffers).  I am not
sure why these GUCs need to be PGDLLIMPORT, but I see that NBuffers is
like that.

I wanted to do some very simple smoke testing of CLOG sizes on my
local development machine:

  pgbench -i -s1000 postgres
  pgbench -t400 -c8 -j8 -Mprepared postgres

I disabled autovacuum after running that just to be sure it wouldn't
interfere with my experiment:

  alter table pgbench_accounts set (autovacuum_enabled = off);

Then I shut the cluster down and made a copy, so I could do some
repeated experiments from the same initial conditions each time.  At
this point I had 30 files -001E under pg_xact, holding 256kB = ~1
million transactions each.  It'd take ~960 buffers to cache it all.
So how long does VACUUM FREEZE pgbench_accounts take?

I tested with just the 0001 patch, and also with the 0002 patch
(improved version, attached):

clog_buffers=128:  0001=2:28.499, 0002=2:17:891
clog_buffers=1024: 0001=1:38.485, 0002=1:29.701

I'm sure the speedup of the 0002 patch can be amplified by increasing
the number of transactions referenced in the table OR number of
clog_buffers, considering that the linear search produces
O(transactions * clog_buffers) work.  That was 32M transactions and
8MB of CLOG, but I bet if you double both of those numbers once or
twice things start to get hot.  I don't see why you shouldn't be able
to opt to cache literally all of CLOG if you want (something like 50MB
assuming default autovacuum_freeze_max_age, scale to taste, up to
512MB for the theoretical maximum useful value).

I'm not saying the 0002 patch is bug-free yet though, it's a bit finickity.
From e7a8b9a0e27de80bab13f004b4d7e3fa1da677c4 Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Mon, 15 Feb 2021 21:51:56 +0500
Subject: [PATCH v12 1/2] Make all SLRU buffer sizes configurable.

Provide new GUCs to set the number of buffers, instead of using hard
coded defaults.

Author: Andrey M. Borodin 
Reviewed-by: Anastasia Lubennikova 
Reviewed-by: Tomas Vondra 
Reviewed-by: Alexander Korotkov 
Reviewed-by: Gilles Darold 
Reviewed-by: Thomas Munro 
Discussion: https://postgr.es/m/2BEC2B3F-9B61-4C1D-9FB5-5FAB0F05EF86%40yandex-team.ru
---
 doc/src/sgml/config.sgml  | 137 ++
 src/backend/access/transam/clog.c |   6 +
 src/backend/access/transam/commit_ts.c|   5 +-
 src/backend/access/transam/multixact.c|   8 +-
 src/backend/access/transam/subtrans.c |   5 +-
 src/backend/commands/async.c  |   8 +-
 src/backend/storage/lmgr/predicate.c  |   4 +-
 src/backend/utils/init/globals.c  |   8 +
 src/backend/utils/misc/guc.c  |  77 ++
 src/backend/utils/misc/postgresql.conf.sample |   9 ++
 src/include/access/multixact.h|   4 -
 src/include/access/subtrans.h |   3 -
 src/include/commands/async.h  |   5 -
 src/include/miscadmin.h   |   7 +
 src/include/storage/predicate.h   |   4 -
 15 files changed, 261 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ddc6d789d8..f1112bfa9c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1886,6 +1886,143 @@ include_dir 'conf.d'

   
  
+ 
+
+  multixact_offsets_buffers (integer)
+  
+   multixact_offsets_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to used to cache the contents
+of pg_multixact/offsets (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 8.
+This parameter can only be set at server start.
+   
+  
+ 
+
+
+  multixact_members_buffers (integer)
+  
+   multixact_members_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to used to cache the contents
+of pg_multixact/members (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 16.
+This 

Re: MultiXact\SLRU buffers configuration

2021-03-24 Thread Thomas Munro
On Thu, Mar 25, 2021 at 10:31 AM Thomas Munro  wrote:
> We already know that increasing the number of CLOG buffers above the
> current number hurts as the linear search begins to dominate
> (according to the commit message for 5364b357), and it doesn't seem
> great to ship a new feature that melts your CPU when you turn it up.
> Perhaps, to ship this, we need to introduce a buffer mapping table?  I
> have attached a "one coffee" attempt at that, on top of your v10 patch
> (unmodified), for discussion.  It survives basic testing but I don't
> know how it performs.

Hrrr... Cfbot showed an assertion failure.  Here's the two coffee
version with a couple of silly mistakes fixed.
From 4817d16cfb6704d43a7bef12648e753d239c809c Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Mon, 15 Feb 2021 21:51:56 +0500
Subject: [PATCH v11 1/2] Make all SLRU buffer sizes configurable

---
 doc/src/sgml/config.sgml  | 108 ++
 src/backend/access/transam/clog.c |   6 +
 src/backend/access/transam/commit_ts.c|   5 +-
 src/backend/access/transam/multixact.c|   8 +-
 src/backend/access/transam/subtrans.c |   5 +-
 src/backend/commands/async.c  |   8 +-
 src/backend/storage/lmgr/predicate.c  |   4 +-
 src/backend/utils/init/globals.c  |   8 ++
 src/backend/utils/misc/guc.c  |  77 +
 src/backend/utils/misc/postgresql.conf.sample |  16 +++
 src/include/access/multixact.h|   4 -
 src/include/access/subtrans.h |   3 -
 src/include/commands/async.h  |   5 -
 src/include/miscadmin.h   |   8 ++
 src/include/storage/predicate.h   |   4 -
 15 files changed, 240 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ddc6d789d8..0adcf0efaf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1886,6 +1886,114 @@ include_dir 'conf.d'

   
  
+ 
+
+  multixact_offsets_slru_buffers (integer)
+  
+   multixact_offsets_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for MultiXact offsets. MultiXact offsets
+are used to store information about offsets of multiple row lockers (caused by SELECT FOR UPDATE and others).
+It defaults to 64 kilobytes (64KB).
+This parameter can only be set at server start.
+   
+  
+ 
+
+
+  multixact_members_slru_buffers (integer)
+  
+   multixact_members_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for MultiXact members. MultiXact members
+are used to store information about XIDs of multiple row lockers. Typically multixact_members_slru_buffers
+is twice more than multixact_offsets_slru_buffers.
+It defaults to 128 kilobytes (128KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  subtrans_slru_buffers (integer)
+  
+   subtrans_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for subtransactions.
+It defaults to 256 kilobytes (256KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  notify_slru_buffers (integer)
+  
+   notify_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for asyncronous notifications (NOTIFY, LISTEN).
+It defaults to 64 kilobytes (64KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  serial_slru_buffers (integer)
+  
+   serial_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for predicate locks.
+It defaults to 128 kilobytes (128KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  clog_slru_buffers (integer)
+  
+   clog_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for CLOG.
+It defaults to 0, in this case CLOG size is taken as shared_buffers / 512.
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  commit_ts_slru_buffers (integer)
+  
+   commit_ts_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for commit timestamps.
+It defaults to 0, in this case CLOG size is taken as shared_buffers / 512.
+This parameter can only be set at server start.
+   
+  
+ 
 
  
   max_stack_depth (integer)
diff --git a/src/backend/access/transam/clog.c 

Re: MultiXact\SLRU buffers configuration

2021-03-24 Thread Thomas Munro
Hi Andrey,

On Sat, Mar 13, 2021 at 1:44 AM Andrey Borodin  wrote:
> [v10]

+intmultixact_offsets_slru_buffers = 8;
+intmultixact_members_slru_buffers = 16;
+intsubtrans_slru_buffers = 32;
+intnotify_slru_buffers = 8;
+intserial_slru_buffers = 16;
+intclog_slru_buffers = 0;
+intcommit_ts_slru_buffers = 0;

I don't think we should put "slru" (the name of the buffer replacement
algorithm, implementation detail) in the GUC names.

+It defaults to 0, in this case CLOG size is taken as
shared_buffers / 512.

We already know that increasing the number of CLOG buffers above the
current number hurts as the linear search begins to dominate
(according to the commit message for 5364b357), and it doesn't seem
great to ship a new feature that melts your CPU when you turn it up.
Perhaps, to ship this, we need to introduce a buffer mapping table?  I
have attached a "one coffee" attempt at that, on top of your v10 patch
(unmodified), for discussion.  It survives basic testing but I don't
know how it performs.
From 4817d16cfb6704d43a7bef12648e753d239c809c Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Mon, 15 Feb 2021 21:51:56 +0500
Subject: [PATCH v10 1/2] Make all SLRU buffer sizes configurable

---
 doc/src/sgml/config.sgml  | 108 ++
 src/backend/access/transam/clog.c |   6 +
 src/backend/access/transam/commit_ts.c|   5 +-
 src/backend/access/transam/multixact.c|   8 +-
 src/backend/access/transam/subtrans.c |   5 +-
 src/backend/commands/async.c  |   8 +-
 src/backend/storage/lmgr/predicate.c  |   4 +-
 src/backend/utils/init/globals.c  |   8 ++
 src/backend/utils/misc/guc.c  |  77 +
 src/backend/utils/misc/postgresql.conf.sample |  16 +++
 src/include/access/multixact.h|   4 -
 src/include/access/subtrans.h |   3 -
 src/include/commands/async.h  |   5 -
 src/include/miscadmin.h   |   8 ++
 src/include/storage/predicate.h   |   4 -
 15 files changed, 240 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ddc6d789d8..0adcf0efaf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1886,6 +1886,114 @@ include_dir 'conf.d'

   
  
+ 
+
+  multixact_offsets_slru_buffers (integer)
+  
+   multixact_offsets_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for MultiXact offsets. MultiXact offsets
+are used to store information about offsets of multiple row lockers (caused by SELECT FOR UPDATE and others).
+It defaults to 64 kilobytes (64KB).
+This parameter can only be set at server start.
+   
+  
+ 
+
+
+  multixact_members_slru_buffers (integer)
+  
+   multixact_members_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for MultiXact members. MultiXact members
+are used to store information about XIDs of multiple row lockers. Typically multixact_members_slru_buffers
+is twice more than multixact_offsets_slru_buffers.
+It defaults to 128 kilobytes (128KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  subtrans_slru_buffers (integer)
+  
+   subtrans_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for subtransactions.
+It defaults to 256 kilobytes (256KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  notify_slru_buffers (integer)
+  
+   notify_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for asyncronous notifications (NOTIFY, LISTEN).
+It defaults to 64 kilobytes (64KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  serial_slru_buffers (integer)
+  
+   serial_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for predicate locks.
+It defaults to 128 kilobytes (128KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  clog_slru_buffers (integer)
+  
+   clog_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for CLOG.
+It defaults to 0, in this case CLOG size is taken as shared_buffers / 512.
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  commit_ts_slru_buffers (integer)
+  
+

Re: MultiXact\SLRU buffers configuration

2021-03-15 Thread Gilles Darold
Le 12/03/2021 à 13:44, Andrey Borodin a écrit :
>
>> 11 марта 2021 г., в 20:50, Gilles Darold  написал(а):
>>
>>
>> The patch doesn't apply anymore in master cause of error: patch failed: 
>> src/backend/utils/init/globals.c:150
>>
>>
>>
>> An other remark about this patch is that it should be mentionned in the 
>> documentation (doc/src/sgml/config.sgml) that the new configuration 
>> variables need a server restart, for example by adding "This parameter can 
>> only be set at server start." like for shared_buffers. Patch on 
>> postgresql.conf mention it.
>>
>> And some typo to be fixed:
>>
>>
>>
>> s/Tipically/Typically/
>>
>> s/asincronous/asyncronous/
>>
>> s/confugured/configured/
>>
>> s/substrnsactions/substransactions/
>>
>>
> Thanks, Gilles! Fixed.
>
> Best regards, Andrey Borodin.
>

Hi Andrey,

I found two problems in this patch, first in src/include/miscadmin.h
multixact_members_slru_buffers is declared twice:


 extern PGDLLIMPORT int max_parallel_workers;
+extern PGDLLIMPORT int multixact_offsets_slru_buffers;
+extern PGDLLIMPORT int multixact_members_slru_buffers;
+extern PGDLLIMPORT int multixact_members_slru_buffers;  <-
+extern PGDLLIMPORT int subtrans_slru_buffers;


In file src/backend/access/transam/multixact.c the second variable
should be multixact_buffers_slru_buffers and not
multixact_offsets_slru_buffers.


@@ -1848,13 +1848,13 @@ MultiXactShmemInit(void)
    MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;

    SimpleLruInit(MultiXactOffsetCtl,
- "MultiXactOffset",
NUM_MULTIXACTOFFSET_BUFFERS, 0,
+ "MultiXactOffset",
multixact_offsets_slru_buffers, 0,
  MultiXactOffsetSLRULock,
"pg_multixact/offsets",
  LWTRANCHE_MULTIXACTOFFSET_BUFFER,
  SYNC_HANDLER_MULTIXACT_OFFSET);
    SlruPagePrecedesUnitTests(MultiXactOffsetCtl,
MULTIXACT_OFFSETS_PER_PAGE);
    SimpleLruInit(MultiXactMemberCtl,
- "MultiXactMember",
NUM_MULTIXACTMEMBER_BUFFERS, 0,
+ "MultiXactMember",
multixact_offsets_slru_buffers, 0,    <--
  MultiXactMemberSLRULock,
"pg_multixact/members",
  LWTRANCHE_MULTIXACTMEMBER_BUFFER,
  SYNC_HANDLER_MULTIXACT_MEMBER);



Please fix them so that I can end the review.


-- 
Gilles Darold
LzLabs GmbH
http://www.lzlabs.com/





Re: MultiXact\SLRU buffers configuration

2021-03-12 Thread Andrey Borodin


> 11 марта 2021 г., в 20:50, Gilles Darold  написал(а):
> 
> 
> The patch doesn't apply anymore in master cause of error: patch failed: 
> src/backend/utils/init/globals.c:150
> 
> 
> 
> An other remark about this patch is that it should be mentionned in the 
> documentation (doc/src/sgml/config.sgml) that the new configuration variables 
> need a server restart, for example by adding "This parameter can only be set 
> at server start." like for shared_buffers. Patch on postgresql.conf mention 
> it.
> 
> And some typo to be fixed:
> 
> 
> 
> s/Tipically/Typically/
> 
> s/asincronous/asyncronous/
> 
> s/confugured/configured/
> 
> s/substrnsactions/substransactions/
> 
> 

Thanks, Gilles! Fixed.

Best regards, Andrey Borodin.


v10-0001-Make-all-SLRU-buffer-sizes-configurable.patch
Description: Binary data




Re: MultiXact\SLRU buffers configuration

2021-03-11 Thread Gilles Darold

Le 15/02/2021 à 18:17, Andrey Borodin a écrit :



23 дек. 2020 г., в 21:31, Gilles Darold  написал(а):

Sorry for the response delay, we have run several others tests trying to figure 
out the performances gain per patch but unfortunately we have very heratic 
results. With the same parameters and patches the test doesn't returns the same 
results following the day or the hour of the day. This is very frustrating and 
I suppose that this is related to the Azure architecture. The only thing that I 
am sure is that we had the best performances results with all patches and

multixact_offsets_slru_buffers = 256
multixact_members_slru_buffers = 512
multixact_local_cache_entries = 4096


but I can not say if all or part of the patches are improving the performances. 
My feeling is that performances gain related to patches 1 (shared lock) and 3 
(conditional variable) do not have much to do with the performances gain 
compared to just tuning the multixact buffers. This is when the multixact 
contention is observed but perhaps they are delaying the contention. It's all 
the more frustrating that we had a test case to reproduce the contention but 
not the architecture apparently.

Hi! Thanks for the input.
I think we have a consensus here that configuring SLRU size is beneficial for 
MultiXacts.
There is proposal in nearby thread [0] on changing default value of commit_ts 
SLRU buffers.
In my experience from time to time there can be problems with subtransactions 
cured by extending subtrans SLRU.

Let's make all SLRUs configurable?
PFA patch with draft of these changes.

Best regards, Andrey Borodin.


[0] 
https://www.postgresql.org/message-id/flat/20210115220744.GA24457%40alvherre.pgsql



The patch doesn't apply anymore in master cause of error: patch failed: 
src/backend/utils/init/globals.c:150



An other remark about this patch is that it should be mentionned in the 
documentation (doc/src/sgml/config.sgml) that the new configuration 
variables need a server restart, for example by adding "This parameter 
can only be set at server start." like for shared_buffers. Patch on 
postgresql.conf mention it.


And some typo to be fixed:


   s/Tipically/Typically/

   s/asincronous/asyncronous/

   s/confugured/configured/

   s/substrnsactions/substransactions/


--
Gilles Darold
LzLabs GmbH



Re: MultiXact\SLRU buffers configuration

2021-02-15 Thread Andrey Borodin


> 23 дек. 2020 г., в 21:31, Gilles Darold  написал(а):
> 
> Sorry for the response delay, we have run several others tests trying to 
> figure out the performances gain per patch but unfortunately we have very 
> heratic results. With the same parameters and patches the test doesn't 
> returns the same results following the day or the hour of the day. This is 
> very frustrating and I suppose that this is related to the Azure 
> architecture. The only thing that I am sure is that we had the best 
> performances results with all patches and
> 
> multixact_offsets_slru_buffers = 256
> multixact_members_slru_buffers = 512
> multixact_local_cache_entries = 4096
> 
> 
> but I can not say if all or part of the patches are improving the 
> performances. My feeling is that performances gain related to patches 1 
> (shared lock) and 3 (conditional variable) do not have much to do with the 
> performances gain compared to just tuning the multixact buffers. This is when 
> the multixact contention is observed but perhaps they are delaying the 
> contention. It's all the more frustrating that we had a test case to 
> reproduce the contention but not the architecture apparently.

Hi! Thanks for the input.
I think we have a consensus here that configuring SLRU size is beneficial for 
MultiXacts.
There is proposal in nearby thread [0] on changing default value of commit_ts 
SLRU buffers.
In my experience from time to time there can be problems with subtransactions 
cured by extending subtrans SLRU.

Let's make all SLRUs configurable?
PFA patch with draft of these changes.

Best regards, Andrey Borodin.


[0] 
https://www.postgresql.org/message-id/flat/20210115220744.GA24457%40alvherre.pgsql



v9-0001-Make-all-SLRU-buffer-sizes-configurable.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2020-12-23 Thread Gilles Darold

Le 13/12/2020 à 18:24, Andrey Borodin a écrit :



13 дек. 2020 г., в 14:17, Gilles Darold  написал(а):

I've done more review on these patches.

Thanks, Gilles! I'll incorporate all your fixes to patchset.
Can you also benchmark conditional variable sleep? The patch "Add conditional 
variable to wait for next MultXact offset in corner case"?
The problem manifests on Standby when Primary is heavily loaded with 
MultiXactOffsetControlLock.

Thanks!

Best regards, Andrey Borodin.


Hi Andrey,


Sorry for the response delay, we have run several others tests trying to 
figure out the performances gain per patch but unfortunately we have 
very heratic results. With the same parameters and patches the test 
doesn't returns the same results following the day or the hour of the 
day. This is very frustrating and I suppose that this is related to the 
Azure architecture. The only thing that I am sure is that we had the 
best performances results with all patches and


   multixact_offsets_slru_buffers = 256
   multixact_members_slru_buffers = 512
   multixact_local_cache_entries = 4096

but I can not say if all or part of the patches are improving the 
performances. My feeling is that performances gain related to patches 1 
(shared lock) and 3 (conditional variable) do not have much to do with 
the performances gain compared to just tuning the multixact buffers. 
This is when the multixact contention is observed but perhaps they are 
delaying the contention. It's all the more frustrating that we had a 
test case to reproduce the contention but not the architecture apparently.



Can't do much more at this point.


Best regards,

--
Gilles Darold
LzLabs GmbH
http://www.lzlabs.com/



Re: MultiXact\SLRU buffers configuration

2020-12-13 Thread Andrey Borodin


> 13 дек. 2020 г., в 22:24, Andrey Borodin  написал(а):
> 
> 
> 
>> 13 дек. 2020 г., в 14:17, Gilles Darold  написал(а):
>> 
>> I've done more review on these patches.
> 
> Thanks, Gilles! I'll incorporate all your fixes to patchset.

PFA patches.
Also, I've noted that patch "Add conditional variable to wait for next MultXact 
offset in corner case" removes CHECK_FOR_INTERRUPTS();, I'm not sure it's 
correct.

Thanks!

Best regards, Andrey Borodin.


v8-0004-Add-GUCs-to-tune-MultiXact-SLRUs.patch
Description: Binary data


v8-0003-Add-conditional-variable-to-wait-for-next-MultXac.patch
Description: Binary data


v8-0002-Make-MultiXact-local-cache-size-configurable.patch
Description: Binary data


v8-0001-Use-shared-lock-in-GetMultiXactIdMembers-for.patch
Description: Binary data





Re: MultiXact\SLRU buffers configuration

2020-12-13 Thread Andrey Borodin



> 13 дек. 2020 г., в 14:17, Gilles Darold  написал(а):
> 
> I've done more review on these patches.

Thanks, Gilles! I'll incorporate all your fixes to patchset.
Can you also benchmark conditional variable sleep? The patch "Add conditional 
variable to wait for next MultXact offset in corner case"?
The problem manifests on Standby when Primary is heavily loaded with 
MultiXactOffsetControlLock.

Thanks!

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2020-12-13 Thread Gilles Darold
Le 11/12/2020 à 18:50, Gilles Darold a écrit :
> Le 10/12/2020 à 15:45, Gilles Darold a écrit :
>> Le 08/12/2020 à 18:52, Andrey Borodin a écrit :
>>> Hi Gilles!
>>>
>>> Many thanks for your message!
>>>
 8 дек. 2020 г., в 21:05, Gilles Darold  написал(а):

 I know that this report is not really helpful 
>>> Quite contrary - this benchmarks prove that controllable reproduction 
>>> exists. I've rebased patches for PG11. Can you please benchmark them 
>>> (without extending SLRU)?
>>>
>>> Best regards, Andrey Borodin.
>>>
>> Hi,
>>
>>
>> Running tests yesterday with the patches has reported log of failures
>> with error on INSERT and UPDATE statements:
>>
>>
>> ERROR:  lock MultiXactOffsetControlLock is not held
>>
>>
>> After a patch review this morning I think I have found what's going
>> wrong. In patch
>> v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch I
>> think there is a missing reinitialisation of the lockmode variable to
>> LW_NONE inside the retry loop after the call to LWLockRelease() in
>> src/backend/access/transam/multixact.c:1392:GetMultiXactIdMembers().
>> I've attached a new version of the patch for master that include the
>> fix I'm using now with PG11 and with which everything works very well
>> now.
>>
>>
>> I'm running more tests to see the impact on the performances to play
>> with multixact_offsets_slru_buffers, multixact_members_slru_buffers
>> and multixact_local_cache_entries. I will reports the results later
>> today.
>>
>
> Hi,
>
> Sorry for the delay, I have done some further tests to try to reach
> the limit without bottlenecks on multixact or shared buffers. The
> tests was done on a Microsoft Asure machine with 2TB of RAM and 4
> sockets Intel Xeon Platinum 8280M (128 cpu). PG configuration:
>
>     max_connections = 4096
>     shared_buffers = 64GB
>     max_prepared_transactions = 2048
>     work_mem = 256MB
>     maintenance_work_mem = 2GB
>     wal_level = minimal
>     synchronous_commit = off
>     commit_delay = 1000
>     commit_siblings = 10
>     checkpoint_timeout = 1h
>     max_wal_size = 32GB
>     checkpoint_completion_target = 0.9
>
> I have tested with several values for the different buffer's variables
> starting from:
>
>     multixact_offsets_slru_buffers = 64
>     multixact_members_slru_buffers = 128
>     multixact_local_cache_entries = 256
>
> to the values with the best performances we achieve with this test to
> avoid MultiXactOffsetControlLock or MultiXactMemberControlLock:
>
>     multixact_offsets_slru_buffers = 128
>     multixact_members_slru_buffers = 512
>     multixact_local_cache_entries = 1024
>
> Also shared_buffers have been increased up to 256GB to avoid
> buffer_mapping contention.
>
> Our last best test reports the following wait events:
>
>      event_type |   event    |    sum
>     ++---
>      Client | ClientRead | 321690211
>      LWLock | buffer_content |   2970016
>      IPC    | ProcArrayGroupUpdate   |   2317388
>      LWLock | ProcArrayLock  |   1445828
>      LWLock | WALWriteLock   |   1187606
>      LWLock | SubtransControlLock    |    972889
>      Lock   | transactionid  |    840560
>      Lock   | relation   |    587600
>      Activity   | LogicalLauncherMain    |    529599
>      Activity   | AutoVacuumMain |    528097
>
> At this stage I don't think we can have better performances by tuning
> these buffers at least with PG11.
>
> About performances gain related to the patch for shared lock in
> GetMultiXactIdMembers unfortunately I can not see a difference with or
> without this patch, it could be related to our particular benchmark.
> But clearly the patch on multixact buffers should be committed as this
> is really helpfull to be able to tuned PG when multixact bottlenecks
> are found.


I've done more review on these patches.


1) as reported in my previous message patch 0001 looks useless as it
doesn't allow measurable performances gain.


2) In patch 0004 there is two typo: s/informaion/information/ will fix them


3) the GUC are missing in the postgresql.conf.sample file, see patch in
attachment for a proposal.


Best regards,

-- 
Gilles Darold
LzLabs GmbH
https://www.lzlabs.com/

diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b7fb2ec1fe..1fb33809ee 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -189,6 +189,14 @@
 	# (change requires restart)
 #backend_flush_after = 0		# measured in pages, 0 disables
 
+# - MultiXact Buffers -
+
+#multixact_offsets_slru_buffers = 8	# memory used for MultiXact offsets
+	# (change requires restart)
+#multixact_members_slru_buffers = 16	# memory used for MultiXact members
+	# (change requires restart)

Re: MultiXact\SLRU buffers configuration

2020-12-11 Thread Gilles Darold

Le 10/12/2020 à 15:45, Gilles Darold a écrit :

Le 08/12/2020 à 18:52, Andrey Borodin a écrit :

Hi Gilles!

Many thanks for your message!


8 дек. 2020 г., в 21:05, Gilles Darold  написал(а):

I know that this report is not really helpful

Quite contrary - this benchmarks prove that controllable reproduction exists. 
I've rebased patches for PG11. Can you please benchmark them (without extending 
SLRU)?

Best regards, Andrey Borodin.


Hi,


Running tests yesterday with the patches has reported log of failures 
with error on INSERT and UPDATE statements:



ERROR:  lock MultiXactOffsetControlLock is not held


After a patch review this morning I think I have found what's going 
wrong. In patch 
v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch I 
think there is a missing reinitialisation of the lockmode variable to 
LW_NONE inside the retry loop after the call to LWLockRelease() in 
src/backend/access/transam/multixact.c:1392:GetMultiXactIdMembers(). 
I've attached a new version of the patch for master that include the 
fix I'm using now with PG11 and with which everything works very well now.



I'm running more tests to see the impact on the performances to play 
with multixact_offsets_slru_buffers, multixact_members_slru_buffers 
and multixact_local_cache_entries. I will reports the results later today.




Hi,

Sorry for the delay, I have done some further tests to try to reach the 
limit without bottlenecks on multixact or shared buffers. The tests was 
done on a Microsoft Asure machine with 2TB of RAM and 4 sockets Intel 
Xeon Platinum 8280M (128 cpu). PG configuration:


    max_connections = 4096
    shared_buffers = 64GB
    max_prepared_transactions = 2048
    work_mem = 256MB
    maintenance_work_mem = 2GB
    wal_level = minimal
    synchronous_commit = off
    commit_delay = 1000
    commit_siblings = 10
    checkpoint_timeout = 1h
    max_wal_size = 32GB
    checkpoint_completion_target = 0.9

I have tested with several values for the different buffer's variables 
starting from:


    multixact_offsets_slru_buffers = 64
    multixact_members_slru_buffers = 128
    multixact_local_cache_entries = 256

to the values with the best performances we achieve with this test to 
avoid MultiXactOffsetControlLock or MultiXactMemberControlLock:


    multixact_offsets_slru_buffers = 128
    multixact_members_slru_buffers = 512
    multixact_local_cache_entries = 1024

Also shared_buffers have been increased up to 256GB to avoid 
buffer_mapping contention.


Our last best test reports the following wait events:

     event_type |   event    |    sum
    ++---
     Client | ClientRead | 321690211
     LWLock | buffer_content |   2970016
     IPC    | ProcArrayGroupUpdate   |   2317388
     LWLock | ProcArrayLock  |   1445828
     LWLock | WALWriteLock   |   1187606
     LWLock | SubtransControlLock    |    972889
     Lock   | transactionid  |    840560
     Lock   | relation   |    587600
     Activity   | LogicalLauncherMain    |    529599
     Activity   | AutoVacuumMain |    528097

At this stage I don't think we can have better performances by tuning 
these buffers at least with PG11.


About performances gain related to the patch for shared lock in 
GetMultiXactIdMembers unfortunately I can not see a difference with or 
without this patch, it could be related to our particular benchmark. But 
clearly the patch on multixact buffers should be committed as this is 
really helpfull to be able to tuned PG when multixact bottlenecks are found.



Best regards,

--
Gilles Darold
LzLabs GmbH
https://www.lzlabs.com/




Re: MultiXact\SLRU buffers configuration

2020-12-10 Thread Gilles Darold

Le 08/12/2020 à 18:52, Andrey Borodin a écrit :

Hi Gilles!

Many thanks for your message!


8 дек. 2020 г., в 21:05, Gilles Darold  написал(а):

I know that this report is not really helpful

Quite contrary - this benchmarks prove that controllable reproduction exists. 
I've rebased patches for PG11. Can you please benchmark them (without extending 
SLRU)?

Best regards, Andrey Borodin.


Hi,


Running tests yesterday with the patches has reported log of failures 
with error on INSERT and UPDATE statements:



   ERROR:  lock MultiXactOffsetControlLock is not held


After a patch review this morning I think I have found what's going 
wrong. In patch 
v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch I think 
there is a missing reinitialisation of the lockmode variable to LW_NONE 
inside the retry loop after the call to LWLockRelease() in 
src/backend/access/transam/multixact.c:1392:GetMultiXactIdMembers(). 
I've attached a new version of the patch for master that include the fix 
I'm using now with PG11 and with which everything works very well now.



I'm running more tests to see the impact on the performances to play 
with multixact_offsets_slru_buffers, multixact_members_slru_buffers and 
multixact_local_cache_entries. I will reports the results later today.


--
Gilles Darold
http://www.darold.net/

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 034349aa7b..4c372065de 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -640,10 +640,11 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
 	int			lsnindex;
 	char	   *byteptr;
 	XidStatus	status;
+	LWLockMode	lockmode = LW_NONE;
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 
-	slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid);
+	slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid, );
 	byteptr = XactCtl->shared->page_buffer[slotno] + byteno;
 
 	status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
@@ -651,6 +652,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
 	lsnindex = GetLSNIndex(slotno, xid);
 	*lsn = XactCtl->shared->group_lsn[lsnindex];
 
+	Assert(lockmode != LW_NONE);
 	LWLockRelease(XactSLRULock);
 
 	return status;
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 2fe551f17e..2699de033d 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -288,6 +288,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
 	CommitTimestampEntry entry;
 	TransactionId oldestCommitTsXid;
 	TransactionId newestCommitTsXid;
+	LWLockMode	lockmode = LW_NONE;
 
 	if (!TransactionIdIsValid(xid))
 		ereport(ERROR,
@@ -342,7 +343,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
 	}
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
-	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid, );
 	memcpy(,
 		   CommitTsCtl->shared->page_buffer[slotno] +
 		   SizeOfCommitTimestampEntry * entryno,
@@ -352,6 +353,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
 	if (nodeid)
 		*nodeid = entry.nodeid;
 
+	Assert(lockmode != LW_NONE);
 	LWLockRelease(CommitTsSLRULock);
 	return *ts != 0;
 }
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index eb8de7cf32..56bdd04364 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1237,6 +1237,7 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactId tmpMXact;
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
+	LWLockMode	lockmode = LW_NONE;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1340,12 +1341,13 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * time on every multixact creation.
 	 */
 retry:
-	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
-	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+	slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+		multi, );
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 	offset = *offptr;
@@ -1377,7 +1379,8 @@ retry:
 		entryno = MultiXactIdToOffsetEntry(tmpMXact);
 
 		if (pageno != prev_pageno)
-			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+			slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
+tmpMXact, );
 
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
@@ -1387,6 +1390,7 @@ retry:
 		{
 			/* Corner case 2: next multixact is still being filled in */
 			LWLockRelease(MultiXactOffsetSLRULock);
+			lockmode = 

Re: MultiXact\SLRU buffers configuration

2020-12-09 Thread Gilles Darold

Le 09/12/2020 à 11:51, Gilles Darold a écrit :

Also PG regression tests are failing too on several part.


Forget this, I have not run the regression tests in the right repository:

...

===
 All 189 tests passed.
===


I'm looking why the application is failing.


--
Gilles Darold
http://www.darold.net/





Re: MultiXact\SLRU buffers configuration

2020-12-09 Thread Gilles Darold

Hi Andrey,

Thanks for the backport. I have issue with the first patch "Use shared 
lock in GetMultiXactIdMembers for offsets and members" 
(v1106-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-o.patch) the 
applications are not working anymore when I'm applying it. Also PG 
regression tests are failing too on several part.



   test insert_conflict  ... ok
   test create_function_1    ... FAILED
   test create_type  ... FAILED
   test create_table ... FAILED
   test create_function_2    ... FAILED
   test copy ... FAILED
   test copyselect   ... ok
   test copydml  ... ok
   test create_misc  ... FAILED
   test create_operator  ... FAILED
   test create_procedure ... ok
   test create_index ... FAILED
   test index_including  ... ok
   test create_view  ... FAILED
   test create_aggregate ... ok
   test create_function_3    ... ok
   test create_cast  ... ok
   test constraints  ... FAILED
   test triggers ... FAILED
   test inherit  ...
   ^C



This is also where I left my last try to back port for PG11, I will try 
to fix it again but it could take time to have it working.


Best regards,

--
Gilles Darold


Le 08/12/2020 à 18:52, Andrey Borodin a écrit :

Hi Gilles!

Many thanks for your message!


8 дек. 2020 г., в 21:05, Gilles Darold  написал(а):

I know that this report is not really helpful

Quite contrary - this benchmarks prove that controllable reproduction exists. 
I've rebased patches for PG11. Can you please benchmark them (without extending 
SLRU)?

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2020-12-08 Thread Andrey Borodin
Hi Gilles!

Many thanks for your message!

> 8 дек. 2020 г., в 21:05, Gilles Darold  написал(а):
> 
> I know that this report is not really helpful 

Quite contrary - this benchmarks prove that controllable reproduction exists. 
I've rebased patches for PG11. Can you please benchmark them (without extending 
SLRU)?

Best regards, Andrey Borodin.



v1106-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-o.patch
Description: Binary data


v1106-0002-Make-MultiXact-local-cache-size-configurable.patch
Description: Binary data


v1106-0003-Add-conditional-variable-to-wait-for-next-Mult.patch
Description: Binary data


v1106-0004-Add-GUCs-to-tune-MultiXact-SLRUs.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2020-12-08 Thread Gilles Darold

Le 13/11/2020 à 12:49, Andrey Borodin a écrit :



10 нояб. 2020 г., в 23:07, Tomas Vondra  
написал(а):

On 11/10/20 7:16 AM, Andrey Borodin wrote:


but this picture was not stable.


Seems we haven't made much progress in reproducing the issue :-( I guess
we'll need to know more about the machine where this happens. Is there
anything special about the hardware/config? Are you monitoring size of
the pg_multixact directory?

It's Ubuntu 18.04.4 LTS, Intel Xeon E5-2660 v4, 56 CPU cores with 256Gb of RAM.
PostgreSQL 10.14, compiled by gcc 7.5.0, 64-bit

No, unfortunately we do not have signals for SLRU sizes.
3.5Tb mdadm raid10 over 28 SSD drives, 82% full.

First incident triggering investigation was on 2020-04-19, at that time cluster 
was running on PG 10.11. But I think it was happening before.

I'd say nothing special...


How do you collect wait events for aggregation? just insert into some table 
with cron?


No, I have a simple shell script (attached) sampling data from
pg_stat_activity regularly. Then I load it into a table and aggregate to
get a summary.

Thanks!

Best regards, Andrey Borodin.



Hi,


Some time ago I have encountered a contention on 
MultiXactOffsetControlLock with a performances benchmark. Here are the 
wait event monitoring result with a pooling each 10 seconds and a 30 
minutes run for the benchmarl:



 event_type |   event    |   sum
++--
 Client | ClientRead | 44722952
 LWLock | MultiXactOffsetControlLock | 30343060
 LWLock | multixact_offset   | 16735250
 LWLock | MultiXactMemberControlLock |  1601470
 LWLock | buffer_content |   991344
 LWLock | multixact_member   |   805624
 Lock   | transactionid  |   204997
 Activity   | LogicalLauncherMain    |   198834
 Activity   | CheckpointerMain   |   198834
 Activity   | AutoVacuumMain |   198469
 Activity   | BgWriterMain   |   184066
 Activity   | WalWriterMain  |   171571
 LWLock | WALWriteLock   |    72428
 IO | DataFileRead   |    35708
 Activity   | BgWriterHibernate  |    12741
 IO | SLRURead   | 9121
 Lock   | relation   | 8858
 LWLock | ProcArrayLock  | 7309
 LWLock | lock_manager   | 6677
 LWLock | pg_stat_statements | 4194
 LWLock | buffer_mapping | 3222


After reading this thread I change the value of the buffer size to 32 
and 64 and obtain the following results:



 event_type |   event    |    sum
++---
 Client | ClientRead | 268297572
 LWLock | MultiXactMemberControlLock |  65162906
 LWLock | multixact_member   |  33397714
 LWLock | buffer_content |   4737065
 Lock   | transactionid  |   2143750
 LWLock | SubtransControlLock    |   1318230
 LWLock | WALWriteLock   |   1038999
 Activity   | LogicalLauncherMain    |    940598
 Activity   | AutoVacuumMain |    938566
 Activity   | CheckpointerMain   |    799052
 Activity   | WalWriterMain  |    749069
 LWLock | subtrans   |    710163
 Activity   | BgWriterHibernate  |    536763
 Lock   | object |    514225
 Activity   | BgWriterMain   |    394206
 LWLock | lock_manager   |    295616
 IO | DataFileRead   |    274236
 LWLock | ProcArrayLock  | 77099
 Lock   | tuple  | 59043
 IO | CopyFileWrite  | 45611
 Lock   | relation   | 42714

There was still contention on multixact but less than the first run. I 
have increased the buffers to 128 and 512 and obtain the best results 
for this bench:


 event_type |   event    |    sum
++---
 Client | ClientRead | 160463037
 LWLock | MultiXactMemberControlLock |   5334188
 LWLock | buffer_content |   5228256
 LWLock | buffer_mapping |   2368505
 LWLock | SubtransControlLock    |   2289977
 IPC    | ProcArrayGroupUpdate   |   1560875
 LWLock | ProcArrayLock  |   1437750
 Lock   | transactionid  |    825561
 LWLock | subtrans   |    772701
 LWLock | WALWriteLock   |    666138
 Activity   | LogicalLauncherMain    |    492585
 Activity   | CheckpointerMain   |    492458
 Activity   | AutoVacuumMain |    491548
 LWLock | lock_manager   |    426531
 Lock   | object |    403581
 Activity   | WalWriterMain  |    394668
 

Re: MultiXact\SLRU buffers configuration

2020-11-13 Thread Andrey Borodin



> 10 нояб. 2020 г., в 23:07, Tomas Vondra  
> написал(а):
> 
> On 11/10/20 7:16 AM, Andrey Borodin wrote:
>> 
>> 
>> but this picture was not stable.
>> 
> 
> Seems we haven't made much progress in reproducing the issue :-( I guess
> we'll need to know more about the machine where this happens. Is there
> anything special about the hardware/config? Are you monitoring size of
> the pg_multixact directory?

It's Ubuntu 18.04.4 LTS, Intel Xeon E5-2660 v4, 56 CPU cores with 256Gb of RAM.
PostgreSQL 10.14, compiled by gcc 7.5.0, 64-bit

No, unfortunately we do not have signals for SLRU sizes.
3.5Tb mdadm raid10 over 28 SSD drives, 82% full.

First incident triggering investigation was on 2020-04-19, at that time cluster 
was running on PG 10.11. But I think it was happening before.

I'd say nothing special...

> 
>> How do you collect wait events for aggregation? just insert into some table 
>> with cron?
>> 
> 
> No, I have a simple shell script (attached) sampling data from
> pg_stat_activity regularly. Then I load it into a table and aggregate to
> get a summary.

Thanks!

Best regards, Andrey Borodin.





Re: MultiXact\SLRU buffers configuration

2020-11-10 Thread Thomas Munro
On Wed, Nov 11, 2020 at 7:07 AM Tomas Vondra
 wrote:
> Seems we haven't made much progress in reproducing the issue :-( I guess
> we'll need to know more about the machine where this happens. Is there
> anything special about the hardware/config? Are you monitoring size of
> the pg_multixact directory?

Which release was the original problem seen on?




Re: MultiXact\SLRU buffers configuration

2020-11-10 Thread Tomas Vondra



On 11/10/20 7:16 AM, Andrey Borodin wrote:
> 
> 
>> 10 нояб. 2020 г., в 05:13, Tomas Vondra  
>> написал(а):
>> After the issue reported in [1] got fixed, I've restarted the multi-xact
>> stress test, hoping to reproduce the issue. But so far no luck :-(
> 
> 
> Tomas, many thanks for looking into this. I figured out that to make 
> multixact sets bigger transactions must hang for a while and lock large set 
> of tuples. But not continuous range to avoid locking on buffer_content.
> I did not manage to implement this via pgbench, that's why I was trying to 
> hack on separate go program. But, essentially, no luck either.
> I was observing something resemblant though
> 
> пятница,  8 мая 2020 г. 15:08:37 (every 
> 1s)
> 
>   pid  | wait_event | wait_event_type | state  |  
>  query
> ---++-++
>  41344 | ClientRead | Client  | idle   | insert into 
> t1 select generate_series(1,100,1)
>  41375 | MultiXactOffsetControlLock | LWLock  | active | select * 
> from t1 where i = ANY ($1) for share
>  41377 | MultiXactOffsetControlLock | LWLock  | active | select * 
> from t1 where i = ANY ($1) for share
>  41378 || | active | select * 
> from t1 where i = ANY ($1) for share
>  41379 | MultiXactOffsetControlLock | LWLock  | active | select * 
> from t1 where i = ANY ($1) for share
>  41381 || | active | select * 
> from t1 where i = ANY ($1) for share
>  41383 | MultiXactOffsetControlLock | LWLock  | active | select * 
> from t1 where i = ANY ($1) for share
>  41385 | MultiXactOffsetControlLock | LWLock  | active | select * 
> from t1 where i = ANY ($1) for share
> (8 rows)
> 
> but this picture was not stable.
> 

Seems we haven't made much progress in reproducing the issue :-( I guess
we'll need to know more about the machine where this happens. Is there
anything special about the hardware/config? Are you monitoring size of
the pg_multixact directory?

> How do you collect wait events for aggregation? just insert into some table 
> with cron?
> 

No, I have a simple shell script (attached) sampling data from
pg_stat_activity regularly. Then I load it into a table and aggregate to
get a summary.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


collect-wait-events.sh
Description: application/shellscript


Re: MultiXact\SLRU buffers configuration

2020-11-09 Thread Andrey Borodin



> 10 нояб. 2020 г., в 05:13, Tomas Vondra  
> написал(а):
> After the issue reported in [1] got fixed, I've restarted the multi-xact
> stress test, hoping to reproduce the issue. But so far no luck :-(


Tomas, many thanks for looking into this. I figured out that to make multixact 
sets bigger transactions must hang for a while and lock large set of tuples. 
But not continuous range to avoid locking on buffer_content.
I did not manage to implement this via pgbench, that's why I was trying to hack 
on separate go program. But, essentially, no luck either.
I was observing something resemblant though

пятница,  8 мая 2020 г. 15:08:37 (every 1s)

  pid  | wait_event | wait_event_type | state  |
   query
---++-++
 41344 | ClientRead | Client  | idle   | insert into t1 
select generate_series(1,100,1)
 41375 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
 41377 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
 41378 || | active | select * from 
t1 where i = ANY ($1) for share
 41379 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
 41381 || | active | select * from 
t1 where i = ANY ($1) for share
 41383 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
 41385 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
(8 rows)

but this picture was not stable.

How do you collect wait events for aggregation? just insert into some table 
with cron?

Thanks!

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2020-11-09 Thread Tomas Vondra
Hi,

After the issue reported in [1] got fixed, I've restarted the multi-xact
stress test, hoping to reproduce the issue. But so far no luck :-(

I've started slightly different tests on two machines - on one machine
I've done this:

  a) init.sql

  create table t (a int);
  insert into t select i from generate_series(1,1) s(i);
  alter table t add primary key (a);

  b) select.sql

  SELECT * FROM t
   WHERE a = (1+mod(abs(hashint4(extract(epoch from now())::int)),
1)) FOR KEY SHARE;

  c) pgbench -n -c 32 -j 8 -f select.sql -T $((24*3600)) test

The idea is to have large table and many clients hitting a small random
subset of the rows. A sample of wait events from ~24h run looks like this:

  e_type  |e_name|   sum
--+--+--
 LWLock   | BufferContent| 13913863
  |  |  7194679
 LWLock   | WALWrite |  1710507
 Activity | LogicalLauncherMain  |   726599
 Activity | AutoVacuumMain   |   726127
 Activity | WalWriterMain|   725183
 Activity | CheckpointerMain |   604694
 Client   | ClientRead   |   599900
 IO   | WALSync  |   502904
 Activity | BgWriterMain |   378110
 Activity | BgWriterHibernate|   348464
 IO   | WALWrite |   129412
 LWLock   | ProcArray| 6633
 LWLock   | WALInsert| 5714
 IO   | SLRUWrite| 2580
 IPC  | ProcArrayGroupUpdate | 2216
 LWLock   | XactSLRU | 2196
 Timeout  | VacuumDelay  | 1078
 IPC  | XactGroupUpdate  |  737
 LWLock   | LockManager  |  503
 LWLock   | WALBufMapping|  295
 LWLock   | MultiXactMemberSLRU  |  267
 IO   | DataFileWrite|   68
 LWLock   | BufferIO |   59
 IO   | DataFileRead |   27
 IO   | DataFileFlush|   14
 LWLock   | MultiXactGen |7
 LWLock   | BufferMapping|1

So, nothing particularly interesting - there certainly are not many wait
events related to SLRU.

On the other machine I did this:

  a) init.sql
  create table t (a int primary key);
  insert into t select i from generate_series(1,1000) s(i);

  b) select.sql
  select * from t for key share;

  c) pgbench -n -c 32 -j 8 -f select.sql -T $((24*3600)) test

and the wait events (24h run too) look like this:

  e_type   |e_name |   sum
---+---+--
 LWLock| BufferContent | 20804925
   |   |  2575369
 Activity  | LogicalLauncherMain   |   745780
 Activity  | AutoVacuumMain|   745292
 Activity  | WalWriterMain |   740507
 Activity  | CheckpointerMain  |   737691
 Activity  | BgWriterHibernate |   731123
 LWLock| WALWrite  |   570107
 IO| WALSync   |   452603
 Client| ClientRead|   151438
 BufferPin | BufferPin |23466
 LWLock| WALInsert |21631
 IO| WALWrite  |19050
 LWLock| ProcArray |15082
 Activity  | BgWriterMain  |14655
 IPC   | ProcArrayGroupUpdate  | 7772
 LWLock| WALBufMapping | 3555
 IO| SLRUWrite | 1951
 LWLock| MultiXactGen  | 1661
 LWLock| MultiXactMemberSLRU   |  359
 LWLock| MultiXactOffsetSLRU   |  242
 LWLock| XactSLRU  |  141
 IPC   | XactGroupUpdate   |  104
 LWLock| LockManager   |   28
 IO| DataFileRead  |4
 IO| ControlFileSyncUpdate |1
 Timeout   | VacuumDelay   |1
 IO| WALInitWrite  |1

Also nothing particularly interesting - few SLRU wait events.

So unfortunately this does not really reproduce the SLRU locking issues
you're observing - clearly, there has to be something else triggering
it. Perhaps this workload is too simplistic, or maybe we need to run
different queries. Or maybe the hw needs to be somewhat different (more
CPUs? different storage?)


[1]
https://www.postgresql.org/message-id/20201104013205.icogbi773przyny5@development

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: MultiXact\SLRU buffers configuration

2020-11-02 Thread Andrey Borodin



> 29 окт. 2020 г., в 18:49, Tomas Vondra  
> написал(а):
> 
> On Thu, Oct 29, 2020 at 12:08:21PM +0500, Andrey Borodin wrote:
>> 
>> 
>>> 29 окт. 2020 г., в 04:32, Tomas Vondra  
>>> написал(а):
>>> 
>>> It's not my intention to be mean or anything like that, but to me this
>>> means we don't really understand the problem we're trying to solve. Had
>>> we understood it, we should be able to construct a workload reproducing
>>> the issue ...
>>> 
>>> I understand what the individual patches are doing, and maybe those
>>> changes are desirable in general. But without any benchmarks from a
>>> plausible workload I find it hard to convince myself that:
>>> 
>>> (a) it actually will help with the issue you're observing on production
>>> 
>>> and
>>> (b) it's actually worth the extra complexity (e.g. the lwlock changes)
>>> 
>>> 
>>> I'm willing to invest some of my time into reviewing/testing this, but I
>>> think we badly need better insight into the issue, so that we can build
>>> a workload reproducing it. Perhaps collecting some perf profiles and a
>>> sample of the queries might help, but I assume you already tried that.
>> 
>> Thanks, Tomas! This totally makes sense.
>> 
>> Indeed, collecting queries did not help yet. We have loadtest environment 
>> equivalent to production (but with 10x less shards), copy of production 
>> workload queries. But the problem does not manifest there.
>> Why do I think the problem is in MultiXacts?
>> Here is a chart with number of wait events on each host
>> 
>> 
>> During the problem MultiXactOffsetControlLock and SLRURead dominate all 
>> other lock types. After primary switchover to another node SLRURead 
>> continued for a bit there, then disappeared.
> 
> OK, so most of this seems to be due to SLRURead and
> MultiXactOffsetControlLock. Could it be that there were too many
> multixact members, triggering autovacuum to prevent multixact
> wraparound? That might generate a lot of IO on the SLRU. Are you
> monitoring the size of the pg_multixact directory?

Yes, we had some problems with 'multixact "members" limit exceeded' long time 
ago.
We tuned autovacuum_multixact_freeze_max_age = 2 and 
vacuum_multixact_freeze_table_age = 7500 (half of defaults) and since then 
did not ever encounter this problem (~5 months).
But the MultiXactOffsetControlLock problem persists. Partially the problem was 
solved by adding more shards. But when one of shards encounters a problem it's 
either MultiXacts or vacuum causing relation truncation (unrelated to this 
thread).

Thanks!

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2020-10-29 Thread Tomas Vondra

On Thu, Oct 29, 2020 at 12:08:21PM +0500, Andrey Borodin wrote:




29 окт. 2020 г., в 04:32, Tomas Vondra  
написал(а):

It's not my intention to be mean or anything like that, but to me this
means we don't really understand the problem we're trying to solve. Had
we understood it, we should be able to construct a workload reproducing
the issue ...

I understand what the individual patches are doing, and maybe those
changes are desirable in general. But without any benchmarks from a
plausible workload I find it hard to convince myself that:

(a) it actually will help with the issue you're observing on production

and
(b) it's actually worth the extra complexity (e.g. the lwlock changes)


I'm willing to invest some of my time into reviewing/testing this, but I
think we badly need better insight into the issue, so that we can build
a workload reproducing it. Perhaps collecting some perf profiles and a
sample of the queries might help, but I assume you already tried that.


Thanks, Tomas! This totally makes sense.

Indeed, collecting queries did not help yet. We have loadtest environment 
equivalent to production (but with 10x less shards), copy of production 
workload queries. But the problem does not manifest there.
Why do I think the problem is in MultiXacts?
Here is a chart with number of wait events on each host


During the problem MultiXactOffsetControlLock and SLRURead dominate all other 
lock types. After primary switchover to another node SLRURead continued for a 
bit there, then disappeared.


OK, so most of this seems to be due to SLRURead and
MultiXactOffsetControlLock. Could it be that there were too many
multixact members, triggering autovacuum to prevent multixact
wraparound? That might generate a lot of IO on the SLRU. Are you
monitoring the size of the pg_multixact directory?


Backtraces on standbys during the problem show that most of backends are 
sleeping in pg_sleep(1000L) and are not included into wait stats on these 
charts.

Currently I'm considering writing test that directly calls MultiXactIdExpand(), 
MultiXactIdCreate(), and GetMultiXactIdMembers() from an extension. How do you 
think, would benchmarks in such tests be meaningful?



I don't know. I'd much rather have a SQL-level benchmark than an
extension doing this kind of stuff.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: MultiXact\SLRU buffers configuration

2020-10-28 Thread Tomas Vondra

Hi,

On Wed, Oct 28, 2020 at 12:34:58PM +0500, Andrey Borodin wrote:

Tomas, thanks for looking into this!


28 окт. 2020 г., в 06:36, Tomas Vondra  
написал(а):


This thread started with a discussion about making the SLRU sizes
configurable, but this patch version only adds a local cache. Does this
achieve the same goal, or would we still gain something by having GUCs
for the SLRUs?

If we're claiming this improves performance, it'd be good to have some
workload demonstrating that and measurements. I don't see anything like
that in this thread, so it's a bit hand-wavy. Can someone share details
of such workload (even synthetic one) and some basic measurements?


All patches in this thread aim at the same goal: improve performance in 
presence of MultiXact locks contention.
I could not build synthetical reproduction of the problem, however I did some 
MultiXact stressing here [0]. It's a clumsy test program, because it still is 
not clear to me which parameters of workload trigger MultiXact locks 
contention. In generic case I was encountering other locks like *GenLock: 
XidGenLock, MultixactGenLock etc. Yet our production system encounters this 
problem approximately once in a month through this year.

Test program locks for share different set of tuples in presence of concurrent 
full scans.
To produce a set of locks we choose one of 14 bits. If a row number has this 
bit set to 0 we add lock it.
I've been measuring time to lock all rows 3 time for each of 14 bits, observing 
total time to set all locks.
During the test I was observing locks in pg_stat_activity, if they did not 
contain enough MultiXact locks I was tuning parameters further (number of 
concurrent clients, number of bits, select queries etc).

Why is it so complicated? It seems that other reproductions of a problem were 
encountering other locks.



It's not my intention to be mean or anything like that, but to me this
means we don't really understand the problem we're trying to solve. Had
we understood it, we should be able to construct a workload reproducing
the issue ...

I understand what the individual patches are doing, and maybe those
changes are desirable in general. But without any benchmarks from a
plausible workload I find it hard to convince myself that:

(a) it actually will help with the issue you're observing on production

and 


(b) it's actually worth the extra complexity (e.g. the lwlock changes)


I'm willing to invest some of my time into reviewing/testing this, but I
think we badly need better insight into the issue, so that we can build
a workload reproducing it. Perhaps collecting some perf profiles and a
sample of the queries might help, but I assume you already tried that.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: MultiXact\SLRU buffers configuration

2020-10-28 Thread Tomas Vondra

On Wed, Oct 28, 2020 at 10:36:39PM +0300, Alexander Korotkov wrote:

Hi, Tomas!

Thank you for your review.

On Wed, Oct 28, 2020 at 4:36 AM Tomas Vondra
 wrote:

I did a quick review on this patch series. A couple comments:


0001


This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is
changed to return information about what lock was used, merely to allow
the callers to do an Assert() that the value is not LW_NONE.


Yes, but this is not merely to allow callers to do an Assert().
Sometimes in multixacts it could save us some relocks.  So, we can
skip relocking lock to exclusive mode if it's in exclusive already.
Adding Assert() to every caller is probably overkill.



Hmm, OK. That can only happen in GetMultiXactIdMembers, which is the
only place where we do retry, right? Do we actually know this makes any
measurable difference? It seems we're mostly imagining that it might
help, but we don't have any actual proof of that (e.g. a workload which
we might benchmark). Or am I missing something?

For me, the extra conditions make it way harder to reason about the
behavior of the code, and I can't convince myself it's worth it.



IMO we could achieve exactly the same thing by passing a simple flag
that would say 'make sure we got a lock' or something like that. In
fact, aren't all callers doing the assert? That'd mean we can just do
the check always, without the flag. (I see GetMultiXactIdMembers does
two calls and only checks the second result, but I wonder if that's
intended or omission.)


Having just the flag is exactly what the original version by Andrey
did.  But if we have to read two multixact offsets pages or multiple
members page in one GetMultiXactIdMembers()), then it does relocks
from exclusive mode to exclusive mode.  I decide that once we decide
to optimize this locks, this situation is nice to evade.


In any case, it'd make the lwlock.c changes unnecessary, I think.


I agree that it would be better to not touch lwlock.c.  But I didn't
find a way to evade relocking exclusive mode to exclusive mode without
touching lwlock.c or making code cumbersome in other places.



Hmm. OK.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: MultiXact\SLRU buffers configuration

2020-10-28 Thread Alexander Korotkov
Hi, Tomas!

Thank you for your review.

On Wed, Oct 28, 2020 at 4:36 AM Tomas Vondra
 wrote:
> I did a quick review on this patch series. A couple comments:
>
>
> 0001
> 
>
> This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is
> changed to return information about what lock was used, merely to allow
> the callers to do an Assert() that the value is not LW_NONE.

Yes, but this is not merely to allow callers to do an Assert().
Sometimes in multixacts it could save us some relocks.  So, we can
skip relocking lock to exclusive mode if it's in exclusive already.
Adding Assert() to every caller is probably overkill.

> IMO we could achieve exactly the same thing by passing a simple flag
> that would say 'make sure we got a lock' or something like that. In
> fact, aren't all callers doing the assert? That'd mean we can just do
> the check always, without the flag. (I see GetMultiXactIdMembers does
> two calls and only checks the second result, but I wonder if that's
> intended or omission.)

Having just the flag is exactly what the original version by Andrey
did.  But if we have to read two multixact offsets pages or multiple
members page in one GetMultiXactIdMembers()), then it does relocks
from exclusive mode to exclusive mode.  I decide that once we decide
to optimize this locks, this situation is nice to evade.

> In any case, it'd make the lwlock.c changes unnecessary, I think.

I agree that it would be better to not touch lwlock.c.  But I didn't
find a way to evade relocking exclusive mode to exclusive mode without
touching lwlock.c or making code cumbersome in other places.

> 0002
> 
>
> Specifies the number cached MultiXact by backend. Any SLRU lookup ...
>
> should be 'number of cached ...'

Sounds reasonable.

> 0003
> 
>
>  * Conditional variable for waiting till the filling of the next multixact
>  * will be finished.  See GetMultiXactIdMembers() and RecordNewMultiXact()
>  * for details.
>
> Perhaps 'till the next multixact is filled' or 'gets full' would be
> better. Not sure.

Sounds reasonable as well.

--
Regards,
Alexander Korotkov




Re: MultiXact\SLRU buffers configuration

2020-10-28 Thread Andrey Borodin
Tomas, thanks for looking into this!

> 28 окт. 2020 г., в 06:36, Tomas Vondra  
> написал(а):
> 
> 
> This thread started with a discussion about making the SLRU sizes
> configurable, but this patch version only adds a local cache. Does this
> achieve the same goal, or would we still gain something by having GUCs
> for the SLRUs?
> 
> If we're claiming this improves performance, it'd be good to have some
> workload demonstrating that and measurements. I don't see anything like
> that in this thread, so it's a bit hand-wavy. Can someone share details
> of such workload (even synthetic one) and some basic measurements?

All patches in this thread aim at the same goal: improve performance in 
presence of MultiXact locks contention.
I could not build synthetical reproduction of the problem, however I did some 
MultiXact stressing here [0]. It's a clumsy test program, because it still is 
not clear to me which parameters of workload trigger MultiXact locks 
contention. In generic case I was encountering other locks like *GenLock: 
XidGenLock, MultixactGenLock etc. Yet our production system encounters this 
problem approximately once in a month through this year.

Test program locks for share different set of tuples in presence of concurrent 
full scans.
To produce a set of locks we choose one of 14 bits. If a row number has this 
bit set to 0 we add lock it.
I've been measuring time to lock all rows 3 time for each of 14 bits, observing 
total time to set all locks.
During the test I was observing locks in pg_stat_activity, if they did not 
contain enough MultiXact locks I was tuning parameters further (number of 
concurrent clients, number of bits, select queries etc).

Why is it so complicated? It seems that other reproductions of a problem were 
encountering other locks.

Lets describe patches in this thread from the POV of these test.

*** Configurable SLRU buffers for MultiXact members and offsets.
From tests it is clear that high and low values for these buffers affect the 
test time. Here are time for a one test run with different offsets and members 
sizes [1]
Our production currently runs with (numbers are pages of buffers)
+#define NUM_MXACTOFFSET_BUFFERS 32
+#define NUM_MXACTMEMBER_BUFFERS 64
And, looking back to incidents in summer and fall 2020, seems like it helped 
mostly.

But it's hard to give some tuning advises based on test results. Values (32,64) 
produce 10% better result than current hardcoded values (8,16). In generic case 
this is not what someone should tune first.

*** Configurable caches of MultiXacts.
Tests were specifically designed to beat caches. So, according to test the 
bigger cache is - the more time it takes to accomplish the test [2].
Anyway cache is local for backend and it's purpose is deduplication of written 
MultiXacts, not enhancing reads.

*** Using advantage of SimpleLruReadPage_ReadOnly() in MultiXacts.
This simply aligns MultiXact with Subtransactions and CLOG. Other SLRUs already 
take advantage of reading SLRU with shared lock.
On synthetical tests without background selects this patch adds another ~4.7% 
of performance [3] against [4]. This improvement seems consistent between 
different parameter values, yet within measurements deviation (see difference 
between warmup run [5] and closing run [6]).
All in all, these attempts to measure impact are hand-wavy too. But it makes 
sense to use consistent approach among similar subsystems (MultiXacts, 
Subtrans, CLOG etc).

*** Reduce sleep in GetMultiXactIdMembers() on standby.
The problem with pg_usleep(1000L) within GetMultiXactIdMembers() manifests on 
standbys during contention of MultiXactOffsetControlLock. It's even harder to 
reproduce.
Yet it seems obvious that reducing sleep to shorter time frame will make count 
of sleeping backend smaller.

For consistency I've returned patch with SLRU buffer configs to patchset (other 
patches are intact). But I'm mostly concerned about patches 1 and 3.

Thanks!

Best regards, Andrey Borodin.

[0] https://github.com/x4m/multixact_stress
[1] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L22-L39
[2] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L83-L99
[3] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L9
[4] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L29
[5] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L3
[6] https://github.com/x4m/multixact_stress/blob/master/testresults.txt#L19



v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch
Description: Binary data


v6-0004-Add-GUCs-to-tune-MultiXact-SLRUs.patch
Description: Binary data


v6-0003-Add-conditional-variable-to-wait-for-next-MultXac.patch
Description: Binary data


v6-0002-Make-MultiXact-local-cache-size-configurable.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2020-10-27 Thread Tomas Vondra

On Tue, Oct 27, 2020 at 08:23:26PM +0300, Alexander Korotkov wrote:

On Tue, Oct 27, 2020 at 8:02 PM Alexander Korotkov  wrote:

On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin  wrote:
> Thanks for your review, Alexander!
> +1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
> Other changes seem correct to me too.
>
>
> I've tried to find optimal value for cache size and it seems to me that it 
affects multixact scalability much less than sizes of offsets\members buffers. I 
concur that patch 2 of the patchset does not seem documented enough.

Thank you.  I've made a few more minor adjustments to the patchset.
I'm going to push 0001 and 0003 if there are no objections.


I get that patchset v5 doesn't pass the tests due to typo in assert.
The fixes version is attached.



I did a quick review on this patch series. A couple comments:


0001


This looks quite suspicious to me - SimpleLruReadPage_ReadOnly is
changed to return information about what lock was used, merely to allow
the callers to do an Assert() that the value is not LW_NONE.

IMO we could achieve exactly the same thing by passing a simple flag
that would say 'make sure we got a lock' or something like that. In
fact, aren't all callers doing the assert? That'd mean we can just do
the check always, without the flag. (I see GetMultiXactIdMembers does
two calls and only checks the second result, but I wonder if that's
intended or omission.)

In any case, it'd make the lwlock.c changes unnecessary, I think.


0002


Specifies the number cached MultiXact by backend. Any SLRU lookup ...

should be 'number of cached ...'


0003


* Conditional variable for waiting till the filling of the next multixact
* will be finished.  See GetMultiXactIdMembers() and RecordNewMultiXact()
* for details.

Perhaps 'till the next multixact is filled' or 'gets full' would be
better. Not sure.


This thread started with a discussion about making the SLRU sizes
configurable, but this patch version only adds a local cache. Does this
achieve the same goal, or would we still gain something by having GUCs
for the SLRUs?

If we're claiming this improves performance, it'd be good to have some
workload demonstrating that and measurements. I don't see anything like
that in this thread, so it's a bit hand-wavy. Can someone share details
of such workload (even synthetic one) and some basic measurements?


regards


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: MultiXact\SLRU buffers configuration

2020-10-27 Thread Alexander Korotkov
On Tue, Oct 27, 2020 at 8:02 PM Alexander Korotkov  wrote:
> On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin  wrote:
> > Thanks for your review, Alexander!
> > +1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
> > Other changes seem correct to me too.
> >
> >
> > I've tried to find optimal value for cache size and it seems to me that it 
> > affects multixact scalability much less than sizes of offsets\members 
> > buffers. I concur that patch 2 of the patchset does not seem documented 
> > enough.
>
> Thank you.  I've made a few more minor adjustments to the patchset.
> I'm going to push 0001 and 0003 if there are no objections.

I get that patchset v5 doesn't pass the tests due to typo in assert.
The fixes version is attached.

--
Regards,
Alexander Korotkov


v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch
Description: Binary data


v6-0002-Make-MultiXact-local-cache-size-configurable.patch
Description: Binary data


v6-0003-Add-conditional-variable-to-wait-for-next-MultXact.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2020-10-27 Thread Alexander Korotkov
On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin  wrote:
> Thanks for your review, Alexander!
> +1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
> Other changes seem correct to me too.
>
>
> I've tried to find optimal value for cache size and it seems to me that it 
> affects multixact scalability much less than sizes of offsets\members 
> buffers. I concur that patch 2 of the patchset does not seem documented 
> enough.

Thank you.  I've made a few more minor adjustments to the patchset.
I'm going to push 0001 and 0003 if there are no objections.

--
Regards,
Alexander Korotkov


v5-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch
Description: Binary data


v5-0002-Make-MultiXact-local-cache-size-configurable.patch
Description: Binary data


v5-0003-Add-conditional-variable-to-wait-for-next-MultXact.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2020-10-26 Thread Andrey Borodin



> 26 окт. 2020 г., в 06:05, Alexander Korotkov  
> написал(а):
> 
> Hi!
> 
> On Mon, Sep 28, 2020 at 5:41 PM Andrey M. Borodin  
> wrote:
>>> 28 авг. 2020 г., в 23:08, Anastasia Lubennikova 
>>>  написал(а):
>>> 
>>> 1) The first patch is sensible and harmless, so I think it is ready for 
>>> committer. I haven't tested the performance impact, though.
>>> 
>>> 2) I like the initial proposal to make various SLRU buffers configurable, 
>>> however, I doubt if it really solves the problem, or just moves it to 
>>> another place?
>>> 
>>> The previous patch you sent was based on some version that contained 
>>> changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' 
>>> and 'multixact_members_slru_buffers'. Have you just forgot to attach them? 
>>> The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
>>> Meanwhile, I attach the rebased patch to calm down the CFbot. The changes 
>>> are trivial.
>>> 
>>> 2.1) I think that both min and max values for this parameter are too 
>>> extreme. Have you tested them?
>>> 
>>> +   _local_cache_entries,
>>> +   256, 2, INT_MAX / 2,
>>> 
>>> 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
>>> 
>>> 3) No changes for third patch. I just renamed it for consistency.
>> 
>> Thank you for your review.
>> 
>> Indeed, I had 4th patch with tests, but these tests didn't work well: I 
>> still did not manage to stress SLRUs to reproduce problem from production...
>> 
>> You are absolutely correct in point 2: I did only tests with sane values. 
>> And observed extreme performance degradation with values ~ 64 megabytes. I 
>> do not know which highest values should we pick? 1Gb? Or highest possible 
>> functioning value?
>> 
>> I greatly appreciate your review, sorry for so long delay. Thanks!
> 
> I took a look at this patchset.
> 
> The 1st and 3rd patches look good to me.  I made just minor improvements.
> 1) There is still a case when SimpleLruReadPage_ReadOnly() relocks the
> SLRU lock, which is already taken in exclusive mode.  I've evaded this
> by passing the lock mode as a parameter to
> SimpleLruReadPage_ReadOnly().
> 3) CHECK_FOR_INTERRUPTS() is not needed anymore, because it's called
> inside ConditionVariableSleep() if needed.  Also, no current wait
> events use slashes, and I don't think we should introduce slashes
> here.  Even if we should, then we should also rename existing wait
> events to be consistent with a new one.  So, I've renamed the new wait
> event to remove the slash.
> 
> Regarding the patch 2.  I see the current documentation in the patch
> doesn't explain to the user how to set the new parameter.  I think we
> should give users an idea what workloads need high values of
> multixact_local_cache_entries parameter and what doesn't.  Also, we
> should explain the negative aspects of high values
> multixact_local_cache_entries.  Ideally, we should get the advantage
> without overloading users with new nontrivial parameters, but I don't
> have a particular idea here.
> 
> I'd like to propose committing 1 and 3, but leave 2 for further review.

Thanks for your review, Alexander! 
+1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
Other changes seem correct to me too.


I've tried to find optimal value for cache size and it seems to me that it 
affects multixact scalability much less than sizes of offsets\members buffers. 
I concur that patch 2 of the patchset does not seem documented enough.

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2020-10-25 Thread Alexander Korotkov
Hi!

On Mon, Sep 28, 2020 at 5:41 PM Andrey M. Borodin  wrote:
> > 28 авг. 2020 г., в 23:08, Anastasia Lubennikova 
> >  написал(а):
> >
> > 1) The first patch is sensible and harmless, so I think it is ready for 
> > committer. I haven't tested the performance impact, though.
> >
> > 2) I like the initial proposal to make various SLRU buffers configurable, 
> > however, I doubt if it really solves the problem, or just moves it to 
> > another place?
> >
> > The previous patch you sent was based on some version that contained 
> > changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' 
> > and 'multixact_members_slru_buffers'. Have you just forgot to attach them? 
> > The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
> > Meanwhile, I attach the rebased patch to calm down the CFbot. The changes 
> > are trivial.
> >
> > 2.1) I think that both min and max values for this parameter are too 
> > extreme. Have you tested them?
> >
> > +   _local_cache_entries,
> > +   256, 2, INT_MAX / 2,
> >
> > 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
> >
> > 3) No changes for third patch. I just renamed it for consistency.
>
> Thank you for your review.
>
> Indeed, I had 4th patch with tests, but these tests didn't work well: I still 
> did not manage to stress SLRUs to reproduce problem from production...
>
> You are absolutely correct in point 2: I did only tests with sane values. And 
> observed extreme performance degradation with values ~ 64 megabytes. I do not 
> know which highest values should we pick? 1Gb? Or highest possible 
> functioning value?
>
> I greatly appreciate your review, sorry for so long delay. Thanks!

I took a look at this patchset.

The 1st and 3rd patches look good to me.  I made just minor improvements.
1) There is still a case when SimpleLruReadPage_ReadOnly() relocks the
SLRU lock, which is already taken in exclusive mode.  I've evaded this
by passing the lock mode as a parameter to
SimpleLruReadPage_ReadOnly().
3) CHECK_FOR_INTERRUPTS() is not needed anymore, because it's called
inside ConditionVariableSleep() if needed.  Also, no current wait
events use slashes, and I don't think we should introduce slashes
here.  Even if we should, then we should also rename existing wait
events to be consistent with a new one.  So, I've renamed the new wait
event to remove the slash.

Regarding the patch 2.  I see the current documentation in the patch
doesn't explain to the user how to set the new parameter.  I think we
should give users an idea what workloads need high values of
multixact_local_cache_entries parameter and what doesn't.  Also, we
should explain the negative aspects of high values
multixact_local_cache_entries.  Ideally, we should get the advantage
without overloading users with new nontrivial parameters, but I don't
have a particular idea here.

I'd like to propose committing 1 and 3, but leave 2 for further review.

--
Regards,
Alexander Korotkov


v4-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offsets.patch
Description: Binary data


v4-0002-Make-MultiXact-local-cache-size-configurable.patch
Description: Binary data


v4-0003-Add-conditional-variable-to-wait-for-next-MultXact-o.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2020-10-07 Thread Anastasia Lubennikova

On 28.09.2020 17:41, Andrey M. Borodin wrote:

Hi, Anastasia!


28 авг. 2020 г., в 23:08, Anastasia Lubennikova  
написал(а):

1) The first patch is sensible and harmless, so I think it is ready for 
committer. I haven't tested the performance impact, though.

2) I like the initial proposal to make various SLRU buffers configurable, 
however, I doubt if it really solves the problem, or just moves it to another 
place?

The previous patch you sent was based on some version that contained changes for other 
slru buffers numbers: 'multixact_offsets_slru_buffers' and 
'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message 
"[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are 
trivial.

2.1) I think that both min and max values for this parameter are too extreme. 
Have you tested them?

+   _local_cache_entries,
+   256, 2, INT_MAX / 2,

2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.

3) No changes for third patch. I just renamed it for consistency.

Thank you for your review.

Indeed, I had 4th patch with tests, but these tests didn't work well: I still 
did not manage to stress SLRUs to reproduce problem from production...

You are absolutely correct in point 2: I did only tests with sane values. And 
observed extreme performance degradation with values ~ 64 megabytes. I do not 
know which highest values should we pick? 1Gb? Or highest possible functioning 
value?


I would go with the values that we consider adequate for this setting. 
As I see, there is no strict rule about it in guc.c and many variables 
have large border values. Anyway, we need to explain it at least in the 
documentation and code comments.


It seems that the default was conservative enough, so it can be also a 
minimal value too. As for maximum, can you provide any benchmark 
results? If we have a peak and a noticeable performance degradation 
after that, we can use it to calculate the preferable maxvalue.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: MultiXact\SLRU buffers configuration

2020-09-28 Thread Andrey M. Borodin
Hi, Anastasia!

> 28 авг. 2020 г., в 23:08, Anastasia Lubennikova 
>  написал(а):
> 
> 1) The first patch is sensible and harmless, so I think it is ready for 
> committer. I haven't tested the performance impact, though.
> 
> 2) I like the initial proposal to make various SLRU buffers configurable, 
> however, I doubt if it really solves the problem, or just moves it to another 
> place?
> 
> The previous patch you sent was based on some version that contained changes 
> for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 
> 'multixact_members_slru_buffers'. Have you just forgot to attach them? The 
> patch message "[PATCH v2 2/4]" hints that you had 4 patches)
> Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are 
> trivial.
> 
> 2.1) I think that both min and max values for this parameter are too extreme. 
> Have you tested them?
> 
> +   _local_cache_entries,
> +   256, 2, INT_MAX / 2,
> 
> 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
> 
> 3) No changes for third patch. I just renamed it for consistency.

Thank you for your review. 

Indeed, I had 4th patch with tests, but these tests didn't work well: I still 
did not manage to stress SLRUs to reproduce problem from production...

You are absolutely correct in point 2: I did only tests with sane values. And 
observed extreme performance degradation with values ~ 64 megabytes. I do not 
know which highest values should we pick? 1Gb? Or highest possible functioning 
value?

I greatly appreciate your review, sorry for so long delay. Thanks!

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2020-08-28 Thread Anastasia Lubennikova

On 08.07.2020 10:03, Andrey M. Borodin wrote:



2 июля 2020 г., в 17:02, Daniel Gustafsson  написал(а):


On 15 May 2020, at 02:03, Kyotaro Horiguchi  wrote:
Generally in such cases, condition variables would work.  In the
attached PoC, the reader side gets no penalty in the "likely" code
path.  The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases.  But I couldn't cause the
situation where the sleep 1000u is reached.

The submitted patch no longer applies, can you please submit an updated
version?  I'm marking the patch Waiting on Author in the meantime.

Thanks, Daniel! PFA V2

Best regards, Andrey Borodin.


1) The first patch is sensible and harmless, so I think it is ready for 
committer. I haven't tested the performance impact, though.


2) I like the initial proposal to make various SLRU buffers 
configurable, however, I doubt if it really solves the problem, or just 
moves it to another place?


The previous patch you sent was based on some version that contained 
changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' 
and 'multixact_members_slru_buffers'. Have you just forgot to attach 
them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The 
changes are trivial.


2.1) I think that both min and max values for this parameter are too 
extreme. Have you tested them?


+   _local_cache_entries,
+   256, 2, INT_MAX / 2,

2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.

3) No changes for third patch. I just renamed it for consistency.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From 264b475fbf6ee519dc3b4eb2cca25145d2aaa569 Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Sat, 9 May 2020 21:19:18 +0500
Subject: [PATCH v2 1/4] Use shared lock in GetMultiXactIdMembers for offsets
 and members

Previously read of multixact required exclusive control locks in
offstes and members SLRUs. This could lead to contention.
In this commit we take advantge of SimpleLruReadPage_ReadOnly
similar to CLOG usage.
---
 src/backend/access/transam/clog.c  |  2 +-
 src/backend/access/transam/commit_ts.c |  2 +-
 src/backend/access/transam/multixact.c | 12 ++--
 src/backend/access/transam/slru.c  |  8 +---
 src/backend/access/transam/subtrans.c  |  2 +-
 src/backend/commands/async.c   |  2 +-
 src/backend/storage/lmgr/predicate.c   |  2 +-
 src/include/access/slru.h  |  2 +-
 8 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f3da40ae01..3614d0c774 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -643,7 +643,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 
-	slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid);
+	slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid, false);
 	byteptr = XactCtl->shared->page_buffer[slotno] + byteno;
 
 	status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 9cdb136435..5fbbf446e3 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -342,7 +342,7 @@ TransactionIdGetCommitTsData(TransactionId xid, TimestampTz *ts,
 	}
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
-	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid);
+	slotno = SimpleLruReadPage_ReadOnly(CommitTsCtl, pageno, xid, false);
 	memcpy(,
 		   CommitTsCtl->shared->page_buffer[slotno] +
 		   SizeOfCommitTimestampEntry * entryno,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 79ec21c75a..241d9dd78d 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1321,12 +1321,12 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * time on every multixact creation.
 	 */
 retry:
-	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
-	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+	/* lock is acquired by SimpleLruReadPage_ReadOnly */
+	slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi, false);
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 	offset = *offptr;
@@ -1358,7 +1358,7 @@ retry:
 		entryno = MultiXactIdToOffsetEntry(tmpMXact);
 
 		if (pageno != prev_pageno)
-			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
+			slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, tmpMXact, true);
 
 		offptr = (MultiXactOffset *) 

Re: MultiXact\SLRU buffers configuration

2020-08-02 Thread Daniel Gustafsson



> On 8 Jul 2020, at 09:03, Andrey M. Borodin  wrote:
> 
> 
> 
>> 2 июля 2020 г., в 17:02, Daniel Gustafsson  написал(а):
>> 
>>> On 15 May 2020, at 02:03, Kyotaro Horiguchi  wrote:
>> 
>>> Generally in such cases, condition variables would work.  In the
>>> attached PoC, the reader side gets no penalty in the "likely" code
>>> path.  The writer side always calls ConditionVariableBroadcast but the
>>> waiter list is empty in almost all cases.  But I couldn't cause the
>>> situation where the sleep 1000u is reached.
>> 
>> The submitted patch no longer applies, can you please submit an updated
>> version?  I'm marking the patch Waiting on Author in the meantime.
> Thanks, Daniel! PFA V2

This version too has stopped applying according to the CFbot.  I've moved it to
the next commitfest as we're out of time in this one and it was only pointed
out now, but kept it Waiting on Author.

cheers ./daniel



Re: MultiXact\SLRU buffers configuration

2020-07-08 Thread Andrey M. Borodin


> 2 июля 2020 г., в 17:02, Daniel Gustafsson  написал(а):
> 
>> On 15 May 2020, at 02:03, Kyotaro Horiguchi  wrote:
> 
>> Generally in such cases, condition variables would work.  In the
>> attached PoC, the reader side gets no penalty in the "likely" code
>> path.  The writer side always calls ConditionVariableBroadcast but the
>> waiter list is empty in almost all cases.  But I couldn't cause the
>> situation where the sleep 1000u is reached.
> 
> The submitted patch no longer applies, can you please submit an updated
> version?  I'm marking the patch Waiting on Author in the meantime.
Thanks, Daniel! PFA V2

Best regards, Andrey Borodin.


v2-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch
Description: Binary data


v2-0002-Make-MultiXact-local-cache-size-configurable.patch
Description: Binary data


v2-0003-Add-conditional-variable-to-wait-for-next-MultXac.patch
Description: Binary data




Re: MultiXact\SLRU buffers configuration

2020-07-02 Thread Daniel Gustafsson
> On 15 May 2020, at 02:03, Kyotaro Horiguchi  wrote:

> Generally in such cases, condition variables would work.  In the
> attached PoC, the reader side gets no penalty in the "likely" code
> path.  The writer side always calls ConditionVariableBroadcast but the
> waiter list is empty in almost all cases.  But I couldn't cause the
> situation where the sleep 1000u is reached.

The submitted patch no longer applies, can you please submit an updated
version?  I'm marking the patch Waiting on Author in the meantime.

cheers ./daniel



Re: MultiXact\SLRU buffers configuration

2020-05-19 Thread Kyotaro Horiguchi
At Fri, 15 May 2020 14:01:46 +0500, "Andrey M. Borodin"  
wrote in 
> 
> 
> > 15 мая 2020 г., в 05:03, Kyotaro Horiguchi  
> > написал(а):
> > 
> > At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" 
> >  wrote in 
> >>> GetMultiXactIdMembers believes that 4 is successfully done if 2
> >>> returned valid offset, but actually that is not obvious.
> >>> 
> >>> If we add a single giant lock just to isolate ,say,
> >>> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
> >>> unnecessarily.  Perhaps we need finer-grained locking-key for standby
> >>> that works similary to buffer lock on primary, that doesn't cause
> >>> confilicts between irrelevant mxids.
> >>> 
> >> We can just replay members before offsets. If offset is already there - 
> >> members are there too.
> >> But I'd be happy if we could mitigate those 1000us too - with a hint about 
> >> last maixd state in a shared MX state, for example.
> > 
> > Generally in such cases, condition variables would work.  In the
> > attached PoC, the reader side gets no penalty in the "likely" code
> > path.  The writer side always calls ConditionVariableBroadcast but the
> > waiter list is empty in almost all cases.  But I couldn't cause the
> > situation where the sleep 1000u is reached.
> Thanks! That really looks like a good solution without magic timeouts. 
> Beautiful!
> I think I can create temporary extension which calls MultiXact API and tests 
> edge-cases like this 1000us wait.
> This extension will also be also useful for me to assess impact of bigger 
> buffers, reduced read locking (as in my 2nd patch) and other tweaks.

Happy to hear that, It would need to use timeout just in case, though.

> >> Actually, if we read empty mxid array instead of something that is 
> >> replayed just yet - it's not a problem of inconsistency, because 
> >> transaction in this mxid could not commit before we started. ISTM.
> >> So instead of fix, we, probably, can just add a comment. If this reasoning 
> >> is correct.
> > 
> > The step 4 of the reader side reads the members of the target mxid. It
> > is already written if the offset of the *next* mxid is filled-in.
> Most often - yes, but members are not guaranteed to be filled in order. Those 
> who win MXMemberControlLock will write first.
> But nobody can read members of MXID before it is returned. And its members 
> will be written before returning MXID.

Yeah, right.  Otherwise assertion failure happens.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: MultiXact\SLRU buffers configuration

2020-05-15 Thread Andrey M. Borodin



> 15 мая 2020 г., в 05:03, Kyotaro Horiguchi  
> написал(а):
> 
> At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" 
>  wrote in 
>>> GetMultiXactIdMembers believes that 4 is successfully done if 2
>>> returned valid offset, but actually that is not obvious.
>>> 
>>> If we add a single giant lock just to isolate ,say,
>>> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
>>> unnecessarily.  Perhaps we need finer-grained locking-key for standby
>>> that works similary to buffer lock on primary, that doesn't cause
>>> confilicts between irrelevant mxids.
>>> 
>> We can just replay members before offsets. If offset is already there - 
>> members are there too.
>> But I'd be happy if we could mitigate those 1000us too - with a hint about 
>> last maixd state in a shared MX state, for example.
> 
> Generally in such cases, condition variables would work.  In the
> attached PoC, the reader side gets no penalty in the "likely" code
> path.  The writer side always calls ConditionVariableBroadcast but the
> waiter list is empty in almost all cases.  But I couldn't cause the
> situation where the sleep 1000u is reached.
Thanks! That really looks like a good solution without magic timeouts. 
Beautiful!
I think I can create temporary extension which calls MultiXact API and tests 
edge-cases like this 1000us wait.
This extension will also be also useful for me to assess impact of bigger 
buffers, reduced read locking (as in my 2nd patch) and other tweaks.

>> Actually, if we read empty mxid array instead of something that is replayed 
>> just yet - it's not a problem of inconsistency, because transaction in this 
>> mxid could not commit before we started. ISTM.
>> So instead of fix, we, probably, can just add a comment. If this reasoning 
>> is correct.
> 
> The step 4 of the reader side reads the members of the target mxid. It
> is already written if the offset of the *next* mxid is filled-in.
Most often - yes, but members are not guaranteed to be filled in order. Those 
who win MXMemberControlLock will write first.
But nobody can read members of MXID before it is returned. And its members will 
be written before returning MXID.

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2020-05-14 Thread Kyotaro Horiguchi
At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin"  
wrote in 
> > GetMultiXactIdMembers believes that 4 is successfully done if 2
> > returned valid offset, but actually that is not obvious.
> > 
> > If we add a single giant lock just to isolate ,say,
> > GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
> > unnecessarily.  Perhaps we need finer-grained locking-key for standby
> > that works similary to buffer lock on primary, that doesn't cause
> > confilicts between irrelevant mxids.
> > 
> We can just replay members before offsets. If offset is already there - 
> members are there too.
> But I'd be happy if we could mitigate those 1000us too - with a hint about 
> last maixd state in a shared MX state, for example.

Generally in such cases, condition variables would work.  In the
attached PoC, the reader side gets no penalty in the "likely" code
path.  The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases.  But I couldn't cause the
situation where the sleep 1000u is reached.

> Actually, if we read empty mxid array instead of something that is replayed 
> just yet - it's not a problem of inconsistency, because transaction in this 
> mxid could not commit before we started. ISTM.
> So instead of fix, we, probably, can just add a comment. If this reasoning is 
> correct.

The step 4 of the reader side reads the members of the target mxid. It
is already written if the offset of the *next* mxid is filled-in.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e2aa5c9ce4..9db8f6cddd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -82,6 +82,7 @@
 #include "lib/ilist.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/lmgr.h"
 #include "storage/pmsignal.h"
@@ -233,6 +234,7 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
+	ConditionVariable nextoff_cv;
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -873,6 +875,14 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	/* Exchange our lock */
 	LWLockRelease(MultiXactOffsetControlLock);
 
+	/*
+	 *  Let everybody know the offset of this mxid is recorded now. The waiters
+	 *  are waiting for the offset of the mxid next of the target to know the
+	 *  number of members of the target mxid, so we don't need to wait for
+	 *  members of this mxid are recorded.
+	 */
+	ConditionVariableBroadcast(>nextoff_cv);
+
 	LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
 
 	prev_pageno = -1;
@@ -1367,9 +1377,19 @@ retry:
 		if (nextMXOffset == 0)
 		{
 			/* Corner case 2: next multixact is still being filled in */
+
+			/*
+			 * The recorder of the next mxid is just before writing the offset.
+			 * Wait for the offset to be written.
+			 */
+			ConditionVariablePrepareToSleep(>nextoff_cv);
+
 			LWLockRelease(MultiXactOffsetControlLock);
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+
+			ConditionVariableSleep(>nextoff_cv,
+   WAIT_EVENT_WAIT_NEXT_MXMEMBERS);
+			ConditionVariableCancelSleep();
 			goto retry;
 		}
 
@@ -1847,6 +1867,7 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
+		ConditionVariableInit(>nextoff_cv);
 	}
 	else
 		Assert(found);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 0ecd29a1d9..1ac6b37188 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3879,6 +3879,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_WAIT_NEXT_MXMEMBERS:
+			event_name = "Mact/WaitNextXactMembers";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ae9a39573c..e79bba0bef 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -886,7 +886,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_WAIT_NEXT_MXMEMBERS
 } WaitEventIPC;
 
 /* --


Re: MultiXact\SLRU buffers configuration

2020-05-14 Thread Andrey M. Borodin



> 14 мая 2020 г., в 11:16, Kyotaro Horiguchi  
> написал(а):
> 
> At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" 
>  wrote in 
 I'm looking more at MultiXact and it seems to me that we have a race 
 condition there.
 
 When we create a new MultiXact we do:
 1. Generate new MultiXactId under MultiXactGenLock
 2. Record new mxid with members and offset to WAL
 3. Write offset to SLRU under MultiXactOffsetControlLock
 4. Write members to SLRU under MultiXactMemberControlLock
>>> 
>>> But, don't we hold exclusive lock on the buffer through all the steps
>>> above?
>> Yes...Unless MultiXact is observed on StandBy. This could lead to observing 
>> inconsistent snapshot: one of lockers committed tuple delete, but standby 
>> sees it as alive.
> 
> Ah, right. I looked from GetNewMultiXactId. Actually
> XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to
> the creating mxact id. And GetMultiXactIdMembers is considering that
> case.
> 
 When we read MultiXact we do:
 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we 
 sleep and goto 1
 3. Retrieve members from SLRU under MultiXactMemberControlLock
 4. . what we do if there are just zeroes because step 4 is not 
 executed yet? Nothing, return empty members list.
>>> 
>>> So transactions never see such incomplete mxids, I believe.
>> I've observed sleep in step 2. I believe it's possible to observe special 
>> effects of step 4 too.
>> Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it 
>> hits hard on Standbys: if someone is locking whole table on primary - all 
>> seq scans on standbys follow him with MultiXactOffsetControlLock contention.
> 
> GetMultiXactIdMembers believes that 4 is successfully done if 2
> returned valid offset, but actually that is not obvious.
> 
> If we add a single giant lock just to isolate ,say,
> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
> unnecessarily.  Perhaps we need finer-grained locking-key for standby
> that works similary to buffer lock on primary, that doesn't cause
> confilicts between irrelevant mxids.
> 
We can just replay members before offsets. If offset is already there - members 
are there too.
But I'd be happy if we could mitigate those 1000us too - with a hint about last 
maixd state in a shared MX state, for example.

Actually, if we read empty mxid array instead of something that is replayed 
just yet - it's not a problem of inconsistency, because transaction in this 
mxid could not commit before we started. ISTM.
So instead of fix, we, probably, can just add a comment. If this reasoning is 
correct.

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2020-05-14 Thread Kyotaro Horiguchi
At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin"  
wrote in 
> >> I'm looking more at MultiXact and it seems to me that we have a race 
> >> condition there.
> >> 
> >> When we create a new MultiXact we do:
> >> 1. Generate new MultiXactId under MultiXactGenLock
> >> 2. Record new mxid with members and offset to WAL
> >> 3. Write offset to SLRU under MultiXactOffsetControlLock
> >> 4. Write members to SLRU under MultiXactMemberControlLock
> > 
> > But, don't we hold exclusive lock on the buffer through all the steps
> > above?
> Yes...Unless MultiXact is observed on StandBy. This could lead to observing 
> inconsistent snapshot: one of lockers committed tuple delete, but standby 
> sees it as alive.

Ah, right. I looked from GetNewMultiXactId. Actually
XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to
the creating mxact id. And GetMultiXactIdMembers is considering that
case.

> >> When we read MultiXact we do:
> >> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
> >> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we 
> >> sleep and goto 1
> >> 3. Retrieve members from SLRU under MultiXactMemberControlLock
> >> 4. . what we do if there are just zeroes because step 4 is not 
> >> executed yet? Nothing, return empty members list.
> > 
> > So transactions never see such incomplete mxids, I believe.
> I've observed sleep in step 2. I believe it's possible to observe special 
> effects of step 4 too.
> Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it 
> hits hard on Standbys: if someone is locking whole table on primary - all seq 
> scans on standbys follow him with MultiXactOffsetControlLock contention.

GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.

If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily.  Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: MultiXact\SLRU buffers configuration

2020-05-13 Thread Andrey M. Borodin



> 14 мая 2020 г., в 06:25, Kyotaro Horiguchi  
> написал(а):
> 
> At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" 
>  wrote in 
>> 
>> 
>>> 11 мая 2020 г., в 16:17, Andrey M. Borodin  
>>> написал(а):
>>> 
>>> I've went ahead and created 3 patches:
>>> 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
>>> 2. Reduce locking level to shared on read of MultiXactId members
>>> 3. Configurable cache size
>> 
>> I'm looking more at MultiXact and it seems to me that we have a race 
>> condition there.
>> 
>> When we create a new MultiXact we do:
>> 1. Generate new MultiXactId under MultiXactGenLock
>> 2. Record new mxid with members and offset to WAL
>> 3. Write offset to SLRU under MultiXactOffsetControlLock
>> 4. Write members to SLRU under MultiXactMemberControlLock
> 
> But, don't we hold exclusive lock on the buffer through all the steps
> above?
Yes...Unless MultiXact is observed on StandBy. This could lead to observing 
inconsistent snapshot: one of lockers committed tuple delete, but standby sees 
it as alive.

>> When we read MultiXact we do:
>> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
>> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we 
>> sleep and goto 1
>> 3. Retrieve members from SLRU under MultiXactMemberControlLock
>> 4. . what we do if there are just zeroes because step 4 is not executed 
>> yet? Nothing, return empty members list.
> 
> So transactions never see such incomplete mxids, I believe.
I've observed sleep in step 2. I believe it's possible to observe special 
effects of step 4 too.
Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it 
hits hard on Standbys: if someone is locking whole table on primary - all seq 
scans on standbys follow him with MultiXactOffsetControlLock contention.

It looks like this:
0x7fcd56896ff7 in __GI___select (nfds=nfds@entry=0, 
readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, 
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffd83376fe0) at 
../sysdeps/unix/sysv/linux/select.c:41
#0  0x7fcd56896ff7 in __GI___select (nfds=nfds@entry=0, 
readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, 
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffd83376fe0) at 
../sysdeps/unix/sysv/linux/select.c:41
#1  0x56186e0d54bd in pg_usleep (microsec=microsec@entry=1000) at 
./build/../src/port/pgsleep.c:56
#2  0x56186dd5edf2 in GetMultiXactIdMembers (from_pgupgrade=0 '\000', 
onlyLock=, members=0x7ffd83377080, multi=3106214809) at 
./build/../src/backend/access/transam/multixact.c:1370
#3  GetMultiXactIdMembers () at 
./build/../src/backend/access/transam/multixact.c:1202
#4  0x56186dd2d2d9 in MultiXactIdGetUpdateXid (xmax=, 
t_infomask=) at ./build/../src/backend/access/heap/heapam.c:7039
#5  0x56186dd35098 in HeapTupleGetUpdateXid 
(tuple=tuple@entry=0x7fcba3b63d58) at 
./build/../src/backend/access/heap/heapam.c:7080
#6  0x56186e0cd0f8 in HeapTupleSatisfiesMVCC (htup=, 
snapshot=0x56186f44a058, buffer=230684) at 
./build/../src/backend/utils/time/tqual.c:1091
#7  0x56186dd2d922 in heapgetpage (scan=scan@entry=0x56186f4c8e78, 
page=page@entry=3620) at ./build/../src/backend/access/heap/heapam.c:439
#8  0x56186dd2ea7c in heapgettup_pagemode (key=0x0, nkeys=0, 
dir=ForwardScanDirection, scan=0x56186f4c8e78) at 
./build/../src/backend/access/heap/heapam.c:1034
#9  heap_getnext (scan=scan@entry=0x56186f4c8e78, 
direction=direction@entry=ForwardScanDirection) at 
./build/../src/backend/access/heap/heapam.c:1801
#10 0x56186de84f51 in SeqNext (node=node@entry=0x56186f4a4f78) at 
./build/../src/backend/executor/nodeSeqscan.c:81
#11 0x56186de6a3f1 in ExecScanFetch (recheckMtd=0x56186de84ef0 
, accessMtd=0x56186de84f20 , node=0x56186f4a4f78) at 
./build/../src/backend/executor/execScan.c:97
#12 ExecScan (node=0x56186f4a4f78, accessMtd=0x56186de84f20 , 
recheckMtd=0x56186de84ef0 ) at 
./build/../src/backend/executor/execScan.c:164


Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2020-05-13 Thread Kyotaro Horiguchi
At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin"  
wrote in 
> 
> 
> > 11 мая 2020 г., в 16:17, Andrey M. Borodin  
> > написал(а):
> > 
> > I've went ahead and created 3 patches:
> > 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
> > 2. Reduce locking level to shared on read of MultiXactId members
> > 3. Configurable cache size
> 
> I'm looking more at MultiXact and it seems to me that we have a race 
> condition there.
> 
> When we create a new MultiXact we do:
> 1. Generate new MultiXactId under MultiXactGenLock
> 2. Record new mxid with members and offset to WAL
> 3. Write offset to SLRU under MultiXactOffsetControlLock
> 4. Write members to SLRU under MultiXactMemberControlLock

But, don't we hold exclusive lock on the buffer through all the steps
above?

> When we read MultiXact we do:
> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we 
> sleep and goto 1
> 3. Retrieve members from SLRU under MultiXactMemberControlLock
> 4. . what we do if there are just zeroes because step 4 is not executed 
> yet? Nothing, return empty members list.

So transactions never see such incomplete mxids, I believe.

> What am I missing?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: MultiXact\SLRU buffers configuration

2020-05-13 Thread Andrey M. Borodin



> 11 мая 2020 г., в 16:17, Andrey M. Borodin  написал(а):
> 
> I've went ahead and created 3 patches:
> 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
> 2. Reduce locking level to shared on read of MultiXactId members
> 3. Configurable cache size

I'm looking more at MultiXact and it seems to me that we have a race condition 
there.

When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLock

When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we 
sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. . what we do if there are just zeroes because step 4 is not executed 
yet? Nothing, return empty members list.

What am I missing?

Best regards, Andrey Borodin.



Re: MultiXact\SLRU buffers configuration

2020-05-11 Thread Andrey M. Borodin


> 8 мая 2020 г., в 21:36, Andrey M. Borodin  написал(а):
> 
> *** The problem ***
> I'm investigating some cases of reduced database performance due to 
> MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead).
> The problem manifested itself during index repack and constraint validation. 
> Both being effectively full table scans.
> The database workload contains a lot of select for share\select for update 
> queries. I've tried to construct synthetic world generator and could not 
> achieve similar lock configuration: I see a lot of different locks in wait 
> events, particularly a lot more MultiXactMemberLocks. But from my experiments 
> with synthetic workload, contention of MultiXactOffsetLock can be reduced by 
> increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers.
> 
> *** Question 1 ***
> Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile 
> and run database as usual?
> I cannot experiment much with production. But I'm mostly sure that bigger 
> buffers will solve the problem.
> 
> *** Question 2 ***
> Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do 
> them configurable? I think multis, clog, subtransactions and others will 
> benefit from bigger buffer. But, probably, too much of knobs can be confusing.
> 
> *** Question 3 ***
> MultiXact offset lock is always taken as exclusive lock. It turns MultiXact 
> Offset subsystem to single threaded. If someone have good idea how to make it 
> more concurrency-friendly, I'm willing to put some efforts into this.
> Probably, I could just add LWlocks for each offset buffer page. Is it 
> something worth doing? Or are there any hidden cavers and difficulties?

I've created benchmark[0] imitating MultiXact pressure on my laptop: 7 clients 
are concurrently running select "select * from table where primary_key = ANY 
($1) for share" where $1 is array of identifiers so that each tuple in a table 
is locked by different set of XIDs. During this benchmark I observe contention 
of MultiXactControlLock in pg_stat_activity

пятница,  8 мая 2020 г. 15:08:37 (every 1s)

  pid  | wait_event | wait_event_type | state  |
   query
---++-++
 41344 | ClientRead | Client  | idle   | insert into t1 
select generate_series(1,100,1)
 41375 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
 41377 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
 41378 || | active | select * from 
t1 where i = ANY ($1) for share
 41379 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
 41381 || | active | select * from 
t1 where i = ANY ($1) for share
 41383 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
 41385 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
(8 rows)

Finally, the benchmark is measuring time to execute select for update 42 times.

I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache size

I've found out that:
1. When MultiXact working set does not fit into buffers - benchmark results 
grow very high. Yet, very big buffers slow down benchmark too. For this 
benchmark optimal SLRU size id 32 pages for offsets and 64 pages for members 
(defaults are 8 and 16 respectively).
2. Lock optimisation increases performance by 5% on default SLRU sizes. 
Actually, benchmark does not explicitly read MultiXactId members, but when it 
replaces one with another - it have to read previous set. I understand that we 
can construct benchmark to demonstrate dominance of any algorithm and 5% of 
synthetic workload is not a very big number. But it just make sense to try to 
take shared lock for reading.
3. Manipulations with cache size do not affect benchmark anyhow. It's somewhat 
expected: benchmark is designed to defeat cache, either way OffsetControlLock 
would not be stressed.

For our workload, I think we will just increase numbers of SLRU sizes. But 
patchset may be useful for tuning and as a performance optimisation of 
MultiXact.

Also MultiXacts seems to be not very good fit into SLRU design. I think it 
would be better to use B-tree as a container. Or at least make MultiXact 
members extendable in-place (reserve some size when multixact is created).
When we want to extend number of locks for a tuple currently we will:
1. Iterate through all SLRU buffers for offsets to read 

MultiXact\SLRU buffers configuration

2020-05-08 Thread Andrey M. Borodin
Hi, hackers!

*** The problem ***
I'm investigating some cases of reduced database performance due to 
MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead).
The problem manifested itself during index repack and constraint validation. 
Both being effectively full table scans.
The database workload contains a lot of select for share\select for update 
queries. I've tried to construct synthetic world generator and could not 
achieve similar lock configuration: I see a lot of different locks in wait 
events, particularly a lot more MultiXactMemberLocks. But from my experiments 
with synthetic workload, contention of MultiXactOffsetLock can be reduced by 
increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers.

*** Question 1 ***
Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and 
run database as usual?
I cannot experiment much with production. But I'm mostly sure that bigger 
buffers will solve the problem.

*** Question 2 ***
Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them 
configurable? I think multis, clog, subtransactions and others will benefit 
from bigger buffer. But, probably, too much of knobs can be confusing.

*** Question 3 ***
MultiXact offset lock is always taken as exclusive lock. It turns MultiXact 
Offset subsystem to single threaded. If someone have good idea how to make it 
more concurrency-friendly, I'm willing to put some efforts into this.
Probably, I could just add LWlocks for each offset buffer page. Is it something 
worth doing? Or are there any hidden cavers and difficulties?

Thanks!

Best regards, Andrey Borodin.