Re: Slow standby snapshot

2022-11-29 Thread Simon Riggs
On Tue, 29 Nov 2022 at 20:46, Tom Lane  wrote:
>
> I wrote:
> > That seems like a fairly bad idea: it will add extra contention
> > on ProcArrayLock, and I see no real strong argument that the path
> > can't get traversed often enough for that to matter.  It would
> > likely be better for KnownAssignedXidsCompress to obtain the lock
> > for itself, only after it knows there is something worth doing.
>
> Since we're running out of time in the current commitfest,
> I went ahead and changed that, and made the cosmetic fixes
> I wanted, and pushed.

That is a complete patch from multiple angles; very happy here.

Thanks for a great job.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Slow standby snapshot

2022-11-29 Thread Tom Lane
Michail Nikolaev  writes:
> The small thing I was thinking to add in KnownAssignedXidsCompress is
> the assertion like

> Assert(MyBackendType == B_STARTUP);

Mmm ... given where the call sites are, we have got lots more problems
than this if some non-startup process reaches them.  I'm not sure this
is worth the trouble, but if it is, I'd put it in the callers.

>> I'd be willing to
>> make the memory barrier change anyway, because that seems like
>> a simple change that can't hurt.

> I'm going to create a separate commit fest entry for it, ok?

Right, since I closed this one already.

regards, tom lane




Re: Slow standby snapshot

2022-11-29 Thread Michail Nikolaev
Hello, Tom.

> Since we're running out of time in the current commitfest,
> I went ahead and changed that, and made the cosmetic fixes
> I wanted, and pushed.

Great, thanks!

The small thing I was thinking to add in KnownAssignedXidsCompress is
the assertion like

Assert(MyBackendType == B_STARTUP);

Just to make it more clear that locking is not the only thing required
for the call.

>  I'd be willing to
> make the memory barrier change anyway, because that seems like
> a simple change that can't hurt.

I'm going to create a separate commit fest entry for it, ok?

Best regards,
Michail.




Re: Slow standby snapshot

2022-11-29 Thread Tom Lane
I wrote:
> That seems like a fairly bad idea: it will add extra contention
> on ProcArrayLock, and I see no real strong argument that the path
> can't get traversed often enough for that to matter.  It would
> likely be better for KnownAssignedXidsCompress to obtain the lock
> for itself, only after it knows there is something worth doing.

Since we're running out of time in the current commitfest,
I went ahead and changed that, and made the cosmetic fixes
I wanted, and pushed.

regards, tom lane




Re: Slow standby snapshot

2022-11-28 Thread Tom Lane
Michail Nikolaev  writes:
>> * when idle - since we have time to do it when that happens, which
>> happens often since most workloads are bursty

> I have added getting of ProcArrayLock for this case.

That seems like a fairly bad idea: it will add extra contention
on ProcArrayLock, and I see no real strong argument that the path
can't get traversed often enough for that to matter.  It would
likely be better for KnownAssignedXidsCompress to obtain the lock
for itself, only after it knows there is something worth doing.
(This ties back to previous discussion: the comment claiming it's
safe to read head/tail because we have the lock is misguided.
It's safe because we're the only process that changes them.
So we can make the heuristic decision before acquiring lock.)

While you could use the "reason" code to decide whether you need
to take the lock, it might be better to add a separate boolean
argument specifying whether the caller already has the lock.

Beyond that, I don't see any issues except cosmetic ones.

> Also, I think while we still in the context, it is good to add:
> * Simon's optimization (1) for KnownAssignedXidsRemoveTree (it is
> simple and effective for some situations without any seen drawbacks)
> * Maybe my patch (2) for replacing known_assigned_xids_lck with memory 
> barrier?

Doesn't seem like we have any hard evidence in favor of either of
those being worth doing.  We especially haven't any evidence that
they'd still be worth doing after this patch.  I'd be willing to
make the memory barrier change anyway, because that seems like
a simple change that can't hurt.  I'm less enthused about the
patch at (1) because of the complexity it adds.

regards, tom lane




Re: Slow standby snapshot

2022-11-22 Thread Michail Nikolaev
Hello, everyone.

I have tried to put it all together.

> In the absence of that approach, falling back to a counter that
> compresses every N xids would be best, in addition to the two new
> forced compression events.

Done.

> Also, if we add more forced compressions, it seems like we should have
> a short-circuit for a forced compression where there's nothing to do.

Done.

> I'm also wondering why there's not an
>
> Assert(compress_index == pArray->numKnownAssignedXids);
>
> after the loop, to make sure our numKnownAssignedXids tracking
> is sane.

Done.

> * when idle - since we have time to do it when that happens, which
> happens often since most workloads are bursty

I have added getting of ProcArrayLock for this case.
Also, I have added maximum frequency as 1 per second to avoid
contention with heavy read load in case of small,
episodic but regular WAL traffic (WakeupRecovery() for each 100ms for
example). Or it is useless?

> It'd be more reliable
> to use a static counter to skip all but every N'th compress attempt
> (something we could do inside KnownAssignedXidsCompress itself, instead
> of adding warts at the call sites).

Done. I have added “reason” enum for calling KnownAssignedXidsCompress
to keep it as much clean as possible.
But not sure that I was successful here.

Also, I think while we still in the context, it is good to add:
* Simon's optimization (1) for KnownAssignedXidsRemoveTree (it is
simple and effective for some situations without any seen drawbacks)
* Maybe my patch (2) for replacing known_assigned_xids_lck with memory barrier?

New version in attach. Running benchmarks now.
Preliminary result in attachments (16CPU, 5000 max_connections, now 64
active connections instead of 16).
Also, interesting moment - with 64 connections, vanilla version is
unable to recover its performance after 30-sec transaction on primary.

[1]: 
https://www.postgresql.org/message-id/flat/CANbhV-Ey8HRYPvnvQnsZAteCfzN3VHVhZVKfWMYcnjMnSzs4dQ%40mail.gmail.com#05993cf2bc87e35e0dff38fec26b9805
[2]: 
https://www.postgresql.org/message-id/flat/CANtu0oiPoSdQsjRd6Red5WMHi1E83d2%2B-bM9J6dtWR3c5Tap9g%40mail.gmail.com#cc4827dee902978f93278732435e8521

--
Michail Nikolaev
From 926403598e9506edfa32c9534be9a4e48f756002 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Wed, 23 Nov 2022 00:13:32 +0300
Subject: [PATCH vnext] Currently, KnownAssignedXidsGetAndSetXmin requires an
 iterative loop through KnownAssignedXids array, including xids marked as
 invalid. Performance impact is especially noticeable in the presence of long
 (few seconds) transactions on primary, high value (few thousands) of
 max_connections and high read workload on standby. Most of the CPU spent on
 looping throw KnownAssignedXids while almost all xid are invalid anyway.
 KnownAssignedXidsCompress removes invalid xid from time to time, but
 performance is still affected.

To increase performance, frequency of running KnownAssignedXidsCompress is increased. Now it is called for each xid % 64 == 0 (number selected by running benchmarks). Also, the minimum bound of element to compress (4 * PROCARRAY_MAXPROCS) is removed.

Additionally, compression is called on RecoveryInfo and idle state of startup process.

Simon Riggs, Michail Nikolaev with help by Tom Lane and
Andres Freund.
---
 src/backend/access/transam/xlogrecovery.c |   3 +
 src/backend/storage/ipc/procarray.c   | 103 +-
 src/include/storage/standby.h |   1 +
 3 files changed, 85 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea..2cc9581cb6 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3836,6 +3836,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		streaming_reply_sent = true;
 	}
 
+	/* Do any background tasks that might benefit us later */
+	StartupProcessIdleMaintenance();
+
 	/* Update pg_stat_recovery_prefetch before sleeping. */
 	XLogPrefetcherComputeStats(xlogprefetcher);
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 283517d956..fde748519e 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -274,6 +274,10 @@ static TransactionId *KnownAssignedXids;
 static bool *KnownAssignedXidsValid;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
+/* Counters for KnownAssignedXids compression heuristic */
+static int transactionEndsCounter;
+static TimestampTz lastCompressTs;
+
 /*
  * If we're in STANDBY_SNAPSHOT_PENDING state, standbySnapshotPendingXmin is
  * the highest xid that might still be running that we don't have in
@@ -335,8 +339,16 @@ static void DisplayXidCache(void);
 #define xc_slow_answer_inc()		((void) 0)
 #endif			/* XIDCACHE_DEBUG */
 
+typedef enum KnownAssignedXidsCompressReason
+{
+	NO_SPACE,
+	RECOVERY_INFO,
+	TRANSACT

Re: Slow standby snapshot

2022-11-22 Thread Simon Riggs
On Tue, 22 Nov 2022 at 16:53, Tom Lane  wrote:
>
> Simon Riggs  writes:
> > On Tue, 22 Nov 2022 at 16:28, Tom Lane  wrote:
> >> If we do those things, do we need a wasted-work counter at all?
>
> > The wasted work counter works well to respond to heavy read-only
> > traffic and also avoids wasted compressions for write-heavy workloads.
> > So I still like it the best.
>
> This argument presumes that maintenance of the counter is free,
> which it surely is not.  I don't know how bad contention on that
> atomically-updated variable could get, but it seems like it could
> be an issue when lots of processes are acquiring snapshots.

I understand. I was assuming that you and Andres liked that approach.

In the absence of that approach, falling back to a counter that
compresses every N xids would be best, in addition to the two new
forced compression events.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Slow standby snapshot

2022-11-22 Thread Tom Lane
Simon Riggs  writes:
> On Tue, 22 Nov 2022 at 16:28, Tom Lane  wrote:
>> If we do those things, do we need a wasted-work counter at all?

> The wasted work counter works well to respond to heavy read-only
> traffic and also avoids wasted compressions for write-heavy workloads.
> So I still like it the best.

This argument presumes that maintenance of the counter is free,
which it surely is not.  I don't know how bad contention on that
atomically-updated variable could get, but it seems like it could
be an issue when lots of processes are acquiring snapshots.

regards, tom lane




Re: Slow standby snapshot

2022-11-22 Thread Simon Riggs
On Tue, 22 Nov 2022 at 16:28, Tom Lane  wrote:
>
> Simon Riggs  writes:
> > We seem to have replaced one magic constant with another, so not sure
> > if this is autotuning, but I like it much better than what we had
> > before (i.e. better than my prev patch).
>
> Yeah, the magic constant is still magic, even if it looks like it's
> not terribly sensitive to the exact value.
>
> > 1. I was surprised that you removed the limits on size and just had
> > the wasted work limit. If there is no read traffic that will mean we
> > hardly ever compress, which means the removal of xids at commit will
> > get slower over time.  I would prefer that we forced compression on a
> > regular basis, such as every time we process an XLOG_RUNNING_XACTS
> > message (every 15s), as well as when we hit certain size limits.
>
> > 2. If there is lots of read traffic but no changes flowing, it would
> > also make sense to force compression when the startup process goes
> > idle rather than wait for the work to be wasted first.
>
> If we do those things, do we need a wasted-work counter at all?
>
> I still suspect that 90% of the problem is the max_connections
> dependency in the existing heuristic, because of the fact that
> you have to push max_connections to the moon before it becomes
> a measurable problem.  If we do
>
> -if (nelements < 4 * PROCARRAY_MAXPROCS ||
> -nelements < 2 * pArray->numKnownAssignedXids)
> +if (nelements < 2 * pArray->numKnownAssignedXids)
>
> and then add the forced compressions you suggest, where
> does that put us?

The forced compressions I propose happen
* when idle - since we have time to do it when that happens, which
happens often since most workloads are bursty
* every 15s - since we already have lock
which is overall much less often than every 64 commits, as benchmarked
by Michail.
I didn't mean to imply that superceded the wasted work approach, it
was meant to be in addition to.

The wasted work counter works well to respond to heavy read-only
traffic and also avoids wasted compressions for write-heavy workloads.
So I still like it the best.

> Also, if we add more forced compressions, it seems like we should have
> a short-circuit for a forced compression where there's nothing to do.
> So more or less like
>
> nelements = head - tail;
> if (!force)
> {
> if (nelements < 2 * pArray->numKnownAssignedXids)
> return;
> }
> else
> {
> if (nelements == pArray->numKnownAssignedXids)
> return;
> }

+1

> I'm also wondering why there's not an
>
> Assert(compress_index == pArray->numKnownAssignedXids);
>
> after the loop, to make sure our numKnownAssignedXids tracking
> is sane.

+1

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Slow standby snapshot

2022-11-22 Thread Tom Lane
Simon Riggs  writes:
> We seem to have replaced one magic constant with another, so not sure
> if this is autotuning, but I like it much better than what we had
> before (i.e. better than my prev patch).

Yeah, the magic constant is still magic, even if it looks like it's
not terribly sensitive to the exact value.

> 1. I was surprised that you removed the limits on size and just had
> the wasted work limit. If there is no read traffic that will mean we
> hardly ever compress, which means the removal of xids at commit will
> get slower over time.  I would prefer that we forced compression on a
> regular basis, such as every time we process an XLOG_RUNNING_XACTS
> message (every 15s), as well as when we hit certain size limits.

> 2. If there is lots of read traffic but no changes flowing, it would
> also make sense to force compression when the startup process goes
> idle rather than wait for the work to be wasted first.

If we do those things, do we need a wasted-work counter at all?

I still suspect that 90% of the problem is the max_connections
dependency in the existing heuristic, because of the fact that
you have to push max_connections to the moon before it becomes
a measurable problem.  If we do

-if (nelements < 4 * PROCARRAY_MAXPROCS ||
-nelements < 2 * pArray->numKnownAssignedXids)
+if (nelements < 2 * pArray->numKnownAssignedXids)

and then add the forced compressions you suggest, where
does that put us?

Also, if we add more forced compressions, it seems like we should have
a short-circuit for a forced compression where there's nothing to do.
So more or less like

nelements = head - tail;
if (!force)
{
if (nelements < 2 * pArray->numKnownAssignedXids)
return;
}
else
{
if (nelements == pArray->numKnownAssignedXids)
return;
}

I'm also wondering why there's not an

Assert(compress_index == pArray->numKnownAssignedXids);

after the loop, to make sure our numKnownAssignedXids tracking
is sane.

regards, tom lane




Re: Slow standby snapshot

2022-11-20 Thread Simon Riggs
On Sun, 20 Nov 2022 at 13:45, Michail Nikolaev
 wrote:

> If such approach looks committable - I could do more careful
> performance testing to find the best value for
> WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS.

Nice patch.

We seem to have replaced one magic constant with another, so not sure
if this is autotuning, but I like it much better than what we had
before (i.e. better than my prev patch).

Few thoughts

1. I was surprised that you removed the limits on size and just had
the wasted work limit. If there is no read traffic that will mean we
hardly ever compress, which means the removal of xids at commit will
get slower over time.  I would prefer that we forced compression on a
regular basis, such as every time we process an XLOG_RUNNING_XACTS
message (every 15s), as well as when we hit certain size limits.

2. If there is lots of read traffic but no changes flowing, it would
also make sense to force compression when the startup process goes
idle rather than wait for the work to be wasted first.

Quick patch to add those two compression events also.

That should favour the smaller wasted work limits.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


events_that_force_compression.v1.patch
Description: Binary data


Re: Slow standby snapshot

2022-11-20 Thread Michail Nikolaev
Oh, seems like it is not my day :) The image fixed again.


Re: Slow standby snapshot

2022-11-20 Thread Michail Nikolaev
Oops, wrong image, this is correct one. But is 1-run tests, so it
shows only basic correlation,


Re: Slow standby snapshot

2022-11-20 Thread Michail Nikolaev
Hello.

On Wed, Nov 16, 2022 at 3:44 AM Andres Freund  wrote:
> Approach 1:

> We could have an atomic variable in ProcArrayStruct that counts the amount of
> wasted effort and have processes update it whenever they've wasted a
> meaningful amount of effort.  Something like counting the skipped elements in
> KnownAssignedXidsGetAndSetXmin in a function local static variable and
> updating the shared counter whenever that reaches

I made the WIP patch for that approach and some initial tests. It
seems like it works pretty well.
At least it is better than previous ways for standbys without high
read only load.

Both patch and graph in attachments. Strange numbers is a limit of
wasted work to perform compression.
I have used the same (1) testing script and configuration as before
(two 16-CPU machines, long transaction on primary at 60th second,
simple-update and select-only for pgbench).

If such approach looks committable - I could do more careful
performance testing to find the best value for
WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS.

[1]: https://gist.github.com/michail-nikolaev/e1dfc70bdd7cfd1b902523dbb3db2f28
--
Michail Nikolaev
Index: src/backend/storage/ipc/procarray.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
--- a/src/backend/storage/ipc/procarray.c	(revision fb32748e32e2b6b2fcb32220980b93d5436f855e)
+++ b/src/backend/storage/ipc/procarray.c	(date 1668950653116)
@@ -272,6 +272,7 @@
  */
 static TransactionId *KnownAssignedXids;
 static bool *KnownAssignedXidsValid;
+static pg_atomic_uint32 *KnownAssignedXidsWastedSnapshotWork;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
 /*
@@ -451,6 +452,10 @@
 			ShmemInitStruct("KnownAssignedXidsValid",
 			mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
 			&found);
+		KnownAssignedXidsWastedSnapshotWork = (pg_atomic_uint32 *)
+			ShmemInitStruct("KnownAssignedXidsWastedSnapshotWork",
+			sizeof(pg_atomic_uint32), &found);
+		pg_atomic_init_u32(KnownAssignedXidsWastedSnapshotWork, 0);
 	}
 }
 
@@ -4616,20 +4621,9 @@
 
 	if (!force)
 	{
-		/*
-		 * If we can choose how much to compress, use a heuristic to avoid
-		 * compressing too often or not often enough.
-		 *
-		 * Heuristic is if we have a large enough current spread and less than
-		 * 50% of the elements are currently in use, then compress. This
-		 * should ensure we compress fairly infrequently. We could compress
-		 * less often though the virtual array would spread out more and
-		 * snapshots would become more expensive.
-		 */
-		int			nelements = head - tail;
-
-		if (nelements < 4 * PROCARRAY_MAXPROCS ||
-			nelements < 2 * pArray->numKnownAssignedXids)
+#define WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS 111
+		if (pg_atomic_read_u32(KnownAssignedXidsWastedSnapshotWork)
+	< WASTED_SNAPSHOT_WORK_LIMIT_TO_COMPRESS)
 			return;
 	}
 
@@ -4650,6 +4644,8 @@
 
 	pArray->tailKnownAssignedXids = 0;
 	pArray->headKnownAssignedXids = compress_index;
+	/* Reset wasted work counter */
+	pg_atomic_write_u32(KnownAssignedXidsWastedSnapshotWork, 0);
 }
 
 /*
@@ -5031,6 +5027,7 @@
 KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
 			   TransactionId xmax)
 {
+	ProcArrayStruct *pArray = procArray;
 	int			count = 0;
 	int			head,
 tail;
@@ -5078,6 +5075,10 @@
 		}
 	}
 
+	/* Add number of invalid items scanned to wasted work counter */
+	pg_atomic_add_fetch_u32(KnownAssignedXidsWastedSnapshotWork,
+			(head - tail) - pArray->numKnownAssignedXids);
+
 	return count;
 }
 


Re: Slow standby snapshot

2022-11-16 Thread Michail Nikolaev
Hello everyone.

> However ... I tried to reproduce the original complaint, and
> failed entirely.  I do see KnownAssignedXidsGetAndSetXmin
> eating a bit of time in the standby backends, but it's under 1%
> and doesn't seem to be rising over time.  Perhaps we've already
> applied some optimization that ameliorates the problem?  But
> I tested v13 as well as HEAD, and got the same results.

> Hmm.  I wonder if my inability to detect a problem is because the startup
> process does keep ahead of the workload on my machine, while it fails
> to do so on the OP's machine.  I've only got a 16-CPU machine at hand,
> which probably limits the ability of the primary to saturate the standby's
> startup process.

Yes, optimization by Andres Freund made things much better, but the
impact is still noticeable.

I was also using 16CPU machine - but two of them (primary and standby).

Here are the scripts I was using (1) for benchmark - maybe it could help.


> Nowadays we've *got* those primitives.  Can we get rid of
> known_assigned_xids_lck, and if so would it make a meaningful
> difference in this scenario?

I was trying it already - but was unable to find real benefits for it.
WIP patch in attachment.

Hm, I see I have sent it to list, but it absent in archives... Just
quote from it:

> First potential positive effect I could see is
> (TransactionIdIsInProgress -> KnownAssignedXidsSearch) locking but
> seems like it is not on standby hotpath.

> Second one - locking for KnownAssignedXidsGetAndSetXmin (build
> snapshot). But I was unable to measure impact. It wasn’t visible
> separately in (3) test.

> Maybe someone knows scenario causing known_assigned_xids_lck or
> TransactionIdIsInProgress become bottleneck on standby?

The latest question is still actual :)

> I think it might be a bigger effect than one might immediately think. Because
> the spinlock will typically be on the same cacheline as head/tail, and because
> every spinlock acquisition requires the cacheline to be modified (and thus
> owned mexlusively) by the current core, uses of head/tail will very commonly
> be cache misses even in workloads without a lot of KAX activity.

I was trying to find some way to practically achieve any noticeable
impact here, but unsuccessfully.

>> But yeah, it does feel like the proposed
>> approach is only going to be optimal over a small range of conditions.

> In particular, it doesn't adapt at all to workloads that don't replay all that
> much, but do compute a lot of snapshots.

The approach (2) was optimized to avoid any additional work for anyone
except non-startup
processes (approach with offsets to skip gaps while building snapshot).


[1]: https://gist.github.com/michail-nikolaev/e1dfc70bdd7cfd1b902523dbb3db2f28
[2]: 
https://www.postgresql.org/message-id/flat/CANtu0ogzo4MsR7My9%2BNhu3to5%3Dy7G9zSzUbxfWYOn9W5FfHjTA%40mail.gmail.com#341a3c3b033f69b260120b3173a66382

--
Michail Nikolaev
From 9759ea174db5e140415e4ce80a62f63075e91fe3 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Sun, 21 Nov 2021 21:23:02 +0300
Subject: [PATCH] memory barrier instead of spinlock

---
 src/backend/storage/ipc/procarray.c | 42 -
 1 file changed, 9 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c 
b/src/backend/storage/ipc/procarray.c
index 4da53a5b3f..bca336a80d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -60,7 +60,6 @@
 #include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
-#include "storage/spin.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -81,7 +80,6 @@ typedef struct ProcArrayStruct
int numKnownAssignedXids;   /* current # of valid 
entries */
int tailKnownAssignedXids;  /* index of oldest 
valid element */
int headKnownAssignedXids;  /* index of newest 
element, + 1 */
-   slock_t known_assigned_xids_lck;/* protects head/tail 
pointers */
 
/*
 * Highest subxid that has been removed from KnownAssignedXids array to
@@ -424,7 +422,6 @@ CreateSharedProcArray(void)
procArray->numKnownAssignedXids = 0;
procArray->tailKnownAssignedXids = 0;
procArray->headKnownAssignedXids = 0;
-   SpinLockInit(&procArray->known_assigned_xids_lck);
procArray->lastOverflowedXid = InvalidTransactionId;
procArray->replication_slot_xmin = InvalidTransactionId;
procArray->replication_slot_catalog_xmin = InvalidTransactionId;
@@ -4552,10 +4549,9 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * pointer.  This wouldn't require any lock at all, except that on machines
  * with weak memory ordering we need to be careful that other processors
  * see the array element changes before they see the head pointer change.
- * We handle this by usi

Re: Slow standby snapshot

2022-11-15 Thread Simon Riggs
On Wed, 16 Nov 2022 at 00:15, Tom Lane  wrote:
>
> Andres Freund  writes:
> > On 2022-11-15 23:14:42 +, Simon Riggs wrote:
> >> Hence more frequent compression is effective at reducing the overhead.
> >> But too frequent compression slows down the startup process, which
> >> can't then keep up.
> >> So we're just looking for an optimal frequency of compression for any
> >> given workload.
>
> > What about making the behaviour adaptive based on the amount of wasted 
> > effort
> > during those two operations, rather than just a hardcoded "emptiness" 
> > factor?
>
> Not quite sure how we could do that, given that those things aren't even
> happening in the same process.  But yeah, it does feel like the proposed
> approach is only going to be optimal over a small range of conditions.

I have not been able to think of a simple way to autotune it.

> > I don't think the xids % KAX_COMPRESS_FREQUENCY == 0 filter is a good idea -
> > if you have a workload with plenty subxids you might end up never 
> > compressing
> > because xids divisible by KAX_COMPRESS_FREQUENCY will end up as a subxid
> > most/all of the time.
>
> Yeah, I didn't think that was too safe either.

> It'd be more reliable
> to use a static counter to skip all but every N'th compress attempt
> (something we could do inside KnownAssignedXidsCompress itself, instead
> of adding warts at the call sites).

I was thinking exactly that myself, for the reason of keeping it all
inside KnownAssignedXidsCompress().

--
Simon Riggshttp://www.EnterpriseDB.com/




Re: Slow standby snapshot

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 19:46:26 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > To me it sounds like known_assigned_xids_lck is pointless and the talk about
> > memory barriers a red herring, since all modifications have to happen with
> > ProcArrayLock held exlusively and all reads with ProcArrayLock held in share
> > mode. It can't be legal to modify head/tail or the contents of the array
> > outside of that. And lwlocks provide sufficient barrier semantics.
> 
> No ... RecordKnownAssignedTransactionIds calls KnownAssignedXidsAdd
> with exclusive_lock = false, and in the typical case that will not
> acquire ProcArrayLock at all.  Since there's only one writer, that
> seems safe enough, and I believe the commentary's claim that we
> really just need to be sure the head-pointer update is seen
> after the array updates.

Oh, right. I focussed to much on the part of the comment quoted in your email.

I still think it's misleading for the comment to say that the tail can be
modified with the spinlock - I don't see how that'd ever be correct. Nor could
head be safely decreased with just the spinlock.


Too bad, that seems to make the idea of compressing in other backends a
non-starter unfortunately :(. Although - are we really gaining that much by
avoiding ProcArrayLock in the RecordKnownAssignedTransactionIds() case? It
only happens when latestObservedXid is increased, and we'll remove them at
commit with the exclusive lock held. Afaict that's the only KAX access that
doesn't also require ProcArrayLock?

Greetings,

Andres Freund




Re: Slow standby snapshot

2022-11-15 Thread Tom Lane
Andres Freund  writes:
> To me it sounds like known_assigned_xids_lck is pointless and the talk about
> memory barriers a red herring, since all modifications have to happen with
> ProcArrayLock held exlusively and all reads with ProcArrayLock held in share
> mode. It can't be legal to modify head/tail or the contents of the array
> outside of that. And lwlocks provide sufficient barrier semantics.

No ... RecordKnownAssignedTransactionIds calls KnownAssignedXidsAdd
with exclusive_lock = false, and in the typical case that will not
acquire ProcArrayLock at all.  Since there's only one writer, that
seems safe enough, and I believe the commentary's claim that we
really just need to be sure the head-pointer update is seen
after the array updates.

regards, tom lane




Re: Slow standby snapshot

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 19:15:15 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-11-15 23:14:42 +, Simon Riggs wrote:
> >> Hence more frequent compression is effective at reducing the overhead.
> >> But too frequent compression slows down the startup process, which
> >> can't then keep up.
> >> So we're just looking for an optimal frequency of compression for any
> >> given workload.
> 
> > What about making the behaviour adaptive based on the amount of wasted 
> > effort
> > during those two operations, rather than just a hardcoded "emptiness" 
> > factor?
> 
> Not quite sure how we could do that, given that those things aren't even
> happening in the same process.

I'm not certain what the best approach is, but I don't think the
not-the-same-process part is a blocker.


Approach 1:

We could have an atomic variable in ProcArrayStruct that counts the amount of
wasted effort and have processes update it whenever they've wasted a
meaningful amount of effort.  Something like counting the skipped elements in
KnownAssignedXidsGetAndSetXmin in a function local static variable and
updating the shared counter whenever that reaches



Approach 2:

Perform conditional cleanup in non-startup processes - I think that'd actually
be ok, as long as ProcArrayLock is held exlusively.  We could count the amount
of skipped elements in KnownAssignedXidsGetAndSetXmin() in a local variable,
and whenever that gets too high, conditionally acquire ProcArrayLock lock
exlusively at the end of GetSnapshotData() and compress KAX. Reset the local
variable independent of getting the lock or not, to avoid causing a lot of
contention.

The nice part is that this would work even without the startup making
process. The not nice part that it'd require a bit of code study to figure out
whether it's safe to modify KAX from outside the startup process.



> But yeah, it does feel like the proposed
> approach is only going to be optimal over a small range of conditions.

In particular, it doesn't adapt at all to workloads that don't replay all that
much, but do compute a lot of snapshots.

Greetings,

Andres Freund




Re: Slow standby snapshot

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 23:14:42 +, Simon Riggs wrote:
> On Tue, 15 Nov 2022 at 23:06, Tom Lane  wrote:
> >
> > BTW, while nosing around this code I came across this statement
> > (procarray.c, about line 4550 in HEAD):
> >
> >  * ... To add XIDs to the array, we just insert
> >  * them into slots to the right of the head pointer and then advance the 
> > head
> >  * pointer.  This wouldn't require any lock at all, except that on machines
> >  * with weak memory ordering we need to be careful that other processors
> >  * see the array element changes before they see the head pointer change.
> >  * We handle this by using a spinlock to protect reads and writes of the
> >  * head/tail pointers.  (We could dispense with the spinlock if we were to
> >  * create suitable memory access barrier primitives and use those instead.)
> >  * The spinlock must be taken to read or write the head/tail pointers unless
> >  * the caller holds ProcArrayLock exclusively.
> >
> > Nowadays we've *got* those primitives.  Can we get rid of
> > known_assigned_xids_lck, and if so would it make a meaningful
> > difference in this scenario?

Forgot to reply to this part:

I'm confused by the explanation of the semantics of the spinlock:

  "The spinlock must be taken to read or write the head/tail pointers
  unless the caller holds ProcArrayLock exclusively."

makes it sound like it'd be valid to modify the KnownAssignedXids without
holding ProcArrayLock exclusively. Doesn't that contradict only needing the
spinlock because of memory ordering?

And when would it be valid to do any modifications of KnownAssignedXids
without holding ProcArrayLock exclusively? Concurrent readers of
KnownAssignedXids will operate on a snapshot of head/tail (the spinlock is
only ever held to query them). Afaict any such modification would be racy,
because subsequent modifications of KnownAssignedXids could overwrite contents
of KnownAssignedXids that another backend is in the process of reading, based
on the stale snapshot of head/tail.


To me it sounds like known_assigned_xids_lck is pointless and the talk about
memory barriers a red herring, since all modifications have to happen with
ProcArrayLock held exlusively and all reads with ProcArrayLock held in share
mode. It can't be legal to modify head/tail or the contents of the array
outside of that. And lwlocks provide sufficient barrier semantics.



> I think you could do that *as well*, since it does act as an overhead
> but that is not related to the main issues

I think it might be a bigger effect than one might immediately think. Because
the spinlock will typically be on the same cacheline as head/tail, and because
every spinlock acquisition requires the cacheline to be modified (and thus
owned mexlusively) by the current core, uses of head/tail will very commonly
be cache misses even in workloads without a lot of KAX activity.

Greetings,

Andres Freund




Re: Slow standby snapshot

2022-11-15 Thread Tom Lane
Simon Riggs  writes:
> but that is not related to the main issues:

> * COMMITs: xids are removed from the array by performing a binary
> search - this gets more and more expensive as the array gets wider
> * SNAPSHOTs: scanning the array for snapshots becomes more expensive
> as the array gets wider

Right.  The case complained of in this thread is SNAPSHOT cost,
since that's what KnownAssignedXidsGetAndSetXmin is used for.

> Hence more frequent compression is effective at reducing the overhead.
> But too frequent compression slows down the startup process, which
> can't then keep up.
> So we're just looking for an optimal frequency of compression for any
> given workload.

Hmm.  I wonder if my inability to detect a problem is because the startup
process does keep ahead of the workload on my machine, while it fails
to do so on the OP's machine.  I've only got a 16-CPU machine at hand,
which probably limits the ability of the primary to saturate the standby's
startup process.  If that's accurate, reducing the frequency of
compression attempts could be counterproductive in my workload range.
It would help the startup process when that is the bottleneck --- but
that wasn't what the OP complained of, so I'm not sure it helps him
either.

It seems like maybe what we should do is just drop the "nelements < 4 *
PROCARRAY_MAXPROCS" part of the existing heuristic, which is clearly
dangerous with large max_connection settings, and in any case doesn't
have a clear connection to either the cost of scanning or the cost
of compressing.  Or we could replace that with a hardwired constant,
like "nelements < 400".

regards, tom lane




Re: Slow standby snapshot

2022-11-15 Thread Tom Lane
Andres Freund  writes:
> On 2022-11-15 23:14:42 +, Simon Riggs wrote:
>> Hence more frequent compression is effective at reducing the overhead.
>> But too frequent compression slows down the startup process, which
>> can't then keep up.
>> So we're just looking for an optimal frequency of compression for any
>> given workload.

> What about making the behaviour adaptive based on the amount of wasted effort
> during those two operations, rather than just a hardcoded "emptiness" factor?

Not quite sure how we could do that, given that those things aren't even
happening in the same process.  But yeah, it does feel like the proposed
approach is only going to be optimal over a small range of conditions.

> I don't think the xids % KAX_COMPRESS_FREQUENCY == 0 filter is a good idea -
> if you have a workload with plenty subxids you might end up never compressing
> because xids divisible by KAX_COMPRESS_FREQUENCY will end up as a subxid
> most/all of the time.

Yeah, I didn't think that was too safe either.  It'd be more reliable
to use a static counter to skip all but every N'th compress attempt
(something we could do inside KnownAssignedXidsCompress itself, instead
of adding warts at the call sites).

regards, tom lane




Re: Slow standby snapshot

2022-11-15 Thread Andres Freund
Hi,

On 2022-11-15 23:14:42 +, Simon Riggs wrote:
> * COMMITs: xids are removed from the array by performing a binary
> search - this gets more and more expensive as the array gets wider
> * SNAPSHOTs: scanning the array for snapshots becomes more expensive
> as the array gets wider
>
> Hence more frequent compression is effective at reducing the overhead.
> But too frequent compression slows down the startup process, which
> can't then keep up.

> So we're just looking for an optimal frequency of compression for any
> given workload.

What about making the behaviour adaptive based on the amount of wasted effort
during those two operations, rather than just a hardcoded "emptiness" factor?
It's common that basically no snapshots are taken, and constantly compressing
in that case is likely going to be wasted effort.


The heuristic the patch adds to KnownAssignedXidsRemoveTree() seems somewhat
misplaced to me. When called from ProcArrayApplyXidAssignment() we probably
should always compress - it'll only be issued when a substantial amount of
subxids have been assigned, so there'll be a bunch of cleanup work.  It makes
more sense from ExpireTreeKnownAssignedTransactionIds(), since it will very
commonly called for individual xids - but even then, we afaict should take
into account how many xids we've just expired.

I don't think the xids % KAX_COMPRESS_FREQUENCY == 0 filter is a good idea -
if you have a workload with plenty subxids you might end up never compressing
because xids divisible by KAX_COMPRESS_FREQUENCY will end up as a subxid
most/all of the time.


Re cost of processing at COMMITs: We do a fresh binary search for each subxid
right now. There's a lot of locality in the xids that can be expired. Perhaps
we could have a cache for the position of the latest value in
KnownAssignedXidsSearch() and search linearly if the distance from the last
looked up value isn't large?


Greetings,

Andres




Re: Slow standby snapshot

2022-11-15 Thread Simon Riggs
On Tue, 15 Nov 2022 at 23:06, Tom Lane  wrote:
>
> BTW, while nosing around this code I came across this statement
> (procarray.c, about line 4550 in HEAD):
>
>  * ... To add XIDs to the array, we just insert
>  * them into slots to the right of the head pointer and then advance the head
>  * pointer.  This wouldn't require any lock at all, except that on machines
>  * with weak memory ordering we need to be careful that other processors
>  * see the array element changes before they see the head pointer change.
>  * We handle this by using a spinlock to protect reads and writes of the
>  * head/tail pointers.  (We could dispense with the spinlock if we were to
>  * create suitable memory access barrier primitives and use those instead.)
>  * The spinlock must be taken to read or write the head/tail pointers unless
>  * the caller holds ProcArrayLock exclusively.
>
> Nowadays we've *got* those primitives.  Can we get rid of
> known_assigned_xids_lck, and if so would it make a meaningful
> difference in this scenario?

I think you could do that *as well*, since it does act as an overhead
but that is not related to the main issues:

* COMMITs: xids are removed from the array by performing a binary
search - this gets more and more expensive as the array gets wider
* SNAPSHOTs: scanning the array for snapshots becomes more expensive
as the array gets wider

Hence more frequent compression is effective at reducing the overhead.
But too frequent compression slows down the startup process, which
can't then keep up.

So we're just looking for an optimal frequency of compression for any
given workload.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Slow standby snapshot

2022-11-15 Thread Tom Lane
BTW, while nosing around this code I came across this statement
(procarray.c, about line 4550 in HEAD):

 * ... To add XIDs to the array, we just insert
 * them into slots to the right of the head pointer and then advance the head
 * pointer.  This wouldn't require any lock at all, except that on machines
 * with weak memory ordering we need to be careful that other processors
 * see the array element changes before they see the head pointer change.
 * We handle this by using a spinlock to protect reads and writes of the
 * head/tail pointers.  (We could dispense with the spinlock if we were to
 * create suitable memory access barrier primitives and use those instead.)
 * The spinlock must be taken to read or write the head/tail pointers unless
 * the caller holds ProcArrayLock exclusively.

Nowadays we've *got* those primitives.  Can we get rid of
known_assigned_xids_lck, and if so would it make a meaningful
difference in this scenario?

regards, tom lane




Re: Slow standby snapshot

2022-11-15 Thread Tom Lane
Simon Riggs  writes:
> I've cleaned up the comments and used a #define for the constant.
> IMHO this is ready for commit. I will add it to the next CF.

I looked at this a little.  It's a simple enough patch, and if it
solves the problem then I sure like it better than the previous
ideas in this thread.

However ... I tried to reproduce the original complaint, and
failed entirely.  I do see KnownAssignedXidsGetAndSetXmin
eating a bit of time in the standby backends, but it's under 1%
and doesn't seem to be rising over time.  Perhaps we've already
applied some optimization that ameliorates the problem?  But
I tested v13 as well as HEAD, and got the same results.

regards, tom lane




Re: Slow standby snapshot

2022-11-08 Thread Thomas Munro
On Wed, Nov 9, 2022 at 1:54 PM Andres Freund  wrote:
> On 2022-11-09 11:42:36 +1300, Thomas Munro wrote:
> > On Sat, Sep 17, 2022 at 6:27 PM Simon Riggs
> >  wrote:
> > > I've cleaned up the comments and used a #define for the constant.
> > >
> > > IMHO this is ready for commit. I will add it to the next CF.
> >
> > FYI This had many successful cfbot runs but today it blew up on
> > Windows when the assertion TransactionIdPrecedesOrEquals(safeXid,
> > snap->xmin) failed:
> >
> > https://api.cirrus-ci.com/v1/artifact/task/5311549010083840/crashlog/crashlog-postgres.exe_1c40_2022-11-08_00-20-28-110.txt
>
> I don't immediately see how that could be connected to this patch - afaict
> that crash wasn't during recovery, and the modified functions should only be
> active during hot standby.

Indeed, sorry for the noise.




Re: Slow standby snapshot

2022-11-08 Thread Andres Freund
Hi,

On 2022-11-09 11:42:36 +1300, Thomas Munro wrote:
> On Sat, Sep 17, 2022 at 6:27 PM Simon Riggs
>  wrote:
> > I've cleaned up the comments and used a #define for the constant.
> >
> > IMHO this is ready for commit. I will add it to the next CF.
> 
> FYI This had many successful cfbot runs but today it blew up on
> Windows when the assertion TransactionIdPrecedesOrEquals(safeXid,
> snap->xmin) failed:
> 
> https://api.cirrus-ci.com/v1/artifact/task/5311549010083840/crashlog/crashlog-postgres.exe_1c40_2022-11-08_00-20-28-110.txt

I don't immediately see how that could be connected to this patch - afaict
that crash wasn't during recovery, and the modified functions should only be
active during hot standby.

Greetings,

Andres Freund




Re: Slow standby snapshot

2022-11-08 Thread Thomas Munro
On Sat, Sep 17, 2022 at 6:27 PM Simon Riggs
 wrote:
> I've cleaned up the comments and used a #define for the constant.
>
> IMHO this is ready for commit. I will add it to the next CF.

FYI This had many successful cfbot runs but today it blew up on
Windows when the assertion TransactionIdPrecedesOrEquals(safeXid,
snap->xmin) failed:

https://api.cirrus-ci.com/v1/artifact/task/5311549010083840/crashlog/crashlog-postgres.exe_1c40_2022-11-08_00-20-28-110.txt




Re: Slow standby snapshot

2022-09-16 Thread Simon Riggs
On Fri, 16 Sept 2022 at 17:08, Michail Nikolaev
 wrote:
>
> Hello everyone.
>
> To find the best frequency for calling KnownAssignedXidsCompress in
> Simon's patch, I made a set of benchmarks. It looks like each 8th xid
> is a little bit often.
>
> Setup and method is the same as previous (1). 16-core machines,
> max_connections = 5000. Tests were running for about a day, 220 runs
> in total (each version for 20 times, evenly distributed throughout the
> day).
>
> Staring from 60th second, 30 seconds-long transaction was started on primary.
>
> Graphs in attachment. So, looks like 64 is the best value here. It
> gives even a little bit more TPS than smaller values.
>
> Yes, such benchmark does not cover all possible cases, but it is
> better to measure at least something when selecting constants :)

This is very good and clear, thank you.


> If someone has an idea of different benchmark scenarios - please share them.

> So, updated version (with 64 and some commit message) in attachment too.
>
> [1]: 
> https://www.postgresql.org/message-id/flat/CANtu0ohzBFTYwdLtcanWo4%2B794WWUi7LY2rnbHyorJdE8_ZnGg%40mail.gmail.com#379c1be7b8134ada5a574078d51b64c6

I've cleaned up the comments and used a #define for the constant.

IMHO this is ready for commit. I will add it to the next CF.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


v8c-new-heuristic-to-compress-KnownAssignedXids.patch
Description: Binary data


Re: Slow standby snapshot

2022-08-02 Thread Andrey Borodin



> On 2 Aug 2022, at 20:18, Simon Riggs  wrote:
> 
> On Tue, 2 Aug 2022 at 12:32, Andrey Borodin  wrote:
> 
>> KnownAssignedXidsRemoveTree() only compress with probability 1/8, but it is 
>> still O(N*N).
> 
> Currently it is O(NlogS), not quite as bad as O(N^2).
Consider workload when we have a big number of simultaneously active xids. 
Number of calls to KnownAssignedXidsRemoveTree() is proportional to number of 
these xids.
And the complexity of KnownAssignedXidsRemoveTree() is proportional to the 
number of these xids, because each call to KnownAssignedXidsRemoveTree() might 
evenly run compression (which will not compress much).

Compression is not an answer to performance problems - because it might be 
burden itself. Instead we can make compression unneeded to make a snapshot's 
xids-in-progress list.


> Since each xid in the tree is always stored to the right, it should be
> possible to make that significantly better by starting each binary
> search from the next element, rather than the start of the array.
> Something like the attached might help, but we can probably make that
> cache conscious to improve things even more.

As Kyotaro-san correctly mentioned - performance degradation happened in 
KnownAssignedXidsGetAndSetXmin() which does not do binary search.



> On 3 Aug 2022, at 06:04, Kyotaro Horiguchi  wrote:
> 
> The original complaint is KnownAssignedXidsGetAndSetXmin can get very
> slow for large max_connections. I'm not sure what was happening on the
> KAXidsArray at the time precisely, but if the array starts with a
> large number of invalid entries (I guess it is likely), and the
> variable "start" were available to the function (that is, it were
> placed in procArray), that strategy seems to work for this case. With
> this strategy we can avoid compression if only the relatively narrow
> range in the array is occupied.

This applies to only one workload - all transactions are very short. If we have 
a tiny fraction of mid or long transactions - this heuristics does not help 
anymore.


Thank you!

Best regards, Andrey Borodin. 







Re: Slow standby snapshot

2022-08-02 Thread Kyotaro Horiguchi
At Tue, 2 Aug 2022 16:18:44 +0100, Simon Riggs  
wrote in 
> On Tue, 2 Aug 2022 at 12:32, Andrey Borodin  wrote:
> 
> > KnownAssignedXidsRemoveTree() only compress with probability 1/8, but it is 
> > still O(N*N).
> 
> Currently it is O(NlogS), not quite as bad as O(N^2).
> 
> Since each xid in the tree is always stored to the right, it should be
> possible to make that significantly better by starting each binary
> search from the next element, rather than the start of the array.
> Something like the attached might help, but we can probably make that
> cache conscious to improve things even more.

The original complaint is KnownAssignedXidsGetAndSetXmin can get very
slow for large max_connections. I'm not sure what was happening on the
KAXidsArray at the time precisely, but if the array starts with a
large number of invalid entries (I guess it is likely), and the
variable "start" were available to the function (that is, it were
placed in procArray), that strategy seems to work for this case.  With
this strategy we can avoid compression if only the relatively narrow
range in the array is occupied.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Slow standby snapshot

2022-08-02 Thread Simon Riggs
On Tue, 2 Aug 2022 at 12:32, Andrey Borodin  wrote:

> KnownAssignedXidsRemoveTree() only compress with probability 1/8, but it is 
> still O(N*N).

Currently it is O(NlogS), not quite as bad as O(N^2).

Since each xid in the tree is always stored to the right, it should be
possible to make that significantly better by starting each binary
search from the next element, rather than the start of the array.
Something like the attached might help, but we can probably make that
cache conscious to improve things even more.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


subx_optimize_KnownAssignedXidsRemoveTree.v2.patch
Description: Binary data


Re: Slow standby snapshot

2022-08-02 Thread Andrey Borodin



> On 29 Jul 2022, at 20:08, Simon Riggs  wrote:
> 
> A simple patch like this seems to hit the main concern, aiming to keep
> the array from spreading out and impacting snapshot performance for
> SELECTs, yet not doing it so often that the startup process has a
> higher burden of work.

The idea to compress more often seem viable. But this might make some other 
workloads pathological.
Some KnownAssignedXids routines now can become quadratic in case of lots of 
subtransactions.

KnownAssignedXidsRemoveTree() only compress with probability 1/8, but it is 
still O(N*N).

IMO original patch (with next pointer) is much safer in terms of unexpected 
performance degradation.

Thanks!

Best regards, Andrey Borodin.



Re: Slow standby snapshot

2022-08-02 Thread Simon Riggs
On Fri, 29 Jul 2022 at 18:24, Michail Nikolaev
 wrote:

> > A simple patch like this seems to hit the main concern, aiming to keep
> > the array from spreading out and impacting snapshot performance for
> > SELECTs, yet not doing it so often that the startup process has a
> > higher burden of work.
>
> Nice, I'll do performance testing for both versions and master branch
> as baseline.

The objective of all patches is to touch the smallest number of
cachelines when accessing the KnownAssignedXacts array.

The trade-off is to keep the array small with the minimum number of
compressions, so that normal snapshots don't feel the contention and
so that the Startup process doesn't slow down because of the extra
compression work. The values I've chosen in the recent patch are just
guesses at what we'll need to reduce it to, so there may be some
benefit in varying those numbers to see the effects.

It did also occur to me that we might need a separate process to
perform the compressions, which we might be able to give to WALWriter.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Slow standby snapshot

2022-07-29 Thread Michail Nikolaev
Hello.

Thanks to everyone for the review.

> It seems to me storing the index itself is simpler and maybe faster by
> the cycles to perform addition.

Yes, first version used 1-byte for offset with maximum value of 255.
Agreed, looks like there is no sense to store offsets now.

> A simple patch like this seems to hit the main concern, aiming to keep
> the array from spreading out and impacting snapshot performance for
> SELECTs, yet not doing it so often that the startup process has a
> higher burden of work.

Nice, I'll do performance testing for both versions and master branch
as baseline.

Thanks,
Michail.




Re: Slow standby snapshot

2022-07-29 Thread Simon Riggs
On Wed, 27 Jul 2022 at 08:08, Kyotaro Horiguchi  wrote:
>
> At Tue, 26 Jul 2022 23:09:16 +0500, Andrey Borodin  
> wrote in
> >
> >
> > > On 20 Jul 2022, at 02:12, Michail Nikolaev  
> > > wrote:
> > >
> > >> I've looked into v5.
> > > Thanks!
> > >
> > > Patch is updated accordingly your remarks.
> >
> > The patch seems Ready for Committer from my POV.
>
> + * is s updated during taking the snapshot. The KnownAssignedXidsNextOffset
> + * contains not an offset to the next valid xid but a number which tends to
> + * the offset to next valid xid. KnownAssignedXidsNextOffset[] values read
>
> Is there still a reason why the array stores the distnace to the next
> valid element instead of the index number of the next valid index?  It
> seems to me that that was in an intention to reduce the size of the
> offset array but it is int32[] which is far wider than
> TOTAL_MAX_CACHED_SUBXIDS.
>
> It seems to me storing the index itself is simpler and maybe faster by
> the cycles to perform addition.

I'm inclined to think this is all too much. All of this optimization
makes sense when the array spreads out horribly, but we should be just
avoiding that in the first place by compressing more often.

The original coded frequency of compression was just a guess and was
never tuned later.

A simple patch like this seems to hit the main concern, aiming to keep
the array from spreading out and impacting snapshot performance for
SELECTs, yet not doing it so often that the startup process has a
higher burden of work.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


subx_compress_knownassignedxids_more_often.v2.patch
Description: Binary data


Re: Slow standby snapshot

2022-07-27 Thread Kyotaro Horiguchi
At Tue, 26 Jul 2022 23:09:16 +0500, Andrey Borodin  wrote 
in 
> 
> 
> > On 20 Jul 2022, at 02:12, Michail Nikolaev  
> > wrote:
> > 
> >> I've looked into v5.
> > Thanks!
> > 
> > Patch is updated accordingly your remarks.
> 
> The patch seems Ready for Committer from my POV.

+ * is s updated during taking the snapshot. The KnownAssignedXidsNextOffset
+ * contains not an offset to the next valid xid but a number which tends to
+ * the offset to next valid xid. KnownAssignedXidsNextOffset[] values read

Is there still a reason why the array stores the distnace to the next
valid element instead of the index number of the next valid index?  It
seems to me that that was in an intention to reduce the size of the
offset array but it is int32[] which is far wider than
TOTAL_MAX_CACHED_SUBXIDS.

It seems to me storing the index itself is simpler and maybe faster by
the cycles to perform addition.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Slow standby snapshot

2022-07-26 Thread Andrey Borodin



> On 20 Jul 2022, at 02:12, Michail Nikolaev  wrote:
> 
>> I've looked into v5.
> Thanks!
> 
> Patch is updated accordingly your remarks.

The patch seems Ready for Committer from my POV.

Thanks!

Best regards, Andrey Borodin.



Re: Slow standby snapshot

2022-07-19 Thread Michail Nikolaev
Hello, Andrey.

> I've looked into v5.
Thanks!

Patch is updated accordingly your remarks.

Best regards,
Michail.
From 1301a262dea7f541c11092851e82f10932150ee3 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Tue, 19 Jul 2022 23:50:53 +0300
Subject: [PATCH v6] Currently KnownAssignedXidsGetAndSetXmin requires an
 iterative loop through KnownAssignedXids array, including xids marked as
 invalid. Performance impact is especially noticeable in the presence of long
 (few seconds) transactions on primary, big number (few thousands) of
 max_connections and high read workload on standby. Most of the cpu spent on
 looping throw KnownAssignedXids while almost all xid are invalid anyway.
 KnownAssignedXidsCompress removes invalid xids from time to time, but
 performance is still affected.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

To increase performance we lazily maintain an offset in the KnownAssignedXidsNextOffset array to skip known
to be invalid xids. KnownAssignedXidsNextOffset does not always point to “next” valid xid, it is just some
offset safe to skip (known to be filled by only invalid xids).
---
 src/backend/storage/ipc/procarray.c | 57 -
 1 file changed, 48 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index dadaa958a8..f613ae2f09 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -271,6 +271,7 @@ static TransactionId cachedXidIsNotInProgress = InvalidTransactionId;
  */
 static TransactionId *KnownAssignedXids;
 static bool *KnownAssignedXidsValid;
+static int32 *KnownAssignedXidsNextOffset;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
 /*
@@ -450,6 +451,12 @@ CreateSharedProcArray(void)
 			ShmemInitStruct("KnownAssignedXidsValid",
 			mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
 			&found);
+		KnownAssignedXidsNextOffset = (int32 *)
+ShmemInitStruct("KnownAssignedXidsNextOffset",
+mul_size(sizeof(int32), TOTAL_MAX_CACHED_SUBXIDS),
+&found);
+		for (int i = 0; i < TOTAL_MAX_CACHED_SUBXIDS; i++)
+			KnownAssignedXidsNextOffset[i] = 1;
 	}
 }
 
@@ -4539,7 +4546,15 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * XID entry itself.  This preserves the property that the XID entries are
  * sorted, so we can do binary searches easily.  Periodically we compress
  * out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
+ * array immediately on every deletion. Also, we lazily maintain an offset
+ * in KnownAssignedXidsNextOffset[] array to skip known to be invalid xids.
+ * It helps to skip the gaps; it could significantly increase performance in
+ * the case of long transactions on the primary. KnownAssignedXidsNextOffset
+ * is s updated during taking the snapshot. The KnownAssignedXidsNextOffset
+ * contains not an offset to the next valid xid but a number which tends to
+ * the offset to next valid xid. KnownAssignedXidsNextOffset[] values read
+ * and updated without additional locking because four-bytes read-writes are
+ * assumed to be atomic.
  *
  * The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
  * are those with indexes tail <= i < head; items outside this subscript range
@@ -4577,7 +4592,7 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  *		must happen)
  *	* Compressing the array is O(S) and requires exclusive lock
  *	* Removing an XID is O(logS) and requires exclusive lock
- *	* Taking a snapshot is O(S) and requires shared lock
+ *	* Taking a snapshot is O(S), amortized O(N) next call; requires shared lock
  *	* Checking for an XID is O(logS) and requires shared lock
  *
  * In comparison, using a hash table for KnownAssignedXids would mean that
@@ -4637,12 +4652,13 @@ KnownAssignedXidsCompress(bool force)
 	 * re-aligning data to 0th element.
 	 */
 	compress_index = 0;
-	for (i = tail; i < head; i++)
+	for (i = tail; i < head; i += KnownAssignedXidsNextOffset[i])
 	{
 		if (KnownAssignedXidsValid[i])
 		{
 			KnownAssignedXids[compress_index] = KnownAssignedXids[i];
 			KnownAssignedXidsValid[compress_index] = true;
+			KnownAssignedXidsNextOffset[compress_index] = 1;
 			compress_index++;
 		}
 	}
@@ -4745,6 +4761,7 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
 	{
 		KnownAssignedXids[head] = next_xid;
 		KnownAssignedXidsValid[head] = true;
+		KnownAssignedXidsNextOffset[head] = 1;
 		TransactionIdAdvance(next_xid);
 		head++;
 	}
@@ -4960,7 +4977,7 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)
 	tail = pArray->tailKnownAssignedXids;
 	head = pArray->headKnownAssignedXids;
 
-	for (i = tail; i < head; i++)
+	for (i = tail; i < head; i += KnownAssignedXidsNextOffset[i])
 	{
 		if (KnownAssignedXidsValid[i])
 		{
@@ -4983,7 +5000,7 @@ KnownAssignedXidsRemovePreceding(TransactionId removeXid)

Re: Slow standby snapshot

2022-07-03 Thread Andrey Borodin



> On 1 Apr 2022, at 04:18, Michail Nikolaev  wrote:
> 
> Hello.
> 
> Just an updated commit message.

I've looked into v5.

IMO the purpose of KnownAssignedXidsNext would be slightly more obvious if it 
was named KnownAssignedXidsNextOffset.
Also please consider some editorialisation:
s/high value/big number/g
KnownAssignedXidsNext[] is updating while taking the snapshot. -> 
KnownAssignedXidsNext[] is updated during taking the snapshot.
O(N) next call -> amortized O(N) on next call

Is it safe on all platforms to do "KnownAssignedXidsNext[prev] = n;" while only 
holding shared lock? I think it is, per Alexander's comment, but maybe let's 
document it?

Thank you!

Thanks! Best regards, Andrey Borodin.



Re: Slow standby snapshot

2022-07-02 Thread Michail Nikolaev
Hello, Simon.

Sorry for calling you directly, but you know the subject better than
anyone else. It is related to your work from 2010 - replacing
KnownAssignedXidsHash with the KnownAssignedXids array.

I have added additional optimization to the data structure you
implemented. Initially, it was caused by the huge usage of CPU (70%+)
for KnownAssignedXidsGetAndSetXmin in case of long (few seconds)
transactions on primary and high (few thousands) max_connections in
Postgres 11.

After snapshot scalability optimization by Andres Freund (2), it is
not so crucial but still provides a significant performance impact
(+10% TPS) for a typical workload, see benchmark (3).

Last patch version is here - (4).

Does such optimisation look worth committing?

Thanks in advance,
Michail.

[1]: 
https://github.com/postgres/postgres/commit/2871b4618af1acc85665eec0912c48f8341504c4#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179
[2]: https://commitfest.postgresql.org/29/2500/
[3]: 
https://www.postgresql.org/message-id/flat/CANtu0ohzBFTYwdLtcanWo4%2B794WWUi7LY2rnbHyorJdE8_ZnGg%40mail.gmail.com#379c1be7b8134ada5a574078d51b64c6
[4]: 
https://www.postgresql.org/message-id/flat/CANtu0ogzo4MsR7My9%2BNhu3to5%3Dy7G9zSzUbxfWYOn9W5FfHjTA%40mail.gmail.com#341a3c3b033f69b260120b3173a66382




Re: Slow standby snapshot

2022-03-31 Thread Michail Nikolaev
Hello.

Just an updated commit message.

Thanks,
Michail.
From 934d649a18c66f8b448463e57375865b33ce53e7 Mon Sep 17 00:00:00 2001
From: nkey 
Date: Fri, 1 Apr 2022 02:17:55 +0300
Subject: [PATCH v5] Optimize KnownAssignedXidsGetAndSetXmin by maintaining
 offset to next valid xid.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Currently KnownAssignedXidsGetAndSetXmin requires an iterative loop through 
KnownAssignedXids array,
including xids marked as invalid. Performance impact is especially noticeable 
in the presence of
long (few seconds) transactions on primary, high value (few thousands) of 
max_connections and high
read workload on standby. Most of the cpu spent on looping throw 
KnownAssignedXids while almost all
xid are invalid anyway. KnownAssignedXidsCompress removes invalid xids from 
time to time,
but performance is still affected.

To increase performance we lazily maintain an offset in the 
KnownAssignedXidsNext array to skip known
to be invalid xids. KnownAssignedXidsNext does not always point to “next” valid 
xid, it is just some
offset safe to skip (known to be filled by only invalid xids).

It helps to skip the gaps and could significantly increase performance - up to 
10% more TPS on standby.

Thread: 
https://www.postgresql.org/message-id/flat/CALdSSPgahNUD_%3DpB_j%3D1zSnDBaiOtqVfzo8Ejt5J_k7qZiU1Tw%40mail.gmail.com

Benchmark:
https://www.postgresql.org/message-id/flat/CANtu0ohzBFTYwdLtcanWo4%2B794WWUi7LY2rnbHyorJdE8_ZnGg%40mail.gmail.com#379c1be7b8134ada5a574078d51b64c6
---
 src/backend/storage/ipc/procarray.c | 56 -
 1 file changed, 47 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c 
b/src/backend/storage/ipc/procarray.c
index 13d192ec2b..b5cb3423fb 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -267,6 +267,7 @@ static PGPROC *allProcs;
  */
 static TransactionId *KnownAssignedXids;
 static bool *KnownAssignedXidsValid;
+static int32 *KnownAssignedXidsNext;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
 /*
@@ -455,6 +456,12 @@ CreateSharedProcArray(void)
ShmemInitStruct("KnownAssignedXidsValid",
mul_size(sizeof(bool), 
TOTAL_MAX_CACHED_SUBXIDS),
&found);
+   KnownAssignedXidsNext = (int32 *)
+   ShmemInitStruct("KnownAssignedXidsNext",
+   
mul_size(sizeof(int32), TOTAL_MAX_CACHED_SUBXIDS),
+   &found);
+   for (int i = 0; i < TOTAL_MAX_CACHED_SUBXIDS; i++)
+   KnownAssignedXidsNext[i] = 1;
}
 }
 
@@ -4544,7 +4551,13 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * XID entry itself.  This preserves the property that the XID entries are
  * sorted, so we can do binary searches easily.  Periodically we compress
  * out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
+ * array immediately on every deletion. Also, we lazily maintain an offset
+ * in KnownAssignedXidsNext[] array to skip known to be invalid xids. It
+ * helps to skip the gaps; it could significantly increase performance in
+ * the case of long transactions on the primary. KnownAssignedXidsNext[] is
+ * updating while taking the snapshot. In general case KnownAssignedXidsNext
+ * contains not an offset to the next valid xid but a number which tends to
+ * the offset to next valid xid.
  *
  * The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
  * are those with indexes tail <= i < head; items outside this subscript range
@@ -4582,7 +4595,7 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * must happen)
  * * Compressing the array is O(S) and requires exclusive lock
  * * Removing an XID is O(logS) and requires exclusive lock
- * * Taking a snapshot is O(S) and requires shared lock
+ * * Taking a snapshot is O(S), O(N) next call; requires shared lock
  * * Checking for an XID is O(logS) and requires shared lock
  *
  * In comparison, using a hash table for KnownAssignedXids would mean that
@@ -4642,12 +4655,13 @@ KnownAssignedXidsCompress(bool force)
 * re-aligning data to 0th element.
 */
compress_index = 0;
-   for (i = tail; i < head; i++)
+   for (i = tail; i < head; i += KnownAssignedXidsNext[i])
{
if (KnownAssignedXidsValid[i])
{
KnownAssignedXids[compress_index] = 
KnownAssignedXids[i];
KnownAssignedXidsValid[compress_index] = true;
+   KnownAssignedXidsNext[compress_index] = 1;
compress_index++;
}
 

Re: Slow standby snapshot

2022-02-20 Thread Michail Nikolaev
Hello, Andrey.

Thanks for your efforts.

> Patch on barrier seems too complicated to me right now. I’d propose to focus 
> on KnowAssignedXidsNext patch: it’s clean, simple and effective.
I'll extract it to the separated patch later.

> I’ve rebased the patch so that it does not depend on previous step. Please 
> check out it’s current state, if you are OK with it - let’s mark the patch 
> Ready for Committer. Just maybe slightly better commit message would make the 
> patch easier to understand.
Everything seems to be correct.

Best regards,
Michail.




Re: Slow standby snapshot

2022-02-20 Thread Andrey Borodin


> On 22 Nov 2021, at 14:05, Michail Nikolaev  wrote:
> 
>> Write barrier must be issued after write, not before.
>> Don't we need to issue read barrier too?
> 
> The write barrier is issued after the changes to KnownAssignedXidsNext
> and KnownAssignedXidsValid arrays and before the update of
> headKnownAssignedXids.
> So, it seems to be correct. We make sure once the CPU sees changes of
> headKnownAssignedXids - underlying arrays contain all the required
> data.

Patch on barrier seems too complicated to me right now. I’d propose to focus on 
KnowAssignedXidsNext patch: it’s clean, simple and effective.

I’ve rebased the patch so that it does not depend on previous step. Please 
check out it’s current state, if you are OK with it - let’s mark the patch 
Ready for Committer. Just maybe slightly better commit message would make the 
patch easier to understand.


Thanks! Best regards, Andrey Borodin.


v4-0001-Use-linked-list-to-improve-KnownAssignedXids-perf.patch
Description: Binary data


Re: Slow standby snapshot

2021-11-22 Thread Michail Nikolaev
Hello, Andrey.

> Write barrier must be issued after write, not before.
> Don't we need to issue read barrier too?

The write barrier is issued after the changes to KnownAssignedXidsNext
and KnownAssignedXidsValid arrays and before the update of
headKnownAssignedXids.
So, it seems to be correct. We make sure once the CPU sees changes of
headKnownAssignedXids - underlying arrays contain all the required
data.

Thanks,
Michail.




Re: Slow standby snapshot

2021-11-22 Thread Andrey Borodin



> 21 нояб. 2021 г., в 23:58, Michail Nikolaev  
> написал(а):
> 
> 

Write barrier must be issued after write, not before.
Don't we need to issue read barrier too?

Best regards, Andrey Borodin.



Re: Slow standby snapshot

2021-11-21 Thread Michail Nikolaev
Hello, Alexander.

Thanks for your review.

> It looks like KnownAssignedXidsNext doesn't have to be
> pg_atomic_uint32.  I see it only gets read with pg_atomic_read_u32()
> and written with pg_atomic_write_u32().  Existing code believes that
> read/write of 32-bit values is atomic.  So, you can use just uint32
> instead of pg_atomic_uint32.

Fixed. Looks better now, yeah.

Also, I added an additional (not depending on KnownAssignedXidsNext
feature) commit replacing the spinlock with a memory barrier. It goes
to Simon Riggs and Tom Lane at 2010:

> (We could dispense with the spinlock if we were to
> create suitable memory access barrier primitives and use those instead.)

Now it is possible to avoid additional spinlock on each
KnownAssignedXidsGetAndSetXmin. I have not measured the performance
impact of this particular change yet (and it is not easy to reliable
measure impact less than 0.5% probably), but I think it is worth
adding. We need to protect only the head pointer because the tail is
updated only with exclusive ProcArrayLock. BTW should I provide a
separate patch for this?

So, now we have a pretty successful benchmark for the typical use-case
and some additional investigation had been done. So, I think I’ll
re-add the patch to the commitfest app.

Thanks,
Michail
From 94cd2fbf37b5f0b824e0f9a9bc23f762a8bb19b5 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Sun, 21 Nov 2021 21:23:02 +0300
Subject: [PATCH v3 1/2] memory barrier instead of spinlock

---
 src/backend/storage/ipc/procarray.c | 42 +++--
 1 file changed, 9 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index a9945c80eb..da0c4eaa00 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -60,7 +60,6 @@
 #include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
-#include "storage/spin.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -81,7 +80,6 @@ typedef struct ProcArrayStruct
 	int			numKnownAssignedXids;	/* current # of valid entries */
 	int			tailKnownAssignedXids;	/* index of oldest valid element */
 	int			headKnownAssignedXids;	/* index of newest element, + 1 */
-	slock_t		known_assigned_xids_lck;	/* protects head/tail pointers */
 
 	/*
 	 * Highest subxid that has been removed from KnownAssignedXids array to
@@ -425,7 +423,6 @@ CreateSharedProcArray(void)
 		procArray->numKnownAssignedXids = 0;
 		procArray->tailKnownAssignedXids = 0;
 		procArray->headKnownAssignedXids = 0;
-		SpinLockInit(&procArray->known_assigned_xids_lck);
 		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
@@ -4553,10 +4550,9 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * pointer.  This wouldn't require any lock at all, except that on machines
  * with weak memory ordering we need to be careful that other processors
  * see the array element changes before they see the head pointer change.
- * We handle this by using a spinlock to protect reads and writes of the
- * head/tail pointers.  (We could dispense with the spinlock if we were to
- * create suitable memory access barrier primitives and use those instead.)
- * The spinlock must be taken to read or write the head/tail pointers unless
+ * We handle this by using a memory barrier to protect writes of the
+ * head pointer.
+ * The memory barrier is taken before write the head pointer unless
  * the caller holds ProcArrayLock exclusively.
  *
  * Algorithmic analysis:
@@ -4600,7 +4596,6 @@ KnownAssignedXidsCompress(bool force)
 	int			compress_index;
 	int			i;
 
-	/* no spinlock required since we hold ProcArrayLock exclusively */
 	head = pArray->headKnownAssignedXids;
 	tail = pArray->tailKnownAssignedXids;
 
@@ -4686,7 +4681,7 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
 
 	/*
 	 * Since only the startup process modifies the head/tail pointers, we
-	 * don't need a lock to read them here.
+	 * are safe read them here.
 	 */
 	head = pArray->headKnownAssignedXids;
 	tail = pArray->tailKnownAssignedXids;
@@ -4744,21 +4739,20 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid,
 	pArray->numKnownAssignedXids += nxids;
 
 	/*
-	 * Now update the head pointer.  We use a spinlock to protect this
+	 * Now update the head pointer.  We use a memory barrier to protect this
 	 * pointer, not because the update is likely to be non-atomic, but to
 	 * ensure that other processors see the above array updates before they
 	 * see the head pointer change.
 	 *
 	 * If we're holding ProcArrayLock exclusively, there's no need to take the
-	 * spinlock.
+	 * barrier.
 	 */
 	if (exclusive_lock)
 		pArray->headKnownAssignedXids = head;
 	else
 	{
-		SpinLockAcquire(&pArray->known_assigned_xids_lck);
+		pg_write_barrier();
 		pArray->headKnownA

Re: Slow standby snapshot

2021-11-15 Thread Alexander Korotkov
On Wed, Nov 10, 2021 at 12:16 AM Michail Nikolaev
 wrote:
> I updated the patch a little. KnownAssignedXidsGetAndSetXmin now
> causes fewer cache misses because some values are stored in variables
> (registers). I think it is better to not lean on the compiler here
> because of `volatile` args.
> Also, I have added some comments.

It looks like KnownAssignedXidsNext doesn't have to be
pg_atomic_uint32.  I see it only gets read with pg_atomic_read_u32()
and written with pg_atomic_write_u32().  Existing code believes that
read/write of 32-bit values is atomic.  So, you can use just uint32
instead of pg_atomic_uint32.

--
Regards,
Alexander Korotkov




Re: Slow standby snapshot

2021-11-09 Thread Michail Nikolaev
Hello, Andrey.

Thanks for your feedback.

> Current patch addresses another problem. In presence of old enough 
> transaction enumeration of KnownAssignedXids with shared lock prevents adding 
> new transactions with exclusive lock. And recovery effectively pauses.

Actually, I see two problems here (caused by the presence of old long
transactions). The first one is lock contention which causes recovery
pauses. The second one - just high CPU usage on standby by
KnownAssignedXidsGetAndSetXmin.

> All in all, I think using proposed "KnownAssignedXidsNext" patch solves real 
> problem and the problem of binary searches should be addressed by compressing 
> KnownAssignedXids more often.

I updated the patch a little. KnownAssignedXidsGetAndSetXmin now
causes fewer cache misses because some values are stored in variables
(registers). I think it is better to not lean on the compiler here
because of `volatile` args.
Also, I have added some comments.

Best regards,
Michail.
From 1d55c6fae8cc160eadd705da0d70d9e7fb5bc00f Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Wed, 10 Nov 2021 00:02:18 +0300
Subject: [PATCH v2] known assignment xid next

---
 src/backend/storage/ipc/procarray.c | 67 -
 1 file changed, 57 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 892f0f6799..422004ad31 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -267,6 +267,7 @@ static PGPROC *allProcs;
  */
 static TransactionId *KnownAssignedXids;
 static bool *KnownAssignedXidsValid;
+static pg_atomic_uint32 *KnownAssignedXidsNext;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
 /*
@@ -405,6 +406,7 @@ void
 CreateSharedProcArray(void)
 {
 	bool		found;
+	int			i;
 
 	/* Create or attach to the ProcArray shared structure */
 	procArray = (ProcArrayStruct *)
@@ -446,6 +448,12 @@ CreateSharedProcArray(void)
 			ShmemInitStruct("KnownAssignedXidsValid",
 			mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
 			&found);
+		KnownAssignedXidsNext = (pg_atomic_uint32 *)
+ShmemInitStruct("KnownAssignedXidsNext",
+mul_size(sizeof(pg_atomic_uint32), TOTAL_MAX_CACHED_SUBXIDS),
+&found);
+		for (i = 0; i < TOTAL_MAX_CACHED_SUBXIDS; i++)
+			pg_atomic_init_u32(&KnownAssignedXidsNext[i], 1);
 	}
 }
 
@@ -4517,7 +4525,13 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * XID entry itself.  This preserves the property that the XID entries are
  * sorted, so we can do binary searches easily.  Periodically we compress
  * out the unused entries; that's much cheaper than having to compress the
- * array immediately on every deletion.
+ * array immediately on every deletion. Also, we lazily maintain an offset
+ * in KnownAssignedXidsNext[] array to skip known to be invalid xids. It
+ * helps to skip the gaps; it could significantly increase performance in
+ * the case of long transactions on the primary. KnownAssignedXidsNext[] is
+ * updating while taking the snapshot. In general case KnownAssignedXidsNext
+ * contains not an offset to the next valid xid but a number which tends to
+ * the offset to next valid xid.
  *
  * The actually valid items in KnownAssignedXids[] and KnownAssignedXidsValid[]
  * are those with indexes tail <= i < head; items outside this subscript range
@@ -4544,7 +4558,10 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * head/tail pointers.  (We could dispense with the spinlock if we were to
  * create suitable memory access barrier primitives and use those instead.)
  * The spinlock must be taken to read or write the head/tail pointers unless
- * the caller holds ProcArrayLock exclusively.
+ * the caller holds ProcArrayLock exclusively. Access to KnownAssignedXidsNext
+ * is not especially protected by any lock because it just some kind of hint
+ * to reduce the scan cost, but at least shared ProcArrayLock is held anyway -
+ * it is required to access KnownAssignedXids.
  *
  * Algorithmic analysis:
  *
@@ -4555,7 +4572,7 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  *		must happen)
  *	* Compressing the array is O(S) and requires exclusive lock
  *	* Removing an XID is O(logS) and requires exclusive lock
- *	* Taking a snapshot is O(S) and requires shared lock
+ *	* Taking a snapshot is O(S), O(N) next call; requires shared lock
  *	* Checking for an XID is O(logS) and requires shared lock
  *
  * In comparison, using a hash table for KnownAssignedXids would mean that
@@ -4615,12 +4632,13 @@ KnownAssignedXidsCompress(bool force)
 	 * re-aligning data to 0th element.
 	 */
 	compress_index = 0;
-	for (i = tail; i < head; i++)
+	for (i = tail; i < head; i += pg_atomic_read_u32(&KnownAssignedXidsNext[i]))
 	{
 		if (KnownAssignedXidsValid[i])
 		{
 			KnownAssignedXids[compress_index] = KnownAssignedXids[i];
 			KnownAssignedXidsValid[compress_index] = true;
+			pg_atomic_write_u32(&KnownAssignedXidsN

Re: Slow standby snapshot

2021-11-07 Thread Andrey Borodin
Sorry for so late reply. I've been thinking about possible approaches.
KnownAssignedXids over hashtable in fact was implemented long before and 
rejected [0].

> 3 авг. 2021 г., в 22:35, Andres Freund  написал(а):
> 
> On 2021-08-03 10:33:50 +0500, Andrey Borodin wrote:
>>> 3 авг. 2021 г., в 03:01, Andres Freund  написал(а):
>>> On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
 The main idea is simple optimistic optimization - store offset to next
 valid entry. So, in most cases, we could just skip all the gaps.
 Of course, it adds some additional impact for workloads without long
 (few seconds) transactions but it is almost not detectable (because of
 CPU caches).
>>> 
>>> I'm doubtful that that's really the right direction. For workloads that
>>> are replay heavy we already often can see the cost of maintaining the
>>> known xids datastructures show up significantly - not surprising, given
>>> the datastructure. And for standby workloads with active primaries the
>>> cost of searching through the array in all backends is noticeable as
>>> well.  I think this needs a bigger data structure redesign.
>> 
>> KnownAssignedXids implements simple membership test idea. What kind of
>> redesign would you suggest? Proposed optimisation makes it close to optimal,
>> but needs eventual compression.
> 
> Binary searches are very ineffecient on modern CPUs (unpredictable memory
> accesses, unpredictable branches). We constantly need to do binary searches
> during replay to remove xids from the array. I don't see how you can address
> that with just the current datastructure.
Current patch addresses another problem. In presence of old enough transaction 
enumeration of KnownAssignedXids with shared lock prevents adding new 
transactions with exclusive lock. And recovery effectively pauses.

Binary searches can consume 10-15 cache misses, which is unreasonable amount of 
memory waits. But that's somewhat different problem.
Also binsearch is not that expensive when we compress KnownAssignedXids often.

>> Maybe use a hashtable of running transactions? It will be slightly faster
>> when adding\removing single transactions. But much worse when doing
>> KnownAssignedXidsRemove().
> 
> Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
> write KnownAssignedXidsGet[AndSetXmin]()?
I was thinking about inefficient KnownAssignedXidsRemovePreceding() in 
hashtable. But, probably, this is not so frequent operation.

>> Maybe use a tree? (AVL\RB or something like that) It will be slightly 
>> better, because it does not need eventual compression like exiting array.
> 
> I'm not entirely sure what datastructure would work best. I can see something
> like a radix tree work well, or a skiplist style approach. Or a hashtable:
> 
> I'm not sure that we need to care as much about the cost of
> KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
> less frequent. But more importantly, it'd not be hard to maintain an
> occasionally (or opportunistically) maintained denser version for
> GetSnapshotData() - there's only a single writer, making the concurrency
> issues a lot simpler.

I've been prototyping Radix tree for a while.
Here every 4 xids are summarized my minimum Xid and number of underlying Xids. 
Of cause 4 is arbitrary number, summarization area must be of cacheline size.
┌───┐   
│ 1 / 9 │   
├───┴┐  
│└┐ 
│ └┐
│  └┐   
▼   └───▶   
┌───┐   
│ 1 / 3 | 5 / 0 | 9 / 3 | D / 3 │   
├───┬───┬┬──┴┐  
│   └─┐ └───┐└┐  └─┐
│ └─┐   └──┐  └┐   └─┐  
│   └─┐└──┐└┐└┐ 
▼ └─┐ └──┐  └───┐ └┐
┌───▼┴─▶┴──▶───┴───▶
│ 1 | 2 |   | 4 |   |   |   |   | 9 |   | B | C | D | E | F |  │
└──┘
Bottom layer is current array (TransactionId *KnownAssignedXids).
When we remove Xid we need theoretical minimum of cachelines touched. I'd say 
5-7 instead of 10-15 of binsearch (in case of millions of entries in 
KnownAssignedXids).
Enumerating running Xids is not that difficult too: we will need to scan O(xip) 
memory, not whole KnownAssignedXids array.

But the overall complexity of this approach does not seem good to me.

All in all, I think using proposed "KnownAssignedXidsNext" patch solves real 
p

Re: Slow standby snapshot

2021-10-02 Thread Michail Nikolaev
Hello, Andres.

Could you please clarify how to better deal with the situation?

According to your previous letter, I think there was some
misunderstanding regarding the latest patch version (but I am not
sure). Because as far as understand provided optimization (lazily
calculated optional offset to the next valid xid) fits into your
wishes. It was described in the previous letter in more detail.

And now it is not clear for me how to move forward :)

There is an option to try to find some better data structure (like
some tricky linked hash map) but it is going to be huge changes
without any confidence to get a more effective version (because
provided changes make the structure pretty effective).

Another option I see - use optimization from the latest patch version
and get a significant TPS increase (5-7%) for the typical standby read
scenario. Patch is small and does not affect other scenarios in a
negative way. Probably I could make an additional set of some
performance tests and provide some simulation to prove that
pg_atomic_uint32-related code is correct (if required).

Or just leave the issue and hope someone else will try to fix it in
the future :)

Thanks a lot,
Michail.




Re: Slow standby snapshot

2021-09-30 Thread Michael Paquier
On Tue, Aug 10, 2021 at 12:45:17AM +0300, Michail Nikolaev wrote:
> Thanks for the feedback again.

From what I can see, there has been some feedback from Andres here,
and the thread is idle for six weeks now, so I have marked this patch
as RwF in the CF app.
--
Michael


signature.asc
Description: PGP signature


Re: Slow standby snapshot

2021-08-09 Thread Michail Nikolaev
Hello, Andres.

Thanks for the feedback again.

> The problem is that we don't want to add a lot of work
> KnownAssignedXidsAdd/Remove, because very often nobody will build a snapshot
> for that moment and building a sorted, gap-free, linear array of xids isn't
> cheap. In my experience it's more common to be bottlenecked by replay CPU
> performance than on replica snapshot building.

Yes, but my patch adds almost the smallest possible amount for
KnownAssignedXidsAdd/Remove - a single write to the array by index.
It differs from the first version in this thread which is based on linked lists.
The "next valid offset" is just "optimistic optimization" - it means
"you could safely skip KnownAssignedXidsNext[i] while finding the next
valid".
But KnownAssignedXidsNext is not updated by Add/Remove.

> During recovery, where there's only one writer to the procarray / known xids,
> it might not be hard to opportunistically build a dense version of the known
> xids whenever a snapshot is built.

AFAIU the patch does exactly the same.
On the first snapshot building, offsets to the next valid entry are
set. So, a dense version is created on-demand.
And this version is reused (even partly if something was removed) on
the next snapshot building.

> I'm not entirely sure what datastructure would work best. I can see something
> like a radix tree work well, or a skiplist style approach. Or a hashtable:

We could try to use some other structure (for example - linked hash
map) - but the additional cost (memory management, links, hash
calculation) will probably significantly reduce performance.
And it is a much harder step to perform.

So, I think "next valid offset" optimization is a good trade-off for now:
* KnownAssignedXidsAdd/Remove are almost not affected in their complexity
* KnownAssignedXidsGetAndSetXmin is eliminated from the CPU top on
typical read scenario - so, more TPS, less ProcArrayLock contention
* it complements GetSnapshotData() scalability - now on standby
* changes are small

Thanks,
Michail.




Re: Slow standby snapshot

2021-08-03 Thread Andres Freund
Hi,

On 2021-08-03 22:23:58 +0300, Michail Nikolaev wrote:
> > I'm not sure that we need to care as much about the cost of
> > KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes 
> > that
> > less frequent.
> 
> It is still about 5-7% of CPU for a typical workload, a considerable
> amount for me.

I'm not saying we shouldn't optimize things. Just that it's less pressing. And
what kind of price we're willing to optimize may have changed.


> And a lot of systems still work on 12 and 13.

I don't see us backporting performance improvements around this to 12 and 13,
so I don't think that matters much... We've done that a few times, but usually
when there's some accidentally quadratic behaviour or such.


> > But more importantly, it'd not be hard to maintain an
> > occasionally (or opportunistically) maintained denser version for
> > GetSnapshotData() - there's only a single writer, making the concurrency
> > issues a lot simpler.
> 
> Could you please explain it in more detail?
> Single writer and GetSnapshotData() already exclusively hold
> ProcArrayLock at the moment of KnownAssignedXidsRemove,
> KnownAssignedXidsGetAndSetXmin, and sometimes KnownAssignedXidsAdd.

GetSnapshotData() only locks ProcArrayLock in shared mode.

The problem is that we don't want to add a lot of work
KnownAssignedXidsAdd/Remove, because very often nobody will build a snapshot
for that moment and building a sorted, gap-free, linear array of xids isn't
cheap. In my experience it's more common to be bottlenecked by replay CPU
performance than on replica snapshot building.

During recovery, where there's only one writer to the procarray / known xids,
it might not be hard to opportunistically build a dense version of the known
xids whenever a snapshot is built.

Greetings,

Andres Freund




Re: Slow standby snapshot

2021-08-03 Thread Michail Nikolaev
Hello, Andres.

Thanks for your feedback.

>> Maybe use a hashtable of running transactions? It will be slightly faster
>> when adding\removing single transactions. But much worse when doing
>> KnownAssignedXidsRemove().
> Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
> write KnownAssignedXidsGet[AndSetXmin]()?

It is actually was a hashtable in 2010. It was replaced by Simon Riggs
in 2871b4618af1acc85665eec0912c48f8341504c4.

> I'm not sure that we need to care as much about the cost of
> KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
> less frequent.

It is still about 5-7% of CPU for a typical workload, a considerable
amount for me. And a lot of systems still work on 12 and 13.
The proposed approach eliminates KnownAssignedXidsGetAndSetXmin from
the top by a small changeset.

> But more importantly, it'd not be hard to maintain an
> occasionally (or opportunistically) maintained denser version for
> GetSnapshotData() - there's only a single writer, making the concurrency
> issues a lot simpler.

Could you please explain it in more detail?
Single writer and GetSnapshotData() already exclusively hold
ProcArrayLock at the moment of KnownAssignedXidsRemove,
KnownAssignedXidsGetAndSetXmin, and sometimes KnownAssignedXidsAdd.

BTW, the tiny thing we could also fix now is (comment from code):

> (We could dispense with the spinlock if we were to
>  * create suitable memory access barrier primitives and use those instead.)
>  * The spinlock must be taken to read or write the head/tail pointers unless
>  * the caller holds ProcArrayLock exclusively.

Thanks,
Michail.




Re: Slow standby snapshot

2021-08-03 Thread Andres Freund
Hi,

On 2021-08-03 10:33:50 +0500, Andrey Borodin wrote:
> > 3 авг. 2021 г., в 03:01, Andres Freund  написал(а):
> > On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
> >> The main idea is simple optimistic optimization - store offset to next
> >> valid entry. So, in most cases, we could just skip all the gaps.
> >> Of course, it adds some additional impact for workloads without long
> >> (few seconds) transactions but it is almost not detectable (because of
> >> CPU caches).
> > 
> > I'm doubtful that that's really the right direction. For workloads that
> > are replay heavy we already often can see the cost of maintaining the
> > known xids datastructures show up significantly - not surprising, given
> > the datastructure. And for standby workloads with active primaries the
> > cost of searching through the array in all backends is noticeable as
> > well.  I think this needs a bigger data structure redesign.
> 
> KnownAssignedXids implements simple membership test idea. What kind of
> redesign would you suggest? Proposed optimisation makes it close to optimal,
> but needs eventual compression.

Binary searches are very ineffecient on modern CPUs (unpredictable memory
accesses, unpredictable branches). We constantly need to do binary searches
during replay to remove xids from the array. I don't see how you can address
that with just the current datastructure.


> Maybe use a hashtable of running transactions? It will be slightly faster
> when adding\removing single transactions. But much worse when doing
> KnownAssignedXidsRemove().

Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
write KnownAssignedXidsGet[AndSetXmin]()?


> Maybe use a tree? (AVL\RB or something like that) It will be slightly better, 
> because it does not need eventual compression like exiting array.

I'm not entirely sure what datastructure would work best. I can see something
like a radix tree work well, or a skiplist style approach. Or a hashtable:

I'm not sure that we need to care as much about the cost of
KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
less frequent. But more importantly, it'd not be hard to maintain an
occasionally (or opportunistically) maintained denser version for
GetSnapshotData() - there's only a single writer, making the concurrency
issues a lot simpler.


Greetings,

Andres Freund




Re: Slow standby snapshot

2021-08-02 Thread Andrey Borodin
Hi,

> 3 авг. 2021 г., в 03:01, Andres Freund  написал(а):
> 
> Hi,
> 
> On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
>> The main idea is simple optimistic optimization - store offset to next
>> valid entry. So, in most cases, we could just skip all the gaps.
>> Of course, it adds some additional impact for workloads without long
>> (few seconds) transactions but it is almost not detectable (because of
>> CPU caches).
> 
> I'm doubtful that that's really the right direction. For workloads that
> are replay heavy we already often can see the cost of maintaining the
> known xids datastructures show up significantly - not surprising, given
> the datastructure. And for standby workloads with active primaries the
> cost of searching through the array in all backends is noticeable as
> well.  I think this needs a bigger data structure redesign.

KnownAssignedXids implements simple membership test idea. What kind of redesign 
would you suggest? Proposed optimisation makes it close to optimal, but needs 
eventual compression.

Maybe use a hashtable of running transactions? It will be slightly faster when 
adding\removing single transactions. But much worse when doing 
KnownAssignedXidsRemove().

Maybe use a tree? (AVL\RB or something like that) It will be slightly better, 
because it does not need eventual compression like exiting array.

Best regards, Andrey Borodin.



Re: Slow standby snapshot

2021-08-02 Thread Andres Freund
Hi,

On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
> The main idea is simple optimistic optimization - store offset to next
> valid entry. So, in most cases, we could just skip all the gaps.
> Of course, it adds some additional impact for workloads without long
> (few seconds) transactions but it is almost not detectable (because of
> CPU caches).

I'm doubtful that that's really the right direction. For workloads that
are replay heavy we already often can see the cost of maintaining the
known xids datastructures show up significantly - not surprising, given
the datastructure. And for standby workloads with active primaries the
cost of searching through the array in all backends is noticeable as
well.  I think this needs a bigger data structure redesign.

Greetings,

Andres Freund




Re: Slow standby snapshot

2021-08-02 Thread Michail Nikolaev
Hello.

> I have tried such an approach but looks like it is not effective,
> probably because of CPU caching issues.

It was a mistake by me. I have repeated the approach and got good
results with small and a non-invasive patch.

The main idea is simple optimistic optimization - store offset to next
valid entry. So, in most cases, we could just skip all the gaps.
Of course, it adds some additional impact for workloads without long
(few seconds) transactions but it is almost not detectable (because of
CPU caches).

* TEST

The next testing setup was used:

max_connections=5000 (taken from real RDS installation)
pgbench -i -s 10 -U postgres -d postgres

# emulate typical OLTP load
pgbench -b simple-update -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres

#emulate typical cache load on replica
pgbench -b select-only -p 6543 -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres

# emulate some typical long transactions up to 60 seconds on primary
echo "\set t random(0, 60)
BEGIN;
select txid_current();
select pg_sleep(:t);
COMMIT;" > slow_transactions.bench
pgbench -f /home/nkey/pg/slow_transactions.bench -p 5432 -j 1 -c 10 -n
-P 1 -T 18000 -U postgres postgres

* RESULTS

*REL_13_STABLE* - 23.02% vs 0.76%

non-patched:
  23.02%  postgres   [.] KnownAssignedXidsGetAndSetXmin
   2.56%  postgres[.] base_yyparse
   2.15%  postgres[.] AllocSetAlloc
   1.68%  postgres[.] MemoryContextAllocZeroAligned
   1.51%  postgres[.] hash_search_with_hash_value
   1.26%  postgres[.] SearchCatCacheInternal
   1.03%  postgres[.] hash_bytes
   0.92%  postgres[.] pg_checksum_block
   0.89%  postgres[.] expression_tree_walker
   0.81%  postgres[.] core_yylex
   0.69%  postgres[.] palloc
   0.68%  [kernel]  [k] do_syscall_64
   0.59%  postgres[.] _bt_compare
   0.54%  postgres[.] new_list

patched:
   3.09%  postgres[.] base_yyparse
   3.00%  postgres[.] AllocSetAlloc
   2.01%  postgres[.] MemoryContextAllocZeroAligned
   1.89%  postgres[.] SearchCatCacheInternal
   1.80%  postgres[.] hash_search_with_hash_value
   1.27%  postgres[.] expression_tree_walker
   1.27%  postgres[.] pg_checksum_block
   1.18%  postgres[.] hash_bytes
   1.10%  postgres[.] core_yylex
   0.96%  [kernel]  [k] do_syscall_64
   0.86%  postgres[.] palloc
   0.84%  postgres[.] _bt_compare
   0.79%  postgres[.] new_list
   0.76%  postgres[.] KnownAssignedXidsGetAndSetXmin

*MASTER* - 6.16% vs ~0%
(includes snapshot scalability optimization by Andres Freund (1))

non-patched:
   6.16%  postgres[.] KnownAssignedXidsGetAndSetXmin
   3.05%  postgres[.] AllocSetAlloc
   2.59%  postgres[.] base_yyparse
   1.95%  postgres[.] hash_search_with_hash_value
   1.87%  postgres[.] MemoryContextAllocZeroAligned
   1.85%  postgres[.] SearchCatCacheInternal
   1.27%  postgres[.] hash_bytes
   1.16%  postgres[.] expression_tree_walker
   1.06%  postgres[.] core_yylex
   0.94%  [kernel]  [k] do_syscall_64

patched:
   3.35%  postgres[.] base_yyparse
   2.84%  postgres[.] AllocSetAlloc
   1.89%  postgres[.] hash_search_with_hash_value
   1.82%  postgres[.] MemoryContextAllocZeroAligned
   1.79%  postgres[.] SearchCatCacheInternal
   1.49%  postgres[.] pg_checksum_block
   1.26%  postgres[.] hash_bytes
   1.26%  postgres[.] expression_tree_walker
   1.08%  postgres[.] core_yylex
   1.04%  [kernel]  [k] do_syscall_64
   0.81%  postgres[.] palloc

Looks like it is possible to get a significant TPS increase on a very
typical standby workload.
Currently, I have no environment to measure TPS accurately. Could you
please try it on yours?

I have attached two versions of the patch - for master and REL_13_STABLE.
Also, I am going to add a patch to commitfest (2).

Thanks,
MIchail.

(1): https://commitfest.postgresql.org/29/2500/
(2): https://commitfest.postgresql.org/34/3271/
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index c7816fcfb3..27486c096b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -267,6 +267,7 @@ static PGPROC *allProcs;
  */
 static TransactionId *KnownAssignedXids;
 static bool *KnownAssignedXidsValid;
+static pg_atomic_uint32 *KnownAssignedXidsNext;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
 /*
@@ -405,6 +406,7 @@ void
 CreateSharedProcArray(void)
 {
 	bool		found;
+	int			i;
 
 	/* Create or attach to the ProcArray shared structure */
 	procArray = (ProcArrayStruct *)
@@ -446,6

Re: Slow standby snapshot

2021-07-11 Thread Michail Nikolaev
Hello, Kirill.

> Also, maybe it is better to reduce the invasivity by using a more
> simple approach. For example, use the first bit to mark xid as valid
> and the last 7 bit (128 values) as an optimistic offset to the next
> valid xid (jump by 127 steps in the worse scenario).
> What do you think?

I have tried such an approach but looks like it is not effective,
probably because of CPU caching issues.

I have looked again at your patch, ut seems like it has a lot of
issues at the moment:

* error in KnownAssignedXidsGetOldestXmin, `i` is uninitialized, logic is wrong
* error in compressing function
(```KnownAssignedXidsValidDLL[compress_index].prv = prv;```, `prv` is
never updated)
* probably other errors?
* compilation warnings
* looks a little complex logic with `KAX_DLL_ENTRY_INVALID`
* variable\methods placing is bad (see `KAX_E_INVALID` and others)
* need to update documentation about KnownAssignedXidsValid, see ```To
keep individual deletions cheap, we need to allow gaps in the array```
in procarray.c
* formatting is broken

Do you have plans to update it? If not - I could try to rewrite it.

Also, what is about to add a patch to commitfest?

Thanks,
Michail.




Re: Slow standby snapshot

2021-06-13 Thread Michail Nikolaev
)Hello.

> I recently ran into a problem in one of our production postgresql cluster.
> I had noticed lock contention on procarray lock on standby, which causes WAL
> replay lag growth.

Yes, I saw the same issue on my production cluster.

> 1) set max_connections to big number, like 10

I made the tests with a more realistic value - 5000. It is valid value
for Amazon RDS for example (default is
LEAST({DBInstanceClassMemory/9531392}, 5000)).

The test looks like this:

pgbench -i -s 10 -U postgres -d postgres
pgbench -b select-only -p 6543 -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
pgbench -b simple-update -j 1 -c 50 -n -P 1 -T 18000 -U postgres postgres
long transaction on primary - begin;select txid_current();
perf top -p 

So, on postgres 14 (master) non-patched version looks like this:

   5.13%  postgres[.] KnownAssignedXidsGetAndSetXmin
   4.61%  postgres[.] pg_checksum_block
   2.54%  postgres[.] AllocSetAlloc
   2.44%  postgres[.] base_yyparse

It is too much to spend 5-6% of CPU running throw an array :) I think
it should be fixed for both the 13 and 14 versions.

The patched version like this (was unable to notice
KnownAssignedXidsGetAndSetXmin):

   3.08%  postgres[.] pg_checksum_block
   2.89%  postgres[.] AllocSetAlloc
   2.66%  postgres[.] base_yyparse
   2.00%  postgres[.] MemoryContextAllocZeroAligned

On postgres 13 non patched version looks even worse (definitely need
to be fixed in my opinion):

  26.44%  postgres   [.] KnownAssignedXidsGetAndSetXmin
   2.17%  postgres[.] base_yyparse
   2.01%  postgres[.] AllocSetAlloc
   1.55%  postgres[.] MemoryContextAllocZeroAligned

But your patch does not apply to REL_13_STABLE. Could you please
provide two versions?

Also, there are warnings while building with patch:

procarray.c:4595:9: warning: ISO C90 forbids mixed
declarations and code [-Wdeclaration-after-statement]
4595 | int prv = -1;
 | ^~~
procarray.c: In function ‘KnownAssignedXidsGetOldestXmin’:
procarray.c:5056:5: warning: variable ‘tail’ set but not used
[-Wunused-but-set-variable]
5056 | tail;
 | ^~~~
procarray.c:5067:38: warning: ‘i’ is used uninitialized in
this function [-Wuninitialized]
5067 | i = KnownAssignedXidsValidDLL[i].nxt;


Some of them are clear errors, so, please recheck the code.

Also, maybe it is better to reduce the invasivity by using a more
simple approach. For example, use the first bit to mark xid as valid
and the last 7 bit (128 values) as an optimistic offset to the next
valid xid (jump by 127 steps in the worse scenario).
What do you think?

Also, it is a good idea to register the patch in the commitfest app
(https://commitfest.postgresql.org/).

Thanks,
Michail.




Re: Slow standby snapshot

2021-05-20 Thread Kirill Reshke
sorry, forgot to add a patch to the letter



чт, 20 мая 2021 г. в 13:52, Кирилл Решке :

> Hi,
> I recently ran into a problem in one of our production postgresql cluster.
> I had noticed lock contention on procarray lock on standby, which causes
> WAL replay lag growth.
> To reproduce this, you can do the following:
>
> 1) set max_connections to big number, like 10
> 2) begin a transaction on primary
> 3) start pgbench workload on primary and on standby
>
> After a while it will be possible to see KnownAssignedXidsGetAndSetXmin in
> perf top consuming abount 75 % of CPU.
>
> %%
>   PerfTop:1060 irqs/sec  kernel: 0.0%  exact:  0.0% [4000Hz cycles:u],
>  (target_pid: 273361)
>
> ---
>
> 73.92%  postgres   [.] KnownAssignedXidsGetAndSetXmin
>  1.40%  postgres   [.] base_yyparse
>  0.96%  postgres   [.] LWLockAttemptLock
>  0.84%  postgres   [.] hash_search_with_hash_value
>  0.84%  postgres   [.] AtEOXact_GUC
>  0.72%  postgres   [.] ResetAllOptions
>  0.70%  postgres   [.] AllocSetAlloc
>  0.60%  postgres   [.] _bt_compare
>  0.55%  postgres   [.] core_yylex
>  0.42%  libc-2.27.so   [.] __strlen_avx2
>  0.23%  postgres   [.] LWLockRelease
>  0.19%  postgres   [.] MemoryContextAllocZeroAligned
>  0.18%  postgres   [.] expression_tree_walker.part.3
>  0.18%  libc-2.27.so   [.] __memmove_avx_unaligned_erms
>  0.17%  postgres   [.] PostgresMain
>  0.17%  postgres   [.] palloc
>  0.17%  libc-2.27.so   [.] _int_malloc
>  0.17%  postgres   [.] set_config_option
>  0.17%  postgres   [.] ScanKeywordLookup
>  0.16%  postgres   [.] _bt_checkpage
>
> %%
>
>
> We have tried to fix this by using BitMapSet instead of boolean array
> KnownAssignedXidsValid, but this does not help too much.
>
> Instead, using a doubly linked list helps a little more, we got +1000 tps
> on pgbench workload with patched postgresql. The general idea of this patch
> is that, instead of memorizing which elements in KnownAssignedXids are
> valid, lets maintain a doubly linked list of them. This  solution will work
> in exactly the same way, except that taking a snapshot on the replica is
> now O(running transaction) instead of O(head - tail) which is significantly
> faster under some workloads. The patch helps to reduce CPU usage of
> KnownAssignedXidsGetAndSetXmin to ~48% instead of ~74%, but does eliminate
> it from perf top.
>
> The problem is better reproduced on PG13 since PG14 has some snapshot
> optimization.
>
> Thanks!
>
> Best regards, reshke
>
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 5ff8cab394..165cf56ea9 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -255,9 +255,18 @@ static PGPROC *allProcs;
  * Bookkeeping for tracking emulated transactions in recovery
  */
 static TransactionId *KnownAssignedXids;
-static bool *KnownAssignedXidsValid;
+
+typedef struct {
+	int prv;
+	int nxt;
+} KnownAssignedXidValidNode;
+
+// Doubly linked list of valid xids
+static KnownAssignedXidValidNode *KnownAssignedXidsValidDLL;
 static TransactionId latestObservedXid = InvalidTransactionId;
 
+
+
 /*
  * If we're in STANDBY_SNAPSHOT_PENDING state, standbySnapshotPendingXmin is
  * the highest xid that might still be running that we don't have in
@@ -348,6 +357,9 @@ static void GlobalVisUpdateApply(ComputeXidHorizonsResult *horizons);
 /*
  * Report shared-memory space needed by CreateSharedProcArray.
  */
+
+static KnownAssignedXidValidNode KAX_E_INVALID;
+
 Size
 ProcArrayShmemSize(void)
 {
@@ -380,13 +392,20 @@ ProcArrayShmemSize(void)
 		size = add_size(size,
 		mul_size(sizeof(TransactionId),
  TOTAL_MAX_CACHED_SUBXIDS));
-		size = add_size(size,
-		mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS));
+size = add_size(size,
+		mul_size(sizeof(KnownAssignedXidValidNode), TOTAL_MAX_CACHED_SUBXIDS));
 	}
 
+KAX_E_INVALID.prv = -1;
+KAX_E_INVALID.nxt = TOTAL_MAX_CACHED_SUBXIDS;
+
 	return size;
 }
 
+#define KAX_DLL_INDEX_VALID(i) (-1 < i && i < TOTAL_MAX_CACHED_SUBXIDS)
+#define KAX_DLL_ENTRY_INVALID(i) (-1 == KnownAssignedXidsValidDLL[i].prv && KnownAssignedXidsValidDLL[i].nxt == TOTAL_MAX_CACHED_SUBXIDS)
+
+
 /*
  * Initialize the shared PGPROC array during postmaster startup.
  */
@@ -431,10 +450,15 @@ CreateSharedProcArray(void)
 			mul_size(sizeof(TransactionId),
 	 TOTAL_MAX_CACHED_SUBXIDS),
 			&found);
-		KnownAssignedXidsValid = (bool *)
-			ShmemInitStruct("KnownAssignedXidsValid",
-			mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS),
+		
+KnownAssignedXidsValidDLL = (KnownAssignedXidValidNode*)
+			ShmemInitStruct("KnownAssignedXidsValidDLL",
+			mul_size(sizeof(KnownAssignedXidValidNode), TOTAL_MAX_CACHED_SUBXIDS),
 			&found);
+
+

Slow standby snapshot

2021-05-20 Thread Кирилл Решке
Hi,
I recently ran into a problem in one of our production postgresql cluster.
I had noticed lock contention on procarray lock on standby, which causes
WAL replay lag growth.
To reproduce this, you can do the following:

1) set max_connections to big number, like 10
2) begin a transaction on primary
3) start pgbench workload on primary and on standby

After a while it will be possible to see KnownAssignedXidsGetAndSetXmin in
perf top consuming abount 75 % of CPU.

%%
  PerfTop:1060 irqs/sec  kernel: 0.0%  exact:  0.0% [4000Hz cycles:u],
 (target_pid: 273361)
---

73.92%  postgres   [.] KnownAssignedXidsGetAndSetXmin
 1.40%  postgres   [.] base_yyparse
 0.96%  postgres   [.] LWLockAttemptLock
 0.84%  postgres   [.] hash_search_with_hash_value
 0.84%  postgres   [.] AtEOXact_GUC
 0.72%  postgres   [.] ResetAllOptions
 0.70%  postgres   [.] AllocSetAlloc
 0.60%  postgres   [.] _bt_compare
 0.55%  postgres   [.] core_yylex
 0.42%  libc-2.27.so   [.] __strlen_avx2
 0.23%  postgres   [.] LWLockRelease
 0.19%  postgres   [.] MemoryContextAllocZeroAligned
 0.18%  postgres   [.] expression_tree_walker.part.3
 0.18%  libc-2.27.so   [.] __memmove_avx_unaligned_erms
 0.17%  postgres   [.] PostgresMain
 0.17%  postgres   [.] palloc
 0.17%  libc-2.27.so   [.] _int_malloc
 0.17%  postgres   [.] set_config_option
 0.17%  postgres   [.] ScanKeywordLookup
 0.16%  postgres   [.] _bt_checkpage

%%


We have tried to fix this by using BitMapSet instead of boolean array
KnownAssignedXidsValid, but this does not help too much.

Instead, using a doubly linked list helps a little more, we got +1000 tps
on pgbench workload with patched postgresql. The general idea of this patch
is that, instead of memorizing which elements in KnownAssignedXids are
valid, lets maintain a doubly linked list of them. This  solution will work
in exactly the same way, except that taking a snapshot on the replica is
now O(running transaction) instead of O(head - tail) which is significantly
faster under some workloads. The patch helps to reduce CPU usage of
KnownAssignedXidsGetAndSetXmin to ~48% instead of ~74%, but does eliminate
it from perf top.

The problem is better reproduced on PG13 since PG14 has some snapshot
optimization.

Thanks!

Best regards, reshke