Re: IDEA: pg_stat_statements tracking utility statements by tag?

2020-07-29 Thread Martijn van Oosterhout
On Wed, 29 Jul 2020 at 15:40, Julien Rouhaud  wrote:

> On Wed, Jul 29, 2020 at 2:42 PM Fujii Masao 
> wrote:
> >
> >
> > Or, we should extend the existing query normalization to handle also DDL?
>
> +1, introducing DDL normalization seems like a better way to go in the
> long run.  Defining what should and shouldn't be normalized can be
> tricky though.
>

In principle, the only thing that really needs to be normalised is
SAVEPOINT/CURSOR names which are essentially random strings which have no
effect on the result. Most other stuff is material to the query.

That said, I think "aggregate by tag" has value all by itself. Being able
to collapse all CREATE TABLES into a single line can be useful in some
situations.

Ideally the results of FETCH "cursor" should be combined with the DECLARE,
but I really don't know how to go about that.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/


IDEA: pg_stat_statements tracking utility statements by tag?

2020-07-29 Thread Martijn van Oosterhout
Hoi hackers,

We've been using the pg_stat_statements extension to get an idea of the
queries used in the database, but the table is being filled with entries
like:

SAVEPOINT sa_savepoint_NNN;
RELEASE SAVEPOINT sa_savepoint_NNN;
DECLARE "c_7f9451c4dcc0_5" CURSOR WITHOUT HOLD ...
FETCH FORWARD 250 FROM "c_7f9451b03908_5"

Since the unique id is different for each query, the aggregation does
nothing and there are quite a lot of these drowning out the normal queries
(yes, I'm aware this is an issue of itself). The only way to deal with this
is "pg_stat_statements.track_utility=off". However, it occurs to me that if
you just tracked the tags rather than the full query text you could at
least track the number of such queries and how much time they take. So the
above queries would be tracked under SAVEPOINT, RELEASE, DECLARE CURSOR and
(I guess) FETCH respectively. But it would also catch DDL.

Does this sound like something for which a patch would be accepted?

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/


Re: LISTEN/NOTIFY testing woes

2019-11-24 Thread Martijn van Oosterhout
Hoi Tom,

On Sun, 24 Nov 2019 at 02:01, Tom Lane  wrote:
>
> This seems both undesirable for our own testing, and rather bogus
> from users' standpoints as well.  However, I think a simple fix is
> available: just make the SQL pg_notification_queue_usage() function
> advance the queue tail before measuring, as in 0002 below.  This will
> restore the behavior of that function to what it was before 51004c717,
> and it doesn't seem like it'd cost any performance in any plausible
> use-cases.

This was one of those open points in the previous patches where it
wasn't quite clear what the correct behaviour should be. This fixes
the issue, but the question in my mind is: do we want to document this
fact and can users rely on this behaviour? If we go with the argument
that the delay in cleaning up should be entirely invisible, then I
guess this patch is the correct one that makes the made changes
invisible. Arguably not doing this means we'd have to document the
values are possibly out of date.

So I think this patch does the right thing.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/




Re: [PATCH] Improve performance of NOTIFY over many databases (v2)

2019-09-17 Thread Martijn van Oosterhout
Hoi Tom,

On Mon, 16 Sep 2019 at 15:33, Tom Lane  wrote:
>
> Martijn van Oosterhout  writes:
> > I think I like the idea of having SignalBackend do the waking up a
> > slow backend but I'm not enthused by the "lets wake up (at once)
> > everyone that is behind". That's one of the issues I was explicitly
> > trying to solve. If there are any significant number of "slow"
> > backends then we get the "thundering herd" again.
>
> But do we care?  With asyncQueueAdvanceTail gone from the listeners,
> there's no longer an exclusive lock for them to contend on.  And,
> again, I failed to see any significant contention even in HEAD as it
> stands; so I'm unconvinced that you're solving a live problem.

You're right, they only acquire a shared lock which is much less of a
problem. And I forgot that we're still reducing the load from a few
hundred signals and exclusive locks per NOTIFY to perhaps a dozen
shared locks every thousand messages. You'd be hard pressed to
demonstrate there's a real problem here.

So I think your patch is fine as is.

Looking at the release cycle it looks like the earliest either of
these patches will appear in a release is PG13, right?

Thanks again.
-- 
Martijn van Oosterhout  http://svana.org/kleptog/




Re: [PATCH] Improve performance of NOTIFY over many databases (v2)

2019-09-16 Thread Martijn van Oosterhout
Hoi Tom,

On Mon, 16 Sep 2019 at 00:14, Tom Lane  wrote:
>
> I spent some more time thinking about this, and I'm still not too
> satisfied with this patch's approach.  It seems to me the key insights
> we're trying to make use of are:
>
> 1. We don't really need to keep the global tail pointer exactly
> up to date.  It's bad if it falls way behind, but a few pages back
> is fine.

Agreed.

> 2. When sending notifies, only listening backends connected to our
> own database need be awakened immediately.  Backends connected to
> other DBs will need to advance their queue pointer sometime, but
> again it doesn't need to be right away.

Agreed.

> 3. It's bad for multiple processes to all be trying to do
> asyncQueueAdvanceTail concurrently: they'll contend for exclusive
> access to the AsyncQueueLock.  Therefore, having the listeners
> do it is really the wrong thing, and instead we should do it on
> the sending side.

Agreed, but I'd add that listeners in databases that are largely idle
there may never be a sender, and thus need to be advanced up some
other way.

> However, the patch as presented doesn't go all the way on point 3,
> instead having listeners maybe-or-maybe-not do asyncQueueAdvanceTail
> in asyncQueueReadAllNotifications.  I propose that we should go all
> the way and just define tail-advancing as something that happens on
> the sending side, and only once every few pages.  I also think we
> can simplify the handling of other-database listeners by including
> them in the set signaled by SignalBackends, but only if they're
> several pages behind.  So that leads me to the attached patch;
> what do you think?

I think I like the idea of having SignalBackend do the waking up a
slow backend but I'm not enthused by the "lets wake up (at once)
everyone that is behind". That's one of the issues I was explicitly
trying to solve. If there are any significant number of "slow"
backends then we get the "thundering herd" again. If the number of
slow backends exceeds the number of cores then commits across the
system could be held up quite a while (which is what caused me to make
this patch, multiple seconds was not unusual).

The maybe/maybe not in asyncQueueReadAllNotifications is that "if I
was behind, then I probably got woken up, hence I need to wake up
someone else", thus ensuring the cleanup proceeds in an orderly
fashion, leaving gaps where the lock isn't held allowing COMMITs to
proceed.

> BTW, in my hands it seems like point 2 (skip wakening other-database
> listeners) is the only really significant win here, and of course
> that only wins when the notify traffic is spread across a fair number
> of databases.  Which I fear is not the typical use-case.  In single-DB
> use-cases, point 2 helps not at all.  I had a really hard time measuring
> any benefit from point 3 --- I eventually saw a noticeable savings
> when I tried having one notifier and 100 listen-only backends, but
> again that doesn't seem like a typical use-case.  I could not replicate
> your report of lots of time spent in asyncQueueAdvanceTail's lock
> acquisition.  I wonder whether you're using a very large max_connections
> setting and we already fixed most of the problem with that in bca6e6435.
> Still, this patch doesn't seem to make any cases worse, so I don't mind
> if it's just improving unusual use-cases.

I'm not sure if it's an unusual use-case, but it is my use-case :).
Specifically, there are 100+ instances of the same application running
on the same cluster with wildly different usage patterns. Some will be
idle because no-one is logged in, some will be quite busy. Although
there are only 2 listeners per database, that's still a lot of
listeners that can be behind. Though I agree that bca6e6435 will have
mitigated quite a lot (yes, max_connections is quite high). Another
mitigation would be to spread across more smaller database clusters,
which we need to do anyway.

That said, your approach is conceptually simpler which is also worth
something and it gets essentially all the same benefits for more
normal use cases. If the QUEUE_CLEANUP_DELAY were raised a bit then we
could do mitigation of the rest on the client side by having idle
databases send dummy notifies every now and then to trigger clean up
for their database. The flip-side is that slow backends will then have
further to catch up, thus holding the lock longer. It's not worth
making it configurable so we have to guess, but 16 is perhaps a good
compromise.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/




Re: [PATCH] Improve performance of NOTIFY over many databases (v2)

2019-09-15 Thread Martijn van Oosterhout
On Sat, 14 Sep 2019 at 17:08, Tom Lane  wrote:
> Martijn van Oosterhout  writes:
> > On Fri, 13 Sep 2019 at 22:04, Tom Lane  wrote:
> >> But, really ... do we need the backendTryAdvanceTail flag at all?

> None of this seems to respond to my point: it looks to me like it would
> work fine if you simply dropped the patch's additions in PreCommit_Notify
> and ProcessCompletedNotifies, because there is already enough logic to
> decide when to call asyncQueueAdvanceTail.  In particular, the result from
> Signal[MyDB]Backends tells us whether anyone else was awakened, and
> ProcessCompletedNotifies already does asyncQueueAdvanceTail if not.
> As long as we did awaken someone, the ball's now in their court to
> make sure asyncQueueAdvanceTail happens eventually.

Ah, I think I see what you're getting at. As written,
asyncQueueReadAllNotifications() only calls asyncQueueAdvanceTail() if
*it* was a slow backend (advanceTail =
QUEUE_SLOW_BACKEND(MyBackendId)). In a situation where some databases
are regularly using NOTIFY and a few others never (but still
listening) it will lead to the situation where the tail never gets
advanced.

However, I guess you're thinking of asyncQueueReadAllNotifications()
triggering if the queue as a whole was too long. This could in
principle work but it does mean that at some point all backends
sending NOTIFY are going to start calling asyncQueueAdvanceTail()
every time, until the tail gets advanced, and if there are many idle
listening backends behind this could take a while. The slowest backend
might receive more signals while it is processing and so end up
running asyncQueueAdvanceTail() twice. The fact that signals coalesce
stops the process getting completely out of hand but it does feel a
little uncontrolled.

The whole point of this patch is to ensure that at any time only one
backend is being woken up and calling asyncQueueAdvanceTail() at a
time.

But you do point out that the use of the return value of
SignalMyDBBackends() is used wrongly. The fact that no-one got
signalled only meant there were no other listeners on this database
which means nothing in terms of global queue cleanup. What you want to
know is if you're the only listener in the whole system and you can
test for that directly (QUEUE_FIRST_BACKEND == MyBackendId &&
QUEUE_NEXT_BACKEND(MyBackendId) == InvalidBackendId). I can adjust
this in the next version if necessary, it's fairly harmless as is as
it only triggers in the case where a database is only notifying
itself, which probably isn't that common.

I hope I have correctly understood this time.

Have a nice weekend.
-- 
Martijn van Oosterhout  http://svana.org/kleptog/




Re: [PATCH] Improve performance of NOTIFY over many databases (v2)

2019-09-14 Thread Martijn van Oosterhout
Hoi Tom,


On Fri, 13 Sep 2019 at 22:04, Tom Lane  wrote:
>
> This throws multiple compiler warnings for me:

Fixed.

> Also, I don't exactly believe this bit:
[snip]
> It seems unlikely that insertion would stop exactly at a page boundary,
> but that seems to be what this is looking for.

This is how asyncQueueAddEntries() works. Entries are never split over
pages. If there is not enough room, then it advances to the beginning
of the next page and returns. Hence here the offset is zero. I could
set the global inside asyncQueueAddEntries() but that seems icky.
Another alternative is to have asyncQueueAddEntries() return a boolean
"moved to new page", but that's just a long-winded way of doing what
it is now.

> But, really ... do we need the backendTryAdvanceTail flag at all?
> I'm dubious, because it seems like asyncQueueReadAllNotifications
> would have already covered the case if we're listening.  If we're
> not listening, but we signalled some other listeners, it falls
> to them to kick us if we're the slowest backend.  If we're not the
> slowest backend then doing asyncQueueAdvanceTail isn't useful.

There are multiple issues here. asyncQueueReadAllNotifications() is
going to be called by each listener simultaneously, so each listener
is going to come to the same conclusion. On the other side, there is
no guarantee we wake up anyone as a result of the NOTIFY, e.g. if
there are no listeners in the current database. To be sure you try to
advance the tail, you have to trigger on the sending side. The global
is there because at the point we are inserting entries we are still in
a user transaction, potentially holding many table locks (the issue we
were running into in the first place). By setting
backendTryAdvanceTail we can move the work to
ProcessCompletedNotifies() which is after the transaction has
committed and the locks released.

> I agree with getting rid of the asyncQueueAdvanceTail call in
> asyncQueueUnregister; on reflection doing that there seems pretty unsafe,
> because we're not necessarily in a transaction and hence anything that
> could possibly error is a bad idea.  However, it'd be good to add a
> comment explaining that we're not doing that and why it's ok not to.

Comment added.

> I'm fairly unimpressed with the "kick a random slow backend" logic.
> There can be no point in kicking any but the slowest backend, ie
> one whose pointer is exactly the oldest.  Since we're already computing
> the min pointer in that loop, it would actually take *less* logic inside
> the loop to remember the/a backend that had that pointer value, and then
> decide afterwards whether it's slow enough to merit a kick.

Adjusted this. I'm not sure it's actually clearer this way, but it is
less work inside the loop. A small change is that now it won't signal
anyone if this backend is the slowest, which more correct.

Thanks for the feedback. Attached is version 3.

Have a nice weekend,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
From 539d97b47c4535314c23df22e5e87ecc43149f3a Mon Sep 17 00:00:00 2001
From: Martijn van Oosterhout 
Date: Sat, 14 Sep 2019 11:01:11 +0200
Subject: [PATCH 1/2] Improve performance of async notifications

Advancing the tail pointer requires an exclusive lock which can block
backends from other databases, so it's worth keeping these attempts to a
minimum.

Instead of tracking the slowest backend exactly we update the queue more
lazily, only checking when we switch to a new SLRU page.  Additionally,
instead of waking up every slow backend at once, we do them one at a time.
---
 src/backend/commands/async.c | 167 ++-
 1 file changed, 124 insertions(+), 43 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index f26269b5ea..ffd7c7e90b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -73,10 +73,11 @@
  *	  Finally, after we are out of the transaction altogether, we check if
  *	  we need to signal listening backends.  In SignalBackends() we scan the
  *	  list of listening backends and send a PROCSIG_NOTIFY_INTERRUPT signal
- *	  to every listening backend (we don't know which backend is listening on
- *	  which channel so we must signal them all). We can exclude backends that
- *	  are already up to date, though.  We don't bother with a self-signal
- *	  either, but just process the queue directly.
+ *	  to every listening backend for the relavent database (we don't know
+ *	  which backend is listening on which channel so we must signal them
+ *	  all).  We can exclude backends that are already up to date, though.
+ *	  We don't bother with a self-signal either, but just process the queue
+ *	  directly.
  *
  * 5. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
  *	  sets the process's latch, which triggers the event to be processed
@@ -89,13 +90,25 @@
  *	  Inbound-notify

Re: [PATCH] Improve performance of NOTIFY over many databases (v2)

2019-09-11 Thread Martijn van Oosterhout
Hoi Tom,


On Wed, 11 Sep 2019 at 00:18, Tom Lane  wrote:

>
> I pushed 0001 after doing some hacking on it --- it was sloppy about
> datatypes, and about whether the invalid-entry value is 0 or -1,
> and it was just wrong about keeping the list in backendid order.
> (You can't conditionally skip looking for where to put the new
> entry, if you want to maintain the order.  I thought about just
> defining the list as unordered, which would simplify joining the
> list initially, but that could get pretty cache-unfriendly when
> there are lots of entries.)
>
> 0002 is now going to need a rebase, so please do that.
>
>
Thanks for this, and good catch. Looks like I didn't test the first patch
by itself very well.

Here is the rebased second patch.

Thanks in advance,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
From bc4b1b458564f758b7fa1c1f7b0397aade71db06 Mon Sep 17 00:00:00 2001
From: Martijn van Oosterhout 
Date: Mon, 3 Jun 2019 17:13:31 +0200
Subject: [PATCH 1/2] Improve performance of async notifications

Advancing the tail pointer requires an exclusive lock which can block
backends from other databases, so it's worth keeping these attempts to a
minimum.

Instead of tracking the slowest backend exactly we update the queue more
lazily, only checking when we switch to a new SLRU page.  Additionally,
instead of waking up every slow backend at once, we do them one at a time.
---
 src/backend/commands/async.c | 142 +--
 1 file changed, 101 insertions(+), 41 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index f26269b5ea..b9dd0ca139 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -73,10 +73,11 @@
  *	  Finally, after we are out of the transaction altogether, we check if
  *	  we need to signal listening backends.  In SignalBackends() we scan the
  *	  list of listening backends and send a PROCSIG_NOTIFY_INTERRUPT signal
- *	  to every listening backend (we don't know which backend is listening on
- *	  which channel so we must signal them all). We can exclude backends that
- *	  are already up to date, though.  We don't bother with a self-signal
- *	  either, but just process the queue directly.
+ *	  to every listening backend for the relavent database (we don't know
+ *	  which backend is listening on which channel so we must signal them
+ *	  all).  We can exclude backends that are already up to date, though.
+ *	  We don't bother with a self-signal either, but just process the queue
+ *	  directly.
  *
  * 5. Upon receipt of a PROCSIG_NOTIFY_INTERRUPT signal, the signal handler
  *	  sets the process's latch, which triggers the event to be processed
@@ -89,13 +90,25 @@
  *	  Inbound-notify processing consists of reading all of the notifications
  *	  that have arrived since scanning last time. We read every notification
  *	  until we reach either a notification from an uncommitted transaction or
- *	  the head pointer's position. Then we check if we were the laziest
- *	  backend: if our pointer is set to the same position as the global tail
- *	  pointer is set, then we move the global tail pointer ahead to where the
- *	  second-laziest backend is (in general, we take the MIN of the current
- *	  head position and all active backends' new tail pointers). Whenever we
- *	  move the global tail pointer we also truncate now-unused pages (i.e.,
- *	  delete files in pg_notify/ that are no longer used).
+ *	  the head pointer's position.
+ *
+ * 6. To avoid SLRU wraparound and minimize disk space the tail pointer
+ *	  needs to be advanced so that old pages can be truncated.  This
+ *	  however requires an exclusive lock and as such should be done
+ *	  infrequently.
+ *
+ *	  When a new notification is added, the writer checks to see if the
+ *	  tail pointer is more than QUEUE_CLEANUP_DELAY pages behind.  If
+ *	  so, it attempts to advance the tail, and if there are slow
+ *	  backends (perhaps because all the notifications were for other
+ *	  databases), wake one of them up by sending a signal.
+ *
+ *	  When the slow backend processes the queue it notes it was behind
+ *	  and so also tries to advance the tail, possibly waking up another
+ *	  slow backend.  Eventually all backends will have processed the
+ *	  queue and the global tail pointer is move to a new page and we
+ *	  also truncate now-unused pages (i.e., delete files in pg_notify/
+ *	  that are no longer used).
  *
  * An application that listens on the same channel it notifies will get
  * NOTIFY messages for its own NOTIFYs.  These can be ignored, if not useful,
@@ -211,6 +224,12 @@ typedef struct QueuePosition
 	 (x).page != (y).page ? (x) : \
 	 (x).offset > (y).offset ? (x) : (y))
 
+/* how many pages does a backend need to be behind before it needs to be signalled */
+#define QUEUE_CLEANUP_DELAY 4
+
+/* is a backend so far behind it needs to be signalled? */
+#define QUE

[PATCH] Improve performance of NOTIFY over many databases (v2)

2019-08-02 Thread Martijn van Oosterhout
Hoi hackers,

Here is a reworked version of the previous patches.

The original three patches have been collapsed into one as given the
changes discussed it didn't make sense to keep them separate. There
are now two patches (the third is just to help with testing):

Patch 1: Tracks the listening backends in a list so non-listening
backends can be quickly skipped over. This is separate because it's
orthogonal to the rest of the changes and there are other ways to do
this.

Patch 2: This is the meat of the change. It implements all the
suggestions discussed:

- The queue tail is now only updated lazily, whenever the notify queue
moves to a new page. This did require a new global to track this state
through the transaction commit, but it seems worth it.

- Only backends for the current database are signalled when a
notification is made

- Slow backends are woken up one at a time rather than all at once

- A backend is allowed to lag up to 4 SLRU pages behind before being
signalled. This is a tradeoff between how often to get woken up verses
how much work to do once woken up.

- All the relevant comments have been updated to describe the new
algorithm. Locking should also be correct now.

This means in the normal case where listening backends get a
notification occasionally, no-one will ever be considered slow. An
exclusive lock for cleanup will happen about once per SLRU page.
There's still the exclusive locks on adding notifications but that's
unavoidable.

One minor issue is that pg_notification_queue_usage() will now return
a small but non-zero number (about 3e-6) even when nothing is really
going on. This could be fixed by having it take an exclusive lock
instead and updating to the latest values but that barely seems worth
it.

Performance-wise it's even better than my original patches, with about
20-25% reduction in CPU usage in my test setup (using the test script
sent previously).

Here is the log output from my postgres, where you see the signalling in action:

--
16:42:48.673 [10188] martijn@test_131 DEBUG:  PreCommit_Notify
16:42:48.673 [10188] martijn@test_131 DEBUG:  NOTIFY QUEUE = (74,896)...(79,0)
16:42:48.673 [10188] martijn@test_131 DEBUG:  backendTryAdvanceTail -> true
16:42:48.673 [10188] martijn@test_131 DEBUG:  AtCommit_Notify
16:42:48.673 [10188] martijn@test_131 DEBUG:  ProcessCompletedNotifies
16:42:48.673 [10188] martijn@test_131 DEBUG:  backendTryAdvanceTail -> false
16:42:48.673 [10188] martijn@test_131 DEBUG:  asyncQueueAdvanceTail
16:42:48.673 [10188] martijn@test_131 DEBUG:  waking backend 137 (pid 10055)
16:42:48.673 [10055] martijn@test_067 DEBUG:  ProcessIncomingNotify
16:42:48.673 [10187] martijn@test_131 DEBUG:  ProcessIncomingNotify
16:42:48.673 [10055] martijn@test_067 DEBUG:  asyncQueueAdvanceTail
16:42:48.673 [10055] martijn@test_067 DEBUG:  waking backend 138 (pid 10056)
16:42:48.673 [10187] martijn@test_131 DEBUG:  ProcessIncomingNotify: done
16:42:48.673 [10055] martijn@test_067 DEBUG:  ProcessIncomingNotify: done
16:42:48.673 [10056] martijn@test_067 DEBUG:  ProcessIncomingNotify
16:42:48.673 [10056] martijn@test_067 DEBUG:  asyncQueueAdvanceTail
16:42:48.673 [10056] martijn@test_067 DEBUG:  ProcessIncomingNotify: done
16:42:48.683 [9991] martijn@test_042 DEBUG:  Async_Notify(changes)
16:42:48.683 [9991] martijn@test_042 DEBUG:  PreCommit_Notify
16:42:48.683 [9991] martijn@test_042 DEBUG:  NOTIFY QUEUE = (75,7744)...(79,32)
16:42:48.683 [9991] martijn@test_042 DEBUG:  AtCommit_Notify
-

Have a nice weekend.
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
From 82366f1dbc0fc234fdd10dbc15519b3cf7104684 Mon Sep 17 00:00:00 2001
From: Martijn van Oosterhout 
Date: Tue, 23 Jul 2019 16:49:30 +0200
Subject: [PATCH 1/3] Maintain queue of listening backends to speed up loops

Especially the loop to advance the tail pointer is called fairly often while
holding an exclusive lock.
---
 src/backend/commands/async.c | 45 +++-
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 6e9c580ec6..ba0b1baecc 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -217,6 +217,7 @@ typedef struct QueueBackendStatus
 {
 	int32		pid;			/* either a PID or InvalidPid */
 	Oid			dboid;			/* backend's database OID, or InvalidOid */
+	int			nextListener;	/* backendid of next listener, 0=last */
 	QueuePosition pos;			/* backend has read queue up to here */
 } QueueBackendStatus;
 
@@ -247,6 +248,7 @@ typedef struct AsyncQueueControl
 	QueuePosition tail;			/* the global tail is equivalent to the pos of
  * the "slowest" backend */
 	TimestampTz lastQueueFillWarn;	/* time of last queue-full msg */
+	int			firstListener;	/* backendId of first listener, 0=none */
 	QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
 	/* backend[0] is not used; used entries are from [1] to [MaxBackends] */
 } AsyncQueueControl;
@@ -257,8 +

Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)

2019-07-24 Thread Martijn van Oosterhout
On Tue, 23 Jul 2019 at 23:32, Tom Lane  wrote:
>
> Martijn van Oosterhout  writes:
> > I mean tracking the listening backends specifically, so you can
> > replace the loops:
> > for (i=0; i < MaxBackends; i++)
> > with
> > for (i=QUEUE_FIRST_LISTENING_BACKEND; i; i = 
> > QUEUE_NEXT_LISTENING_BACKEND(i))
>
> Ah ... but surely you would not put such info in AsyncQueueEntry,
> where there'd be a separate copy for each message.  I think you
> meant to add the info to AsyncQueueControl.

Umm, yeah. Got that mixed up.

> It might be better to redefine the backends[] array as being mostly
> contiguous (ie, a new backend takes the first free slot not the one
> indexed by its own BackendId), at the price of needing to store
> BackendId in each slot explicitly instead of assuming it's equal to
> the array index.  I suspect the existing data structure is putting too
> much of a premium on making sizeof(QueueBackendStatus) a power of 2.

This would require adding a "MyListenerId" to each backend which I'm not sure
helps the readability. And there's a chance of mixing the id up. The
power-of-2-ness
is I think indeed overrated.

I'll give it a shot a see how it looks.

Have a nice day,

-- 
Martijn van Oosterhout  http://svana.org/kleptog/




Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)

2019-07-23 Thread Martijn van Oosterhout
On Tue, 23 Jul 2019 at 19:21, Tom Lane  wrote:
>
> Martijn van Oosterhout  writes:
> > 2. Add a field to AsyncQueueEntry which points to the next listening
> > backend. This would allow the loops over all listening backends to
> > complete much faster, especially in the normal case where there are
> > not many listeners relative to the number of backends. The downside is
> > this requires an exclusive lock to remove listeners, but that doesn't
> > seem a big problem.
>
> I don't understand how that would work?  The sending backend doesn't
> know what the "next listening backend" is.  Having to scan the whole
> queue when a listener unlistens seems pretty awful too, especially
> if you need exclusive lock while doing so.

I mean tracking the listening backends specifically, so you can
replace the loops:

for (i=0; i < MaxBackends; i++)

with

for (i=QUEUE_FIRST_LISTENING_BACKEND; i; i = QUEUE_NEXT_LISTENING_BACKEND(i))

Such loops occur often when trying to advance the tail, when adding a
new listener,
when sending a notify, etc, all while holding a (exclusive) lock.
Seems like such an easy win
to only loop over the listening backends rather than all of them.

> > Do 2 & 3 seem like a good direction to go? I can probably work something up.
>
> I'm on board with 3, obviously.  Not following what you have in mind
> for 2.

Hope this clears it up a bit. Only waking up one at a time is a good
idea, but needs to some
careful thinking to prove it actually works.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/




Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)

2019-07-23 Thread Martijn van Oosterhout
Hoi Tom,

Thank you for the detailed response. Sorry the delay, I was on holiday.

You are absolutely correct when you point out that the queue pointers
were accessed without the lock and this is probably unsafe.  For the
first two patches this is can be remedied, though I find it makes the
logic a bit harder to follow.  The comments will need to be updated to
reflect the new logic. I hope to post something soon.

As for your point about the third patch, you are right that it's
probably not saving many cycles. However I do think it's worthwhile
actually optimising this loop, because the number of backends that are
listening is likely to be much smaller than the total number of
backends, so there's a lot of cycles being wasted here already. Fixing
the thundering herd issue (like in sinval as you point out) doesn't
actually reduce the amount of work being done, just spreads it out.
Since readers and writers block each other, blocking a writer means
blocking commits across the whole cluster.

There are a number of possible improvements here:

1. Do what sinval does and separate the reader and writer locks so
they can't block each other. This is the ultimate solution, but it's a
significant refactor and it's not clear that's actually worthwhile
here. This would almost be adopting the sinvaladt structure wholesale.

2. Add a field to AsyncQueueEntry which points to the next listening
backend. This would allow the loops over all listening backends to
complete much faster, especially in the normal case where there are
not many listeners relative to the number of backends. The downside is
this requires an exclusive lock to remove listeners, but that doesn't
seem a big problem.

3. The other idea from sinval where you only wake up one worker at a
time is a good one as you point out. This seems quite doable, however,
it seems wasteful to try and wake everyone up the moment we switch to
a new page. The longer you delay the lower the chance you need to wake
anyone at all because they've because they'll have caught up by
themselves. A single SLRU page can hold hundreds, or even thousands of
messages.

Do 2 & 3 seem like a good direction to go? I can probably work something up.

Thanks in advance,
Martijn


> Martijn van Oosterhout  writes:
> > Please find attached updated versions of the patches, I've now tested
> > them. Also attached is a reproduction script to verify that they
> > actually work.
>
> I looked through these (a bit cursorily).
>
> I'm generally on board with the idea of 0001, but not with the patch
> details.  As coded, asyncQueueAdvanceTail is supposing that it can
> examine the shared QUEUE_HEAD and QUEUE_TAIL pointers without any
> lock whatsoever.  That's probably unsafe, and if it is safe for some
> reason, you haven't made the argument why.  Moreover, it seems
> unnecessary to make any such assumption.  Why not put back the
> advanceTail tests you removed, but adjust them so that advanceTail
> isn't set true unless QUEUE_HEAD and QUEUE_TAIL point to different
> pages?  (Note that in the existing coding, those tests are made
> while holding an appropriate lock, so it's safe to look at those
> pointers there.)
>
> It might be a good idea to make a macro encapsulating this new,
> more complicated rule for setting advanceTail, instead of relying
> on keeping the various call sites in sync.
>
> More attention to comments is also needed.  For instance, you've
> made a lie out of the documentation of the tail pointer:
>
> QueuePosition tail; /* the global tail is equivalent to the pos of
>  * the "slowest" backend */
>
> It needs to say something like "is <= the pos of the slowest backend",
> instead.  I think the explanation of why this algorithm is good could
> use more effort, too.
>
> Comments for 0002 are about the same: for no explained reason, and
> certainly no savings, you've put the notify_all test in an unsafe
> place rather than a safe one (viz, two lines down, *after* taking
> the relevant lock).  And 0002 needs more commentary about why
> its optimization is safe and useful, too.  In particular it's
> not obvious why QUEUE_HEAD being on a different page from QUEUE_TAIL
> has anything to do with whether we should wake up other backends.
>
> I'm not very persuaded by 0003, mainly because it seems likely to
> me that 0001 and 0002 will greatly reduce the possibility that
> the early-exit can happen.  So it seems like it's adding cycles
> (in a spot where we hold exclusive lock) without a good chance of
> saving any cycles.
>
> Taking a step back in hopes of seeing the bigger picture ...
> as you already noted, these changes don't really fix the "thundering
> herd of wakeups" problem, they just arrange for it to happen
> once per SLRU page rather t

Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)

2019-06-05 Thread Martijn van Oosterhout
Hoi hackers,

Please find attached updated versions of the patches, I've now tested
them. Also attached is a reproduction script to verify that they
actually work.

To test you need to create 150 databases as described in the script,
then simply execute it. Before patching you get the following results
(last figure is the CPU usage of Postgres):

1559749330 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 269.07%
1559749335 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 268.07%
1559749340 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 270.94%

After patching you get the following:

1559749840 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.02, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 5.09%
1559749845 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 5.06%
1559749850 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 4.75%

The async queue functions in postgres also no longer appear in the
perf output (below measuring threshold).

As for general method, it seems like the actual optimisation here is
that the async queue tail pointer is only updated once per SLRU page
instead of every message. This would require a significantly larger
patch, but shouldn't be too difficult. Thoughts?

Have a nice day,
Martijn

On Tue, 4 Jun 2019 at 09:08, Martijn van Oosterhout  wrote:
>
> Hoi hackers,
>
> We've been having issues with NOTIFYs blocking over multiple databases
> (see [1] for more details). That was 9.4 but we've updated the
> database to 11.3 and still have the same issue. Now however we could
> use perf to do profiling and got the following profile (useless
> details elided):
>
> --32.83%--ProcessClientReadInterrupt
>--32.68%--ProcessNotifyInterrupt
>   --32.16%--asyncQueueReadAllNotifications
>  --23.37%--asyncQueueAdvanceTail
> --20.49%--LWLockAcquire
>--18.93%--LWLockQueueSelf
>   --12.99%--LWLockWaitListLock
>
> (from: perf record -F 99 -ag -- sleep 600)
>
> That shows that more than 20% of the time is spent in that single
> function, waiting for an exclusive lock on the AsyncQueueLock. This
> will block any concurrent session doing a NOTIFY in any database on
> the system. This would certainly explain the symptoms we're seeing
> (process xxx still waiting for AccessExclusiveLock on object 0 of
> class 1262 of database 0).
>
> Analysis of the code leads me to the following hypothesis (and hence
> to the attached patches):
>
> We have ~150 databases, each of which has 2 active backends with an
> active LISTEN. When a NOTIFY happens anywhere on any database it
> (under an exclusive lock) makes a list of 300 backends to send a
> signal to. It then wakes up all of those backends. Each backend then
> examines the message and all but one discards it as being for the
> wrong database. Each backend then calls asyncQueueAdvanceTail (because
> the current position of the each backend was the tail) which then
> takes an exclusive lock and checks all the other backends to see if
> the tail can be advanced. All of these will conclude 'no', except the
> very last one which concludes the tail can be advanced by about 50
> bytes or so.
>
> So the inner loop of asyncQueueAdvanceTail will, while holding a
> global exclusive lock, execute 2*150*4000 (max backends) = 1.2 million
> times for basically no benefit. During this time, no other transaction
> anywhere in the system that does a NOTIFY will be able to commit.
>
> The attached patches attempt reduce the overhead in two ways:
>
> Patch 1: Changes asyncQueueAdvanceTail to do nothing unless the
> QUEUE_HEAD is on a different page than the QUEUE_TAIL. The idea is
> that there's no point trying to advance the tail unless we can
> actually usefully truncate the SLRU. This does however mean that
> asyncQueueReadAllNotifications always has to call
> asyncQueueAdvanceTail since it can no longer be guaranteed that any
> backend is still at the tail, which is one of the assumptions of the
> current code. Not sure if this is a problem or if it can be improved
> without tracking much more state.
>
> Patch 2: Changes SignalBackends to only notify other backends when (a)
> they're the same database as me or (b) the notify queue has advanced
> to a new SLRU page. This avoids backends being woken up for messages
> which they are not interested in.
>
> As a consequence of these changes, we can reduce the number of
> exclusive locks and backend wake ups in our case by a factor of 300.
> You still however get a thundering herd at the end of each SLRU page.
>
> Note: these patches have not yet been extensively 

[PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)

2019-06-04 Thread Martijn van Oosterhout
Hoi hackers,

We've been having issues with NOTIFYs blocking over multiple databases
(see [1] for more details). That was 9.4 but we've updated the
database to 11.3 and still have the same issue. Now however we could
use perf to do profiling and got the following profile (useless
details elided):

--32.83%--ProcessClientReadInterrupt
   --32.68%--ProcessNotifyInterrupt
  --32.16%--asyncQueueReadAllNotifications
 --23.37%--asyncQueueAdvanceTail
--20.49%--LWLockAcquire
   --18.93%--LWLockQueueSelf
  --12.99%--LWLockWaitListLock

(from: perf record -F 99 -ag -- sleep 600)

That shows that more than 20% of the time is spent in that single
function, waiting for an exclusive lock on the AsyncQueueLock. This
will block any concurrent session doing a NOTIFY in any database on
the system. This would certainly explain the symptoms we're seeing
(process xxx still waiting for AccessExclusiveLock on object 0 of
class 1262 of database 0).

Analysis of the code leads me to the following hypothesis (and hence
to the attached patches):

We have ~150 databases, each of which has 2 active backends with an
active LISTEN. When a NOTIFY happens anywhere on any database it
(under an exclusive lock) makes a list of 300 backends to send a
signal to. It then wakes up all of those backends. Each backend then
examines the message and all but one discards it as being for the
wrong database. Each backend then calls asyncQueueAdvanceTail (because
the current position of the each backend was the tail) which then
takes an exclusive lock and checks all the other backends to see if
the tail can be advanced. All of these will conclude 'no', except the
very last one which concludes the tail can be advanced by about 50
bytes or so.

So the inner loop of asyncQueueAdvanceTail will, while holding a
global exclusive lock, execute 2*150*4000 (max backends) = 1.2 million
times for basically no benefit. During this time, no other transaction
anywhere in the system that does a NOTIFY will be able to commit.

The attached patches attempt reduce the overhead in two ways:

Patch 1: Changes asyncQueueAdvanceTail to do nothing unless the
QUEUE_HEAD is on a different page than the QUEUE_TAIL. The idea is
that there's no point trying to advance the tail unless we can
actually usefully truncate the SLRU. This does however mean that
asyncQueueReadAllNotifications always has to call
asyncQueueAdvanceTail since it can no longer be guaranteed that any
backend is still at the tail, which is one of the assumptions of the
current code. Not sure if this is a problem or if it can be improved
without tracking much more state.

Patch 2: Changes SignalBackends to only notify other backends when (a)
they're the same database as me or (b) the notify queue has advanced
to a new SLRU page. This avoids backends being woken up for messages
which they are not interested in.

As a consequence of these changes, we can reduce the number of
exclusive locks and backend wake ups in our case by a factor of 300.
You still however get a thundering herd at the end of each SLRU page.

Note: these patches have not yet been extensively tested, and so
should only be used as basis for discussion.

Comments? Suggestions?

[1] 
https://www.postgresql.org/message-id/cadwg95t0j9zf0uwdcmh81kmndsitavhxmbvgyqrrjcd-ilw...@mail.gmail.com

-- 
Martijn van Oosterhout  http://svana.org/kleptog/
From 17c241a2e307e70465c235248c17ac70d34fd175 Mon Sep 17 00:00:00 2001
From: Martijn van Oosterhout 
Date: Mon, 3 Jun 2019 17:13:31 +0200
Subject: [PATCH 1/2] Only try advancing tail pointer when it's useful.

Advancing the tail pointer requires an exclusive lock which can block
backends from other databases, so it's worth keeping these attempts to a
minimum.
---
 src/backend/commands/async.c | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 738e6ec7e2..c1b0705234 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -89,11 +89,9 @@
  *	  Inbound-notify processing consists of reading all of the notifications
  *	  that have arrived since scanning last time. We read every notification
  *	  until we reach either a notification from an uncommitted transaction or
- *	  the head pointer's position. Then we check if we were the laziest
- *	  backend: if our pointer is set to the same position as the global tail
- *	  pointer is set, then we move the global tail pointer ahead to where the
- *	  second-laziest backend is (in general, we take the MIN of the current
- *	  head position and all active backends' new tail pointers). Whenever we
+ *	  the head pointer's position. Then we check if we can advance the global
+ *	  tail pointer to a new page, if so we take the MIN of the current
+ *	  head position and all active backends' new tail pointers. Whenever we
  *	  move the global tail pointer we also truncate now-unused pages