Re: recovery modules

2023-01-27 Thread Nathan Bossart
On Fri, Jan 27, 2023 at 08:17:46PM -0800, Nathan Bossart wrote:
> On Fri, Jan 27, 2023 at 05:55:42PM -0800, Andres Freund wrote:
>> On 2023-01-27 16:59:10 -0800, Nathan Bossart wrote:
>>> I think it would be weird for the archive module and
>>> recovery module interfaces to look so different, but if that's okay, I can
>>> change it.
>> 
>> I'm a bit sad about the archive module case - I wonder if we should change it
>> now, there can't be many users of it out there. And I think it's more likely
>> that we'll eventually want multiple archiving scripts to run concurrently -
>> which will be quite hard with the current interface (no private state).
> 
> I'm open to that.  IIUC it wouldn't require too many changes to existing
> archive modules, and if it gets us closer to batching or parallelism, it's
> probably worth doing sooner than later.

Here is a work-in-progress patch set for adjusting the archive modules
interface.  Is this roughly what you had in mind?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From f865032283d0c937ff1c47441d70a2b11e89d3f4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 27 Jan 2023 21:01:22 -0800
Subject: [PATCH 1/4] s/ArchiveContext/ArchiveCallbacks

---
 src/backend/postmaster/pgarch.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 8ecdb9ca23..36800127e8 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -97,7 +97,7 @@ char	   *XLogArchiveLibrary = "";
  */
 static time_t last_sigterm_time = 0;
 static PgArchData *PgArch = NULL;
-static ArchiveModuleCallbacks ArchiveContext;
+static ArchiveModuleCallbacks ArchiveCallbacks;
 
 
 /*
@@ -416,8 +416,8 @@ pgarch_ArchiverCopyLoop(void)
 			HandlePgArchInterrupts();
 
 			/* can't do anything if not configured ... */
-			if (ArchiveContext.check_configured_cb != NULL &&
-!ArchiveContext.check_configured_cb())
+			if (ArchiveCallbacks.check_configured_cb != NULL &&
+!ArchiveCallbacks.check_configured_cb())
 			{
 ereport(WARNING,
 		(errmsg("archive_mode enabled, yet archiving is not configured")));
@@ -518,7 +518,7 @@ pgarch_archiveXlog(char *xlog)
 	snprintf(activitymsg, sizeof(activitymsg), "archiving %s", xlog);
 	set_ps_display(activitymsg);
 
-	ret = ArchiveContext.archive_file_cb(xlog, pathname);
+	ret = ArchiveCallbacks.archive_file_cb(xlog, pathname);
 	if (ret)
 		snprintf(activitymsg, sizeof(activitymsg), "last was %s", xlog);
 	else
@@ -824,7 +824,7 @@ HandlePgArchInterrupts(void)
 /*
  * LoadArchiveLibrary
  *
- * Loads the archiving callbacks into our local ArchiveContext.
+ * Loads the archiving callbacks into our local ArchiveCallbacks.
  */
 static void
 LoadArchiveLibrary(void)
@@ -837,7 +837,7 @@ LoadArchiveLibrary(void)
  errmsg("both archive_command and archive_library set"),
  errdetail("Only one of archive_command, archive_library may be set.")));
 
-	memset(, 0, sizeof(ArchiveModuleCallbacks));
+	memset(, 0, sizeof(ArchiveModuleCallbacks));
 
 	/*
 	 * If shell archiving is enabled, use our special initialization function.
@@ -854,9 +854,9 @@ LoadArchiveLibrary(void)
 		ereport(ERROR,
 (errmsg("archive modules have to define the symbol %s", "_PG_archive_module_init")));
 
-	(*archive_init) ();
+	(*archive_init) ();
 
-	if (ArchiveContext.archive_file_cb == NULL)
+	if (ArchiveCallbacks.archive_file_cb == NULL)
 		ereport(ERROR,
 (errmsg("archive modules must register an archive callback")));
 
@@ -869,6 +869,6 @@ LoadArchiveLibrary(void)
 static void
 pgarch_call_module_shutdown_cb(int code, Datum arg)
 {
-	if (ArchiveContext.shutdown_cb != NULL)
-		ArchiveContext.shutdown_cb();
+	if (ArchiveCallbacks.shutdown_cb != NULL)
+		ArchiveCallbacks.shutdown_cb();
 }
-- 
2.25.1

>From 1ea6fb293a1e69e6fdd39e1d6b96a0af601d8542 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 27 Jan 2023 21:16:34 -0800
Subject: [PATCH 2/4] move archive module exports to dedicated header

---
 contrib/basic_archive/basic_archive.c   |  2 +-
 src/backend/postmaster/pgarch.c |  1 +
 src/backend/postmaster/shell_archive.c  |  2 +-
 src/backend/utils/misc/guc_tables.c |  1 +
 src/include/postmaster/archive_module.h | 54 +
 src/include/postmaster/pgarch.h | 39 --
 6 files changed, 58 insertions(+), 41 deletions(-)
 create mode 100644 src/include/postmaster/archive_module.h

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 3d29711a31..87bbb2174d 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -32,7 +32,7 @@
 
 #include "common/int.h"
 #include "miscadmin.h"
-#include "postmaster/pgarch.h"
+#include "postmaster/archive_module.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "utils/guc.h"
diff --git a/src/backend/postmaster/pgarch.c 

Re: Deadlock between logrep apply worker and tablesync worker

2023-01-27 Thread Amit Kapila
On Sat, Jan 28, 2023 at 9:36 AM houzj.f...@fujitsu.com
 wrote:
>
> On Friday, January 27, 2023 8:16 PM Amit Kapila 
> >
> > On Fri, Jan 27, 2023 at 3:45 PM vignesh C  wrote:
> > >
> > > On Mon, 23 Jan 2023 at 10:52, Amit Kapila 
> > wrote:
> > > >
> > > > IIRC, this is done to prevent concurrent drops of origin drop say by
> > > > exposed API pg_replication_origin_drop(). See the discussion in [1]
> > > > related to it. If we want we can optimize it so that we can acquire
> > > > the lock on the specific origin as mentioned in comments
> > > > replorigin_drop_by_name() but it was not clear that this operation
> > > > would be frequent enough.
> > >
> > > Here is an attached patch to lock the replication origin record using
> > > LockSharedObject instead of locking pg_replication_origin relation in
> > > ExclusiveLock mode. Now tablesync worker will wait only if the
> > > tablesync worker is trying to drop the same replication origin which
> > > has already been dropped by the apply worker, the other tablesync
> > > workers will be able to successfully drop the replication origin
> > > without any wait.
> > >
> >
> > There is a code in the function replorigin_drop_guts() that uses the
> > functionality introduced by replorigin_exists(). Can we reuse this function 
> > for
> > the same?
>
> Maybe we can use SearchSysCacheExists1 to check the existence instead of
> adding a new function.
>

Yeah, I think that would be better.

> One comment about the patch.
>
> @@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool 
> missing_ok, bool nowait)
> ...
> +   /* Drop the replication origin if it has not been dropped already */
> +   if (replorigin_exists(roident))
> replorigin_drop_guts(rel, roident, nowait);
>
> If developer pass missing_ok as false, should we report an ERROR here
> instead of silently return ?
>

One thing that looks a bit odd is that we will anyway have a similar
check in replorigin_drop_guts() which is a static function and called
from only one place, so, will it be required to check at both places?

-- 
With Regards,
Amit Kapila.




Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-27 Thread Thomas Munro
On Sat, Jan 28, 2023 at 4:42 PM Andres Freund  wrote:
> Did you use the same compiler / compilation flags as when elver hit it?
> Clearly Tomas' case was with at least some optimizations enabled.

I did use the same compiler version and optimisation level, clang
llvmorg-13.0.0-0-gd7b669b3a303 at -O2.




Re: Something is wrong with wal_compression

2023-01-27 Thread Tom Lane
Thomas Munro  writes:
> Reading Andres's comments and realising how relatively young
> txid_status() is compared to txid_current(), I'm now wondering if we
> shouldn't just disclaim the whole thing in back branches.

My thoughts were trending in that direction too.  It's starting
to sound like we aren't going to be able to make a fix that
we'd be willing to risk back-patching, even if it were completely
compatible at the user level.

Still, the idea that txid_status() isn't trustworthy is rather
scary.  I wonder whether there is a failure mode here that's
exhibitable without using that.

regards, tom lane




Re: Something is wrong with wal_compression

2023-01-27 Thread Thomas Munro
On Sat, Jan 28, 2023 at 4:57 PM Andrey Borodin  wrote:
> It's not trustworthy anyway. Xid wraparound might happen during
> reconnect. I suspect we can design a test that will show that it does
> not always show correct results during xid->2pc conversion (there is a
> point in time when xid is not in regular and not in 2pc, and I'm not
> sure ProcArrayLock is held). Maybe there are other edge cases.

I'm not sure I understand the edge cases, but it is true that this can
only give you the answer until the CLOG is truncated, which is pretty
arbitrary and you could be unlucky.  I guess a reliable version of
this would have new policies about CLOG retention, and CLOG segment
filenames derived from 64 bit xids so they don't wrap around.

> Anyway, if a user wants to know the status of xid in case of
> disconnection they have prepared xacts.

Yeah.  The original proposal mentioned that, but that this was a
"lighter" alternative.

Reading Andres's comments and realising how relatively young
txid_status() is compared to txid_current(), I'm now wondering if we
shouldn't just disclaim the whole thing in back branches.  Maybe if we
want to rescue it in master, there could be a "reliable" argument,
defaulting to false, or whatever, and we could eventually make the
amortisation improvement.




Re: suppressing useless wakeups in logical/worker.c

2023-01-27 Thread Amit Kapila
On Fri, Jan 27, 2023 at 4:07 AM Tom Lane  wrote:
>
> Thanks, pushed.
>
> Returning to the prior patch ... I don't much care for this:
>
> +/* Maybe there will be a free slot in a second... */
> +retry_time = TimestampTzPlusSeconds(now, 1);
> +LogRepWorkerUpdateSyncStartWakeup(retry_time);
>
> We're not moving the goalposts very far on unnecessary wakeups if
> we have to do that.  Do we need to get a wakeup on sync slot free?
> Although having to send that to every worker seems ugly.  Maybe this
> is being done in the wrong place and we need to find a way to get
> the launcher to handle it.
>
> As for the business about process_syncing_tables being only called
> conditionally, I was already of the opinion that the way it's
> getting called is loony.  Why isn't it called from LogicalRepApplyLoop
> (and noplace else)?

Currently, it seems to be called after processing transaction end
commands or when we are not in any transaction. As per my
understanding, that is when we can ensure the sync between tablesync
and apply worker. For example, say when tablesync worker is doing the
initial copy, the apply worker went ahead and processed some
additional xacts (WAL), now the tablesync worker needs to process all
those transactions after initial sync and before it can mark the state
as SYNCDONE. So that can be checked only at transaction boundries.

However, it is not very clear to me why the patch needs the below code.
@@ -3615,7 +3639,33 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
  if (!dlist_is_empty(_mapping))
  wait_time = WalWriterDelay;
  else
- wait_time = NAPTIME_PER_CYCLE;
+ {
+ TimestampTz nextWakeup = DT_NOEND;
+
+ /*
+ * Since process_syncing_tables() is called conditionally, the
+ * tablesync worker start wakeup time might be in the past, and we
+ * can't know for sure when it will be updated again.  Rather than
+ * spinning in a tight loop in this case, bump this wakeup time by
+ * a second.
+ */
+ now = GetCurrentTimestamp();
+ if (wakeup[LRW_WAKEUP_SYNC_START] < now)
+ wakeup[LRW_WAKEUP_SYNC_START] =
TimestampTzPlusSeconds(wakeup[LRW_WAKEUP_SYNC_START], 1);

Do we see unnecessary wakeups without this, or delay in sync?

BTW, do we need to do something about wakeups in
wait_for_relation_state_change()?

-- 
With Regards,
Amit Kapila.




Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 23:18:39 -0500, Tom Lane wrote:
> I also saw it on florican, which is/was an i386 machine using clang and
> pretty standard build options other than
>   'CFLAGS' => '-msse2 -O2',
> so I think this isn't too much about machine architecture or compiler
> flags.

Ah. Florican dropped of the BF status page and I was too lazy to look
deeper. You have a penchant for odd architectures, so it didn't seem too crazy
:)


> Machine speed might matter though.  elver is a good deal faster than
> florican was, and dikkop is slower yet.  I gather Thomas has seen this
> only once on elver, but I saw it maybe a dozen times over a couple of
> years on florican, and now dikkop has hit it after not so many runs.

Re-reading the old thread, it is interesting that you tried hard to reproduce
it outside of the BF, without success:
https://postgr.es/m/2398828.1646000688%40sss.pgh.pa.us

Such problems are quite annoying. Last time I hit such a case was
https://postgr.es/m/20220325052654.3xpbmntatyofau2w%40alap3.anarazel.de
but I can't see anything like that being the issue here.

Greetings,

Andres Freund




RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-27 Thread Takamichi Osumi (Fujitsu)
Hi,

On Friday, January 27, 2023 8:00 PM Amit Kapila  wrote:
> On Fri, Jan 27, 2023 at 1:39 PM Takamichi Osumi (Fujitsu)
>  wrote:
> >
> > On Wednesday, January 25, 2023 11:24 PM I wrote:
> > > Attached the updated v22.
> > Hi,
> >
> > During self-review, I noticed some changes are required for some
> > variable types related to 'min_apply_delay' value, so have conducted
> > the adjustment changes for the same.
> >
> 
> So, you have changed min_apply_delay from int64 to int32, but you haven't
> mentioned the reason for the same? We use 'int' for the similar parameter
> recovery_min_apply_delay, so, ideally, it makes sense but still better to 
> tell your
> reason explicitly.
Yes. It's because I thought I need to make this feature consistent with the 
recovery_min_apply_delay.
This feature handles the range same as the recovery_min_apply delay from 0 to 
INT_MAX now
so should be adjusted to match it.


> Few comments
> =
> 1.
> @@ -70,6 +70,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId)
> BKI_SHARED_RELATION BKI_ROW
>   XLogRecPtr subskiplsn; /* All changes finished at this LSN are
>   * skipped */
> 
> + int32 subminapplydelay; /* Replication apply delay (ms) */
> +
>   NameData subname; /* Name of the subscription */
> 
>   Oid subowner BKI_LOOKUP(pg_authid); /* Owner of the subscription */
> 
> Why are you placing this after subskiplsn? Earlier it was okay because we want
> the 64 bit value to be aligned but now, isn't it better to keep it after 
> subowner?
Moved it after subowner.


> 2.
> +
> + diffms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
> + TimestampTzPlusMilliseconds(finish_ts,
> + MySubscription->minapplydelay));
> 
> The above code appears a bit unreadable. Can we store the result of
> TimestampTzPlusMilliseconds() in a separate variable say "TimestampTz
> delayUntil;"?
Agreed. Fixed.

Attached the updated patch v24.


Best Regards,
Takamichi Osumi



v24-0001-Time-delayed-logical-replication-subscriber.patch
Description:  v24-0001-Time-delayed-logical-replication-subscriber.patch


Re: Something is wrong with wal_compression

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 19:57:35 -0800, Andrey Borodin wrote:
> On Fri, Jan 27, 2023 at 7:40 PM Tom Lane  wrote:
> >
> > What are you using it for, that you don't care whether the answer
> > is trustworthy?
> >
> 
> It's not trustworthy anyway. Xid wraparound might happen during
> reconnect.

I think that part would be approximately fine, as long as you can live with
an answer of "too old". The xid returned by txid_status/pg_current_xact_id()
is 64bit, and there is code to verify that the relevant range is covered by
the clog.

However - there's nothing preventing the xid to become too old in case of a
crash.

If you have an open connection, you can prevent the clog from being truncated
by having an open snapshot. But you can't really without using e.g. 2PC if you
want to handle crashes - obviously snapshots don't survive them.


I really don't think txid_status() can be used for anything but informational
probing of the clog / procarray.



> I suspect we can design a test that will show that it does not always show
> correct results during xid->2pc conversion (there is a point in time when
> xid is not in regular and not in 2pc, and I'm not sure ProcArrayLock is
> held). Maybe there are other edge cases.

Unless I am missing something, that would be very bad [TM], completely
independent of txid_status(). The underlying functions like
TransactionIdIsInProgress() are used for MVCC.

Greetings,

Andres Freund




Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-27 Thread Tom Lane
Andres Freund  writes:
> Except that you're saying that you hit this on elver (amd64), I think it'd be
> interesting that we see the failure on an arm host, which has a less strict
> memory order model than x86.

I also saw it on florican, which is/was an i386 machine using clang and
pretty standard build options other than
'CFLAGS' => '-msse2 -O2',
so I think this isn't too much about machine architecture or compiler
flags.

Machine speed might matter though.  elver is a good deal faster than
florican was, and dikkop is slower yet.  I gather Thomas has seen this
only once on elver, but I saw it maybe a dozen times over a couple of
years on florican, and now dikkop has hit it after not so many runs.

regards, tom lane




Re: recovery modules

2023-01-27 Thread Nathan Bossart
On Fri, Jan 27, 2023 at 05:55:42PM -0800, Andres Freund wrote:
> On 2023-01-27 16:59:10 -0800, Nathan Bossart wrote:
>> I think it would be weird for the archive module and
>> recovery module interfaces to look so different, but if that's okay, I can
>> change it.
> 
> I'm a bit sad about the archive module case - I wonder if we should change it
> now, there can't be many users of it out there. And I think it's more likely
> that we'll eventually want multiple archiving scripts to run concurrently -
> which will be quite hard with the current interface (no private state).

I'm open to that.  IIUC it wouldn't require too many changes to existing
archive modules, and if it gets us closer to batching or parallelism, it's
probably worth doing sooner than later.

> I was wondering why we implement "shell" via a separate mechanism from
> restore_library. I.e. a) why doesn't restore_library default to 'shell',
> instead of an empty string, b) why aren't restore_command et al implemented
> using a restore module.

I think that's the long-term idea.  For archive modules, there were
concerns about backward compatibility [0].

[0] 
https://postgr.es/m/CABUevEx8cKy%3D%2BYQU_3NaeXnZV2bSB7Lk6EE%2B-FEcmE4JO4V1hg%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Something is wrong with wal_compression

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 19:49:17 -0800, Andres Freund wrote:
> It's quite commonly used as part of trigger based replication tools (IIRC
> that's its origin), monitoring, as part of client side logging, as part of
> snapshot management.

Forgot one: Queues.

The way it's used for trigger based replication, queues and also some
materialized aggregation tooling, is that there's a trigger that inserts into
a "log" table. And that log table has a column into which txid_current() will
be inserted. Together with txid_current_snapshot() etc that's used to get a
(at least semi) "transactional" order out of such log tables.

I believe that's originally been invented by londiste / skytool, later slony
migrated to it. The necessary C code was added as contrib/txid in 1f92630fc4e
2007-10-07 and then moved into core a few days later in 18e3fcc31e7.


For those cases making txid_current() flush would approximately double the WAL
flush rate.

Greetings,

Andres Freund




RE: Deadlock between logrep apply worker and tablesync worker

2023-01-27 Thread houzj.f...@fujitsu.com
On Friday, January 27, 2023 8:16 PM Amit Kapila 
> 
> On Fri, Jan 27, 2023 at 3:45 PM vignesh C  wrote:
> >
> > On Mon, 23 Jan 2023 at 10:52, Amit Kapila 
> wrote:
> > >
> > > IIRC, this is done to prevent concurrent drops of origin drop say by
> > > exposed API pg_replication_origin_drop(). See the discussion in [1]
> > > related to it. If we want we can optimize it so that we can acquire
> > > the lock on the specific origin as mentioned in comments
> > > replorigin_drop_by_name() but it was not clear that this operation
> > > would be frequent enough.
> >
> > Here is an attached patch to lock the replication origin record using
> > LockSharedObject instead of locking pg_replication_origin relation in
> > ExclusiveLock mode. Now tablesync worker will wait only if the
> > tablesync worker is trying to drop the same replication origin which
> > has already been dropped by the apply worker, the other tablesync
> > workers will be able to successfully drop the replication origin
> > without any wait.
> >
> 
> There is a code in the function replorigin_drop_guts() that uses the
> functionality introduced by replorigin_exists(). Can we reuse this function 
> for
> the same?

Maybe we can use SearchSysCacheExists1 to check the existence instead of
adding a new function.

One comment about the patch.

@@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool 
missing_ok, bool nowait)
...
+   /* Drop the replication origin if it has not been dropped already */
+   if (replorigin_exists(roident))
replorigin_drop_guts(rel, roident, nowait);

If developer pass missing_ok as false, should we report an ERROR here
instead of silently return ?

Best Regards,
Hou zj


Re: Something is wrong with wal_compression

2023-01-27 Thread Andrey Borodin
On Fri, Jan 27, 2023 at 7:40 PM Tom Lane  wrote:
>
> What are you using it for, that you don't care whether the answer
> is trustworthy?
>

It's not trustworthy anyway. Xid wraparound might happen during
reconnect. I suspect we can design a test that will show that it does
not always show correct results during xid->2pc conversion (there is a
point in time when xid is not in regular and not in 2pc, and I'm not
sure ProcArrayLock is held). Maybe there are other edge cases.

Anyway, if a user wants to know the status of xid in case of
disconnection they have prepared xacts.

Best regards, Andrey Borodin.




Re: Something is wrong with wal_compression

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 22:39:56 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-01-28 11:38:50 +0900, Michael Paquier wrote:
> >> FWIW, my vote goes for a more expensive but reliable function even in
> >> stable branches.
> 
> > I very strenuously object. If we make txid_current() (by way of
> > pg_current_xact_id()) flush WAL, we'll cause outages.
> 
> What are you using it for, that you don't care whether the answer
> is trustworthy?

It's quite commonly used as part of trigger based replication tools (IIRC
that's its origin), monitoring, as part of client side logging, as part of
snapshot management.

txid_current() predates pg_xact_status() by well over 10 years. Clearly we had
lots of uses for it before pg_xact_status() was around.

Greetings,

Andres Freund




Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 22:23:58 +1300, Thomas Munro wrote:
> After 1000 make check loops, and 1000 make -C src/test/modules/test_shm_mq
> check loops, on the same FBSD 13.1 machine as elver which has failed
> like this once before, I haven't been able to reproduce this on
> REL_12_STABLE.

Did you use the same compiler / compilation flags as when elver hit it?
Clearly Tomas' case was with at least some optimizations enabled.

Except that you're saying that you hit this on elver (amd64), I think it'd be
interesting that we see the failure on an arm host, which has a less strict
memory order model than x86.

IIUC elver previously hit this on 12?

Greetings,

Andres Freund




Re: Something is wrong with wal_compression

2023-01-27 Thread Tom Lane
Andres Freund  writes:
> On 2023-01-28 11:38:50 +0900, Michael Paquier wrote:
>> FWIW, my vote goes for a more expensive but reliable function even in
>> stable branches.

> I very strenuously object. If we make txid_current() (by way of
> pg_current_xact_id()) flush WAL, we'll cause outages.

What are you using it for, that you don't care whether the answer
is trustworthy?

regards, tom lane




Re: heapgettup refactoring

2023-01-27 Thread David Rowley
"On Wed, 25 Jan 2023 at 10:17, Melanie Plageman
 wrote:
> I've added a comment but I didn't include the function name in it -- I
> find it repetitive when the comments above functions do that -- however,
> I'm not strongly attached to that.

I think the general format for header comments is:

/*
 * 
 *\t\t
 *
 * [Further details]
 */

We've certainly got places that don't follow that, but I don't think
that's any reason to have no comment or invent some new format.

heapam.c seems to have some other format where we do: "
- ". I generally just try to copy
the style from the surrounding code. I think generally, people won't
argue if you follow the style from the surrounding code, but there'd
be exceptions to that, I'm sure.

I'll skip further review of 0001 here as the whole
ScanDirectionNoMovement case is being discussed on the other thread.

v6-0002:

1. heapgettup_initial_block() needs a header comment to mention what
it does and what it returns. It would be good to make it obvious that
it returns InvalidBlockNumber when there are no blocks to scan.

2. After heapgettup_initial_block(), you're checking "if (block ==
InvalidBlockNumber). It might be worth a mention something like

/*
 * Check if we got to the end of the scan already.  This could happen for
 * an empty relation or if parallel workers scanned everything before we
 * got a chance.
 */

the backward scan comment wouldn't mention parallel workers.

v6-0003:

3. Can you explain why you removed the snapshot local variable in heapgettup()?

4. I think it might be a good idea to use unlikely() in if
(!scan->rs_inited). The idea is to help coax the compiler into moving
that code off to a cold path. That's likely especially important if
heapgettup_initial_block is inlined, which I see it is marked as.

v6-0004:

5. heapgettup_start_page() and heapgettup_continue_page() both need a
header comment to explain what they do and what the inputs and output
arguments are.

6. I'm not too sure what the following comment means:

/* block and lineoff now reference the physically next tid */

"block" is just a parameter to the function and its value is not
adjusted. The comment makes it sound like something was changed.

(I think these might be just not well updated from having split this
out from the 0006 patch as the same comment makes more sense in 0006)

v6-0005:

7. heapgettup_advance_block() needs a header comment.

8. Is there a reason why heapgettup_advance_block() handle backward
scans first? I'd expect you should just follow the lead of the other
functions and do ScanDirectionIsForward first.

v6-0006

David




Re: Generating code for query jumbling through gen_node_support.pl

2023-01-27 Thread Michael Paquier
On Tue, Jan 24, 2023 at 03:57:56PM +0900, Michael Paquier wrote:
> Still, my plan here is to enforce the loading of 
> pg_stat_statements with compute_query_id = regress and
> utility_query_id = jumble (if needed) in a new buildfarm machine,

Actually, about this specific point, I have been able to set up a
buildfarm machine that uses shared_preload_libraries =
pg_stat_statements and compute_query_id = regress in the base
configuration, which is uncovered yet.  This works as long as one sets
up EXTRA_INSTALL => "contrib/pg_stat_statements" in build_env.
--
Michael


signature.asc
Description: PGP signature


Re: Something is wrong with wal_compression

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-28 11:38:50 +0900, Michael Paquier wrote:
> On Fri, Jan 27, 2023 at 06:06:05AM +0100, Laurenz Albe wrote:
> > On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote:
> >> There is no
> >> doubt that the current situation is unacceptable, though, so maybe we
> >> really should just do it and make a faster one later.  Anyone else
> >> want to vote on this?
> > 
> > I wasn't aware of the existence of pg_xact_status, so I suspect that it
> > is not a widely known and used feature.  After reading the documentation,
> > I'd say that anybody who uses it will want it to give a reliable answer.
> > So I'd agree that it is better to make it more expensive, but live up to
> > its promise.

> A code search within the Debian packages (codesearch.debian.net) and
> github does not show that it is not actually used, pg_xact_status() is
> reported as parts of copies of the Postgres code in the regression
> tests.

Not finding a user at codesearch.debian.net provides useful information for C
APIs, but a negative result for an SQL exposed function doesn't provide any
information. Those callers will largely be in application code, which largely
won't be in debian.

And as noted two messages up, we wouldn't need to flush in pg_xact_status(),
we'd need to flush in pg_current_xact_id()/txid_current().


> FWIW, my vote goes for a more expensive but reliable function even in
> stable branches.

I very strenuously object. If we make txid_current() (by way of
pg_current_xact_id()) flush WAL, we'll cause outages.


Greetings,

Andres Freund




Re: Something is wrong with wal_compression

2023-01-27 Thread Maciek Sakrejda
On Fri, Jan 27, 2023, 18:58 Andres Freund  wrote:

> Hi,
>
> On 2023-01-27 16:15:08 +1300, Thomas Munro wrote:
> > It would be pg_current_xact_id() that would have to pay the cost of
> > the WAL flush, not pg_xact_status() itself, but yeah that's what the
> > patch does (with some optimisations).  I guess one question is whether
> > there are any other reasonable real world uses of
> > pg_current_xact_id(), other than the original goal[1].
>
> txid_current() is a lot older than pg_current_xact_id(), and they're
> backed by
> the same code afaict. 8.4 I think.
>
> Unfortunately txid_current() is used in plenty montiring setups IME.
>
> I don't think it's a good idea to make a function that was quite cheap for
> 15
> years, suddenly be several orders of magnitude more expensive...


As someone working on a monitoring tool that uses it (well, both), +1. We'd
have to rethink a few things if this becomes a performance concern.

Thanks,
Maciek


Re: Something is wrong with wal_compression

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 16:15:08 +1300, Thomas Munro wrote:
> It would be pg_current_xact_id() that would have to pay the cost of
> the WAL flush, not pg_xact_status() itself, but yeah that's what the
> patch does (with some optimisations).  I guess one question is whether
> there are any other reasonable real world uses of
> pg_current_xact_id(), other than the original goal[1].

txid_current() is a lot older than pg_current_xact_id(), and they're backed by
the same code afaict. 8.4 I think.

Unfortunately txid_current() is used in plenty montiring setups IME.

I don't think it's a good idea to make a function that was quite cheap for 15
years, suddenly be several orders of magnitude more expensive...

Greetings,

Andres Freund




Re: bug: copy progress reporting of backends which run multiple COPYs

2023-01-27 Thread Michael Paquier
On Sat, Jan 21, 2023 at 02:45:40AM +0100, Matthias van de Meent wrote:
> Let me think about it. I think it would work, but I'm not sure that's
> a great option - it adds backend state that we would need to add
> cleanup handles for. But then again, COPY ... TO could use TRIGGER to
> trigger actual COPY FROM statements, which would also break progress
> reporting in a vanilla instance without extensions.
> 
> I'm not sure what the right thing to do is here.

Simply disabling COPY reporting for file_fdw does not sound appealing
to me, because that can be really useful for users.  As long as you
rely on two code paths that call separately the progress reporting,
things are doomed with the current infrastructure.  For example,
thinking about some fancy cases, you could you create an index that
uses a function as expression, function that includes utility commands
that do progress reporting.  Things would equally go wrong in the
progress view.

What are the assertions/warnings that get triggered in file_fdw?
Joining together file_fdw with a plain COPY is surely a fancy case,
even if COPY TO would allow that.

>> Would you do anything different in the master branch, with no
>> compatibility constraints ?  I think the progress reporting would still
>> be limited to one row per backend, not one per CopyFrom().
> 
> I think I would at least introduce another parameter to BeginCopyFrom
> for progress reporting (instead of relying on pstate != NULL), like
> how we have a bit in reindex_index's params->options that specifies
> whether we want progress reporting (which is unset for parallel
> workers iirc).

This makes sense, independently of the discussion about what should be
done with cross-runs of these APIs.  This could be extended with
user-controllable options for each one of them.  It does not take care
of the root of the problem, just bypasses it.
--
Michael


signature.asc
Description: PGP signature


Re: Add n_tup_newpage_upd to pg_stat table views

2023-01-27 Thread Peter Geoghegan
On Fri, Jan 27, 2023 at 6:44 PM Andres Freund  wrote:
> I don't think that'd make it particularly easy to figure out how often
> out-of-space causes non-HOT updates to go out of page, and how often it causes
> potential HOT updates to go out of page.  If you just have a single index,
> it's not too hard, but after that seems decidedly nontrival.
>
> But I might just be missing what you're suggesting.

It would be useless for that, of course. But it would be a good proxy
for what specific indexes force non-hot updates due to HOT safety
issues. This would work independently of the issue of what's going on
in the heap. That matters too, of course, but in practice the main
problem is likely the specific combination of indexes and updates.
(Maybe it would just be an issue with heap fill factor, at times, but
even then you'd still want to rule out basic HOT safety issues first.)

If you see one particular index that gets a far larger number of
non-hot updates that are reported as "logical changes to the indexed
columns", then dropping that index has the potential to make the HOT
update situation far better.

--
Peter Geoghegan




Re: Add n_tup_newpage_upd to pg_stat table views

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 17:59:32 -0800, Peter Geoghegan wrote:
> > I think this might cause some trouble for existing monitoring setups after 
> > an
> > upgrade. Suddenly the number of updates will appear way lower than
> > before... And if we end up eventually distinguishing between different 
> > reasons
> > for out-of-page updates, or hot/non-hot out-of-page that'll happen again.
> 
> Uh...no it won't? The new counter is totally independent of the existing
> HOT counter, and the transactional all-updates counter. It's just that
> there is an enum that encodes which of the two non-transactional "sub
> counters" to use (either for HOT updates or new-page-migration
> updates).
>
> > I wish we'd included HOT updates in the total number of updates, and just 
> > kept
> > HOT updates a separate counter that'd always be less than updates in total.
> 
> Uh...we did in fact do it that way to begin with?

Sorry, I misread the diff, and then misremembered some old issue.


> > From that angle: Perhaps it'd be better to have counter for how many times a
> > page is found to be full during an update?
> 
> Didn't Corey propose a patch to add just that? Do you mean something
> more specific, like a tracker for when an UPDATE leaves a page full,
> without needing to go to a new page itself?

Nope, I just had a brainfart.


> > Similarly, it's a bit sad that we can't distinguish between the number of
> > potential-HOT out-of-page updates and the other out-of-page updates. But
> > that'd mean even more counters.
> 
> ISTM that it would make more sense to do that at the index level
> instead. It wouldn't be all that hard to teach ExecInsertIndexTuples()
> to remember whether each index received the indexUnchanged hint used
> by bottom-up deletion, which is approximately the same thing, but
> works at the index level.

I don't think that'd make it particularly easy to figure out how often
out-of-space causes non-HOT updates to go out of page, and how often it causes
potential HOT updates to go out of page.  If you just have a single index,
it's not too hard, but after that seems decidedly nontrival.

But I might just be missing what you're suggesting.


Greetings,

Andres Freund




Re: Something is wrong with wal_compression

2023-01-27 Thread Michael Paquier
On Fri, Jan 27, 2023 at 06:06:05AM +0100, Laurenz Albe wrote:
> On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote:
>> There is no
>> doubt that the current situation is unacceptable, though, so maybe we
>> really should just do it and make a faster one later.  Anyone else
>> want to vote on this?
> 
> I wasn't aware of the existence of pg_xact_status, so I suspect that it
> is not a widely known and used feature.  After reading the documentation,
> I'd say that anybody who uses it will want it to give a reliable answer.
> So I'd agree that it is better to make it more expensive, but live up to
> its promise.

A code search within the Debian packages (codesearch.debian.net) and
github does not show that it is not actually used, pg_xact_status() is
reported as parts of copies of the Postgres code in the regression
tests.

FWIW, my vote goes for a more expensive but reliable function even in
stable branches.  Even 857ee8e mentions that this could be used on a
lost connection, so we don't even satisfy the use case of the original
commit as things stand (right?), because lost connection could just be
a result of a crash, and if crash recovery reassigns the XID, then the
client gets it wrong.
--
Michael


signature.asc
Description: PGP signature


Re: Small omission in type_sanity.sql

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 20:39:04 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Tom, is there a reason we run the various sanity tests early-ish in the
> > schedule? It does seem to reduce their effectiveness a bit...
> 
> Originally, those tests were mainly needed to sanity-check the
> hand-maintained initial catalog data, so it made sense to run them
> early.

It's also kinda useful to have some basic validity testing early on, because
if there's something wrong with the catalog values, it'll cause lots of issues
later.


> > Problems:
> > - "Cross-check against pg_type entry" is far too strict about legal 
> > combinations
> >   of typstorage
> 
> Perhaps, but it's enforcing policy about what we want in the
> initial catalog data, not what is possible to support.

True in generaly, but I don't think it matters much in this specific case. We
don't gain much by forbidding 'e' -> 'x' mismatches, given that we allow 'x'
-> 'p'.


There's a lot more such cases in opr_sanity. There's a lot of tests in it that
only make sense for validating the initial catalog contents. It might be
useful to run a more lenient version of it later though.


> So there's a bit of divergence of goals here too.  Maybe we need to split up
> the tests into initial-data-only tests (run early) and tests that should
> hold for user-created objects too (run late)?

Yea, I think so. A bit worried about the duplication that might require.

But the *sanity tests also do also encode a lot of good cross-checks, that are
somewhat easy to break in code (and / or have been broken in the past), so I
think it's worth pursuing.

Greetings,

Andres Freund




Re: Add n_tup_newpage_upd to pg_stat table views

2023-01-27 Thread Peter Geoghegan
On Fri, Jan 27, 2023 at 3:55 PM Andres Freund  wrote:
> I wonder if it's quite detailed enough - we can be forced to do out-of-page
> updates because we actually are out of space, or because we reach the max
> number of line pointers we allow in a page. HOT pruning can't remove dead line
> pointers, so that doesn't necessarily help.

It would be hard to apply that kind of information, I suspect. Maybe
it's still worth having, though.

> Similarly, it's a bit sad that we can't distinguish between the number of
> potential-HOT out-of-page updates and the other out-of-page updates. But
> that'd mean even more counters.

ISTM that it would make more sense to do that at the index level
instead. It wouldn't be all that hard to teach ExecInsertIndexTuples()
to remember whether each index received the indexUnchanged hint used
by bottom-up deletion, which is approximately the same thing, but
works at the index level.

This is obviously more useful, because you have index-granularity
information that can guide users in how to index to maximize the
number of HOT updates. And, even if changing things around didn't lead
to the hoped-for improvement in the rate of HOT updates, it would at
least still allow the indexes on the table to use bottom-up deletion
more often, on average.

Admittedly this has some problems. The index_unchanged_by_update()
logic probably isn't as sophisticated as it ought to be because it's
driven by the statement-level extraUpdatedCols bitmap set, and not a
per-tuple test, like the HOT safety test in heap_update() is.
But...that should probably be fixed anyway.

> I think this might cause some trouble for existing monitoring setups after an
> upgrade. Suddenly the number of updates will appear way lower than
> before... And if we end up eventually distinguishing between different reasons
> for out-of-page updates, or hot/non-hot out-of-page that'll happen again.

Uh...no it won't? The new counter is totally independent of the existing
HOT counter, and the transactional all-updates counter. It's just that
there is an enum that encodes which of the two non-transactional "sub
counters" to use (either for HOT updates or new-page-migration
updates).

> I wish we'd included HOT updates in the total number of updates, and just kept
> HOT updates a separate counter that'd always be less than updates in total.

Uh...we did in fact do it that way to begin with?

> From that angle: Perhaps it'd be better to have counter for how many times a
> page is found to be full during an update?

Didn't Corey propose a patch to add just that? Do you mean something
more specific, like a tracker for when an UPDATE leaves a page full,
without needing to go to a new page itself?

If so, then that does require defining what that really means, because
it isn't trivial. Do you assume that all updates have a successor
version that is equal in size to that of the UPDATE that gets counted
by this hypothetical other counter of yours?

--
Peter Geoghegan




Re: Using WaitEventSet in the postmaster

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-28 14:25:38 +1300, Thomas Munro wrote:
> The nearby thread about searching for uses of volatile reminded me: we
> can now drop a bunch of these in postmaster.c.  The patch I originally
> wrote to do that as part of this series somehow morphed into an
> experimental patch to nuke all global variables[1],

Hah.


> but of course we should at least drop the now redundant use of volatile and
> sigatomic_t.  See attached.

+1

Greetings,

Andres Freund




Re: recovery modules

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 16:59:10 -0800, Nathan Bossart wrote:
> On Fri, Jan 27, 2023 at 04:23:19PM -0800, Andres Freund wrote:
> >> +typedef bool (*RecoveryRestoreCB) (const char *file, const char *path,
> >> + const char 
> >> *lastRestartPointFileName);
> >> +typedef void (*RecoveryArchiveCleanupCB) (const char 
> >> *lastRestartPointFileName);
> >> +typedef void (*RecoveryEndCB) (const char *lastRestartPointFileName);
> >> +typedef void (*RecoveryShutdownCB) (void);
> > 
> > I think the signature of these forces bad coding practices, because there's 
> > no
> > way to have state within a recovery module (as there's no parameter for it).
> > 
> > It's possible we would eventually support multiple modules, e.g. restoring
> > from shorter term file based archiving and from longer term archiving in 
> > some
> > blob store. Then we'll regret not having a varible for this.
> 
> Are you suggesting that we add a "void *arg" to each one of these?

Yes.  Or pass a pointer to a struct with a "private_data" style field to all
of them.



> >> +extern RecoveryModuleCallbacks RecoveryContext;
> > 
> > I think that'll typically be interpreteted as a MemoryContext by readers.
> 
> How about RecoveryCallbacks?

Callbacks is better than Context here imo, but I think 'Recovery' makes it
sound like this actually performs WAL replay or such. Seems like it should be
RestoreCallbacks at the very least?


> > Also, why is this a global var? Exported too?
> 
> It's needed in xlog.c, xlogarchive.c, and xlogrecovery.c.  Would you rather
> it be static to xlogarchive.c and provide accessors for the others?

Maybe? Something about this feels wrong to me, but I can't entirely put my
finger on it.


> >> +/*
> >> + * Type of the shared library symbol _PG_recovery_module_init that is 
> >> looked up
> >> + * when loading a recovery library.
> >> + */
> >> +typedef void (*RecoveryModuleInit) (RecoveryModuleCallbacks *cb);
> > 
> > I think this is a bad way to return callbacks. This way the
> > RecoveryModuleCallbacks needs to be modifiable, which makes the job for the
> > compiler harder (and isn't the greatest for security).
> > 
> > I strongly encourage you to follow the model used e.g. by tableam. The init
> > function should return a pointer to a *constant* struct. Which is 
> > compile-time
> > initialized with the function pointers.
> > 
> > See the bottom of heapam_handler.c for how that ends up looking.
> 
> Hm.  I used the existing strategy for archive modules and logical decoding
> output plugins here.

Unfortunately I didn't realize the problem when I was designing the output
plugin interface. But there's probably too many users of it out there now to
change it.

The interface does at least provide a way to have its own "per instance"
state, via the startup callback and 
LogicalDecodingContext->output_plugin_private.


The worst interface in this area is index AMs - the handler returns a pointer
to a palloced struct with callbacks. That then is copied into a new allocation
in the relcache entry. We have hundreds to thousands of copies of what
bthandler() sets up in memory. Without any sort of benefit.


> I think it would be weird for the archive module and
> recovery module interfaces to look so different, but if that's okay, I can
> change it.

I'm a bit sad about the archive module case - I wonder if we should change it
now, there can't be many users of it out there. And I think it's more likely
that we'll eventually want multiple archiving scripts to run concurrently -
which will be quite hard with the current interface (no private state).

Just btw: It's imo a bit awkward for the definition of the archiving plugin
interface to be in pgarch.h: "Exports from postmaster/pgarch.c" doesn't
describe that well. A dedicated header seems cleaner.


> 
> >> +void
> >> +LoadRecoveryCallbacks(void)
> >> +{
> >> +  RecoveryModuleInit init;
> >> +
> >> +  /*
> >> +   * If the shell command is enabled, use our special initialization
> >> +   * function.  Otherwise, load the library and call its
> >> +   * _PG_recovery_module_init().
> >> +   */
> >> +  if (restoreLibrary[0] == '\0')
> >> +  init = shell_restore_init;
> >> +  else
> >> +  init = (RecoveryModuleInit)
> >> +  load_external_function(restoreLibrary, 
> >> "_PG_recovery_module_init",
> >> + false, NULL);
> > 
> > Why a special rule for shell, instead of just defaulting the GUC to it?
> 
> I'm not following this one.  The default value of the restore_library GUC
> is an empty string, which means that the shell commands should be used.

I was wondering why we implement "shell" via a separate mechanism from
restore_library. I.e. a) why doesn't restore_library default to 'shell',
instead of an empty string, b) why aren't restore_command et al implemented
using a restore module.


> >> +  /*
> >> +   * If using shell commands, 

Re: Using WaitEventSet in the postmaster

2023-01-27 Thread Tom Lane
Thomas Munro  writes:
> The nearby thread about searching for uses of volatile reminded me: we
> can now drop a bunch of these in postmaster.c.  The patch I originally
> wrote to do that as part of this series somehow morphed into an
> experimental patch to nuke all global variables[1], but of course we
> should at least drop the now redundant use of volatile and
> sigatomic_t.  See attached.

+1

regards, tom lane




Re: Small omission in type_sanity.sql

2023-01-27 Thread Tom Lane
Andres Freund  writes:
> Tom, is there a reason we run the various sanity tests early-ish in the
> schedule? It does seem to reduce their effectiveness a bit...

Originally, those tests were mainly needed to sanity-check the
hand-maintained initial catalog data, so it made sense to run them
early.  Since we taught genbki.pl to do a bunch more work, that's
perhaps a bit less pressing.

There's at least one test that intentionally sets up a bogus btree
opclass, which we'd have to drop again if we wanted to run the
sanity checks later.  Not sure what other issues might surface.
You could find out easily enough, of course ...

> Problems:
> - "Cross-check against pg_type entry" is far too strict about legal 
> combinations
>   of typstorage

Perhaps, but it's enforcing policy about what we want in the
initial catalog data, not what is possible to support.  So
there's a bit of divergence of goals here too.  Maybe we need
to split up the tests into initial-data-only tests (run early)
and tests that should hold for user-created objects too
(run late)?

regards, tom lane




Re: Using WaitEventSet in the postmaster

2023-01-27 Thread Thomas Munro
The nearby thread about searching for uses of volatile reminded me: we
can now drop a bunch of these in postmaster.c.  The patch I originally
wrote to do that as part of this series somehow morphed into an
experimental patch to nuke all global variables[1], but of course we
should at least drop the now redundant use of volatile and
sigatomic_t.  See attached.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGKH_RPAo%3DNgPfHKj--565aL1qiVpUGdWt1_pmJehY%2Bdmw%40mail.gmail.com
From 74f8b703f1a23b47bf613813014bc432c62c75d8 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 28 Jan 2023 14:08:23 +1300
Subject: [PATCH] Remove unneeded volatile qualitifiers from postmaster.c.

Several flags were marked volatile and in some cases use sigatomic_t
because they were accessed from signal handlers.  After commit 7389aad6,
we can just use unqualified bool.
---
 src/backend/postmaster/postmaster.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 62fba5fcee..f92dbc2270 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -359,17 +359,17 @@ bool		ClientAuthInProgress = false;	/* T during new-client
 bool		redirection_done = false;	/* stderr redirected for syslogger? */
 
 /* received START_AUTOVAC_LAUNCHER signal */
-static volatile sig_atomic_t start_autovac_launcher = false;
+static bool start_autovac_launcher = false;
 
 /* the launcher needs to be signaled to communicate some condition */
-static volatile bool avlauncher_needs_signal = false;
+static bool avlauncher_needs_signal = false;
 
 /* received START_WALRECEIVER signal */
-static volatile sig_atomic_t WalReceiverRequested = false;
+static bool WalReceiverRequested = false;
 
 /* set when there's a worker that needs to be started up */
-static volatile bool StartWorkerNeeded = true;
-static volatile bool HaveCrashedWorker = false;
+static bool StartWorkerNeeded = true;
+static bool HaveCrashedWorker = false;
 
 /* set when signals arrive */
 static volatile sig_atomic_t pending_pm_pmsignal;
-- 
2.38.1



Re: Small omission in type_sanity.sql

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-18 14:51:32 -0500, Melanie Plageman wrote:
> I only changed these few lines in type_sanity to be more correct; I
> didn't change anything else in regress to actually exercise them (e.g.
> ensuring a partitioned table is around when running type_sanity). It
> might be worth moving type_sanity down in the parallel schedule?

Unfortunately it's not entirely trivial to do that. Despite the comment at the
top of type_sanity.sql:
-- None of the SELECTs here should ever find any matching entries,
-- so the expected output is easy to maintain ;-).

there are actually a few queries in there that are expected to return
objects. And would return more at the end of the tests.

That doesn't seem great.


Tom, is there a reason we run the various sanity tests early-ish in the
schedule? It does seem to reduce their effectiveness a bit...


Problems:
- shell types show up in a bunch of queries, e.g. "Look for illegal values in
  pg_type fields." due to NOT t1.typisdefined
- the omission of various relkinds noted by Melanie shows in "Look for illegal
  values in pg_class fields"
- pg_attribute query doesn't know about dropped columns
- "Cross-check against pg_type entry" is far too strict about legal combinations
  of typstorage



> It does seem a bit hard to remember to update these tests in
> type_sanity.sql when adding some new value for a pg_class field. I
> wonder if there is a better way of testing this.

As evidenced by the above list of failures, moving the test to the end of the
regression tests would help, if we can get there.



> --- All tables and indexes should have an access method.
> -SELECT c1.oid, c1.relname
> -FROM pg_class as c1
> -WHERE c1.relkind NOT IN ('S', 'v', 'f', 'c') and
> -c1.relam = 0;
> - oid | relname 
> --+-
> +-- All tables and indexes except partitioned tables should have an access
> +-- method.
> +SELECT oid, relname, relkind, relam
> +FROM pg_class
> +WHERE relkind NOT IN ('S', 'v', 'f', 'c', 'p') and
> +relam = 0;
> + oid | relname | relkind | relam 
> +-+-+-+---
>  (0 rows)

Don't think that one is right, a partitioned table doesn't have an AM.


> --- Conversely, sequences, views, types shouldn't have them
> -SELECT c1.oid, c1.relname
> -FROM pg_class as c1
> -WHERE c1.relkind IN ('S', 'v', 'f', 'c') and
> -c1.relam != 0;
> - oid | relname 
> --+-
> +-- Conversely, sequences, views, types, and partitioned tables shouldn't have
> +-- them
> +SELECT oid, relname, relkind, relam
> +FROM pg_class
> +WHERE relkind IN ('S', 'v', 'f', 'c', 'p') and
> +relam != 0;
> + oid | relname | relkind | relam 
> +-+-+-+---
>  (0 rows)

Particularly because you include them again here :)


Greetings,

Andres Freund




Re: recovery modules

2023-01-27 Thread Nathan Bossart
On Fri, Jan 27, 2023 at 04:23:19PM -0800, Andres Freund wrote:
>> +typedef bool (*RecoveryRestoreCB) (const char *file, const char *path,
>> +   const char 
>> *lastRestartPointFileName);
>> +typedef void (*RecoveryArchiveCleanupCB) (const char 
>> *lastRestartPointFileName);
>> +typedef void (*RecoveryEndCB) (const char *lastRestartPointFileName);
>> +typedef void (*RecoveryShutdownCB) (void);
> 
> I think the signature of these forces bad coding practices, because there's no
> way to have state within a recovery module (as there's no parameter for it).
> 
> It's possible we would eventually support multiple modules, e.g. restoring
> from shorter term file based archiving and from longer term archiving in some
> blob store. Then we'll regret not having a varible for this.

Are you suggesting that we add a "void *arg" to each one of these?  Or put
the arguments into a struct?  Or something else?

>> +extern RecoveryModuleCallbacks RecoveryContext;
> 
> I think that'll typically be interpreteted as a MemoryContext by readers.

How about RecoveryCallbacks?

> Also, why is this a global var? Exported too?

It's needed in xlog.c, xlogarchive.c, and xlogrecovery.c.  Would you rather
it be static to xlogarchive.c and provide accessors for the others?

>> +/*
>> + * Type of the shared library symbol _PG_recovery_module_init that is 
>> looked up
>> + * when loading a recovery library.
>> + */
>> +typedef void (*RecoveryModuleInit) (RecoveryModuleCallbacks *cb);
> 
> I think this is a bad way to return callbacks. This way the
> RecoveryModuleCallbacks needs to be modifiable, which makes the job for the
> compiler harder (and isn't the greatest for security).
> 
> I strongly encourage you to follow the model used e.g. by tableam. The init
> function should return a pointer to a *constant* struct. Which is compile-time
> initialized with the function pointers.
> 
> See the bottom of heapam_handler.c for how that ends up looking.

Hm.  I used the existing strategy for archive modules and logical decoding
output plugins here.  I think it would be weird for the archive module and
recovery module interfaces to look so different, but if that's okay, I can
change it.

>> +void
>> +LoadRecoveryCallbacks(void)
>> +{
>> +RecoveryModuleInit init;
>> +
>> +/*
>> + * If the shell command is enabled, use our special initialization
>> + * function.  Otherwise, load the library and call its
>> + * _PG_recovery_module_init().
>> + */
>> +if (restoreLibrary[0] == '\0')
>> +init = shell_restore_init;
>> +else
>> +init = (RecoveryModuleInit)
>> +load_external_function(restoreLibrary, 
>> "_PG_recovery_module_init",
>> +   false, NULL);
> 
> Why a special rule for shell, instead of just defaulting the GUC to it?

I'm not following this one.  The default value of the restore_library GUC
is an empty string, which means that the shell commands should be used.

>> +/*
>> + * If using shell commands, remove callbacks for any commands that are 
>> not
>> + * set.
>> + */
>> +if (restoreLibrary[0] == '\0')
>> +{
>> +if (recoveryRestoreCommand[0] == '\0')
>> +RecoveryContext.restore_cb = NULL;
>> +if (archiveCleanupCommand[0] == '\0')
>> +RecoveryContext.archive_cleanup_cb = NULL;
>> +if (recoveryEndCommand[0] == '\0')
>> +RecoveryContext.recovery_end_cb = NULL;
> 
> I'd just mandate that these are implemented and that the module has to handle
> if it doesn't need to do anything.

Wouldn't this just force module authors to write empty functions for the
parts they don't need?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Optimizing PostgreSQL with LLVM's PGO+LTO

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 18:28:16 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I'm sure we have a few places that aren't that careful, but I would hope 
> > it's
> > not a large number. Are you thinking of specific "patterns" we've repeated 
> > all
> > over, or just a few cases you recall?
> 
> I recall that we used to have dependencies on, for example, the LWLock
> functions being out-of-line.  Probably that specific pain point has
> been cleaned up, but it surprises me not at all to hear that there
> are more.

We did clean up a fair bit, some via "infrastructure" fixes. E.g. our
spinlocks didn't use to be a barrier a good while back (c.f. 0709b7ee72e), and
that required putting volatile on things that couldn't move across the lock
boundaries.  I think that in turn was what caused the LWLock issue you
mention, as back then lwlocks used spinlocks.

The increased use of atomics instead of "let's just do a dirty read", fixed a
few instances too.


> I agree that there are probably not a huge number of places that would
> need to be fixed, but I'm not sure how we'd go about finding them.

Yea, that's the annoying part...


One thing we can look for is the use of volatile, which we used to use a lot
for preventing code rearrangement (for lack of barrier primitives in the bad
old days). Both Robert and I removed a bunch of that kind of use of volatile,
and from memory some of them wouldn't have been safe with LTO.

It's really too bad that we [have to] use volatile around signal handlers and
for PG_TRY too, otherwise it'd be easier to search for.

Kinda wondering if we ought to add a sig_volatile, err_volatile or such.


But the main thing probably is to just regularly test LTO and look for
problems. Perhaps worth adding a BF animal that uses -O3 + LTO?

I don't immediately see how to squeeze using PGO into the BF build process
(since we'd have to build without PGO, run some workload, build with PGO -
without any source modifications inbetween)...

Greetings,

Andres Freund




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Tomas Vondra
On 1/27/23 22:19, Andres Freund wrote:
> Hi,
> 
> On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote:
>> On Thu, Jan 26, 2023 at 4:49 PM Andres Freund  wrote:
>>
>>> Huh? Why did you remove the GUC?
>>
>> After reading previous threads, my optimism level of getting it ever
>> in shape of being widely accepted degraded significantly (mainly due
>> to the discussion of wider category of 'WAL I/O throttling' especially
>> in async case, RPO targets in async case and potentially calculating
>> global bandwidth).
> 
> I think it's quite reasonable to limit this to a smaller scope. Particularly
> because those other goals are pretty vague but ambitious goals. IMO the
> problem with a lot of the threads is precisely that that they aimed at a level
> of generallity that isn't achievable in one step.
> 

+1 to that

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Tomas Vondra



On 1/27/23 22:33, Andres Freund wrote:
> Hi,
> 
> On 2023-01-27 21:45:16 +0100, Tomas Vondra wrote:
>> On 1/27/23 08:18, Bharath Rupireddy wrote:
 I think my idea of only forcing to flush/wait an LSN some distance in the 
 past
 would automatically achieve that?
>>>
>>> I'm sorry, I couldn't get your point, can you please explain it a bit more?
>>>
>>
>> The idea is that we would not flush the exact current LSN, because
>> that's likely somewhere in the page, and we always write the whole page
>> which leads to write amplification.
>>
>> But if we backed off a bit, and wrote e.g. to the last page boundary,
>> that wouldn't have this issue (either the page was already flushed -
>> noop, or we'd have to flush it anyway).
> 
> Yep.
> 
> 
>> We could even back off a bit more, to increase the probability it was
>> actually flushed / sent to standby.
> 
> That's not the sole goal, from my end: I'd like to avoid writing out +
> flushing the WAL in too small chunks.  Imagine a few concurrent vacuums or
> COPYs or such - if we're unlucky they'd each end up exceeding their "private"
> limit close to each other, leading to a number of small writes of the
> WAL. Which could end up increasing local commit latency / iops.
> 
> If we instead decide to only ever flush up to something like
>   last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ
> 
> we'd make sure that the throttling mechanism won't cause a lot of small
> writes.
> 

I'm not saying we shouldn't do this, but I still don't see how this
could make a measurable difference. At least assuming a sensible value
of the throttling limit (say, more than 256kB per backend), and OLTP
workload running concurrently. That means ~64 extra flushes/writes per
16MB segment (at most). Yeah, a particular page might get unlucky and be
flushed by multiple backends, but the average still holds. Meanwhile,
the OLTP transactions will generate (at least) an order of magnitude
more flushes.

> 
>>> Keeping replication lag under check enables one to provide a better
>>> RPO guarantee as discussed in the other thread
>>> https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
>>>
>>
>> Isn't that a bit over-complicated? RPO generally only cares about xacts
>> that committed (because that's what you want to not lose), so why not to
>> simply introduce a "sync mode" that simply uses a bit older LSN when
>> waiting for the replica? Seems much simpler and similar to what we
>> already do.
> 
> I don't think that really helps you that much. If there's e.g. a huge VACUUM /
> COPY emitting loads of WAL you'll suddenly see commit latency of a
> concurrently committing transactions spike into oblivion. Whereas a general
> WAL throttling mechanism would throttle the VACUUM, without impacting the
> commit latency of normal transactions.
> 

True, but it solves the RPO goal which is what the other thread was about.

IMHO it's useful to look at this as a resource scheduling problem:
limited WAL bandwidth consumed by backends, with the bandwidth
distributed using some sort of policy.

The patch discussed in this thread uses fundamentally unfair policy,
with throttling applied only on backends that produce a lot of WAL. And
trying to leave the OLTP as unaffected as possible.

The RPO thread seems to be aiming for a "fair" policy, providing the
same fraction of bandwidth to all processes. This will affect all xacts
the same way (sleeps proportional to amount of WAL generated by the xact).

Perhaps we want such alternative scheduling policies, but it'll probably
require something like the autovacuum throttling.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: recovery modules

2023-01-27 Thread Michael Paquier
On Fri, Jan 27, 2023 at 04:09:39PM -0800, Andres Freund wrote:
> I think it would be hard to write a good module that isn't just implementing
> the existing commands without it. Needing to clean up archives and reacting to
> the end of recovery seems a pretty core task.

FWIW, recovery_end_command is straight-forward as it is done by the
startup process, so that's an easy take.  You could split the cake
into two parts, as well, aka first focus on restore_command and
recovery_end_command as a first step, then we could try to figure out 
how archive_cleanup_command would fit in this picture with the
checkpointer or a different process.  There are a bunch of deployments
where WAL archive retention is controlled by the age of the backups,
not by the backend deciding when these should be removed as a
checkpoint runs depending on the computed redo LSN, so recovery
modules would still be useful with just coverage for restore_command
and recovery_end_command.

> I was briefly wondering whether it'd be worth trying to offload things like
> archive_cleanup_command from checkpointer to a different process, for
> robustness. But given that it's pretty much required for performance that the
> module runs in the startup process, that ship probably has sailed.

Yeah, agreed that this could be interesting.  That could leverage the
work of the checkpointer.  Nathan has proposed a patch for that
recently, as far as I recall, to offload some tasks from the startup
and checkpointer processes:
https://commitfest.postgresql.org/41/3448/

So that pretty much goes into the same area?
--
Michael


signature.asc
Description: PGP signature


Re: recovery modules

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-25 16:34:21 +0900, Michael Paquier wrote:
> diff --git a/src/include/access/xlogarchive.h 
> b/src/include/access/xlogarchive.h
> index 299304703e..71c9b88165 100644
> --- a/src/include/access/xlogarchive.h
> +++ b/src/include/access/xlogarchive.h
> @@ -30,9 +30,45 @@ extern bool XLogArchiveIsReady(const char *xlog);
>  extern bool XLogArchiveIsReadyOrDone(const char *xlog);
>  extern void XLogArchiveCleanup(const char *xlog);
>  
> -extern bool shell_restore(const char *file, const char *path,
> -   const char 
> *lastRestartPointFileName);
> -extern void shell_archive_cleanup(const char *lastRestartPointFileName);
> -extern void shell_recovery_end(const char *lastRestartPointFileName);
> +/*
> + * Recovery module callbacks
> + *
> + * These callback functions should be defined by recovery libraries and
> + * returned via _PG_recovery_module_init().  For more information about the
> + * purpose of each callback, refer to the recovery modules documentation.
> + */
> +typedef bool (*RecoveryRestoreCB) (const char *file, const char *path,
> +const char 
> *lastRestartPointFileName);
> +typedef void (*RecoveryArchiveCleanupCB) (const char 
> *lastRestartPointFileName);
> +typedef void (*RecoveryEndCB) (const char *lastRestartPointFileName);
> +typedef void (*RecoveryShutdownCB) (void);

I think the signature of these forces bad coding practices, because there's no
way to have state within a recovery module (as there's no parameter for it).


It's possible we would eventually support multiple modules, e.g. restoring
from shorter term file based archiving and from longer term archiving in some
blob store. Then we'll regret not having a varible for this.


> +typedef struct RecoveryModuleCallbacks
> +{
> + RecoveryRestoreCB restore_cb;
> + RecoveryArchiveCleanupCB archive_cleanup_cb;
> + RecoveryEndCB recovery_end_cb;
> + RecoveryShutdownCB shutdown_cb;
> +} RecoveryModuleCallbacks;
> +
> +extern RecoveryModuleCallbacks RecoveryContext;

I think that'll typically be interpreteted as a MemoryContext by readers.

Also, why is this a global var? Exported too?


> +/*
> + * Type of the shared library symbol _PG_recovery_module_init that is looked 
> up
> + * when loading a recovery library.
> + */
> +typedef void (*RecoveryModuleInit) (RecoveryModuleCallbacks *cb);

I think this is a bad way to return callbacks. This way the
RecoveryModuleCallbacks needs to be modifiable, which makes the job for the
compiler harder (and isn't the greatest for security).

I strongly encourage you to follow the model used e.g. by tableam. The init
function should return a pointer to a *constant* struct. Which is compile-time
initialized with the function pointers.

See the bottom of heapam_handler.c for how that ends up looking.


> +void
> +LoadRecoveryCallbacks(void)
> +{
> + RecoveryModuleInit init;
> +
> + /*
> +  * If the shell command is enabled, use our special initialization
> +  * function.  Otherwise, load the library and call its
> +  * _PG_recovery_module_init().
> +  */
> + if (restoreLibrary[0] == '\0')
> + init = shell_restore_init;
> + else
> + init = (RecoveryModuleInit)
> + load_external_function(restoreLibrary, 
> "_PG_recovery_module_init",
> +false, NULL);

Why a special rule for shell, instead of just defaulting the GUC to it?


> + /*
> +  * If using shell commands, remove callbacks for any commands that are 
> not
> +  * set.
> +  */
> + if (restoreLibrary[0] == '\0')
> + {
> + if (recoveryRestoreCommand[0] == '\0')
> + RecoveryContext.restore_cb = NULL;
> + if (archiveCleanupCommand[0] == '\0')
> + RecoveryContext.archive_cleanup_cb = NULL;
> + if (recoveryEndCommand[0] == '\0')
> + RecoveryContext.recovery_end_cb = NULL;

I'd just mandate that these are implemented and that the module has to handle
if it doesn't need to do anything.



> + /*
> +  * Check for invalid combinations of the command/library parameters and
> +  * load the callbacks.
> +  */
> + CheckMutuallyExclusiveGUCs(restoreLibrary, "restore_library",
> +
> recoveryRestoreCommand, "restore_command");
> + CheckMutuallyExclusiveGUCs(restoreLibrary, "restore_library",
> +recoveryEndCommand, 
> "recovery_end_command");
> + before_shmem_exit(call_recovery_module_shutdown_cb, 0);
> + LoadRecoveryCallbacks();

This kind of sequence is duplicated into several places.

Greetings,

Andres Freund




Re: How to find the number of cached pages for a relation?

2023-01-27 Thread Amin
Thank you Andres.

If I want to do "a" ( Do one probe of the buffer mapping table for each
block of the relation. Thus O(#relation blocks)) what function calls can I
use, assuming I only have access to the relation id? How can I access and
scan the buffer mapping table?

On Fri, Jan 13, 2023 at 6:27 PM Andres Freund  wrote:

> Hi,
>
> On 2023-01-13 17:28:31 -0800, Amin wrote:
> > Before scanning a relation, in the planner stage, I want to make a call
> to
> > retrieve information about how many pages will be a hit for a specific
> > relation. The module pg_buffercache seems to be doing a similar thing.
> > Also, pg_statio_all_tables seems to be having that information, but it is
> > updated after execution. However, I want the information before
> execution.
> > Also not sure how pg_statio_all_tables is created and how I can access it
> > in the code.
>
> There's no cheap way to do that. Currently the only ways are to:
>
> a) Do one probe of the buffer mapping table for each block of the
>relation. Thus O(#relation blocks).
>
> b) Scan all of buffer headers, check which are for the relation. Thus
>O(#NBuffers)
>
> Neither of which are a good idea during planning.
>
>
> It might be a bit more realistic to get very rough estimates:
>
> You could compute the table's historic cache hit ratio from pgstats (i.e.
> use
> the data backing pg_statio_all_tables). Of course that's not going to be
> specific to your query (for index scans etc), and might have changed more
> recently. It'd also be completely wrong after a restart.
>
> If we had information about *recent* cache hit patterns for the relation,
> it'd
> be a lot better, but we don't have the infrastructure for that, and
> introducing it would increase the size of the stats entries noticably.
>
> Another way could be to probe the buffer mapping table for a small subset
> of
> the locks and infer the likelihood of other blocks being in shared buffers
> that way.
>
> A third way could be to track the cache hit for relations in backend local
> memory, likely in the relache entry. The big disadvantage would be that
> query
> plans would differ between connections and that connections would need to
> "warm up" to have good plans. But it'd handle restarts nicely.
>
> Greetings,
>
> Andres Freund
>


Re: recovery modules

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 15:28:21 -0800, Nathan Bossart wrote:
> The more I think about this, the more I wonder whether we really need to
> include archive_cleanup_command and recovery_end_command in recovery
> modules.

I think it would be hard to write a good module that isn't just implementing
the existing commands without it. Needing to clean up archives and reacting to
the end of recovery seems a pretty core task.



> Another weird thing with the checkpointer is that the restore_library will
> stay loaded long after recovery is finished, and it'll be loaded regardless
> of whether recovery is required in the first place.

I don't see a problem with that. And I suspect we might even end up there
for other reasons.

I was briefly wondering whether it'd be worth trying to offload things like
archive_cleanup_command from checkpointer to a different process, for
robustness. But given that it's pretty much required for performance that the
module runs in the startup process, that ship probably has sailed.

Greetings,

Andres Freund




Re: Add n_tup_newpage_upd to pg_stat table views

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 18:23:39 -0500, Corey Huinker wrote:
> This patch adds the n_tup_newpage_upd to all the table stat views.
>
> Just as we currently track HOT updates, it should be beneficial to track
> updates where the new tuple cannot fit on the existing page and must go to
> a different one.

I like that idea.


I wonder if it's quite detailed enough - we can be forced to do out-of-page
updates because we actually are out of space, or because we reach the max
number of line pointers we allow in a page. HOT pruning can't remove dead line
pointers, so that doesn't necessarily help.

Which e.g. means that:

> Hopefully this can give users some insight as to whether their current
> fillfactor settings need to be adjusted.

Isn't that easy, because you can have a page with just a visible single tuple
on, but still be unable to do a same-page update. The fix instead is to VACUUM
(more aggressively).


OTOH, just seeing that there's high percentage "out-of-page updates" provides
more information than we have right now.  And the alternative would be to add
yet another counter.


Similarly, it's a bit sad that we can't distinguish between the number of
potential-HOT out-of-page updates and the other out-of-page updates. But
that'd mean even more counters.


I guess we could try to add tracepoints to allow to distinguish between those
cases instead? Not a lot of people use those though.



> @@ -372,8 +372,11 @@ pgstat_count_heap_update(Relation rel, bool hot)
>   pgstat_info->trans->tuples_updated++;
>
>   /* t_tuples_hot_updated is nontransactional, so just advance it 
> */
> - if (hot)
> + if (hut == PGSTAT_HEAPUPDATE_HOT)
>   pgstat_info->t_counts.t_tuples_hot_updated++;
> + else if (hut == PGSTAT_HEAPUPDATE_NEW_PAGE)
> + pgstat_info->t_counts.t_tuples_newpage_updated++;
> +
>   }
>  }
>

I think this might cause some trouble for existing monitoring setups after an
upgrade. Suddenly the number of updates will appear way lower than
before... And if we end up eventually distinguishing between different reasons
for out-of-page updates, or hot/non-hot out-of-page that'll happen again.

I wish we'd included HOT updates in the total number of updates, and just kept
HOT updates a separate counter that'd always be less than updates in total.


>From that angle: Perhaps it'd be better to have counter for how many times a
page is found to be full during an update?


> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -3155,7 +3155,8 @@ heap_update(Relation relation, ItemPointer otid, 
> HeapTuple newtup,
>   pagefree;
>   boolhave_tuple_lock = false;
>   booliscombo;
> - booluse_hot_update = false;
> + PgStat_HeapUpdateType update_type = PGSTAT_HEAPUPDATE_NON_HOT;
> +
>   boolkey_intact;
>   boolall_visible_cleared = false;
>   boolall_visible_cleared_new = false;
> @@ -3838,10 +3839,11 @@ l2:
>* changed.
>*/
>   if (!bms_overlap(modified_attrs, hot_attrs))
> - use_hot_update = true;
> + update_type = PGSTAT_HEAPUPDATE_HOT;
>   }
>   else
>   {
> + update_type = PGSTAT_HEAPUPDATE_NEW_PAGE;
>   /* Set a hint that the old page could use prune/defrag */
>   PageSetFull(page);
>   }
> @@ -3875,7 +3877,7 @@ l2:
>*/
>   PageSetPrunable(page, xid);
>
> - if (use_hot_update)
> + if (update_type == PGSTAT_HEAPUPDATE_HOT)

It's a bit weird that heap_update() uses a pgstat type to decide what to
do. But not sure there's a much better alternative.

Greetings,

Andres Freund




Re: recovery modules

2023-01-27 Thread Michael Paquier
On Thu, Jan 26, 2023 at 09:40:58PM -0800, Nathan Bossart wrote:
> Yeah, there are some problems here.  If we ERROR, we'll just bounce back to
> the sigsetjmp() block once a second, and we'll never pick up configuration
> reloads, shutdown signals, etc.  If we FATAL, we'll just rapidly restart
> over and over.  Given the dicussion about misconfigured archiving
> parameters [0], I doubt folks will be okay with giving priority to one or
> the other.
> 
> I'm currently thinking that the checkpointer should set a flag and clear
> the recovery callbacks when a misconfiguration is detected.  Anytime the
> checkpointer tries to use the archive-cleanup callback, a WARNING would be
> emitted.  This is similar to an approach I proposed for archiving
> misconfigurations (that we didn't proceed with) [1].  Given the
> aformentioned problems, this approach might be more suitable for the
> checkpointer than it is for the archiver.

So, by doing that, archive_library would be ignored.  What should be
the checkpointer do when a aconfiguration error is found on
misconfiguration?  Would archive_cleanup_command be equally ignored or
could there be a risk to see it used by the checkpointer?

Ignoring it would be fine as the user intended the use of a library,
rather than enforcing its use via a value one did not really want.
So, on top of cleaning the callbacks, archive_cleanup_command should
also be cleaned up in the checkpointer?  Issuing one WARNING per
checkpoint would be indeed much better than looping over and over,
impacting the system reliability.

Thoughts or comments from anyone?
--
Michael


signature.asc
Description: PGP signature


Re: heapgettup() with NoMovementScanDirection unused in core?

2023-01-27 Thread Tom Lane
David Rowley  writes:
> My personal preference would have been to call it
> ScanDirectionCombine, so the naming is more aligned to the 4 other
> macro names that start with ScanDirection in sdir.h, but I'm not going
> to fuss over it.

No objection to that.

regards, tom lane




Re: heapgettup() with NoMovementScanDirection unused in core?

2023-01-27 Thread David Rowley
On Sat, 28 Jan 2023 at 12:15, Tom Lane  wrote:
> /*
>  * Determine the net effect of two direction specifications.
>  * This relies on having ForwardScanDirection = +1, BackwardScanDirection = 
> -1,
>  * and will probably not do what you want if applied to any other values.
>  */
> #define CombineScanDirections(a, b)  ((a) * (b))
>
> The main thing this'd buy us is being able to grep for uses of the
> trick.  If it's written as just multiplication, good luck being
> able to find what's depending on that, should you ever need to.

Yeah, I think the multiplication macro is a good way of doing it.
Having the definition of it close to the ScanDirection enum's
definition is likely a very good idea so that anyone adjusting the
enum values is more likely to notice that it'll cause an issue. A
small note on the enum declaration about the -1, +1 values being
exploited in various places might be a good idea too. I see v6-0006 in
[1] further exploits this, so that's further reason to document that.

My personal preference would have been to call it
ScanDirectionCombine, so the naming is more aligned to the 4 other
macro names that start with ScanDirection in sdir.h, but I'm not going
to fuss over it.

David

[1] 
https://postgr.es/m/caakru_zyixwws1wxszneoy+sjoh_+f5uho-1ufhyi-u0d6z...@mail.gmail.com


David




Re: recovery modules

2023-01-27 Thread Nathan Bossart
On Thu, Jan 26, 2023 at 09:40:58PM -0800, Nathan Bossart wrote:
> On Wed, Jan 25, 2023 at 04:34:21PM +0900, Michael Paquier wrote:
>> The loop part is annoying..  I've never been a fan of adding this
>> cross-value checks for the archiver modules in the first place, and it
>> would make things much simpler in the checkpointer if we need to think
>> about that as we want these values to be reloadable.  Perhaps this
>> could just be an exception where we just give priority on one over the
>> other archive_cleanup_command?  The startup process has a well-defined
>> sequence after a failure, while the checkpointer is designed to remain
>> robust.
> 
> Yeah, there are some problems here.  If we ERROR, we'll just bounce back to
> the sigsetjmp() block once a second, and we'll never pick up configuration
> reloads, shutdown signals, etc.  If we FATAL, we'll just rapidly restart
> over and over.  Given the dicussion about misconfigured archiving
> parameters [0], I doubt folks will be okay with giving priority to one or
> the other.
> 
> I'm currently thinking that the checkpointer should set a flag and clear
> the recovery callbacks when a misconfiguration is detected.  Anytime the
> checkpointer tries to use the archive-cleanup callback, a WARNING would be
> emitted.  This is similar to an approach I proposed for archiving
> misconfigurations (that we didn't proceed with) [1].  Given the
> aformentioned problems, this approach might be more suitable for the
> checkpointer than it is for the archiver.

The more I think about this, the more I wonder whether we really need to
include archive_cleanup_command and recovery_end_command in recovery
modules.  Another weird thing with the checkpointer is that the
restore_library will stay loaded long after recovery is finished, and it'll
be loaded regardless of whether recovery is required in the first place.
Of course, that typically won't cause any problems, and we could wait until
we need to do archive cleanup to load the library (and call its shutdown
callback when recovery is finished), but this strikes me as potentially
more complexity than the feature is worth.  Perhaps we should just focus on
covering the restore_command functionality for now and add the rest later.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Optimizing PostgreSQL with LLVM's PGO+LTO

2023-01-27 Thread Tom Lane
Andres Freund  writes:
> On 2023-01-27 15:06:37 -0500, Tom Lane wrote:
>> There are a lot of places where we're implicitly relying on
>> cross-compilation-unit optimizations NOT happening, because the code isn't
>> adequately decorated with memory barriers and the like.

> We have a fallback compiler barrier implementation doing that, but it
> shouldn't be used on any halfway reasonable compiler. Cross-compilation-unit
> calls don't provide a memory barrier - I assume you're thinking about a
> compiler barrier?

Sorry, yeah, I was being sloppy there.

> I'm sure we have a few places that aren't that careful, but I would hope it's
> not a large number. Are you thinking of specific "patterns" we've repeated all
> over, or just a few cases you recall?

I recall that we used to have dependencies on, for example, the LWLock
functions being out-of-line.  Probably that specific pain point has
been cleaned up, but it surprises me not at all to hear that there
are more.

I agree that there are probably not a huge number of places that would
need to be fixed, but I'm not sure how we'd go about finding them.

regards, tom lane




Add n_tup_newpage_upd to pg_stat table views

2023-01-27 Thread Corey Huinker
This patch adds the n_tup_newpage_upd to all the table stat views.

Just as we currently track HOT updates, it should be beneficial to track
updates where the new tuple cannot fit on the existing page and must go to
a different one.

Hopefully this can give users some insight as to whether their current
fillfactor settings need to be adjusted.

My chosen implementation replaces the hot-update boolean with an
update_type which is currently a three-value enum. I favored that
only slightly over adding a separate newpage-update boolean because the two
events are mutually exclusive and fewer parameters is less overhead and one
less assertion check. The relative wisdom of this choice may not come to
light until we add a new measurement and see whether that new measurement
overlaps either is-hot or is-new-page.
From 55204b3d2f719f5dd8c308ea722606a40b3d09b8 Mon Sep 17 00:00:00 2001
From: coreyhuinker 
Date: Fri, 27 Jan 2023 17:56:59 -0500
Subject: [PATCH v1] Add n_tup_newpage_upd to pg_stat_all_tables and
 pg_stat_xact_all_tables. This value reflects the number of tuples updated
 where the new tuple was placed on a different page than the previous version.

---
 doc/src/sgml/monitoring.sgml |  9 +
 src/backend/access/heap/heapam.c | 10 ++
 src/backend/catalog/system_views.sql |  4 +++-
 src/backend/utils/activity/pgstat_relation.c |  8 ++--
 src/backend/utils/adt/pgstatfuncs.c  | 18 ++
 src/include/catalog/pg_proc.dat  |  9 +
 src/include/pgstat.h | 14 --
 src/test/regress/expected/rules.out  | 12 +---
 8 files changed, 72 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..f0291a21fb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4523,6 +4523,15 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
  
 
+ 
+  
+   n_tup_newpage_upd bigint
+  
+  
+   Number of rows updated where the new row is on a different page than the previous version
+  
+ 
+
  
   
n_live_tup bigint
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e6024a980b..f5aa429a9a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3155,7 +3155,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 pagefree;
 	bool		have_tuple_lock = false;
 	bool		iscombo;
-	bool		use_hot_update = false;
+	PgStat_HeapUpdateType update_type = PGSTAT_HEAPUPDATE_NON_HOT;
+
 	bool		key_intact;
 	bool		all_visible_cleared = false;
 	bool		all_visible_cleared_new = false;
@@ -3838,10 +3839,11 @@ l2:
 		 * changed.
 		 */
 		if (!bms_overlap(modified_attrs, hot_attrs))
-			use_hot_update = true;
+			update_type = PGSTAT_HEAPUPDATE_HOT;
 	}
 	else
 	{
+		update_type = PGSTAT_HEAPUPDATE_NEW_PAGE;
 		/* Set a hint that the old page could use prune/defrag */
 		PageSetFull(page);
 	}
@@ -3875,7 +3877,7 @@ l2:
 	 */
 	PageSetPrunable(page, xid);
 
-	if (use_hot_update)
+	if (update_type == PGSTAT_HEAPUPDATE_HOT)
 	{
 		/* Mark the old tuple as HOT-updated */
 		HeapTupleSetHotUpdated();
@@ -3986,7 +3988,7 @@ l2:
 	if (have_tuple_lock)
 		UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
 
-	pgstat_count_heap_update(relation, use_hot_update);
+	pgstat_count_heap_update(relation, update_type);
 
 	/*
 	 * If heaptup is a private copy, release it.  Don't forget to copy t_self
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..292a9b88b3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -665,6 +665,7 @@ CREATE VIEW pg_stat_all_tables AS
 pg_stat_get_tuples_updated(C.oid) AS n_tup_upd,
 pg_stat_get_tuples_deleted(C.oid) AS n_tup_del,
 pg_stat_get_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+pg_stat_get_tuples_newpage_updated(C.oid) AS n_tup_newpage_upd,
 pg_stat_get_live_tuples(C.oid) AS n_live_tup,
 pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
 pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
@@ -696,7 +697,8 @@ CREATE VIEW pg_stat_xact_all_tables AS
 pg_stat_get_xact_tuples_inserted(C.oid) AS n_tup_ins,
 pg_stat_get_xact_tuples_updated(C.oid) AS n_tup_upd,
 pg_stat_get_xact_tuples_deleted(C.oid) AS n_tup_del,
-pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd
+pg_stat_get_xact_tuples_hot_updated(C.oid) AS n_tup_hot_upd,
+pg_stat_get_xact_tuples_newpage_updated(C.oid) AS n_tup_newpage_upd
 FROM pg_class C LEFT JOIN
  pg_index I ON C.oid = I.indrelid
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
diff --git 

Re: heapgettup() with NoMovementScanDirection unused in core?

2023-01-27 Thread Tom Lane
Melanie Plageman  writes:
> On Fri, Jan 27, 2023 at 05:05:16PM -0500, Tom Lane wrote:
>> AFAIR, there is noplace today that depends on the exact encoding
>> of ForwardScanDirection and BackwardScanDirection, and I'm not
>> sure that we want to introduce such a dependency.

> I think you mean the enum value when you say encoding? I actually
> started using the ScanDirection value in the refactor of heapgettup()
> and heapgettup_pagemode() which I proposed here [1]. Why wouldn't we
> want to introduce such a dependency?

It's just that in general, depending on the numeric values of an
enum isn't a great coding practice.

After thinking about it for awhile, I'd be happier if we added
something like this to sdir.h, and then used it rather than
directly depending on multiplication:

/*
 * Determine the net effect of two direction specifications.
 * This relies on having ForwardScanDirection = +1, BackwardScanDirection = -1,
 * and will probably not do what you want if applied to any other values.
 */
#define CombineScanDirections(a, b)  ((a) * (b))

The main thing this'd buy us is being able to grep for uses of the
trick.  If it's written as just multiplication, good luck being
able to find what's depending on that, should you ever need to.

>> 4. I don't think the proposed test case is worth the cycles.

> Just the one I wrote or any test case?

I think that all this code is quite well-tested already, so I'm
not sure what's the point of adding another test for it.

regards, tom lane




Re: improving user.c error messages

2023-01-27 Thread Nathan Bossart
On Fri, Jan 27, 2023 at 07:31:19PM +0100, Alvaro Herrera wrote:
> On 2023-Jan-26, Nathan Bossart wrote:
>>  ereport(ERROR,
>>  
>> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
>> - errmsg("permission denied: bootstrap 
>> user must be superuser")));
>> + errmsg("permission denied to alter 
>> role"),
>> + errdetail("The bootstrap user must be 
>> superuser.")));
> 
> I think this one isn't using the right errcode; this is not a case of
> insufficient privileges.  There's no priv you can acquire that lets you
> do it.  So I'd change it to unsupported operation.

І fixed this in v4.  I've also attached a second patch in which I've
adjusted the messages that Peter mentioned upthread.

One thing that feels a bit odd is how some of the DETAILs mention the
operation being attempted while others do not.  For example, we have

ERROR:  permission denied to drop role
DETAIL: You must have SUPERUSER privilege to drop roles with SUPERUSER.

In this case, the DETAIL explains the action that is prohibited.  In other
cases, we have something like

ERROR:  permission denied to alter role
DETAIL: You must have CREATEROLE privilege and ADMIN OPTION on role 
"myrole".

which does not.  I think this is okay because adding "to alter the role" to
the end of the DETAIL seems kind of awkward.  But in other cases, such as

ERROR:  permission denied to use replication slots
DETAIL:  You must have REPLICATION privilege.

adding the operation to the end seems less awkward (i.e., "You must have
REPLICATION privilege to use replication slots.").  I don't think there's
any information lost by omitting the action in the DETAIL, so perhaps this
is just a stylistic choice.  I think I'm inclined to add the action to the
DETAIL whenever it doesn't make the message lengthy and awkward, and leave
it out otherwise.  Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d369e35e6a44cb9708a08a1f32d1755e04f04de1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 26 Jan 2023 11:05:13 -0800
Subject: [PATCH v4 1/2] Improve user.c error messages.

---
 src/backend/commands/user.c   | 168 --
 src/test/regress/expected/create_role.out |  77 ++
 src/test/regress/expected/dependency.out  |   4 +
 src/test/regress/expected/privileges.out  |  23 +--
 4 files changed, 197 insertions(+), 75 deletions(-)

diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
index 3a92e930c0..26b533f1be 100644
--- a/src/backend/commands/user.c
+++ b/src/backend/commands/user.c
@@ -316,23 +316,33 @@ CreateRole(ParseState *pstate, CreateRoleStmt *stmt)
 		if (!has_createrole_privilege(currentUserId))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("permission denied to create role")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles.",
+			   "CREATEROLE")));
 		if (issuper)
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must be superuser to create superusers")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "SUPERUSER", "SUPERUSER")));
 		if (createdb && !have_createdb_privilege())
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have createdb permission to create createdb users")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "CREATEDB", "CREATEDB")));
 		if (isreplication && !has_rolreplication(currentUserId))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have replication permission to create replication users")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "REPLICATION", "REPLICATION")));
 		if (bypassrls && !has_bypassrls_privilege(currentUserId))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("must have bypassrls to create bypassrls users")));
+	 errmsg("permission denied to create role"),
+	 errdetail("You must have %s privilege to create roles with %s.",
+			   "BYPASSRLS", "BYPASSRLS")));
 	}
 
 	/*
@@ -744,10 +754,18 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
 	roleid = authform->oid;
 
 	/* To mess with a superuser in any way you gotta be superuser. */
-	if (!superuser() && (authform->rolsuper || dissuper))
+	if (!superuser() && authform->rolsuper)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("must be superuser to alter superuser roles or change superuser attribute")));
+ errmsg("permission denied to alter 

Re: Optimizing PostgreSQL with LLVM's PGO+LTO

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 15:06:37 -0500, Tom Lane wrote:
> There are a lot of places where we're implicitly relying on
> cross-compilation-unit optimizations NOT happening, because the code isn't
> adequately decorated with memory barriers and the like.

We have a fallback compiler barrier implementation doing that, but it
shouldn't be used on any halfway reasonable compiler. Cross-compilation-unit
calls don't provide a memory barrier - I assume you're thinking about a
compiler barrier?

I'm sure we have a few places that aren't that careful, but I would hope it's
not a large number. Are you thinking of specific "patterns" we've repeated all
over, or just a few cases you recall?

Greetings,

Andres Freund




Re: Optimizing PostgreSQL with LLVM's PGO+LTO

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 10:05:09 -0700, João Paulo Labegalini de Carvalho wrote:
> I am investigating the benefits of different profile-guided optimizations
> (PGO) and link-time optimizations (LTO) versus binary optimizers (e.g.
> BOLT) for applications such as PostgreSQL.
> 
> I am facing issues when applying LTO to PostgreSQL as the produced binary
> seems broken (the server dies quickly after it has started). This is
> definitely a compiler bug, but I was wondering if anyone here  have
> experimented with LTO for PostgreSQL.

What compiler / version / flags / OS did you try?


FWIW, I've experimented with LTO and PGO a bunch, both with gcc and clang. I
did hit a crash in gcc, but that did turn out to be a compiler bug, and
actually reduced to something not even needing LTO.

I saw quite substantial speedups with PGO, but I only tested very specific
workloads. IIRC it was >15% gain in concurrent readonly pgbench.


I dimly recall failing to get some benefit out of bolt for some reason that I
unfortunately don't even vaguely recall.

Greetings,

Andres Freund




Re: Non-superuser subscription owners

2023-01-27 Thread Mark Dilger



> On Jan 27, 2023, at 1:35 PM, Robert Haas  wrote:
> 
>> I started out asking what benefits it provides to own a subscription one
>> cannot modify. But I think it is a good capability to have, to restrict the
>> set of relations that replication could target.  Although perhaps it'd be
>> better to set the "replay user" as a separate property on the subscription?
> 
> That's been proposed previously, but for reasons I don't quite
> remember it seems not to have happened. I don't think it achieved
> consensus.
> 
>> Does owning a subscription one isn't allowed to modify useful outside of 
>> that?
> 
> Uh, possibly that's a question for Mark or Jeff. I don't know. I can't
> see what they would be, but I just work here.

If the owner cannot modify the subscription, then the owner degenerates into a 
mere "run-as" user.  Note that this isn't how things work now, and even if we 
disallowed owners from modifying the connection string, there would still be 
other attributes the owner could modify, such as the set of publications 
subscribed.


More generally, my thinking on this thread is that there needs to be two 
nosuperuser roles:  A higher privileged role which can create a subscription, 
and a lower privileged role serving the "run-as" function.  Those shouldn't be 
the same, because the "run-as" concept doesn't logically need to have 
subscription creation power, and likely *shouldn't* have that power.  Depending 
on which sorts of attributes a subscription object has, such as the connection 
string, the answer differs for whether the owner/"run-as" user should get to 
change those attributes.  One advantage of Jeff's idea of using a server object 
rather than a string is that it becomes more plausibly safe to allow the 
subscription owner to make changes to that attribute of the subscription.



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: heapgettup() with NoMovementScanDirection unused in core?

2023-01-27 Thread Melanie Plageman
On Fri, Jan 27, 2023 at 05:05:16PM -0500, Tom Lane wrote:
> Melanie Plageman  writes:
> > I have written the patch to remove the unreachable code in
> > heapgettup_pagemode]().
> 
> A few thoughts ...
> 
> 1. Do we really need quite so many Asserts?  I'd kind of lean
> to just having one, at some higher level of the executor.

Yes, perhaps I was a bit overzealous putting them in functions called
for every tuple.

I'm not sure where in the executor would make the most sense.
ExecInitSeqScan() comes to mind, but I'm not sure that covers all of the
desired cases.

> 
> 2. I'm not sure if we want to do this:
> 
> - direction = estate->es_direction;
> - /* flip direction if this is an overall backward scan */
> - if (ScanDirectionIsBackward(((IndexScan *) 
> node->ss.ps.plan)->indexorderdir))
> - {
> - if (ScanDirectionIsForward(direction))
> - direction = BackwardScanDirection;
> - else if (ScanDirectionIsBackward(direction))
> - direction = ForwardScanDirection;
> - }
> + direction = estate->es_direction * ((IndexScan *) 
> node->ss.ps.plan)->indexorderdir;
> 
> AFAIR, there is noplace today that depends on the exact encoding
> of ForwardScanDirection and BackwardScanDirection, and I'm not
> sure that we want to introduce such a dependency.  If we do it
> at least deserves a comment here, and you probably ought to adjust
> the wishy-washy comment in sdir.h as well.  Taking out the existing
> comment explaining what this code is doing is not an improvement
> either.

I think you mean the enum value when you say encoding? I actually
started using the ScanDirection value in the refactor of heapgettup()
and heapgettup_pagemode() which I proposed here [1]. Why wouldn't we
want to introduce such a dependency?

> 
> 3. You didn't update the header comment for heapgettup, nor the
> one in pathnodes.h for IndexPath.indexscandir.

Oops -- thanks for catching those!

> 
> 4. I don't think the proposed test case is worth the cycles.

Just the one I wrote or any test case?

- Melanie

[1] 
https://www.postgresql.org/message-id/flat/CAAKRu_ZJg_N7zHtWP%2BJoSY_hrce4%2BGKioL137Y2c2En-kuXQ7g%40mail.gmail.com#8a106c6625bc069cf439230cd9fa1000




Re: heapgettup() with NoMovementScanDirection unused in core?

2023-01-27 Thread Tom Lane
Melanie Plageman  writes:
> I have written the patch to remove the unreachable code in
> heapgettup_pagemode]().

A few thoughts ...

1. Do we really need quite so many Asserts?  I'd kind of lean
to just having one, at some higher level of the executor.

2. I'm not sure if we want to do this:

-   direction = estate->es_direction;
-   /* flip direction if this is an overall backward scan */
-   if (ScanDirectionIsBackward(((IndexScan *) 
node->ss.ps.plan)->indexorderdir))
-   {
-   if (ScanDirectionIsForward(direction))
-   direction = BackwardScanDirection;
-   else if (ScanDirectionIsBackward(direction))
-   direction = ForwardScanDirection;
-   }
+   direction = estate->es_direction * ((IndexScan *) 
node->ss.ps.plan)->indexorderdir;

AFAIR, there is noplace today that depends on the exact encoding
of ForwardScanDirection and BackwardScanDirection, and I'm not
sure that we want to introduce such a dependency.  If we do it
at least deserves a comment here, and you probably ought to adjust
the wishy-washy comment in sdir.h as well.  Taking out the existing
comment explaining what this code is doing is not an improvement
either.

3. You didn't update the header comment for heapgettup, nor the
one in pathnodes.h for IndexPath.indexscandir.

4. I don't think the proposed test case is worth the cycles.

regards, tom lane




Re: Non-superuser subscription owners

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 16:35:11 -0500, Robert Haas wrote:
> > Maybe a daft question:
> >
> > Have we considered using a "normal grant", e.g. on the database, instead of 
> > a
> > role?  Could it e.g. be useful to grant a user the permission to create a
> > subscription in one database, but not in another?
> 
> Potentially, but I didn't think we'd want to burn through permissions
> bits that fast, even given 7b378237aa805711353075de142021b1d40ff3b0.
> Still, if the consensus is otherwise, I can change it.

I don't really have an opinion on what's better. I looked briefly whether
there was discussion around ithis but I didn't see anything.

pg_create_subcription feels a bit different than most of the other pg_*
roles. For most of those there is no schema object to tie permissions to. But
here there is.

But I think there's good arguments against a GRANT approach, too. GRANT ALL ON
DATABASE would suddenly be dangerous. How does it interact with database
ownership? Etc.


> Then I guess we'd end up with GRANT CREATE ON DATABASE and GRANT CREATE
> SUBSCRIPTION ON DATABASE, which I'm sure wouldn't be confusing at all.

Heh. I guess it could just be GRANT SUBSCRIBE.



> Or, another thought, maybe this should be checking for CREATE on the
> current database + also pg_create_subscription. That seems like it
> might be the right idea, actually.

Yes, that seems like a good idea.



> > The subscription code already does ownership checks before locking and now
> > there's also the passwordrequired before.  Is it possible that this could 
> > open
> > up some sort of race? Could e.g. the user change the ownership to the
> > superuser in one session, do an ALTER in the other?
> >
> > It looks like your change won't increase the danger of that, as the
> > superuser() check just checks the current users permissions.
> 
> I'm not entirely clear whether there's a hazard there.

I'm not at all either. It's just a code pattern that makes me anxious - I
suspect there's a few places it makes us more vulnerable.


> If there is, I think we could fix it by moving the LockSharedObject call up
> higher, above object_ownercheck. The only problem with that is it lets you
> lock an object on which you have no permissions: see
> 2ad36c4e44c8b513f6155656e1b7a8d26715bb94. To really fix that, we'd need an
> analogue of RangeVarGetRelidExtended.

Yea, we really should have something like RangeVarGetRelidExtended() for other
kinds of objects. It'd take a fair bit of work / time to use it widely, but
it'll take even longer if we start in 5 years ;)

Perhaps the bulk of RangeVarGetRelidExtended() could be generalized by having
a separate name->oid lookup callback?

Greetings,

Andres Freund




Re: heapgettup() with NoMovementScanDirection unused in core?

2023-01-27 Thread Melanie Plageman
Hi,

I have written the patch to remove the unreachable code in
heapgettup_pagemode]().

On Wed, Jan 25, 2023 at 10:02 AM Tom Lane  wrote:
>
> I wonder if we couldn't also get rid of this confusingly-inconsistent
> alternative usage in the planner:
>
>  * 'indexscandir' is one of:
>  *ForwardScanDirection: forward scan of an ordered index
>  *BackwardScanDirection: backward scan of an ordered index
>  *NoMovementScanDirection: scan of an unordered index, or don't care
>  * (The executor doesn't care whether it gets ForwardScanDirection or
>  * NoMovementScanDirection for an indexscan, but the planner wants to
>  * distinguish ordered from unordered indexes for building pathkeys.)
>
> While that comment's claim is plausible, I think it's been wrong for
> years.  AFAICS indxpath.c makes pathkeys before it ever does this:
>
>   index_is_ordered ?
>   ForwardScanDirection :
>   NoMovementScanDirection,
>
> and nothing depends on that later either.  So I think we could
> simplify this to something like "indexscandir is either
> ForwardScanDirection or BackwardScanDirection.  (Unordered
> index types need not support BackwardScanDirection.)"
>

I also did what I *think* Tom is suggesting here -- make index scan's
scan direction always forward or backward...

Maybe the set should be two patches...dunno.

- Melanie
From be8119af72499aec03e59200c9db44f8520ccebf Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 27 Jan 2023 14:02:03 -0500
Subject: [PATCH v1] Remove NoMovementScanDirection inconsistencies

Remove use of NoMovementScanDirection for index scans. Unordered indexes
will now always use ForwardScanDirection.

Remove unreachable code in heapgettup() and heapgettup_pagemode()
handling NoMovementScanDirection.

Add assertions in case table AMs try to use scan directions other than
forward and backward.

Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY%3D_T8QEqZqOL02rpi2bw%40mail.gmail.com
---
 src/backend/access/heap/heapam.c | 73 
 src/backend/commands/explain.c   |  3 -
 src/backend/executor/nodeIndexonlyscan.c | 17 +++---
 src/backend/executor/nodeIndexscan.c | 17 +++---
 src/backend/optimizer/path/indxpath.c|  8 +--
 src/backend/optimizer/util/pathnode.c|  5 +-
 src/include/access/tableam.h |  6 ++
 src/test/regress/expected/hash_index.out | 27 +
 src/test/regress/sql/hash_index.sql  | 23 
 9 files changed, 87 insertions(+), 92 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e6024a980b..be0555f5c4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -523,6 +523,9 @@ heapgettup(HeapScanDesc scan,
 	int			linesleft;
 	ItemId		lpp;
 
+	Assert(dir == ForwardScanDirection ||
+		   dir == BackwardScanDirection);
+
 	/*
 	 * calculate next starting lineoff, given scan direction
 	 */
@@ -583,7 +586,7 @@ heapgettup(HeapScanDesc scan,
 
 		linesleft = lines - lineoff + 1;
 	}
-	else if (backward)
+	else
 	{
 		/* backward parallel scan not supported */
 		Assert(scan->rs_base.rs_parallel == NULL);
@@ -653,34 +656,6 @@ heapgettup(HeapScanDesc scan,
 
 		linesleft = lineoff;
 	}
-	else
-	{
-		/*
-		 * ``no movement'' scan direction: refetch prior tuple
-		 */
-		if (!scan->rs_inited)
-		{
-			Assert(!BufferIsValid(scan->rs_cbuf));
-			tuple->t_data = NULL;
-			return;
-		}
-
-		block = ItemPointerGetBlockNumber(&(tuple->t_self));
-		if (block != scan->rs_cblock)
-			heapgetpage((TableScanDesc) scan, block);
-
-		/* Since the tuple was previously fetched, needn't lock page here */
-		page = BufferGetPage(scan->rs_cbuf);
-		TestForOldSnapshot(snapshot, scan->rs_base.rs_rd, page);
-		lineoff = ItemPointerGetOffsetNumber(&(tuple->t_self));
-		lpp = PageGetItemId(page, lineoff);
-		Assert(ItemIdIsNormal(lpp));
-
-		tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
-		tuple->t_len = ItemIdGetLength(lpp);
-
-		return;
-	}
 
 	/*
 	 * advance the scan until we find a qualifying tuple or run out of stuff
@@ -861,6 +836,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 	int			linesleft;
 	ItemId		lpp;
 
+	Assert(dir == ForwardScanDirection ||
+		   dir == BackwardScanDirection);
+
 	/*
 	 * calculate next starting lineindex, given scan direction
 	 */
@@ -918,7 +896,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 
 		linesleft = lines - lineindex;
 	}
-	else if (backward)
+	else
 	{
 		/* backward parallel scan not supported */
 		Assert(scan->rs_base.rs_parallel == NULL);
@@ -978,38 +956,6 @@ heapgettup_pagemode(HeapScanDesc scan,
 
 		linesleft = lineindex + 1;
 	}
-	else
-	{
-		/*
-		 * ``no movement'' scan direction: refetch prior tuple
-		 */
-		if (!scan->rs_inited)
-		{
-			Assert(!BufferIsValid(scan->rs_cbuf));
-			tuple->t_data = NULL;
-			return;
-		}
-
-		block = 

Re: Non-superuser subscription owners

2023-01-27 Thread Robert Haas
On Fri, Jan 27, 2023 at 4:09 PM Andres Freund  wrote:
> Hm, compared to postgres_fdw, the user has far less control over what's
> happening using that connection. Is there a way a subscription owner can
> trigger evaluation of near-arbitrary SQL on the publisher side?

I'm not aware of one, but what I think it would let you do is
exfiltrate data you're not entitled to see.

> I started out asking what benefits it provides to own a subscription one
> cannot modify. But I think it is a good capability to have, to restrict the
> set of relations that replication could target.  Although perhaps it'd be
> better to set the "replay user" as a separate property on the subscription?

That's been proposed previously, but for reasons I don't quite
remember it seems not to have happened. I don't think it achieved
consensus.

> Does owning a subscription one isn't allowed to modify useful outside of that?

Uh, possibly that's a question for Mark or Jeff. I don't know. I can't
see what they would be, but I just work here.

> Maybe a daft question:
>
> Have we considered using a "normal grant", e.g. on the database, instead of a
> role?  Could it e.g. be useful to grant a user the permission to create a
> subscription in one database, but not in another?

Potentially, but I didn't think we'd want to burn through permissions
bits that fast, even given 7b378237aa805711353075de142021b1d40ff3b0.
Still, if the consensus is otherwise, I can change it. Then I guess
we'd end up with GRANT CREATE ON DATABASE and GRANT CREATE
SUBSCRIPTION ON DATABASE, which I'm sure wouldn't be confusing at all.

Or, another thought, maybe this should be checking for CREATE on the
current database + also pg_create_subscription. That seems like it
might be the right idea, actually.

> The subscription code already does ownership checks before locking and now
> there's also the passwordrequired before.  Is it possible that this could open
> up some sort of race? Could e.g. the user change the ownership to the
> superuser in one session, do an ALTER in the other?
>
> It looks like your change won't increase the danger of that, as the
> superuser() check just checks the current users permissions.

I'm not entirely clear whether there's a hazard there. If there is, I
think we could fix it by moving the LockSharedObject call up higher,
above object_ownercheck. The only problem with that is it lets you
lock an object on which you have no permissions: see
2ad36c4e44c8b513f6155656e1b7a8d26715bb94. To really fix that, we'd
need an analogue of RangeVarGetRelidExtended.

> and throwing an error like that will at the very least leak the connection,
> fd, fd reservation. Which I just had fixed :). At the very least you'd need to
> copy the stuff that "bad_connection:" does.

OK.

> I did wonder whether we should make libpqrcv_connect() use errsave() to return
> errors.  Or whether we should make libpqrcv register a memory context reset
> callback that'd close the libpq connection.

Yeah. Using errsave() might be better, but not sure I want to tackle
that just now.

> That is a somewhat odd API.  Why does it throw for some things, but not
> others? Seems a bit cleaner to pass in a parameter indicating whether it
> should throw when not finding a password? Particularly because you already
> pass that to walrcv_connect().

Will look into that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 21:45:16 +0100, Tomas Vondra wrote:
> On 1/27/23 08:18, Bharath Rupireddy wrote:
> >> I think my idea of only forcing to flush/wait an LSN some distance in the 
> >> past
> >> would automatically achieve that?
> > 
> > I'm sorry, I couldn't get your point, can you please explain it a bit more?
> > 
> 
> The idea is that we would not flush the exact current LSN, because
> that's likely somewhere in the page, and we always write the whole page
> which leads to write amplification.
> 
> But if we backed off a bit, and wrote e.g. to the last page boundary,
> that wouldn't have this issue (either the page was already flushed -
> noop, or we'd have to flush it anyway).

Yep.


> We could even back off a bit more, to increase the probability it was
> actually flushed / sent to standby.

That's not the sole goal, from my end: I'd like to avoid writing out +
flushing the WAL in too small chunks.  Imagine a few concurrent vacuums or
COPYs or such - if we're unlucky they'd each end up exceeding their "private"
limit close to each other, leading to a number of small writes of the
WAL. Which could end up increasing local commit latency / iops.

If we instead decide to only ever flush up to something like
  last_page_boundary - 1/8 * throttle_pages * XLOG_BLCKSZ

we'd make sure that the throttling mechanism won't cause a lot of small
writes.


> > Keeping replication lag under check enables one to provide a better
> > RPO guarantee as discussed in the other thread
> > https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
> > 
> 
> Isn't that a bit over-complicated? RPO generally only cares about xacts
> that committed (because that's what you want to not lose), so why not to
> simply introduce a "sync mode" that simply uses a bit older LSN when
> waiting for the replica? Seems much simpler and similar to what we
> already do.

I don't think that really helps you that much. If there's e.g. a huge VACUUM /
COPY emitting loads of WAL you'll suddenly see commit latency of a
concurrently committing transactions spike into oblivion. Whereas a general
WAL throttling mechanism would throttle the VACUUM, without impacting the
commit latency of normal transactions.

Greetings,

Andres Freund




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 12:48:43 +0530, Bharath Rupireddy wrote:
> Looking at the patch, the feature, in its current shape, focuses on
> improving replication lag (by throttling WAL on the primary) only when
> synchronous replication is enabled. Why is that? Why can't we design
> it for replication in general (async, sync, and logical replication)?
> 
> Keeping replication lag under check enables one to provide a better
> RPO guarantee as discussed in the other thread
> https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.

I think something narrower and easier to achieve is a good thing. We've
already had loads of discussion for the more general problem, without a lot of
actual progress.

Greetings,

Andres Freund




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 12:06:49 +0100, Jakub Wartak wrote:
> On Thu, Jan 26, 2023 at 4:49 PM Andres Freund  wrote:
> 
> > Huh? Why did you remove the GUC?
> 
> After reading previous threads, my optimism level of getting it ever
> in shape of being widely accepted degraded significantly (mainly due
> to the discussion of wider category of 'WAL I/O throttling' especially
> in async case, RPO targets in async case and potentially calculating
> global bandwidth).

I think it's quite reasonable to limit this to a smaller scope. Particularly
because those other goals are pretty vague but ambitious goals. IMO the
problem with a lot of the threads is precisely that that they aimed at a level
of generallity that isn't achievable in one step.


> I've assumed that it is a working sketch, and as such not having GUC name
> right now (just for sync case) would still allow covering various other
> async cases in future without breaking compatibility with potential name GUC
> changes (see my previous "wal_throttle_larger_transactions="
> proposal ).

It's harder to benchmark something like this without a GUC, so I think it's
worth having, even if it's not the final name.


> > SyncRepWaitForLSN() has this comment:
> >  * 'lsn' represents the LSN to wait for.  'commit' indicates whether this 
> > LSN
> >  * represents a commit record.  If it doesn't, then we wait only for the WAL
> >  * to be flushed if synchronous_commit is set to the higher level of
> >  * remote_apply, because only commit records provide apply feedback.
> 
> Hm, not sure if I understand: are you saying that we should (in the
> throttled scenario) have some special feedback msgs or not --
> irrespective of the setting? To be honest the throttling shouldn't
> wait for the standby full setting, it's just about slowdown fact (so
> IMHO it would be fine even in remote_write/remote_apply scenario if
> the remote walreceiver just received the data, not necessarily write
> it into file or wait for for applying it). Just this waiting for a
> round-trip ack about LSN progress would be enough to slow down the
> writer (?). I've added some timing log into the draft and it shows
> more or less constantly solid RTT even as it stands:

My problem was that the function header for SyncRepWaitForLSN() seems to say
that we don't wait at all if commit=false and synchronous_commit <
remote_apply. But I think that might just be bad formulation.

   [...] 'commit' indicates whether this LSN
 * represents a commit record.  If it doesn't, then we wait only for the WAL
 * to be flushed if synchronous_commit is set to the higher level of
 * remote_apply, because only commit records provide apply feedback.

because the code does something entirely different afaics:

/* Cap the level for anything other than commit to remote flush only. */
if (commit)
mode = SyncRepWaitMode;
else
mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);


Greetings,

Andres Freund




Re: Non-superuser subscription owners

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 14:42:01 -0500, Robert Haas wrote:
> At first, I found it a bit tempting to see this as a further
> indication that the force-a-password approach is not the right idea,
> because the test case clearly memorializes a desire *not* to require a
> password in this situation. However, the loopback-to-superuser attack
> is just as viable for subscription as it in other cases, and my
> previous patch would have done nothing to block it.

Hm, compared to postgres_fdw, the user has far less control over what's
happening using that connection. Is there a way a subscription owner can
trigger evaluation of near-arbitrary SQL on the publisher side?


> So what I did instead is add a password_required attribute, just like what
> postgres_fdw has. As in the case of postgres_fdw, the actual rule is that if
> the attribute is false, a password is not required, and if the attribute is
> true, a password is required unless you are a superuser.  If you're a
> superuser, it still isn't. This is a slightly odd set of semantics but it
> has precedent and practical advantages. Also, as in the case of
> postgres_fdw, only a superuser can set password_required=false, and a
> subscription that has that setting can only be modified by a superuser, no
> matter who owns it.

I started out asking what benefits it provides to own a subscription one
cannot modify. But I think it is a good capability to have, to restrict the
set of relations that replication could target.  Although perhaps it'd be
better to set the "replay user" as a separate property on the subscription?

Does owning a subscription one isn't allowed to modify useful outside of that?



> Even though I hate the require-a-password stuff with the intensity of
> a thousand suns, I think this is better than the previous patch,
> because it's more consistent with what we do elsewhere and because it
> blocks the loopback-connection-to-superuser attack. I think we
> *really* need to develop a better system for restricting proxied
> connections (no matter how proxied) and I hope that we do that soon.
> But inventing something for this purpose that differs from what we do
> elsewhere will make that task harder, not easier.
> 
> Thoughts?

I think it's reasonable to mirror behaviour from elsewhere, and it'd let us
have this feature relatively soon - I think it's a common need to do this as a
non-superuser. It's IMO a very good idea to not subscribe as a superuser, even
if set up by a superuser...

But I also would understand if you / somebody else chose to focus on
implementing a less nasty connection model.


> Subject: [PATCH v2] Add new predefined role pg_create_subscriptions.

Maybe a daft question:

Have we considered using a "normal grant", e.g. on the database, instead of a
role?  Could it e.g. be useful to grant a user the permission to create a
subscription in one database, but not in another?


> @@ -1039,6 +1082,16 @@ AlterSubscription(ParseState *pstate, 
> AlterSubscriptionStmt *stmt,
>  
>   sub = GetSubscription(subid, false);
>  
> + /*
> +  * Don't allow non-superuser modification of a subscription with
> +  * password_required=false.
> +  */
> + if (!sub->passwordrequired && !superuser())
> + ereport(ERROR,
> + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> +  
> errmsg("password_required=false is superuser-only"),
> +  errhint("Subscriptions with 
> the password_required option set to false may only be created or modified by 
> the superuser.")));
> +
>   /* Lock the subscription so nobody else can do anything with it. */
>   LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock);

The subscription code already does ownership checks before locking and now
there's also the passwordrequired before.  Is it possible that this could open
up some sort of race? Could e.g. the user change the ownership to the
superuser in one session, do an ALTER in the other?

It looks like your change won't increase the danger of that, as the
superuser() check just checks the current users permissions.


> @@ -180,6 +180,13 @@ libpqrcv_connect(const char *conninfo, bool logical, 
> const char *appname,
>   if (PQstatus(conn->streamConn) != CONNECTION_OK)
>   goto bad_connection_errmsg;
>  
> + if (must_use_password && !PQconnectionUsedPassword(conn->streamConn))
> + ereport(ERROR,
> + 
> (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
> +  errmsg("password is required"),
> +  errdetail("Non-superuser cannot connect if the 
> server does not request a password."),
> +  errhint("Target server's authentication method 
> must be changed.")));
> +

The documentation of libpqrcv_connect() says that:
 * Returns NULL on error and fills the err with 

Re: postgres_fdw, dblink, and CREATE SUBSCRIPTION security

2023-01-27 Thread Robert Haas
On Wed, Jan 25, 2023 at 6:22 PM Jacob Champion  wrote:
> Sure: Ambient authority [1] means that something is granted access based
> on some aspect of its existence that it can't remove (or even
> necessarily enumerate). Up above, when you said "I cannot choose not to
> be myself," that's a clear marker that ambient authority is involved.
> Examples of ambient authn/z factors might include an originating IP
> address, the user ID of a connected peer process, the use of a loopback
> interface, a GPS location, and so on. So 'peer' and 'ident' are ambient
> authentication methods.

OK.

> 1) Forwarding the original ambient context along with the request, so
> the server can check it too.

Right, so a protocol extension. Reasonable idea, but a big lift. Not
only do you need everyone to be running a new enough version of
PostgreSQL, but existing proxies like pgpool and pgbouncer need
updates, too.

> 2) Explicitly combining the request with the proof of authority needed
> to make it, as in capability-based security [3].

As far as I can see, that link doesn't address how you'd make this
approach work across a network.

> 3) Dropping as many implicitly-held privileges as possible before making
> a request. This doesn't solve the problem but may considerably reduce
> the practical attack surface.

Right. I definitely don't object to this kind of approach, but I don't
think it can ever be sufficient by itself.

> > e.g.
> >
> > all all all local all all - deny # block access through UNIX sockets
> > all all all 127.0.0.0/8 all all - deny # block loopback interface via IPv4
> >
> > Or:
> >
> > postgres_fdw all all all all all authentication=cleartext,md5,sasl
> > allow # allow postgres_fdw with password-ish authentication
>
> I think this style focuses on absolute configuration flexibility at the
> expense of usability. It obfuscates the common use cases. (I have the
> exact same complaint about our HBA and ident configs, so I may be
> fighting uphill.)

That's probably somewhat true, but on the other hand, it also is more
powerful than what you're describing. In your system, is there some
way the DBA can say "hey, you can connect to any of the machines on
this list of subnets, but nothing else"? Or equally, "hey, you may NOT
connect to any machine on this list of subnets, but anything else is
fine"? Or "you can connect to these subnets without SSL, but if you
want to talk to anything else, you need to use SSL"? I would feel a
bit bad saying that those are just use cases we don't care about. Most
people likely wouldn't use that kind of flexibility, so maybe it
doesn't really matter, but it seems kind of nice to have. Your idea
seems to rely on us being able to identify all of the policies that a
user is likely to want and give names to each one, and I don't feel
very confident that that's realistic. But maybe I'm misinterpreting
your idea?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Syncrep and improving latency due to WAL throttling

2023-01-27 Thread Tomas Vondra
On 1/27/23 08:18, Bharath Rupireddy wrote:
> On Thu, Jan 26, 2023 at 9:21 PM Andres Freund  wrote:
>>
>>> 7. I think we need to not let backends throttle too frequently even
>>> though they have crossed wal_throttle_threshold bytes. The best way is
>>> to rely on replication lag, after all the goal of this feature is to
>>> keep replication lag under check - say, throttle only when
>>> wal_distance > wal_throttle_threshold && replication_lag >
>>> wal_throttle_replication_lag_threshold.
>>
>> I think my idea of only forcing to flush/wait an LSN some distance in the 
>> past
>> would automatically achieve that?
> 
> I'm sorry, I couldn't get your point, can you please explain it a bit more?
> 

The idea is that we would not flush the exact current LSN, because
that's likely somewhere in the page, and we always write the whole page
which leads to write amplification.

But if we backed off a bit, and wrote e.g. to the last page boundary,
that wouldn't have this issue (either the page was already flushed -
noop, or we'd have to flush it anyway).

We could even back off a bit more, to increase the probability it was
actually flushed / sent to standby. That would still work, because the
whole point is not to allow one process to generate too much unflushed
WAL, forcing the other (small) xacts to wait at commit.

Imagine we have the limit set to 8MB, i.e. the backend flushes WAL after
generating 8MB of WAL. If we flush to the exact current LSN, the other
backends will wait for ~4MB on average. If we back off to 1MB, the wait
average increases to ~4.5MB. (This is simplified, as it ignores WAL from
the small xacts. But those flush regularly, which limit the amount. It
also ignores there might be multiple large xacts.)

> Looking at the patch, the feature, in its current shape, focuses on
> improving replication lag (by throttling WAL on the primary) only when
> synchronous replication is enabled. Why is that? Why can't we design
> it for replication in general (async, sync, and logical replication)?
> 

This focuses on sync rep, because that's where the commit latency comes
from. Async doesn't have that issue, because it doesn't wait for the
standby.

In particular, the trick is in penalizing the backends generating a lot
of WAL, while leaving the small xacts alone.

> Keeping replication lag under check enables one to provide a better
> RPO guarantee as discussed in the other thread
> https://www.postgresql.org/message-id/CAHg%2BQDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw%40mail.gmail.com.
> 

Isn't that a bit over-complicated? RPO generally only cares about xacts
that committed (because that's what you want to not lose), so why not to
simply introduce a "sync mode" that simply uses a bit older LSN when
waiting for the replica? Seems much simpler and similar to what we
already do.

Yeah, if someone generates a lot of WAL in uncommitted transaction, all
of that may be lost. But who cares (from the RPO point of view)?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: GUCs to control abbreviated sort keys

2023-01-27 Thread Jeff Davis
On Fri, 2023-01-27 at 11:41 -0800, Peter Geoghegan wrote:
> I cannot recreate the issue you describe.

Interesting. For my test:

glibc  2.35  ICU 70.1
gcc11.3.0LLVM 14.0.0

> It's not impossible that the perl program you wrote produces
> non-deterministic output

It is non-deterministic, but I tried with two generated files, and got
similar results.

Right now I suspect the ICU version might be the reason. I'll try with
72.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Authentication fails for md5 connections if ~/.postgresql/postgresql.{crt and key} exist

2023-01-27 Thread Cary Huang

> I think the sslcertmode=disable option that I introduced in [1] solves
> this issue too; would it work for your case? That whole patchset is
> meant to tackle the general case of the problem you've described.
>
> (Eventually I'd like to teach the server not to ask for a client
>  certificate if it's not going to use it.)

there is an option in pg_hba.conf on the server side called "clientcert" 
that can be specified besides the auth method that controls if certain 
client connections are required to send client certificate for 
additional verification. The value of "clientcert" can be "verify-ca" or 
"verify-full". For example:


hostssl    all all 127.0.0.1/32 md5 
clientcert=verify-full


If clientcert is not requested by the server, but yet the client still 
sends the certificate, the server will still verify it. This is the case 
in this discussion.


I agree that it is a more elegant approach to add "sslcertmode=disable" 
on the client side to prevent sending default certificate.


But, if the server does request clientcert but client uses 
"sslcertmode=disable" to connect and not give a certificate, it would 
also result in authentication failure. In this case, we actually would 
want to ignore "sslcertmode=disable" and send default certificates if found.


It would perhaps to better change the parameter to 
"defaultclientcert=on-demand" on the client side that will:


1. not send the existing default certificate if server does not request 
a certificate
2. send the existing default certificate if server does request a 
certificate while the client does not use "sslcert" parameter to specify 
another non-default certificate


I put "default" in the parameter name to indicate that it only applies 
to default certificate. If user specifies a non-default certificate 
using "sslcert" parameter, "defaultclientcert" should not be used and 
client should give error if both exists.



Cary Huang

HighGo Software Canada
www.highgo.ca






Re: Optimizing PostgreSQL with LLVM's PGO+LTO

2023-01-27 Thread Tom Lane
=?UTF-8?Q?Jo=C3=A3o_Paulo_Labegalini_de_Carvalho?=  
writes:
> I am facing issues when applying LTO to PostgreSQL as the produced binary
> seems broken (the server dies quickly after it has started). This is
> definitely a compiler bug, but I was wondering if anyone here  have
> experimented with LTO for PostgreSQL.

There are a lot of places where we're implicitly relying on
cross-compilation-unit optimizations NOT happening, because the
code isn't adequately decorated with memory barriers and the like.
So I wouldn't necessarily assume that the misbehavior you're seeing
represents anything that the compiler folks would consider a bug.

In the long run we might be interested in trying to make this
work better, but I don't know of anyone working on it now.

regards, tom lane




Re: Set arbitrary GUC options during initdb

2023-01-27 Thread Robert Haas
On Fri, Jan 27, 2023 at 3:02 PM Tom Lane  wrote:
> The string-hacking was fully as tedious as I expected.  However, the
> output looks pretty nice, and this does have the advantage that the
> pre-programmed substitutions become a lot more robust: they are no
> longer dependent on the initdb code exactly matching what is in
> postgresql.conf.sample.

Awesome! Thank you very much.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Set arbitrary GUC options during initdb

2023-01-27 Thread Tom Lane
I wrote:
>>> Anyway, it seems like I gotta work harder.  I'll produce a
>>> new patch.

The string-hacking was fully as tedious as I expected.  However, the
output looks pretty nice, and this does have the advantage that the
pre-programmed substitutions become a lot more robust: they are no
longer dependent on the initdb code exactly matching what is in
postgresql.conf.sample.

regards, tom lane

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 5b2bdac101..724188b1b5 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -433,6 +433,23 @@ PostgreSQL documentation
 Other, less commonly used, options are also available:
 
 
+ 
+  -c name=value
+  --set name=value
+  
+   
+Forcibly set the server parameter name
+to value during initdb,
+and also install that setting in the
+generated postgresql.conf file,
+so that it will apply during future server runs.
+This option can be given more than once to set several parameters.
+It is primarily useful when the environment is such that the server
+will not start at all using the default parameters.
+   
+  
+ 
+
  
   -d
   --debug
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 7a58c33ace..c955c0837e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -48,6 +48,7 @@
 
 #include "postgres_fe.h"
 
+#include 
 #include 
 #include 
 #include 
@@ -82,6 +83,13 @@
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
 
+/* simple list of strings */
+typedef struct _stringlist
+{
+	char	   *str;
+	struct _stringlist *next;
+} _stringlist;
+
 static const char *const auth_methods_host[] = {
 	"trust", "reject", "scram-sha-256", "md5", "password", "ident", "radius",
 #ifdef ENABLE_GSS
@@ -142,6 +150,8 @@ static char *pwfilename = NULL;
 static char *superuser_password = NULL;
 static const char *authmethodhost = NULL;
 static const char *authmethodlocal = NULL;
+static _stringlist *extra_guc_names = NULL;
+static _stringlist *extra_guc_values = NULL;
 static bool debug = false;
 static bool noclean = false;
 static bool noinstructions = false;
@@ -242,7 +252,10 @@ static char backend_exec[MAXPGPATH];
 
 static char **replace_token(char **lines,
 			const char *token, const char *replacement);
-
+static char **replace_guc_value(char **lines,
+const char *guc_name, const char *guc_value,
+bool mark_as_comment);
+static bool guc_value_requires_quotes(const char *guc_value);
 static char **readfile(const char *path);
 static void writefile(char *path, char **lines);
 static FILE *popen_check(const char *command, const char *mode);
@@ -253,6 +266,7 @@ static void check_input(char *path);
 static void write_version_file(const char *extrapath);
 static void set_null_conf(void);
 static void test_config_settings(void);
+static bool test_specific_config_settings(int test_conns, int test_buffs);
 static void setup_config(void);
 static void bootstrap_template1(void);
 static void setup_auth(FILE *cmdfd);
@@ -360,9 +374,34 @@ escape_quotes_bki(const char *src)
 }
 
 /*
- * make a copy of the array of lines, with token replaced by replacement
+ * Add an item at the end of a stringlist.
+ */
+static void
+add_stringlist_item(_stringlist **listhead, const char *str)
+{
+	_stringlist *newentry = pg_malloc(sizeof(_stringlist));
+	_stringlist *oldentry;
+
+	newentry->str = pg_strdup(str);
+	newentry->next = NULL;
+	if (*listhead == NULL)
+		*listhead = newentry;
+	else
+	{
+		for (oldentry = *listhead; oldentry->next; oldentry = oldentry->next)
+			 /* skip */ ;
+		oldentry->next = newentry;
+	}
+}
+
+/*
+ * Make a copy of the array of lines, with token replaced by replacement
  * the first time it occurs on each line.
  *
+ * The original data structure is not changed, but we share any unchanged
+ * strings with it.  (This definition lends itself to memory leaks, but
+ * we don't care too much about leaks in this program.)
+ *
  * This does most of what sed was used for in the shell script, but
  * doesn't need any regexp stuff.
  */
@@ -416,6 +455,168 @@ replace_token(char **lines, const char *token, const char *replacement)
 	return result;
 }
 
+/*
+ * Make a copy of the array of lines, replacing the possibly-commented-out
+ * assignment of parameter guc_name with a live assignment of guc_value.
+ * The value will be suitably quoted.
+ *
+ * If mark_as_comment is true, the replacement line is prefixed with '#'.
+ * This is used for fixing up cases where the effective default might not
+ * match what is in postgresql.conf.sample.
+ *
+ * We assume there's at most one matching assignment.  If we find no match,
+ * append a new line with the desired assignment.
+ *
+ * The original data structure is not changed, but we share any unchanged
+ * 

Re: Show various offset arrays for heap WAL records

2023-01-27 Thread Robert Haas
On Fri, Jan 27, 2023 at 12:24 PM Melanie Plageman
 wrote:
> I believe I have addressed this in the attached patch.

I'm not sure what's best in terms of formatting details but I
definitely like the idea of making pg_waldump show more details. I'd
even like to have a way to extract the tuple data, when it's
operations on tuples and we have those tuples in the payload. That'd
be a lot more verbose than what you are doing here, though, and I'm
not saying you should go do it right now or anything like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Non-superuser subscription owners

2023-01-27 Thread Robert Haas
On Thu, Jan 19, 2023 at 8:46 PM Andres Freund  wrote:
> > If we already had (or have) that logic someplace else, it would
> > probably make sense to reuse it
>
> We hve. See at least postgres_fdw's check_conn_params(), dblink's
> dblink_connstr_check() and dblink_security_check().

In the patch I posted previously, I had some other set of checks, more
or less along the lines suggested by Jeff. I looked into revising that
approach and making the behavior match exactly what we do in those
places instead. I find that it breaks 027_nosuperuser.pl.
Specifically, where without the patch I get "ok 6 - nosuperuser admin
with all table privileges can replicate into unpartitioned", with the
patch it goes boom, because the subscription can't connect any more
due to the password requirement.

At first, I found it a bit tempting to see this as a further
indication that the force-a-password approach is not the right idea,
because the test case clearly memorializes a desire *not* to require a
password in this situation. However, the loopback-to-superuser attack
is just as viable for subscription as it in other cases, and my
previous patch would have done nothing to block it. So what I did
instead is add a password_required attribute, just like what
postgres_fdw has. As in the case of postgres_fdw, the actual rule is
that if the attribute is false, a password is not required, and if the
attribute is true, a password is required unless you are a superuser.
If you're a superuser, it still isn't. This is a slightly odd set of
semantics but it has precedent and practical advantages. Also, as in
the case of postgres_fdw, only a superuser can set
password_required=false, and a subscription that has that setting can
only be modified by a superuser, no matter who owns it.

Even though I hate the require-a-password stuff with the intensity of
a thousand suns, I think this is better than the previous patch,
because it's more consistent with what we do elsewhere and because it
blocks the loopback-connection-to-superuser attack. I think we
*really* need to develop a better system for restricting proxied
connections (no matter how proxied) and I hope that we do that soon.
But inventing something for this purpose that differs from what we do
elsewhere will make that task harder, not easier.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-Add-new-predefined-role-pg_create_subscriptions.patch
Description: Binary data


Re: GUCs to control abbreviated sort keys

2023-01-27 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 3:29 PM Jeff Davis  wrote:
> On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote:
> > Maybe an easier way to enable or disable it in the source code with a
> > #define would serve this.  Making it a GUC right away seems a bit
> > heavy-handed.  Further exploration and tweaking might well require
> > further source code changes, so relying on a source code level toggle
> > would seem appropriate.
>
> I am using these GUCs for testing the various collation paths in my
> collation refactoring branch.

I'm fine with adding the GUC as a developer option. I think that there
is zero chance of the changes to tuplesort.c having appreciable
overhead.

> I find them pretty useful, and when I saw a regression, I thought
> others might think it was useful, too. But if not I'll just leave them
> in my branch and withdraw from this thread.

I cannot recreate the issue you describe. With abbreviated keys, your
exact test case takes 00:16.620 on my system. Without abbreviated
keys, it takes 00:21.255.

To me it appears to be a moderately good case for abbreviated keys,
though certainly not as good as some cases that I've seen -- ~3x
improvements are common enough.

As a point of reference, the same test case with the C collation and
with abbreviated keys takes 00:10.822. When I look at the "trace_sort"
output for the C collation with abbreviated keys, and compare it to
the equivalent "trace_sort" output for the original "en-US-x-icu"
collation from your test case, it is clear that the overhead of
generating collated abbreviated keys within ICU is relatively high --
the initial scan of the table (which is where we generate all
abbreviated keys here) takes 4.45 seconds in the ICU case, and only
1.65 seconds in the "C" locale case. I think that you should look into
that same difference on your own system, so that we can compare and
contrast.

The underlying issue might well have something to do with the ICU
version you're using, or some other detail of your environment. I'm
using Debian unstable here. Postgres links to the system ICU, which is
ICU 72.

It's not impossible that the perl program you wrote produces
non-deterministic output, which should be controlled for, since it
might just be significant. I see this on my system, having run the
perl program as outlined in your test case:

$ ls -l /tmp/strings.txt
-rw-r--r-- 1 pg pg 431886574 Jan 27 11:13 /tmp/strings.txt
$ sha1sum /tmp/strings.txt
22f60dc12527c215c8e3992e49d31dc531261a83  /tmp/strings.txt

Does that match what you see on your system?

--
Peter Geoghegan




Re: Optimizing PostgreSQL with LLVM's PGO+LTO

2023-01-27 Thread Komяpa
Hi,

We have implemented LTO in PostGIS's build system a couple releases ago. It
definitely gives +10% on heavy maths. Unfortunately we did not manage to
get it running under FreeBSD because of default system linker issues so we
had to hide it under --with-lto switch which we recommend to everyone.

I did not experiment with Postgres itself but there are definitely traces
of numerous LTO-enabled private builds on the web.

On Fri, Jan 27, 2023 at 8:05 PM João Paulo Labegalini de Carvalho <
jaopaul...@gmail.com> wrote:

> Hi all,
>
> I am investigating the benefits of different profile-guided optimizations
> (PGO) and link-time optimizations (LTO) versus binary optimizers (e.g.
> BOLT) for applications such as PostgreSQL.
>
> I am facing issues when applying LTO to PostgreSQL as the produced binary
> seems broken (the server dies quickly after it has started). This is
> definitely a compiler bug, but I was wondering if anyone here  have
> experimented with LTO for PostgreSQL.
>
> Thanks,
>
> --
> João Paulo L. de Carvalho
> Ph.D Computer Science |  IC-UNICAMP | Campinas , SP - Brazil
> Postdoctoral Research Fellow | University of Alberta | Edmonton, AB -
> Canada
> joao.carva...@ic.unicamp.br
> joao.carva...@ualberta.ca
>


Re: improving user.c error messages

2023-01-27 Thread Alvaro Herrera
While we're here,

On 2023-Jan-26, Nathan Bossart wrote:

> @@ -838,7 +867,8 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
>   if (!should_be_super && roleid == BOOTSTRAP_SUPERUSERID)
>   ereport(ERROR,
>   
> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> -  errmsg("permission denied: bootstrap 
> user must be superuser")));
> +  errmsg("permission denied to alter 
> role"),
> +  errdetail("The bootstrap user must be 
> superuser.")));

I think this one isn't using the right errcode; this is not a case of
insufficient privileges.  There's no priv you can acquire that lets you
do it.  So I'd change it to unsupported operation.


I was confused a bit by this one:

> /* an unprivileged user can change their own password */
> if (dpassword && roleid != currentUserId)
> ereport(ERROR,
> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> -errmsg("must have CREATEROLE privilege to change another 
> user's password")));
> +errmsg("permission denied to alter role"),
> +errdetail("To change another role's password, you must 
> have %s privilege and %s on the role.",
> +  "CREATEROLE", "ADMIN OPTION")));
> }

In no other message we say what operation is being attempted in the
errdetail; all the others start with "You must have" and that's it.
However, looking closer I think this one being different is okay,
because the errmsg() you're using is vague, and I think the error report
would be confusing if you were to remove the "To change another role's
password" bit.

The patch looks good to me.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-27 Thread Peter Geoghegan
On Fri, Jan 27, 2023 at 12:52 AM Andres Freund  wrote:
> I agree with bringing high-level context into the decision about whether to
> freeze agressively - my problem with the eager freezing strategy patch isn't
> that it did that too much, it's that it didn't do it enough.
>
>
> But I also don't think what I describe above is really comparable to "table
> level" eager freezing though - the potential worst case overhead is a small
> fraction of the WAL volume, and there's zero increase in data write volume.

All I meant was that I initially thought that you were trying to
replace the FPI thing with something at the same level of ambition,
that could work in a low context way. But I now see that you're
actually talking about something quite a bit more ambitious for
Postgres 16, which is structurally similar to a freezing strategy,
from a code point of view -- it relies on high-level context for the
VACUUM/table as a whole. I wasn't equating it with the eager freezing
strategy in any other way.

It might also be true that this other thing happens to render the FPI
mechanism redundant. I'm actually not completely sure that it will
just yet. Let me verify my understanding of your proposal:

You mean that we'd take the page LSN before doing anything with the
page, right at the top of lazy_scan_prune, at the same point that
"fpi_before" is initialized currently. Then, if we subsequently
dirtied the page (as determined by its LSN, so as to focus on "dirtied
via WAL logged operation") during pruning, *and* if the "lsn_before"
of the page was from before our cutoff (derived via "  lsn_threshold =
 insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1" or similar),
*and* if the page is eligible to become all-frozen, then we'd freeze
the page.

That's it, right? It's about pages that *we* (VACUUM) dirtied, and
wrote records and/or FPIs for already?

> I suspect the absolute worst case of "always freeze dirty pages" is when a
> single tuple on the page gets updated immediately after every time we freeze
> the page - a single tuple is where the freeze record is the least space
> efficient. The smallest update is about the same size as the smallest freeze
> record.  For that to amount to a large WAL increase you'd a crazy rate of such
> updates interspersed with vacuums. In slightly more realistic cases (i.e. not
> column less tuples that constantly get updated and freezing happening all the
> time) you end up with a reasonably small WAL rate overhead.

Other thing is that we'd be doing this in situations where we already
know that a VISIBLE record is required, which is comparable in size to
a FREEZE_PAGE record with one tuple/plan (around 64 bytes). The
smallest WAL records are mostly just generic WAL record header
overhead.

> Obviously that's a pointless workload, but I do think that
> analyzing the "outer boundaries" of the regression something can cause, can be
> helpful.

I agree about the "outer boundaries" being a useful guide.

> I think one way forward with the eager strategy approach would be to have a
> very narrow gating condition for now, and then incrementally expand it in
> later releases.
>
> One use-case where the eager strategy is particularly useful is
> [nearly-]append-only tables - and it's also the one workload that's reasonably
> easy to detect using stats. Maybe something like
> (dead_tuples_since_last_vacuum / inserts_since_last_vacuum) < 0.05
> or so.
>
> That'll definitely leave out loads of workloads where eager freezing would be
> useful - but are there semi-reasonable workloads where it'll hurt badly? I
> don't *think* so.

I have no further plans to work on eager freezing strategy, or
anything of the sort, in light of recent developments. My goal at this
point is very unambitious: to get the basic page-level freezing work
into a form that makes sense as a standalone thing for Postgres 16. To
put things on a good footing, so that I can permanently bow out of all
work on VACUUM having left everything in good order. That's all.

Now, that might still mean that I'd facilitate future work of this
sort, by getting the right basic structure in place. But my
involvement in any work on freezing or anything of the sort ends here,
both as a patch author and a committer of anybody else's work. I'm
proud of the work I've done on VACUUM, but I'm keen to move on from
it.

> > What about unlogged/temporary tables? The obvious thing to do there is
> > what I did in the patch that was reverted (freeze whenever the page
> > will thereby become all-frozen), and forget about LSNs. But you have
> > already objected to that part, specifically.
>
> My main concern about that is the data write amplification it could cause when
> page is clean when we start freezing.  But I can't see a large potential
> downside to always freezing unlogged/temp tables when the page is already
> dirty.

But we have to dirty the page anyway, just to set PD_ALL_VISIBLE. That
was always a gating condition. Actually, that may have 

Re: New strategies for freezing, advancing relfrozenxid early

2023-01-27 Thread Andres Freund
Hi,

On 2023-01-27 12:53:58 -0500, Robert Haas wrote:
> On Fri, Jan 27, 2023 at 12:58 AM Andres Freund  wrote:
> > Essentially the "any fpi" logic is a very coarse grained way of using the 
> > page
> > LSN as a measurement. As I said, I don't think "has a checkpoint occurred
> > since the last write" is a good metric to avoid unnecessary freezing - it's
> > too coarse. But I think using the LSN is the right thought. What about
> > something like
> >
> >   lsn_threshold =  insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
> >   if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
> >  FreezeMe();
> >
> > I probably got some details wrong, what I am going for with lsn_threshold is
> > that we'd freeze an already dirty page if it's not been updated within 10% 
> > of
> > the LSN distance to the last VACUUM.
>
> I think this might not be quite the right idea for a couple of reasons.

It's definitely not perfect.  If we had an approximate LSN->time map as
general infrastructure, we could do a lot better. I think it'd be reasonably
easy to maintain that in the autovacuum launcher, for example.


One thing worth calling out here, because it's not obvious from the code
quoted above in isolation, is that what I was trying to refine here was the
decision when to perform opportunistic freezing *of already dirty pages that
do not require an FPI*.

So all that we need to prevent here is freezing very hotly updated data, where
the WAL overhead of the freeze records would be noticable, because we
constantly VACUUM, due to the high turnover.


> First, suppose that the table is being processed just by autovacuum
> (no manual VACUUM operations) and that the rate of WAL generation is
> pretty even, so that LSN age is a good proxy for time. If autovacuum
> processes the table once per hour, this will freeze if it hasn't been
> updated in the last six minutes. That sounds good. But if autovacuum
> processes the table once per day, then this will freeze if it hasn't
> been updated in 2.4 hours. That might be OK, but it sounds a little on
> the long side.

You're right. I was thinking of the "lsn_since_last_vacuum" because I was
posulating it being useful elsewhere in the thread (but for eager strategy
logic) - but here that's really not very relevant.

Given that we're dealing with already dirty pages not requiring an FPI, I
think a much better "reference LSN" would be the LSN of the last checkpoint
(LSN of the last checkpoint record, not the current REDO pointer).


> Second, and more seriously, I think this would, in some circumstances,
> lead to tremendously unstable behavior. Suppose somebody does a bunch
> of work on a table and then they're like "oh, we should clean up,
> VACUUM" and it completes quickly because it's been a while since the
> last vacuum and so it doesn't freeze much. Then, for whatever reason,
> they decide to run it one more time, and it goes bananas and starts
> freezing all kinds of stuff because the LSN distance since the last
> vacuum is basically zero. Or equally, you run a manual VACUUM, and you
> get completely different behavior depending on how long it's been
> since the last autovacuum ran.

I don't think this quite applies to the scenario at hand, because it's
restricted to already dirty pages. And the max increased overhead is also
small due to that - so occasionally getting it wrong is that impactful.

Greetings,

Andres Freund




Re: improving user.c error messages

2023-01-27 Thread Alvaro Herrera
On 2023-Jan-27, Tom Lane wrote:

> Good point.  My vote is for standardizing on *not* mentioning it.
> Error messages should say "you need privilege X".  That is not
> the place to go into all the ways you could hold privilege X
> (one of which is being superuser).

+1

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El sabio habla porque tiene algo que decir;
el tonto, porque tiene que decir algo" (Platon).




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-27 Thread Robert Haas
On Fri, Jan 27, 2023 at 12:58 AM Andres Freund  wrote:
> Essentially the "any fpi" logic is a very coarse grained way of using the page
> LSN as a measurement. As I said, I don't think "has a checkpoint occurred
> since the last write" is a good metric to avoid unnecessary freezing - it's
> too coarse. But I think using the LSN is the right thought. What about
> something like
>
>   lsn_threshold =  insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
>   if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
>  FreezeMe();
>
> I probably got some details wrong, what I am going for with lsn_threshold is
> that we'd freeze an already dirty page if it's not been updated within 10% of
> the LSN distance to the last VACUUM.

I think this might not be quite the right idea for a couple of reasons.

First, suppose that the table is being processed just by autovacuum
(no manual VACUUM operations) and that the rate of WAL generation is
pretty even, so that LSN age is a good proxy for time. If autovacuum
processes the table once per hour, this will freeze if it hasn't been
updated in the last six minutes. That sounds good. But if autovacuum
processes the table once per day, then this will freeze if it hasn't
been updated in 2.4 hours. That might be OK, but it sounds a little on
the long side. If autovacuum processes the table once per week, then
this will freeze if it hasn't been updated in 16.8 hours. That sounds
too conservative. Conversely, if autovacuum processes the table every
3 minutes, then this will freeze the data if it hasn't been updated in
the last 18 seconds, which sounds awfully aggressive. Maybe I'm wrong
here, but I feel like the absolute amount of wall-clock time we're
talking about here probably matters to some degree. I'm not sure
whether a strict time-based threshold like, say, 10 minutes would be a
good idea, leaving aside the difficulties of implementation. It might
be right to think that if the table is being vacuumed a lot, freezing
more aggressively is smart, and if it's being vacuumed infrequently,
freezing less aggressively is smart, because if the table has enough
activity that it's being vacuumed frequently, that might also be a
sign that we need to freeze more aggressively in order to avoid having
things go sideways. However, I'm not completely sure about that, and I
think it's possible that we need some guardrails to avoid going too
far in either direction.

Second, and more seriously, I think this would, in some circumstances,
lead to tremendously unstable behavior. Suppose somebody does a bunch
of work on a table and then they're like "oh, we should clean up,
VACUUM" and it completes quickly because it's been a while since the
last vacuum and so it doesn't freeze much. Then, for whatever reason,
they decide to run it one more time, and it goes bananas and starts
freezing all kinds of stuff because the LSN distance since the last
vacuum is basically zero. Or equally, you run a manual VACUUM, and you
get completely different behavior depending on how long it's been
since the last autovacuum ran.

In some ways, I think this proposal has many of the same problems as
vacuum_freeze_min_age. In both cases, the instinct is that we should
use something on the page to let us know how long it's been since the
page was modified, and proceed on the theory that if the page has not
been modified recently, it probably isn't about to be modified again.
That's a reasonable instinct, but the rate of XID advancement and the
rate of LSN advancement are both highly variable, even on a system
that's always under some load.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Show various offset arrays for heap WAL records

2023-01-27 Thread Melanie Plageman
Hi,

I have taken a stab at doing some of the tasks listed in this email.

I have made the new files rmgr_utils.c/h.

I have come up with a standard format that I like for the output and
used it in all the heap record types.

Examples below:

snapshotConflictHorizon: 2184, nplans: 2, plans [ { xmax: 0, infomask:
2816, infomask2: 2, ntuples: 5, offsets: [ 10, 11, 12, 18, 71 ] }, {
xmax: 0, infomask: 11008, infomask2: 2, ntuples: 2, offsets: [ 72, 73
] } ]

snapshotConflictHorizon: 2199, nredirected: 4, ndead: 0, nunused: 4,
redirected: [ 1->38, 2->39, 3->40, 4->41 ], dead: [], unused: [ 24,
25, 26, 27, 37 ]

I started documenting it in the rmgr_utils.h header file in a comment,
however it may be worth a README?

I haven't polished this description of the format (or added examples,
etc) or used it in the btree-related functions because I assume the
format and helper function API will need more discussion.

This is still a rough draft, as I anticipate changes will be requested.
I would split it into multiple patches, etc. But I am looking for
feedback on the suggested format and the array formatting helper
function API.

Perhaps there should also be example output of the offset arrays in
pgwalinspect docs?

I've changed the array format helper functions that Andres added to be a
single function with an additional layer of indirection so that any
record with an array can use it regardless of type and format of the
individual elements. The signature is based somewhat off of qsort_r()
and allows the user to pass a function with the the desired format of
the elements.

On a semi-unrelated note, I think it might be nice to have a comment in
heapam_xlog.h about what the infobits fields actually are and why they
exist -- e.g. we only need a subset of infomask[2] bits in these
records.
I put a random comment in the code where I think it should go.
I will delete it later, of course.

On Mon, Jan 9, 2023 at 11:00 PM Peter Geoghegan  wrote:
>
> On Mon, Jan 9, 2023 at 1:58 PM Andres Freund  wrote:
> > The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
> > XLOG_HEAP2_FREEZE_PAGE.
>
> I'm bound to end up doing the same in index access methods. Might make
> sense for the utility routines to live somewhere more centralized, at
> least when code reuse is likely. Practically every index AM has WAL
> records that include a sorted page offset number array, just like
> these ones. It's a very standard thing, obviously.

I plan to add these if the format and API I suggested seems like the
right direction.

On Tue, Jan 10, 2023 at 2:35 PM Andres Freund  wrote:
>
> > > I chose to include infomask[2] for the different freeze plans mainly 
> > > because
> > > it looks odd to see different plans without a visible reason. But I'm not 
> > > sure
> > > that's the right choice.
> >
> > I don't think that it is particularly necessary to do so in order for
> > the output to make sense -- pg_waldump is inherently a tool for
> > experts. What it comes down to for me is whether or not this
> > information is sufficiently useful to display, and/or can be (or needs
> > to be) controlled via some kind of verbosity knob.
>
> It seemed useful enough to me, but I likely also stare more at this stuff than
> most. Compared to the list of offsets it's not that much content.
>

Personally, I like having the infomasks for the freeze plans. If we
someday have a more structured input to rmgr_desc, we could then easily
have them in their own column and use functions like
heap_tuple_infomask_flags() on them.

> > How hard would it be to invent a general mechanism to control the verbosity
> > of what we'll show for each WAL record?
>
> Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
> void(*rm_desc) (StringInfo buf, XLogReaderState *record);
>
> so we'd need to patch all of them. That might be worth doing at some point,
> but I don't want to tackle it right now.

In terms of a more structured format, it seems like it would make the
most sense to pass a JSON or composite datatype structure to rm_desc
instead of that StringInfo.

I would also like to see functions like XLogRecGetBlockRefInfo() pass
something more useful than a stringinfo buffer so that we could easily
extract out the relfilenode in pgwalinspect.

On Mon, Jan 16, 2023 at 10:09 PM Peter Geoghegan  wrote:
>
> On Wed, Jan 11, 2023 at 3:11 PM Peter Geoghegan  wrote:
> > On Wed, Jan 11, 2023 at 3:00 PM Andres Freund  wrote:
> > > What are your thoughts about the place for the helper functions? You're ok
> > > with rmgrdesc_utils.[ch]?
> >
> > Yeah, that seems okay.
>
> BTW, while playing around with this patch today, I noticed that it
> won't display the number of elements in each offset array directly.
> Perhaps it's worth including that, too?

I believe I have addressed this in the attached patch.

- Melanie


v1-0001-Add-rmgr_desc-utilities.patch
Description: Binary data


Re: Add LZ4 compression in pg_dump

2023-01-27 Thread Justin Pryzby
On Thu, Jan 26, 2023 at 12:22:45PM -0600, Justin Pryzby wrote:
> That commit also added this to pg-dump.c:
> 
> +   case PG_COMPRESSION_ZSTD:
> +   pg_fatal("compression with %s is not yet supported", 
> "ZSTD");
> +   break;
> +   case PG_COMPRESSION_LZ4:
> +   pg_fatal("compression with %s is not yet supported", 
> "LZ4");
> +   break;
> 
> In 002, that could be simplified by re-using the supports_compression()
> function.  (And maybe the same in WriteDataToArchive()?)

The first patch aims to minimize references to ".gz" and "GZIP" and
ZLIB.  pg_backup_directory.c comments still refers to ".gz".  I think
the patch should ideally change to refer to "the compressed file
extension" (similar to compress_io.c), avoiding the need to update it
later.

I think the file extension stuff could be generalized, so it doesn't
need to be updated in multiple places (pg_backup_directory.c and
compress_io.c).  Maybe it's useful to add a function to return the
extension of a given compression method.  It could go in compression.c,
and be useful in basebackup.

For the 2nd patch:

I might be in the minority, but I still think some references to "gzip"
should say "zlib":

+} GzipCompressorState;
+
+/* Private routines that support gzip compressed data I/O */
+static void
+DeflateCompressorGzip(ArchiveHandle *AH, CompressorState *cs, bool flush)

In my mind, three things here are misleading, because it doesn't use
gzip headers:

| GzipCompressorState, DeflateCompressorGzip, "gzip compressed".

This comment is about exactly that:

  * underlying stream. The second API is a wrapper around fopen/gzopen and
  * friends, providing an interface similar to those, but abstracts away
  * the possible compression. Both APIs use libz for the compression, but
  * the second API uses gzip headers, so the resulting files can be easily
  * manipulated with the gzip utility.

AIUI, Michael says that it's fine that the user-facing command-line
options use "-Z gzip" (even though the "custom" format doesn't use gzip
headers).  I'm okay with that, as long as that's discussed/understood.

-- 
Justin




Re: Improve GetConfigOptionValues function

2023-01-27 Thread Tom Lane
Nitin Jadhav  writes:
>> Both of you are arguing as though GUC_NO_SHOW_ALL is a security
>> property.  It is not, or at least it's so trivially bypassable
>> that it's useless to consider it one.  All it is is a de-clutter
>> mechanism.

> Understood. If that is the case, then I am ok with the patch.

Pushed v4, then.

regards, tom lane




Optimizing PostgreSQL with LLVM's PGO+LTO

2023-01-27 Thread João Paulo Labegalini de Carvalho
Hi all,

I am investigating the benefits of different profile-guided optimizations
(PGO) and link-time optimizations (LTO) versus binary optimizers (e.g.
BOLT) for applications such as PostgreSQL.

I am facing issues when applying LTO to PostgreSQL as the produced binary
seems broken (the server dies quickly after it has started). This is
definitely a compiler bug, but I was wondering if anyone here  have
experimented with LTO for PostgreSQL.

Thanks,

-- 
João Paulo L. de Carvalho
Ph.D Computer Science |  IC-UNICAMP | Campinas , SP - Brazil
Postdoctoral Research Fellow | University of Alberta | Edmonton, AB - Canada
joao.carva...@ic.unicamp.br
joao.carva...@ualberta.ca


Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-27 Thread Masahiko Sawada
On Wed, Jan 25, 2023 at 3:27 PM Amit Kapila  wrote:
>
> On Wed, Jan 25, 2023 at 10:05 AM Amit Kapila  wrote:
> >
> > On Wed, Jan 25, 2023 at 3:15 AM Peter Smith  wrote:
> > >
> > > 1.
> > > @@ -210,7 +210,7 @@ int logical_decoding_work_mem;
> > >  static const Size max_changes_in_memory = 4096; /* XXX for restore only 
> > > */
> > >
> > >  /* GUC variable */
> > > -int logical_decoding_mode = LOGICAL_DECODING_MODE_BUFFERED;
> > > +int logical_replication_mode = LOGICAL_REP_MODE_BUFFERED;
> > >
> > >
> > > I noticed that the comment /* GUC variable */ is currently only above
> > > the logical_replication_mode, but actually logical_decoding_work_mem
> > > is a GUC variable too. Maybe this should be rearranged somehow then
> > > change the comment "GUC variable" -> "GUC variables"?
> > >
> >
> > I think moving these variables together doesn't sound like a good idea
> > because logical_decoding_work_mem variable is defined with other
> > related variable. Also, if we are doing the last comment, I think that
> > will obviate the need for this.
> >
> > > ==
> > > src/backend/utils/misc/guc_tables.c
> > >
> > > @@ -4908,13 +4908,13 @@ struct config_enum ConfigureNamesEnum[] =
> > >   },
> > >
> > >   {
> > > - {"logical_decoding_mode", PGC_USERSET, DEVELOPER_OPTIONS,
> > > + {"logical_replication_mode", PGC_USERSET, DEVELOPER_OPTIONS,
> > >   gettext_noop("Allows streaming or serializing each change in logical
> > > decoding."),
> > >   NULL,
> > >   GUC_NOT_IN_SAMPLE
> > >   },
> > > - _decoding_mode,
> > > - LOGICAL_DECODING_MODE_BUFFERED, logical_decoding_mode_options,
> > > + _replication_mode,
> > > + LOGICAL_REP_MODE_BUFFERED, logical_replication_mode_options,
> > >   NULL, NULL, NULL
> > >   },
> > >
> > > That gettext_noop string seems incorrect. I think Kuroda-san
> > > previously reported the same, but then you replied it has been fixed
> > > already [1]
> > >
> > > > I felt the description seems not to be suitable for current behavior.
> > > > A short description should be like "Sets a behavior of logical 
> > > > replication", and
> > > > further descriptions can be added in lond description.
> > > I adjusted the description here.
> > >
> > > But this doesn't look fixed to me. (??)
> > >
> >
> > Okay, so, how about the following for the 0001 patch:
> > short desc: Controls when to replicate each change.
> > long desc: On the publisher, it allows streaming or serializing each
> > change in logical decoding.
> >
>
> I have updated the patch accordingly and it looks good to me. I'll
> push this first patch early next week (Monday) unless there are more
> comments.

The patch looks good to me too. Thank you for the patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Set arbitrary GUC options during initdb

2023-01-27 Thread Tom Lane
"David G. Johnston"  writes:
> On Fri, Jan 27, 2023 at 8:53 AM Tom Lane  wrote:
>> Anyway, it seems like I gotta work harder.  I'll produce a
>> new patch.

> How about just adding a "section" to the end of the file as needed:

> # AdHoc Settings Specified During InitDB
> max_connections=75
> ...

Nah, I think that would be impossibly confusing.  One way or another
the live setting has to be near where the GUC is documented.

We will have to do add-at-the-end for custom GUCs, of course,
but in that case there's no matching comment to confuse you.

regards, tom lane




Re: New strategies for freezing, advancing relfrozenxid early

2023-01-27 Thread Peter Geoghegan
On Fri, Jan 27, 2023 at 6:48 AM Robert Haas  wrote:
> > One of the key strengths of systems like Postgres is the ability to
> > inexpensively store a relatively large amount of data that has just
> > about zero chance of being read, let alone modified. While at the same
> > time having decent OLTP performance for the hot data. Not nearly as
> > good as an in-memory system, mind you -- and yet in-memory systems
> > remain largely a niche thing.
>
> I think it's interesting that TPC-C suffers from the kind of problem
> that your patch was intended to address. I hadn't considered that. But
> I do not think it detracts from the basic point I was making, which is
> that you need to think about the downsides of your patch, not just the
> upsides.
>
> If you want to argue that there is *no* OLTP workload that will be
> harmed by freezing as aggressively as possible, then that would be a
> good argument in favor of your patch, because it would be arguing that
> the downside simply doesn't exist, at least for OLTP workloads. The
> fact that you can think of *one particular* OLTP workload that can
> benefit from the patch is just doubling down on the "my patch has an
> upside" argument, which literally no one is disputing.

You've treated me to another multi paragraph talking down, as if I was
still clinging to my original position, which is of course not the
case. I've literally said I'm done with VACUUM for good, and that I
just want to put a line under this. Yet you still persist in doing
this sort of thing. I'm not fighting you, I'm not fighting Andres.

I was making a point about the need to do something in this area in
general. That's all.

-- 
Peter Geoghegan




Re: Timeline ID hexadecimal format

2023-01-27 Thread Sébastien Lardière

On 27/01/2023 15:55, Peter Eisentraut wrote:

On 27.01.23 14:52, Sébastien Lardière wrote:
The attached patch proposes to change the format of timelineid from 
%u to %X.


I think your complaint has merit.  But note that if we did a change 
like this, then log files or reports from different versions would 
have different meaning without a visual difference, which is kind of 
what you complained about in the first place.  At least we should do 
something like 0x%X.


Indeed, but the messages that puzzled was in one log file, just 
together, not in some differents versions.


But yes, it should be documented somewhere, actually, I can't find any 
good place for that,


While digging, It seems that recovery_target_timeline should be given in 
decimal, not in hexadecimal, which seems odd to me ; and pg_controldata 
use decimal too, not hexadecimal…


So, if this idea is correct, the given patch is not enough.

Anyway, do you think it is a good idea or not ?




Regarding .po files, I don't know how to manage them. Is there any 
routine to spread the modifications? Or should I identify and change 
each message?


Don't worry about this.  This is handled elsewhere.



nice,


regards,


--
Sébastien





Re: improving user.c error messages

2023-01-27 Thread Tom Lane
Robert Haas  writes:
> I almost hate to bring this up since I'm not sure how far we want to
> go down this rat hole, but what should be our policy about mentioning
> superuser? I don't think we're entirely consistent right now, and I'm
> not sure whether every error message needs to mention that if you were
> the superuser you could do everything. Is that something we should
> mention always, never, or in some set of circumstances?

Good point.  My vote is for standardizing on *not* mentioning it.
Error messages should say "you need privilege X".  That is not
the place to go into all the ways you could hold privilege X
(one of which is being superuser).

regards, tom lane




Re: Set arbitrary GUC options during initdb

2023-01-27 Thread David G. Johnston
On Fri, Jan 27, 2023 at 8:53 AM Tom Lane  wrote:

> Robert Haas  writes:
> > The idea is that instead of:
>
> > replace_token(conflines, "#max_connections = 100", repltok);
>
> > You'd write something like:
>
> > replace_guc_value(conflines, "max_connections", repltok);
>
> > Which would look for a line matching /^#max_connections\s+=\s/, and
> > then identify everything following that point up to the first #. It
> > would replace all that stuff with repltok, but if the replacement is
> > shorter than the original, it would pad with spaces to get back to the
> > original length. And otherwise it would add a single space, so that if
> > you set a super long GUC value there's still at least one space
> > between the end of the value and the comment that follows.
>
> Well, yeah, I was trying to avoid writing that ;-).  There's even
> one more wrinkle: we might already have removed the initial '#',
> if one does say "-c max_connections=N", because this logic won't
> know whether the -c switch matches one of initdb's predetermined
> substitutions.
>
> > There might be some quoting-related problems with this idea, not sure.
>
> '#' in a value might confuse it, but we could probably take the last '#'
> not the first.
>
> Anyway, it seems like I gotta work harder.  I'll produce a
> new patch.
>
>
How about just adding a "section" to the end of the file as needed:

# AdHoc Settings Specified During InitDB
max_connections=75
...

David J.


Re: improving user.c error messages

2023-01-27 Thread Robert Haas
On Fri, Jan 27, 2023 at 10:52 AM Nathan Bossart
 wrote:
> IMHO superuser should typically only be mentioned when it is the only way
> to do something.  Since superusers have all privileges, I think logs like
> "superuser or privileges of X" are kind of redundant.  If Robert has
> privileges of X, we wouldn't say "privileges of X or Robert."  We'd just
> point to X.  Ultimately, I feel like mentioning superuser in error messages
> usually just makes the message longer without adding any useful
> information.

That's kind of my opinion too, but I'm not sure whether there are
cases where it will lead to confusion.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Set arbitrary GUC options during initdb

2023-01-27 Thread Tom Lane
Robert Haas  writes:
> The idea is that instead of:

> replace_token(conflines, "#max_connections = 100", repltok);

> You'd write something like:

> replace_guc_value(conflines, "max_connections", repltok);

> Which would look for a line matching /^#max_connections\s+=\s/, and
> then identify everything following that point up to the first #. It
> would replace all that stuff with repltok, but if the replacement is
> shorter than the original, it would pad with spaces to get back to the
> original length. And otherwise it would add a single space, so that if
> you set a super long GUC value there's still at least one space
> between the end of the value and the comment that follows.

Well, yeah, I was trying to avoid writing that ;-).  There's even
one more wrinkle: we might already have removed the initial '#',
if one does say "-c max_connections=N", because this logic won't
know whether the -c switch matches one of initdb's predetermined
substitutions.

> There might be some quoting-related problems with this idea, not sure.

'#' in a value might confuse it, but we could probably take the last '#'
not the first.

Anyway, it seems like I gotta work harder.  I'll produce a
new patch.

regards, tom lane




Re: improving user.c error messages

2023-01-27 Thread Nathan Bossart
On Fri, Jan 27, 2023 at 08:31:32AM -0500, Robert Haas wrote:
> I almost hate to bring this up since I'm not sure how far we want to
> go down this rat hole, but what should be our policy about mentioning
> superuser? I don't think we're entirely consistent right now, and I'm
> not sure whether every error message needs to mention that if you were
> the superuser you could do everything. Is that something we should
> mention always, never, or in some set of circumstances?

IMHO superuser should typically only be mentioned when it is the only way
to do something.  Since superusers have all privileges, I think logs like
"superuser or privileges of X" are kind of redundant.  If Robert has
privileges of X, we wouldn't say "privileges of X or Robert."  We'd just
point to X.  Ultimately, I feel like mentioning superuser in error messages
usually just makes the message longer without adding any useful
information.

I recognize that this is a bold opinion and that the policy to mention
superuser might need to be more nuanced in practice...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Set arbitrary GUC options during initdb

2023-01-27 Thread Robert Haas
On Fri, Jan 27, 2023 at 10:34 AM Tom Lane  wrote:
> One idea if we want to make it work like that could be to stop
> trying to edit out the default value, and instead make the file
> contents look like, say,
>
> #huge_pages = try   # on, off, or try
> huge_pages = off# set by initdb

How about just making replace_token() a little smarter, and maybe renaming it?

The idea is that instead of:

replace_token(conflines, "#max_connections = 100", repltok);

You'd write something like:

replace_guc_value(conflines, "max_connections", repltok);

Which would look for a line matching /^#max_connections\s+=\s/, and
then identify everything following that point up to the first #. It
would replace all that stuff with repltok, but if the replacement is
shorter than the original, it would pad with spaces to get back to the
original length. And otherwise it would add a single space, so that if
you set a super long GUC value there's still at least one space
between the end of the value and the comment that follows.

There might be some quoting-related problems with this idea, not sure.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Named Operators

2023-01-27 Thread Matthias van de Meent
On Fri, 27 Jan 2023 at 16:26, Peter Eisentraut
 wrote:
>
> On 12.01.23 14:55, Matthias van de Meent wrote:
> >> Matter of taste, I guess. But more importantly, defining an operator
> >> gives you many additional features that the planner can use to
> >> optimize your query differently, which it can't do with functions. See
> >> the COMMUTATOR, HASHES, etc. clause in the CREATE OPERATOR command.
> > I see. Wouldn't it be better then to instead make it possible for the
> > planner to detect the use of the functions used in operators and treat
> > them as aliases of the operator? Or am I missing something w.r.t.
> > differences between operator and function invocation?
> >
> > E.g. indexes on `int8pl(my_bigint, 1)` does not match queries for
> > `my_bigint + 1` (and vice versa), while they should be able to support
> > that, as OPERATOR(pg_catalog.+(int8, int8)) 's function is int8pl.
>
> I have been thinking about something like this for a long time.
> Basically, we would merge pg_proc and pg_operator internally.  Then, all
> the special treatment for operators would also be available to
> two-argument functions.

And single-argument functions in case of prefix operators, right?




Re: Set arbitrary GUC options during initdb

2023-01-27 Thread Tom Lane
Peter Eisentraut  writes:
> On 25.01.23 22:25, Tom Lane wrote:
>> The specified settings
>> are applied on the command line of the initial probe calls
>> (which happen before we've made any config files), and then they
>> are added to postgresql.auto.conf, which causes them to take
>> effect for the bootstrap backend runs as well as subsequent
>> postmaster starts.

> I would have expected them to be edited into postgresql.conf.  What are 
> the arguments for one or the other?

TBH, the driving reason was that the string-munging code we have in
initdb isn't up to snuff for that: it wants to substitute for an
exactly-known string, which we won't have in the case of an
arbitrary GUC.

One idea if we want to make it work like that could be to stop
trying to edit out the default value, and instead make the file
contents look like, say,

#huge_pages = try   # on, off, or try
huge_pages = off# set by initdb

Then we just need to be able to find the GUC's entry.

> Btw., something that I have had in my notes for a while, but with this 
> it would now be officially exposed:  Not all options can be safely set 
> during bootstrap.  For example,
>  initdb -D data -c track_commit_timestamp=on
> will fail an assertion.  This might be an exception, or there might be 
> others.

Interesting.  We'd probably want to sprinkle some more
do-nothing-in-bootstrap-mode tests as we discover that sort of thing.

regards, tom lane




Re: Set arbitrary GUC options during initdb

2023-01-27 Thread Robert Haas
On Wed, Jan 25, 2023 at 4:26 PM Tom Lane  wrote:
> So this invents an initdb switch "-c NAME=VALUE" just like the
> one that the server itself has long had.

HUGE +1 from me. This will, I think, be extremely convenient in many situations.

> The specified settings
> are applied on the command line of the initial probe calls
> (which happen before we've made any config files), and then they
> are added to postgresql.auto.conf, which causes them to take
> effect for the bootstrap backend runs as well as subsequent
> postmaster starts.

I agree with others that it would seem more natural to edit them in
postgresql.conf itself, but I also think it doesn't matter nearly as
much as getting the feature in some form.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Named Operators

2023-01-27 Thread Peter Eisentraut

On 12.01.23 14:55, Matthias van de Meent wrote:

Matter of taste, I guess. But more importantly, defining an operator
gives you many additional features that the planner can use to
optimize your query differently, which it can't do with functions. See
the COMMUTATOR, HASHES, etc. clause in the CREATE OPERATOR command.

I see. Wouldn't it be better then to instead make it possible for the
planner to detect the use of the functions used in operators and treat
them as aliases of the operator? Or am I missing something w.r.t.
differences between operator and function invocation?

E.g. indexes on `int8pl(my_bigint, 1)` does not match queries for
`my_bigint + 1` (and vice versa), while they should be able to support
that, as OPERATOR(pg_catalog.+(int8, int8)) 's function is int8pl.


I have been thinking about something like this for a long time. 
Basically, we would merge pg_proc and pg_operator internally.  Then, all 
the special treatment for operators would also be available to 
two-argument functions.






Re: Set arbitrary GUC options during initdb

2023-01-27 Thread Isaac Morland
On Fri, 27 Jan 2023 at 09:49, Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 25.01.23 22:25, Tom Lane wrote:
> > So this invents an initdb switch "-c NAME=VALUE" just like the
> > one that the server itself has long had.
>
> This seems useful.
>
> > The specified settings
> > are applied on the command line of the initial probe calls
> > (which happen before we've made any config files), and then they
> > are added to postgresql.auto.conf, which causes them to take
> > effect for the bootstrap backend runs as well as subsequent
> > postmaster starts.
>
> I would have expected them to be edited into postgresql.conf.  What are
> the arguments for one or the other?
>

That would be my expectation also. I believe that is how it works now for
options which can be set by initdb, such as locale and port. I view
postgresql.auto.conf being for temporary changes, or changes related to
different instances within a replication setup, or whatever other uses
people come up with - but not for the permanent configuration established
by initdb.

In particular, I would be surprised if removing a postgresql.auto.conf
completely disabled an instance. Obviously, in my replication setup
example, the replication would be broken, but the basic operation of the
instance would still be possible.

Also, if somebody wants to put a change in postgresql.auto.conf, they can
easily do it using ALTER SYSTEM once the instance is running, or by just
writing out their own postgresql.auto.conf before starting it. Putting a
change in postgresql.conf programmatically is a bit of a pain.


Re: Lazy JIT IR code generation to increase JIT speed with partitions

2023-01-27 Thread Dmitry Dolgov
> On Fri, Jan 27, 2023 at 10:02:32AM +0100, David Geier wrote:
> It's very curious as to why we didn't really see that when dumping the
> bitcode. It seems like the bitcode is always different enough to not spot
> that.

As I've noted off-list, there was noticeable difference in the dumped
bitcode, which I haven't noticed since we were talking mostly about
differences between executions of the same query.




Re: old_snapshot_threshold bottleneck on replica

2023-01-27 Thread Robert Haas
On Fri, Jan 27, 2023 at 9:30 AM Maxim Orlov  wrote:
> I thank you for your advices. I've dived deeper into the problem and I think 
> v2 patch is wrong.

Cool!

> Accessing threshold_timestamp and threshold_xid in 
> TransactionIdLimitedForOldSnapshots
> without lock would lead to an improper xlimit calculation.

That would be a bummer.

> So, my choice would be (3b). My goal is to optimize access to the 
> threshold_timestamp to avoid
> multiple spinlock acquisition on read. In the same time, simultaneous access 
> to these variable
> (threshold_timestamp and threshold_xid) should be protected with spinlock.
>
> I remove atomic for threshold_xid and add comments on mutex_threshold. PFA, 
> v3. I

Interesting, but it's still not entirely clear to me from reading the
comments why we should think that this is safe.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




  1   2   >