Re: Reducing power consumption on idle servers

2022-11-16 Thread Simon Riggs
On Thu, 17 Nov 2022 at 07:36, Bharath Rupireddy
 wrote:

> > promote_trigger_file is not tested and there are better ways, so
> > deprecating it in this release is fine.
>
> Hm, but..
>
> > Anyone that relies on it can update their mechanisms to a supported
> > one with a one-line change. Realistically, anyone using it won't be on
> > the latest release anyway, at least for a long time, since if they use
> > manual methods then they are well behind the times.
>
> I may be overly pessimistic here - the change from 5 sec to 60 sec for
> detecting promote_trigger_file will have a direct impact on failovers
> I believe.

No, it will have a direct effect only on people using promote_trigger_file
who do not read and act upon the deprecation notice before upgrading
by making a one line change to their failover scripts.

Since pretty much everyone doing HA uses external HA software (cloud
or otherwise) this shouldn't affect anyone.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




RE: Data is copied twice when specifying both child and parent table in publication

2022-11-16 Thread wangw.f...@fujitsu.com
On Thurs, Nov 17, 2022 at 13:58 PM vignesh C  wrote:
> On Wed, 16 Nov 2022 at 14:28, wangw.f...@fujitsu.com
>  wrote:
> >
> > On Mon, Nov 14, 2022 at 0:56 AM vignesh C  wrote:
> > > >
> > > > Attach new patches.
> > >
> >
> > Thanks for your comments.
> >
> > > Here we are having tables list to store the relids and table_infos
> > > list which stores pubid along with relid. Here tables list acts as a
> > > temporary list to get filter_partitions and then delete the
> > > published_rel from table_infos. Will it be possible to directly
> > > operate on table_infos list and remove the temporary tables list used.
> > > We might have to implement comparator, deduplication functions and
> > > change filter_partitions function to work directly on published_rel
> > > type list.
> > > +   /
> > > +* Record the published table and the
> > > corresponding publication so
> > > +* that we can get row filters and column list 
> > > later.
> > > +*
> > > +* When a table is published by multiple
> > > publications, to obtain
> > > +* all row filters and column list, the
> > > structure related to this
> > > +* table will be recorded multiple times.
> > > +*/
> > > +   foreach(lc, pub_elem_tables)
> > > +   {
> > > +   published_rel *table_info =
> > > (published_rel *) malloc(sizeof(published_rel));
> > > +
> > > +   table_info->relid = lfirst_oid(lc);
> > > +   table_info->pubid = pub_elem->oid;
> > > +   table_infos = lappend(table_infos, 
> > > table_info);
> > > +   }
> > > +
> > > +   tables = list_concat(tables, pub_elem_tables);
> > >
> > > Thoughts?
> >
> > I think we could only deduplicate published tables per publication to get 
> > all
> > row filters and column lists for each published table later.
> > I removed the temporary list 'tables' and modified the API of the function
> > filter_partitions to handle published_rel type list.
> >
> > Attach the new patch set.
> 
> Thanks for the update patch.

Thanks for your comment.

> One suggestion:
> +/* Records association between publication and published table */
> +typedef struct
> +{
> +   Oid relid;  /* OID of published table */
> +   Oid pubid;  /* OID of publication
> that publishes this
> +* table. */
> +} published_rel;
> +
> 
> +   /*
> +* Record the published table and the
> corresponding publication so
> +* that we can get row filters and column list later.
> +*
> +* When a table is published by multiple
> publications, to obtain
> +* all row filters and column list, the
> structure related to this
> +* table will be recorded multiple times.
> +*/
> +   foreach(lc, pub_elem_tables)
> +   {
> +   published_rel *table_info =
> (published_rel *) malloc(sizeof(published_rel));
> +
> +   table_info->relid = lfirst_oid(lc);
> +   table_info->pubid = pub_elem->oid;
> +   table_infos = lappend(table_infos, 
> table_info);
> +   }
> 
> In this format if there are n relations in publication we will store
> pubid n times, in all tables publication there will many thousands of
> tables. We could avoid storing the pubid for every relid, instead we
> could  represent it like below to avoid storing publication id for
> each tables:
> 
> +/* Records association between publication and published tables */
> +typedef struct
> +{
> +   List*relids,  /* OIDs of the publisher tables */
> +   Oid pubid;   /* OID of publication that publishes this
> + * tables. */
> +}published_rel;
> 
> Thoughts?

I think this complicates the function filter_partitions.
Because if we use such a node type, I think we need to concatenate 'relids'
list of each node of the 'table_infos' list in the function filter_partitions
to become a temporary list. Then filter this temporary list and process the
'table_infos' list according to the filtering result.

Regards,
Wang wei


Re: Reducing power consumption on idle servers

2022-11-16 Thread Bharath Rupireddy
On Wed, Nov 16, 2022 at 8:35 PM Simon Riggs
 wrote:
>
> On Wed, 16 Nov 2022 at 12:47, Bharath Rupireddy
>  wrote:
> >
> > On Wed, Nov 16, 2022 at 2:34 PM Simon Riggs
> >  wrote:
> > >
> > > Reposting v6 now so that patch tester doesn't think this has failed
> > > when the patch on other thread gets applied.
> >
> > Intention of the patch, that is, to get rid of promote_trigger_file
> > GUC sometime in future, looks good to me. However, the timeout change
> > to 60 sec from 5 sec seems far-reaching. While it discourages the use
> > of the GUC, it can impact many existing production servers that still
> > rely on promote_trigger_file as it can increase the failover times
> > impacting SLAs around failover.
>
> The purpose of 60s is to allow for power reduction, so 5s won't do.

I understand. I know it's a bit hard to measure the power savings, I'm
wondering if there's any info, maybe not necessarily related to
postgres, but in general how much power gets saved if a certain number
of waits/polls/system calls are reduced.

> promote_trigger_file is not tested and there are better ways, so
> deprecating it in this release is fine.

Hm, but..

> Anyone that relies on it can update their mechanisms to a supported
> one with a one-line change. Realistically, anyone using it won't be on
> the latest release anyway, at least for a long time, since if they use
> manual methods then they are well behind the times.

I may be overly pessimistic here - the change from 5 sec to 60 sec for
detecting promote_trigger_file will have a direct impact on failovers
I believe. After upgrading to the version with 60 sec waiting,
there'll be an increase in failover times if the developers miss to
see the deprecation info about promote_trigger_file. I'm not sure
what's the best way to discourage using promote_trigger_file. Maybe
the change from 5 sec to 60 sec is the way. I'm not really sure.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Assertion failure with barriers in parallel hash join

2022-11-16 Thread Thomas Munro
On Thu, Nov 17, 2022 at 8:01 PM David Geier  wrote:
> Can we make progress with this patch in the current commit fest, or discuss 
> what is still missing to bring this in?

Hi David,
Sorry for the delay.  I'll aim to get this done in the next few days.




Re: ubsan fails on 32bit builds

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 17:42:30 -0800, Andres Freund wrote:
> Afaict the problem is that
>   proc = (PGPROC *) &(waitQueue->links);
> 
> is a gross gross hack - this isn't actually a PGPROC, it's pointing to an
> SHM_QUEUE, but *not* one embedded in PGPROC.  It kinda works because ->links
> is at offset 0 in PGPROC, which means that
>   SHMQueueInsertBefore(&(proc->links), &(MyProc->links));
> will turn >links back into waitQueue->links. Which we then can enqueue
> again.
> 
> I don't see the point of this hack, even leaving ubsan's valid complaints
> aside. Why bother having this, sometimes, fake PGPROC pointer when we could
> just use a SHM_QUEUE* to determine the insertion point?

As done in the attached patch. With this ubsan passes both on 32bit and 64bit.

Greetings,

Andres Freund
>From 776f002952f92c764e98dfe8c180a00c1f60ab09 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 16 Nov 2022 20:00:59 -0800
Subject: [PATCH 1/3] Fix mislabeling of PROC_QUEUE->links as PGPROC, fixing
 ubsan on 32bit

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20221117014230.op5kmgypdv2dt...@awork3.anarazel.de
Backpatch:
---
 src/backend/storage/lmgr/proc.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 13fa07b0ff7..214866502ed 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1050,13 +1050,13 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	uint32		hashcode = locallock->hashcode;
 	LWLock	   *partitionLock = LockHashPartitionLock(hashcode);
 	PROC_QUEUE *waitQueue = &(lock->waitProcs);
+	SHM_QUEUE  *waitQueuePos;
 	LOCKMASK	myHeldLocks = MyProc->heldLocks;
 	TimestampTz standbyWaitStart = 0;
 	bool		early_deadlock = false;
 	bool		allow_autovacuum_cancel = true;
 	bool		logged_recovery_conflict = false;
 	ProcWaitStatus myWaitStatus;
-	PGPROC	   *proc;
 	PGPROC	   *leader = MyProc->lockGroupLeader;
 	int			i;
 
@@ -1104,13 +1104,16 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	 * we are only considering the part of the wait queue before my insertion
 	 * point.
 	 */
-	if (myHeldLocks != 0)
+	if (myHeldLocks != 0 && waitQueue->size > 0)
 	{
 		LOCKMASK	aheadRequests = 0;
+		SHM_QUEUE  *proc_node;
 
-		proc = (PGPROC *) waitQueue->links.next;
+		proc_node = waitQueue->links.next;
 		for (i = 0; i < waitQueue->size; i++)
 		{
+			PGPROC	   *proc = (PGPROC *) proc_node;
+
 			/*
 			 * If we're part of the same locking group as this waiter, its
 			 * locks neither conflict with ours nor contribute to
@@ -1118,7 +1121,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 			 */
 			if (leader != NULL && leader == proc->lockGroupLeader)
 			{
-proc = (PGPROC *) proc->links.next;
+proc_node = proc->links.next;
 continue;
 			}
 			/* Must he wait for me? */
@@ -1157,20 +1160,21 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		}
 
 		/*
-		 * If we fall out of loop normally, proc points to waitQueue head, so
-		 * we will insert at tail of queue as desired.
+		 * If we iterated through the whole queue, cur points to the waitQueue
+		 * head, so we will insert at tail of queue as desired.
 		 */
+		waitQueuePos = proc_node;
 	}
 	else
 	{
 		/* I hold no locks, so I can't push in front of anyone. */
-		proc = (PGPROC *) &(waitQueue->links);
+		waitQueuePos = >links;
 	}
 
 	/*
-	 * Insert self into queue, ahead of the given proc (or at tail of queue).
+	 * Insert self into queue, at the position determined above.
 	 */
-	SHMQueueInsertBefore(&(proc->links), &(MyProc->links));
+	SHMQueueInsertBefore(waitQueuePos, >links);
 	waitQueue->size++;
 
 	lock->waitMask |= LOCKBIT_ON(lockmode);
-- 
2.38.0



Odd behavior with pg_stat_statements and queries called from SQL functions

2022-11-16 Thread Maciek Sakrejda
I noticed an odd behavior today in pg_stat_statements query
normalization for queries called from SQL-language functions. If I
have three functions that call an essentially identical query (the
functions are only marked SECURITY DEFINER to prevent inlining):

maciek=# create or replace function f1(f1param text) returns text
language sql as 'select f1param' security definer;
CREATE FUNCTION
maciek=# create or replace function f2(f2param text) returns text
language sql as 'select f2param' security definer;
CREATE FUNCTION
maciek=# create or replace function f3(text) returns text language sql
as 'select $1' security definer;
CREATE FUNCTION

and I have pg_stat_statements.track = 'all', so that queries called
from functions are tracked, these all end up with the same query id in
pg_stat_statements, but the query text includes the parameter name (if
one is referenced in the query in the function). E.g., if I call f1
first, then f2 and f3, I get:

maciek=# select queryid, query, calls from pg_stat_statements where
queryid = 6741491046520556186;
   queryid   | query  | calls
-++---
 6741491046520556186 | select f1param | 3
(1 row)

If I call f3 first, then f2 and f1, I get

maciek=# select queryid, query, calls from pg_stat_statements where
queryid = 6741491046520556186;
   queryid   |   query   | calls
-+---+---
 6741491046520556186 | select $1 | 3
(1 row)

I understand that the query text may be captured differently for
different queries that map to the same id, but it seems confusing that
parameter names referenced in queries called from functions are not
normalized away, since they're not germane to the query execution
itself, and the context of the function is otherwise stripped away by
this point. I would expect that all three of these queries end up in
pg_stat_statements with the query text "select $1".Thoughts?

Thanks,
Maciek




Re: when the startup process doesn't (logging startup delays)

2022-11-16 Thread Bharath Rupireddy
On Thu, Nov 17, 2022 at 12:21 AM Robert Haas  wrote:
>
> On Wed, Nov 16, 2022 at 1:47 AM Bharath Rupireddy
>  wrote:
> > That can be done, only if we can disable the timeout in another place
> > when the StandbyMode is set to true in ReadRecord(), that is, after
> > the standby server finishes crash recovery and enters standby mode.
>
> Oh, interesting. I didn't realize that we would need to worry about that case.
>
> > I'm attaching the v3 patch for further review. Please find the CF
> > entry here - https://commitfest.postgresql.org/41/4012/.
>
> I kind of dislike having to have logic for this in two places. Seems
> like it could create future bugs.

Duplication is a problem that I agree with and I have an idea here -
how about introducing a new function, say EnableStandbyMode() that
sets StandbyMode to true and disables the startup progress timeout,
something like the attached?

> How about the attached approach, instead? This way, the first time the
> timer expires after we reach standby mode, we reactively disable it.

Hm. I'm not really sure if it's a good idea. While it simplifies the
code, the has_startup_progress_timeout_expired() gets called for every
WAL record in standby mode. Isn't this an unnecessary thing?
Currently, the if (!StandbyMode) condition blocks the function calls.
And I'm also a little concerned that we move the StandbyMode variable
to startup.c which so far tiled to xlogrecovery.c. Maybe these are not
really concerns at all. Maybe others are okay with this approach.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From b25838215534488c53855a5cd5009fba09bd358a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 17 Nov 2022 07:13:27 +
Subject: [PATCH v4] Disable STARTUP_PROGRESS_TIMEOUT in standby mode

In standby mode, we actually don't report progress of recovery to
not bloat server logs as the standby is always in recovery unless
promoted. However, startup_progress_timeout_handler() gets called
every log_startup_progress_interval seconds, which is unnecessary.

Therefore, we disable the timeout in standby mode.

Reported-by: Thomas Munro
Authors: Bharath Rupireddy, Simon Riggs
Reviewed-by: Robert Haas
Backpatch-through: 15
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGKCHSffAj8zZJKJvNX7ygnQFxVD6wm1d-2j3fVw%2BMafPQ%40mail.gmail.com
---
 src/backend/access/transam/xlogrecovery.c | 21 +---
 src/backend/postmaster/startup.c  | 29 ---
 src/include/postmaster/startup.h  |  2 ++
 3 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea..2feae1ebd3 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -386,6 +386,7 @@ static bool recoveryStopAfter;
 /* prototypes for local functions */
 static void ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *replayTLI);
 
+static void EnableStandbyMode(void);
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
 static bool read_backup_label(XLogRecPtr *checkPointLoc,
@@ -470,6 +471,20 @@ XLogRecoveryShmemInit(void)
 	ConditionVariableInit(>recoveryNotPausedCV);
 }
 
+static void
+EnableStandbyMode(void)
+{
+	StandbyMode = true;
+
+	/*
+	 * To avoid server log bloat, we don't report recovery progress in a
+	 * standby as it will always be in recovery unless promoted. We disable
+	 * startup progress timeout in standby mode to avoid calling
+	 * startup_progress_timeout_handler() unnecessarily.
+	 */
+	disable_startup_progress_timeout();
+}
+
 /*
  * Prepare the system for WAL recovery, if needed.
  *
@@ -603,7 +618,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 */
 		InArchiveRecovery = true;
 		if (StandbyModeRequested)
-			StandbyMode = true;
+			EnableStandbyMode();
 
 		/*
 		 * When a backup_label file is present, we want to roll forward from
@@ -740,7 +755,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		{
 			InArchiveRecovery = true;
 			if (StandbyModeRequested)
-StandbyMode = true;
+EnableStandbyMode();
 		}
 
 		/* Get the last valid checkpoint record. */
@@ -3115,7 +3130,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
 InArchiveRecovery = true;
 if (StandbyModeRequested)
-	StandbyMode = true;
+	EnableStandbyMode();
 
 SwitchIntoArchiveRecovery(xlogreader->EndRecPtr, replayTLI);
 minRecoveryPoint = xlogreader->EndRecPtr;
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index f99186eab7..2705e425e8 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -314,11 +314,22 @@ startup_progress_timeout_handler(void)
 	

Re: Parallel Full Hash Join

2022-11-16 Thread Thomas Munro
On Thu, Nov 17, 2022 at 5:22 PM Ian Lawrence Barwick  wrote:
> This patch is marked as "Waiting for Committer" in the current commitfest [1]
> with yourself as committer; do you have any plans to move ahead with this?

Yeah, sorry for lack of progress.  Aiming to get this in shortly.




Typo for xl_running_xacts

2022-11-16 Thread Japin Li

Hi, hackers

I find some typo about xl_running_xacts in comments.
Attached a patch to fix those.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5006a5c464..44d94bf656 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -93,7 +93,7 @@
  * Initially the machinery is in the START stage. When an xl_running_xacts
  * record is read that is sufficiently new (above the safe xmin horizon),
  * there's a state transition. If there were no running xacts when the
- * running_xacts record was generated, we'll directly go into CONSISTENT
+ * xl_running_xacts record was generated, we'll directly go into CONSISTENT
  * state, otherwise we'll switch to the BUILDING_SNAPSHOT state. Having a full
  * snapshot means that all transactions that start henceforth can be decoded
  * in their entirety, but transactions that started previously can't. In
@@ -1331,7 +1331,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 */
 
 	/*
-	 * xl_running_xact record is older than what we can use, we might not have
+	 * xl_running_xacts record is older than what we can use, we might not have
 	 * all necessary catalog rows anymore.
 	 */
 	if (TransactionIdIsNormal(builder->initial_xmin_horizon) &&
@@ -1399,7 +1399,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 * encountered.  In that case, switch to BUILDING_SNAPSHOT state, and
 	 * record xl_running_xacts->nextXid.  Once all running xacts have finished
 	 * (i.e. they're all >= nextXid), we have a complete catalog snapshot.  It
-	 * might look that we could use xl_running_xact's ->xids information to
+	 * might look that we could use xl_running_xacts->xids information to
 	 * get there quicker, but that is problematic because transactions marked
 	 * as running, might already have inserted their commit record - it's
 	 * infeasible to change that with locking.


Re: Assertion failure with barriers in parallel hash join

2022-11-16 Thread David Geier

Hi Thomas,

Can we make progress with this patch in the current commit fest, or 
discuss what is still missing to bring this in?


Thanks!

--
David Geier
(ServiceNow)

On 6/6/22 17:01, David Geier wrote:

Hi Thomas,

Correct. We're running with disabled parallel leader participation and 
we have to do so, because another custom plan node we built relies on 
that.

That would be great. Anything else I can help with to get this patch in?

Thanks!

--
David
(ServiceNow)

On Fri, 3 Jun 2022 at 00:06, Thomas Munro  wrote:

On Thu, Jun 2, 2022 at 9:31 PM David Geier 
wrote:
> We recently encountered the same bug in the field. Oleksii
Kozlov managed to come up with reproduction steps, which reliably
trigger it. Interestingly, the bug does not only manifest as
failing assertion, but also as segmentation fault; in builds with
disabled and with enabled (!) assertions. So it can crash
production environments. We applied the proposed patch v3 from
Melanie to the REL_14_3 branch and can confirm that with the patch
neither the assertion nor the segmentation fault still occur.

Thanks for the report, testing, and for creating the CF entry.

I assume you are using parallel_leader_participation=off, and the
reason we haven't heard more about this is because few people do that.
By coincidence I was just about to restart a bunch of hash join
projects and have been paging the topic area back into my brain, so
I'll start with another round of testing/analysis of this bug/patch
next week.


Re: Fix order of checking ICU options in initdb and create database

2022-11-16 Thread Peter Eisentraut

On 29.10.22 13:33, Marina Polyakova wrote:
2. initdb/create database report problems with the ICU locale/encoding 
although they may already report that ICU is not supported in this build:


2.1.

$ initdb --locale-provider icu hoge
...
initdb: error: ICU locale must be specified

$ initdb --locale-provider icu --icu-locale en-US hoge
...
initdb: error: ICU is not supported in this build

$ createdb --locale-provider icu hoge
createdb: error: database creation failed: ERROR:  ICU locale must be 
specified


$ createdb --locale-provider icu --icu-locale en-US hoge
createdb: error: database creation failed: ERROR:  ICU is not supported 
in this build


I'm not in favor of changing this.  The existing code intentionally 
tries to centralize the "ICU is not supported in this build" knowledge 
in few places.  Your patch tries to make this check early, but in the 
process adds more places where ICU support needs to be checked 
explicitly.  This increases the code size and also creates a future 
burden to maintain that level of checking.  I think building without ICU 
should be considered a marginal configuration at this point, so we don't 
need to go out of our way to create a perfect user experience for this 
configuration, as long as we check somewhere in the end.






Re: Fix order of checking ICU options in initdb and create database

2022-11-16 Thread Peter Eisentraut

On 29.10.22 15:09, Marina Polyakova wrote:

On 2022-10-29 14:33, Marina Polyakova wrote:

Hello!

This is the last proposed patch on this subject [1] moved to a
separate thread for Commitfest..


Also added a patch to export with_icu when running src/bin/scripts tests 
[1].


I have committed the meson change.





Re: Assertion failure in SnapBuildInitialSnapshot()

2022-11-16 Thread Amit Kapila
On Wed, Nov 16, 2022 at 11:56 PM Andres Freund  wrote:
>
> On 2022-11-16 14:22:01 +0530, Amit Kapila wrote:
> > On Wed, Nov 16, 2022 at 7:30 AM Andres Freund  wrote:
> > >
> > > On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:
> > > > On Tue, Nov 15, 2022 at 8:08 AM Andres Freund  
> > > > wrote:
> > > > > nor do we enforce in an obvious place that we
> > > > > don't already hold a snapshot.
> > > > >
> > > >
> > > > We have a check for (FirstXactSnapshot == NULL) in
> > > > RestoreTransactionSnapshot->SetTransactionSnapshot. Won't that be
> > > > sufficient?
> > >
> > > I don't think that'd e.g. catch a catalog snapshot being held, yet that'd
> > > still be bad. And I think checking in SetTransactionSnapshot() is too 
> > > late,
> > > we've already overwritten MyProc->xmin by that point.
> > >
> >
> > So, shall we add the below Asserts in SnapBuildInitialSnapshot() after
> > we have the Assert for Assert(!FirstSnapshotSet)?
> > Assert(FirstXactSnapshot == NULL);
> > Assert(!HistoricSnapshotActive());
>
> I don't think that'd catch a catalog snapshot. But perhaps the better answer
> for the catalog snapshot is to just invalidate it explicitly. The user doesn't
> have control over the catalog snapshot being taken, and it's not too hard to
> imagine the walsender code triggering one somewhere.
>
> So maybe we should add something like:
>
> InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */
> if (HaveRegisteredOrActiveSnapshot())
>   elog(ERROR, "cannot build an initial slot snapshot when snapshots exist")
> Assert(!HistoricSnapshotActive());
>
> I think we'd not need to assert FirstXactSnapshot == NULL or !FirstSnapshotSet
> with that, because those would show up in HaveRegisteredOrActiveSnapshot().
>

In the attached patch, I have incorporated this change and other
feedback. I think this should probably help us find the reason for
this problem when we see it in the future.

-- 
With Regards,
Amit Kapila.


v2-0001-Add-additional-checks-while-creating-the-initial-.patch
Description: Binary data


Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-11-16 Thread Bharath Rupireddy
On Wed, Nov 16, 2022 at 6:55 PM Maxim Orlov  wrote:
>
>> These functions are marked as 'STRICT', meaning a null is returned,
>> without even calling the function, if any of the input parameters is
>> null. I think we can keep the behaviour the same as its friends.
>
> Thanks for the explanations. I think you are right.
>
> Confirm. And a timeline_id is added.
>>
>> While on this, I noticed that the pg_walfile_name_offset() isn't
>> covered in tests. I took an opportunity and added a simple test case
>> along with pg_walfile_offset_lsn().
>>
>> I'm attaching the v3 patch set for further review.
>
> Great job! We should mark this patch as RFC, shall we?

Please do, if you feel so. Thanks for your review.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Supporting TAP tests with MSVC and Windows

2022-11-16 Thread Noah Misch
On Mon, Aug 22, 2022 at 09:44:42AM -0400, Andrew Dunstan wrote:
> On 2022-08-21 Su 20:40, Noah Misch wrote:
> > This (commit 13d856e of 2015-07-29) added the following:
> >
> > --- a/src/test/perl/TestLib.pm
> > +++ b/src/test/perl/TestLib.pm
> > @@ -242,7 +288,17 @@ sub command_exit_is
> > print("# Running: " . join(" ", @{$cmd}) ."\n");
> > my $h = start $cmd;
> > $h->finish();
> > -   is($h->result(0), $expected, $test_name);
> > +
> > +   # On Windows, the exit status of the process is returned directly as the
> > +   # process's exit code, while on Unix, it's returned in the high bits
> > +   # of the exit code (see WEXITSTATUS macro in the standard 
> > +   # header file). IPC::Run's result function always returns exit code >> 
> > 8,
> > +   # assuming the Unix convention, which will always return 0 on Windows as
> > +   # long as the process was not terminated by an exception. To work around
> > +   # that, use $h->full_result on Windows instead.
> > +   my $result = ($Config{osname} eq "MSWin32") ?
> > +   ($h->full_results)[0] : $h->result(0);
> > +   is($result, $expected, $test_name);
> >  }
> >
> > That behavior came up again in the context of a newer IPC::Run test case.  
> > I'm
> > considering changing the IPC::Run behavior such that the above would have 
> > been
> > unnecessary.  However, if I do, the above code would want to adapt to handle
> > pre-change and post-change IPC::Run versions.  If you have an opinion on
> > whether or how IPC::Run should change, I welcome comments on
> > https://github.com/toddr/IPC-Run/issues/161.
> 
> Assuming it changes, we'll have to have a version test here. I don't
> think we can have a flag day where we suddenly require IPC::Run's
> bleeding edge on Windows. So changing it is a good thing, but it won't
> help us much.

IPC::Run git now has the change, and we may release soon.  Here's what I plan
to push to make PostgreSQL cope.
Author: Noah Misch 
Commit: Noah Misch 

Account for IPC::Run::result() Windows behavior change.

This restores compatibility with the not-yet-released successor of
version 20220807.0.  Back-patch to 9.4, which introduced this code.

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm 
b/src/test/perl/PostgreSQL/Test/Utils.pm
index 99d3345..b139190 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -782,15 +782,11 @@ sub command_exit_is
my $h = IPC::Run::start $cmd;
$h->finish();
 
-   # On Windows, the exit status of the process is returned directly as the
-   # process's exit code, while on Unix, it's returned in the high bits
-   # of the exit code (see WEXITSTATUS macro in the standard 
-   # header file). IPC::Run's result function always returns exit code >> 
8,
-   # assuming the Unix convention, which will always return 0 on Windows as
-   # long as the process was not terminated by an exception. To work around
-   # that, use $h->full_results on Windows instead.
+   # Normally, if the child called exit(N), IPC::Run::result() returns N.  
On
+   # Windows, with IPC::Run v20220807.0 and earlier, full_results() is the
+   # method that returns N (https://github.com/toddr/IPC-Run/issues/161).
my $result =
-   ($Config{osname} eq "MSWin32")
+ ($Config{osname} eq "MSWin32" && $IPC::Run::VERSION <= 20220807.0)
  ? ($h->full_results)[0]
  : $h->result(0);
is($result, $expected, $test_name);


Re: Data is copied twice when specifying both child and parent table in publication

2022-11-16 Thread vignesh C
On Wed, 16 Nov 2022 at 14:28, wangw.f...@fujitsu.com
 wrote:
>
> On Mon, Nov 14, 2022 at 0:56 AM vignesh C  wrote:
> > >
> > > Attach new patches.
> >
>
> Thanks for your comments.
>
> > Here we are having tables list to store the relids and table_infos
> > list which stores pubid along with relid. Here tables list acts as a
> > temporary list to get filter_partitions and then delete the
> > published_rel from table_infos. Will it be possible to directly
> > operate on table_infos list and remove the temporary tables list used.
> > We might have to implement comparator, deduplication functions and
> > change filter_partitions function to work directly on published_rel
> > type list.
> > +   /
> > +* Record the published table and the
> > corresponding publication so
> > +* that we can get row filters and column list 
> > later.
> > +*
> > +* When a table is published by multiple
> > publications, to obtain
> > +* all row filters and column list, the
> > structure related to this
> > +* table will be recorded multiple times.
> > +*/
> > +   foreach(lc, pub_elem_tables)
> > +   {
> > +   published_rel *table_info =
> > (published_rel *) malloc(sizeof(published_rel));
> > +
> > +   table_info->relid = lfirst_oid(lc);
> > +   table_info->pubid = pub_elem->oid;
> > +   table_infos = lappend(table_infos, 
> > table_info);
> > +   }
> > +
> > +   tables = list_concat(tables, pub_elem_tables);
> >
> > Thoughts?
>
> I think we could only deduplicate published tables per publication to get all
> row filters and column lists for each published table later.
> I removed the temporary list 'tables' and modified the API of the function
> filter_partitions to handle published_rel type list.
>
> Attach the new patch set.

Thanks for the update patch.
One suggestion:
+/* Records association between publication and published table */
+typedef struct
+{
+   Oid relid;  /* OID of published table */
+   Oid pubid;  /* OID of publication
that publishes this
+* table. */
+} published_rel;
+

+   /*
+* Record the published table and the
corresponding publication so
+* that we can get row filters and column list later.
+*
+* When a table is published by multiple
publications, to obtain
+* all row filters and column list, the
structure related to this
+* table will be recorded multiple times.
+*/
+   foreach(lc, pub_elem_tables)
+   {
+   published_rel *table_info =
(published_rel *) malloc(sizeof(published_rel));
+
+   table_info->relid = lfirst_oid(lc);
+   table_info->pubid = pub_elem->oid;
+   table_infos = lappend(table_infos, table_info);
+   }

In this format if there are n relations in publication we will store
pubid n times, in all tables publication there will many thousands of
tables. We could avoid storing the pubid for every relid, instead we
could  represent it like below to avoid storing publication id for
each tables:

+/* Records association between publication and published tables */
+typedef struct
+{
+   List*relids,  /* OIDs of the publisher tables */
+   Oid pubid;   /* OID of publication that publishes this
+ * tables. */
+}published_rel;

Thoughts?

Regards,
Vignesh




RE: Fix the README file for MERGE command

2022-11-16 Thread Waithant Myo (Fujitsu)
Hi Richard,

Thank you for your time.

Best Regards,
Myo Wai Thant
From: Richard Guo 
Sent: Thursday, November 17, 2022 10:31 AM
To: Myo, Waithant/Myo W. 
Cc: pgsql-hackers@lists.postgresql.org
Subject: Re: Fix the README file for MERGE command


On Wed, Nov 16, 2022 at 4:37 PM Waithant Myo (Fujitsu) 
mailto:myo.waith...@fujitsu.com>> wrote:
The actions of MERGE command can be specified as follows: INSERT, UPDATE, 
DELETE and DO NOTHING.
However, in the README file, the ‘UPDATE’ word is described 2 times instead of 
‘DELETE’.

Therefore, I attached the patch file which fix this word usage.
It would be great if you could take a look at it.

Apparently this is a typo.  Good catch! +1.

Thanks
Richard


Re: Assertion failure in SnapBuildInitialSnapshot()

2022-11-16 Thread Amit Kapila
On Wed, Nov 16, 2022 at 11:56 PM Andres Freund  wrote:
>
> On 2022-11-16 14:22:01 +0530, Amit Kapila wrote:
> > On Wed, Nov 16, 2022 at 7:30 AM Andres Freund  wrote:
> > >
> > > On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:
> > > > On Tue, Nov 15, 2022 at 8:08 AM Andres Freund  
> > > > wrote:
> > > > > nor do we enforce in an obvious place that we
> > > > > don't already hold a snapshot.
> > > > >
> > > >
> > > > We have a check for (FirstXactSnapshot == NULL) in
> > > > RestoreTransactionSnapshot->SetTransactionSnapshot. Won't that be
> > > > sufficient?
> > >
> > > I don't think that'd e.g. catch a catalog snapshot being held, yet that'd
> > > still be bad. And I think checking in SetTransactionSnapshot() is too 
> > > late,
> > > we've already overwritten MyProc->xmin by that point.
> > >
> >
> > So, shall we add the below Asserts in SnapBuildInitialSnapshot() after
> > we have the Assert for Assert(!FirstSnapshotSet)?
> > Assert(FirstXactSnapshot == NULL);
> > Assert(!HistoricSnapshotActive());
>
> I don't think that'd catch a catalog snapshot. But perhaps the better answer
> for the catalog snapshot is to just invalidate it explicitly. The user doesn't
> have control over the catalog snapshot being taken, and it's not too hard to
> imagine the walsender code triggering one somewhere.
>
> So maybe we should add something like:
>
> InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */
>

The comment "/* about to overwrite MyProc->xmin */" is unclear to me.
We already have a check (/* so we don't overwrite the existing value
*/
if (TransactionIdIsValid(MyProc->xmin))) in SnapBuildInitialSnapshot()
which ensures that we don't overwrite MyProc->xmin, so the above
comment seems contradictory to me.

-- 
With Regards,
Amit Kapila.




Re: predefined role(s) for VACUUM and ANALYZE

2022-11-16 Thread Nathan Bossart
On Wed, Nov 16, 2022 at 03:09:47PM -0500, Andrew Dunstan wrote:
> OK, reading the history I think everyone is on board with expanding
> AclMode from uint32 to uint64. Is that right?

I skimmed through this thread again, and AFAICT folks are okay with this
approach.  I'm not aware of any remaining concerns.

> If so I'm intending to
> commit at least the first two of these patches fairly soon.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: ps command does not show walsender's connected db

2022-11-16 Thread Ian Lawrence Barwick
2022年10月11日(火) 20:06 Bharath Rupireddy :
>
> On Tue, Oct 11, 2022 at 2:11 PM bt22nakamorit
>  wrote:
> >
> > 2022-10-10 16:12 Bharath Rupireddy wrote:
> > > Thanks. LGTM.
> > Thank you for your review!
> > I have this issue posted on Commitfest 2022-11 with title "show
> > walsender's connected db for ps command entry".
> > May I change the status to "Ready for Committer"?
>
> Done - https://commitfest.postgresql.org/40/3937/.

Hi

Fujii-san is marked as committer on the commifest entry for this patch [1];
are you able to go ahead and get it committed?

[1] https://commitfest.postgresql.org/40/3937/

Thanks

Ian Barwick




Re: Parallel Full Hash Join

2022-11-16 Thread Ian Lawrence Barwick
2022年4月8日(金) 20:30 Thomas Munro :
>
> On Wed, Jan 12, 2022 at 10:30 AM Melanie Plageman
>  wrote:
> > On Fri, Nov 26, 2021 at 3:11 PM Thomas Munro  wrote:
> > > #3 0x009cf57e in ExceptionalCondition (conditionName=0x29cae8
> > > "BarrierParticipants(>shared->batch_barrier) == 1",
> > > errorType=, fileName=0x2ae561 "nodeHash.c",
> > > lineNumber=lineNumber@entry=2224) at assert.c:69
> > > No locals.
> > > #4 0x0071575e in ExecParallelScanHashTableForUnmatched
> > > (hjstate=hjstate@entry=0x80a60a3c8,
> > > econtext=econtext@entry=0x80a60ae98) at nodeHash.c:2224
> >
> > I believe this assert can be safely removed.
>
> Agreed.
>
> I was looking at this with a view to committing it, but I need more
> time.  This will be at the front of my queue when the tree reopens.
> I'm trying to find the tooling I had somewhere that could let you test
> attaching and detaching at every phase.
>
> The attached version is just pgindent'd.

Hi Thomas

This patch is marked as "Waiting for Committer" in the current commitfest [1]
with yourself as committer; do you have any plans to move ahead with this?

[1] https://commitfest.postgresql.org/40/2903/

Regards

Ian Barwick




Re: pglz compression performance, take two

2022-11-16 Thread Ian Lawrence Barwick
2021年11月5日(金) 14:51 Andrey Borodin :
>
> Thanks for the review Mark! Sorry it took too long to reply on my side.
>
> > 28 июня 2021 г., в 21:05, Mark Dilger  
> > написал(а):
> >
> >> #define PGLZ_HISTORY_SIZE   0x0fff - 1  /* to avoid compare in 
> >> iteration */
> > ...
> >> static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
> > ...
> >>if (hist_next == PGLZ_HISTORY_SIZE + 1)
> >
> > These are the only uses of PGLZ_HISTORY_SIZE.  Perhaps you could just 
> > defined the symbol as 0x0fff and skip the -1 and +1 business?
> Fixed.
>
> >> /* --
> >> * pglz_compare -
> >> *
> >> *  Compares 4 bytes at pointers
> >> * --
> >> */
> >> static inline bool
> >> pglz_compare32(const void *ptr1, const void *ptr2)
> >> {
> >>return memcmp(ptr1, ptr2, 4) == 0;
> >> }
> >
> > The comment function name differs from the actual function name.
> >
> > Also, pglz_compare returns an offset into the string, whereas 
> > pglz_compare32 returns a boolean.  This is fairly unintuitive.  The "32" 
> > part of pglz_compare32 implies doing the same thing as pglz_compare but 
> > where the string is known to be 4 bytes in length.  Given that 
> > pglz_compare32 is dissimilar to pglz_compare, perhaps avoid using 
> > /pglz_compare/ in its name?
> I've removed pglz_compare32 entirely. It was a simple substitution for 
> memcmp().
>
> >
> >>/*
> >> * Determine length of match. A better match must be larger than the
> >> * best so far. And if we already have a match of 16 or more bytes,
> >> * it's worth the call overhead to use memcmp()
> >
> > This comment is hard to understand, given the code that follows.  The first 
> > block calls memcmp(), which seems to be the function overhead you refer to. 
> >  The second block calls the static inline function pglz_compare32, which 
> > internally calls memcmp().  Superficially, there seems to be a memcmp() 
> > function call either way.  The difference is that in the first block's call 
> > to memcmp(), the length is a runtime value, and in the second block, it is 
> > a compile-time known value.  If you are depending on the compiler to notice 
> > this distinction and optimize the second call, perhaps you can mention that 
> > explicitly?  Otherwise, reading and understanding the comment takes more 
> > effort.
> I've updated comment for second branch with fixed-size memcmp(). Frankly, I'm 
> not sure "if (memcmp(input_pos, hist_pos, 4) == 0)" worth the complexity,  
> internals of "pglz_compare(0, len_bound, input_pos + 0, hist_pos + 0);" would 
> do almost same instructions.
>
> >
> > I took a quick look for other places in the code that try to beat the 
> > performance of memcmp on short strings.  In varlena.c, rest_of_char_same() 
> > seems to do so.  We also use comparisons on NameData, which frequently 
> > contains strings shorter than 16 bytes.  Is it worth sharting a static 
> > inline function that uses your optimization in other places?  How confident 
> > are you that your optimization really helps?
> Honestly, I do not know. The overall patch effect consists of stacking up 
> many small optimizations. They have a net effect, but are too noisy to 
> measure independently. That's mostly the reason why I didn't know what to 
> reply for so long.
>
>
> > 5 нояб. 2021 г., в 01:47, Tomas Vondra  
> > написал(а):
> >
> > Andrey, can you update the patch per Mark's review? I'll do my best to get 
> > it committed sometime in this CF.
>
> Cool! Here's the patch.

HI!

This patch is marked as "Ready for Committer" in the current commitfest [1]
but has seen no further activity for more than a year, Given that it's
on its 10th
commitfest, it would be useful to clarify its status one way or the other.

[1] https://commitfest.postgresql.org/40/2897/

Thanks

Ian Barwick




Re: CI and test improvements

2022-11-16 Thread Justin Pryzby
On Wed, Nov 16, 2022 at 08:08:32PM -0800, Andres Freund wrote:
> I also don't like my "cores_script". Not quite sure yet how to do that
> more cleanly.

I don't know which is cleaner:

ls /core* && mv /tmp/core* /tmp/cores

find / -maxdepth 1 -type f -name 'core*' -print0 |
xargs -r0 mv -vt /tmp/cores

for a in /core*; do [ ! -e "$a" ] || mv "$a" /tmp/cores; done




Re: CI and test improvements

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 21:58:39 -0600, Justin Pryzby wrote:
> On Wed, Nov 16, 2022 at 07:48:14PM -0800, Andres Freund wrote:
> > I've used this a bunch on personal branches, and I think it's the way to
> > go. It doesn't take long, saves a lot of cycles when one pushes something
> > broken. Starts to runs the CompilerWarnings task after a minimal amount of
> > sanity checking, instead of having to wait for a task running all tests,
> > without the waste of running it immediately and failing all the different
> > configurations, which takes forever.
> 
> Well, I don't hate it.
> 
> But I don't think you should call "ccache -z":

Agreed - that was really just for "development" of the task.

I also don't like my "cores_script". Not quite sure yet how to do that
more cleanly.

Greetings,

Andres Freund




Re: CI and test improvements

2022-11-16 Thread Justin Pryzby
On Wed, Nov 16, 2022 at 07:48:14PM -0800, Andres Freund wrote:
> I've used this a bunch on personal branches, and I think it's the way to
> go. It doesn't take long, saves a lot of cycles when one pushes something
> broken. Starts to runs the CompilerWarnings task after a minimal amount of
> sanity checking, instead of having to wait for a task running all tests,
> without the waste of running it immediately and failing all the different
> configurations, which takes forever.

Well, I don't hate it.

But I don't think you should call "ccache -z":

On Tue, Oct 18, 2022 at 12:09:30PM -0500, Justin Pryzby wrote:
> I realized that ccache -z clears out not only the global stats, but the
> per-file cache stats (from which the global stats are derived) - which
> obviously makes the cache work poorly.


-- 
Justin




Re: HOT chain validation in verify_heapam()

2022-11-16 Thread Himanshu Upadhyaya
On Wed, Nov 16, 2022 at 11:23 PM Robert Haas  wrote:

> On Wed, Nov 16, 2022 at 4:51 AM Himanshu Upadhyaya
>  wrote:
> > yes, got it, have tried to test and it is giving false corruption in
> case of subtransaction.
> > I think a better way to have this check is, we need to check that if
> pred_xmin is
> > aborted then current_xmin should be aborted only. So there is no way
> that we
> > validate corruption with in_progress txid.
>
> Please note that you can't use TransactionIdDidAbort here, because
> that will return false for transactions aborted by a crash. You have
> to check that it's not in progress and then afterwards check that it's
> not committed. Also note that if you check whether it's committed
> first and then check whether it's in progress afterwards, there's a
> race condition: it might commit just after you verify that it isn't
> committed yet, and then it won't be in progress any more and will look
> aborted.
>
> I disagree with the idea that we can't check in progress. I think the
> checks could look something like this:
>
> pred_in_progress = TransactionIdIsInProgress(pred_xmin);
> current_in_progress = TransactionIdIsInProgress(current_xmin);
> if (pred_in_progress)
> {
>  if (current_in_progress)
> return ok;
>  // recheck to avoid race condition
>  if (TransactionIdIsInProgress(pred_xmin))
>  {
>  if (TransactionIdDidCommit(current_xmin))
>  return corruption: predecessor xmin in progress, but
> current xmin committed;
>  else
>  return corruption: predecessor xmin in progress, but
> current xmin aborted;
>  }
>
I think we can have a situation where pred_xmin is in progress but
curr_xmin is aborted, consider below example:
 ‘postgres[14723]=#’BEGIN;
BEGIN
‘postgres[14723]=#*’insert into test2 values (1,1);
INSERT 0 1
‘postgres[14723]=#*’savepoint s1;
SAVEPOINT
‘postgres[14723]=#*’update test2 set a =2;
UPDATE 1
‘postgres[14723]=#*’rollback to savepoint s1;
ROLLBACK

Now pred_xmin is in progress but curr_xmin is aborted, am I missing
anything here?
I think if pred_xmin is aborted and curr_xmin is in progress we should
consider it as a corruption case but vice versa is not true.


-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com


Re: Is the plan for IN(1,2,3) always the same as for =ANY('{1,2,3}') when using PQexec with no params?

2022-11-16 Thread Tom Lane
Dmitry Koterov  writes:
> PG13+. Assume we have two identical queries with no arguments (as a plain
> text, e.g. passed to PQexec - NOT to PQexecParams!):

> - one with "a=X AND b IN(...)"
> - and one with "a=X and b=ANY('{...}')

> The question: is it guaranteed that the planner will always choose
> identical plans for them (or, at least, the plan for ANY will not match an
> existing index worse than the plan with IN)?

This depends greatly on what "..." represents.  But if it's a list
of constants, they're probably equivalent.  transformAExprIn()
offers some caveats:

 * We try to generate a ScalarArrayOpExpr from IN/NOT IN, but this is only
 * possible if there is a suitable array type available.  If not, we fall
 * back to a boolean condition tree with multiple copies of the lefthand
 * expression.  Also, any IN-list items that contain Vars are handled as
 * separate boolean conditions, because that gives the planner more scope
 * for optimization on such clauses.

regards, tom lane




Re: CI and test improvements

2022-11-16 Thread Andres Freund
Hi,

On 2022-10-02 14:54:21 -0700, Andres Freund wrote:
> On 2022-10-02 16:35:06 -0500, Justin Pryzby wrote:
> > On Sun, Oct 02, 2022 at 01:52:01PM -0700, Andres Freund wrote:
> > > On 2022-10-01 18:36:41 -0700, Andres Freund wrote:
> > > > I am wondering if we should instead introduce a new "quickcheck" task 
> > > > that
> > > > just compiles and runs maybe one test and have *all* other tests depend 
> > > > on
> > > > that.  Wasting a precious available windows instance to just fail to 
> > > > build or
> > > > immediately fail during tests doesn't really make sense.
> > 
> > > With a primed cache this takes ~32s, not too bad imo. 12s of that is
> > > cloning the repo.
> > 
> > Maybe - that would avoid waiting 4 minutes for a windows instance to
> > start in the (hopefully atypical) case of a patch that fails in 1-2
> > minutes under linux/freebsd.
> > 
> > If the patch were completely broken, the windows task would take ~4min
> > to start, plus up to ~4min before failing to compile or failing an early
> > test.  6-8 minutes isn't nothing, but doesn't seem worth the added
> > complexity.
> 
> Btw, the motivation to work on this just now was that I'd like to enable more
> sanitizers (undefined,alignment for linux-meson, address for
> linux-autoconf). Yes, we could make the dependency on freebsd instead, but I'd
> like to try to enable the clang-only memory sanitizer there (if it works on
> freebsd)...

I've used this a bunch on personal branches, and I think it's the way to
go. It doesn't take long, saves a lot of cycles when one pushes something
broken. Starts to runs the CompilerWarnings task after a minimal amount of
sanity checking, instead of having to wait for a task running all tests,
without the waste of running it immediately and failing all the different
configurations, which takes forever.

Greetings,

Andres Freund




Is the plan for IN(1,2,3) always the same as for =ANY('{1,2,3}') when using PQexec with no params?

2022-11-16 Thread Dmitry Koterov
Hi.

PG13+. Assume we have two identical queries with no arguments (as a plain
text, e.g. passed to PQexec - NOT to PQexecParams!):

- one with "a=X AND b IN(...)"
- and one with "a=X and b=ANY('{...}')

The question: is it guaranteed that the planner will always choose
identical plans for them (or, at least, the plan for ANY will not match an
existing index worse than the plan with IN)? In my experiments it shows
that the answer is "yes", but I don't know the PG internals to make sure
that it's true in ALL situations.

Assumptions:

- The number of values in IN/ANY is of medium cardinality (say, 10-100
values)
- Again, all those values are static; no parameters are involved; plain
simple SQL as a text
- There is also another column "a" which is compared against a constant
("a" is of 1000x lower cardinality than "b" to make it interesting), and an
index on (a, b)

Example:

create table test(a bigint, b bigint);
create index test_idx on test(a, b);

truncate test;
insert into test(a, b) select round(s/1), s from generate_series(1,
100) s;

# explain analyze select * from test where *a=10 and b in(1,2,3)*;

 Index Only Scan using test_idx on test  (cost=0.42..13.31 rows=1 width=16)
   Index Cond: ((a = 10) AND (b = ANY ('{1,2,3}'::bigint[])))

# explain analyze select * from test where *a=10 and b=any('{1,2,3}')*;

 Index Only Scan using test_idx on test  (cost=0.42..13.31 rows=1 width=16)
   Index Cond: ((a = 10) AND (b = ANY ('{1,2,3}'::bigint[])))

It shows exactly the same plan here. *Would it always be the same, or it
may be different?* (E.g. my worry is that for IN variant, the planner can
use the statistics against the actual values in that IN parentheses, whilst
when using ANY('{...}'), I can imagine that in some circumstances, it would
ignore the actual values within the literal and instead build a "generic"
plan which may not utilize the index properly; is it the case?)

P.S.
A slightly correlated question was raised on StackOverflow, e.g.
https://stackoverflow.com/questions/34627026/in-vs-any-operator-in-postgresql
- but it wasn't about this particular usecase, they were discussing more
complicated things, like when each element of an IN/ANY clause is actually
a pair of elements (which makes the planner go crazy in some
circumstances). My question is way simpler, I don't deal with clauses like
"IN((1,2),(3,4),(5,6))" etc.; it's all about the plain 2-column selects.
Unfortunately, it's super-hard to find more info about this question,
because both "in" and "any" are stop-words in search engines, so they don't
show good answers, including any of the PG mail list archives.

P.S.S.
Why does the answer matter? Because for "IN(1,2,3)" case, e.g.
pg_stat_statements generalizes the query to "IN($1,$2,$3)" and doesn't
coalesce multiple queries into one, whilst for "=ANY('{1,2,3}')", it
coalesces them all to "=ANY($1)". Having those 1,2,3,... of a different
cardinality all the time, the logs/stats are flooded with the useless
variants of the same query basically. It also applies to e.g. logging to
Datadog which normalizes the queries. We'd love to use "=ANY(...)" variant
everywhere and never use IN() anymore, but are scared of getting some
unexpected regressions.


Re: pg_upgrade test failure

2022-11-16 Thread Justin Pryzby
On Tue, Nov 08, 2022 at 01:16:09AM +1300, Thomas Munro wrote:
> So [1] on its own didn't fix this.  My next guess is that the attached
> might help.

I took the liberty of adding a CF entry for this
https://commitfest.postgresql.org/41/4011/

And afterwards figured I could be a little bit wasteful and run the
tests using meson test --repeat, rather than let cfbot do it over the
course of a month.
https://cirrus-ci.com/task/5115893722644480

So I didn't find evidence that it doesn't resolve the issue (but this
also doesn't prove that it will works).

-- 
Justin




Re: User functions for building SCRAM secrets

2022-11-16 Thread Michael Paquier
On Thu, Nov 10, 2022 at 11:14:34PM -0500, Jonathan S. Katz wrote:
> On 10/31/22 8:56 PM, Michael Paquier wrote:
>> Well, one could pass a salt based on something generated by random()
>> to emulate what we currently do in the default case, as well.  The
>> salt length is an extra possibility, letting it be randomly generated
>> by pg_strong_random().
> 
> Sure, this is a good point. From a SQL level we can get that from pgcrypto
> "gen_random_bytes".

Could it be something we could just push into core?  FWIW, I've used
that quite a bit in the last to cheaply build long random strings of
data for other things.  Without pgcrypto, random() with
generate_series() has always been kind of..  fun.

+SELECT scram_build_secret_sha256(NULL);
+ERROR:  password must not be null
+SELECT scram_build_secret_sha256(NULL, NULL);
+ERROR:  password must not be null
+SELECT scram_build_secret_sha256(NULL, NULL, NULL);
+ERROR:  password must not be null

This is just testing three times the same thing as per the defaults.
I would cut the second and third cases.

git diff --check reports some whitespaces.

scram_build_secret_sha256_internal() is missing SASLprep on the
password string.  Perhaps the best thing to do here is just to extend
pg_be_scram_build_secret() with more arguments so as callers can
optionally pass down a custom salt with its length, leaving the
responsibility to pg_be_scram_build_secret() to create a random salt
if nothing has been given?
--
Michael


signature.asc
Description: PGP signature


Re: logical decoding and replication of sequences, take 2

2022-11-16 Thread Andres Freund
Hi,


On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
> Well, yeah - we can either try to perform the stuff independently of the
> transactions that triggered it, or we can try making it part of some of
> the transactions. Each of those options has problems, though :-(
>
> The first version of the patch tried the first approach, i.e. decode the
> increments and apply that independently. But:
>
>   (a) What would you do with increments of sequences created/reset in a
>   transaction? Can't apply those outside the transaction, because it
>   might be rolled back (and that state is not visible on primary).

I think a reasonable approach could be to actually perform different WAL
logging for that case. It'll require a bit of machinery, but could actually
result in *less* WAL logging overall, because we don't need to emit a WAL
record for each SEQ_LOG_VALS sequence values.



>   (b) What about increments created before we have a proper snapshot?
>   There may be transactions dependent on the increment. This is what
>   ultimately led to revert of the patch.

I don't understand this - why would we ever need to process those increments
from before we have a snapshot?  Wouldn't they, by definition, be before the
slot was active?

To me this is the rough equivalent of logical decoding not giving the initial
state of all tables. You need some process outside of logical decoding to get
that (obviously we have some support for that via the exported data snapshot
during slot creation).

I assume that part of the initial sync would have to be a new sequence
synchronization step that reads all the sequence states on the publisher and
ensures that the subscriber sequences are at the same point. There's a bit of
trickiness there, but it seems entirely doable. The logical replication replay
support for sequences will have to be a bit careful about not decreasing the
subscriber's sequence values - the standby initially will be ahead of the
increments we'll see in the WAL. But that seems inevitable given the
non-transactional nature of sequences.



> This version of the patch tries to do the opposite thing - make sure
> that the state after each commit matches what the transaction might have
> seen (for sequences it accessed). It's imperfect, because it might log a
> state generated "after" the sequence got accessed - it focuses on the
> guarantee not to generate duplicate values.

That approach seems quite wrong to me.


> > I'm going to confess that I have no really specific idea how to
> > implement that. I'm just not sufficiently familiar with this code.
> > However, I suspect that the solution lies in changing things on the
> > decoding side rather than in the WAL format. I feel like the
> > information that we need in order to do the right thing must already
> > be present in the WAL. If it weren't, then how could crash recovery
> > work correctly, or physical replication? At any given moment, you can
> > choose to promote a physical standby, and at that point the state you
> > observe on the new primary had better be some state that existed on
> > the primary at some point in its history. At any moment, you can
> > unplug the primary, restart it, and run crash recovery, and if you do,
> > you had better end up with some state that existed on the primary at
> > some point shortly before the crash.

One minor exception here is that there's no real time bound to see the last
few sequence increments if nothing after the XLOG_SEQ_LOG records forces a WAL
flush.


> > I think that there are actually a
> > few subtle inaccuracies in the last two sentences, because actually
> > the order in which transactions become visible on a physical standby
> > can differ from the order in which it happens on the primary, but I
> > don't think that actually changes the picture much. The point is that
> > the WAL is the definitive source of information about what happened
> > and in what order it happened, and we use it in that way already in
> > the context of physical replication, and of standbys. If logical
> > decoding has a problem with some case that those systems handle
> > correctly, the problem is with logical decoding, not the WAL format.
> >
>
> The problem lies in how we log sequences. If we wrote each individual
> increment to WAL, it might work the way you propose (except for cases
> with sequences created in a transaction, etc.). But that's not what we
> do - we log sequence increments in batches of 32 values, and then only
> modify the sequence relfilenode.

> This works for physical replication, because the WAL describes the
> "next" state of the sequence (so if you do "SELECT * FROM sequence"
> you'll not see the same state, and the sequence value may "jump ahead"
> after a failover).
>
> But for logical replication this does not work, because the transaction
> might depend on a state created (WAL-logged) by some other transaction.
> And perhaps that transaction actually happened *before* we even 

Re: Meson add host_system to PG_VERSION_STR

2022-11-16 Thread Michael Paquier
On Wed, Nov 16, 2022 at 02:12:05PM +0100, Juan José Santamaría Flecha wrote:
> On Wed, Nov 16, 2022 at 10:50 AM Peter Eisentraut <
> peter.eisentr...@enterprisedb.com> wrote:
>> Perhaps some examples before and after on different platforms could be
>> shown.
> 
> Yes, that would make clear what the patch is trying to do. For version() we
> get:
> 
> Configure:
>  PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Debian
> 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit
> 
> Meson:
>  PostgreSQL 16devel on x86_64, compiled by gcc-6.3.0
> 
> Patched:
>  PostgreSQL 16devel on x86_64-linux, compiled by gcc-6.3.0

Ah, thanks.  I was not following this point.  Adding the host
information for consistency makes sense, indeed.
--
Michael


signature.asc
Description: PGP signature


Re: Allow file inclusion in pg_hba and pg_ident files

2022-11-16 Thread Michael Paquier
On Wed, Nov 16, 2022 at 10:53:02AM +0800, Julien Rouhaud wrote:
> While being the same inclusion infrastructure, it's likely that people will
> have different usage.  I'm assuming that for GUCs the main usage is to have
> your automation tool put one of your template conf for instance 
> (small_vm.conf,
> big_vm.conf or something like that) in an included directory, so you don't
> really need a lot of them.  You can also rely on ALTER SYSTEM to avoid 
> manually
> handling configuration files entirely.
> 
> For authentication there are probably very different pattern.  The use case I
> had when writing this patch is some complex application that relies on many
> services, each service having dedicated role and authentication, and services
> can be enabled or disabled dynamically pretty much anytime.  I had to write
> code to merge possibly new entries with existing pg_hba/pg_ident configuration
> files.  With this feature it would be much easier (and robust) to simply have 
> a
> main pg_hba/pg_ident that includes some directory, and have the service
> enable/disable simply creates/removes a dedicated file for each service, and 
> we
> then would usually have at least a dozen files there.
> 
> I'm assuming people doing multi-tenant can also have similar usage.

So you main application for HBA/ident is include_dir/..  For GUCs, we
would just skip the directory if it has no files, but the directory
has to exist.  Your patch is behaving the same.

>> but that does not
>> mean, it seems to me, that there should be an inconsistency in the way
>> we process those commands because one has implemented a feature but
>> not the other.  On the contrary, I'd rather try to make them
>> consistent.
> 
> You mean stopping at the first error, even if it's only for the view 
> reporting?
> That will make the reload consistent, but the view will be a bit useless then.

Yep, I meant to stop the generation of the TokenizedAuthLines at the
first inclusion error by letting tokenize_auth_file() return a status
to be able to stop the recursions.  But after some second-thoughts
pondering about this, I see why I am wrong and why you are right.  As
you say, stopping the generation of the TokenizedAuthLines would just
limit the amount of data reported at once in the view, the golden rule
for HBA/ident being that we would reload nothing as long as there is
at least one error reported when parsing things.  So giving more
control to tokenize_auth_file() to stop a recursion, say by making it
return a boolean status, makes little sense.

> It would be nice to have the information that an "include_if_exists" file
> didn't exist, but having a log-level message in the "error" column is a clear
> POLA violation.  People will probably just do something like
> 
> SELECT file_name, line_number, error FROM pg_hba_file_rules WHERE error IS NOT
> NULL;
> 
> and report an error if any row is found.  Having to parse the error field to
> know if that's really an error or not is going to be a huge foot-gun.  Maybe 
> we
> could indeed report the problem in err_msg but for include_if_exists display 
> it
> in some other column of the view?

Hmm.  One possibility would be to add a direct mention to the
"include", "include_dir" and "include_if_exists" clauses through
pg_hba_file_rules.type?  I don't see how to do that without making the
patch much more invasive as per the existing separation between
tokenization and Hba/IdentLine filling, though, and perhaps the error 
provided with the full path to the file would provide enough context
for one to know if the failure is happening on an included file for
database/user lists or a full file from an "include" clause.  It also
means that include_if_exists remains transparent all the time.
Simpler may be for the best here, at the end.  

By the way, I am wondering whether process_included_authfile() is
the most intuitive interface here.  The only thing that prevents a
common routine to process the include commands is the failure on
GetConfFilesInDir(), where we need to build a TokenizedAuthLine when
the others have already done so tokenize_file_with_context().  Could
it be cleaner to have a small routine like makeTokenizedAuthLine()
that gets reused when we fail scanning a directory to build a
TokenizedAuthLine, in combination with a small-ish routine working on
a directory like ParseConfigDirectory() but for HBA/ident?  Or we
could just drop process_included_authfile() entirely?  On failure,
this would make the code do a next_line all the time for all the
include clauses.
--
Michael


signature.asc
Description: PGP signature


ubsan fails on 32bit builds

2022-11-16 Thread Andres Freund
Hi,

I am working on polishing my patch to make CI use sanitizers. Unfortunately
using -fsanitize=alignment,undefined causes tests to fail on 32bit builds.

https://cirrus-ci.com/task/5092504471601152
https://api.cirrus-ci.com/v1/artifact/task/5092504471601152/testrun/build-32/testrun/recovery/022_crash_temp_files/log/022_crash_temp_files_node_crash.log

../src/backend/storage/lmgr/proc.c:1173:2: runtime error: member access within 
misaligned address 0xf4019e54 for type 'struct PGPROC', which requires 8 byte 
alignment
0xf4019e54: note: pointer points here
  e0 0d 09 f4 54 9e 01 f4  54 9e 01 f4 00 00 00 00  00 00 00 00 00 00 00 00  00 
00 00 00 00 00 00 00
  ^
==65203==Using libbacktrace symbolizer.
#0 0x57076f46 in ProcSleep ../src/backend/storage/lmgr/proc.c:1173
#1 0x57054cf7 in WaitOnLock ../src/backend/storage/lmgr/lock.c:1859
#2 0x57058e4f in LockAcquireExtended ../src/backend/storage/lmgr/lock.c:1101
#3 0x57058f82 in LockAcquire ../src/backend/storage/lmgr/lock.c:752
#4 0x57051bb8 in XactLockTableWait ../src/backend/storage/lmgr/lmgr.c:702
#5 0x569c31b3 in _bt_doinsert ../src/backend/access/nbtree/nbtinsert.c:225
#6 0x569cff09 in btinsert ../src/backend/access/nbtree/nbtree.c:200
#7 0x569ac19d in index_insert ../src/backend/access/index/indexam.c:193
#8 0x56c72af6 in ExecInsertIndexTuples 
../src/backend/executor/execIndexing.c:416
#9 0x56d014c7 in ExecInsert ../src/backend/executor/nodeModifyTable.c:1065
...


I can reproduce this locally.

At first I thought the problem was caused by:
46d6e5f5679 Display the time when the process started waiting for the lock, in 
pg_locks, take 2

as pg_atomic_uint64 is 8 byte aligned on x86 - otherwise one gets into
terrible terrible performance territory because atomics can be split across
cachelines - but 46d6e5f5679 didn't teach ProcGlobalShmemSize() /
InitProcGlobal() that allocations need to be aligned to a larger
size. However, we've made ShmemAllocRaw() use cacheline alignment, which
should suffice.  And indeed - ProcGlobal->allProcs is aligned correctly, and
sizeof(PGPROC) % 8 == 0.  It doesn't seem great to rely on that, but ...


Printing out *proc in proc.c:1173 seems indicates clearly that it's not a
valid proc for some reason.

(gdb) p myHeldLocks
$26 = 0
(gdb) p lock->waitProcs
$27 = {links = {prev = 0xf33c4b5c, next = 0xf33c4b5c}, size = 0}
(gdb) p &(waitQueue->links)
$29 = (SHM_QUEUE *) 0xf33c4b5c
(gdb) p proc
$30 = (PGPROC *) 0xf33c4b5c

Afaict the problem is that
proc = (PGPROC *) &(waitQueue->links);

is a gross gross hack - this isn't actually a PGPROC, it's pointing to an
SHM_QUEUE, but *not* one embedded in PGPROC.  It kinda works because ->links
is at offset 0 in PGPROC, which means that
SHMQueueInsertBefore(&(proc->links), &(MyProc->links));
will turn >links back into waitQueue->links. Which we then can enqueue
again.

I don't see the point of this hack, even leaving ubsan's valid complaints
aside. Why bother having this, sometimes, fake PGPROC pointer when we could
just use a SHM_QUEUE* to determine the insertion point?

Greetings,

Andres Freund




Re: logical decoding and replication of sequences, take 2

2022-11-16 Thread Tomas Vondra



On 11/16/22 22:05, Robert Haas wrote:
> On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra
>  wrote:
>> The other option might be to make these messages non-transactional, in
>> which case we'd separate the ordering from COMMIT ordering, evading the
>> reordering problem.
>>
>> That'd mean we'd ignore rollbacks (which seems fine), we could probably
>> optimize this by checking if the state actually changed, etc. But we'd
>> also need to deal with transactions created in the (still uncommitted)
>> transaction. But I'm also worried it might lead to the same issue with
>> non-transactional behaviors that forced revert in v15.
> 
> I think it might be a good idea to step back slightly from
> implementation details and try to agree on a theoretical model of
> what's happening here. Let's start by banishing the words
> transactional and non-transactional from the conversation and talk
> about what logical replication is trying to do.
> 

OK, let's try.

> We can imagine that the replicated objects on the primary pass through
> a series of states S1, S2, ..., Sn, where n keeps going up as new
> state changes occur. The state, for our purposes here, is the contents
> of the database as they could be observed by a user running SELECT
> queries at some moment in time chosen by the user. For instance, if
> the initial state of the database is S1, and then the user executes
> BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the
> state that differs from S1 in that both of those rows are now part of
> the database contents. There is no state where one of those rows is
> visible and the other is not. That was never observable by the user,
> except from within the transaction as it was executing, which we can
> and should discount. I believe that the goal of logical replication is
> to bring about a state of affairs where the set of states observable
> on the standby is a subset of the states observable on the primary.
> That is, if the primary goes from S1 to S2 to S3, the standby can do
> the same thing, or it can go straight from S1 to S3 without ever
> making it possible for the user to observe S2. Either is correct
> behavior. But the standby cannot invent any new states that didn't
> occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to
> S3, or something like that. It can only consolidate changes that
> occurred separately on the primary, never split them up. Neither can
> it reorder them.
> 

I mostly agree, and in a way the last patch aims to do roughly this,
i.e. make sure that the state after each transaction matches the state a
user might observe on the primary (modulo implementation challenges).

There's a couple of caveats, though:

1) Maybe we should focus more on "actually observed" state instead of
"observable". Who cares if the sequence moved forward in a transaction
that was ultimately rolled back? No committed transaction should have
observer those values - in a way, the last "valid" state of the sequence
is the last value generated in a transaction that ultimately committed.

2) I think what matters more is that we never generate duplicate value.
That is, if you generate a value from a sequence, commit a transaction
and replicate it, then the logical standby should not generate the same
value from the sequence. This guarantee seems necessary for "failover"
to logical standby.

> Now, if you accept this as a reasonable definition of correctness,
> then the next question is what consequences it has for transactional
> and non-transactional behavior. If all behavior is transactional, then
> we've basically got to replay each primary transaction in a single
> standby transaction, and commit those transactions in the same order
> that the corresponding primary transactions committed. We could
> legally choose to merge a group of transactions that committed one
> after the other on the primary into a single transaction on the
> standby, and it might even be a good idea if they're all very tiny,
> but it's not required. But if there are non-transactional things
> happening, then there are changes that become visible at some time
> other than at a transaction commit. For example, consider this
> sequence of events, in which each "thing" that happens is
> transactional except where the contrary is noted:
> 
> T1: BEGIN;
> T2: BEGIN;
> T1: Do thing 1;
> T2: Do thing 2;
> T1: Do a non-transactional thing;
> T1: Do thing 3;
> T2: Do thing 4;
> T2: COMMIT;
> T1: COMMIT;
> 
> From the point of the user here, there are 4 observable states here:
> 
> S1: Initiate state.
> S2: State after the non-transactional thing happens.
> S3: State after T2 commits (reflects the non-transactional thing plus
> things 2 and 4).
> S4: State after T1 commits.
> 
> Basically, the non-transactional thing behaves a whole lot like a
> separate transaction. That non-transactional operation ought to be
> replicated before T2, which ought to be replicated before T1. Maybe
> logical replication ought to treat 

Re: Fix the README file for MERGE command

2022-11-16 Thread Richard Guo
On Wed, Nov 16, 2022 at 4:37 PM Waithant Myo (Fujitsu) <
myo.waith...@fujitsu.com> wrote:

> The actions of MERGE command can be specified as follows: INSERT, UPDATE,
> DELETE and DO NOTHING.
>
> However, in the README file, the ‘UPDATE’ word is described 2 times
> instead of ‘DELETE’.
>
>
>
> Therefore, I attached the patch file which fix this word usage.
>
> It would be great if you could take a look at it.
>

Apparently this is a typo.  Good catch! +1.

Thanks
Richard


Re: when the startup process doesn't (logging startup delays)

2022-11-16 Thread Thomas Munro
On Thu, Nov 17, 2022 at 7:51 AM Robert Haas  wrote:
+ * up, since standby mode is a state that is intendeded to persist

typo

Otherwise LGTM.




Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs

2022-11-16 Thread Peter Geoghegan
On Wed, Nov 16, 2022 at 4:25 PM Andres Freund  wrote:
> > Anyway, worth calling this out directly in these comments IMV. We're
> > addressing two closely related things that assign opposite meanings to
> > InvalidTransactionId, which is rather confusing.
>
> It makes sense to call this out, but I'd
> s/snapshotConflictHorizon format XIDs/cutoff with snapshotConflictHorizon 
> semantics/
>
> or such?

WFM.

-- 
Peter Geoghegan




Re: libpq compression (part 2)

2022-11-16 Thread Andrey Borodin
On Tue, Nov 15, 2022 at 7:17 PM Justin Pryzby  wrote:
>

Also I've found one more TODO item for the patch. Currently in
fe-connect.c patch is doing buffer reset:
conn->inStart = conn->inCursor = conn->inEnd = 0;
This effectively consumes butes up tu current cursor. However, packet
inspection is continued. The patch works because in most cases
following code will issue re-read of message. Coincidentally.
/* Get the type of request. */
if (pqGetInt((int *) , 4, conn))
{
/* We'll come back when there are more data */
return PGRES_POLLING_READING;
}

But I think we need a proper
goto keep_going;
to start from the beginning of the message.


Thank you!

Best regards, Andrey Borodin.




Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 15:37:40 -0800, Peter Geoghegan wrote:
> On Wed, Nov 16, 2022 at 3:27 PM Andres Freund  wrote:
> > What are "snapshotConflictHorizon format XIDs"? I guess you mean format in 
> > the
> > sense of having the semantics of snapshotConflictHorizon?
> 
> Yes. That is the only possible way that any recovery conflict ever
> works on the REDO side, with the exception of a few
> not-very-interesting cases such as DROP TABLESPACE.
> 
> GetConflictingVirtualXIDs() assigns a special meaning to
> InvalidTransactionId which is the *opposite* of the special meaning
> that snapshotConflictHorizon-based values assign to
> InvalidTransactionId. At one point they actually did the same
> definition for InvalidTransactionId, but that was changed soon after
> hot standby first went in (when we taught btree delete records to not
> use ludicrously conservative cutoffs that caused needless conflicts).
> 
> Anyway, worth calling this out directly in these comments IMV. We're
> addressing two closely related things that assign opposite meanings to
> InvalidTransactionId, which is rather confusing.

It makes sense to call this out, but I'd
s/snapshotConflictHorizon format XIDs/cutoff with snapshotConflictHorizon 
semantics/

or such?

Greetings,

Andres Freund




Re: About displaying NestLoopParam

2022-11-16 Thread Tom Lane
Greg Stark  writes:
> I do note that the code in question was added in this commit in 2010.
> That predates the addition of LATERAL in 2013.

Yeah.  It's pretty clear from the comments that I was concerned about
false matches of PARAM_EXEC numbers.  I think that was a live issue
at the time but is so no longer, cf. 46c508fbc and 1db5667ba.
The possibility of LATERAL references is what makes it interesting
to search higher in the plan tree, so there wasn't any real reason to
take any risk of a false match.

I think I might've also been concerned about printing misleading
names for any Vars we did find, due to them belonging to a different
query level.  That's probably a dead issue too now that ruleutils
assigns unique aliases to all RTEs in the query (I'm not sure if
it did at the time).

Looking at this now, it seems a little weird to me that we allow
LATERAL values to be passed down directly into the subplan rather
than having them go through the parParam mechanism.  (If they did,
ruleutils' restriction would be fine.)  I don't know of a reason
to change that, though.

> I suppose those
> comments may be talking about InitPlans for things like constant
> subqueries that have been pulled up to InitPlans in queries like:
> explain verbose select * from x join y on (x.i=y.j) where y.j+1=(select 5) ;
> Which your patch doesn't eliminate the $0 in.

No, because that $0 is for a subplan/initplan output, which we don't
have any other sort of name for.  Your example produces output that
explains what it is:

   InitPlan 1 (returns $0)
 ...
   Filter: ((y.j + 1) = $0)

although I'm not sure that we document that anywhere user-facing.

> Fwiw your patch applied for me and built without warnings and seems to
> work for all the queries I've thrown at it so far. That's hardly an
> exhaustive test of course.

I'm content to apply this (although I quibble with removal of some
of the commentary).  Worst case, somebody will find an example where
it produces wrong/misleading output, and we can revert it.  But
the regression test changes show that it does produce useful output
in at least some cases.

regards, tom lane




Re: PATCH: Using BRIN indexes for sorted output

2022-11-16 Thread Tomas Vondra
On 11/16/22 22:52, Greg Stark wrote:
> Fwiw tuplesort does do something like what you want for the top-k
> case. At least it used to last I looked -- not sure if it went out
> with the tapesort ...
> > For top-k it inserts new tuples into the heap data structure and then
> pops the top element out of the hash. That keeps a fixed number of
> elements in the heap. It's always inserting and removing at the same
> time. I don't think it would be very hard to add a tuplesort interface
> to access that behaviour.
> 


Bounded sorts are still there, implemented using a heap (which is what
you're talking about, I think). I actually looked at it some time ago,
and it didn't look like a particularly good match for the general case
(without explicit LIMIT). Bounded sorts require specifying number of
tuples, and then discard the remaining tuples. But you don't know how
many tuples you'll actually find until the next minval - you have to
keep them all.

Maybe we could feed the tuples into a (sorted) heap incrementally, and
consume tuples until the next minval value. I'm not against exploring
that idea, but it certainly requires more work than just slapping some
interface to existing code.

> For something like BRIN you would sort the ranges by minvalue then
> insert all the tuples for each range. Before inserting tuples for a
> new range you would first pop out all the tuples that are < the
> minvalue for the new range.
> 

Well, yeah. That's pretty much exactly what the last version of this
patch (from October 23) does.

> I'm not sure how you handle degenerate BRIN indexes that behave
> terribly. Like, if many BRIN ranges covered the entire key range.
> Perhaps there would be a clever way to spill the overflow and switch
> to quicksort for the spilled tuples without wasting lots of work
> already done and without being too inefficient.

In two ways:

1) Don't have such BRIN index - if it has many degraded ranges, it's
bound to perform poorly even for WHERE conditions. We've lived with this
until now, I don't think this makes the issue any worse.

2) Improving statistics for BRIN indexes - until now the BRIN costing is
very crude, we have almost no information about how wide the ranges are,
how much they overlap, etc. The 0001 part (discussed in a thread [1])
aims to provide much better statistics. Yes, the costing still doesn't
use that information very much.


regards


[1] https://commitfest.postgresql.org/40/3952/

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs

2022-11-16 Thread Peter Geoghegan
On Wed, Nov 16, 2022 at 3:27 PM Andres Freund  wrote:
> The "(also...) formulation seems a bit odd. How about "an obsolescent heap
> tuple that the caller is physically removing, e.g. via HOT pruning or index
> deletion." or such?

Okay, WFM.

> > + * snapshotConflictHorizon format values are how all hot Standby conflicts 
> > are
> > + * generated by REDO routines (at least wherever a granular cutoff is 
> > used).
>
> Not quite parsing for me.

I meant something like: this is a cutoff that works in the same way as
any other cutoff involved with recovery conflicts, in general, with
the exception of those cases that have very coarse grained conflicts,
such as DROP TABLESPACE.

I suppose it would be better to just say the first part. Will fix.

> > + /*
> > +  * It's quite possible that final snapshotConflictHorizon value will 
> > be
> > +  * invalid in final WAL record, indicating that we definitely don't 
> > need to
> > +  * generate a conflict
> > +  */
>
> *the final
>
> Isn't this already described in the header?

Sort of, but arguably it makes sense to call it out specifically.
Though on second thought, yeah, lets just get rid of it.

> What are "snapshotConflictHorizon format XIDs"? I guess you mean format in the
> sense of having the semantics of snapshotConflictHorizon?

Yes. That is the only possible way that any recovery conflict ever
works on the REDO side, with the exception of a few
not-very-interesting cases such as DROP TABLESPACE.

GetConflictingVirtualXIDs() assigns a special meaning to
InvalidTransactionId which is the *opposite* of the special meaning
that snapshotConflictHorizon-based values assign to
InvalidTransactionId. At one point they actually did the same
definition for InvalidTransactionId, but that was changed soon after
hot standby first went in (when we taught btree delete records to not
use ludicrously conservative cutoffs that caused needless conflicts).

Anyway, worth calling this out directly in these comments IMV. We're
addressing two closely related things that assign opposite meanings to
InvalidTransactionId, which is rather confusing.

-- 
Peter Geoghegan




Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 14:14:30 -0800, Peter Geoghegan wrote:
>  /*
> - * If 'tuple' contains any visible XID greater than latestRemovedXid,
> - * ratchet forwards latestRemovedXid to the greatest one found.
> - * This is used as the basis for generating Hot Standby conflicts, so
> - * if a tuple was never visible then removing it should not conflict
> - * with queries.
> + * Maintain snapshotConflictHorizon for caller by ratcheting forward its 
> value
> + * using any committed XIDs contained in 'tuple', an obsolescent heap tuple
> + * that caller is in the process of physically removing via pruning.
> + * (Also supports generating index deletion snapshotConflictHorizon values.)

The "(also...) formulation seems a bit odd. How about "an obsolescent heap
tuple that the caller is physically removing, e.g. via HOT pruning or index
deletion." or such?


> + * snapshotConflictHorizon format values are how all hot Standby conflicts 
> are
> + * generated by REDO routines (at least wherever a granular cutoff is used).

Not quite parsing for me.

> + * Caller must initialize its value to InvalidTransactionId, which is 
> generally
> + * interpreted as "definitely no need for a recovery conflict".
> + *
> + * Final value must reflect all heap tuples that caller will physically 
> remove
> + * via the ongoing pruning operation.  ResolveRecoveryConflictWithSnapshot() 
> is
> + * passed the final value (taken from caller's WAL record) by a REDO routine.

> + /*
> +  * It's quite possible that final snapshotConflictHorizon value will be
> +  * invalid in final WAL record, indicating that we definitely don't 
> need to
> +  * generate a conflict
> +  */

*the final

Isn't this already described in the header?


> @@ -3337,12 +3337,17 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool 
> excludeXmin0,
>   * GetConflictingVirtualXIDs -- returns an array of currently active VXIDs.
>   *
>   * Usage is limited to conflict resolution during recovery on standby 
> servers.
> - * limitXmin is supplied as either latestRemovedXid, or InvalidTransactionId
> - * in cases where we cannot accurately determine a value for 
> latestRemovedXid.
> + * limitXmin is supplied as either a snapshotConflictHorizon format XID, or 
> as
> + * InvalidTransactionId in cases where caller cannot accurately determine a
> + * safe snapshotConflictHorizon value.
>   *
>   * If limitXmin is InvalidTransactionId then we want to kill everybody,
>   * so we're not worried if they have a snapshot or not, nor does it really
> - * matter what type of lock we hold.
> + * matter what type of lock we hold.  Caller must avoid calling here with
> + * snapshotConflictHorizon format XIDs that were set to InvalidTransactionId

What are "snapshotConflictHorizon format XIDs"? I guess you mean format in the
sense of having the semantics of snapshotConflictHorizon?



Greetings,

Andres Freund




Re: [PATCH] Support % wildcard in extension upgrade filenames

2022-11-16 Thread Justin Pryzby
Note that the patch is passing tests when using autoconf build but
failing for meson builds.

http://cfbot.cputube.org/sandro-santilli.html

I imagine you need to make the corresponding change to ./meson.build
that you made to ./Makefile.

https://wiki.postgresql.org/wiki/Meson_for_patch_authors
https://wiki.postgresql.org/wiki/Meson

You can test under cirrusci from your github account, with instructions
at: ./src/tools/ci/README

Cheers,
-- 
Justin




RE: [PATCH] Support % wildcard in extension upgrade filenames

2022-11-16 Thread Regina Obe
> On Sun, Nov 13, 2022 at 11:46:50PM -0500, Regina Obe wrote:
> > > Re: Sandro Santilli
> > > > I'm attaching an updated version of the patch. This time the patch
> > > > is tested. Nothing changes unless the .control file for the
> > > > subject extension doesn't have a "wildcard_upgrades = true"
statement.
> > > >
> > > > When wildcard upgrades are enabled, a file with a "%" symbol as
> > > > the "source" part of the upgrade path will match any version and
> > >
> > > Fwiw I believe wildcard_upgrades isn't necessary in the .control file.
> > > If there are no % files, none would be used anyway, and if there
> > > are, it's
> > clear
> > > it's meant as wildcard since % won't appear in any remotely sane
> > > version number.
> >
> > I also like the idea of skipping the wildcard_upgrades syntax.
> > Then there is no need to have a conditional control file for PG 16 vs.
> > older versions.
> 
> Here we go. Attached a version of the patch with no "wildcard_upgrades"
> controlling it.
> 
> --strk;

I think you should increment the version number on the file name of this
patch.
You had one earlier called 0001-...
The one before that was missing a version number entirely.

Maybe call this 0003-...

Thanks,
Regina





Re: [PATCH] Support % wildcard in extension upgrade filenames

2022-11-16 Thread 'Sandro Santilli'
On Sun, Nov 13, 2022 at 11:46:50PM -0500, Regina Obe wrote:
> > Re: Sandro Santilli
> > > I'm attaching an updated version of the patch. This time the patch is
> > > tested. Nothing changes unless the .control file for the subject
> > > extension doesn't have a "wildcard_upgrades = true" statement.
> > >
> > > When wildcard upgrades are enabled, a file with a "%" symbol as the
> > > "source" part of the upgrade path will match any version and
> > 
> > Fwiw I believe wildcard_upgrades isn't necessary in the .control file.
> > If there are no % files, none would be used anyway, and if there are, it's
> clear
> > it's meant as wildcard since % won't appear in any remotely sane version
> > number.
> 
> I also like the idea of skipping the wildcard_upgrades syntax.
> Then there is no need to have a conditional control file for PG 16 vs. older
> versions.

Here we go. Attached a version of the patch with no
"wildcard_upgrades" controlling it.

--strk;
>From 9b138eae95e0d389bee3776247ba9d7d5144bcc5 Mon Sep 17 00:00:00 2001
From: Sandro Santilli 
Date: Wed, 14 Sep 2022 11:10:10 +0200
Subject: [PATCH] Allow wildcard (%) in extension upgrade paths

A wildcard character "%" will be accepted in the
"source" side of the upgrade script and be considered
usable to upgrade any version to the "target" side.

Includes regression test and documentation.
---
 doc/src/sgml/extend.sgml  |  8 
 src/backend/commands/extension.c  | 42 ---
 src/test/modules/test_extensions/Makefile |  6 ++-
 .../expected/test_extensions.out  | 15 +++
 .../test_extensions/sql/test_extensions.sql   |  7 
 .../test_ext_wildcard1--%--2.0.sql|  6 +++
 .../test_ext_wildcard1--1.0.sql   |  6 +++
 .../test_ext_wildcard1.control|  3 ++
 8 files changed, 85 insertions(+), 8 deletions(-)
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--%--2.0.sql
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--1.0.sql
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1.control

diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index 46e873a166..c79140f669 100644
--- a/doc/src/sgml/extend.sgml
+++ b/doc/src/sgml/extend.sgml
@@ -1081,6 +1081,14 @@ SELECT pg_catalog.pg_extension_config_dump('my_config', 'WHERE NOT standard_entr
  1.1).
 
 
+
+ The literal value % can be used as the
+ old_version component in an extension
+ update script for it to match any version. Such wildcard update
+ scripts will only be used when no explicit path is found from
+ old to target version.
+
+
 
  Given that a suitable update script is available, the command
  ALTER EXTENSION UPDATE will update an installed extension
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 1a62e5dac5..e3ea9dba30 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -128,6 +128,7 @@ static void ApplyExtensionUpdates(Oid extensionOid,
   bool cascade,
   bool is_create);
 static char *read_whole_file(const char *filename, int *length);
+static bool file_exists(const char *name);
 
 
 /*
@@ -890,7 +891,14 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 	if (from_version == NULL)
 		elog(DEBUG1, "executing extension script for \"%s\" version '%s'", control->name, version);
 	else
+	{
+		if ( ! file_exists(filename) )
+		{
+			/* if filename does not exist, try wildcard */
+			filename = get_extension_script_filename(control, "%", version);
+		}
 		elog(DEBUG1, "executing extension script for \"%s\" update from version '%s' to '%s'", control->name, from_version, version);
+	}
 
 	/*
 	 * If installing a trusted extension on behalf of a non-superuser, become
@@ -1214,14 +1222,19 @@ identify_update_path(ExtensionControlFile *control,
 
 	/* Find shortest path */
 	result = find_update_path(evi_list, evi_start, evi_target, false, false);
+	if (result != NIL)
+		return result;
 
-	if (result == NIL)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("extension \"%s\" has no update path from version \"%s\" to version \"%s\"",
-		control->name, oldVersion, newVersion)));
+	/* Find wildcard path, if no explicit path was found */
+	evi_start = get_ext_ver_info("%", _list);
+	result = find_update_path(evi_list, evi_start, evi_target, false, false);
+	if (result != NIL)
+		return result;
 
-	return result;
+	ereport(ERROR,
+			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			 errmsg("extension \"%s\" has no update path from version \"%s\" to version \"%s\"",
+	control->name, oldVersion, newVersion)));
 }
 
 /*
@@ -3392,3 +3405,20 @@ read_whole_file(const char *filename, int *length)
 	buf[*length] = '\0';
 	return buf;
 }
+
+static bool
+file_exists(const char *name)
+{
+	struct stat st;
+
+	Assert(name != NULL);
+
+	if (stat(name, ) == 0)
+	

Re: Suppressing useless wakeups in walreceiver

2022-11-16 Thread Thomas Munro
On Wed, Nov 16, 2022 at 5:24 PM Nathan Bossart  wrote:
> On Wed, Nov 16, 2022 at 04:57:08PM +1300, Thomas Munro wrote:
> > On Tue, Nov 15, 2022 at 5:49 PM Nathan Bossart  
> > wrote:
> >> Another option might be to just force initial reply/feedback messages when
> >> streaming starts.  The attached patch improves src/test/recovery test
> >> runtime just like the previous one I posted.
> >
> > Yeah, looks good in my tests.  I think we just need to make it
> > conditional so we don't force it if someone has
> > wal_receiver_status_interval disabled.
>
> Yeah, that makes sense.  IIUC setting "force" to false would have the same
> effect for the initial round of streaming, but since writePtr/flushPtr will
> be set in later rounds, no reply would be guaranteed.  That might be good
> enough for the tests, but it seems a bit inconsistent.  So, your patch is
> probably the way to go.

On second thoughts, there's not much point in that, since we always
force replies under various other circumstances anyway.  There isn't
really a 'never send replies' mode.  Committed the way you had it
before.  Thanks!

I can't unsee that we're spending ~35 seconds waiting for catchup in
718 separate wait_for_catchup calls.  The problem is entirely on the
waiting side (we already do WalRcvForceReply() in xlogrecovery.c when
going idle), and there's probably a nice condition variable-based
improvement here but I decided to hop over that rabbit hole today.




Re: Index not getting cleaned even though vacuum is running

2022-11-16 Thread Matheus Alcantara
Hi

--- Original Message ---
On Tuesday, November 15th, 2022 at 12:38, Karthik Jagadish (kjagadis) 
 wrote:


> Hi,
> 
> We notice that vacuum is happening at regular intervals but the space 
> occupied by indexes is always increasing. Any pointers as to why would this 
> happen?
> 
> Some outputs below. Auto vacuum is enabled but we notice index size is 
> growing.
> 
> $ psql -U postgres -d cgms -c "SELECT 
> pg_size_pretty(SUM(pg_relation_size(table_schema||'.'||table_name))) as size 
> from information_schema.tables"
> 
> size
> 
> ---
> 
> 25 GB
> 
> (1 row)
> 
> $ psql -U postgres -d cgms -c "SELECT 
> pg_size_pretty(SUM(pg_indexes_size(table_schema||'.'||table_name) + 
> pg_relation_size(table_schema||'.'||table_name))) as size from 
> information_schema.tables"
> 
>   size
> 
> 
> 
> 151 GB
> 
> (1 row)
> 
> $ sudo du -hsc /var/lib/pgsql/12/data
> 
> 154G    /var/lib/pgsql/12/data
> 
> 154G    total
> 
> Appreciate if someone can give some pointers.
> 
> Regards,
> 
> Karthik

As far as I know vacuum just mark the space of dead rows available for future
reuse, so I think it's expected that the size doesn't decrease.


"The standard form of VACUUM removes dead row versions in tables and indexes
and marks the space available for future reuse. However, it will not return the
space to the operating system, except in the special case where one or more
pages at the end of a table become entirely free and an exclusive table lock
can be easily obtained. In contrast, VACUUM FULL actively compacts tables by
writing a complete new version of the table file with no dead space. This
minimizes the size of the table, but can take a long time. It also requires
extra disk space for the new copy of the table, until the operation completes."

https://www.postgresql.org/docs/current/routine-vacuuming.html




--
Matheus Alcantara






Re: Allow single table VACUUM in transaction block

2022-11-16 Thread Tom Lane
Greg Stark  writes:
> I think this requesting autovacuum worker should be a distinct
> command. Or at least an explicit option to vacuum.

+1.  That'd reduce confusion, and perhaps we could remove some
of the restrictions.

regards, tom lane




Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs

2022-11-16 Thread Peter Geoghegan
On Tue, Nov 15, 2022 at 8:48 PM Peter Geoghegan  wrote:
> Okay, let's go with snapshotConflictHorizon. I'll use that name in the
> next revision, which I should be able to post tomorrow.

Attached is a somewhat cleaned up version that uses that symbol name
for everything.

--
Peter Geoghegan


v2-0001-Standardize-rmgrdesc-recovery-conflict-XID-output.patch
Description: Binary data


Re: Allow single table VACUUM in transaction block

2022-11-16 Thread Greg Stark
I think the idea of being able to request an autovacuum worker for a
specific table is actually very good. I think it's what most users
actually want when they are running vacuum. In fact in previous jobs
people have built infrastructure that basically duplicates autovacuum
just so they could do this.

However I'm not a fan of commands that sometimes do one thing and
sometimes magically do something very different. I don't like the idea
that the same vacuum command would sometimes run in-process and
sometimes do this out of process request. And the rules for when it
does which are fairly complex to explain -- it runs in process unless
you're in a transaction when it runs out of process unless it's a
temporary table ...

I think this requesting autovacuum worker should be a distinct
command. Or at least an explicit option to vacuum.

Also, I was confused reading the thread above about mention of VACUUM
FULL. I don't understand why it's relevant at all. We certainly can't
force VACUUM FULL when it wasn't requested on potentially large
tables.




Re: PATCH: Using BRIN indexes for sorted output

2022-11-16 Thread Greg Stark
Fwiw tuplesort does do something like what you want for the top-k
case. At least it used to last I looked -- not sure if it went out
with the tapesort ...

For top-k it inserts new tuples into the heap data structure and then
pops the top element out of the hash. That keeps a fixed number of
elements in the heap. It's always inserting and removing at the same
time. I don't think it would be very hard to add a tuplesort interface
to access that behaviour.

For something like BRIN you would sort the ranges by minvalue then
insert all the tuples for each range. Before inserting tuples for a
new range you would first pop out all the tuples that are < the
minvalue for the new range.

I'm not sure how you handle degenerate BRIN indexes that behave
terribly. Like, if many BRIN ranges covered the entire key range.
Perhaps there would be a clever way to spill the overflow and switch
to quicksort for the spilled tuples without wasting lots of work
already done and without being too inefficient.




Re: [DOCS] Stats views and functions not in order?

2022-11-16 Thread David G. Johnston
On Tue, Nov 15, 2022 at 6:39 PM Peter Smith  wrote:

>
> I was also wondering (but have not yet done) if the content *outside*
> the tables should be reordered to match the table 28.1/28.2 order.
>
> Thoughts?
>
>
I would love to do away with the ToC listing of view names in 28.2
altogether.

Also, make it so each view ends up being its own separate page.

The name of the views in the table should then be the hyperlinks to those
pages.

Basically the way Chapter 54.1 works.  Though the interplay between the top
Chapter 54 and 54.1 is a bit repetitive.

https://www.postgresql.org/docs/current/views.html

I wonder whether having the table be structured but the ToC be purely
alphabetical would be considered a good idea...

The tables need hyperlinks regardless.  I wouldn't insist on changing the
ordering to match the table, especially with the hyperlinks, but I also
wouldn't reject it.  Figuring out how to make them one-per-page would be
time better spent though.

David J.


Re: Meson add host_system to PG_VERSION_STR

2022-11-16 Thread Juan José Santamaría Flecha
On Wed, Nov 16, 2022 at 8:02 PM Andres Freund  wrote:

>
> Given we're looking at improving this, should we also add 32/64-bit piece?
>
> If so, we probably should move building PG_VERSION_STR to later so we can
> use
> SIZEOF_VOID_P - configure.ac does that too.
>
> With extra_version set to -andres the attached results in:
>
> PostgreSQL 16devel-andres on x86_64-linux, compiled by gcc-13.0.0, 64-bit
>

WFM.

Regards,

Juan José Santamaría Flecha


ScanSourceDatabasePgClass

2022-11-16 Thread David Christensen
Hi Robert,

In 9c08aea6a you introduce the block-by-block strategy for creating a
copy of the database.  In the main loop, this utilizes this call:

buf = ReadBufferWithoutRelcache(rlocator, MAIN_FORKNUM, blkno,
RBM_NORMAL, bstrategy, false);

Here, the last parameter is "false" for the permanence factor of this
relation.  Since we know that pg_class is in fact a permanent
relation, this ends up causing issues for the TDE patches that I am
working on updating, due using the opposite value when calculating the
page's IV and thus failing the decryption when trying to create a
database based on template0.

Is there a reason why this needs to be "false" here?  I recognize that
this routine is accessing the table outside of a formal connection, so
there might be more subtle effects that I am not aware of.  If so this
should be documented.  If it's an oversight, I think we should change
to be "true" to match the actual permanence state of the relation.

I did test changing it to true and didn't notice any adverse effects
in `make installcheck-world`, but let me know if there is more to this
story than meets the eye.

(I did review the original discussion at
https://www.postgresql.org/message-id/flat/CA%2BTgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T%3DCYCvRyFHywSBUQ%40mail.gmail.com
and did not notice any discussion of this specific parameter choice.)

Thanks,

David




Re: About displaying NestLoopParam

2022-11-16 Thread Greg Stark
So I guess I don't have much to add since I don't really understand
the Param infrastructure, certainly not any better than you seem to.

I do note that the code in question was added in this commit in 2010.
That predates the addition of LATERAL in 2013. I suppose those
comments may be talking about InitPlans for things like constant
subqueries that have been pulled up to InitPlans in queries like:

explain verbose select * from x join y on (x.i=y.j) where y.j+1=(select 5) ;

Which your patch doesn't eliminate the $0 in. I don't know if the code
you're removing is just for efficiency -- to avoid trawling through
nodes of the plan that can't be relevant -- or for correctness.

Fwiw your patch applied for me and built without warnings and seems to
work for all the queries I've thrown at it so far. That's hardly an
exhaustive test of course.

commit 1cc29fe7c60ba643c114979dbe588d3a38005449
Author: Tom Lane 
Date:   Tue Jul 13 20:57:19 2010 +

Teach EXPLAIN to print PARAM_EXEC Params as the referenced expressions,
rather than just $N.  This brings the display of nestloop-inner-indexscan
plans back to where it's been, and incidentally improves the display of
SubPlan parameters as well.  In passing, simplify the EXPLAIN code by
having it deal primarily in the PlanState tree rather than separately
searching Plan and PlanState trees.  This is noticeably cleaner for
subplans, and about a wash elsewhere.

One small difference from previous behavior is that EXPLAIN will no longer
qualify local variable references in inner-indexscan plan nodes, since it
no longer sees such nodes as possibly referencing multiple tables.  Vars
referenced through PARAM_EXEC Params are still forcibly qualified, though,
so I don't think the display is any more confusing than before.  Adjust a
couple of examples in the documentation to match this behavior.

On Tue, 20 Sept 2022 at 05:00, Richard Guo  wrote:
>
>
> On Fri, Sep 16, 2022 at 5:59 PM Richard Guo  wrote:
>>
>> I'm not saying this is a bug, but just curious why param 0 cannot be
>> displayed as the referenced expression. And I find the reason is that in
>> function find_param_referent(), we have the 'in_same_plan_level' flag
>> controlling that if we have emerged from a subplan, i.e. not the same
>> plan level any more, we would not look further for the matching
>> NestLoopParam. Param 0 suits this situation.
>>
>> And there is a comment there also saying,
>>
>> /*
>>  * NestLoops transmit params to their inner child only; also, once
>>  * we've crawled up out of a subplan, this couldn't possibly be
>>  * the right match.
>>  */
>
>
> After thinking of this for more time, I still don't see the reason why
> we cannot display NestLoopParam after we've emerged from a subplan.
>
> It seems these params are from parameterized subqueryscan and their
> values are supplied by an upper nestloop. These params should have been
> processed in process_subquery_nestloop_params() that we just add the
> PlannerParamItem entries to root->curOuterParams, in the form of
> NestLoopParam, using the same PARAM_EXEC slots.
>
> So I propose the patch attached to remove the 'in_same_plan_level' flag
> so that we can display NestLoopParam across subplan. Please correct me
> if I'm wrong.
>
> Thanks
> Richard



-- 
greg




Re: logical decoding and replication of sequences, take 2

2022-11-16 Thread Robert Haas
On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra
 wrote:
> The other option might be to make these messages non-transactional, in
> which case we'd separate the ordering from COMMIT ordering, evading the
> reordering problem.
>
> That'd mean we'd ignore rollbacks (which seems fine), we could probably
> optimize this by checking if the state actually changed, etc. But we'd
> also need to deal with transactions created in the (still uncommitted)
> transaction. But I'm also worried it might lead to the same issue with
> non-transactional behaviors that forced revert in v15.

I think it might be a good idea to step back slightly from
implementation details and try to agree on a theoretical model of
what's happening here. Let's start by banishing the words
transactional and non-transactional from the conversation and talk
about what logical replication is trying to do.

We can imagine that the replicated objects on the primary pass through
a series of states S1, S2, ..., Sn, where n keeps going up as new
state changes occur. The state, for our purposes here, is the contents
of the database as they could be observed by a user running SELECT
queries at some moment in time chosen by the user. For instance, if
the initial state of the database is S1, and then the user executes
BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the
state that differs from S1 in that both of those rows are now part of
the database contents. There is no state where one of those rows is
visible and the other is not. That was never observable by the user,
except from within the transaction as it was executing, which we can
and should discount. I believe that the goal of logical replication is
to bring about a state of affairs where the set of states observable
on the standby is a subset of the states observable on the primary.
That is, if the primary goes from S1 to S2 to S3, the standby can do
the same thing, or it can go straight from S1 to S3 without ever
making it possible for the user to observe S2. Either is correct
behavior. But the standby cannot invent any new states that didn't
occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to
S3, or something like that. It can only consolidate changes that
occurred separately on the primary, never split them up. Neither can
it reorder them.

Now, if you accept this as a reasonable definition of correctness,
then the next question is what consequences it has for transactional
and non-transactional behavior. If all behavior is transactional, then
we've basically got to replay each primary transaction in a single
standby transaction, and commit those transactions in the same order
that the corresponding primary transactions committed. We could
legally choose to merge a group of transactions that committed one
after the other on the primary into a single transaction on the
standby, and it might even be a good idea if they're all very tiny,
but it's not required. But if there are non-transactional things
happening, then there are changes that become visible at some time
other than at a transaction commit. For example, consider this
sequence of events, in which each "thing" that happens is
transactional except where the contrary is noted:

T1: BEGIN;
T2: BEGIN;
T1: Do thing 1;
T2: Do thing 2;
T1: Do a non-transactional thing;
T1: Do thing 3;
T2: Do thing 4;
T2: COMMIT;
T1: COMMIT;

>From the point of the user here, there are 4 observable states here:

S1: Initiate state.
S2: State after the non-transactional thing happens.
S3: State after T2 commits (reflects the non-transactional thing plus
things 2 and 4).
S4: State after T1 commits.

Basically, the non-transactional thing behaves a whole lot like a
separate transaction. That non-transactional operation ought to be
replicated before T2, which ought to be replicated before T1. Maybe
logical replication ought to treat it in exactly that way: as a
separate operation that needs to be replicated after any earlier
transactions that completed prior to the history shown here, but
before T2 or T1. Alternatively, you can merge the non-transactional
change into T2, i.e. the first transaction that committed after it
happened. But you can't merge it into T1, even though it happened in
T1. If you do that, then you're creating states on the standby that
never existed on the primary, which is wrong. You could argue that
this is just nitpicking: who cares if the change in the sequence value
doesn't get replicated at exactly the right moment? But I don't think
it's a technicality at all: I think if we don't make the operation
appear to happen at the same point in the sequence as it became
visible on the master, then there will be endless artifacts and corner
cases to the bottom of which we will never get. Just like if we
replicated the actual transactions out of order, chaos would ensue,
because there can be logical dependencies between them, so too can
there be logical dependencies between non-transactional operations, or
between a 

Re: Making Vars outer-join aware

2022-11-16 Thread Tom Lane
Richard Guo  writes:
> I'm reviewing the part about multiple version clauses, and I find a case
> that may not work as expected.  I tried with some query as below
>  (A leftjoin (B leftjoin C on (Pbc)) on (Pab)) left join D on (Pcd)
> Assume Pbc is strict for B and Pcd is strict for C.
> According to identity 3, we know one of its equivalent form is
>  ((A leftjoin B on (Pab)) leftjoin C on (Pbc)) left join D on (Pcd)
> For outer join clause Pcd, we would generate two versions from the first
> form
> Version 1: C Vars with nullingrels as {A/B}
> Version 2: C Vars with nullingrels as {B/C, A/B}
> I understand version 2 is reasonable as the nullingrels from parser
> would be set as that.  But it seems version 1 is not applicable in
> either form.

Hmm.  Looking at the data structures generated for the first form,
we have

B/C join:

  {SPECIALJOININFO 
  :min_lefthand (b 2)
  :min_righthand (b 3)
  :syn_lefthand (b 2)
  :syn_righthand (b 3)
  :jointype 1 
  :ojrelid 4 
  :commute_above_l (b 7)
  :commute_above_r (b 5)
  :commute_below (b)

A/B join:

  {SPECIALJOININFO 
  :min_lefthand (b 1)
  :min_righthand (b 2)
  :syn_lefthand (b 1)
  :syn_righthand (b 2 3 4)
  :jointype 1 
  :ojrelid 5 
  :commute_above_l (b)
  :commute_above_r (b)
  :commute_below (b 4)

everything-to-D join:

  {SPECIALJOININFO 
  :min_lefthand (b 1 2 3 4 5)
  :min_righthand (b 6)
  :syn_lefthand (b 1 2 3 4 5)
  :syn_righthand (b 6)
  :jointype 1 
  :ojrelid 7 
  :commute_above_l (b)
  :commute_above_r (b)
  :commute_below (b 4)

So we've marked the 4 and 7 joins as possibly commuting, but they
cannot commute according to 7's min_lefthand set.  I don't think
the extra clone condition is terribly harmful --- it's useless
but shouldn't cause any problems.  However, if these joins should be
able to commute then the min_lefthand marking is preventing us
from considering legal join orders (and has been doing so all along,
that's not new in this patch).  It looks to me like they should be
able to commute (giving your third form), so this is a pre-existing
planning deficiency.

Without having looked too closely, I suspect this is coming from
the delay_upper_joins/check_outerjoin_delay stuff in initsplan.c.
That's a chunk of logic that I'd like to nuke altogether, and maybe
we will be able to do so once this patchset is a bit further along.
But I've not had time to look at it yet.

I'm not entirely clear on whether the strange selection of clone
clauses for this example is a bug in process_postponed_left_join_quals
or if that function is just getting misled by the bogus min_lefthand
value.

> Looking at the two forms again, it seems the expected two versions for
> Pcd should be
> Version 1: C Vars with nullingrels as {B/C}
> Version 2: C Vars with nullingrels as {B/C, A/B}
> With this we may have another problem that the two versions are both
> applicable at the C/D join according to clause_is_computable_at(), in
> both forms.

At least when I tried it just now, clause_is_computable_at correctly
rejected the first version, because we've already computed A/B when
we are trying to form the C/D join so we expect it to be listed in
varnullingrels.

regards, tom lane




Re: Split index and table statistics into different types of stats

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-15 10:48:40 +0100, Drouvot, Bertrand wrote:
> diff --git a/src/backend/utils/adt/pgstatfuncs.c 
> b/src/backend/utils/adt/pgstatfuncs.c
> index ae3365d917..be7f175bf1 100644
> --- a/src/backend/utils/adt/pgstatfuncs.c
> +++ b/src/backend/utils/adt/pgstatfuncs.c
> @@ -36,24 +36,34 @@
>  
>  #define HAS_PGSTAT_PERMISSIONS(role)  (has_privs_of_role(GetUserId(), 
> ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role))
>  
> +#define PGSTAT_FETCH_STAT_ENTRY(entry, stat_name) ((entry == NULL) ? 0 : 
> (int64) (entry->stat_name))
> +
>  Datum
> -pg_stat_get_numscans(PG_FUNCTION_ARGS)
> +pg_stat_get_index_numscans(PG_FUNCTION_ARGS)
>  {
>   Oid relid = PG_GETARG_OID(0);
>   int64   result;
> - PgStat_StatTabEntry *tabentry;
> + PgStat_StatIndEntry *indentry = pgstat_fetch_stat_indentry(relid);
>  
> - if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
> - result = 0;
> - else
> - result = (int64) (tabentry->numscans);
> + result = PGSTAT_FETCH_STAT_ENTRY(indentry, numscans);
>  
>   PG_RETURN_INT64(result);
>  }

This still leaves a fair bit of boilerplate. ISTM that the function body
really should just be a single line.

Might even be worth defining the whole function via a macro. Perhaps something 
like

PGSTAT_DEFINE_REL_FIELD_ACCESSOR(PGSTAT_KIND_INDEX, pg_stat_get_index, 
numscans);

I think the logic to infer which DB oid to use for a stats entry could be
shared between different kinds of stats. We don't need to duplicate it.

Is there any reason to not replace the "double lookup" in
pgstat_fetch_stat_tabentry() with IsSharedRelation()?


This should probably be done in a preparatory commit.

Greetings,

Andres Freund




Re: predefined role(s) for VACUUM and ANALYZE

2022-11-16 Thread Andrew Dunstan


On 2022-11-15 Tu 00:08, Nathan Bossart wrote:
> On Mon, Nov 14, 2022 at 03:40:04PM -0800, Nathan Bossart wrote:
>> Thanks for taking a look!  Here is a rebased version of the patch set.
> Oops, apparently object_aclcheck() cannot be used for pg_class.  Here is
> another version that uses pg_class_aclcheck() instead.  I'm not sure how I
> missed this earlier.
>

OK, reading the history I think everyone is on board with expanding
AclMode from uint32 to uint64. Is that right? If so I'm intending to
commit at least the first two of these patches fairly soon.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 14:19:32 -0500, Robert Haas wrote:
> I have never really understood why we drive background writer or
> checkpointer statistics through the statistics collector.

To some degree it is required for durability - the stats system needs to know
how to write out those stats. But that wasn't ever a good reason to send
messages to the stats collector - it could just read the stats from shared
memory after all.

There's also integration with snapshots of the stats, resetting them, etc.

There's also the complexity that some of the stats e.g. for checkpointer
aren't about work the checkpointer did, but just have ended up there for
historical raisins. E.g. the number of fsyncs and writes done by backends.

See below:

> Here again, for things like table statistics, there is no choice, because we
> could have an unbounded number of tables and need to keep statistics about
> all of them. The statistics collector can handle that by allocating more
> memory as required. But there is only one background writer and only one
> checkpointer, so that is not needed in those cases. Why not just have them
> expose anything they want to expose through shared memory directly?

That's how it is in 15+. The memory for "fixed-numbered" or "global"
statistics are maintained by the stats system, but in plain shared memory,
allocated at server start. Not via the hash table.

Right now stats updates for the checkpointer use the "changecount" approach to
updates. For now that makes sense, because we update the stats only
occasionally (after a checkpoint or when writing in CheckpointWriteDelay()) -
a stats viewer seeing the checkpoint count go up, without yet seeing the
corresponding buffers written would be misleading.

I don't think we'd want every buffer write or whatnot go through the
changecount mechanism, on some non-x86 platforms that could be noticable. But
if we didn't stage the stats updates locally I think we could make most of the
stats changes without that overhead.  For updates that just increment a single
counter there's simply no benefit in the changecount mechanism afaict.

I didn't want to do that change during the initial shared memory stats work,
it already was bigger than I could handle...


It's not quite clear to me what the best path forward is for
buf_written_backend / buf_fsync_backend, which currently are reported via the
checkpointer stats. I think the best path might be to stop counting them via
the CheckpointerShmem->num_backend_writes etc and just populate the fields in
the view (for backward compat) via the proposed [1] pg_stat_io patch.  Doing
that accounting with CheckpointerCommLock held exclusively isn't free.



> If the statistics collector provides services that we care about, like
> persisting data across restarts or making snapshots for transactional
> behavior, then those might be reasons to go through it even for the
> background writer or checkpointer. But if so, we should be explicit
> about what the reasons are, both in the mailing list discussion and in
> code comments. Otherwise I fear that we'll just end up doing something
> in a more complicated way than is really necessary.

I tried to provide at least some of that in the comments at the start of
pgstat.c in 15+. There's very likely more that should be added, but I think
it's a decent start.

Greetings,

Andres Freund


[1] 
https://www.postgresql.org/message-id/CAOtHd0ApHna7_p6mvHoO%2BgLZdxjaQPRemg3_o0a4ytCPijLytQ%40mail.gmail.com




Re: locale -a missing on Alpine Linux?

2022-11-16 Thread Tom Lane
Christoph Moench-Tegeder  writes:
> ## Peter Eisentraut (peter.eisentr...@enterprisedb.com):
>> First of all, is this a standard installation of this OS, or is perhaps 
>> something incomplete, broken, or unusual about the current OS installation?

> Alpine uses musl libc, on which you need package musl-locales to get
> a /usr/bin/locale.
> https://pkgs.alpinelinux.org/package/edge/community/x86/musl-locales

Ah.  And that also shows that if you didn't install that package,
you don't have any locales either, except presumably C/POSIX.

So probably we should treat failure of the locale command as okay
and just press on with no non-built-in locales.

regards, tom lane




Re: locale -a missing on Alpine Linux?

2022-11-16 Thread Christoph Moench-Tegeder
## Peter Eisentraut (peter.eisentr...@enterprisedb.com):

> First of all, is this a standard installation of this OS, or is perhaps 
> something incomplete, broken, or unusual about the current OS installation?

Alpine uses musl libc, on which you need package musl-locales to get
a /usr/bin/locale.
https://pkgs.alpinelinux.org/package/edge/community/x86/musl-locales
https://musl.libc.org/

Regards,
Christoph

-- 
Spare Space




Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-11-16 Thread Robert Haas
On Wed, Nov 16, 2022 at 5:32 AM Bharath Rupireddy
 wrote:
> -1 for CheckpointerShmemStruct as it is being used for running
> checkpoints and I don't think adding stats to it is a great idea.
> Instead, extending PgStat_CheckpointerStats and using shared memory
> stats for reporting progress/last checkpoint related stats is a good
> idea IMO.

I agree with Andres: progress reporting isn't really quite the same
thing as stats, and either place seems like it could be reasonable. I
don't presently have an opinion on which is a better fit, but I don't
think the fact that CheckpointerShmemStruct is used for running
checkpoints rules anything out. Progress reporting is *also* about
running checkpoints. Any historical data you want to expose might not
be about running checkpoints, but, uh, so what? I don't really see
that as a strong argument against it fitting into this struct.

> I also think that a new pg_stat_checkpoint view is needed
> because, right now, the PgStat_CheckpointerStats stats are exposed via
> the pg_stat_bgwriter view, having a separate view for checkpoint stats
> is good here.

Yep.

> Also, removing CheckpointStatsData and moving all of
> those members to PgStat_CheckpointerStats, of course, by being careful
> about the amount of shared memory required, is also a good idea IMO.
> Going forward, PgStat_CheckpointerStats and pg_stat_checkpoint view
> can be a single point of location for all the checkpoint related
> stats.

I'm not sure that I completely follow this part, or that I agree with
it. I have never really understood why we drive background writer or
checkpointer statistics through the statistics collector. Here again,
for things like table statistics, there is no choice, because we could
have an unbounded number of tables and need to keep statistics about
all of them. The statistics collector can handle that by allocating
more memory as required. But there is only one background writer and
only one checkpointer, so that is not needed in those cases. Why not
just have them expose anything they want to expose through shared
memory directly?

If the statistics collector provides services that we care about, like
persisting data across restarts or making snapshots for transactional
behavior, then those might be reasons to go through it even for the
background writer or checkpointer. But if so, we should be explicit
about what the reasons are, both in the mailing list discussion and in
code comments. Otherwise I fear that we'll just end up doing something
in a more complicated way than is really necessary.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Meson add host_system to PG_VERSION_STR

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 09:01:04 +0900, Michael Paquier wrote:
> On Wed, Nov 16, 2022 at 12:08:56AM +0100, Juan José Santamaría Flecha wrote:
> > As mentioned here [1] it might be interesting to complete the returned
> > information by version() when compiled with meson by including the
> > host_system.
>
> The meson build provides extra_version, which would be able to do the
> same, no?  The information would be appended to PG_VERSION_STR through
> PG_VERSION.

I don't really follow: Including the operating system in PG_VERSION_STR,
as we're doing in autoconf, seems orthogonal to extra_version? Adding linux
into extra_version would result in linux showing up in e.g.
SHOW server_version;
which doesn't seem right.


I think there's a further deficiency in the PG_VERSION_STR the meson build
generates - we use the build system's CPU. Autoconf shows $host, not $build.


For comparison, on my machine autoconf shows:
  PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc-12 (Debian 
12.2.0-9) 12.2.0, 64-bit
whereas with meson we currently end up with
  PostgreSQL 16devel on x86_64, compiled by gcc-13.0.0

I still don't think it makes sense to try to copy (or invoke)
config.guess. Particularly when targetting windows, but even just having to
keep updating config.guess in perpituity seems unnecessary.

Given we're looking at improving this, should we also add 32/64-bit piece?

If so, we probably should move building PG_VERSION_STR to later so we can use
SIZEOF_VOID_P - configure.ac does that too.

With extra_version set to -andres the attached results in:

PostgreSQL 16devel-andres on x86_64-linux, compiled by gcc-13.0.0, 64-bit

Greetings,

Andres Freund
diff --git i/meson.build w/meson.build
index 058382046e1..843b9e19c6b 100644
--- i/meson.build
+++ w/meson.build
@@ -143,12 +143,11 @@ cdata.set_quoted('PACKAGE_TARNAME', 'postgresql')
 
 pg_version += get_option('extra_version')
 cdata.set_quoted('PG_VERSION', pg_version)
-cdata.set_quoted('PG_VERSION_STR', 'PostgreSQL @0@ on @1@, compiled by @2@-@3@'.format(
-  pg_version, build_machine.cpu_family(), cc.get_id(), cc.version()))
 cdata.set_quoted('PG_MAJORVERSION', pg_version_major.to_string())
 cdata.set('PG_MAJORVERSION_NUM', pg_version_major)
 cdata.set('PG_MINORVERSION_NUM', pg_version_minor)
 cdata.set('PG_VERSION_NUM', pg_version_num)
+# PG_VERSION_STR is built later, it depends compiler test results
 cdata.set_quoted('CONFIGURE_ARGS', '')
 
 
@@ -2435,6 +2434,15 @@ cdata.set('MEMSET_LOOP_LIMIT', memset_loop_limit)
 cdata.set_quoted('DLSUFFIX', dlsuffix)
 
 
+# built later than the rest of the version metadata, we need SIZEOF_VOID_P
+cdata.set_quoted('PG_VERSION_STR',
+  'PostgreSQL @0@ on @1@-@2@, compiled by @3@-@4@, @5@-bit'.format(
+pg_version, host_machine.cpu_family(), host_system,
+cc.get_id(), cc.version(), cdata.get('SIZEOF_VOID_P') * 8,
+  )
+)
+
+
 
 ###
 # Threading


locale -a missing on Alpine Linux?

2022-11-16 Thread Peter Eisentraut

Since 2fe3bdbd691a, initdb has been failing on malleefowl:

performing post-bootstrap initialization ... sh: locale: not found
2022-11-15 23:48:44.288 EST [10436] FATAL:  could not execute command 
"locale -a": command not found
2022-11-15 23:48:44.288 EST [10436] STATEMENT:  SELECT 
pg_import_system_collations('pg_catalog');


That's precisely the kind of thing this patch was supposed to catch, but 
obviously it's not good that initdb is now failing.


First of all, is this a standard installation of this OS, or is perhaps 
something incomplete, broken, or unusual about the current OS installation?





Re: when the startup process doesn't (logging startup delays)

2022-11-16 Thread Robert Haas
On Wed, Nov 16, 2022 at 1:47 AM Bharath Rupireddy
 wrote:
> That can be done, only if we can disable the timeout in another place
> when the StandbyMode is set to true in ReadRecord(), that is, after
> the standby server finishes crash recovery and enters standby mode.

Oh, interesting. I didn't realize that we would need to worry about that case.

> I'm attaching the v3 patch for further review. Please find the CF
> entry here - https://commitfest.postgresql.org/41/4012/.

I kind of dislike having to have logic for this in two places. Seems
like it could create future bugs.

How about the attached approach, instead? This way, the first time the
timer expires after we reach standby mode, we reactively disable it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


another-way.patch
Description: Binary data


Re: libpq support for NegotiateProtocolVersion

2022-11-16 Thread Jacob Champion
On Tue, Nov 15, 2022 at 2:19 AM Peter Eisentraut
 wrote:
> I think for the current code, the following would be an appropriate
> adjustment:
>
> diff --git a/src/interfaces/libpq/fe-connect.c
> b/src/interfaces/libpq/fe-connect.c
> index 746e9b4f1efc..d15fb96572d9 100644
> --- a/src/interfaces/libpq/fe-connect.c
> +++ b/src/interfaces/libpq/fe-connect.c
> @@ -3412,8 +3412,7 @@ PQconnectPoll(PGconn *conn)
>  /* Get the type of request. */
>  if (pqGetInt((int *) , 4, conn))
>  {
> -   /* We'll come back when there are more data */
> -   return PGRES_POLLING_READING;
> +   goto error_return;
>  }
>  msgLength -= 4;
>
> And then the handling of the 'v' message in my patch would also be
> adjusted like that.

Yes -- though that particular example may be dead code, since we
should have already checked that there are at least four more bytes in
the buffer.

--Jacob




Re: Assertion failure in SnapBuildInitialSnapshot()

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 14:22:01 +0530, Amit Kapila wrote:
> On Wed, Nov 16, 2022 at 7:30 AM Andres Freund  wrote:
> >
> > On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:
> > > On Tue, Nov 15, 2022 at 8:08 AM Andres Freund  wrote:
> > > > nor do we enforce in an obvious place that we
> > > > don't already hold a snapshot.
> > > >
> > >
> > > We have a check for (FirstXactSnapshot == NULL) in
> > > RestoreTransactionSnapshot->SetTransactionSnapshot. Won't that be
> > > sufficient?
> >
> > I don't think that'd e.g. catch a catalog snapshot being held, yet that'd
> > still be bad. And I think checking in SetTransactionSnapshot() is too late,
> > we've already overwritten MyProc->xmin by that point.
> >
> 
> So, shall we add the below Asserts in SnapBuildInitialSnapshot() after
> we have the Assert for Assert(!FirstSnapshotSet)?
> Assert(FirstXactSnapshot == NULL);
> Assert(!HistoricSnapshotActive());

I don't think that'd catch a catalog snapshot. But perhaps the better answer
for the catalog snapshot is to just invalidate it explicitly. The user doesn't
have control over the catalog snapshot being taken, and it's not too hard to
imagine the walsender code triggering one somewhere.

So maybe we should add something like:

InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */
if (HaveRegisteredOrActiveSnapshot())
  elog(ERROR, "cannot build an initial slot snapshot when snapshots exist")
Assert(!HistoricSnapshotActive());

I think we'd not need to assert FirstXactSnapshot == NULL or !FirstSnapshotSet
with that, because those would show up in HaveRegisteredOrActiveSnapshot().

Greetings,

Andres Freund




Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 16:01:55 +0530, Bharath Rupireddy wrote:
> -1 for CheckpointerShmemStruct as it is being used for running
> checkpoints and I don't think adding stats to it is a great idea.

Why?  Imo the data needed for progress reporting aren't really "stats". We'd
not accumulate counters over time, just for the current checkpoint.

I think it might even be useful for other parts of the system to know what the
checkpointer is doing, e.g. bgwriter or autovacuum could adapt the behaviour
if checkpointer can't keep up. Somehow it'd feel wrong to use the stats system
as the source of such adjustments - but perhaps my gut feeling on that isn't
right.

The best argument for combining progress reporting with accumulating stats is
that we could likely share some of the code. Having accumulated stats for all
the checkpoint phases would e.g. be quite valuable.


> Instead, extending PgStat_CheckpointerStats and using shared memory
> stats for reporting progress/last checkpoint related stats is a good
> idea IMO

There's certainly some potential for deduplicating state and to make stats
updated more frequently. But that doesn't necessarily mean that putting the
checkpoint progress into PgStat_CheckpointerStats is a good idea (nor the
opposite).


> I also think that a new pg_stat_checkpoint view is needed
> because, right now, the PgStat_CheckpointerStats stats are exposed via
> the pg_stat_bgwriter view, having a separate view for checkpoint stats
> is good here.

I agree that we should do that, but largely independent of the architectural
question at hand.

Greetings,

Andres Freund




Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 14:49:59 +0100, Mats Kindahl wrote:
> I think the discussion went a little sideways, so let me recap what I'm
> suggesting:
> 
>1. I mentioned that there is a missing callback when the filenode is
>unlinked and this is particularly evident when dropping a table.
>2. It was correctly pointed out to me that an implementor need to ensure
>that dropping a table is transactional.
>3. I argued that the callback is still correct and outlined how this can
>be handled by a table access method using xact callbacks, if necessary.

I still think 3) isn't a solution to 2). The main issue is that xact callbacks
don't address that storage has to be removed during redo as well. For that
dropping storage has to be integrated with commit / abort records.

I don't think custom a custom WAL rmgr addresses this either - it has to be
integrated with the commit record, and the record has to be replayed as a whole.


> I see huge potential in the table access method and would like to do my
> part in helping it in succeeding. I noted that the API is biased in how the
> existing heap implementation works and is also very focused on
> implementations of "local" storage engines. For more novel architectures
> (for example, various sorts of distributed architectures) and to be easier
> to work with, I think that the API can be improved in a few places. This is
> a first step in the direction of making the API both easier to use as well
> as enabling more novel use-cases.

I agree with that - there's lots more work to be done and the evolution from
not having the abstraction clearly shows. Some of the deficiencies are easy to
fix, but others are there because there's no quick solution to them.


> > To me it still seems fundamentally the wrong direction to implement a "drop
> > relation callback" tableam callback.
> >
> 
> This is not really a "drop table" callback, it is just the most obvious
> case where this is missing. So, just to recap the situation as it looks
> right now.
> 
> Here is (transactional) truncate table:
> 
>1. Allocate a new file node in the same tablespace as the table
>2. Add the file node to the list of pending node to delete
>3. Overwrite the existing file node in the relation with the new one
>4. Call table_relation_set_new_filenode to tell extension that there is
>a new filenode for the relation
> 
> Here is drop table:
> 
>1. Add the existing file node to the list of pending deletes
>2. Remove the table from the catalogs
> 
> For an extension writer, the disappearance of the old file node is
> "invisible" since there is no callback about this, but it is very clear
> when a new file node is allocated. In addition to being inconsistent, it
> adds an extra burden on the extension writer. To notice that a file node
> has been unlinked you can register a transaction handler and investigate
> the pending list at commit or abort time. Even though possible, there are
> two problems with this: 1) the table access method is notified "late" in
> the transaction that the file node is going away, and 2) it is
> unnecessarily complicated to register a transaction handler only for
> inspecting this.

Using a transaction callback solely also doesn't address the redo issue...


I think to make this viable for filenodes that don't look like md.c's, you'd
have to add a way to make commit/abort records extensible by custom
rmgrs. Then the replay of a commit/abort could call the custom rmgr for the
"sub-record" during replay.


> Telling the access method that the filenode is unlinked by adding a
> callback is by far the best solution since it does not affect existing
> extensions and will give the table access methods opportunities to act on
> it immediately.

I'm loathe to add a callback that I don't think can be used correctly without
further changes.

Greetings,

Andres Freund




Re: Add sub-transaction overflow status in pg_stat_activity

2022-11-16 Thread Robert Haas
On Tue, Nov 15, 2022 at 2:29 PM Andres Freund  wrote:
> Hence the suggestion to show the pid of the session with the most subxacts. We
> probably also should add a bunch of accessor functions for people that want
> more detail... But just seeing in one place what's problematic would be the
> big get, the rest will be a small percentage of users IME.

I guess all I can say here is that my experience differs.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Small miscellaneous fixes

2022-11-16 Thread Ranier Vilela
Em qua., 16 de nov. de 2022 às 03:59, Michael Paquier 
escreveu:

> On Tue, Oct 04, 2022 at 08:23:16AM -0300, Ranier Vilela wrote:
> > Both are correct, I missed the pqsignal calls.
> >
> > Attached patch to change this.
>
> The change for pgbench is missing and this is only changing
> pg_test_fsync.  Switching to sig_atomic_t would be fine on non-WIN32
> as these are used in signal handlers, but are we sure that this is
> fine on WIN32 for pg_test_fsync where we rely on a separate thread to
> control the timing of the alarm?
>
Well I tested here in Windows 10 64 bits with sig_atomic_t alarm_triggered
and works fine.
ctrl + c breaks the exe.

Windows 10 64bits
SSD 256GB

For curiosity, this is the test results:
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync

ctrl + c

C:\postgres_debug\bin>pg_test_fsync
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync

ctrl + c

C:\postgres_debug\bin>pg_test_fsync
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  9495,720 ops/sec 105 usecs/op
fdatasync   444,174 ops/sec2251 usecs/op
fsync   398,487 ops/sec2509 usecs/op
fsync_writethrough  342,018 ops/sec2924 usecs/op
open_sync   n/a

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  4719,825 ops/sec 212 usecs/op
fdatasync   442,138 ops/sec2262 usecs/op
fsync   401,163 ops/sec2493 usecs/op
fsync_writethrough  397,198 ops/sec2518 usecs/op
open_sync   n/a

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
 1 * 16kB open_sync write   n/a
 2 *  8kB open_sync writes  n/a
 4 *  4kB open_sync writes  n/a
 8 *  2kB open_sync writes  n/a
16 *  1kB open_sync writes  n/a

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
write, fsync, close  77,808 ops/sec   12852 usecs/op
write, close, fsync  77,469 ops/sec   12908 usecs/op

Non-sync'ed 8kB writes:
write139789,685 ops/sec   7 usecs/op

regards,
Ranier Vilela


Re: HOT chain validation in verify_heapam()

2022-11-16 Thread Robert Haas
On Wed, Nov 16, 2022 at 4:51 AM Himanshu Upadhyaya
 wrote:
> yes, got it, have tried to test and it is giving false corruption in case of 
> subtransaction.
> I think a better way to have this check is, we need to check that if 
> pred_xmin is
> aborted then current_xmin should be aborted only. So there is no way that we
> validate corruption with in_progress txid.

Please note that you can't use TransactionIdDidAbort here, because
that will return false for transactions aborted by a crash. You have
to check that it's not in progress and then afterwards check that it's
not committed. Also note that if you check whether it's committed
first and then check whether it's in progress afterwards, there's a
race condition: it might commit just after you verify that it isn't
committed yet, and then it won't be in progress any more and will look
aborted.

I disagree with the idea that we can't check in progress. I think the
checks could look something like this:

pred_in_progress = TransactionIdIsInProgress(pred_xmin);
current_in_progress = TransactionIdIsInProgress(current_xmin);
if (pred_in_progress)
{
 if (current_in_progress)
return ok;
 // recheck to avoid race condition
 if (TransactionIdIsInProgress(pred_xmin))
 {
 if (TransactionIdDidCommit(current_xmin))
 return corruption: predecessor xmin in progress, but
current xmin committed;
 else
 return corruption: predecessor xmin in progress, but
current xmin aborted;
 }
 // fallthrough: when we entered this if-block pred_xmin was still
in progress but no longer;
 pred_in_progress = false;
}

if (TransactionIdDidCommit(pred_xmin))
return ok;

if (current_in_progress)
 return corruption: predecessor xmin aborted, but current xmin in progress;
else if (TransactionIdDidCommit(current_xmin))
 return corruption: predecessor xmin aborted, but current xmin committed;

The error messages as phrased here aren't actually what we should use;
they would need rephrasing. But I think, or hope anyway, that the
logic works. I think you basically just need the 1 recheck: if you see
the predecessor xmin in progress and then the current xmin in
progress, you have to go back and check that the predecessor xmin is
still in progress, because otherwise both could have committed or
aborted together in between.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PoC] configurable out of disk space elog level

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 15:59:09 +0300, Maxim Orlov wrote:
> Patch is posted as PoC is attached.

> --- a/src/backend/storage/smgr/md.c
> +++ b/src/backend/storage/smgr/md.c
> @@ -40,6 +40,7 @@
>  #include "storage/sync.h"
>  #include "utils/hsearch.h"
>  #include "utils/memutils.h"
> +#include "utils/spccache.h"
>  
>  /*
>   *   The magnetic disk storage manager keeps track of open file
> @@ -479,14 +480,16 @@ mdextend(SMgrRelation reln, ForkNumber forknum, 
> BlockNumber blocknum,
>  
>   if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, 
> WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
>   {
> + int elevel = 
> get_tablespace_elevel(reln->smgr_rlocator.locator.spcOid);
> +

You can't do catalog access below the bufmgr.c layer. It could lead to all
kinds of nastiness, including potentially recursing back to md.c. Even leaving
that aside, we can't do catalog accesses in all kinds of environments that
this currently is active in - most importantly it's affecting the startup
process. We don't do catalog accesses in the startup process, and even if we
were to do so, we couldn't unconditionally because the catalog might not even
be consistent at this point (nor is it guaranteed that the wal_level even
allows to access catalogs during recovery).

I'm not convinced by the usecase in the first place, but purely technically I
think it's not viable to make this a tablespace option.

Greetings,

Andres Freund




Re: Make a 100_bugs.pl test more faster.

2022-11-16 Thread Tom Lane
"Anton A. Melnikov"  writes:
> Here is a separate patch for the node usage optimization mentioned above.
> It decreases the CPU usage during 100_bugs.pl by about 30%.

Hmm ... as written, this isn't testing the same thing, because you
didn't disable the FOR ALL TABLES publications created in the earlier
steps, so we're redundantly syncing more publications in the later
ones.  Cleaning those up seems to make it a little faster yet,
so pushed with that adjustment.

regards, tom lane




Re: MERGE regress test

2022-11-16 Thread Alvaro Herrera
On 2022-Nov-16, Teja Mupparti wrote:

> A quick question on merge regress-test
> 
> https://github.com/postgres/postgres/blob/REL_15_STABLE/src/test/regress/expected/merge.out#L846
j 
> should there be an ERROR or comment needs a fix? What's the expected behavior?

Hmm, good find.  As I recall, I was opposed to the idea of throwing an
error if the WHEN expression writes to the database, and the previous
implementation had some hole, so I just ripped it out after discussing
it; but I evidently failed to notice this test case about it.

However, I can't find any mailing list discussion about this point.
Maybe I just asked Simon off-list about it.

IMO just deleting that test is a sufficient fix.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes...  Fixed.
http://archives.postgresql.org/message-id/482d1632.8010...@sigaev.ru




Re: [PoC] Let libpq reject unexpected authentication requests

2022-11-16 Thread Jacob Champion
On Tue, Nov 15, 2022 at 11:07 PM Michael Paquier  wrote:
> I am beginning to look at the last version proposed, which has been
> marked as RfC.  Does this patch need a refresh in light of a9e9a9f and
> 0873b2d?  The changes for libpq_append_conn_error() should be
> straight-forward.

Updated in v13, thanks!

--Jacob
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index 85e3b9d913..52b1ba927e 100644
--- a/src/interfaces/libpq/fe-auth.c
+++ b/src/interfaces/libpq/fe-auth.c
@@ -550,9 +550,8 @@ pg_SASL_init(PGconn *conn, int payloadlen)
if ((conn->sasl_mechs_denied && found)
|| (!conn->sasl_mechs_denied && !found))
{
-   appendPQExpBuffer(>errorMessage,
- libpq_gettext("auth 
method \"%s\" requirement failed: server requested unacceptable SASL mechanism 
\"%s\"\n"),
- conn->require_auth, 
selected_mechanism);
+   libpq_append_conn_error(conn, "auth method \"%s\" 
requirement failed: server requested unacceptable SASL mechanism \"%s\"",
+   
conn->require_auth, selected_mechanism);
goto error;
}
}
@@ -904,14 +903,12 @@ check_expected_areq(AuthRequest areq, PGconn *conn)
 */
if (!conn->ssl_cert_requested)
{
-   appendPQExpBufferStr(>errorMessage,
-
libpq_gettext("server did not request a certificate"));
+   libpq_append_conn_error(conn, "server did not request a 
certificate");
return false;
}
else if (!conn->ssl_cert_sent)
{
-   appendPQExpBufferStr(>errorMessage,
-
libpq_gettext("server accepted connection without a valid certificate"));
+   libpq_append_conn_error(conn, "server accepted 
connection without a valid certificate");
return false;
}
}
@@ -997,10 +994,8 @@ check_expected_areq(AuthRequest areq, PGconn *conn)
if (!reason)
reason = auth_description(areq);
 
-   appendPQExpBuffer(>errorMessage,
- libpq_gettext("auth method 
\"%s\" requirement failed: %s\n"),
- conn->require_auth, reason);
-
+   libpq_append_conn_error(conn, "auth method \"%s\" requirement 
failed: %s",
+   
conn->require_auth, reason);
return result;
}
 
diff --git a/src/interfaces/libpq/fe-connect.c 
b/src/interfaces/libpq/fe-connect.c
index f40af2a4f9..46fc8e4940 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -1307,9 +1307,8 @@ connectOptions2(PGconn *conn)
else if (!negated)
{
conn->status = CONNECTION_BAD;
-   appendPQExpBuffer(>errorMessage,
- 
libpq_gettext("negative require_auth method \"%s\" cannot be mixed with 
non-negative methods"),
- 
method);
+   libpq_append_conn_error(conn, "negative 
require_auth method \"%s\" cannot be mixed with non-negative methods",
+   
method);
 
free(part);
return false;
@@ -1321,9 +1320,8 @@ connectOptions2(PGconn *conn)
else if (negated)
{
conn->status = CONNECTION_BAD;
-   appendPQExpBuffer(>errorMessage,
- 
libpq_gettext("require_auth method \"%s\" cannot be mixed with negative 
methods"),
- method);
+   libpq_append_conn_error(conn, "require_auth 
method \"%s\" cannot be mixed with negative methods",
+   
method);
 
free(part);
return false;
@@ -1395,9 +1393,8 @@ connectOptions2(PGconn *conn)
else
{
conn->status = CONNECTION_BAD;
-  

Re: meson oddities

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 11:54:10 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-11-16 10:53:59 +0100, Peter Eisentraut wrote:
> >> Could you explain this in more detail?
> 
> > If I just want to install postgres into a prefix without 'postgresql' added 
> > in
> > a bunch of directories, e.g. because I already have pg-$version to be in the
> > prefix, there's really no good way to do so - you can't even specify
> > --sysconfdir or such, because we just override that path.
> 
> At least for the libraries, the point of the 'postgresql' subdir IMO
> is to keep backend-loadable extensions separate from random libraries.
> It's not great that we may fail to do that depending on what the
> initial part of the library path is.

Agreed, extensions really should never be in a path searched by the dynamic
linker, even if the prefix contains 'postgres'.

To me that's a separate thing from adding postgresql to datadir, sysconfdir,
includedir, docdir... On a green field I'd say the 'extension library'
directory should just always be extensions/ or such.

Greetings,

Andres Freund




Re: meson oddities

2022-11-16 Thread Tom Lane
Andres Freund  writes:
> On 2022-11-16 10:53:59 +0100, Peter Eisentraut wrote:
>> Could you explain this in more detail?

> If I just want to install postgres into a prefix without 'postgresql' added in
> a bunch of directories, e.g. because I already have pg-$version to be in the
> prefix, there's really no good way to do so - you can't even specify
> --sysconfdir or such, because we just override that path.

At least for the libraries, the point of the 'postgresql' subdir IMO
is to keep backend-loadable extensions separate from random libraries.
It's not great that we may fail to do that depending on what the
initial part of the library path is.

I could get behind allowing the user to specify that path explicitly
and then not modifying it; but when we're left to our own devices
I think we should preserve that separation.

regards, tom lane




Re: [PoC] Reducing planning time when tables have many partitions

2022-11-16 Thread Thom Brown
On Mon, 7 Nov 2022 at 06:33, Zhang Mingli  wrote:
>
> HI,
>
> Regards,
> Zhang Mingli
> On Nov 7, 2022, 14:26 +0800, Tom Lane , wrote:
>
> Andrey Lepikhov  writes:
>
> I'm still in review of your patch now. At most it seems ok, but are you
> really need both eq_sources and eq_derives lists now?
>
>
> Didn't we just have this conversation? eq_sources needs to be kept
> separate to support the "broken EC" logic. We don't want to be
> regurgitating derived clauses as well as originals in that path.
>
> Aha, we have that conversation in another thread(Reducing duplicativeness of 
> EquivalenceClass-derived clauses
> ) : https://www.postgresql.org/message-id/644164.1666877342%40sss.pgh.pa.us

Once the issue Tom identified has been resolved, I'd like to test
drive newer patches.

Thom




Re: meson oddities

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 10:53:59 +0100, Peter Eisentraut wrote:
> On 16.11.22 00:40, Andres Freund wrote:
> > Somewhat relatedly, I wonder if we should have a better way to 
> > enable/disable
> > the 'pgsql' path logic. It's pretty annoying that prefix basically doesn't
> > work if it doesn't contain 'pgsql' or 'postgres'.
> 
> Could you explain this in more detail?

If I just want to install postgres into a prefix without 'postgresql' added in
a bunch of directories, e.g. because I already have pg-$version to be in the
prefix, there's really no good way to do so - you can't even specify
--sysconfdir or such, because we just override that path.

And because many of our binaries are major version specific you pretty much
need to include the major version in the prefix, making the 'postgresql' we
add redundant.

I think the easiest way today is to use a temporary prefix and then just
rename the installation path. But that obviously doesn't deal well with
rpaths, at least as long as we don't use relative rpaths.

Greetings,

Andres Freund




Re: heapgettup refactoring

2022-11-16 Thread Peter Eisentraut

On 04.11.22 16:51, Melanie Plageman wrote:

Thanks for the review!
Attached is v2 with feedback addressed.


Your 0001 had already been pushed.

I have pushed your 0002.

I have also pushed the renaming of page -> block, dp -> page separately. 
 This should reduce the size of your 0003 a bit.


Please produce an updated version of the 0003 page for further review.





Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-11-16 Thread Tom Lane
Simon Riggs  writes:
> On Wed, 16 Nov 2022 at 00:09, Tom Lane  wrote:
>> No, that's not what XidInMVCCSnapshot does. If snapshot->suboverflowed
>> is set (ie, somebody somewhere/somewhen overflowed), then it does
>> SubTransGetTopmostTransaction and searches only the xips with the result.
>> This behavior requires that all live subxids be correctly mapped by
>> SubTransGetTopmostTransaction, or we'll draw false conclusions.

> Your comments are correct wrt to the existing coding, but not to the
> patch, which is coded as described and does not suffer those issues.

Ah, OK.

Still ... I really do not like this patch.  It introduces a number of
extremely fragile assumptions, and I think those would come back to
bite us someday, even if they hold now which I'm still unsure about.
It doesn't help that you've chosen to document them only at the place
making them and not at the place(s) likely to break them.

Also, to be blunt, this is not Ready For Committer.  It's more WIP,
because even if the code is okay there are comments all over the system
that you've invalidated.  (At the very least, the file header comments
in subtrans.c and the comments in struct SnapshotData need work; I've
not looked hard but I'm sure there are more places with comments
bearing on these data structures.)

Perhaps it would be a good idea to split up the patch.  The business
about making pg_subtrans flat rather than a tree seems like a good
idea in any event, although as I said it doesn't seem like we've got
a fleshed-out version of that here.  We could push forward on getting
that done and then separately consider the rest of it.

regards, tom lane




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-16 Thread Masahiko Sawada
On Wed, Nov 16, 2022 at 4:39 PM John Naylor
 wrote:
>
>
> On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada  
> wrote:
> >
> > On Wed, Nov 16, 2022 at 1:46 PM John Naylor
> >  wrote:
> > >
> > >
> > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada  
> > > wrote:
> > > > Thanks! Please let me know if there is something I can help with.
> > >
> > > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> > >
> > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> > > "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
> >
> > Which tests do you use to get this assertion failure? I've confirmed
> > there is a bug in 0005 patch but without it, "make check-world"
> > passed.
>
> Hmm, I started over and rebuilt and it didn't reproduce. Not sure what 
> happened, sorry for the noise.

Good to know. No problem.

> I'm attaching a test I wrote to stress test branch prediction in search, and 
> while trying it out I found two possible issues.

Thank you for testing!

>
> It's based on the random int load test, but tests search speed. Run like this:
>
> select * from bench_search_random_nodes(10 * 1000 * 1000)
>
> It also takes some care to include all the different node kinds, restricting 
> the possible keys by AND-ing with a filter. Here's a simple demo:
>
> filter = ((uint64)1<<40)-1;
> LOG:  num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 = 
> 62663, n256 = 3130
>
> Just using random integers leads to >99% using the smallest node. I wanted to 
> get close to having the same number of each, but that's difficult while still 
> using random inputs. I ended up using
>
> filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
>
> which gives
>
> LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 
> 182670, n256 = 1024
>
> Which seems okay for the task. One puzzling thing I found while trying 
> various filters is that sometimes the reported tree height would change. For 
> example:
>
> filter = (((uint64) 1<<32) | (0xFF<<24));
> LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 = 
> 62632, n256 = 3161
>
> 1) Any idea why the tree height would be reported as 7 here? I didn't expect 
> that.

In my environment, (0xFF<<24) is 0xFF00, not 0xFF00.
It seems the filter should be (((uint64) 1<<32) | ((uint64)
0xFF<<24)).

>
> 2) It seems that 0004 actually causes a significant slowdown in this test (as 
> in the attached, using the second filter above and with turboboost disabled):
>
> v9 0003: 2062 2051 2050
> v9 0004: 2346 2316 2321
>
> That means my idea for the pointer struct might have some problems, at least 
> as currently implemented. Maybe in the course of separating out and polishing 
> that piece, an inefficiency will fall out. Or, it might be another reason to 
> template local and shared separately. Not sure yet. I also haven't tried to 
> adjust this test for the shared memory case.

I'll also run the test on my environment and do the investigation tomorrow.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Reducing power consumption on idle servers

2022-11-16 Thread Simon Riggs
On Wed, 16 Nov 2022 at 12:47, Bharath Rupireddy
 wrote:
>
> On Wed, Nov 16, 2022 at 2:34 PM Simon Riggs
>  wrote:
> >
> > Reposting v6 now so that patch tester doesn't think this has failed
> > when the patch on other thread gets applied.
>
> Intention of the patch, that is, to get rid of promote_trigger_file
> GUC sometime in future, looks good to me. However, the timeout change
> to 60 sec from 5 sec seems far-reaching. While it discourages the use
> of the GUC, it can impact many existing production servers that still
> rely on promote_trigger_file as it can increase the failover times
> impacting SLAs around failover.

The purpose of 60s is to allow for power reduction, so 5s won't do.

promote_trigger_file is not tested and there are better ways, so
deprecating it in this release is fine.

Anyone that relies on it can update their mechanisms to a supported
one with a one-line change. Realistically, anyone using it won't be on
the latest release anyway, at least for a long time, since if they use
manual methods then they are well behind the times.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: [BUG] Logical replica crash if there was an error in a function.

2022-11-16 Thread Anton A. Melnikov

Thanks a lot for the fast reply!

On 03.11.2022 18:29, Tom Lane wrote:

If we were just adding a
query or two to an existing scenario, that could be okay; but spinning
up and syncing a whole new primary and standby database is *expensive*
when you multiply it by the number of times developers and buildfarm
animals are going to run the tests in the future.



On 15.11.2022 04:59, Tom Lane wrote:

"Anton A. Melnikov"  writes:

On 02.11.2022 21:02, Tom Lane wrote:

I don't think the cost of that test case is justified by the tiny
probability that it'd ever catch anything.



Seems it is possible to do a test without these remarks. The attached
test uses existing nodes and checks the specific error message.


My opinion remains unchanged: this would be a very poor use of test
cycles.


Sorry, i didn't fully understand what is required and
added some functions to the test that spend extra cpu time. But i found
that it is possible to make a test according to previous remarks by adding
only a few extra queries to an existent test without any additional syncing.

Experimentally, i could not observe any significant difference in
CPU usage due to this test addition.
The difference in the CPU usage was less than standard error.
See 100_bugs-CPU-time.txt attached.


Additionally
i've tried to reduce overall number of nodes previously
used in this test in a similar way.


Optimizing existing tests isn't an answer to that.  We could
install those optimizations without adding a new test case.


Oh sure! Here is a separate patch for this optimization:
https://www.postgresql.org/message-id/eb7aa992-c2d7-6ce7-4942-0c784231a362%40inbox.ru
On the contrary with previous case in that one the difference in the CPU usage
during 100_bugs.pl is essential; it decreases approximately by 30%.


With the best wishes!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companycommit 2449158ba576d7c6d97852d14f85dadb3aced262
Author: Anton A. Melnikov 
Date:   Wed Nov 16 11:46:54 2022 +0300

Add test for syntax error in the function in a a logical replication
worker. See dea834938.

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 6247aa7730..eb4ab6d359 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -69,6 +69,51 @@ $node_publisher->wait_for_catchup('sub1');
 
 pass('index predicates do not cause crash');
 
+# https://www.postgresql.org/message-id/flat/adf0452f-8c6b-7def-d35e-ab516c80088e%40inbox.ru
+
+# The bug was that when a syntax error occurred in a SQL-language or PL/pgSQL-language
+# CREATE FUNCTION or DO command executed in a logical replication worker,
+# we'd suffer a null pointer dereference or assertion failure.
+
+$node_subscriber->safe_psql('postgres', q{
+	create or replace procedure rebuild_test(
+	) as
+	$body$
+	declare
+		l_code  text:= E'create or replace function public.test_selector(
+	) returns setof public.tab1 as
+	\$body\$
+		select * from  error_name
+	\$body\$ language sql;';
+	begin
+		execute l_code;
+		perform public.test_selector();
+	end
+	$body$ language plpgsql;
+	create or replace function test_trg()
+	returns trigger as
+	$body$
+		declare
+		begin
+			call public.rebuild_test();
+			return NULL;
+		end
+	$body$ language plpgsql;
+	create trigger test_trigger after insert on tab1 for each row
+execute function test_trg();
+	alter table tab1 enable replica trigger test_trigger;
+});
+
+# This would crash on the subscriber if not fixed
+$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)");
+
+my $result = $node_subscriber->wait_for_log(
+	"ERROR:  relation \"error_name\" does not exist at character"
+);
+
+ok($result,
+	"ERROR: Logical decoding doesn't fail on function error");
+
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
## src/test/subscription/100_bugs.pl 
Before adding test:
Files=1, Tests=7, 14 wallclock secs ( 0.02 usr  0.00 sys +  7.96 cusr  2.24 
csys = 10.22 CPU)
Files=1, Tests=7, 15 wallclock secs ( 0.02 usr  0.00 sys +  8.21 cusr  2.13 
csys = 10.36 CPU)
Files=1, Tests=7, 14 wallclock secs ( 0.01 usr  0.01 sys +  8.69 cusr  2.17 
csys = 10.88 CPU)
Files=1, Tests=7, 15 wallclock secs ( 0.01 usr  0.00 sys +  8.34 cusr  2.22 
csys = 10.57 CPU)
Files=1, Tests=7, 15 wallclock secs ( 0.01 usr  0.00 sys +  8.64 cusr  2.19 
csys = 10.84 CPU)
Files=1, Tests=7, 14 wallclock secs ( 0.02 usr  0.00 sys +  8.18 cusr  2.20 
csys = 10.40 CPU)
Files=1, Tests=7, 13 wallclock secs ( 0.01 usr  0.00 sys +  8.02 cusr  2.23 
csys = 10.26 CPU)
Files=1, Tests=7, 14 wallclock secs ( 0.01 usr  0.01 sys +  8.20 cusr  2.14 
csys = 10.36 CPU)
Files=1, Tests=7, 13 wallclock secs ( 0.02 usr  0.00 sys +  8.04 cusr  2.19 
csys = 10.25 CPU)
Files=1, Tests=7, 14 wallclock secs ( 0.01 usr  0.00 sys +  8.20 cusr  2.09 
csys = 10.30 CPU)
Average CPU usage = 10.44+-0.07s

After adding test:
Files=1, Tests=8, 14 

MERGE regress test

2022-11-16 Thread Teja Mupparti
Hi,



A quick question on merge regress-test



https://github.com/postgres/postgres/blob/REL_15_STABLE/src/test/regress/expected/merge.out#L846



should there be an ERROR or comment needs a fix? What's the expected behavior?



Regards

Teja

[Text  Description automatically generated]


Re: const qualifier for list APIs

2022-11-16 Thread Tom Lane
Ashutosh Bapat  writes:
> Functions like lappend_*() in list.c do not modify the second
> argument. So it can be qualified as const. Any reason why we don't do
> that? Is it because the target pointer ptr_value is not const
> qualified?

It would be a lie in many (most?) cases, wherever somebody later pulls the
pointer out of the list without applying "const" to it.  So I can't see
that adding "const" there would be an improvement.

> So the coding practice though questionable, is
> safe and avoids unnecessary pallocs. But SonarQube does complain about
> it.

Maybe an explicit cast to (void *) would shut it up.

regards, tom lane




Make a 100_bugs.pl test more faster.

2022-11-16 Thread Anton A. Melnikov

Hello!

The previous discussion was here:
https://www.postgresql.org/message-id/flat/b570c367-ba38-95f3-f62d-5f59b9808226%40inbox.ru


On 15.11.2022 04:59, Tom Lane wrote:

"Anton A. Melnikov"  writes:

Additionally
i've tried to reduce overall number of nodes previously
used in this test in a similar way.


Optimizing existing tests isn't an answer to that.  We could
install those optimizations without adding a new test case.


Here is a separate patch for the node usage optimization mentioned above.
It decreases the CPU usage during 100_bugs.pl by about 30%.

There are also some experimental data: 100_bugs-CPU-usage.txt


Would be glad for any comments and concerns.

With best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companycommit 72ea85b5f6a64f2c53e6a55455d134b228b6994b
Author: Anton A. Melnikov 
Date:   Mon Nov 14 08:30:26 2022 +0300

Reuse nodes in subscription/t/100_bugs.pl test to make in faster.

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 6247aa7730..525584b2b6 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -81,9 +81,8 @@ $node_subscriber->stop('fast');
 # identity set before accepting updates.  If it did not it would cause
 # an error when an update was attempted.
 
-$node_publisher = PostgreSQL::Test::Cluster->new('publisher2');
-$node_publisher->init(allows_streaming => 'logical');
-$node_publisher->start;
+$node_publisher->rotate_logfile();
+$node_publisher->start();
 
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION pub FOR ALL TABLES");
@@ -226,13 +225,13 @@ $node_sub->stop('fast');
 # target table's relcache was not being invalidated. This leads to skipping
 # UPDATE/DELETE operations during apply on the subscriber side as the columns
 # required to search corresponding rows won't get logged.
-$node_publisher = PostgreSQL::Test::Cluster->new('publisher3');
-$node_publisher->init(allows_streaming => 'logical');
-$node_publisher->start;
 
-$node_subscriber = PostgreSQL::Test::Cluster->new('subscriber3');
-$node_subscriber->init(allows_streaming => 'logical');
-$node_subscriber->start;
+$node_publisher->rotate_logfile();
+$node_publisher->start();
+
+$node_subscriber->rotate_logfile();
+$node_subscriber->rotate_logfile(); # call twice to synchronize lognames between master and replica
+$node_subscriber->start();
 
 $node_publisher->safe_psql('postgres',
 	"CREATE TABLE tab_replidentity_index(a int not null, b int not null)");
## src/test/subscription/100_bugs.pl 
Before reuse nodes:
Files=1, Tests=7, 12 wallclock secs ( 0.02 usr  0.00 sys +  8.02 cusr  1.96 
csys = 10.00 CPU)
Files=1, Tests=7, 14 wallclock secs ( 0.02 usr  0.00 sys +  7.99 cusr  2.23 
csys = 10.24 CPU)
Files=1, Tests=7, 15 wallclock secs ( 0.01 usr  0.00 sys +  8.31 cusr  2.16 
csys = 10.48 CPU)
Files=1, Tests=7, 14 wallclock secs ( 0.01 usr  0.00 sys +  8.29 cusr  2.08 
csys = 10.38 CPU)
Files=1, Tests=7, 14 wallclock secs ( 0.01 usr  0.00 sys +  8.01 cusr  2.14 
csys = 10.16 CPU)
Files=1, Tests=7, 13 wallclock secs ( 0.02 usr  0.00 sys +  8.13 cusr  2.04 
csys = 10.19 CPU)
Files=1, Tests=7, 13 wallclock secs ( 0.02 usr  0.00 sys +  8.11 cusr  2.11 
csys = 10.24 CPU)
Files=1, Tests=7, 14 wallclock secs ( 0.02 usr  0.00 sys +  8.04 cusr  2.19 
csys = 10.25 CPU)
Files=1, Tests=7, 12 wallclock secs ( 0.01 usr  0.00 sys +  8.02 cusr  2.07 
csys = 10.10 CPU)
Files=1, Tests=7, 13 wallclock secs ( 0.02 usr  0.00 sys +  8.05 cusr  2.09 
csys = 10.16 CPU)
Average CPU usage = 10.22+-0.04s

After reuse nodes:
Files=1, Tests=7, 11 wallclock secs ( 0.01 usr  0.00 sys +  5.68 cusr  1.56 
csys =  7.25 CPU)
Files=1, Tests=7, 11 wallclock secs ( 0.02 usr  0.00 sys +  5.61 cusr  1.60 
csys =  7.23 CPU)
Files=1, Tests=7, 11 wallclock secs ( 0.02 usr  0.00 sys +  5.66 cusr  1.64 
csys =  7.32 CPU)
Files=1, Tests=7, 12 wallclock secs ( 0.02 usr  0.00 sys +  5.76 cusr  1.63 
csys =  7.41 CPU)
Files=1, Tests=7, 11 wallclock secs ( 0.02 usr  0.00 sys +  5.60 cusr  1.61 
csys =  7.23 CPU)
Files=1, Tests=7, 10 wallclock secs ( 0.02 usr  0.00 sys +  5.63 cusr  1.65 
csys =  7.30 CPU)
Files=1, Tests=7, 10 wallclock secs ( 0.02 usr  0.00 sys +  5.64 cusr  1.58 
csys =  7.24 CPU)
Files=1, Tests=7, 12 wallclock secs ( 0.01 usr  0.00 sys +  5.59 cusr  1.75 
csys =  7.35 CPU)
Files=1, Tests=7, 11 wallclock secs ( 0.01 usr  0.00 sys +  5.54 cusr  1.74 
csys =  7.29 CPU)
Files=1, Tests=7, 10 wallclock secs ( 0.02 usr  0.00 sys +  5.62 cusr  1.57 
csys =  7.21 CPU)
Average CPU usage = 7.28+-0.02s


Re: closing file in adjust_data_dir

2022-11-16 Thread Ted Yu
On Wed, Nov 16, 2022 at 12:28 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 16.11.22 04:31, Ted Yu wrote:
> > On Wed, 16 Nov 2022 at 11:15, Ted Yu  > > wrote:
> >  > On Tue, Nov 15, 2022 at 7:12 PM Japin Li  > > wrote:
> >  >> After some rethinking, I find the origin code do not have
> problems.
> >  >>
> >  >> If fd is NULL or fgets() returns NULL, the process exits.
> > Otherwise, we
> >  >> call
> >  >> pclose() to close fd.  The code isn't straightforward, however,
> > it is
> >  >> correct.
> >
> > Hi,
> > Please take a look at the following:
> >
> > https://en.cppreference.com/w/c/io/fgets
> > 
> > Quote: If the failure has been caused by some other error, sets the
> > /error/ indicator (see ferror()
> > ) on |stream|. The contents
> > of the array pointed to by |str| are indeterminate (it may not even be
> > null-terminated).
>
> That has nothing to do with the return value of fgets().
>
> Hi, Peter:
Here is how the return value from pclose() is handled in other places:

+   if (pclose_rc != 0)
+   {
+   ereport(ERROR,

The above is very easy to understand.
While the check in `adjust_data_dir` is somewhat harder to comprehend.

I think the formation presented in patch v1 aligns with existing checks of
the return value from pclose().
It also gives a unique error message in the case that the return value from
pclose() indicates an error.

Cheers


Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

2022-11-16 Thread Mats Kindahl
Hello all,

I think the discussion went a little sideways, so let me recap what I'm
suggesting:

   1. I mentioned that there is a missing callback when the filenode is
   unlinked and this is particularly evident when dropping a table.
   2. It was correctly pointed out to me that an implementor need to ensure
   that dropping a table is transactional.
   3. I argued that the callback is still correct and outlined how this can
   be handled by a table access method using xact callbacks, if necessary.

I see huge potential in the table access method and would like to do my
part in helping it in succeeding. I noted that the API is biased in how the
existing heap implementation works and is also very focused on
implementations of "local" storage engines. For more novel architectures
(for example, various sorts of distributed architectures) and to be easier
to work with, I think that the API can be improved in a few places. This is
a first step in the direction of making the API both easier to use as well
as enabling more novel use-cases.

Writing implementations with ease is more about having the right callbacks
in the right places and providing the right information, but in some cases
it is just not possible to implement efficient functionality with the
current interface. I think it can be useful to separate these two kinds of
enhancements when discussing the API, but I think both are important for
the table access methods to be practically usable and to leverage the full
power of this concept.

On Tue, Aug 2, 2022 at 1:44 AM Andres Freund  wrote:

> Hi,
>
> On 2021-04-05 21:57:12 +0200, Mats Kindahl wrote:
> >2. In the storage layer, the function RelationDropStorage is called,
> >which will record the table to be dropped in the pendingDeletes
> >
> > When committing (or aborting) the transaction, there are two calls that
> are
> > interesting, in this order:
> >
> >1. CallXactCallbacks which calls registered callbacks
> >2. smgrDoPendingDeletes, which calls the storage layer directly to
> >perform the actual deletion, if necessary.
> >
> > Now, suppose that we want to replace the storage layer with a different
> > one. It is straightforward to replace it by implementing the Table AM
> > methods above, but we are missing a callback on dropping the table. If we
> > have that, we can record the table-to-be-dropped in a similar manner to
> how
> > the heap AM does it and register a transaction callback using
> > RegisterXactCallback.
>
>   don't think implementing dropping relation data at-commit/rollback using
> xact callbacks can be correct. The dropping needs to be integrated with the
> commit / abort records, so it is redone during crash recovery - that's not
> possible with xact callbacks.
>

Yes, but this patch is about making the extension aware that a file node is
being unlinked.


> To me it still seems fundamentally the wrong direction to implement a "drop
> relation callback" tableam callback.
>

This is not really a "drop table" callback, it is just the most obvious
case where this is missing. So, just to recap the situation as it looks
right now.

Here is (transactional) truncate table:

   1. Allocate a new file node in the same tablespace as the table
   2. Add the file node to the list of pending node to delete
   3. Overwrite the existing file node in the relation with the new one
   4. Call table_relation_set_new_filenode to tell extension that there is
   a new filenode for the relation

Here is drop table:

   1. Add the existing file node to the list of pending deletes
   2. Remove the table from the catalogs

For an extension writer, the disappearance of the old file node is
"invisible" since there is no callback about this, but it is very clear
when a new file node is allocated. In addition to being inconsistent, it
adds an extra burden on the extension writer. To notice that a file node
has been unlinked you can register a transaction handler and investigate
the pending list at commit or abort time. Even though possible, there are
two problems with this: 1) the table access method is notified "late" in
the transaction that the file node is going away, and 2) it is
unnecessarily complicated to register a transaction handler only for
inspecting this.

Telling the access method that the filenode is unlinked by adding a
callback is by far the best solution since it does not affect existing
extensions and will give the table access methods opportunities to act on
it immediately.

I have attached an updated patch that changes the names of the callbacks
since there was a name change. I had also missed the case of unlinking a
file node when tables were truncated, so I have added a callback for this
as well.

Best wishes,
Mats Kindahl


> Greetings,
>
> Andres Freund
>
From 5023c5d9deea7008e4b74f02564262714fb8c6b6 Mon Sep 17 00:00:00 2001
From: Mats Kindahl 
Date: Mon, 31 Oct 2022 14:27:03 +0100
Subject: [PATCH] Add callback for unlinking file node of table

When dropping or 

Re: [PoC] configurable out of disk space elog level

2022-11-16 Thread Pavel Borisov
Hi, Maxim!
> My proposal is to add a tablespace option in order to be able to configure 
> which behaviour is appropriate for a
> particular user. I've decided to call this option “on_no_space” for now. If 
> anyone has a better naming for this feature,
> please, report.
>
> So, the idea is to add both GUC and tablespace option “on_no_space”. The 
> tablespace option defines the behaviour of the
> cluster for a particular tablespace in “on_no_space” situation. The GUC 
> defines the default value of tablespace option.

I suppose there can be a kind of attack with this feature i.e.

- If someone already has his own tablespace he can do:
ALTER TABLESPACE my SET on_no_space=fatal; // This needs tablespace
ownership, not superuser permission.
- Then fill up his own db with garbage to fill his tablespace.
- Then all the db cluster will go fatal, even if the other users'
tablespaces are almost free.

If this can be avoided, I think the patch can be useful.

Regards,
Pavel Borisov.




Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-11-16 Thread Maxim Orlov
On Tue, 15 Nov 2022 at 13:02, Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Fri, Nov 11, 2022 at 5:52 PM Maxim Orlov  wrote:
>
> > But, in my view, some improvements may be proposed. We should be more,
> let's say, cautious (or a paranoid if you wish),
> > in pg_walfile_offset_lsn while dealing with user input values. What I
> mean by that is:
> >  - to init input vars of sscanf, i.e. tli, log andseg;
> >  - check the return value of sscanf call;
> >  - check filename max length.
>
> IsXLogFileName() will take care of this. Also, I've added a new inline
> function XLogIdFromFileName() that parses file name and returns log
> and seg along with XLogSegNo and timeline id. This new function avoids
> an extra sscanf() call as well.
>
> > Another question arises for me: is this function can be called during
> recovery? If not, we should ereport about this, should we?
>
> It's just a utility function and doesn't depend on any of the server's
> current state (unlike pg_walfile_name()), hence it doesn't matter if
> this function is called during recovery.
>
> > And one last note here: pg_walfile_offset_lsn is accepting NULL values
> and return NULL in this case. From a theoretical
> > point of view, this is perfectly fine. Actually, I think this is exactly
> how it supposed to be, but I'm not sure if there are no other opinions here.
>
> These functions are marked as 'STRICT', meaning a null is returned,
> without even calling the function, if any of the input parameters is
> null. I think we can keep the behaviour the same as its friends.
>

Thanks for the explanations. I think you are right.


> >errmsg("offset must not be negative or greater than or equal to the
> WAL segment size")));
>
> Changed.
>

Confirm. And a timeline_id is added.


> While on this, I noticed that the pg_walfile_name_offset() isn't
> covered in tests. I took an opportunity and added a simple test case
> along with pg_walfile_offset_lsn().
>
> I'm attaching the v3 patch set for further review.
>

Great job! We should mark this patch as RFC, shall we?

-- 
Best regards,
Maxim Orlov.


Re: Meson add host_system to PG_VERSION_STR

2022-11-16 Thread Juan José Santamaría Flecha
On Wed, Nov 16, 2022 at 10:50 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 16.11.22 01:01, Michael Paquier wrote:
> >
> > The meson build provides extra_version, which would be able to do the
> > same, no?  The information would be appended to PG_VERSION_STR through
> > PG_VERSION.
>
> I think this is meant to achieve parity between the version strings
> generated by configure and by meson.
>
> Perhaps some examples before and after on different platforms could be
> shown.
>

Yes, that would make clear what the patch is trying to do. For version() we
get:

Configure:
 PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Debian
6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

Meson:
 PostgreSQL 16devel on x86_64, compiled by gcc-6.3.0

Patched:
 PostgreSQL 16devel on x86_64-linux, compiled by gcc-6.3.0

Regards,

Juan José Santamaría Flecha


Re: Commit fest 2022-11

2022-11-16 Thread Ian Lawrence Barwick
2022年11月14日(月) 22:38 Ian Lawrence Barwick :
>
> 2022年11月14日(月) 22:23 James Coleman :
> >
> > On Mon, Nov 14, 2022 at 7:08 AM Ian Lawrence Barwick  
> > wrote:
> > >
> > > 2022年11月9日(水) 8:12 Justin Pryzby :
> > 
> > > > If my script is not wrong, these patches add TAP tests, but don't update
> > > > the requisite ./meson.build file.  It seems like it'd be reasonable to
> > > > set them all as WOA until that's done.
> > > >
> > > > $ for a in `git branch -a |sort |grep commitfest/40`; do : echo 
> > > > "$a..."; x=`git log -1 --compact-summary "$a"`; echo "$x" |grep 
> > > > '/t/.*pl.*new' >/dev/null || continue; echo "$x" |grep -Fw meson 
> > > > >/dev/null && continue; git log -1 --oneline "$a"; done
> > > > ... [CF 40/3558] Allow file inclusion in pg_hba and pg_ident files
> > > > ... [CF 40/3628] Teach pg_waldump to extract FPIs from the WAL stream
> > > > ... [CF 40/3646] Skip replicating the tables specified in except table 
> > > > option
> > > > ... [CF 40/3663] Switching XLog source from archive to streaming when 
> > > > primary available
> > > > ... [CF 40/3670] pg_rewind: warn when checkpoint hasn't happened after 
> > > > promotion
> > > > ... [CF 40/3729] Testing autovacuum wraparound
> > > > ... [CF 40/3877] vacuumlo: add test to vacuumlo for test coverage
> > > > ... [CF 40/3985] TDE key management patches
> > >
> > > Looks like your script is correct, will update accordingly.
> > >
> >
> > It's possible this has been discussed before, but it seems less than
> > ideal to have notifications about moving to WOA be sent only in a bulk
> > email hanging off the "current CF" email chain as opposed to being
> > sent to the individual threads. Perhaps that's something the app
> > should do for us in this situation. Without that though the patch
> > authors are left to wade through unrelated discussion, and, probably
> > more importantly, the patch discussion thread doesn't show the current
> > state (I think bumping there is more likely to prompt activity as
> > well).
>
> FWIW I've been manually "bumping" the respective threads, which is somewhat
> time-consuming but seems to have been quite productive in terms of getting
> patches updated.
>
> Will do same for the above once I've confirmed what is being requested,
> (which I presume is adding the new tests to the 'tests' array in the 
> respective
> "meson.build" file; just taking the opportunity to familiariize myself with 
> it).

Various mails since sent to the appropriate threads; I took the opportunity
to create a short wiki page:

   https://wiki.postgresql.org/wiki/Meson_for_patch_authors

with relevant details (AFAICS) for anyone not familiar with meson; corrections/
improvements welcome.

In the meantime I notice a number of patches in cfbot are currently failing on
TAP test "test_custom_rmgrs/t/001_basic.pl" in some environments. This was
added the other day in commit ae168c794f; there is already a fix for the issue
( 36e0358e70 ) but the cfbot doesn't have that commit yet.

Regards

Ian Barwick




[PoC] configurable out of disk space elog level

2022-11-16 Thread Maxim Orlov
Hi!


PROBLEM

Our customer stumble onto the next behaviour of the Postgres cluster: if
disk space is exhausted, Postgres continues to
work until WAL can be successfully written. Thus, upon disk space
exhaustion, clients will get an “ERROR: could not
extend file “base/X/X”: No space left on device” messages and
transactions will be aborted. But the cluster
continues to work for a quite some time.

This behaviour of the PostgreSQL, of course, is perfectly legit. Cluster
just translate OS error to the user and can do
nothing about it, expecting space may be available later.

On the other hand, users continues to send more data and having more and
more transactions to be aborted.

There are next possible ways to diagnose described situation:
 —external monitoring system;
 —log analysis;
 —create/drop table and analyse results.

Each one have advantages and disadvantages. I'm not going to dive deeper
here, if you don't mind.

The customer, mentioned above, in this particular case, would be glad to be
able to have a mechanism to stop the cluster.
Again, in this concrete case.


PROPOSAL

My proposal is to add a tablespace option in order to be able to configure
which behaviour is appropriate for a
particular user. I've decided to call this option “on_no_space” for now. If
anyone has a better naming for this feature,
please, report.

So, the idea is to add both GUC and tablespace option “on_no_space”. The
tablespace option defines the behaviour of the
cluster for a particular tablespace in “on_no_space” situation. The GUC
defines the default value of tablespace option.

Patch is posted as PoC is attached.

Here's what it looks like:
===
== Create 100Mb disk
$ dd if=/dev/zero of=/tmp/foo.img bs=100M count=1
$ mkfs.ext4 /tmp/foo.img
$ mkdir /tmp/foo
$ sudo mount -t ext4 -o loop /tmp/foo.img /tmp/foo
$ sudo chown -R orlov:orlov /tmp/foo
===
== Into psql
postgres=# CREATE TABLESPACE foo LOCATION '/tmp/foo' WITH
(on_no_space=fatal);
CREATE TABLESPACE
postgres=# \db+
   List of tablespaces
Name| Owner | Location | Access privileges |   Options   |
 Size   | Description
+---+--+---+-+-+-
 foo| orlov | /tmp/foo |   | {on_no_space=fatal} |
0 bytes |
...

postgres=# CREATE TABLE bar(qux int, quux text) WITH (autovacuum_enabled =
false) TABLESPACE foo;
CREATE TABLE
postgres=# INSERT INTO bar(qux, quux) SELECT id, md5(id::text) FROM
generate_series(1, 1000) AS id;
FATAL:  could not extend file "pg_tblspc/16384/PG_16_202211121/5/16385": No
space left on device
HINT:  Check free disk space.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
===


CAVEATS

Again, I've posted this patch as a PoC. This is not a complete realization
of described functionality. AFAICS, there are
next problems:
 - I have to put get_tablespace_elevel call in RelationGetBufferForTuple in
order to tablespace in cache; overwise,
   cache miss in get_tablespace triggers assertion failing in lock.c:887
(Assert("!IsRelationExtensionLockHeld")).
   This assertion was added by commit 15ef6ff4 (see [0] for details).
 - What error should be when mdextend called not to insert a tuple into a
heap (WAL applying, for example)?

Maybe, adding just GUC without ability to customize certain tablespaces to
define "out of disk space" behaviour is enough?
I would appreciate it if you give your opinions on a subject.

-- 
Best regards,
Maxim Orlov.

[0]
https://www.postgresql.org/message-id/flat/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B%2BSs%3DGn1LA%40mail.gmail.com


v1-0001-Add-out-of-disk-space-elog-level.patch
Description: Binary data


Re: out of memory in crosstab()

2022-11-16 Thread Joe Conway

On 11/16/22 02:47, Amit Langote wrote:

A customer seems to have run into $subject.  Here's a reproducer they shared:



With the following logged:

LOG:  server process (PID 121846) was terminated by signal 9: Killed


That's the Linux OOM killer. Was this running in a container or under 
systemd with memory.limit_in_bytes set? If so, perhaps they need a 
higher setting.




The problem seems to be spi_printtup() continuing to allocate memory
to expand _SPI_current->tuptable to store the result of crosstab()'s
input query that's executed using:

 /* Retrieve the desired rows */
 ret = SPI_execute(sql, true, 0);

Note that this asks SPI to retrieve and store *all* result rows of the
query in _SPI_current->tuptable, and if there happen to be so many
rows, as in the case of above example, spi_printtup() ends up asking
for a bit too much memory.


check


The easiest fix for this seems to be for crosstab() to use open a
cursor (SPI_cursor_open) and fetch the rows in batches
(SPI_cursor_fetch) rather than all in one go.  I have implemented that
in the attached.  Maybe the patch should address other functions that
potentially have the same problem.


Seems reasonable. I didn't look that closely at the patch, but I do 
think that there needs to be some justification for the selected batch 
size and/or make it configurable.



I also wondered about fixing this by making _SPI_current->tuptable use
a tuplestore that can spill to disk as its backing store rather than a
plain C HeapTuple array, but haven't checked how big of a change that
would be; SPI_tuptable is referenced in many places across the tree.
Though I suspect that idea has enough merits to give that a try
someday.


Seems like a separate patch at the very least


Thoughts on whether this should be fixed and the fix be back-patched?


-1 on backpatching -- this is not a bug, and the changes are non-trivial

Joe

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





  1   2   >