date:20230117

Re: Rethinking the implementation of ts_headline()

2023-01-17 Thread Alvaro Herrera

On 2023-Jan-16, Tom Lane wrote:

> I get this with the patch:
> 
>  ts_headline | ts_headline 
> -+-
>  {ipsum} ... {labor} | {ipsum} ... {labor}
> (1 row)
> 
> which is what I'd expect, because it removes the artificial limit on
> cover length that I added in 78e73e875.  So I'm wondering why you got a
> different result.

Indeed, that's what I get now, too, after re-applying the patches.
I find no way to reproduce the bogus result I got yesterday.

Sorry for the noise.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801

Re: Exit walsender before confirming remote flush in logical replication

2023-01-17 Thread Amit Kapila

On Mon, Jan 16, 2023 at 4:39 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > In logical replication apply preceeds write and flush so we have no
> > indication whether a record is "replicated" to standby by other than
> > apply LSN. On the other hand, logical recplication doesn't have a
> > business with switchover so that assurarance is useless. Thus I think
> > we can (practically) ignore apply_lsn at shutdown. It seems subtly
> > irregular, though.
>
> Another consideration is that the condition (!pq_is_send_pending()) ensures 
> that
> there are no pending messages, including other packets. Currently we force 
> walsenders
> to clean up all messages before shutting down, even if it is a keepalive one.
> I cannot have any problems caused by this, but I can keep the condition in 
> case of
> logical replication.
>

Let me try to summarize the discussion till now. The problem we are
trying to solve here is to allow a shutdown to complete when walsender
is not able to send the entire WAL. Currently, in such cases, the
shutdown fails. As per our current understanding, this can happen when
(a) walreceiver/walapply process is stuck (not able to receive more
WAL) due to locks or some other reason; (b) a long time delay has been
configured to apply the WAL (we don't yet have such a feature for
logical replication but the discussion for same is in progress).

Both reasons mostly apply to logical replication because there is no
separate walreceiver process whose job is to just flush the WAL. In
logical replication, the process that receives the WAL also applies
it. So, while applying it can stuck for a long time waiting for some
heavy-weight lock to be released by some other long-running
transaction by the backend. Similarly, if the user has configured a
large value of time-delayed apply, it can lead to a network buffer
full between walsender and receive/process.

The condition to allow the shutdown to wait for all WAL to be sent has
two parts: (a) it confirms that there is no pending WAL to be sent;
(b) it confirms all the WAL sent has been flushed by the client. As
per our understanding, both these conditions are to allow clean
switchover/failover which seems to be useful only for physical
replication. The logical replication doesn't provide such
functionality.

The proposed patch tries to eliminate condition (b) for logical
replication in the hopes that the same will allow the shutdown to be
complete in most cases. There is no specific reason discussed to not
do (a) for logical replication.

Now, to proceed here we have the following options: (1) Fix (b) as
proposed by the patch and document the risks related to (a); (2) Fix
both (a) and (b); (3) Do nothing and document that users need to
unblock the subscribers to complete the shutdown.

Thoughts?

-- 
With Regards,
Amit Kapila.

RE: Perform streaming logical transactions by background workers and parallel apply

2023-01-17 Thread houzj.f...@fujitsu.com

On Tuesday, January 17, 2023 2:46 PM Masahiko Sawada  
wrote:
> 
> On Tue, Jan 17, 2023 at 12:37 PM houzj.f...@fujitsu.com
>  wrote:
> > Attach the new version 0001 patch which addressed all other comments.
> >
> 
> Thank you for updating the patch. Here is one comment:
> 
> @@ -426,14 +427,24 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
> 
> /*
>  * Show the leader only for active parallel
> workers.  This
> -* leaves the field as NULL for the
> leader of a parallel
> -* group.
> +* leaves the field as NULL for the
> leader of a parallel group
> +* or the leader of parallel apply workers.
>  */
> if (leader && leader->pid !=
> beentry->st_procpid)
> {
> values[28] =
> Int32GetDatum(leader->pid);
> nulls[28] = false;
> }
> +   else
> +   {
> +   int
> leader_pid = GetLeaderApplyWorkerPid(beentry->st_procpid);
> +
> +   if (leader_pid != InvalidPid)
> +   {
> +   values[28] =
> Int32GetDatum(leader_pid);
> +   nulls[28] = false;
> +   }
> +   }
> }
> 
> I'm slightly concerned that there could be overhead of executing
> GetLeaderApplyWorkerPid () for every backend process except for parallel
> query workers. The number of such backends could be large and
> GetLeaderApplyWorkerPid() acquires the lwlock. For example, does it make
> sense to check (st_backendType == B_BG_WORKER) before calling
> GetLeaderApplyWorkerPid()? Or it might not be a problem since it's
> LogicalRepWorkerLock which is not likely to be contended.

Thanks for the comment and I think your suggestion makes sense.
I have added the check before getting the leader pid. Here is the new version 
patch.

Best regards,
Hou zj


v83-0001-Display-the-leader-apply-worker-s-PID-for-parall.patch
Description:  v83-0001-Display-the-leader-apply-worker-s-PID-for-parall.patch

RE: Perform streaming logical transactions by background workers and parallel apply

2023-01-17 Thread houzj.f...@fujitsu.com

On Tuesday, January 17, 2023 12:34 PM shveta malik  
wrote:
> 
> On Tue, Jan 17, 2023 at 9:07 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Tuesday, January 17, 2023 11:32 AM Peter Smith
>  wrote:
> > > OK. I didn't know there was another header convention that you were
> > > following.
> > > In that case, it is fine to leave the name as-is.
> >
> > Thanks for confirming!
> >
> > Attach the new version 0001 patch which addressed all other comments.
> >
> > Best regards,
> > Hou zj
> 
> Hello Hou-san,
> 
> 1. Do we need to extend test-cases to review the leader_pid column in pg_stats
> tables?

Thanks for the comments.

We currently don't have any tests for the view, so I feel we can extend
them later as a separate patch.

> 2. Do we need to follow the naming convention for
> 'GetLeaderApplyWorkerPid' like other functions in the same file which starts
> with 'logicalrep_'

We have agreed [1] to follow the naming convention for functions in 
logicallauncher.h
which are mainly used for other modules.

[1] 
https://www.postgresql.org/message-id/CAHut%2BPtgj%3DDY8F1cMBRUxsZtq2-faW%3D%3D5-dSuHSPJGx1a_vBFQ%40mail.gmail.com

Best regards,
Hou zj

Re: [PATCH] Add native windows on arm64 support

2023-01-17 Thread Niyas Sait




On 16/12/2022 10:52, Niyas Sait wrote:

On 12/12/2022 23:56, Michael Paquier wrote:

On Mon, Dec 12, 2022 at 01:38:37PM +, Niyas Sait wrote:

On 05/12/2022 18:14, Andres Freund wrote:
I think the old build system specific part is really minimal in the 
patch. I

can strip out those if that's preferred.

Removing all the changes from src/tools/msvc/ is an approach that
works for me.



I've attached a new version (v7) that removes changes to support old 
MSVC build system.



Hello,

Can I get review for the latest patch please ?

--
Niyas

Re: Asynchronous and "direct" IO support for PostgreSQL.

2023-01-17 Thread Wenjing Zeng



> 2021年9月1日 13:56，Andres Freund  写道：
> 
> Hi,
> 
> Attached is an updated patch AIO series. The major changes are:
> - rebased onto master (Andres)
> - lots of progress on posix AIO backend (Thomas)
> - lots of progress towards a windows native AIO implementation - not yet quite
>  merged (Thomas & David)
> - considerably improved "worker" io_method (Thomas)
> - some preliminary patches merged (Thomas) and thus dropped
> - error handling overhaul, AIO references now use resource owners
> - quite a few more localized bugfixes
> - further CI improvements
> 
> Unfortunately there's a few tests that don't pass on windows. At least some of
> those failures also happen on master - hence the alternative output file added
> in the last commit.
> 
> Thanks to Thomas there's now a new wiki page for AIO support:
> https://wiki.postgresql.org/wiki/AIO
> It's currently mostly a shared todo list
> 
> My own next steps are to try to get some of the preliminary patches merged
> into master, and to address some of the higher level things that aren't yet
> quite right with the AIO interface, and to split the "main" AIO patch into
> smaller patches.
> 
> I hope that we soon send in a new version with native AIO support for
> windows. I'm mostly interested in that to make sure that we get the shared
> infrastructure right.
> 
> Melanie has some work improving bitmap heap scan AIO support and some IO stats
> / explain improvements.
> 
> I think a decent and reasonably simple example for the way the AIO interface
> can be used to do faster IO is
> v3-0028-aio-Use-AIO-in-nbtree-vacuum-scan.patch.gz which adds AIO for nbtree
> vacuum. It's not perfectly polished, but I think it shows that it's not too
> hard to add AIO usage to individual once the general infrastructure is in
> place.
> 
> I've attached the code for posterity, but the series is large enough that I
> don't think it makes sense to do that all that often... The code is at
> https://github.com/anarazel/postgres/tree/aio

HI Andres:

I noticed this feature and did some testing.
code in GitHub's aio branch:
Function 
static void
pgaio_write_smgr_retry(PgAioInProgress *io)
{
uint32 off;
AioBufferTag *tag = &io->scb_data.write_smgr.tag;
SMgrRelation reln = smgropen(tag->rlocator.locator, tag->rlocator.backend);

io->op_data.read.fd = smgrfd(reln, tag->forkNum, tag->blockNum, &off);
Assert(off == io->op_data.read.offset);
}

seems should to be:
io->op_data.write.fd = smgrfd(reln, tag->forkNum, tag->blockNum, &off);
Assert(off == io->op_data.write.offset);


Best regards,
Wenjing

> 
> Greetings,
> 
> Andres Freund
>

Cross-partition UPDATE and foreign table partitions

2023-01-17 Thread Antonin Houska

I was wondering why ExecCrossPartitionUpdateForeignKey() has an unused
argument "oldslot" and wanted to suggest its removal. However, before I did,
it occurred to me that callers may want to pass the whole slot when the
partition is a foreign table, i.e. when the "tupleid" argument cannot be
used. (In that case the problem would be that the function implementation is
incomplete.)

However, when checking how cross-partition UPDATE works internally for foreign
tables, I saw surprising behavior. The attached script creates partitioned
table "a" with foreign table partitions "a1" and "a2". If you then run the
following commands

INSERT INTO a VALUES (1), (10);
UPDATE a SET i=11 WHERE i=1;
TABLE a1;

you'll see that the tuples are correctly routed into the partitions, but the
UPDATE is simply executed on the "a1" partition. Instead, I'd expect it to
delete the tuple from "a1" and insert it into "a2". That looks like a bug.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

CREATE EXTENSION IF NOT EXISTS postgres_fdw WITH SCHEMA public;

CREATE SERVER s1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (
dbname 'postgres',
host 'localhost',
port '5432'
);

CREATE USER MAPPING FOR CURRENT_ROLE SERVER s1;

CREATE TABLE public.a (
i integer NOT NULL
)
PARTITION BY RANGE (i);


CREATE TABLE public.a1 (
i integer NOT NULL
);

CREATE FOREIGN TABLE public.a1_loc (
i integer NOT NULL
)
SERVER s1
OPTIONS (
table_name 'a1'
);

CREATE TABLE public.a2 (
i integer NOT NULL
);

CREATE FOREIGN TABLE public.a2_loc (
i integer NOT NULL
)
SERVER s1
OPTIONS (
table_name 'a2'
);

ALTER TABLE ONLY public.a ATTACH PARTITION public.a1_loc FOR VALUES FROM (0) TO 
(10);
ALTER TABLE ONLY public.a ATTACH PARTITION public.a2_loc FOR VALUES FROM (10) 
TO (20);

ALTER TABLE ONLY public.a1
ADD CONSTRAINT a1_pkey PRIMARY KEY (i);

ALTER TABLE ONLY public.a2
ADD CONSTRAINT a2_pkey PRIMARY KEY (i);

Update comments in multixact.c

2023-01-17 Thread shiy.f...@fujitsu.com

Hi,

I noticed that commit 5212d447fa updated some comments in multixact.c because
SLRU truncation for multixacts is performed during VACUUM, instead of
checkpoint. Should the following comments which mentioned checkpointer be
changed, too?

1.
* we compute it (using nextMXact if none are valid).  Each backend is
* required not to attempt to access any SLRU data for MultiXactIds older
* than its own OldestVisibleMXactId[] setting; this is necessary because
* the checkpointer could truncate away such data at any instant.

2.
 * We set the OldestVisibleMXactId for a given transaction the first time
 * it's going to inspect any MultiXactId.  Once we have set this, we are
 * guaranteed that the checkpointer won't truncate off SLRU data for
 * MultiXactIds at or after our OldestVisibleMXactId.

Regards,
Shi yu

Re: minor bug

2023-01-17 Thread Laurenz Albe

On Mon, 2023-01-16 at 19:59 +0100, Torsten Förtsch wrote:
> not sure if this is known behavior.
> 
> Server version is 14.6 (Debian 14.6-1.pgdg110+1).
> 
> In a PITR setup I have these settings:
> 
> recovery_target_xid = '852381'
> recovery_target_inclusive = 'false'
> 
> In the log file I see this message:
> 
> LOG:  recovery stopping before commit of transaction 852381, time 2000-01-01 
> 00:00:00+00
> 
> But:
> 
> postgres=# select * from pg_last_committed_xact();
>   xid   |           timestamp           | roident 
> +---+-
>  852380 | 2023-01-16 18:00:35.054495+00 |       0
> 
> So, the timestamp displayed in the log message is certainly wrong.

Redirected to -hackers.

If recovery stops at a WAL record that has no timestamp, you get this
bogus recovery stop time.  I think we should show the recovery stop time
only if time was the target, as in the attached patch.

Yours,
Laurenz Albe
From 622e52bbd652fc8872448e46c3ca0bc78dd847fe Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Tue, 17 Jan 2023 10:38:40 +0100
Subject: [PATCH] Don't show bogus recovery stop time
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If target for archive recovery was different from time, the
WAL record that ended recovery might not have a timestamp.
In that case, the log message at the end of recovery would
show a recovery time of midnight on 2000-01-01.

Reported-by: Torsten Förtsch
Author: Laurenz Albe
Discussion: https://postgr.es/m/cakkg4_kuevpqbmyoflajx7opaqk6cvwkvx0hrcfjspfrptx...@mail.gmail.com
---
 src/backend/access/transam/xlogrecovery.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index bc3c3eb3e7..ce3718ca2d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2570,19 +2570,21 @@ recoveryStopsBefore(XLogReaderState *record)
 		recoveryStopLSN = InvalidXLogRecPtr;
 		recoveryStopName[0] = '\0';
 
-		if (isCommit)
+		/* for targets other than time, we might have no stop time */
+		if (recoveryTarget == RECOVERY_TARGET_XID)
 		{
 			ereport(LOG,
-	(errmsg("recovery stopping before commit of transaction %u, time %s",
+	(errmsg("recovery stopping before %s of transaction %u, time %s",
+			(isCommit ? "commit" : "abort"),
 			recoveryStopXid,
 			timestamptz_to_str(recoveryStopTime;
 		}
 		else
 		{
 			ereport(LOG,
-	(errmsg("recovery stopping before abort of transaction %u, time %s",
-			recoveryStopXid,
-			timestamptz_to_str(recoveryStopTime;
+	(errmsg("recovery stopping before %s of transaction %u",
+			(isCommit ? "commit" : "abort"),
+			recoveryStopXid)));
 		}
 	}
 
-- 
2.39.0

Re: Cross-partition UPDATE and foreign table partitions

2023-01-17 Thread Antonin Houska

Antonin Houska  wrote:

> I was wondering why ExecCrossPartitionUpdateForeignKey() has an unused
> argument "oldslot" and wanted to suggest its removal. However, before I did,
> it occurred to me that callers may want to pass the whole slot when the
> partition is a foreign table, i.e. when the "tupleid" argument cannot be
> used. (In that case the problem would be that the function implementation is
> incomplete.)
> 
> However, when checking how cross-partition UPDATE works internally for foreign
> tables, I saw surprising behavior. The attached script creates partitioned
> table "a" with foreign table partitions "a1" and "a2". If you then run the
> following commands
> 
> INSERT INTO a VALUES (1), (10);
> UPDATE a SET i=11 WHERE i=1;
> TABLE a1;
> 
> you'll see that the tuples are correctly routed into the partitions, but the
> UPDATE is simply executed on the "a1" partition. Instead, I'd expect it to
> delete the tuple from "a1" and insert it into "a2". That looks like a bug.

Well, as it usually happens, I found a related information as soon as I had
sent a report. The documentation of CREATE FOREIGN TABLE says:

"However it is not currently possible to move a row from a foreign-table
partition to another partition. An UPDATE that would require doing that will
fail due to the partitioning constraint, assuming that that is properly
enforced by the remote server."

So the remaining question is whether the "oldslot" argument of
ExecCrossPartitionUpdateForeignKey() will be used in the future or should be
removed. Note that the ExecUpdateAct() passes its "slot" variable for it,
which seems to contain the *new* version of the tuple rather than the
old. Some cleanup may be needed.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Logical replication timeout problem

2023-01-17 Thread Amit Kapila

On Mon, Jan 16, 2023 at 10:06 PM Ashutosh Bapat
 wrote:
>
> On Wed, Jan 11, 2023 at 4:11 PM wangw.f...@fujitsu.com
>  wrote:
> >
> > On Mon, Jan 9, 2023 at 13:04 PM Amit Kapila  wrote:
> > >
> >
> > Thanks for your comments.
> >
> > > One more thing, I think it would be better to expose a new callback
> > > API via reorder buffer as suggested previously [2] similar to other
> > > reorder buffer APIs instead of directly using reorderbuffer API to
> > > invoke plugin API.
> >
> > Yes, I agree. I think it would be better to add a new callback API on the 
> > HEAD.
> > So, I improved the fix approach:
> > Introduce a new optional callback to update the process. This callback 
> > function
> > is invoked at the end inside the main loop of the function
> > ReorderBufferProcessTXN() for each change. In this way, I think it seems 
> > that
> > similar timeout problems could be avoided.
>
> I am a bit worried about the indirections that the wrappers and hooks
> create. Output plugins call OutputPluginUpdateProgress() in callbacks
> but I don't see why  ReorderBufferProcessTXN() needs a callback to
> call OutputPluginUpdateProgress.
>

Yeah, I think we can do it as we are doing the previous approach but
we need an additional wrapper (update_progress_cb_wrapper()) as the
current patch has so that we can add error context information. This
is similar to why we have a wrapper for all other callbacks like
change_cb_wrapper.

-- 
With Regards,
Amit Kapila.

Re: minor bug

2023-01-17 Thread Michael Paquier

On Tue, Jan 17, 2023 at 10:42:03AM +0100, Laurenz Albe wrote:
> If recovery stops at a WAL record that has no timestamp, you get this
> bogus recovery stop time.  I think we should show the recovery stop time
> only if time was the target, as in the attached patch.

Good catch!  I'll try to take a look.
--
Michael


signature.asc
Description: PGP signature

Re: ExecRTCheckPerms() and many prunable partitions (checkAsUser)

2023-01-17 Thread Alvaro Herrera

On 2022-Dec-12, Amit Langote wrote:

> On Sun, Dec 11, 2022 at 6:25 PM Amit Langote  wrote:
> > I've attached 0001 to remove those extraneous code blocks and add a
> > comment mentioning that userid need not be recomputed.
> >
> > While staring at the build_simple_rel() bit mentioned above, I
> > realized that this code fails to set userid correctly in the
> > inheritance parent rels that are child relations of subquery parent
> > relations, such as UNION ALL subqueries.  In that case, instead of
> > copying the userid (= 0) of the parent rel, the child should look up
> > its own RTEPermissionInfo, which should be there, and use the
> > checkAsUser value from there.  I've attached 0002 to fix this hole.  I
> > am not sure whether there's a way to add a test case for this in the
> > core suite.
> 
> Ah, I realized we could just expand the test added by 553d2ec27 with a
> wrapper view (to test checkAsUser functionality) and a UNION ALL query
> over the view (to test this change).

Hmm, but if I run this test without the code change in 0002, the test
also passes (to wit: the plan still has two hash joins), so I understand
that either you're testing something that's not changed by the patch,
or the test case itself isn't really what you wanted.

As for 0001, it seems simpler to me to put the 'userid' variable in the
same scope as 'onerel', and then compute it just once and don't bother
with it at all afterwards, as in the attached.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Al principio era UNIX, y UNIX habló y dijo: "Hello world\n".
No dijo "Hello New Jersey\n", ni "Hello USA\n".
gpg: Firmado el lun 16 ene 2023 17:23:19 CET
gpg:usando RSA clave E2C96E4A9BCA7E92A8E3DA551C20ACB9D5C564AE
gpg: Firma correcta de "Álvaro Herrera " [absoluta]
gpg: alias "Álvaro Herrera (PostgreSQL Global Development Group) " [absoluta]
gpg: alias "Álvaro Herrera (2ndQuadrant) " [absoluta]
commit cdc09715305d98457c8240b579528b7835c39d58
Author: Alvaro Herrera  [Álvaro Herrera ]
AuthorDate: Sun Dec 11 18:12:05 2022 +0900
CommitDate: Mon Jan 16 17:23:19 2023 +0100

Remove some dead code in selfuncs.c

RelOptInfo.userid is the same for all relations in a given
inheritance tree, so the code in examine_variable() and
example_simple_variable() that wants to repeat the ACL checks on
the root parent rel instead of a given leaf child relations need not
recompute userid too.

Author: Amit Langote
Reported-by: Justin Pryzby
Discussion: https://postgr.es/m/20221210201753.ga27...@telsasoft.com

diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 75bc20c7c9..0a5632699d 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -500,7 +500,6 @@ find_join_rel(PlannerInfo *root, Relids relids)
  *
  * Otherwise these fields are left invalid, so GetForeignJoinPaths will not be
  * called for the join relation.
- *
  */
 static void
 set_foreign_rel_properties(RelOptInfo *joinrel, RelOptInfo *outer_rel,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 57de51f0db..4e4888dde4 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5080,6 +5080,18 @@ examine_variable(PlannerInfo *root, Node *node, int varRelid,
 		 */
 		ListCell   *ilist;
 		ListCell   *slist;
+		Oid			userid;
+
+		/*
+		 * Determine the user ID to use for privilege checks: either
+		 * onerel->userid if it's set (e.g., in case we're accessing the table
+		 * via a view), or the current user otherwise.
+		 *
+		 * If we drill down to child relations, we keep using the same userid:
+		 * it's going to be the same anyway, due to how we set up the relation
+		 * tree (q.v. build_simple_rel).
+		 */
+		userid = OidIsValid(onerel->userid) ? onerel->userid : GetUserId();
 
 		foreach(ilist, onerel->indexlist)
 		{
@@ -5150,18 +5162,10 @@ examine_variable(PlannerInfo *root, Node *node, int varRelid,
 			{
 /* Get index's table for permission check */
 RangeTblEntry *rte;
-Oid			userid;
 
 rte = planner_rt_fetch(index->rel->relid, root);
 Assert(rte->rtekind == RTE_RELATION);
 
-/*
- * Use onerel->userid if it's set, in case
- * we're accessing the table via a view.
- */
-userid = OidIsValid(onerel->userid) ?
-	onerel->userid : GetUserId();
-
 /*
  * For simplicity, we insist on the whole
  * table being selectable, rather than trying
@@ -5212,9 +5216,6 @@ examine_variable(PlannerInfo *root, Node *node, int varRelid,
 		rte = planner_rt_fetch(varno, root);
 		Assert(rte->rtekind == RTE_RELATION);
 
-		userid = OidIsValid(onerel->userid) ?
-			onerel->userid : GetUserId();
-
 		vardata->acl_ok =
 			rte->securityQuals == NIL &&
 			(pg_class_aclc

Helper functions for wait_for_catchup() in Cluster.pm

2023-01-17 Thread Drouvot, Bertrand


Hi hackers,

please find attached a patch proposal to define $SUBJECT.

The idea has been raised in [1], where we are adding more calls to 
wait_for_catchup() in 'replay' mode.

The current code already has 25 of those, so the proposed patch is defining a 
new wait_for_replay_catchup() function.

While at it, adding also:

- wait_for_write_catchup(): called 5 times
- wait_for_sent_catchup() and wait_for_flush_catchup() for consistency purpose 
(while there is
currently no occurrences of wait_for_catchup() in 'sent' or 'flush' mode.).

Looking forward to your feedback,

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: 
https://www.postgresql.org/message-id/20230106034036.2m4qnn7ep7b5ipet%40awork3.anarazel.dediff --git a/src/bin/pg_rewind/t/007_standby_source.pl 
b/src/bin/pg_rewind/t/007_standby_source.pl
index 52644c2c0d..0a7e2d4521 100644
--- a/src/bin/pg_rewind/t/007_standby_source.pl
+++ b/src/bin/pg_rewind/t/007_standby_source.pl
@@ -75,8 +75,8 @@ $node_a->safe_psql('postgres',
 $node_a->safe_psql('postgres', 'CHECKPOINT');
 
 my $lsn = $node_a->lsn('write');
-$node_a->wait_for_catchup('node_b', 'write', $lsn);
-$node_b->wait_for_catchup('node_c', 'write', $lsn);
+$node_a->wait_for_write_catchup('node_b', $lsn);
+$node_b->wait_for_write_catchup('node_c', $lsn);
 
 # Promote C
 #
@@ -160,7 +160,7 @@ in A, after C was promoted
 $node_a->safe_psql('postgres',
"INSERT INTO tbl1 values ('in A, after rewind')");
 
-$node_b->wait_for_catchup('node_c', 'replay', $node_a->lsn('write'));
+$node_b->wait_for_replay_catchup('node_c', $node_a->lsn('write'));
 
 check_query(
'SELECT * FROM tbl1',
diff --git a/src/test/modules/brin/t/02_wal_consistency.pl 
b/src/test/modules/brin/t/02_wal_consistency.pl
index 5983ef208e..8d1dc68194 100644
--- a/src/test/modules/brin/t/02_wal_consistency.pl
+++ b/src/test/modules/brin/t/02_wal_consistency.pl
@@ -70,6 +70,6 @@ my ($ret, $out, $err) = $whiskey->psql(
});
 cmp_ok($out, '>=', 1);
 
-$whiskey->wait_for_catchup($charlie, 'replay', $whiskey->lsn('insert'));
+$whiskey->wait_for_replay_catchup($charlie, $whiskey->lsn('insert'));
 
 done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm 
b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 04921ca3a3..3e3aeea0c6 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2711,6 +2711,70 @@ sub wait_for_catchup
 
 =pod
 
+=item $node->wait_for_flush_catchup(standby_name, target_lsn)
+
+Helper function for wait_for_catchup when waiting for the flush_lsn column
+to catchup.
+
+=cut
+
+sub wait_for_flush_catchup
+{
+   my ($self, $standby_name, $target_lsn) = @_;
+
+   $self->wait_for_catchup($standby_name, 'flush', $target_lsn);
+}
+
+=pod
+
+=item $node->wait_for_replay_catchup(standby_name, target_lsn)
+
+Helper function for wait_for_catchup when waiting for the replay_lsn column
+to catchup.
+
+=cut
+
+sub wait_for_replay_catchup
+{
+   my ($self, $standby_name, $target_lsn) = @_;
+
+   $self->wait_for_catchup($standby_name, 'replay', $target_lsn);
+}
+
+=pod
+
+=item $node->wait_for_sent_catchup(standby_name, target_lsn)
+
+Helper function for wait_for_catchup when waiting for the sent_lsn column
+to catchup.
+
+=cut
+
+sub wait_for_sent_catchup
+{
+   my ($self, $standby_name, $target_lsn) = @_;
+
+   $self->wait_for_catchup($standby_name, 'sent', $target_lsn);
+}
+
+=pod
+
+=item $node->wait_for_write_catchup(standby_name, target_lsn)
+
+Helper function for wait_for_catchup when waiting for the write_lsn column
+to catchup.
+
+=cut
+
+sub wait_for_write_catchup
+{
+   my ($self, $standby_name, $target_lsn) = @_;
+
+   $self->wait_for_catchup($standby_name, 'write', $target_lsn);
+}
+
+=pod
+
 =item $node->wait_for_slot_catchup(slot_name, mode, target_lsn)
 
 Wait for the named replication slot to equal or pass the supplied target_lsn.
diff --git a/src/test/recovery/t/001_stream_rep.pl 
b/src/test/recovery/t/001_stream_rep.pl
index 23a90dd85b..c90f9c8383 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -48,8 +48,8 @@ $node_primary->safe_psql('postgres',
 
 # Wait for standbys to catch up
 my $primary_lsn = $node_primary->lsn('write');
-$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
-$node_standby_1->wait_for_catchup($node_standby_2, 'replay', $primary_lsn);
+$node_primary->wait_for_replay_catchup($node_standby_1, $primary_lsn);
+$node_standby_1->wait_for_replay_catchup($node_standby_2, $primary_lsn);
 
 my $result =
   $node_standby_1->safe_psql('postgres', "SELECT count(*) FROM tab_int");
@@ -67,8 +67,8 @@ $node_primary->safe_psql('postgres',
 
 # Wait for standbys to catch up
 $primary_lsn = $node_primary->lsn('write');
-$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
-$node_standby_1->wait_for_catchup($node_standby_2, 'replay',

Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-17 Thread Lukas Fittl

On Fri, Jan 6, 2023 at 1:19 AM Jelte Fennema  wrote:

> Nice addition! And the code looks pretty straight forward.
>

Thanks for reviewing!

The current patch triggers warnings:
> https://cirrus-ci.com/task/6016013976731648 Looks like you need to add
> void as the argument.
>

Fixed in v2 attached. This also adds a simple regression test, as well as
fixes the parallel working handling.

Do you have some performance comparison between TIMING ON and TIMING
> SAMPLING?
>

Here are some benchmarks of auto_explain overhead on my ARM-based M1
Macbook for the following query run with pgbench on a scale factor 100 data
set:

SELECT COUNT(*) FROM pgbench_branches JOIN pgbench_accounts USING (bid)
JOIN pgbench_tellers USING (bid) WHERE bid = 42;

(the motivation is to use a query that is more complex than the standard
pgbench select-only test query)

avg latency (best of 3), -T 300, -c 4, -s 100, shared_buffers 2GB, fsync
off, max_parallel_workers_per_gather 0:

master, log_timing = off: 871 ms (878 / 877 / 871)
patch, log_timing = off: 869 ms (882 / 880 / 869)
patch, log_timing = on: 890 ms (917 / 930 / 890)
patch, log_timing = sampling, samplefreq = 1000: 869 ms (887 / 869 / 894)

Additionally, here is Andres' benchmark from [1], with the sampling option
added:

%  psql -Xc 'DROP TABLE IF EXISTS t; CREATE TABLE t AS SELECT * FROM
generate_series(1, 10) g(i);' postgres && pgbench -n -r -t 100 -f
<(echo -e "SELECT COUNT(*) FROM t;EXPLAIN (ANALYZE, TIMING OFF) SELECT
COUNT(*) FROM t;EXPLAIN (ANALYZE, TIMING SAMPLING) SELECT COUNT(*) FROM
t;EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;") postgres |grep '^ '
DROP TABLE
SELECT 10
 3.507   0  SELECT COUNT(*) FROM t;
 3.476   0  EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*)
FROM t;
 3.576   0  EXPLAIN (ANALYZE, TIMING SAMPLING) SELECT
COUNT(*) FROM t;
 5.096   0  EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*)
FROM t;

My pg_test_timing data for reference:

% pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 23.65 ns
Histogram of timing durations:
  < us   % of total  count
 1 97.64472  123876325
 2  2.354212986658
 4  0.00022277
 8  0.00016202
16  0.00064815
32  0.5 64

In InstrStartSampling there's logic to increase/decrease the frequency of
> an already existing timer. It's not clear to me when this can occur. I'd
> expect sampling frequency to remain constant throughout an explain plan. If
> it's indeed needed, I think a code comment would be useful to explain why
> this edge case is necessary.
>

Clarified in a code comment in v2. This is needed for handling nested
statements which could have different sampling frequencies for each nesting
level, i.e. a function might want to sample it's queries at a higher
frequency than its caller.

Thanks,
Lukas

[1] https://postgr.es/m/20230116213913.4oseovlzvc2674z7%40awork3.anarazel.de

-- 
Lukas Fittl


v2-0001-Add-TIMING-SAMPLING-option-for-EXPLAIN-ANALYZE.patch
Description: Binary data

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-17 Thread Takamichi Osumi (Fujitsu)

Hi,


On Saturday, January 14, 2023 3:27 PM vignesh C  wrote:
> 1) Since the min_apply_delay = 3, but you have specified 2s, there might be a
> possibility that it can log delay as 1000ms due to pub/sub/network delay and
> the test can fail randomly, If we cannot ensure this log file value,
> check_apply_delay_time verification alone should be sufficient.
> +is($result, qq(5|1|5), 'check if the new rows were applied to
> +subscriber');
> +
> +check_apply_delay_log("logical replication apply delay", "2000");
You are right. Removed the left-time check of the 1st call of 
check_apply_delay_log(). 


> 2) I'm not sure if this will add any extra coverage as the altering value of
> min_apply_delay is already tested in the regression, if so this test can be
> removed:
> +# Test ALTER SUBSCRIPTION. Delay 86460 seconds (1 day 1 minute).
> +$node_subscriber->safe_psql('postgres',
> +   "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay =
> 8646)"
> +);
> +
> +# New row to trigger apply delay.
> +$node_publisher->safe_psql('postgres',
> +   "INSERT INTO test_tab VALUES (0, 'foobar')");
> +
> +check_apply_delay_log("logical replication apply delay", "8000");
While addressing this point, I've noticed that there is a
behavior difference between physical replication's recovery_min_apply_delay
and this feature when stopping the replication during delays.

At present, in the latter case,
the apply worker exits without applying the suspended transaction
after ALTER SUBSCRIPTION DISABLE command for the subscription.
Meanwhile, there is no "disabling" command for physical replication,
but I checked the behavior about what happens for promoting a secondary
during the delay of recovery_min_apply_delay for physical replication as one 
example.
The transaction has become visible even in the promoting in the middle of delay.

I'm not sure if I should make the time-delayed LR aligned with this behavior.
Does someone has an opinion for this ?

By the way, the above test code can be used for the test case
when the apply worker is in a delay but the transaction has been canceled by
ALTER SUBSCRIPTION DISABLE command. So, I didn't remove it at this stage.
> 3) We generally keep the subroutines before the tests, it can be kept
> accordingly:
> 3.a)
> +sub check_apply_delay_log
> +{
> +   my ($message, $expected) = @_;
> +   $expected = 0 unless defined $expected;
> +
> +   my $old_log_location = $log_location;
> 
> 3.b)
> +sub check_apply_delay_time
> +{
> +   my ($primary_key, $expected_diffs) = @_;
> +
> +   my $inserted_time_on_pub = $node_publisher->safe_psql('postgres',
> qq[
> +   SELECT extract(epoch from c) FROM test_tab WHERE a =
> $primary_key;
> +   ]);
> +
Fixed.

 
> 4) typo "more then once" should be "more than once"
> + regress_testsub | regress_subscription_user | f   |
> {testpub,testpub1,testpub2} | f  | off   | d|
> f| any|0 | off
> | dbname=regress_doesnotexist | 0/0
>  (1 row)
> 
>  -- fail - publication used more then once @@ -316,10 +316,10 @@ ERROR:
> publication "testpub3" is not in subscription "regress_testsub"
>  -- ok - delete publications
>  ALTER SUBSCRIPTION regress_testsub DROP PUBLICATION testpub1,
> testpub2 WITH (refresh = false);
>  \dRs+
This was an existing typo on HEAD. Addressed in other thread in [1].

 
> 5) This can be changed to "Is it larger than expected?"
> +   # Is it larger than expected ?
> +   cmp_ok($logged_delay, '>', $expected,
> +   "The wait time of the apply worker is long
> enough expectedly"
> +   );
Fixed.
 
> 6) 2022 should be changed to 2023
> +++ b/src/test/subscription/t/032_apply_delay.pl
> @@ -0,0 +1,170 @@
> +
> +# Copyright (c) 2022, PostgreSQL Global Development Group
> +
> +# Test replication apply delay
Fixed.


> 7) Termination full stop is not required for single line comments:
> 7.a)
> +use Test::More;
> +
> +# Create publisher node.
> +my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
> 
> 7.b) +
> +# Create subscriber node.
> +my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
> 
> 7.c) +
> +# Create some preexisting content on publisher.
> +$node_publisher->safe_psql('postgres',
> 
> 7.d) similarly in rest of the files
Removed the periods for single line comments.


> 8) Is it possible to add one test for spooling also?
There is a streaming transaction case in the TAP test already.


I conducted some minor comment modifications along with above changes.
Kindly have a look at the v16.

[1] - 
https://www.postgresql.org/message-id/flat/TYCPR01MB83737EA140C79B7D099F65E8EDC69%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi



v16-0001-Time-delayed-logical-replication-subscriber.patch
Description:  v16-0001-Time-delayed-logical-replication-subscriber.patch

RE: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-01-17 Thread wangw.f...@fujitsu.com

On Wed, Jan 11, 2023 4:31 PM Melih Mutlu  wrote:
> Rebased the patch to resolve conflicts.

Thanks for your patch set.

Here are some comments:

v3-0001* patch
===

1. typedefs.list
I think we also need to add "walrcv_slot_snapshot_fn" to this file.

v7-0002* patch
===
1. About function ReplicationOriginNameForLogicalRep()
Do we need to modify the API of this function? I think the original API could
also meet the current needs. Since this is not a static function, I think it
seems better to keep the original API if there is no reason. Please let me know
if I'm missing something.

-

2. Comment atop the function GetSubscriptionRelReplicationSlot
+/*
+ * Get replication slot name of subscription table.
+ *
+ * Returns null if the subscription table does not have a replication slot.
+ */

Since this function always returns NULL, I think it would be better to say the
value in "slotname" here instead of the function's return value.

If you agree with this, please also kindly modify the comment atop the function
GetSubscriptionRelOrigin.

-

3. typo
+* At this point, there shouldn't be any existing 
replication
+* origin wit the same name.

wit -> with

-

4. In function CreateSubscription
+   values[Anum_pg_subscription_sublastusedid - 1] = Int64GetDatum(1);

I think it might be better to initialize this field to NULL or 0 here.
Because in the patch, we always ignore the initialized value when launching
the sync worker in the function process_syncing_tables_for_apply. And I think
we could document in pg-doc that this value means that no tables have been
synced yet.

-

5. New member "created_slot" in structure LogicalRepWorker
+   /*
+* Indicates if the sync worker created a replication slot or it reuses 
an
+* existing one created by another worker.
+*/
+   boolcreated_slot;

I think the second half of the sentence looks inaccurate.
Because I think this flag could be false even when we reuse an existing slot
created by another worker. Assuming the first run for the worker tries to sync
a table which is synced by another sync worker before, and the relstate is set
to SUBREL_STATE_FINISHEDCOPY by another sync worker, I think this flag will not
be set to true. (see function LogicalRepSyncTableStart)

So, what if we simplify the description here and just say that this worker
already has it's default slot?

If I'm not missing something and you agree with this, please also kindly modify
the relevant comment atop the if-statement (!MyLogicalRepWorker->created_slot)
in the function LogicalRepSyncTableStart.

Regards,
Wang Wei

Re: Small miscellaneus fixes (Part II)

2023-01-17 Thread Ranier Vilela

Em ter., 17 de jan. de 2023 às 04:37, John Naylor <
john.nay...@enterprisedb.com> escreveu:

>
> On Mon, Jan 16, 2023 at 1:28 PM John Naylor 
> wrote:
> >
> >
> > I wrote:
> > > ...but arguably the earlier check is close enough that it's silly to
> assert in the "else" branch, and I'd be okay with just nuking those lines.
> Another thing that caught my attention is the assumption that unsetting a
> bit is so expensive that we have to first check if it's set, so we may as
> well remove "IS_BRACKET(Np->Num)" as well.
> >
> > The attached is what I mean. I'll commit this this week unless there are
> objections.
>
> I've pushed this and the cosmetic fix in pg_dump. Those were the only
> things I saw that had some interest, so I closed the CF entry.
>
Thanks for the commits.

regards,
Ranier Vilela

Re: Helper functions for wait_for_catchup() in Cluster.pm

2023-01-17 Thread Alvaro Herrera

On 2023-Jan-17, Drouvot, Bertrand wrote:

> The idea has been raised in [1], where we are adding more calls to
> wait_for_catchup() in 'replay' mode.

This seems mostly useless as presented.  Maybe if you're able to reduce
the noise on the second argument it would be worth something -- namely,
if the wrapper function receives a node instead of an LSN: perhaps
wait_for_replay_catchup() would use the flush LSN from the given node,
wait_for_write_catchup() would use the write LSN, and
wait_for_sent_catchup() would use the insert LSN.  (I didn't check in
your patch if there are callsites that do something else).  This would
in several cases let you also remove the line with the assignment of
appropriate LSN to a separate variable.  If you did it that way, maybe
the code would become a tiny bit smaller overall.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: xml2: add test for coverage

2023-01-17 Thread vignesh C

On Fri, 25 Nov 2022 at 18:08, Peter Eisentraut
 wrote:
>
> On 23.08.22 03:38, Dong Wook Lee wrote:
> > I made a small patch for xml2 to improve test coverage.
> > However, there was a problem using the functions below.
> >
> > - xpath_number
> > - xpath_bool
> > - xpath_nodeset
> > - xpath_list
> >
> > Do you have any advice on how to use this function correctly?
> > It would also be good to add an example of using the function to the 
> > document.
>
> I can confirm that these functions could use more tests and more
> documentation and examples.  But given that you registered a patch in
> the commit fest, it should be you who provides a patch to solve those
> issues.  Are you still working on this, or were you just looking for
> help on how to solve this?

Hi DongWook Lee,

Are you planning to work on this and provide an updated patch, if you
are not planning to work on it, we can update the commitfest entry
accordingly.

Regards,
Vignesh

Re: vacuumlo: add test to vacuumlo for test coverage

2023-01-17 Thread vignesh C

On Wed, 16 Nov 2022 at 10:18, Ian Lawrence Barwick  wrote:
>
> 2022年9月3日(土) 17:28 Dong Wook Lee :
> >
> > Hi hackers,
> > I write a tiny patch about vacuumlo to improve test coverage.
> > I hope my work is meaningful.
>
> Hi
>
> While reviewing the patch backlog, we have determined that this patch adds
> one or more TAP tests but has not added the test to the "meson.build" file.
>
> To do this, locate the relevant "meson.build" file for each test and add it
> in the 'tests' dictionary, which will look something like this:
>
>   'tap': {
> 'tests': [
>   't/001_basic.pl',
> ],
>   },
>
> For some additional details please see this Wiki article:
>
>   https://wiki.postgresql.org/wiki/Meson_for_patch_authors
>
> For more information on the meson build system for PostgreSQL see:
>
>   https://wiki.postgresql.org/wiki/Meson

Hi DongWook Lee,

Please plan to work on the comment and provide a patch. As CommitFest
2023-01 is currently underway, this would be an excellent time to
update the patch and get the patch in a better shape.

Regards,
Vignesh

Re: Fix database creation during installchecks for ICU cluster

2023-01-17 Thread vignesh C

On Tue, 29 Nov 2022 at 20:24, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> Thanks for the patch!
>
>
> On 10/29/22 12:54, Marina Polyakova wrote:
> >
> > 1) The ECPG tests fail because they use the SQL_ASCII encoding [2],
> > the database template0 uses the ICU locale provider and SQL_ASCII is
> > not supported by ICU:
> >
> > $ make -C src/interfaces/ecpg/ installcheck
> > ...
> > == creating database "ecpg1_regression" ==
> > ERROR:  encoding "SQL_ASCII" is not supported with ICU provider
> > ERROR:  database "ecpg1_regression" does not exist
> > command failed: "/home/marina/postgresql/master/my/inst/bin/psql" -X
> > -c "CREATE DATABASE \"ecpg1_regression\" TEMPLATE=template0
> > ENCODING='SQL_ASCII'" -c "ALTER DATABASE \"ecpg1_regression\" SET
> > lc_messages TO 'C';ALTER DATABASE \"ecpg1_regression\" SET lc_monetary
> > TO 'C';ALTER DATABASE \"ecpg1_regression\" SET lc_numeric TO 'C';ALTER
> > DATABASE \"ecpg1_regression\" SET lc_time TO 'C';ALTER DATABASE
> > \"ecpg1_regression\" SET bytea_output TO 'hex';ALTER DATABASE
> > \"ecpg1_regression\" SET timezone_abbreviations TO 'Default';" "postgres"
> >
>
> I can confirm that same error happens on my end and your patch fixes the
> issue. But, do ECPG tests really require SQL_ASCII encoding? I removed
> ECPG tests' encoding line [1], rebuilt it and 'make -C
> src/interfaces/ecpg/ installcheck' passed without applying your patch.
>
>
> >
> > 2) The option --no-locale in pg_regress is described as "use C locale"
> > [3]. But in this case the created databases actually use the ICU
> > locale provider with the ICU cluster locale from template0 (see
> > diff_check_backend_used_provider.txt):
> >
> > $ make NO_LOCALE=1 installcheck
>
>
> This works on my end without applying your patch. Commands I used are:
>
> $ ./configure --with-icu --prefix=$PWD/build_dir
> $ make && make install && export PATH=$PWD/build_dir/bin:$PATH
> $ initdb --locale-provider icu --icu-locale en-US -D data && pg_ctl -D
> data -l logfile start
> $ make NO_LOCALE=1 installcheck

Hi Marina Polyakova,

Since it is working without your patch, Is this patch required for any
other scenarios?

Regards,
Vignesh

[PATCH] Constify proclist.h

2023-01-17 Thread Aleksander Alekseev

Hi hackers,

This is a follow-up to [1] and c8ad4d81.

> Additionally Bharath pointed out that there are other pieces of code
> that we may want to change in a similar fashion,
> proclist.h/proclist_types.h as one example. I didn't do this yet
> because I would like to know the community opinion first on whether we
> should do this at all.

Since the consensus seems to be to constify everything possible here
is the patch for proclist.h. There is nothing to change in
proclist_types.h.

[1]: 
https://postgr.es/m/CAJ7c6TM2%3D08mNKD9aJg8vEY9hd%2BG4L7%2BNvh30UiNT3kShgRgNg%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev


v1-0001-Constify-proclist.h.patch
Description: Binary data

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-17 Thread Hayato Kuroda (Fujitsu)

Dear hackers,

> At present, in the latter case,
> the apply worker exits without applying the suspended transaction
> after ALTER SUBSCRIPTION DISABLE command for the subscription.
> Meanwhile, there is no "disabling" command for physical replication,
> but I checked the behavior about what happens for promoting a secondary
> during the delay of recovery_min_apply_delay for physical replication as one
> example.
> The transaction has become visible even in the promoting in the middle of 
> delay.
> 
> I'm not sure if I should make the time-delayed LR aligned with this behavior.
> Does someone has an opinion for this ?

I put my opinion here. The current specification is correct; we should not 
follow
a physical replication manner.
One motivation for this feature is to offer opportunities to correct data loss
errors. When accidental delete events occur, DBA can stop propagations on 
subscribers
by disabling the subscription, with the patch at present.
IIUC, when the subscription is disabled before transactions are started,
workers exit and stop applications. This feature delays starting txns, so we
should regard such an alternation as that is executed before the transaction.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-17 Thread Amit Kapila

On Tue, Jan 17, 2023 at 4:30 PM Takamichi Osumi (Fujitsu)
 wrote:
>
> On Saturday, January 14, 2023 3:27 PM vignesh C  wrote:
>
> > 2) I'm not sure if this will add any extra coverage as the altering value of
> > min_apply_delay is already tested in the regression, if so this test can be
> > removed:
> > +# Test ALTER SUBSCRIPTION. Delay 86460 seconds (1 day 1 minute).
> > +$node_subscriber->safe_psql('postgres',
> > +   "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay =
> > 8646)"
> > +);
> > +
> > +# New row to trigger apply delay.
> > +$node_publisher->safe_psql('postgres',
> > +   "INSERT INTO test_tab VALUES (0, 'foobar')");
> > +
> > +check_apply_delay_log("logical replication apply delay", "8000");
> While addressing this point, I've noticed that there is a
> behavior difference between physical replication's recovery_min_apply_delay
> and this feature when stopping the replication during delays.
>
> At present, in the latter case,
> the apply worker exits without applying the suspended transaction
> after ALTER SUBSCRIPTION DISABLE command for the subscription.
>

In the previous paragraph, you said the behavior difference while
stopping the replication but it is not clear from where this DISABLE
command comes in that scenario.

> Meanwhile, there is no "disabling" command for physical replication,
> but I checked the behavior about what happens for promoting a secondary
> during the delay of recovery_min_apply_delay for physical replication as one 
> example.
> The transaction has become visible even in the promoting in the middle of 
> delay.
>

What causes such a transaction to be visible after promotion? Ideally,
if the commit doesn't succeed, the transaction shouldn't be visible.
Do, we allow the transaction waiting due to delay to get committed on
promotion?

> I'm not sure if I should make the time-delayed LR aligned with this behavior.
> Does someone has an opinion for this ?
>

Can you please explain a bit more as asked above to understand the difference?

-- 
With Regards,
Amit Kapila.

Re: Logical replication timeout problem

2023-01-17 Thread Ashutosh Bapat

On Tue, Jan 17, 2023 at 3:34 PM Amit Kapila  wrote:

> >
> > I am a bit worried about the indirections that the wrappers and hooks
> > create. Output plugins call OutputPluginUpdateProgress() in callbacks
> > but I don't see why  ReorderBufferProcessTXN() needs a callback to
> > call OutputPluginUpdateProgress.
> >
>
> Yeah, I think we can do it as we are doing the previous approach but
> we need an additional wrapper (update_progress_cb_wrapper()) as the
> current patch has so that we can add error context information. This
> is similar to why we have a wrapper for all other callbacks like
> change_cb_wrapper.
>

Ultimately OutputPluginUpdateProgress() will be called - which in turn
will call ctx->update_progress. I don't see wrappers around
OutputPluginWrite or OutputPluginPrepareWrite. But I see that those
two are called always from output plugin, so indirectly those are
called through a wrapper. I also see that update_progress_cb_wrapper()
is similar, as far as wrapper is concerned, to
ReorderBufferUpdateProgress() in the earlier patch.
ReorderBufferUpdateProgress() looks more readable than the wrapper.

If we want to keep the wrapper at least we should use a different
variable name. update_progress is also there LogicalDecodingContext
and will be indirectly called from ReorderBuffer::update_progress.
Somebody might think that there's some recursion involved there.
That's a mighty confusion.

-- 
Best Wishes,
Ashutosh Bapat

Re: doc: add missing "id" attributes to extension packaging page

2023-01-17 Thread Karl O. Pinc

On Tue, 17 Jan 2023 06:57:23 +0100
Brar Piening  wrote:

> On 17.01.2023 at 02:05, Karl O. Pinc wrote:
> > Or maybe the right way is to set a mode at the very top,
> > the first apply-templates call, and not mess with the
> > built-in templates at all.  (You'd write your own
> > "postgres-mode" templates the same way, to "wrap"
> > and call the default templates.)
> >
> > Think of the mode as an implicit argument that's preserved and
> > passed down through each template invocation without having to
> > be explicitly specified by the calling code.  
> 
> I think the document you're missing is [1].
> 
> There are multiple ways to customize DocBook XSL output and it sounds
> like you want me to write a customization layer which I didn't do
> because there is precedent that the typical "way to do it" (TM) in the
> PostgreSQL project is [2].
> 
> Regards,
> 
> Brar
> 
> [1] http://www.sagehill.net/docbookxsl/CustomizingPart.html
> [2] http://www.sagehill.net/docbookxsl/ReplaceTemplate.html
> 

Sagehill is normally vary good.  But in this case [2] does not
apply.  Or rather it applies but it is overkill because you
do not want to replace what a template is producing.  You
want to add to what a template is producing.  So you want to
wrap the template, with your new code adding output before/
after what the original produces.

[1] does not contain this technique.

If you're not willing to try I am willing to see if I can
produce an example to work from.  My XSLT is starting to
come back.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-01-17 Thread Aleksander Alekseev

Hi hackers,

> Yeah, pg_upgrade will briefly start and stop the old server to make
> sure all the WAL is replayed, and won't transfer any of the files
> over. AFAIK, major-version WAL changes are fine; it was the previous
> claim that we could do it in a minor version that I was unsure about.

OK, here is the patchset v53 where I mostly modified the commit
messages. It is explicitly said that 0001 modifies the WAL records and
why we decided to do it in this patch. Additionally any mention of
64-bit XIDs is removed since it is not guaranteed that the rest of the
patches are going to be accepted. 64-bit SLRU page numbering is a
valuable change per se.

Changing the status of the CF entry to RfC apparently was a bit
premature. It looks like the patchset can use a few more rounds of
review.

In 0002:

```
-#define TransactionIdToCTsPage(xid) \
-((xid) / (TransactionId) COMMIT_TS_XACTS_PER_PAGE)
+static inline int64
+TransactionIdToCTsPageInternal(TransactionId xid, bool lock)
+{
+FullTransactionIdfxid,
+nextXid;
+uint32epoch;
+
+if (lock)
+LWLockAcquire(XidGenLock, LW_SHARED);
+
+/* make a local copy */
+nextXid = ShmemVariableCache->nextXid;
+
+if (lock)
+LWLockRelease(XidGenLock);
+
+epoch = EpochFromFullTransactionId(nextXid);
+if (xid > XidFromFullTransactionId(nextXid))
+--epoch;
+
+fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+
+return fxid.value / (uint64) COMMIT_TS_XACTS_PER_PAGE;
+}
```

I'm pretty confident that shared memory can't be accessed like this,
without taking a lock. Although it may work on x64 generally we can
get garbage, unless nextXid is accessed atomically and has a
corresponding atomic type. On top of that I'm pretty sure
TransactionIds can't be compared with the regular comparison
operators. All in all, so far I don't understand why this piece of
code should be so complicated.

The same applies to:

```
-#define TransactionIdToPage(xid) ((xid) / (TransactionId)
SUBTRANS_XACTS_PER_PAGE)
+static inline int64
+TransactionIdToPageInternal(TransactionId xid, bool lock)
```

... in subtrans.c

Maxim, perhaps you could share with us what your reasoning was here?

-- 
Best regards,
Aleksander Alekseev


v53-0002-Utilize-64-bit-SLRU-page-numbers-in-SLRU-callers.patch
Description: Binary data


v53-0001-Use-64-bit-numbering-of-SLRU-pages.patch
Description: Binary data


v53-0003-pg_upgrade-from-32-bit-to-64-bit-SLRU-page-numbe.patch
Description: Binary data

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-17 Thread Andrew Dunstan

On 2023-01-16 Mo 21:58, Tom Lane wrote:
> I've pushed the per-branch AdjustUpgrade.pm files and tested by performing
> a fresh round of buildfarm runs with the patched TestUpgradeXversion.pm
> file.  I think we're in good shape with this project.
>
> I dunno if we want to stretch buildfarm owners' patience with yet
> another BF client release right now.  On the other hand, I'm antsy
> to see if we can un-revert 1b4d280ea after doing a little more
> work in AdjustUpgrade.pm.
>
>   

It looks like the only animals doing the cross version tests crake,
drongo and fairywren. These are all mine, so I don't think we need to do
a new release for this.

I think the next step is to push the buildfarm client changes, and
update those three animals to use it, and make sure nothing breaks. I'll
go and do those things now. Then you should be able to try your unrevert.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-17 Thread Robert Haas

On Fri, Jan 13, 2023 at 2:56 PM Andres Freund  wrote:
> Does anybody see a reason to not move forward with this aspect? We do a fair
> amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
> Instrumentation (16 bytes saved in Instrumentation itself, 32 via
> BufferUsage).

I read through 0001 and it seems basically fine to me. Comments:

1. pg_clock_gettime_ns() doesn't follow pgindent conventions.

2. I'm not entirely sure that the new .?S_PER_.?S macros are
worthwhile but maybe they are, and in any case I don't care very much.

3. I've always found 'struct timespec' to be pretty annoying
notationally, so I like the fact that this patch would reduce use of
it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Record queryid when auto_explain.log_verbose is on

2023-01-17 Thread torikoshia


On 2023-01-16 22:07, Julien Rouhaud wrote:

Hi,

On Mon, Jan 16, 2023 at 09:36:59PM +0900, torikoshia wrote:


As far as I read the manual below, auto_explain.log_verbose should 
record

logs equivalent to VERBOSE option of EXPLAIN.


Ah good catch, that's clearly an oversight!


Attached patch makes auto_explain also print query identifiers.

What do you think?


@@ -407,6 +408,9 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
ExplainPrintTriggers(es, queryDesc);
if (es->costs)
ExplainPrintJITSummary(es, queryDesc);
+			if (es->verbose && queryDesc->plannedstmt->queryId != 
UINT64CONST(0))

+   ExplainPropertyInteger("Query Identifier", 
NULL, (int64)
+  
queryDesc->plannedstmt->queryId, es);

For interactive EXPLAIN the query identifier is printed just after the 
plan,
before the triggers and the JIT summary so auto_explain should do the 
same.

Thanks for the comment!
Agreed and updated the patch.


On 2023-01-17 03:53, Justin Pryzby wrote:

On Mon, Jan 16, 2023 at 09:36:59PM +0900, torikoshia wrote:

Attached patch makes auto_explain also print query identifiers.


This was asked during the initial patch; does your patch address the
issues here ?

https://www.postgresql.org/message-id/20200308142644.vlihk7djpwqjkp7w%40nol


Thanks!
I may misunderstand something, but it seems that the issue occurred 
since queryid was calculated in pgss_post_parse_analyze() at that time.


```
--- queryid_exposure-v6.diff, which is the patch just before the 
discussion
@@ -792,16 +801,34 @@ pgss_post_parse_analyze(ParseState *pstate, Query 
*query)

..snip..

if (query->utilityStmt)
{
-   query->queryId = UINT64CONST(0);
+   if (pgss_track_utility && 
PGSS_HANDLED_UTILITY(query->utilityStmt)

+   && pstate->p_sourcetext)
+   {
+   const char *querytext = pstate->p_sourcetext;
+   int query_location = query->stmt_location;
+   int query_len = query->stmt_len;
+
+   /*
+* Confine our attention to the relevant part of the string, 
if the

+* query is a portion of a multi-statement source string.
+*/
+   querytext = pgss_clean_querytext(pstate->p_sourcetext,
+&query_location,
+&query_len);
+
+   query->queryId = pgss_compute_utility_queryid(querytext, 
query_len);

```

Currently queryId is not calculated in pgss_post_parse_analyze() and the 
issue does not occur, doesn't it?
I confirmed the attached patch logged queryid for some utility 
statements.


```
LOG:  0: duration: 0.017 ms  plan:
Query Text: prepare p1 as select 1;
Result  (cost=0.00..0.01 rows=1 width=4)
  Output: 1
Query Identifier: -1801652217649936326
```

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONFrom c113e00fc681289e33f72855d86b93b0d44360d2 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Tue, 17 Jan 2023 22:42:19 +0900
Subject: [PATCH v2] Make auto_explain record query identifier

When auto_explain.log_verbose is on, auto_explain should record logs
equivalent to VERBOSE option of EXPLAIN. Howerver, when
compute_query_id is on, query identifiers are only printed in
VERBOSE option of EXPLAIN.
This patch makes auto_explain also print query identifier.
---
 contrib/auto_explain/auto_explain.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index c3ac27ae99..ee6917e605 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -21,6 +21,7 @@
 #include "jit/jit.h"
 #include "nodes/params.h"
 #include "utils/guc.h"
+#include "utils/queryjumble.h"
 
 PG_MODULE_MAGIC;
 
@@ -403,6 +404,9 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			ExplainQueryText(es, queryDesc);
 			ExplainQueryParameters(es, queryDesc->params, auto_explain_log_parameter_max_length);
 			ExplainPrintPlan(es, queryDesc);
+			if (es->verbose && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+ExplainPropertyInteger("Query Identifier", NULL, (int64)
+	   queryDesc->plannedstmt->queryId, es);
 			if (es->analyze && auto_explain_log_triggers)
 ExplainPrintTriggers(es, queryDesc);
 			if (es->costs)

base-commit: 15eb1228800a35042ef318f941173aa8b89002d1
-- 
2.25.1

GSoC 2023

2023-01-17 Thread Ilaria Battiston


Greetings -hackers,

Our beloved Google Summer of Code is back for 2023, with a format 
similar to 2022: both medium and large sized projects can be proposed, 
with more flexibility on end dates. The program will be open to students 
and open source beginners, as stated in this blog post: 
https://opensource.googleblog.com/2022/11/get-ready-for-google-summer-of-code-2023.html


Now is the time to work on getting together a set of projects we'd like 
to have GSoC students work on over the summer. Similar to last year, we 
need to have a good set of projects for students to choose from in 
advance of the deadline for mentoring organizations.


However, as noted in the blog post above, project length expectations 
may vary. Please decide accordingly based on your requirements and 
availability! Also, there is going to be only one intermediate 
evaluation, similarly to last year.


GSoC timeline: https://developers.google.com/open-source/gsoc/timeline

The deadline for Mentoring organizations to apply is: February 7. The 
list of accepted organization will be published around February 22.


Unsurprisingly, we'll need to have an Ideas page again, so I've gone 
ahead and created one (copying last year's):

https://wiki.postgresql.org/wiki/GSoC_2023

Google discusses what makes a good "Ideas" list here:
https://google.github.io/gsocguides/mentor/defining-a-project-ideas-list.html

All the entries are marked with '2022' to indicate they were pulled from 
last year. If the project from last year is still relevant, please 
update it to be '2023' and make sure to update all of the information 
(in particular, make sure to list yourself as a mentor and remove the 
other mentors, as appropriate). Please also be sure to update the 
project's scope to be appropriate for the new guidelines.


New entries are certainly welcome and encouraged, just be sure to note 
them as '2023' when you add them. Projects from last year which were 
worked on but have significant follow-on work to be completed are 
absolutely welcome as well - simply update the description appropriately 
and mark it as being for '2023'.


When we get closer to actually submitting our application, I'll clean 
out the '2022' entries that didn't get any updates. Also - if there are 
any projects that are no longer appropriate (maybe they were completed, 
for example and no longer need work), please feel free to remove them. 
The page is still work in progress, so it's entirely possible I missed 
some updates where a GSoC project was completed independently of GSoC 
(and if I removed any that shouldn't have been - feel free to add them 
back by copying from the 2022 page).


As a reminder, each idea on the page should be in the format that the 
other entries are in and should include:

- Project title/one-line description
- Brief, 2-5 sentence, description of the project
- Description of programming skills needed and estimation of the 
difficulty level

- Project size
- List of potential mentors
- Expected Outcomes

As with last year, please consider PostgreSQL to be an "Umbrella" 
project and that anything which would be considered "PostgreSQL Family" 
per the News/Announce policy [1] is likely to be acceptable as a 
PostgreSQL GSoC project.


In other words, if you're a contributor or developer on WAL-G, barman, 
pgBackRest, the PostgreSQL website (pgweb), the PgEU/PgUS website code 
(pgeu-system), pgAdmin4, pgbouncer, pldebugger, the PG RPMs (pgrpms), 
the JDBC driver, the ODBC driver, or any of the many other PG Family 
projects, please feel free to add a project for consideration! If we get 
quite a few, we can organize the page further based on which project or 
maybe what skills are needed or similar.


Let's have another great year of GSoC with PostgreSQL!

Thanks!

Ilaria & Stephen

[1]: https://www.postgresql.org/about/policies/news-and-events/

Re: Switching XLog source from archive to streaming when primary available

2023-01-17 Thread Bharath Rupireddy

On Thu, Jan 12, 2023 at 6:21 AM Nathan Bossart  wrote:
>
> On Tue, Oct 18, 2022 at 07:31:33AM +0530, Bharath Rupireddy wrote:
> > In summary, the standby state machine in WaitForWALToBecomeAvailable()
> > exhausts all the WAL in pg_wal before switching to streaming after
> > failing to fetch from archive. The v8 patch proposed upthread deviates
> > from this behaviour. Hence, attaching v9 patch that keeps the
> > behaviour as-is, that means, the standby exhausts all the WAL in
> > pg_wal before switching to streaming after fetching WAL from archive
> > for at least streaming_replication_retry_interval milliseconds.
>
> I think this is okay.  The following comment explains why archives are
> preferred over existing files in pg_wal:
>
>  * When doing archive recovery, we always prefer an archived log file 
> even
>  * if a file of the same name exists in XLOGDIR.  The reason is that 
> the
>  * file in XLOGDIR could be an old, un-filled or partly-filled version
>  * that was copied and restored as part of backing up $PGDATA.
>
> With your patch, we might replay one of these "old" files in pg_wal instead
> of the complete version of the file from the archives,

That's true even today, without the patch, no? We're not changing the
existing behaviour of the state machine. Can you explain how it
happens with the patch?

On HEAD, after failing to read from the archive, exhaust all wal from
pg_wal and then switch to streaming mode. With the patch, after
reading from the archive for at least
streaming_replication_retry_interval milliseconds, exhaust all wal
from pg_wal and then switch to streaming mode.

> but I think that is
> still correct.  We'll just replay whatever exists in pg_wal (which may be
> un-filled or partly-filled) before attempting streaming.  If that fails,
> we'll go back to trying the archives again.
>
> Would you mind testing this scenario?

How about something like below for testing the above scenario? If it
looks okay, I can add it as a new TAP test file.

1. Generate WAL files f1 and f2 and archive them.
2. Check the replay lsn and WAL file name on the standby, when it
replays upto f2, stop the standby.
3. Set recovery to fail on the standby, and stop the standby.
4. Generate f3, f4 (partially filled) on the primary.
5. Manually copy f3, f4 to the standby's pg_wal.
6. Start the standby, since recovery is set to fail, and there're new
WAL files (f3, f4) under its pg_wal, it must replay those WAL files
(check the replay lsn and WAL file name, it must be f4) before
switching to streaming.
7. Generate f5 on the primary.
8. The standby should receive f5 and replay it (check the replay lsn
and WAL file name, it must be f5).
9. Set streaming to fail on the standby and set recovery to succeed.
10. Generate f6 on the primary.
11. The standby should receive f6 via archive and replay it (check the
replay lsn and WAL file name, it must be f6).

If needed, we can look out for these messages to confirm it works as expected:
elog(DEBUG2, "switched WAL source from %s to %s after %s",
 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
 lastSourceFailed ? "failure" : "success");
ereport(LOG,
(errmsg("restored log file \"%s\" from archive",
xlogfname)));

Essentially, it covers what the documentation
https://www.postgresql.org/docs/devel/warm-standby.html says:

"In standby mode, the server continuously applies WAL received from
the primary server. The standby server can read WAL from a WAL archive
(see restore_command) or directly from the primary over a TCP
connection (streaming replication). The standby server will also
attempt to restore any WAL found in the standby cluster's pg_wal
directory. That typically happens after a server restart, when the
standby replays again WAL that was streamed from the primary before
the restart, but you can also manually copy files to pg_wal at any
time to have them replayed."

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: daitch_mokotoff module

2023-01-17 Thread Dag Lem

Paul Ramsey  writes:

>> On Jan 12, 2023, at 7:30 AM, Dag Lem  wrote:
>> 

[...]

>> 
>> Sure, I can do that. You don't think this much example text will be
>> TL;DR?
>
> I can only speak for myself, but examples are the meat of
> documentation learning, so as long as they come with enough
> explanatory context to be legible it's worth having them, IMO.
>

I have updated the documentation, hopefully it is more accessible now.

I also corrected documentation for the other functions in fuzzystrmatch
(function name and argtype in the wrong order).

Crossing fingers that someone will eventually change the status to
"Ready for Committer" :-)

Best regards,

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..2548903770
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,582 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "catalog/pg_type.h"
+#include "mb/pg_wchar.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from
+ * https://www.jewishgen.org

Re: Make use of assign_checkpoint_completion_target() to calculate CheckPointSegments correctly

2023-01-17 Thread Bharath Rupireddy

On Tue, Jan 17, 2023 at 12:31 PM Michael Paquier  wrote:
>
> On Mon, Dec 26, 2022 at 06:12:34PM +0530, Bharath Rupireddy wrote:
> > It looks like assign_checkpoint_completion_target() is defined [1],
> > but never used, because of which CheckPointSegments may miss to
> > account for changed checkpoint_completion_target. I'm attaching a tiny
> > patch to fix this.
> >
> > Thoughts?
>
> Oops.  It looks like you are right here.  This would impact the
> calculation of CheckPointSegments on reload when
> checkpoint_completion_target is updated.  This is wrong since we've
> switched to max_wal_size as of 88e9823, so this had better be
> backpatched all the way down.
>
> Thoughts?

+1 to backpatch as setting checkpoint_completion_target will not take
effect immediately.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-17 Thread Masahiko Sawada

On Tue, Jan 17, 2023 at 6:14 PM houzj.f...@fujitsu.com
 wrote:
>
> On Tuesday, January 17, 2023 2:46 PM Masahiko Sawada  
> wrote:
> >
> > On Tue, Jan 17, 2023 at 12:37 PM houzj.f...@fujitsu.com
> >  wrote:
> > > Attach the new version 0001 patch which addressed all other comments.
> > >
> >
> > Thank you for updating the patch. Here is one comment:
> >
> > @@ -426,14 +427,24 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
> >
> > /*
> >  * Show the leader only for active parallel
> > workers.  This
> > -* leaves the field as NULL for the
> > leader of a parallel
> > -* group.
> > +* leaves the field as NULL for the
> > leader of a parallel group
> > +* or the leader of parallel apply workers.
> >  */
> > if (leader && leader->pid !=
> > beentry->st_procpid)
> > {
> > values[28] =
> > Int32GetDatum(leader->pid);
> > nulls[28] = false;
> > }
> > +   else
> > +   {
> > +   int
> > leader_pid = GetLeaderApplyWorkerPid(beentry->st_procpid);
> > +
> > +   if (leader_pid != InvalidPid)
> > +   {
> > +   values[28] =
> > Int32GetDatum(leader_pid);
> > +   nulls[28] = false;
> > +   }
> > +   }
> > }
> >
> > I'm slightly concerned that there could be overhead of executing
> > GetLeaderApplyWorkerPid () for every backend process except for parallel
> > query workers. The number of such backends could be large and
> > GetLeaderApplyWorkerPid() acquires the lwlock. For example, does it make
> > sense to check (st_backendType == B_BG_WORKER) before calling
> > GetLeaderApplyWorkerPid()? Or it might not be a problem since it's
> > LogicalRepWorkerLock which is not likely to be contended.
>
> Thanks for the comment and I think your suggestion makes sense.
> I have added the check before getting the leader pid. Here is the new version 
> patch.

Thank you for updating the patch. Looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-17 Thread Tomas Vondra





On 1/15/23 21:22, Andres Freund wrote:
> Hi,
> 
> On 2023-01-13 09:11:06 +0100, David Geier wrote:
>> Mostly I'm wondering if the sampling based approach gains us enough to be
>> worth it, once the patch to use RDTSC hopefully landed (see [1]).
> 
> Well, I'm not sure we have a path forward on it. There's portability and
> accuracy concerns. But more importantly:
> 
>> I believe that with the RDTSC patch the overhead of TIMING ON is lower than
>> the overhead of using ANALYZE with TIMING OFF in the first place. Hence, to
>> be really useful, it would be great if we could on top of TIMING SAMPLING
>> also lower the overhead of ANALYZE itself further (e.g. by using a fast path
>> for the default EXPLAIN (ANALYZE, TIMING ON / SAMPLING)). Currently,
>> InstrStartNode() and InstrStopNode() have a ton of branches and without all
>> the typically deactivated code the implementation would be very small and
>> could be placed in an inlinable function.
> 
> Yes, I think SAMPLING could get rid of most of the instrumentation overhead -
> at the cost of a bit less detail in the explain, of course.  Which could make
> it a lot more feasible to enable something like auto_explain.log_timing in
> busy workloads.
> 
> For the sampling mode we don't really need something like
> InstrStart/StopNode. We just need a pointer to node currently executing - not
> free to set, but still a heck of a lot cheaper than InstrStopNode(), even
> without ->need_timer etc. Then the timer just needs to do
> instr->sampled_total += (now - last_sample)
> last_sample = now
> 

I don't understand why we would even use timestamps, in this case? AFAIK
"sampling profilers" simply increment a counter for the executing node,
and then approximate the time as proportional to the count.

That also does not have issues with timestamp "rounding" - considering
e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
possible the node completes within 1ms, in which case

   (now - last_sample)

ends up being 0 (assuming I correctly understand the code).

And I don't think there's any particularly good way to correct this.

It seems ExplainNode() attempts to do some correction, but I doubt
that's very reliable, as these fast nodes will have sampled_total=0, so
no matter what you multiply this with, it'll still be 0.

> 
> I've been thinking that we should consider making more of the instrumentation
> code work like that. The amount of work we're doing in InstrStart/StopNode()
> has steadily risen. When buffer usage and WAL usage are enabled, we're
> executing over 1000 instructions! And each single Instrumentation node is ~450
> bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
> embedded.
> 
> If we instead have InstrStartNode() set up a global pointer to the
> Instrumentation node, we can make the instrumentation code modify both the
> "global" counters (pgBufferUsage, pgWalUsage) and, if set,
> current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
> changes - right now nodes "automatically" include the IO/WAL incurred in child
> nodes, but that's just a small bit of additional summin-up to be done during
> EXPLAIN.
> 

That's certainly one way to implement that. I wonder if we could make
that work without the global pointer, but I can't think of any.

> 
> Separately, I think we should consider re-ordering Instrumentation so that
> bufusage_start, walusage_start are after the much more commonly used
> elements. We're forcing ntuples, nloops, .. onto separate cachelines, even
> though they're accounted for unconditionally.
> 

+1 to that


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Add <> support to sepgsql_restorecon

2023-01-17 Thread Ted X Toth

The intent of this patch is not to stop all relabeling only to stop 
sepgsql_restorecon from doing a bulk relabel. I believe sepgsql_object_relabel 
is called by the 'SECURITY LABEL'  statement which I'm using to set the label 
of db objects to a specific context which I would not want altered later by a 
restorecon. This is particularly important in a MLS (multi-level security) 
environment where for example if a row were labeled at the 'secret' level I 
would not restorecon to relabel it possibly causing a downgrade.

The new status of this patch is: Ready for Committer

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-17 Thread Tom Lane

Andrew Dunstan  writes:
> On 2023-01-16 Mo 21:58, Tom Lane wrote:
>> I dunno if we want to stretch buildfarm owners' patience with yet
>> another BF client release right now.  On the other hand, I'm antsy
>> to see if we can un-revert 1b4d280ea after doing a little more
>> work in AdjustUpgrade.pm.

> It looks like the only animals doing the cross version tests crake,
> drongo and fairywren. These are all mine, so I don't think we need to do
> a new release for this.

copperhead, kittiwake, snapper, and tadarida were running them
until fairly recently.

regards, tom lane

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Robert Haas

On Mon, Jan 16, 2023 at 11:11 PM Peter Geoghegan  wrote:
> On Mon, Jan 16, 2023 at 8:25 AM Robert Haas  wrote:
> > I really dislike formulas like Min(freeze_max_age * 2, 1 billion).
> > That looks completely magical from a user perspective. Some users
> > aren't going to understand autovacuum behavior at all. Some will, and
> > will be able to compare age(relfrozenxid) against
> > autovacuum_freeze_max_age. Very few people are going to think to
> > compare age(relfrozenxid) against some formula based on
> > autovacuum_freeze_max_age. I guess if we document it, maybe they will.
>
> What do you think of Andres' autovacuum_no_auto_cancel_age proposal?

I like it better than your proposal. I don't think it's a fundamental
improvement and I would rather see a fundamental improvement, but I
can see it being better than nothing.

> > I do like the idea of driving the auto-cancel behavior off of the
> > results of previous attempts to vacuum the table. That could be done
> > independently of the XID age of the table.
>
> Even when the XID age of the table has already significantly surpassed
> autovacuum_freeze_max_age, say due to autovacuum worker starvation?
>
> > If we've failed to vacuum
> > the table, say, 10 times, because we kept auto-cancelling, it's
> > probably appropriate to force the issue.
>
> I suggested 1000 times upthread. 10 times seems very low, at least if
> "number of times cancelled" is the sole criterion, without any
> attention paid to relfrozenxid age or some other tiebreaker.

Hmm, I think that a threshold of 1000 is far too high to do much good.
By the time we've tried to vacuum a table 1000 times and failed every
time, I anticipate that the situation will be pretty dire, regardless
of why we thought the table needed to be vacuumed in the first place.
In the best case, with autovacum_naptime=1minute, failing 1000 times
means that we've delayed vacuuming the table for at least 16 hours.
That's assuming that there's a worker available to retry every minute
and that we fail quickly. If it's a 2 hour vacuum operation and we
typically fail about halfway through, it could take us over a month to
hit 1000 failures. There are many tables out there that get enough
inserts, updates, and deletes that a 16-hour delay will result in
irreversible bloat, never mind a 41-day delay. After even a few days,
wraparound could become critical even if bloat isn't.

I'm not sure why a threshold of 10 would be too low. It seems to me
that if we fail ten times in a row to vacuum a table and fail for the
same reason every time, we're probably going to keep failing for that
reason. If that is true, we will be better off if we force the issue
sooner rather than later. There's no value in letting the table bloat
out the wazoo and the cluster approach a wraparound shutdown before we
insist. Consider a more mundane example. If I try to start my car or
my dishwasher or my computer or my toaster oven ten times and it fails
ten times in a row, and the failure mode appears to be the same each
time, I am not going to sit there and try 990 more times hoping things
get better, because that seems very unlikely to help. Honestly,
depending on the situation, I might not even get to ten times before I
switch to doing some form of troubleshooting and/or calling someone
who could repair the device.

In fact I think there's a decent argument that a threshold of ten is
possibly too high here, too. If you wait until the tenth try before
you try not auto-cancelling, then a table with a workload that makes
auto-cancelling 100% probable will get vacuumed 10% as often as it
would otherwise. I think there are cases where that would be OK, but
probably on the whole it's not going to go very well. The only problem
I see with lowering the threshold below ~10 is that the signal starts
to get weak. If something fails for the same reason ten times in a row
you can be pretty sure it's a chronic problem. If you made the
thereshold say three you'd probably start making bad decisions
sometimes -- you'd think that you had a chronic problem when really
you just got a bit unlucky.

To get back to the earlier question above, I think that if the
retries-before-not-auto-cancelling threshold were low enough to be
effective, you wouldn't necessarily need to consider XID age as a
second reason for not auto-cancelling. You would want to force the
behavior anyway when you hit emergency mode, because that should force
all the mitigations we have, but I don't know that you need to do
anything before that.

> > It doesn't really matter
> > whether the autovacuum triggered because of bloat or because of XID
> > age. Letting either of those things get out of control is bad.
>
> While inventing a new no-auto-cancel behavior that prevents bloat from
> getting completely out of control may well have merit, I don't see why
> it needs to be attached to this other effort.

It doesn't, but I think it would be a lot more beneficial than just
adding a new GUC. A lot of the fu

Re: minor bug

2023-01-17 Thread Tom Lane

Laurenz Albe  writes:
> On Mon, 2023-01-16 at 19:59 +0100, Torsten Förtsch wrote:
>> So, the timestamp displayed in the log message is certainly wrong.

> If recovery stops at a WAL record that has no timestamp, you get this
> bogus recovery stop time.  I think we should show the recovery stop time
> only if time was the target, as in the attached patch.

I don't think that is a tremendously useful definition: the user
already knows what recoveryStopTime is, or can find it out from
their settings easily enough.

I seem to recall that the original idea was to report the timestamp
of the commit/abort record we are stopping at.  Maybe my memory is
faulty, but I think that'd be significantly more useful than the
current behavior, *especially* when the replay stopping point is
defined by something other than time.

(Also, the wording of the log message suggests that that's what
the reported time is supposed to be.  I wonder if somebody messed
this up somewhere along the way.)

regards, tom lane

Re: Issue attaching a table to a partitioned table with an auto-referenced foreign key

2023-01-17 Thread Guillaume Lelarge

Quick ping, just to make sure someone can get a look at this issue :)
Thanks.


Le ven. 6 janv. 2023 à 11:07, Guillaume Lelarge  a
écrit :

> Hello,
>
> One of our customers has an issue with partitions and foreign keys. He
> works on a v13, but the issue is also present on v15.
>
> I attach a SQL script showing the issue, and the results on 13.7, 13.9,
> and 15.1. But I'll explain the script here, and its behaviour on 13.9.
>
> There is one partitioned table, two partitions and a foreign key. The
> foreign key references the same table:
>
> create table t1 (
>   c1 bigint not null,
>   c1_old bigint null,
>   c2 bigint not null,
>   c2_old bigint null,
>   primary key (c1, c2)
>   )
>   partition by list (c1);
> create table t1_a   partition of t1 for values in (1);
> create table t1_def partition of t1 default;
> alter table t1 add foreign key (c1_old, c2_old) references t1 (c1, c2) on
> delete restrict on update restrict;
>
> I've a SQL function that shows me some information from pg_constraints
> (code of the function in the SQL script attached). Here is the result of
> this function after creating the table, its partitions, and its foreign key:
>
> select * from show_constraints();
> conname |   t|  tref  |   coparent
> +++---
>  t1_c1_old_c2_old_fkey  | t1 | t1 |
>  t1_c1_old_c2_old_fkey  | t1_a   | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey  | t1_def | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey1 | t1 | t1_a   | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey2 | t1 | t1_def | t1_c1_old_c2_old_fkey
> (5 rows)
>
> The constraint works great :
>
> insert into t1 values(1, NULL, 2, NULL);
> insert into t1 values(2, 1,2, 2);
> delete from t1 where c1 = 1;
> psql:ticket15010_v3.sql:34: ERROR:  update or delete on table "t1_a"
> violates foreign key constraint "t1_c1_old_c2_old_fkey1" on table "t1"
> DETAIL:  Key (c1, c2)=(1, 2) is still referenced from table "t1".
>
> This error is normal since the line I want to delete is referenced on the
> other line.
>
> If I try to detach the partition, it also gives me an error.
>
> alter table t1 detach partition t1_a;
> psql:ticket15010_v3.sql:36: ERROR:  removing partition "t1_a" violates
> foreign key constraint "t1_c1_old_c2_old_fkey1"
> DETAIL:  Key (c1_old, c2_old)=(1, 2) is still referenced from table "t1".
>
> Sounds good to me too (well, I'd like it to be smarter and find that the
> constraint is still good after the detach, but I can understand why it
> won't allow it).
>
> The pg_constraint didn't change of course:
>
> select * from show_constraints();
> conname |   t|  tref  |   coparent
> +++---
>  t1_c1_old_c2_old_fkey  | t1 | t1 |
>  t1_c1_old_c2_old_fkey  | t1_a   | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey  | t1_def | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey1 | t1 | t1_a   | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey2 | t1 | t1_def | t1_c1_old_c2_old_fkey
> (5 rows)
>
> Now, I'll delete the whole table contents, and I'll detach the partition:
>
> delete from t1;
> alter table t1 detach partition t1_a;
>
> It seems to be working, but the content of pg_constraints is weird:
>
> select * from show_constraints();
> conname |   t|  tref  |   coparent
> +++---
>  t1_c1_old_c2_old_fkey  | t1 | t1 |
>  t1_c1_old_c2_old_fkey  | t1_a   | t1 |
>  t1_c1_old_c2_old_fkey  | t1_def | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey2 | t1 | t1_def | t1_c1_old_c2_old_fkey
> (4 rows)
>
> I understand why the ('t1_c1_old_c2_old_fkey1', 't1', 't1_a',
> 't1_c1_old_c2_old_fkey') tuple has gone but I don't understand why the
> ('t1_c1_old_c2_old_fkey', 't1_a', 't1', NULL) tuple is still there.
>
> Anyway, I attach the partition:
>
> alter table t1 attach partition t1_a for values in (1);
>
> But pg_constraint has not changed:
>
> select * from show_constraints();
> conname |   t|  tref  |   coparent
> +++---
>  t1_c1_old_c2_old_fkey  | t1 | t1 |
>  t1_c1_old_c2_old_fkey  | t1_a   | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey  | t1_def | t1 | t1_c1_old_c2_old_fkey
>  t1_c1_old_c2_old_fkey2 | t1 | t1_def | t1_c1_old_c2_old_fkey
> (4 rows)
>
> I was expecting to see the fifth tuple coming back, but alas, no.
>
> And as a result, the foreign key doesn't work anymore:
>
> insert into t1 values(1, NULL, 2, NULL);
> insert into t1 values(2, 1,2, 2);
> delete from t1 where c1 = 1;
>
> Well, let's truncate the partitioned table, and drop the partition:
>
> truncate t1;
> drop table t1_a;
>
> The content of pg_constraint looks good to me:
>
> select * from show_constraints();
> conname |

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-17 Thread Andrew Dunstan



On 2023-01-17 Tu 10:18, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 2023-01-16 Mo 21:58, Tom Lane wrote:
>>> I dunno if we want to stretch buildfarm owners' patience with yet
>>> another BF client release right now.  On the other hand, I'm antsy
>>> to see if we can un-revert 1b4d280ea after doing a little more
>>> work in AdjustUpgrade.pm.
>> It looks like the only animals doing the cross version tests crake,
>> drongo and fairywren. These are all mine, so I don't think we need to do
>> a new release for this.
> copperhead, kittiwake, snapper, and tadarida were running them
> until fairly recently.
>
>   


Ah, yes, true, I didn't look far enough back.

The new file can be downloaded from

- it's a dropin replacement.

FYI crake has just passed the test with flying colours.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-17 Thread Tom Lane

Andrew Dunstan  writes:
> FYI crake has just passed the test with flying colours.

Cool.  I await the Windows machines' results with interest.

regards, tom lane

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 08:46:12 -0500, Robert Haas wrote:
> On Fri, Jan 13, 2023 at 2:56 PM Andres Freund  wrote:
> > Does anybody see a reason to not move forward with this aspect? We do a fair
> > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> > just using nanoseconds. We'd also save memory in BufferUsage (144-122 
> > bytes),
> > Instrumentation (16 bytes saved in Instrumentation itself, 32 via
> > BufferUsage).

Here's an updated version of the move to representing instr_time as
nanoseconds. It's now split into a few patches:

0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
  warn

  Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
  just allow to assign 0.

0002) Convert instr_time to uint64

  This is the cleaned up version of the prior patch. The main change is
  that it deduplicated a lot of the code between the architectures.

0003) Add INSTR_TIME_SET_SECOND()

  This is used in 0004. Just allows setting an instr_time to a time in
  seconds, allowing for a cheaper loop exit condition in 0004.

0004) report nanoseconds in pg_test_timing


I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
how to do the cycles->nanosecond conversion with integer shift and multiply in
the common case, which does show a noticable speedup. But that's for another
day.

I fought a bit with myself about whether to send those patches in this thread,
because it'll take over the CF entry. But decided that it's ok, given that
David's patches should be rebased over these anyway?


> I read through 0001 and it seems basically fine to me. Comments:
>
> 1. pg_clock_gettime_ns() doesn't follow pgindent conventions.

Fixed.


> 2. I'm not entirely sure that the new .?S_PER_.?S macros are
> worthwhile but maybe they are, and in any case I don't care very much.

There's now fewer. But those I'd like to keep. I just end up counting digits
manually way too many times.


> 3. I've always found 'struct timespec' to be pretty annoying
> notationally, so I like the fact that this patch would reduce use of
> it.

Same.

Greetings,

Andres Freund
>From c1024a9dfa7f5645200b7fa68e8bce5561c9cee0 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 16 Jan 2023 10:04:42 -0800
Subject: [PATCH v7 1/4] Zero initialize instr_time uses causing compiler
 warnings

These are all not necessary from a correctness POV. However, in a subsequent
patch instr_time will be simplified to an int64, at which point gcc would
otherwise start to warn about the changed places.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20230116023639.rn36vf6ajqmfc...@awork3.anarazel.de
Backpatch:
---
 src/backend/access/transam/xlog.c   | 4 
 src/backend/storage/buffer/bufmgr.c | 4 
 src/backend/storage/file/buffile.c  | 4 
 src/bin/psql/common.c   | 6 ++
 4 files changed, 18 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f47fb75700..7d65b1d4159 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2191,6 +2191,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 /* Measure I/O timing to write WAL data */
 if (track_wal_io_timing)
 	INSTR_TIME_SET_CURRENT(start);
+else
+	INSTR_TIME_SET_ZERO(start);
 
 pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 written = pg_pwrite(openLogFile, from, nleft, startoffset);
@@ -8150,6 +8152,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 	/* Measure I/O timing to sync the WAL file */
 	if (track_wal_io_timing)
 		INSTR_TIME_SET_CURRENT(start);
+	else
+		INSTR_TIME_SET_ZERO(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a6..800a4248c95 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1017,6 +1017,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			if (track_io_timing)
 INSTR_TIME_SET_CURRENT(io_start);
+			else
+INSTR_TIME_SET_ZERO(io_start);
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
@@ -2902,6 +2904,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c5464b6aa62..0a51624df3b 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -446,6 +446,8 @@ BufFileLoadBuffer(BufFile *file)
 
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
 
 	/*
 	 * Read whatever we can get, up to a full bufferload.
@@ -525,6 +527,8 @@ BufFileDumpBuffer(BufFile *f

Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:
> I don't understand why we would even use timestamps, in this case? AFAIK
> "sampling profilers" simply increment a counter for the executing node,
> and then approximate the time as proportional to the count.

The timer interrupt distances aren't all that evenly spaced, particularly
under load, and are easily distorted by having to wait for IO, an lwlock ...


> That also does not have issues with timestamp "rounding" - considering
> e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
> possible the node completes within 1ms, in which case
> 
>(now - last_sample)
> 
> ends up being 0 (assuming I correctly understand the code).

That part should be counting in nanoseconds, I think? Unless I misunderstand
something?

We already compute the timestamp inside timeout.c, but don't yet pass that to
timeout handlers. I think there's others re-computing timestamps.


> And I don't think there's any particularly good way to correct this.
> 
> It seems ExplainNode() attempts to do some correction, but I doubt
> that's very reliable, as these fast nodes will have sampled_total=0, so
> no matter what you multiply this with, it'll still be 0.

That's just the scaling to the "actual time" that you're talking about above,
no?


> > I've been thinking that we should consider making more of the 
> > instrumentation
> > code work like that. The amount of work we're doing in InstrStart/StopNode()
> > has steadily risen. When buffer usage and WAL usage are enabled, we're
> > executing over 1000 instructions! And each single Instrumentation node is 
> > ~450
> > bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
> > embedded.
> > 
> > If we instead have InstrStartNode() set up a global pointer to the
> > Instrumentation node, we can make the instrumentation code modify both the
> > "global" counters (pgBufferUsage, pgWalUsage) and, if set,
> > current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
> > changes - right now nodes "automatically" include the IO/WAL incurred in 
> > child
> > nodes, but that's just a small bit of additional summin-up to be done during
> > EXPLAIN.
> > 
> 
> That's certainly one way to implement that. I wonder if we could make
> that work without the global pointer, but I can't think of any.

I don't see a realistic way at least. We could pass down an
"InstrumentationContext" through everything that needs to do IO and WAL. But
that seems infeasible at this point.


Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-17 Thread Tom Lane

Andres Freund  writes:
> Here's an updated version of the move to representing instr_time as
> nanoseconds. It's now split into a few patches:

I took a quick look through this.

> 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
>   warn
>   Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
>   just allow to assign 0.

I think it's probably wise to keep the macro.  If we ever rethink this
again, we'll be glad we kept it.  Similarly, IS_ZERO is a good idea
even if it would work with just compare-to-zero.  I'm almost tempted
to suggest you define instr_time as a struct with a uint64 field,
just to help keep us honest about that.

> 0003) Add INSTR_TIME_SET_SECOND()
>   This is used in 0004. Just allows setting an instr_time to a time in
>   seconds, allowing for a cheaper loop exit condition in 0004.

Code and comments are inconsistent about whether it's SET_SECOND or
SET_SECONDS.  I think I prefer the latter, but don't care that much.

> 0004) report nanoseconds in pg_test_timing

Didn't examine 0004 in any detail, but the others look good to go
other than these nits.

regards, tom lane

Re: CI and test improvements

2023-01-17 Thread Justin Pryzby

The autoconf system runs all tap tests in t/*.pl, but meson requires
enumerating them in ./meson.build.

This checks for and finds no missing tests in the current tree:

$ for pl in `find src contrib -path '*/t/*.pl'`; do base=${pl##*/}; 
dir=${pl%/*}; meson=${dir%/*}/meson.build; grep "$base" "$meson" >/dev/null || 
echo "$base is missing from $meson"; done

However, this finds two real problems and one false-positive with
missing regress/isolation tests:

$ for makefile in `find src contrib -name Makefile`; do for testname in `sed -r 
'/^(REGRESS|ISOLATION) =/!d; s///; :l; /$/{s///; N; b l}; s/\n//g' 
"$makefile"`; do meson=${makefile%/Makefile}/meson.build; grep -Fw "$testname" 
"$meson" >/dev/null || echo "$testname is missing from $meson"; done; done
guc_privs is missing from src/test/modules/unsafe_tests/meson.build
oldextversions is missing from contrib/pg_stat_statements/meson.build
$(CF_PGP_TESTS) is missing from contrib/pgcrypto/meson.build

I also tried but failed to write something to warn if "meson test" was
run with a list of tests but without tmp_install.  Help wanted.

I propose to put something like this into "SanityCheck".

-- 
Justin

Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-17 Thread Tomas Vondra

On 1/17/23 18:02, Andres Freund wrote:
> Hi,
> 
> On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:
>> I don't understand why we would even use timestamps, in this case? AFAIK
>> "sampling profilers" simply increment a counter for the executing node,
>> and then approximate the time as proportional to the count.
> 
> The timer interrupt distances aren't all that evenly spaced, particularly
> under load, and are easily distorted by having to wait for IO, an lwlock ...
> 

OK, so the difference is that these events (I/O, lwlocks) may block
signals, and after signals get unblocked we only get a single event for
each signal. Yeah, the timestamp handles that case better.

> 
>> That also does not have issues with timestamp "rounding" - considering
>> e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
>> possible the node completes within 1ms, in which case
>>
>>(now - last_sample)
>>
>> ends up being 0 (assuming I correctly understand the code).
> 
> That part should be counting in nanoseconds, I think? Unless I misunderstand
> something?
> 

The higher precision does not help, because both values come from the
*sampled* timestamp (i.e. the one updated from the signal handler). So
if the node happens to execute between two signals, the values are going
to be the same, and the difference is 0.

Perhaps for many executions it works out, because some executions will
cross the boundary, and the average will converge to the right value.

> We already compute the timestamp inside timeout.c, but don't yet pass that to
> timeout handlers. I think there's others re-computing timestamps.
> 
> 
>> And I don't think there's any particularly good way to correct this.
>>
>> It seems ExplainNode() attempts to do some correction, but I doubt
>> that's very reliable, as these fast nodes will have sampled_total=0, so
>> no matter what you multiply this with, it'll still be 0.
> 
> That's just the scaling to the "actual time" that you're talking about above,
> no?
> 

Maybe, not sure.

> 
>>> I've been thinking that we should consider making more of the 
>>> instrumentation
>>> code work like that. The amount of work we're doing in InstrStart/StopNode()
>>> has steadily risen. When buffer usage and WAL usage are enabled, we're
>>> executing over 1000 instructions! And each single Instrumentation node is 
>>> ~450
>>> bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
>>> embedded.
>>>
>>> If we instead have InstrStartNode() set up a global pointer to the
>>> Instrumentation node, we can make the instrumentation code modify both the
>>> "global" counters (pgBufferUsage, pgWalUsage) and, if set,
>>> current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
>>> changes - right now nodes "automatically" include the IO/WAL incurred in 
>>> child
>>> nodes, but that's just a small bit of additional summin-up to be done during
>>> EXPLAIN.
>>>
>>
>> That's certainly one way to implement that. I wonder if we could make
>> that work without the global pointer, but I can't think of any.
> 
> I don't see a realistic way at least. We could pass down an
> "InstrumentationContext" through everything that needs to do IO and WAL. But
> that seems infeasible at this point.
> 

Why infeasible?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-16 13:58:21 -0800, Peter Geoghegan wrote:
> On Fri, Jan 13, 2023 at 9:55 PM Andres Freund  wrote:
> When I express skepticism of very high autovacuum_freeze_max_age
> settings, it's mostly just that I don't think that age(relfrozenxid)
> is at all informative in the way that something that triggers
> autovacuum ought to be. Worst of all, the older relfrozenxid gets, the
> less informative it becomes. I'm sure that it can be safe to use very
> high autovacuum_freeze_max_age values, but it seems like a case of
> using a very complicated and unreliable thing to decide when to
> VACUUM.

Both you and Robert said this, and I have seen it be true, but typically not
for large high-throughput OLTP databases, where I found increasing
relfrozenxid to be important. Sure, there's probably some up/down through the
day / week, but it's likely to be pretty predictable.

I think the problem is that an old relfrozenxid doesn't tell you how much
outstanding work there is. Perhaps that's what both of you meant...

I think that's not the fault of relfrozenxid as a trigger, but that we simply
don't keep enough other stats. We should imo at least keep track of:

In pg_class:

- The number of all frozen pages, like we do for all-visible

  That'd give us a decent upper bound for the amount of work we need to do to
  increase relfrozenxid. It's important that this is crash safe (thus no
  pg_stats), and it only needs to be written when we'd likely make other
  changes to the pg_class row anyway.

In pgstats:

- The number of dead items, incremented both by the heap scan and
  opportunistic pruning

  This would let us estimate how urgently we need to clean up indexes.

- The xid/mxid horizons during the last vacuum

- The number of pages with tuples that couldn't removed due to the horizon
  during the last vacuum

  Together with the horizon, this would let us avoid repeated vacuums that
  won't help. Tracking the number of pages instead of tuples allows a lot
  better cost/benefit estimation of another vacuum.

- The number of pages with tuples that couldn't be frozen

  Similar to the dead tuple one, except that it'd help avoid repeated vacuums
  to increase relfrozenxid, when it won't be able to help.

> > I've seen a ~2TB table grow to ~20TB due to dead tuples, at which point the
> > server crash-restarted due to WAL ENOSPC. I think in that case there wasn't
> > even DDL, they just needed to make the table readonly, for a brief moment, a
> > few times a day. The problem started once the indexes grew to be large 
> > enough
> > that the time for (re-)finding dead tuples + and an index scan phase got 
> > large
> > enough that they were unlucky to be killed a few times in a row. After that
> > autovac never got through the index scan phase. Not doing any "new" work, 
> > just
> > collecting the same dead tids over and over, scanning the indexes for those
> > tids, never quite finishing.
>
> That makes sense to me, though I wonder if there would have been
> another kind of outage (not the one you actually saw) had the
> autocancellation behavior somehow been disabled after the first few
> cancellations.

I suspect so, but it's hard to know.

> This sounds like a great argument in favor of suspend-and-resume as a
> way of handling autocancellation -- no useful work needs to be thrown
> away for AV to yield for a minute or two. One ambition of mine for the
> visibility map snapshot infrastructure was to be able support
> suspend-and-resume. It wouldn't be that hard for autovacuum to
> serialize everything, and use that to pick up right where an
> autovacuum worker left of at when it was autocancelled. Same
> OldestXmin starting point as before, same dead_items array, same
> number of scanned_pages (and pages left to scan).

Hm, that seems a lot of work. Without having held a lock you don't even know
whether your old dead items still apply. Of course it'd improve the situation
substantially, if we could get it.

> > If you know postgres internals, it's easy. You know that autovac stops
> > auto-cancelling after freeze_max_age and you know that the lock queue is
> > ordered. So you issue DROP TABLE / TRUNCATE / whatever and immediately
> > afterwards cancel the autovac worker. But if you aren't, it'll take a while 
> > to
> > figure out that the DROP TABLE isn't progressing due to a lock conflict, at
> > which point you'll cancel the statement (possibly having wrecked havoc with
> > all other accesses). Then you figure out that you need to cancel
> > autovacuum. After you try the DROP TABLE again - but the next worker has
> > gotten to work on the table.
>
> Yeah, that's pretty bad. Maybe DROP TABLE and TRUNCATE should be
> special cases? Maybe they should always be able to auto cancel an
> autovacuum?

Yea, I think so. It's not obvious how to best pass down that knowledge into
ProcSleep(). It'd have to be in the LOCALLOCK, I think. Looks like the best
way would be to change LockAcquireExtended() to get a flag

Re: doc: add missing "id" attributes to extension packaging page

2023-01-17 Thread Brar Piening


On 17.01.2023 at 14:12, Karl O. Pinc wrote:

If you're not willing to try I am willing to see if I can
produce an example to work from.  My XSLT is starting to
come back.


I'm certainly willing to try but I'd appreciate an example in any case.

My XSLT skills are mostly learning by doing combined with trial and error.

Regards,

Brar

Re: recovery modules

2023-01-17 Thread Nathan Bossart

On Tue, Jan 17, 2023 at 02:32:03PM +0900, Michael Paquier wrote:
> Could it be cleaner in the long term to remove entirely
> BuildRestoreCommand() and move the conversion of the xlogpath with
> make_native_path() one level higher in the stack?

Yeah, this seems cleaner.  I removed BuildRestoreCommand() in v8.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c710f5a9e294b198ce6bb2e8d9404cb26a76b913 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 23 Dec 2022 16:53:38 -0800
Subject: [PATCH v8 1/2] Refactor code for restoring files via shell.

Presently, restore_command uses a different code path than
archive_cleanup_command and recovery_end_command.  These code paths
are similar and can be easily combined.
---
 src/backend/access/transam/shell_restore.c | 99 ++
 src/backend/access/transam/xlogarchive.c   |  1 -
 src/common/Makefile|  1 -
 src/common/archive.c   | 60 -
 src/common/meson.build |  1 -
 src/common/percentrepl.c   | 13 +--
 src/fe_utils/archive.c | 11 ++-
 src/include/common/archive.h   | 21 -
 src/tools/msvc/Mkvcbuild.pm|  2 +-
 9 files changed, 56 insertions(+), 153 deletions(-)
 delete mode 100644 src/common/archive.c
 delete mode 100644 src/include/common/archive.h

diff --git a/src/backend/access/transam/shell_restore.c b/src/backend/access/transam/shell_restore.c
index 7753a7d667..4752be0b9f 100644
--- a/src/backend/access/transam/shell_restore.c
+++ b/src/backend/access/transam/shell_restore.c
@@ -20,16 +20,14 @@
 
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
-#include "common/archive.h"
 #include "common/percentrepl.h"
 #include "storage/ipc.h"
 #include "utils/wait_event.h"
 
-static void ExecuteRecoveryCommand(const char *command,
+static bool ExecuteRecoveryCommand(const char *command,
    const char *commandName,
-   bool failOnSignal,
-   uint32 wait_event_info,
-   const char *lastRestartPointFileName);
+   bool failOnSignal, bool exitOnSigterm,
+   uint32 wait_event_info, int fail_elevel);
 
 /*
  * Attempt to execute a shell-based restore command.
@@ -40,25 +38,16 @@ bool
 shell_restore(const char *file, const char *path,
 			  const char *lastRestartPointFileName)
 {
+	char	   *nativePath = pstrdup(path);
 	char	   *cmd;
-	int			rc;
+	bool		ret;
 
 	/* Build the restore command to execute */
-	cmd = BuildRestoreCommand(recoveryRestoreCommand, path, file,
-			  lastRestartPointFileName);
-
-	ereport(DEBUG3,
-			(errmsg_internal("executing restore command \"%s\"", cmd)));
-
-	/*
-	 * Copy xlog from archival storage to XLOGDIR
-	 */
-	fflush(NULL);
-	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
-	rc = system(cmd);
-	pgstat_report_wait_end();
-
-	pfree(cmd);
+	make_native_path(nativePath);
+	cmd = replace_percent_placeholders(recoveryRestoreCommand,
+	   "restore_command", "frp", file,
+	   lastRestartPointFileName, nativePath);
+	pfree(nativePath);
 
 	/*
 	 * Remember, we rollforward UNTIL the restore fails so failure here is
@@ -84,17 +73,11 @@ shell_restore(const char *file, const char *path,
 	 *
 	 * We treat hard shell errors such as "command not found" as fatal, too.
 	 */
-	if (rc != 0)
-	{
-		if (wait_result_is_signal(rc, SIGTERM))
-			proc_exit(1);
-
-		ereport(wait_result_is_any_signal(rc, true) ? FATAL : DEBUG2,
-(errmsg("could not restore file \"%s\" from archive: %s",
-		file, wait_result_to_str(rc;
-	}
+	ret = ExecuteRecoveryCommand(cmd, "restore_command", true, true,
+ WAIT_EVENT_RESTORE_COMMAND, DEBUG2);
+	pfree(cmd);
 
-	return (rc == 0);
+	return ret;
 }
 
 /*
@@ -103,9 +86,14 @@ shell_restore(const char *file, const char *path,
 void
 shell_archive_cleanup(const char *lastRestartPointFileName)
 {
-	ExecuteRecoveryCommand(archiveCleanupCommand, "archive_cleanup_command",
-		   false, WAIT_EVENT_ARCHIVE_CLEANUP_COMMAND,
-		   lastRestartPointFileName);
+	char	   *cmd;
+
+	cmd = replace_percent_placeholders(archiveCleanupCommand,
+	   "archive_cleanup_command",
+	   "r", lastRestartPointFileName);
+	(void) ExecuteRecoveryCommand(cmd, "archive_cleanup_command", false, false,
+  WAIT_EVENT_ARCHIVE_CLEANUP_COMMAND, WARNING);
+	pfree(cmd);
 }
 
 /*
@@ -114,9 +102,14 @@ shell_archive_cleanup(const char *lastRestartPointFileName)
 void
 shell_recovery_end(const char *lastRestartPointFileName)
 {
-	ExecuteRecoveryCommand(recoveryEndCommand, "recovery_end_command", true,
-		   WAIT_EVENT_RECOVERY_END_COMMAND,
-		   lastRestartPointFileName);
+	char	   *cmd;
+
+	cmd = replace_percent_placeholders(recoveryEndCommand,
+	   "recovery_end_command",
+	   "r", lastRestartPointFileName);
+	(void) ExecuteRecoveryCommand(cmd, "recovery_end_command", true, false,
+  WAIT_EVENT_RECOVERY_END_COMMAND, WARNING);
+	pfre

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 10:26:52 -0500, Robert Haas wrote:
> On Mon, Jan 16, 2023 at 11:11 PM Peter Geoghegan  wrote:
> > On Mon, Jan 16, 2023 at 8:25 AM Robert Haas  wrote:
> > > I really dislike formulas like Min(freeze_max_age * 2, 1 billion).
> > > That looks completely magical from a user perspective. Some users
> > > aren't going to understand autovacuum behavior at all. Some will, and
> > > will be able to compare age(relfrozenxid) against
> > > autovacuum_freeze_max_age. Very few people are going to think to
> > > compare age(relfrozenxid) against some formula based on
> > > autovacuum_freeze_max_age. I guess if we document it, maybe they will.
> >
> > What do you think of Andres' autovacuum_no_auto_cancel_age proposal?
> 
> I like it better than your proposal. I don't think it's a fundamental
> improvement and I would rather see a fundamental improvement, but I
> can see it being better than nothing.

That's similar to my feelings about it.

I do think it'll be operationally nice to have at least some window where an
autovacuum is triggered due to age and where it won't prevent cancels. In many
situations it'll likely suffice if that window is autovacuum_naptime *
xids_per_sec large, but of course that's easily enough exceeded.

> > > I do like the idea of driving the auto-cancel behavior off of the
> > > results of previous attempts to vacuum the table. That could be done
> > > independently of the XID age of the table.
> >
> > Even when the XID age of the table has already significantly surpassed
> > autovacuum_freeze_max_age, say due to autovacuum worker starvation?
> >
> > > If we've failed to vacuum
> > > the table, say, 10 times, because we kept auto-cancelling, it's
> > > probably appropriate to force the issue.
> >
> > I suggested 1000 times upthread. 10 times seems very low, at least if
> > "number of times cancelled" is the sole criterion, without any
> > attention paid to relfrozenxid age or some other tiebreaker.
> 
> Hmm, I think that a threshold of 1000 is far too high to do much good.

Agreed.

> By the time we've tried to vacuum a table 1000 times and failed every
> time, I anticipate that the situation will be pretty dire, regardless
> of why we thought the table needed to be vacuumed in the first place.

Agreed.

> In the best case, with autovacum_naptime=1minute, failing 1000 times
> means that we've delayed vacuuming the table for at least 16 hours.

Perhaps it'd make sense for an auto-cancelled worker to signal the launcher to
do a cycle of vacuuming? Or even to just try to vacuum the table again
immediately? After all, we know that the table is going to be on the schedule
of the next worker immediately. Of course we shouldn't retry indefinitely, but
...

> In fact I think there's a decent argument that a threshold of ten is
> possibly too high here, too. If you wait until the tenth try before
> you try not auto-cancelling, then a table with a workload that makes
> auto-cancelling 100% probable will get vacuumed 10% as often as it
> would otherwise. I think there are cases where that would be OK, but
> probably on the whole it's not going to go very well.

That's already kind of the case - we'll only block auto-cancelling when
exceeding autovacuum_freeze_max_age, all the other autovacuums will be
cancelable.

> The only problem I see with lowering the threshold below ~10 is that the
> signal starts to get weak. If something fails for the same reason ten times
> in a row you can be pretty sure it's a chronic problem. If you made the
> thereshold say three you'd probably start making bad decisions sometimes --
> you'd think that you had a chronic problem when really you just got a bit
> unlucky.

Yea. Schema migrations in prod databases typically have to run in
single-statement or very small transactions, for obvious reasons. Needing to
lock the same table exclusively a few times during a schema migration is
pretty normal, particularly when foreign keys are involved. Getting blocked by
autovacuum in the middle of a schema migration is NASTY.

This is why I'm a bit worried that 10 might be too low... It's not absurd for
a schema migration to create 10 new tables referencing an existing table in
need of vacuuming.

Perhaps we should track when the first failure was, and take that into
account? Clearly having all 10 autovacuums on the same table cancelled is
different when those 10 cancellations happened in the last 10 *
autovacuum_naptime minutes, than when the last successful autovacuum was hours
ago.

> To get back to the earlier question above, I think that if the
> retries-before-not-auto-cancelling threshold were low enough to be
> effective, you wouldn't necessarily need to consider XID age as a
> second reason for not auto-cancelling. You would want to force the
> behavior anyway when you hit emergency mode, because that should force
> all the mitigations we have, but I don't know that you need to do
> anything before that.

Hm, without further restrictions, that has me wo

Re: almost-super-user problems that we haven't fixed yet

2023-01-17 Thread Nathan Bossart

On Mon, Jan 16, 2023 at 09:06:10PM -0500, Robert Haas wrote:
> On Mon, Jan 16, 2023 at 5:37 PM Nathan Bossart  
> wrote:
>> On Mon, Jan 16, 2023 at 02:29:56PM -0500, Robert Haas wrote:
>> > 4. You can reserve a small number of connections for the superuser
>> > with superuser_reserved_connections, but there's no way to do a
>> > similar thing for any other user. As mentioned above, a CREATEROLE
>> > user could set connection limits for every created role such that the
>> > sum of those limits is less than max_connections by some margin, but
>> > that restricts each of those roles individually, not all of them in
>> > the aggregate. Maybe we could address this by inventing a new GUC
>> > reserved_connections and a predefined role
>> > pg_use_reserved_connections.
>>
>> I've written something like this before, and I'd be happy to put together a
>> patch if there is interest.
> 
> Cool. I had been thinking of coding it up myself, but you doing it works, too.

Alright.  The one design question I have is whether this should be a new
set of reserved connections or replace superuser_reserved_connections
entirely.

If we create a new batch of reserved connections, only roles with
privileges of pg_use_reserved_connections would be able to connect if the
number of remaining slots is greater than superuser_reserved_connections
but less than or equal to superuser_reserved_connections +
reserved_connections.  Only superusers would be able to connect if the
number of remaining slots is less than or equal to
superuser_reserved_connections.  This helps avoid blocking new superuser
connections even if you've reserved some connections for non-superusers.

Іf we replace superuser_reserved_connections, we're basically opening up
the existing functionality to non-superusers, which is simpler and probably
more in the spirit of this thread, but it doesn't provide a way to prevent
blocking new superuser connections.

My preference is the former approach.  This is closest to what I've written
before, and if I read your words carefully, it seems to be what you are
proposing.  WDYT?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 19:00:02 +0100, Tomas Vondra wrote:
> On 1/17/23 18:02, Andres Freund wrote:
> > On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:
> >> That also does not have issues with timestamp "rounding" - considering
> >> e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
> >> possible the node completes within 1ms, in which case
> >>
> >>(now - last_sample)
> >>
> >> ends up being 0 (assuming I correctly understand the code).
> > 
> > That part should be counting in nanoseconds, I think? Unless I misunderstand
> > something?
> > 
> 
> The higher precision does not help, because both values come from the
> *sampled* timestamp (i.e. the one updated from the signal handler). So
> if the node happens to execute between two signals, the values are going
> to be the same, and the difference is 0.

In that case there simply wasn't any sample for the node, and a non-timestamp
based sample counter wouldn't do anything different?

If you're worried about the case where a timer does fire during execution of
the node, but exactly once, that should provide a difference between the last
sampled timestamp and the current time. It'll attribute a bit too much to the
in-progress nodes, but well, that's sampling for you.

I think a "hybrid" explain mode might be worth thinking about. Use the
"current" sampling method for the first execution of a node, and for the first
few milliseconds of a query (or perhaps the first few timestamp
acquisitions). That provides an accurate explain analyze for short queries,
without a significant slowdown. Then switch to sampling, which provides decent
attribution for a bit longer running queries.

> >>> I've been thinking that we should consider making more of the 
> >>> instrumentation
> >>> code work like that. The amount of work we're doing in 
> >>> InstrStart/StopNode()
> >>> has steadily risen. When buffer usage and WAL usage are enabled, we're
> >>> executing over 1000 instructions! And each single Instrumentation node is 
> >>> ~450
> >>> bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
> >>> embedded.
> >>>
> >>> If we instead have InstrStartNode() set up a global pointer to the
> >>> Instrumentation node, we can make the instrumentation code modify both the
> >>> "global" counters (pgBufferUsage, pgWalUsage) and, if set,
> >>> current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
> >>> changes - right now nodes "automatically" include the IO/WAL incurred in 
> >>> child
> >>> nodes, but that's just a small bit of additional summin-up to be done 
> >>> during
> >>> EXPLAIN.
> >>>
> >>
> >> That's certainly one way to implement that. I wonder if we could make
> >> that work without the global pointer, but I can't think of any.
> > 
> > I don't see a realistic way at least. We could pass down an
> > "InstrumentationContext" through everything that needs to do IO and WAL. But
> > that seems infeasible at this point.

> Why infeasible?

Primarily the scale of the change. We'd have to pass down the context into all
table/index AM functions. And into a lot of bufmgr.c, xlog.c functions,
which'd require their callers to have access to the context.  That's hundreds
if not thousands places.

Adding that many function parameters might turn out to be noticable runtime
wise, due to increased register movement. I think for a number of the places
where we currently don't, we ought to use by-reference struct for the
not-always-used parameters, that then also could contain this context.

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 12:26:57 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Here's an updated version of the move to representing instr_time as
> > nanoseconds. It's now split into a few patches:
> 
> I took a quick look through this.

Thanks!


> > 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
> >   warn
> >   Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
> >   just allow to assign 0.
> 
> I think it's probably wise to keep the macro.  If we ever rethink this
> again, we'll be glad we kept it.  Similarly, IS_ZERO is a good idea
> even if it would work with just compare-to-zero.

Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
could give us the best of both worlds?


> I'm almost tempted to suggest you define instr_time as a struct with a
> uint64 field, just to help keep us honest about that.

I can see that making sense. Unless somebody pipes up with opposition to that
plan soon, I'll see how it goes.


> > 0003) Add INSTR_TIME_SET_SECOND()
> >   This is used in 0004. Just allows setting an instr_time to a time in
> >   seconds, allowing for a cheaper loop exit condition in 0004.
> 
> Code and comments are inconsistent about whether it's SET_SECOND or
> SET_SECONDS.  I think I prefer the latter, but don't care that much.

That's probably because I couldn't decide... So I'll go with your preference.


> > 0004) report nanoseconds in pg_test_timing
> 
> Didn't examine 0004 in any detail, but the others look good to go
> other than these nits.

Thanks for looking!

Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote:
> > > @@ -359,6 +360,15 @@ static const PgStat_KindInfo 
> > > pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
> > >   .snapshot_cb = pgstat_checkpointer_snapshot_cb,
> > >   },
> > >
> > > + [PGSTAT_KIND_IO] = {
> > > + .name = "io_ops",
> >
> > That should be "io" now I think?
> >
> 
> Oh no! I didn't notice this was broken. I've added pg_stat_have_stats()
> to the IO stats tests now.
> 
> It would be nice if pgstat_get_kind_from_str() could be used in
> pg_stat_reset_shared() to avoid having to remember to change both.

It's hard to make that work, because of the historical behaviour of that
function :(


> Also:
> - Since recovery_prefetch doesn't have a statistic kind, it doesn't fit
>   well into this paradigm

I think that needs a rework anyway - it went in at about the same time as the
shared mem stats patch, so it doesn't quite cohere.


> On a separate note, should we be setting have_[io/slru/etc]stats to
> false in the reset all functions?

That'd not work reliably, because other backends won't do the same. I don't
see a benefit in doing it differently in the local connection than the other
connections.


> > > +typedef struct PgStat_BackendIO
> > > +{
> > > + PgStat_Counter 
> > > data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
> > > +} PgStat_BackendIO;
> >
> > Would it bother you if we swapped the order of iocontext and iobject here 
> > and
> > related places? It makes more sense to me semantically, and should now be
> > pretty easy, code wise.
> 
> So, thinking about this I started noticing inconsistencies in other
> areas around this order:
> For example: ordering of objects mentioned in commit messages and comments,
> ordering of parameters (like in pgstat_count_io_op() [currently in
> reverse order]).
> 
> I think we should make a final decision about this ordering and then
> make everywhere consistent (including ordering in the view).
> 
> Currently the order is:
> BackendType
>   IOContext
> IOObject
>   IOOp
> 
> You are suggesting this order:
> BackendType
>   IOObject
> IOContext
>   IOOp
> 
> Could you explain what you find more natural about this ordering (as I
> find the other more natural)?

The object we're performing IO on determines more things than the context. So
it just seems like the natural hierarchical fit. The context is a sub-category
of the object. Consider how it'll look like if we also have objects for 'wal',
'temp files'. It'll make sense to group by just the object, but it won't make
sense to group by just the context.

If it were trivial to do I'd use a different IOContext for each IOObject. But
it'd make it much harder. So there'll just be a bunch of values of IOContext
that'll only be used for one or a subset of the IOObjects.


The reason to put BackendType at the top is pragmatic - one backend is of a
single type, but can do IO for all kinds of objects/contexts. So any other
hierarchy would make the locking etc much harder.


> This is one possible natural sentence with these objects:
> 
> During COPY, a client backend may read in data from a permanent
> relation.
> This order is:
> IOContext
>   BackendType
> IOOp
>   IOObject
> 
> I think English sentences are often structured subject, verb, object --
> but in our case, we have an extra thing that doesn't fit neatly
> (IOContext).

"..., to avoid polluting the buffer cache it uses the bulk (read|write)
strategy".


> Also, IOOp in a sentence would be in the middle (as the
> verb). I made it last because a) it feels like the smallest unit b) it
> would make the code a lot more annoying if it wasn't last.

Yea, I think pragmatically that is the right choice.



> > > Subject: [PATCH v47 3/5] pgstat: Count IO for relations
> >
> > Nearly happy with this now. See one minor nit below.
> >
> > I don't love the counting in register_dirty_segment() and mdsyncfiletag(), 
> > but
> > I don't have a better idea, and it doesn't seem too horrible.
> 
> You don't like it because such things shouldn't be in md.c -- since we
> went to the trouble of having function pointers and making it general?

It's more of a gut feeling than well reasoned ;)



> > > +-- Change the tablespace so that the table is rewritten directly, then 
> > > SELECT
> > > +-- from it to cause it to be read back into shared buffers.
> > > +SET allow_in_place_tablespaces = true;
> > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
> >
> > Perhaps worth doing this in tablespace.sql, to avoid the additional
> > checkpoints done as part of CREATE/DROP TABLESPACE?
> >
> > Or, at least combine this with the CHECKPOINTs above?
> 
> I see a checkpoint is requested when dropping the tablespace if not all
> the files in it are deleted. It seems like if the DROP TABLE for the
> permanent table is before the explicit checkpoints in the test, then the
> DROP TABLESPACE will not cause an additional checkpoint.

Unfortunately, that'

Re: Refactor recordExtObjInitPriv()

2023-01-17 Thread Peter Eisentraut


On 16.01.23 23:43, Nathan Bossart wrote:

On Mon, Jan 16, 2023 at 12:01:47PM +0100, Peter Eisentraut wrote:

I have updated the patch as you suggested and split out the aggregate issue
into a separate patch for clarity.


LGTM


committed

Re: constify arguments of copy_file() and copydir()

2023-01-17 Thread Peter Eisentraut


On 17.01.23 07:24, Michael Paquier wrote:

On Mon, Jan 16, 2023 at 10:53:40AM +0900, Michael Paquier wrote:

+1.


While I don't forget about this thread..  Any objections if I were to
apply that?


Looks good to me.

[PATCH] Teach planner to further optimize sort in distinct

2023-01-17 Thread Ankit Kumar Pandey

Hi, this is extension of `teach planner to evaluate multiple windows in 
the optimal order` work applied to distinct operation.


Based on discussions before 
(https://www.postgresql.org/message-id/flat/CAApHDvr7rSCVXzGfVa1L9pLpkKj6-s8NynK8o%2B98X9sKjejnQQ%40mail.gmail.com#e01327a3053d9281c40f281ef7105b42) 
,


> All I imagine you need to do for it
> is to invent a function in pathkeys.c which is along the lines of what
> pathkeys_count_contained_in() does, but returns a List of pathkeys
> which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
> that does not exist as a pathkey in keys1. In
> create_final_distinct_paths(), you can then perform an incremental
> sort on any input_path which has a non-empty return list and in
> create_incremental_sort_path(), you'll pass presorted_keys as the
> number of pathkeys in the path, and the required pathkeys the
> input_path->pathkeys + the pathkeys returned from the new function.


There is bit confusion in wording here:

"returns a List of pathkeys
which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
that does not exist as a pathkey in keys1."

You mean extract common keys without ordering right?

Example: keys1 = (a,b,c), keys2 = (b,a)

returns (a,b)

and

keys1 = (a,b,c), keys = (d)

returns = ()

which translates to

needed_pathkeys = (a,b,c) = key2

input_pathkeys = (b,a) key1

returns (b,a) = common_keys

new needed_pathkeys = unique(common_keys + old needed_pathkeys)

=> new needed_pathkeys = (b,a,c)

The new needed_pathkeys matches input_pathkeys.

This is what I implemented in the patch.


The patched version yields the following plans:

set enable_hashagg=0;
set enable_seqscan=0;

explain (costs off) select distinct relname,relkind,count(*) over 
(partition by

relkind) from pg_Class;
   QUERY PLAN
-
 Unique
   ->  Incremental Sort
 Sort Key: relkind, relname, (count(*) OVER (?))
 Presorted Key: relkind
 ->  WindowAgg
   ->  Sort
 Sort Key: relkind
 ->  Seq Scan on pg_class
(8 rows)

explain (costs off) select distinct a, b, count(*) over (partition by b, 
a) from abcd;

   QUERY PLAN

 Unique
   ->  Incremental Sort
 Sort Key: b, a, (count(*) OVER (?))
 Presorted Key: b, a
 ->  WindowAgg
   ->  Incremental Sort
 Sort Key: b, a
 Presorted Key: b
 ->  Index Scan using b_idx on abcd
(9 rows)

explain (costs off) select distinct a, b, count(*) over (partition by c, 
d) from abcd;

   QUERY PLAN

 Unique
   ->  Sort
 Sort Key: a, b, (count(*) OVER (?))
 ->  WindowAgg
   ->  Incremental Sort
 Sort Key: c, d
 Presorted Key: c
 ->  Index Scan using c_idx on abcd
(8 rows)


Issue with index path still remains as pathkeys get purged by 
truncate_useless_pathkeys


and hence are not available in create_final_distinct_paths for the above 
optimizations.



I have attached a patch for the reference.


Thanks,

Ankit

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 609df93dc9..13f6006577 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -1968,3 +1968,32 @@ has_useful_pathkeys(PlannerInfo *root, RelOptInfo *rel)
 		return true;			/* might be able to use them for ordering */
 	return false;/* definitely useless */
 }
+
+/*
+ * extract_common_pathkeys
+ *		returns a List of pathkeys
+ *	which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
+ * that does not exist as a pathkey in keys1 
+ */
+List *
+extract_common_pathkeys(List* keys1, List *keys2)
+{
+	List *new_pk = NIL;
+	ListCell	*l1;
+	ListCell	*l2;
+	foreach(l1, keys1)
+	{
+		PathKey*pathkey1 = (PathKey *) lfirst(l1);
+		foreach(l2, keys2)
+		{
+			PathKey*pathkey2 = (PathKey *) lfirst(l2);
+			if (pathkey1 == pathkey2)
+			{
+new_pk = lappend(new_pk, pathkey1);
+break;
+			}
+		}
+	}
+	return new_pk;
+
+}
\ No newline at end of file
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 044fb24666..1802d28e75 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4844,11 +4844,28 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 			Path	   *sorted_path;
 			bool		is_sorted;
 			int			presorted_keys;
+			List		*common_keys;
 
 			is_sorted = pathkeys_count_contained_in(needed_pathkeys,
 	input_path->pathkeys,
 	&presorted_keys);
 
+			/*
+			 * Check if there are common pathkeys (regardless of ordering)
+			 */
+			common_keys = extract_common_pathkeys(inp

Re: Remove source code display from \df+?

2023-01-17 Thread Isaac Morland

On Thu, 12 Jan 2023 at 12:06, Isaac Morland  wrote:

Thanks everybody. So based on the latest discussion I will:
>
> 1) rename the column from “Source code” to “Internal name”; and
> 2) change the contents to NULL except when the language (identified by
> oid) is INTERNAL or C.
>
> Patch forthcoming, I hope.
>

I've attached a patch for this. It turns out to simplify the existing code
in one way because the recently added call to pg_get_function_sqlbody() is
no longer needed since it applies only to SQL functions, which will now
display as a blank column.

I implemented the change and was surprised to see that no tests failed.
Turns out that while there are several tests for \df, there were none for
\df+. I added a couple, just using \df+ on some functions that appear to me
to be present on plain vanilla Postgres.

I was initially concerned about translation support for the column heading,
but it turns out that \dT+ already has a column with the exact same name so
it appears we don’t need any new translations.

I welcome comments and feedback. Now to try to find something manageable to
review.

0001-Remove-source-code-display-from-df.patch
Description: Binary data

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-01-17 Thread Tomas Vondra

On 1/9/23 09:44, Ilya Gladyshev wrote:
> On Sun, 2023-01-08 at 10:48 -0600, Justin Pryzby wrote:
>> On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote:
>>> ...
>>>
>>> @committers: Is it okay to add nparts_done to IndexStmt ?
>>
>> Any hint about this ?
>>

AFAIK fields added at the end of a struct is seen as acceptable from the
ABI point of view. It's not risk-free, but we did that multiple times
when fixing bugs, IIRC.

The primary risk is old extensions (built on older minor version)
running on new server, getting confused by new fields (and implied
shifts in the structs). But fields at the end should be safe - the
extension simply ignores the stuff at the end. The one problem would be
arrays of structs, because even a field at the end changes the array
stride. But I don't think we do that with IndexStmt ...

Of course, if the "old" extension itself allocates the struct and passes
it to core code, that might still be an issue, because it'll allocate a
smaller struct, and core might see bogus data at the end.

On the other hand, new extensions on old server may get confused too,
because it may try setting a field that does not exist.

So ultimately it's about weighing risks vs. benefits - evaluating
whether fixing the issue is actually worth it.

The question is if/how many such extensions messing with IndexStmt in
this way actually exist. That is, allocate IndexStmt (or array of it). I
haven't found any, but maybe some extensions for index or partition
management do it? Not sure.

But ...

Do we actually need the new parts_done field? I mean, we already do
track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the
st_progress_param array. Can't we simply read it from there? Then we
would not have ABI issues with the new field added to IndexStmt.

>> This should be resolved before the "CIC on partitioned tables" patch,
>> which I think is otherwise done.
> 
> I suggest that we move on with the IndexStmt patch and see what the
> committers have to say about it. I have brushed the patch up a bit,
> fixing TODOs and adding docs as per our discussion above.
> 

I did take a look at the patch, so here are my 2c:

1) num_leaf_partitions says it's "excluding foreign tables" but then it
uses RELKIND_HAS_STORAGE() which excludes various others relkinds, e.g.
partitioned tables etc. Minor, but perhaps a bit confusing.

2) I'd probably say count_leaf_partitions() instead.

3) The new part in DefineIndex counting leaf partitions should have a
comment before

if (!OidIsValid(parentIndexId))
{ ... }

4) It's a bit confusing that one of the branches in DefineIndex just
sets stmt->parts_done without calling pgstat_progress_update_param
(while the other one does both). AFAICS the call is not needed because
we already updated it during the recursive DefineIndex call, but maybe
the comment should mention that?

As for the earlier discussion about the "correct" behavior for leaf vs.
non-leaf partitions and whether to calculate partitions in advance:

* I agree it's desirable to count partitions in advance, instead of
adding incrementally. The view is meant to provide "overview" of the
CREATE INDEX progress, and imagine you get

  partitions_total partitions_done
10   9

so that you believe you're ~90% done. But then it jumps to the next
child and now you get

  partitions_total partitions_done
20  10

which makes the view a bit useless for it's primary purpose, IMHO.

* I don't care very much about leaf vs. non-leaf partitions. If we
exclude non-leaf ones, fine with me. But the number of non-leaf ones
should be much smaller than leaf ones, and if the partition already has
a matching index that distorts the tracking too. Furthermore the
partitions may have different size etc. so the progress is only
approximate anyway.

I wonder if we could improve this to track the size of partitions for
total/done? That'd make leaf/non-leaf distinction unnecessary, because
non-leaf partitions have size 0.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-17 Thread Tomas Vondra

On 1/17/23 19:46, Andres Freund wrote:
> Hi,
> 
> On 2023-01-17 19:00:02 +0100, Tomas Vondra wrote:
>> On 1/17/23 18:02, Andres Freund wrote:
>>> On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:
 That also does not have issues with timestamp "rounding" - considering
 e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
 possible the node completes within 1ms, in which case

(now - last_sample)

 ends up being 0 (assuming I correctly understand the code).
>>>
>>> That part should be counting in nanoseconds, I think? Unless I misunderstand
>>> something?
>>>
>>
>> The higher precision does not help, because both values come from the
>> *sampled* timestamp (i.e. the one updated from the signal handler). So
>> if the node happens to execute between two signals, the values are going
>> to be the same, and the difference is 0.
> 
> In that case there simply wasn't any sample for the node, and a non-timestamp
> based sample counter wouldn't do anything different?
> 

Yeah, you're right.

> If you're worried about the case where a timer does fire during execution of
> the node, but exactly once, that should provide a difference between the last
> sampled timestamp and the current time. It'll attribute a bit too much to the
> in-progress nodes, but well, that's sampling for you.
> 
> 
> I think a "hybrid" explain mode might be worth thinking about. Use the
> "current" sampling method for the first execution of a node, and for the first
> few milliseconds of a query (or perhaps the first few timestamp
> acquisitions). That provides an accurate explain analyze for short queries,
> without a significant slowdown. Then switch to sampling, which provides decent
> attribution for a bit longer running queries.
> 

Yeah, this is essentially the sampling I imagined when I first read the
subject of this thread. It samples which node executions to measure (and
then measures those accurately), while these patches sample timestamps.

> 
> 
> I've been thinking that we should consider making more of the 
> instrumentation
> code work like that. The amount of work we're doing in 
> InstrStart/StopNode()
> has steadily risen. When buffer usage and WAL usage are enabled, we're
> executing over 1000 instructions! And each single Instrumentation node is 
> ~450
> bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
> embedded.
>
> If we instead have InstrStartNode() set up a global pointer to the
> Instrumentation node, we can make the instrumentation code modify both the
> "global" counters (pgBufferUsage, pgWalUsage) and, if set,
> current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
> changes - right now nodes "automatically" include the IO/WAL incurred in 
> child
> nodes, but that's just a small bit of additional summin-up to be done 
> during
> EXPLAIN.
>

 That's certainly one way to implement that. I wonder if we could make
 that work without the global pointer, but I can't think of any.
>>>
>>> I don't see a realistic way at least. We could pass down an
>>> "InstrumentationContext" through everything that needs to do IO and WAL. But
>>> that seems infeasible at this point.
> 
>> Why infeasible?
> 
> Primarily the scale of the change. We'd have to pass down the context into all
> table/index AM functions. And into a lot of bufmgr.c, xlog.c functions,
> which'd require their callers to have access to the context.  That's hundreds
> if not thousands places.
> 
> Adding that many function parameters might turn out to be noticable runtime
> wise, due to increased register movement. I think for a number of the places
> where we currently don't, we ought to use by-reference struct for the
> not-always-used parameters, that then also could contain this context.
> 

OK, I haven't realized we'd have to pass it to that many places.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: CI and test improvements

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 11:35:09 -0600, Justin Pryzby wrote:
> The autoconf system runs all tap tests in t/*.pl, but meson requires
> enumerating them in ./meson.build.

Yes. It was a mistake that we ever used t/*.pl for make. For one, it means
that make can't control concurrency meaningfully, due to the varying number of
tests run with one prove instance. It's also the only thing that tied us to
prove, which is one hell of a buggy mess.

> This checks for and finds no missing tests in the current tree:
>
> $ for pl in `find src contrib -path '*/t/*.pl'`; do base=${pl##*/}; 
> dir=${pl%/*}; meson=${dir%/*}/meson.build; grep "$base" "$meson" >/dev/null 
> || echo "$base is missing from $meson"; done

Likely because I do something similar locally.

# prep
m test --list > /tmp/tests.txt

# Check if all tap tests are known to meson
for f in $(git ls-files|grep -E '(t|test)/.*.pl$'|sort);do t=$(echo $f|sed -E 
-e 's/^.*\/([^/]*)\/(t|test)\/(.*)\.pl$/\1\/\3/');grep -q -L $t /tmp/tests.txt 
|\
| echo $f;done

# Check if all regression / isolation tests are known to meson
#
# Expected to find plpgsql due to extra 'src' directory level, src/test/mb
# because it's not run anywhere and sepgsql, because that's not tested yet
for d in $(find ~/src/postgresql -type d \( -name sql -or -name specs \) );do 
t=$(basename $(dirname $d)); grep -q -L $t /tmp/tests.txt || echo $d; done

> However, this finds two real problems and one false-positive with
> missing regress/isolation tests:

Which the above does *not* test for. Good catch.

I'll push the fix for those as soon as tests passed on my personal repo.

> $ for makefile in `find src contrib -name Makefile`; do for testname in `sed 
> -r '/^(REGRESS|ISOLATION) =/!d; s///; :l; /$/{s///; N; b l}; s/\n//g' 
> "$makefile"`; do meson=${makefile%/Makefile}/meson.build; grep -Fw 
> "$testname" "$meson" >/dev/null || echo "$testname is missing from $meson"; 
> done; done
> guc_privs is missing from src/test/modules/unsafe_tests/meson.build

Yep. That got added during the development of the meson port, so it's not too 
surprising.

> oldextversions is missing from contrib/pg_stat_statements/meson.build

This one however, is odd. Not sure how that happened.

> $(CF_PGP_TESTS) is missing from contrib/pgcrypto/meson.build

Assume that's the false positive?

> I also tried but failed to write something to warn if "meson test" was
> run with a list of tests but without tmp_install.  Help wanted.

That doesn't even catch the worst case - when there's tmp_install, but it's
too old.

The proper solution would be to make the creation of tmp_install a dependency
of the relevant tests. Unfortunately meson still auto-propages those to
dependencies of the 'all' target (for historical reasons), and creating the
temp install is too slow on some machines to make that tolerable. I think
there's an open PR to change that. Once that's in a released meson version
that's in somewhat widespread use, we should change that.

The other path forward is to allow running the tests without
tmp_install. There's not that much we'd need to allow running directly from
the source tree - the biggest thing is a way to load extensions from a list of
paths. This option is especially attractive because it'd allow to run
individual tests without a fully built sourcetree. No need to build other
binaries when you just want to test psql, or more extremely, pg_test_timing.

> I propose to put something like this into "SanityCheck".

Perhaps we instead could add it as a separate "meson-only" test? Then it'd
fail on developer's machines, instead of later in CI. We could pass the test
information from the 'tests' array, or it could look at the metadata in
meson-info/intro-tests.json

Greetings,

Andres Freund

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Robert Haas

On Tue, Jan 17, 2023 at 1:02 PM Andres Freund  wrote:
> On 2023-01-16 13:58:21 -0800, Peter Geoghegan wrote:
> > On Fri, Jan 13, 2023 at 9:55 PM Andres Freund  wrote:
> > When I express skepticism of very high autovacuum_freeze_max_age
> > settings, it's mostly just that I don't think that age(relfrozenxid)
> > is at all informative in the way that something that triggers
> > autovacuum ought to be. Worst of all, the older relfrozenxid gets, the
> > less informative it becomes. I'm sure that it can be safe to use very
> > high autovacuum_freeze_max_age values, but it seems like a case of
> > using a very complicated and unreliable thing to decide when to
> > VACUUM.
>
> Both you and Robert said this, and I have seen it be true, but typically not
> for large high-throughput OLTP databases, where I found increasing
> relfrozenxid to be important. Sure, there's probably some up/down through the
> day / week, but it's likely to be pretty predictable.
>
> I think the problem is that an old relfrozenxid doesn't tell you how much
> outstanding work there is. Perhaps that's what both of you meant...

I think so, at least in my case. The reason why the default
autovacuum_freeze_max_age is 300m is not because going over 300m is a
problem, but because we don't know how long it's going to take to run
all of the vacuums, and we're still going to be consuming XIDs in the
meantime, and we have no idea how fast we're consuming XIDs either.
For all we know it might take days, and we might be burning through
XIDs really quickly, so we might need a ton of headroom.

That's not to say that raising the setting might not be a sensible
thing to do in a context where you know what the workload is. If you
know that the vacuums will completely quickly OR that the XID burn
rate is low, you can raise it drastically and be fine. We just can't
assume that will be true everywhere.

> I think that's not the fault of relfrozenxid as a trigger, but that we simply
> don't keep enough other stats. We should imo at least keep track of:
>
> In pg_class:
>
> - The number of all frozen pages, like we do for all-visible
>
>   That'd give us a decent upper bound for the amount of work we need to do to
>   increase relfrozenxid. It's important that this is crash safe (thus no
>   pg_stats), and it only needs to be written when we'd likely make other
>   changes to the pg_class row anyway.

I'm not sure how useful this is because a lot of the work is from
scanning the indexes.

> In pgstats:
>
> - The number of dead items, incremented both by the heap scan and
>   opportunistic pruning
>
>   This would let us estimate how urgently we need to clean up indexes.

I don't think this is true because btree indexes are self-cleaning in
some scenarios and not in others.

> - The xid/mxid horizons during the last vacuum
>
> - The number of pages with tuples that couldn't removed due to the horizon
>   during the last vacuum
>
> - The number of pages with tuples that couldn't be frozen

Not bad to know, but if the horizon we could use advances by 1, we
can't tell whether that allows pruning nothing additional or another
billion tuples.

I'm not trying to take the position that XID age is a totally useless
metric. I don't think it is. If XID age is high, you know you have a
problem, and the higher it is, the more urgent that problem is.
Furthemore, if XID is low, you know that you don't have that
particular problem. You might have some other one, but that's OK: this
one metric doesn't have to answer every question to be useful.
However, where XID age really falls down as a metric is that it
doesn't tell you what it's going to take to solve the problem. The
answer, at ten thousand feet, is always vacuum. But how long will that
vacuum run? We don't know. Do we need the XID horizon to advance
first, and if so, how far? We don't know. Do we need auto-cancellation
to be disabled? We don't know. That's where we get into a lot of
trouble here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: almost-super-user problems that we haven't fixed yet

2023-01-17 Thread Robert Haas

On Tue, Jan 17, 2023 at 1:42 PM Nathan Bossart  wrote:
> Alright.  The one design question I have is whether this should be a new
> set of reserved connections or replace superuser_reserved_connections
> entirely.

I think it should definitely be something new, not a replacement.

> If we create a new batch of reserved connections, only roles with
> privileges of pg_use_reserved_connections would be able to connect if the
> number of remaining slots is greater than superuser_reserved_connections
> but less than or equal to superuser_reserved_connections +
> reserved_connections.  Only superusers would be able to connect if the
> number of remaining slots is less than or equal to
> superuser_reserved_connections.  This helps avoid blocking new superuser
> connections even if you've reserved some connections for non-superusers.

This is precisely what I had in mind.

I think the documentation will need some careful wordsmithing,
including adjustments to superuser_reserved_connections. We want to
recast superuser_reserved_connections as a final reserve to be touched
after even reserved_connections has been exhausted.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Peter Geoghegan

On Tue, Jan 17, 2023 at 10:02 AM Andres Freund  wrote:
> Both you and Robert said this, and I have seen it be true, but typically not
> for large high-throughput OLTP databases, where I found increasing
> relfrozenxid to be important. Sure, there's probably some up/down through the
> day / week, but it's likely to be pretty predictable.
>
> I think the problem is that an old relfrozenxid doesn't tell you how much
> outstanding work there is. Perhaps that's what both of you meant...

That's what I meant, yes.

> I think that's not the fault of relfrozenxid as a trigger, but that we simply
> don't keep enough other stats. We should imo at least keep track of:

If you assume that there is chronic undercounting of dead tuples
(which I think is very common), then of course anything that triggers
vacuuming is going to help with that problem -- it might be totally
inadequate, but still make the critical difference by not allowing the
system to become completely destabilized. I absolutely accept that
users that are relying on that exist, and that those users ought to
not have things get even worse -- I'm pragmatic. But overall, what we
should be doing is fixing the real problem, which is that the dead
tuples accounting is deeply flawed. Actually, it's not just that the
statistics are flat out wrong; the whole model is flat-out wrong.

The assumptions that work well for optimizer statistics quite simply
do not apply here. Random sampling for this is just wrong, because
we're not dealing with something that follows a distribution that can
be characterized with a sufficiently large sample. With optimizer
statistics, the entire contents of the table is itself a sample taken
from the wider world -- so even very stale statistics can work quite
well (assuming that the schema is well normalized). Whereas the
autovacuum dead tuples stuff is characterized by constant change. I
mean of course it is -- that's the whole point! The central limit
theorem obviously just doesn't work for something like this  -- we
cannot generalize from a sample, at all.

I strongly suspect that it still wouldn't be a good model even if the
information was magically always correct. It might actually be worse
in some ways! Most of my arguments against the model are not arguments
against the accuracy of the statistics as such. They're actually
arguments against the fundamental relevance of the information itself,
to the actual problem at hand. We are not interested in information
for its own sake; we're interested in making better decisions about
autovacuum scheduling. Those may only have a very loose relationship.

How many dead heap-only tuples are equivalent to one LP_DEAD item?
What about page-level concentrations, and the implication for
line-pointer bloat? I don't have a good answer to any of these
questions myself. And I have my doubts that there are *any* good
answers. Even these questions are the wrong questions (they're just
less wrong). Fundamentally, we're deciding when the next autovacuum
should run against each table. Presumably it's going to have to happen
some time, and when it does happen it happens to the table as a whole.
And with a larger table it probably doesn't matter if it happens +/- a
few hours from some theoretical optimal time. Doesn't it boil down to
that?

If we taught the system to do the autovacuum work early because it's a
relatively good time for it from a system level point of view (e.g.
it's going to be less disruptive right now), that would be useful and
easy to justify on its own terms. But it would also tend to make the
system much less vulnerable to undercounting dead tuples, since in
practice there'd be a decent chance of getting to them early enough
that it at least wasn't extremely painful any one time. It's much
easier to understand that the system is quiescent than it is to
understand bloat.

BTW, I think that the insert-driven autovacuum stuff added to 13 has
made the situation with bloat significantly better. Of course it
wasn't really designed to do that at all, but it still has, kind of by
accident, in roughly the same way that antiwraparound autovacuums help
with bloat by accident. So maybe we should embrace "happy accidents"
like that a bit more. It doesn't necessarily matter if we do the right
thing for a reason that turns out to have not been the best reason.
I'm certainly not opposed to it, despite my complaints about relying
on age(relfrozenxid).

> In pgstats:
> (Various stats)

Overall, what I like about your ideas here is the emphasis on bounding
the worst case, and the emphasis on the picture at the page level over
the tuple level.

I'd like to use the visibility map more for stuff here, too. It is
totally accurate about all-visible/all-frozen pages, so many of my
complaints about statistics don't really apply. Or need not apply, at
least. If 95% of a table's pages are all-frozen in the VM, then of
course it's pretty unlikely to be the right time to VACUUM the table
if it's to clean up bloat

Re: cutting down the TODO list thread

2023-01-17 Thread Bruce Momjian

On Wed, Jan 11, 2023 at 02:09:56PM +0700, John Naylor wrote:
> I've come up with some revised language, including s/15/16/ and removing the
> category of "[E]" (easier to implement), since it wouldn't be here if it were
> actually easy:

I think it is still possible for a simple item to be identified as
wanted and easy, but not completed and put on the TODO list.

> WARNING for Developers: This list contains some known PostgreSQL bugs, some
> feature requests, and some things we are not even sure we want. This is not
> meant to be a resource for beginning developers to get ideas for things to 
> work
> on. 
> 
> All of these items are hard, and some are perhaps impossible. Some of these
> items might have become unnecessary since they were added. Others might be
> desirable but:
> 
> - a large amount work is required
> - the problems are subtler than they might appear
> - the desirable semantics aren't clear
> - there are tradeoffs that there's not consensus about
> - some combinations of the above
> 
> If you really need a feature that is listed below, it will be worth reading 
> the
> linked email thread if there is one, since it will often show the 
> difficulties,
> or perhaps contain previous failed attempts to get a patch committed. If after
> that you still want to work on it, be prepared to first discuss the value of
> the feature. Do not assume that you can start coding and expect it to be
> committed. Always discuss design on the Hackers list before starting to code.
> 
> Over time, it may become clear that a TODO item has become outdated or
> otherwise determined to be either too controversial or not worth the
> development effort. Such items should be retired to the Not Worth Doing page.
> 
> [D] marks changes that are done, and will appear in the PostgreSQL 16 release.

I think we risk overloading people with too many words above, and they
will not read it fully.  Can it be simplified?  I wonder if some of this
belows in the developer's FAQ and linked to that from the TODO list.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: cutting down the TODO list thread

2023-01-17 Thread Bruce Momjian

On Mon, Jan 16, 2023 at 05:17:23PM +0700, John Naylor wrote:
> 
> I wrote:
> 
> > We could also revise the developer FAQ:
> > - remove phrase "Outstanding features are detailed in Todo."
> > - add suggestion to to check the Todo or Not_worth_doing pages to see if the
> desired feature is undesirable or problematic
> > - rephrase "Working in isolation is not advisable because others might be
> working on the same TODO item, or you might have misunderstood the TODO item."
> so it doesn't mention 'TODO' at all.
> 
> There is also
> 
> https://wiki.postgresql.org/wiki/So,_you_want_to_be_a_developer%3F#TODOs
> 
> Changing the description of what links to the todo list will probably do more
> to reduce confusion than language in the todo list itself.

Agreed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 14:57:27 -0500, Robert Haas wrote:
> > In pg_class:
> >
> > - The number of all frozen pages, like we do for all-visible
> >
> >   That'd give us a decent upper bound for the amount of work we need to do 
> > to
> >   increase relfrozenxid. It's important that this is crash safe (thus no
> >   pg_stats), and it only needs to be written when we'd likely make other
> >   changes to the pg_class row anyway.
> 
> I'm not sure how useful this is because a lot of the work is from
> scanning the indexes.

We can increase relfrozenxid before the index scans, unless we ran out of dead
tuple space. We already have code for that in failsafe mode, in some way. But
we really also should increase

> > In pgstats:
> >
> > - The number of dead items, incremented both by the heap scan and
> >   opportunistic pruning
> >
> >   This would let us estimate how urgently we need to clean up indexes.
> 
> I don't think this is true because btree indexes are self-cleaning in
> some scenarios and not in others.

I mainly meant it from the angle of whether need to clean up dead items in the
heap to avoid the table from bloating because we stop using those pages -
which requires index scans. But even for the index scan portion, it'd give us
a better bound than we have today.

We probably should track the number of killed tuples in indexes.

> > - The xid/mxid horizons during the last vacuum
> >
> > - The number of pages with tuples that couldn't removed due to the horizon
> >   during the last vacuum
> >
> > - The number of pages with tuples that couldn't be frozen
> 
> Not bad to know, but if the horizon we could use advances by 1, we
> can't tell whether that allows pruning nothing additional or another
> billion tuples.

Sure. But it'd be a lot better than scanning it again and again when nothing
has changed because thing still holds back the horizon. We could improve upon
it later by tracking the average or even bins of ages.

> However, where XID age really falls down as a metric is that it
> doesn't tell you what it's going to take to solve the problem. The
> answer, at ten thousand feet, is always vacuum. But how long will that
> vacuum run? We don't know. Do we need the XID horizon to advance
> first, and if so, how far? We don't know. Do we need auto-cancellation
> to be disabled? We don't know. That's where we get into a lot of
> trouble here.

Agreed. I think the metrics I proposed would help some, by at least providing
sensible upper boundaries (for work) and minimal requirements (horizon during
last vacuum).

Greetings,

Andres Freund

Re: document the need to analyze partitioned tables

2023-01-17 Thread Bruce Momjian

On Thu, Jan 12, 2023 at 03:27:47PM -0800, Nathan Bossart wrote:
> On Wed, Oct 05, 2022 at 10:37:01AM +0200, Laurenz Albe wrote:
> > On Mon, 2022-03-28 at 15:05 +0200, Tomas Vondra wrote:
> >> I've pushed the last version, and backpatched it to 10 (not sure I'd
> >> call it a bugfix, but I certainly agree with Justin it's worth
> >> mentioning in the docs, even on older branches).
> > 
> > I'd like to suggest an improvement to this.  The current wording could
> > be read to mean that dead tuples won't get cleaned up in partitioned tables.
> 
> Well, dead tuples won't get cleaned up in partitioned tables, as
> partitioned tables do not have storage.  But I see what you mean.  Readers
> might misinterpret this to mean that autovacuum will not process the
> partitions.  There's a good definition of what the docs mean by
> "partitioned table" [0], but FWIW it took me some time before I
> consistently read "partitioned table" to mean "only the thing with relkind
> set to 'p'" and not "both the partitioned table and its partitions."  So,
> while the current wording it technically correct, I think it'd be
> reasonable to expand it to help avoid confusion.
> 
> Here is my take on the wording:
> 
>   Since all the data for a partitioned table is stored in its partitions,
>   autovacuum does not process partitioned tables.  Instead, autovacuum
>   processes the individual partitions that are regular tables.  This
>   means that autovacuum only gathers statistics for the regular tables
>   that serve as partitions and not for the partitioned tables.  Since
>   queries may rely on a partitioned table's statistics, you should
>   collect statistics via the ANALYZE command when it is first populated,
>   and again whenever the distribution of data in its partitions changes
>   significantly.

Uh, what about autovacuum's handling of partitioned tables?  This makes
it sound like it ignores them because it talks about manual ANALYZE.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-01-17 Thread Justin Pryzby

On Tue, Jan 17, 2023 at 08:44:36PM +0100, Tomas Vondra wrote:
> On 1/9/23 09:44, Ilya Gladyshev wrote:
> > On Sun, 2023-01-08 at 10:48 -0600, Justin Pryzby wrote:
> >> On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote:
> >>> ...
> >>>
> >>> @committers: Is it okay to add nparts_done to IndexStmt ?
> >>
> >> Any hint about this ?
> >>
> 
> AFAIK fields added at the end of a struct is seen as acceptable from the
> ABI point of view. It's not risk-free, but we did that multiple times
> when fixing bugs, IIRC.

My question isn't whether it's okay to add a field at the end of a
struct in general, but rather whether it's acceptable to add this field
at the end of this struct, and not because it's unsafe to do in a minor
release, but whether someone is going to say that it's an abuse of the
data structure.

> Do we actually need the new parts_done field? I mean, we already do
> track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the
> st_progress_param array. Can't we simply read it from there? Then we
> would not have ABI issues with the new field added to IndexStmt.

Good idea to try.

> As for the earlier discussion about the "correct" behavior for leaf vs.
> non-leaf partitions and whether to calculate partitions in advance:
> 
> * I agree it's desirable to count partitions in advance, instead of
> adding incrementally. The view is meant to provide "overview" of the
> CREATE INDEX progress, and imagine you get
> 
>   partitions_total partitions_done
> 10   9
> 
> so that you believe you're ~90% done. But then it jumps to the next
> child and now you get
> 
>   partitions_total partitions_done
> 20  10
> 
> which makes the view a bit useless for it's primary purpose, IMHO.

To be clear, that's the current, buggy behavior, and this thread is
about fixing it.  The proposed patches all ought to avoid that.

But the bug isn't caused by not "calculating partitions in advance".
Rather, the issue is that currently, the "total" is overwritten while
recursing.

That's a separate question from whether indirect partitions are counted
or not.

> I wonder if we could improve this to track the size of partitions for
> total/done? That'd make leaf/non-leaf distinction unnecessary, because
> non-leaf partitions have size 0.

Maybe, but it's out of scope for this patch.

Thanks for looking.

-- 
Justin

Re: document the need to analyze partitioned tables

2023-01-17 Thread Justin Pryzby

On Tue, Jan 17, 2023 at 03:53:24PM -0500, Bruce Momjian wrote:
> On Thu, Jan 12, 2023 at 03:27:47PM -0800, Nathan Bossart wrote:
> > On Wed, Oct 05, 2022 at 10:37:01AM +0200, Laurenz Albe wrote:
> > > On Mon, 2022-03-28 at 15:05 +0200, Tomas Vondra wrote:
> > >> I've pushed the last version, and backpatched it to 10 (not sure I'd
> > >> call it a bugfix, but I certainly agree with Justin it's worth
> > >> mentioning in the docs, even on older branches).
> > > 
> > > I'd like to suggest an improvement to this.  The current wording could
> > > be read to mean that dead tuples won't get cleaned up in partitioned 
> > > tables.
> > 
> > Well, dead tuples won't get cleaned up in partitioned tables, as
> > partitioned tables do not have storage.  But I see what you mean.  Readers
> > might misinterpret this to mean that autovacuum will not process the
> > partitions.  There's a good definition of what the docs mean by
> > "partitioned table" [0], but FWIW it took me some time before I
> > consistently read "partitioned table" to mean "only the thing with relkind
> > set to 'p'" and not "both the partitioned table and its partitions."  So,
> > while the current wording it technically correct, I think it'd be
> > reasonable to expand it to help avoid confusion.
> > 
> > Here is my take on the wording:
> > 
> > Since all the data for a partitioned table is stored in its partitions,
> > autovacuum does not process partitioned tables.  Instead, autovacuum
> > processes the individual partitions that are regular tables.  This
> > means that autovacuum only gathers statistics for the regular tables
> > that serve as partitions and not for the partitioned tables.  Since
> > queries may rely on a partitioned table's statistics, you should
> > collect statistics via the ANALYZE command when it is first populated,
> > and again whenever the distribution of data in its partitions changes
> > significantly.
> 
> Uh, what about autovacuum's handling of partitioned tables?  This makes
> it sound like it ignores them because it talks about manual ANALYZE.

If we're referring to the *partitioned* table, then it does ignore them.
See:

|commit 6f8127b7390119c21479f5ce495b7d2168930e82
|Author: Alvaro Herrera 
|Date:   Mon Aug 16 17:27:52 2021 -0400
|
|Revert analyze support for partitioned tables

Maybe (all?) the clarification the docs need is to say:
"Partitioned tables are not *themselves* processed by autovacuum."

-- 
Justin

Re: Extracting cross-version-upgrade knowledge from buildfarm client

2023-01-17 Thread Tom Lane

Andrew Dunstan  writes:
> On 2023-01-16 Mo 21:58, Tom Lane wrote:
>> I dunno if we want to stretch buildfarm owners' patience with yet
>> another BF client release right now.  On the other hand, I'm antsy
>> to see if we can un-revert 1b4d280ea after doing a little more
>> work in AdjustUpgrade.pm.

> I think the next step is to push the buildfarm client changes, and
> update those three animals to use it, and make sure nothing breaks. I'll
> go and do those things now. Then you should be able to try your unrevert.

It looks like unrevert will require ~130 lines in AdjustUpgrade.pm,
which is not great but not awful either.  I think this is ready to
go once you've vetted your remaining buildfarm animals.

regards, tom lane

diff --git a/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm b/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm
index 7cf4ced392..5bed1d6839 100644
--- a/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm
+++ b/src/test/perl/PostgreSQL/Test/AdjustUpgrade.pm
@@ -268,6 +268,12 @@ sub adjust_old_dumpfile
 	# Version comments will certainly not match.
 	$dump =~ s/^-- Dumped from database version.*\n//mg;
 
+	if ($old_version < 16)
+	{
+		# Fix up some view queries that no longer require table-qualification.
+		$dump = _mash_view_qualifiers($dump);
+	}
+
 	if ($old_version >= 14 && $old_version < 16)
 	{
 		# Fix up some privilege-set discrepancies.
@@ -396,6 +402,133 @@ sub adjust_old_dumpfile
 	return $dump;
 }
 
+
+# Data for _mash_view_qualifiers
+my @_unused_view_qualifiers = (
+	# Present at least since 9.2
+	{ obj => 'VIEW public.trigger_test_view',  qual => 'trigger_test' },
+	{ obj => 'VIEW public.domview',qual => 'domtab' },
+	{ obj => 'VIEW public.my_property_normal', qual => 'customer' },
+	{ obj => 'VIEW public.my_property_secure', qual => 'customer' },
+	{ obj => 'VIEW public.pfield_v1',  qual => 'pf' },
+	{ obj => 'VIEW public.rtest_v1',   qual => 'rtest_t1' },
+	{ obj => 'VIEW public.rtest_vview1',   qual => 'x' },
+	{ obj => 'VIEW public.rtest_vview2',   qual => 'rtest_view1' },
+	{ obj => 'VIEW public.rtest_vview3',   qual => 'x' },
+	{ obj => 'VIEW public.rtest_vview5',   qual => 'rtest_view1' },
+	{ obj => 'VIEW public.shoelace_obsolete',  qual => 'shoelace' },
+	{ obj => 'VIEW public.shoelace_candelete', qual => 'shoelace_obsolete' },
+	{ obj => 'VIEW public.toyemp', qual => 'emp' },
+	{ obj => 'VIEW public.xmlview4',   qual => 'emp' },
+	# Since 9.3 (some of these were removed in 9.6)
+	{ obj => 'VIEW public.tv', qual => 't' },
+	{ obj => 'MATERIALIZED VIEW mvschema.tvm', qual => 'tv' },
+	{ obj => 'VIEW public.tvv',qual => 'tv' },
+	{ obj => 'MATERIALIZED VIEW public.tvvm',  qual => 'tvv' },
+	{ obj => 'VIEW public.tvvmv',  qual => 'tvvm' },
+	{ obj => 'MATERIALIZED VIEW public.bb',qual => 'tvvmv' },
+	{ obj => 'VIEW public.nums',   qual => 'nums' },
+	{ obj => 'VIEW public.sums_1_100', qual => 't' },
+	{ obj => 'MATERIALIZED VIEW public.tm',qual => 't' },
+	{ obj => 'MATERIALIZED VIEW public.tmm',   qual => 'tm' },
+	{ obj => 'MATERIALIZED VIEW public.tvmm',  qual => 'tvm' },
+	# Since 9.4
+	{
+		obj  => 'MATERIALIZED VIEW public.citext_matview',
+		qual => 'citext_table'
+	},
+	{
+		obj  => 'OR REPLACE VIEW public.key_dependent_view',
+		qual => 'view_base_table'
+	},
+	{
+		obj  => 'OR REPLACE VIEW public.key_dependent_view_no_cols',
+		qual => 'view_base_table'
+	},
+	# Since 9.5
+	{
+		obj  => 'VIEW public.dummy_seclabel_view1',
+		qual => 'dummy_seclabel_tbl2'
+	},
+	{ obj => 'VIEW public.vv',  qual => 'test_tablesample' },
+	{ obj => 'VIEW public.test_tablesample_v1', qual => 'test_tablesample' },
+	{ obj => 'VIEW public.test_tablesample_v2', qual => 'test_tablesample' },
+	# Since 9.6
+	{
+		obj  => 'MATERIALIZED VIEW public.test_pg_dump_mv1',
+		qual => 'test_pg_dump_t1'
+	},
+	{ obj => 'VIEW public.test_pg_dump_v1', qual => 'test_pg_dump_t1' },
+	{ obj => 'VIEW public.mvtest_tv',   qual => 'mvtest_t' },
+	{
+		obj  => 'MATERIALIZED VIEW mvtest_mvschema.mvtest_tvm',
+		qual => 'mvtest_tv'
+	},
+	{ obj => 'VIEW public.mvtest_tvv',   qual => 'mvtest_tv' },
+	{ obj => 'MATERIALIZED VIEW public.mvtest_tvvm', qual => 'mvtest_tvv' },
+	{ obj => 'VIEW public.mvtest_tvvmv', qual => 'mvtest_tvvm' },
+	{ obj => 'MATERIALIZED VIEW public.mvtest_bb',   qual => 'mvtest_tvvmv' },
+	{ obj => 'MATERIALIZED VIEW public.mvtest_tm',   qual => 'mvtest_t' },
+	{ obj => 'MATERIALIZED VIEW public.mvtest_tmm',  qual => 'mvtest_tm' },
+	{ obj => 'MATERIALIZED VIEW public.mvtest_tvmm', qual => 'mvtest_tvm' },
+	# Since 10 (some removed in 12)
+	{ obj => 'VIEW public.itestv10',  qual => 'itest10' },
+	{ obj => 'VIEW public.itestv11',  qual => 'itest11' },
+	{ obj => 'VIEW public.xmltableview2', qual => '"xmltable"' },
+	# Since 12
+	{
+		obj  => 'MATERIALIZED VIEW public.tableam_tblmv_

Re: document the need to analyze partitioned tables

2023-01-17 Thread Bruce Momjian

On Tue, Jan 17, 2023 at 03:00:50PM -0600, Justin Pryzby wrote:
> On Tue, Jan 17, 2023 at 03:53:24PM -0500, Bruce Momjian wrote:
> > On Thu, Jan 12, 2023 at 03:27:47PM -0800, Nathan Bossart wrote:
> > > Here is my take on the wording:
> > > 
> > >   Since all the data for a partitioned table is stored in its partitions,
> > >   autovacuum does not process partitioned tables.  Instead, autovacuum
> > >   processes the individual partitions that are regular tables.  This
> > >   means that autovacuum only gathers statistics for the regular tables
> > >   that serve as partitions and not for the partitioned tables.  Since
> > >   queries may rely on a partitioned table's statistics, you should
> > >   collect statistics via the ANALYZE command when it is first populated,
> > >   and again whenever the distribution of data in its partitions changes
> > >   significantly.
> > 
> > Uh, what about autovacuum's handling of partitioned tables?  This makes
> > it sound like it ignores them because it talks about manual ANALYZE.
> 
> If we're referring to the *partitioned* table, then it does ignore them.
> See:
> 
> |commit 6f8127b7390119c21479f5ce495b7d2168930e82
> |Author: Alvaro Herrera 
> |Date:   Mon Aug 16 17:27:52 2021 -0400
> |
> |Revert analyze support for partitioned tables

Yes, I see that patch was trying to combine the statistics of individual
partitions into a partitioned table summary.

> Maybe (all?) the clarification the docs need is to say:
> "Partitioned tables are not *themselves* processed by autovacuum."

Yes, I think the lack of autovacuum needs to be specifically mentioned
since most people assume autovacuum handles _all_ statistics updating.

Can someone summarize how bad it is we have no statistics on partitioned
tables?  It sounds bad to me. 

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

Embrace your flaws.  They make you human, rather than perfect,
which you will never be.

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-01-17 Thread Melanie Plageman

v49 attached

On Tue, Jan 17, 2023 at 2:12 PM Andres Freund  wrote:
> On 2023-01-17 12:22:14 -0500, Melanie Plageman wrote:
>
> > > > +typedef struct PgStat_BackendIO
> > > > +{
> > > > + PgStat_Counter 
> > > > data[IOCONTEXT_NUM_TYPES][IOOBJECT_NUM_TYPES][IOOP_NUM_TYPES];
> > > > +} PgStat_BackendIO;
> > >
> > > Would it bother you if we swapped the order of iocontext and iobject here 
> > > and
> > > related places? It makes more sense to me semantically, and should now be
> > > pretty easy, code wise.
> >
> > So, thinking about this I started noticing inconsistencies in other
> > areas around this order:
> > For example: ordering of objects mentioned in commit messages and comments,
> > ordering of parameters (like in pgstat_count_io_op() [currently in
> > reverse order]).
> >
> > I think we should make a final decision about this ordering and then
> > make everywhere consistent (including ordering in the view).
> >
> > Currently the order is:
> > BackendType
> >   IOContext
> > IOObject
> >   IOOp
> >
> > You are suggesting this order:
> > BackendType
> >   IOObject
> > IOContext
> >   IOOp
> >
> > Could you explain what you find more natural about this ordering (as I
> > find the other more natural)?
>
> The object we're performing IO on determines more things than the context. So
> it just seems like the natural hierarchical fit. The context is a sub-category
> of the object. Consider how it'll look like if we also have objects for 'wal',
> 'temp files'. It'll make sense to group by just the object, but it won't make
> sense to group by just the context.
>
> If it were trivial to do I'd use a different IOContext for each IOObject. But
> it'd make it much harder. So there'll just be a bunch of values of IOContext
> that'll only be used for one or a subset of the IOObjects.
>
>
> The reason to put BackendType at the top is pragmatic - one backend is of a
> single type, but can do IO for all kinds of objects/contexts. So any other
> hierarchy would make the locking etc much harder.
>
>
> > This is one possible natural sentence with these objects:
> >
> > During COPY, a client backend may read in data from a permanent
> > relation.
> > This order is:
> > IOContext
> >   BackendType
> > IOOp
> >   IOObject
> >
> > I think English sentences are often structured subject, verb, object --
> > but in our case, we have an extra thing that doesn't fit neatly
> > (IOContext).
>
> "..., to avoid polluting the buffer cache it uses the bulk (read|write)
> strategy".
>
>
> > Also, IOOp in a sentence would be in the middle (as the
> > verb). I made it last because a) it feels like the smallest unit b) it
> > would make the code a lot more annoying if it wasn't last.
>
> Yea, I think pragmatically that is the right choice.

I have changed the order and updated all the places using
PgStat_BktypeIO as well as in all locations in which it should be
ordered for consistency (that I could find in the pass I did) -- e.g.
the view definition, function signatures, comments, commit messages,
etc.

> > > > +-- Change the tablespace so that the table is rewritten directly, then 
> > > > SELECT
> > > > +-- from it to cause it to be read back into shared buffers.
> > > > +SET allow_in_place_tablespaces = true;
> > > > +CREATE TABLESPACE regress_io_stats_tblspc LOCATION '';
> > >
> > > Perhaps worth doing this in tablespace.sql, to avoid the additional
> > > checkpoints done as part of CREATE/DROP TABLESPACE?
> > >
> > > Or, at least combine this with the CHECKPOINTs above?
> >
> > I see a checkpoint is requested when dropping the tablespace if not all
> > the files in it are deleted. It seems like if the DROP TABLE for the
> > permanent table is before the explicit checkpoints in the test, then the
> > DROP TABLESPACE will not cause an additional checkpoint.
>
> Unfortunately, that's not how it works :(. See the comment above mdunlink():
>
> > * For regular relations, we don't unlink the first segment file of the rel,
> > * but just truncate it to zero length, and record a request to unlink it 
> > after
> > * the next checkpoint.  Additional segments can be unlinked immediately,
> > * however.  Leaving the empty file in place prevents that relfilenumber
> > * from being reused.  The scenario this protects us from is:
> > ...
>
>
> > Is this what you are suggesting? Dropping the temporary table should not
> > have an effect on this.
>
> I was wondering about simply moving that portion of the test to
> tablespace.sql, where we already created a tablespace.
>
>
> An alternative would be to propose splitting tablespace.sql into one portion
> running at the start of parallel_schedule, and one at the end. Historically,
> we needed tablespace.sql to be optional due to causing problems when
> replicating to another instance on the same machine, but now we have
> allow_in_place_tablespaces.

It seems like the best way would be to split up the tablespace test file
as you suggested and drop the tablespace at the e

Re: Update comments in multixact.c

2023-01-17 Thread Peter Geoghegan

On Tue, Jan 17, 2023 at 1:33 AM shiy.f...@fujitsu.com
 wrote:
> I noticed that commit 5212d447fa updated some comments in multixact.c because
> SLRU truncation for multixacts is performed during VACUUM, instead of
> checkpoint. Should the following comments which mentioned checkpointer be
> changed, too?

Yes, I think so.

-- 
Peter Geoghegan

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Robert Haas

On Tue, Jan 17, 2023 at 3:08 PM Peter Geoghegan  wrote:
> If you assume that there is chronic undercounting of dead tuples
> (which I think is very common), ...

Why do you think that?

> How many dead heap-only tuples are equivalent to one LP_DEAD item?
> What about page-level concentrations, and the implication for
> line-pointer bloat? I don't have a good answer to any of these
> questions myself.

Seems a bit pessimistic. If we had unlimited resources and all
operations were infinitely fast, the optimal strategy would be to
vacuum after every insert, update, or delete. But in reality, that
would be prohibitively expensive, so we're making a trade-off.
Batching together cleanup for many table modifications reduces the
amortized cost of cleaning up after one such operation very
considerably. That's critical. But if we batch too much together, then
the actual cleanup doesn't happen soon enough to keep us out of
trouble.

If we had an oracle that could provide us with perfect information,
we'd ask it, among other things, how much work will be required to
vacuum right now, and how much benefit would we get out of doing so.
The dead tuple count is related to the first question. It's not a
direct, linear relationship, but it's not completely unrelated,
either. Maybe we could refine the estimates by gathering more or
different statistics than we do now, but ultimately it's always going
to be a trade-off between getting the work done sooner (and thus maybe
preventing table growth or a wraparound shutdown) and being able to do
more work at once (and thus being more efficient). The current system
set of counters predates HOT and the visibility map, so it's not
surprising if needs updating, but if you're argue that the whole
concept is just garbage, I think that's an overreaction.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Doc: add XML ID attributes to and tags.

2023-01-17 Thread Tom Lane

Peter Eisentraut  writes:
> On 12.01.23 00:05, Tom Lane wrote:
>> That reminds me that I was going to suggest fixing the few existing
>> variances from the "use '-' not '_'" policy:
>> 
>> $ grep 'id="[a-zA-Z0-9-]*_' *sgml ref/*sgml
>> config.sgml: > xreflabel="plan_cache_mode">

> should be fixed

>> libpq.sgml:
>> libpq.sgml:
>> libpq.sgml:
>> libpq.sgml:

> I think we can leave these.  They are internally consistent.

>> pgbuffercache.sgml:  

> should be fixed

>> ref/pg_checksums.sgml: 

> pretty bogus

OK, done like that.

regards, tom lane

Re: Rework of collation code, extensibility

2023-01-17 Thread Peter Geoghegan

On Sat, Jan 14, 2023 at 12:03 PM Jeff Davis  wrote:
> The first goal I had was simply that the code was really hard to
> understand and work on, and refactoring was justified to improve the
> situation.
>
> The second goal, which is somewhat dependent on the first goal, is that
> we really need an ability to support multiple ICU libraries, and I
> wanted to do some common groundwork that would be needed for any
> approach we choose there, and provide some hooks to get us there. You
> are right that this goal influenced the first goal.

I don't disagree that it was somewhat independent of the first goal. I
just think that it makes sense to "round up to fully dependent".
Basically it's not independent enough to be worth talking about as an
independent thing, just as a practical matter - it's confusing at the
level of things like the commit message. There is a clear direction
that you're going in here from the start, and your intentions in 0001
do matter to somebody that's just looking at 0001 in isolation. That
is my opinion, at least.

The second goal is a perfectly good enough goal on its own, and one
that I am totally supportive of. Making the code clearer is icing on
the cake.

> ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some
> basic testing and it doesn't seem like it's slower than using the
> length. If passing the length is faster for some reason, it would
> complicate the API because we'd need an entry point that's expecting
> nul-termination and lengths, which is awkward (as Peter E. pointed
> out).

That's good. I'm happy to leave it at that. I was only enquiring.

> I felt it was a little clearer amongst the other code, to a casual
> reader, but I suppose it's a style thing. I will change it if you
> insist.

I certainly won't insist.

> I'd have to expose the pg_locale_t struct, which didn't seem desirable
> to me. Do you think it's enough of a performance concern to be worth
> some ugliness there?

I don't know. Quite possibly not. It would be nice to have some data
on that, though.

-- 
Peter Geoghegan

PGDOCS - sgml linkend using single-quotes

2023-01-17 Thread Peter Smith

Hi,

I happened to notice some examples of SGML linkends that were using
single quotes instead of double quotes.

It didn't seem to be the conventional style because grepping (from
doc/src/sgml folder) showed only a tiny fraction using single quotes.

(single-quotes)
$ grep --include=*.sgml -rn . -e "linkend='" | wc -l
12

(double-quotes)
$ grep --include=*.sgml -rn . -e 'linkend="' | wc -l
5915

~~

PSA patch that makes them all use double quotes.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Replace-linkend-single-quotes-with-double-quotes.patch
Description: Binary data

Re: doc: add missing "id" attributes to extension packaging page

2023-01-17 Thread Karl O. Pinc

On Tue, 17 Jan 2023 19:13:38 +0100
Brar Piening  wrote:

> On 17.01.2023 at 14:12, Karl O. Pinc wrote:
> > If you're not willing to try I am willing to see if I can
> > produce an example to work from.  My XSLT is starting to
> > come back.  
> 
> I'm certainly willing to try but I'd appreciate an example in any
> case.

It's good you asked.  After poking about the XSLT 1.0 spec to refresh
my memory I abandoned the "mode" approach entirely.  The real "right
way" is to use the import mechanism.

I've attached a patch that "wraps" the section.heading template
and adds extra stuff to the HTML output generated by the stock
template.  (example_section_heading_override.patch)

There's a bug.  All that goes into the html is a comment, not
a hoverable link.  But the technique is clear.

On my system (Debian 11, bullseye) I found the URI to import
by looking at:
/usr/share/xml/docbook/stylesheet/docbook-xsl/catalog.xml
(Which is probably the right place to look.)
Ultimately, that file is findable via: /etc/xml/catalog
The "best way" on
Debian is: /usr/share/doc/docbook-xsl/README.gz
In other words, the README that comes with the style sheets.

Supposedly, the href=URLs are really URIs and will be good
no matter what/when.  The XSLT processor should know to look
at the system catalogs and read the imported style sheet
from the local file system.

It might be useful to add --nonet to the xsltproc invocation(s)
in the Makefile(s).  Just in case; to keep from retrieving
stylesheets from the net.  (If the option is not already there.
I didn't look.)

If this is the first time that PG uses the XSLT import mechanism
I imagine that "things could go wrong" depending on what sort
of system is being used to build the docs.  I'm not worried,
but it is something to note for the committer.

> My XSLT skills are mostly learning by doing combined with trial and
> error.

I think of XSLT as a functional programming language.  Recursion is
a big deal, and data directed programming can be a powerful technique
because XSLT is so good with data structures.
(https://mitp-content-server.mit.edu/books/content/sectbyfn/books_pres_0/6515/sicp.zip/full-text/book/book-Z-H-17.html#%_sec_2.4.3)

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/stylesheet-html-common.xsl b/doc/src/sgml/stylesheet-html-common.xsl
index 9df2782ce4..88e084e190 100644
--- a/doc/src/sgml/stylesheet-html-common.xsl
+++ b/doc/src/sgml/stylesheet-html-common.xsl
@@ -12,6 +12,10 @@
   all HTML output variants (chunked and single-page).
   -->

+
+http://cdn.docbook.org/release/xsl/current/html/sections.xsl"/>
+

@@ -301,4 +305,53 @@ set   toc,title

+
+
+  
+  
+  
+  
+  
+  
+  
+
+  
+
+
+
+
+  
+  
+
+  
+
+  #
+  
+
+  
+
+
+  id_link
+
+ #
+  
+
+
+  
+  
+
+  
+   element without id. Please add an id to make it usable
+   as stable anchor in the public HTML documentation.
+
+
+  Please add an id to the 
+  
+   element in the sgml source code.
+
+  
+
+  
+
+

diff --git a/doc/src/sgml/stylesheet.css b/doc/src/sgml/stylesheet.css
index 6410a47958..e4174e0613 100644
--- a/doc/src/sgml/stylesheet.css
+++ b/doc/src/sgml/stylesheet.css
@@ -163,3 +163,13 @@ acronym		{ font-style: inherit; }
 width: 75%;
   }
 }
+
+/* Links to ids of headers and definition terms */
+a.id_link {
+color: inherit;
+visibility: hidden;
+}
+
+*:hover > a.id_link {
+visibility: visible;
+}

Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-01-17 Thread Jacob Champion

On Sun, Jan 15, 2023 at 12:03 PM Andrey Chudnovsky
 wrote:
> 2. Removed Device Code implementation in libpq. Several reasons:
>- Reduce scope and focus on the protocol first.
>- Device code implementation uses iddawc dependency. Taking this
> dependency is a controversial step which requires broader discussion.
>- Device code implementation without iddaws would significantly
> increase the scope of the patch, as libpq needs to poll the token
> endpoint, setup different API calls, e.t.c.
>- That flow should canonically only be used for clients which can't
> invoke browsers. If it is the only flow to be implemented, it can be
> used in the context when it's not expected by the OAUTH protocol.

I'm not understanding the concern in the final point -- providers
generally require you to opt into device authorization, at least as far
as I can tell. So if you decide that it's not appropriate for your use
case... don't enable it. (And I haven't seen any claims that opting into
device authorization weakens the other flows in any way. So if we're
going to implement a flow in libpq, I still think device authorization
is the best choice, since it works on headless machines as well as those
with browsers.)

All of this points at a bigger question to the community: if we choose
not to provide a flow implementation in libpq, is adding OAUTHBEARER
worth the additional maintenance cost?

My personal vote would be "no". I think the hook-only approach proposed
here would ensure that only larger providers would implement it in
practice, and in that case I'd rather spend cycles on generic SASL.

> 3. Temporarily removed test suite. We are actively working on aligning
> the tests with the latest changes. Will add a patch with tests soon.

Okay. Case in point, the following change to the patch appears to be
invalid JSON:

> +   appendStringInfo(&buf,
> +   "{ "
> +   "\"status\": \"invalid_token\", "
> +   "\"openid-configuration\": \"%s\","
> +   "\"scope\": \"%s\" ",
> +   "\"issuer\": \"%s\" ",
> +   "}",

Additionally, the "issuer" field added here is not part of the RFC. I've
written my thoughts about unofficial extensions upthread but haven't
received a response, so I'm going to start being more strident: Please,
for the sake of reviewers, call out changes you've made to the spec, and
why they're justified.

The patches seem to be out of order now (and the documentation in the
commit messages has been removed).

Thanks,
--Jacob

Re: Removing redundant grouping columns

2023-01-17 Thread Tom Lane

I wrote:
> Yeah, sideswiped by 3c6fc5820 apparently.  No substantive change needed.

And immediately sideswiped by da5800d5f.

If nobody has any comments on this, I'm going to go ahead and push
it.  The value of the improvement is rapidly paling in comparison
to the patch's maintenance effort.

regards, tom lane

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 21237d18ef..473fa45bd4 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -3736,6 +3736,13 @@ appendGroupByClause(List *tlist, deparse_expr_cxt *context)
 	 */
 	Assert(!query->groupingSets);
 
+	/*
+	 * We intentionally print query->groupClause not processed_groupClause,
+	 * leaving it to the remote planner to get rid of any redundant GROUP BY
+	 * items again.  This is necessary in case processed_groupClause reduced
+	 * to empty, and in any case the redundancy situation on the remote might
+	 * be different than what we think here.
+	 */
 	foreach(lc, query->groupClause)
 	{
 		SortGroupClause *grp = (SortGroupClause *) lfirst(lc);
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c0267a99d2..385b76a8cb 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -2864,16 +2864,13 @@ select c2 * (random() <= 1)::int as c2 from ft2 group by c2 * (random() <= 1)::i
 -- GROUP BY clause in various forms, cardinal, alias and constant expression
 explain (verbose, costs off)
 select count(c2) w, c2 x, 5 y, 7.0 z from ft1 group by 2, y, 9.0::int order by 2;
-  QUERY PLAN   

- Sort
+ QUERY PLAN 
+
+ Foreign Scan
Output: (count(c2)), c2, 5, 7.0, 9
-   Sort Key: ft1.c2
-   ->  Foreign Scan
- Output: (count(c2)), c2, 5, 7.0, 9
- Relations: Aggregate on (public.ft1)
- Remote SQL: SELECT count(c2), c2, 5, 7.0, 9 FROM "S 1"."T 1" GROUP BY 2, 3, 5
-(7 rows)
+   Relations: Aggregate on (public.ft1)
+   Remote SQL: SELECT count(c2), c2, 5, 7.0, 9 FROM "S 1"."T 1" GROUP BY 2, 3, 5 ORDER BY c2 ASC NULLS LAST
+(4 rows)
 
 select count(c2) w, c2 x, 5 y, 7.0 z from ft1 group by 2, y, 9.0::int order by 2;
   w  | x | y |  z  
@@ -2990,14 +2987,13 @@ select exists(select 1 from pg_enum), sum(c1) from ft1 group by 1;
 QUERY PLAN 
 ---
  GroupAggregate
-   Output: ($0), sum(ft1.c1)
-   Group Key: $0
+   Output: $0, sum(ft1.c1)
InitPlan 1 (returns $0)
  ->  Seq Scan on pg_catalog.pg_enum
->  Foreign Scan on public.ft1
- Output: $0, ft1.c1
+ Output: ft1.c1
  Remote SQL: SELECT "C 1" FROM "S 1"."T 1"
-(8 rows)
+(7 rows)
 
 select exists(select 1 from pg_enum), sum(c1) from ft1 group by 1;
  exists |  sum   
@@ -3382,14 +3378,13 @@ select array_agg(c1 order by c1 using operator(public.<^)) from ft2 where c2 = 6
 --
  GroupAggregate
Output: array_agg(c1 ORDER BY c1 USING <^ NULLS LAST), c2
-   Group Key: ft2.c2
->  Sort
- Output: c2, c1
+ Output: c1, c2
  Sort Key: ft2.c1 USING <^
  ->  Foreign Scan on public.ft2
-   Output: c2, c1
+   Output: c1, c2
Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE (("C 1" < 100)) AND ((c2 = 6))
-(9 rows)
+(8 rows)
 
 -- This should not be pushed either.
 explain (verbose, costs off)
@@ -3458,14 +3453,13 @@ select array_agg(c1 order by c1 using operator(public.<^)) from ft2 where c2 = 6
 --
  GroupAggregate
Output: array_agg(c1 ORDER BY c1 USING <^ NULLS LAST), c2
-   Group Key: ft2.c2
->  Sort
- Output: c2, c1
+ Output: c1, c2
  Sort Key: ft2.c1 USING <^
  ->  Foreign Scan on public.ft2
-   Output: c2, c1
+   Output: c1, c2
Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE (("C 1" < 100)) AND ((c2 = 6))
-(9 rows)
+(8 rows)
 
 -- Cleanup
 drop operator class my_op_class using btree;
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 53f890bb48..50d23f922c 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -3345,9 +3345,9 @@ estimate_path_cost_size(PlannerInfo *root,
 			}
 
 			/* Get number of grouping columns and possible number of groups */
-			numGroupCols = list_length(root->parse->groupCla

Re: [PATCH] Add native windows on arm64 support

2023-01-17 Thread Andres Freund

Hi,

On 2022-12-16 10:52:23 +, Niyas Sait wrote:
> Subject: [PATCH v7] Enable postgres native build for windows-arm64 platform

>  elif host_cpu == 'arm' or host_cpu == 'aarch64'
>  
> -  prog = '''
> +  if cc.get_id() == 'msvc'
> +cdata.set('USE_ARMV8_CRC32C', false)
> +cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> +have_optimized_crc = true

I dimly recall that windows might actually require the relevant extension on
arm?


> +  else
> +prog = '''
>  #include 

I'd just make this include #ifdef _MSV_VER (or whatever it is).


>  int main(void)
> @@ -1960,18 +1966,19 @@ int main(void)
>  }
>  '''
>  
> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
> without -march=armv8-a+crc',
> -  args: test_c_args)
> -# Use ARM CRC Extension unconditionally
> -cdata.set('USE_ARMV8_CRC32C', 1)
> -have_optimized_crc = true
> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
> with -march=armv8-a+crc',
> -  args: test_c_args + ['-march=armv8-a+crc'])
> -# Use ARM CRC Extension, with runtime check
> -cflags_crc += '-march=armv8-a+crc'
> -cdata.set('USE_ARMV8_CRC32C', false)
> -cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> -have_optimized_crc = true
> +if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
> without -march=armv8-a+crc',
> +args: test_c_args)

Seems like it'd be easier to read if you don't re-indent this, but just have
the cc.get_id() == 'msvc' part of this if/else-if.


Greetings,

Andres Freund

Re: [EXTERNAL] Re: Support load balancing in libpq

2023-01-17 Thread Jacob Champion

On Fri, Jan 13, 2023 at 10:44 AM Jacob Champion  wrote:
> And my thought was that the one-time
> initialization could be moved to a place that doesn't need to know the
> connection options at all, to make it easier to reason about the
> architecture. Say, next to the WSAStartup machinery.

(And after marinating on this over the weekend, it occurred to me that
keeping the per-connection option while making the PRNG global
introduces an additional hazard, because two concurrent connections
can now fight over the seed value.)

--Jacob

Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-17 Thread Peter Geoghegan

On Tue, Jan 17, 2023 at 2:11 PM Robert Haas  wrote:
> On Tue, Jan 17, 2023 at 3:08 PM Peter Geoghegan  wrote:
> > If you assume that there is chronic undercounting of dead tuples
> > (which I think is very common), ...
>
> Why do you think that?

For the reasons I gave about statistics, random sampling, the central
limit theorem. All that stuff. This matches the experience of Andres.
And is obviously the only explanation behind the reliance on
antiwraparound autovacuums for cleaning up bloat in larger OLTP
databases. It just fits: the dead tuples approach can sometimes be so
completely wrong that even an alternative triggering condition based
on something that is virtually unrelated to the thing we actually care
about can do much better in practice. Consistently, reliably, for a
given table/workload.

> > How many dead heap-only tuples are equivalent to one LP_DEAD item?
> > What about page-level concentrations, and the implication for
> > line-pointer bloat? I don't have a good answer to any of these
> > questions myself.
>
> Seems a bit pessimistic. If we had unlimited resources and all
> operations were infinitely fast, the optimal strategy would be to
> vacuum after every insert, update, or delete. But in reality, that
> would be prohibitively expensive, so we're making a trade-off.

To a large degree, that's my point. I don't know how to apply this
information, so having detailed information doesn't seem like the main
problem.

> If we had an oracle that could provide us with perfect information,
> we'd ask it, among other things, how much work will be required to
> vacuum right now, and how much benefit would we get out of doing so.

And then what would we do? What about costs?

Even if we were omniscient, we still wouldn't be omnipotent. We're
still subject to the laws of physics. VACUUM would still be something
that more or less works at the level of the whole table, or not at
all. So being omniscient seems kinda overrated to me. Adding more
information does not in general lead to better outcomes.

> The dead tuple count is related to the first question. It's not a
> direct, linear relationship, but it's not completely unrelated,
> either. Maybe we could refine the estimates by gathering more or
> different statistics than we do now, but ultimately it's always going
> to be a trade-off between getting the work done sooner (and thus maybe
> preventing table growth or a wraparound shutdown) and being able to do
> more work at once (and thus being more efficient). The current system
> set of counters predates HOT and the visibility map, so it's not
> surprising if needs updating, but if you're argue that the whole
> concept is just garbage, I think that's an overreaction.

What I'm arguing is that principally relying on any one thing is
garbage. If you have only one thing that creates pressure to VACUUM
then there can be a big impact whenever it turns out to be completely
wrong. Whereas if VACUUM can run because of (say) 3 moderate signals
taken together, then it's much less likely that we'll be completely
wrong. In general my emphasis is on avoiding disaster in all its
forms. Vacuuming somewhat early more often is perhaps suboptimal, but
far from a disaster. It's the kind of thing that we can manage.

By all means, let's make the dead tuples/dead items stuff less naive
(e.g. make it distinguish between LP_DEAD items and dead heap-only
tuples). But even then, we shouldn't continue to completely rely on it
in the way that we do right now. In other words, I'm fine with adding
more information that is more accurate as long as we don't continue to
make the mistake of not treating it kinda suspect, and certainly not
something to completely rely on if at all possible. In particular, we
need to think about both costs and benefits at all times.

-- 
Peter Geoghegan

Re: Can we let extensions change their dumped catalog schemas?

2023-01-17 Thread Jacob Champion

On 1/12/23 11:04, Jacob Champion wrote:
> On Wed, Jan 11, 2023 at 1:03 PM Tom Lane  wrote:
>> Jacob Champion  writes:
>>> Right, I think it would have to be opt-in. Say, a new control file
>>> option dump_version or some such.
>>
>> That would require all the installed extensions to cope with this
>> the same way, which does not seem like a great assumption.
> 
> How so? Most installed extensions would not opt into a version dump,
> I'd imagine.

As a concrete example, Timescale's extension control file could look
like this:

default_version = '2.x.y'
module_pathname = '$libdir/timescaledb-2.x.y'
...
dump_version = true

which would then cause pg_dump to issue a VERSION for its CREATE
EXTENSION line. Other extensions would remain with the default
(dump_version = false), so they'd be dumped without an explicit VERSION.

(And in the case of option 3, the name of the control file option
changes -- dump_internals, maybe? -- but it still doesn't affect other
installed extensions.)

--Jacob

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-01-17 Thread Tomas Vondra




On 1/17/23 21:59, Justin Pryzby wrote:
> On Tue, Jan 17, 2023 at 08:44:36PM +0100, Tomas Vondra wrote:
>> On 1/9/23 09:44, Ilya Gladyshev wrote:
>>> On Sun, 2023-01-08 at 10:48 -0600, Justin Pryzby wrote:
 On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote:
> ...
>
> @committers: Is it okay to add nparts_done to IndexStmt ?

 Any hint about this ?

>>
>> AFAIK fields added at the end of a struct is seen as acceptable from the
>> ABI point of view. It's not risk-free, but we did that multiple times
>> when fixing bugs, IIRC.
> 
> My question isn't whether it's okay to add a field at the end of a
> struct in general, but rather whether it's acceptable to add this field
> at the end of this struct, and not because it's unsafe to do in a minor
> release, but whether someone is going to say that it's an abuse of the
> data structure.
> 

Ah, you mean whether it's the right place for the parameter?

I don't think it is, really. IndexStmt is meant to be a description of
the CREATE INDEX statement, not something that includes info about how
it's processed. But it's the only struct we pass to the DefineIndex for
child indexes, so the only alternatives I can think of is a global
variable and the (existing) param array field.

Nevertheless, ABI compatibility is still relevant for backbranches.


>> Do we actually need the new parts_done field? I mean, we already do
>> track the value - at PROGRESS_CREATEIDX_PARTITIONS_DONE index in the
>> st_progress_param array. Can't we simply read it from there? Then we
>> would not have ABI issues with the new field added to IndexStmt.
> 
> Good idea to try.
> 

OK

>> As for the earlier discussion about the "correct" behavior for leaf vs.
>> non-leaf partitions and whether to calculate partitions in advance:
>>
>> * I agree it's desirable to count partitions in advance, instead of
>> adding incrementally. The view is meant to provide "overview" of the
>> CREATE INDEX progress, and imagine you get
>>
>>   partitions_total partitions_done
>> 10   9
>>
>> so that you believe you're ~90% done. But then it jumps to the next
>> child and now you get
>>
>>   partitions_total partitions_done
>> 20  10
>>
>> which makes the view a bit useless for it's primary purpose, IMHO.
> 
> To be clear, that's the current, buggy behavior, and this thread is
> about fixing it.  The proposed patches all ought to avoid that.
> 
> But the bug isn't caused by not "calculating partitions in advance".
> Rather, the issue is that currently, the "total" is overwritten while
> recursing.
> 

You're right the issue us about overwriting the total - not sure what I
was thinking about when writing this. I guess I got distracted by the
discussion about "preliminary planning" etc. Sorry for the confusion.

> That's a separate question from whether indirect partitions are counted
> or not.
> 
>> I wonder if we could improve this to track the size of partitions for
>> total/done? That'd make leaf/non-leaf distinction unnecessary, because
>> non-leaf partitions have size 0.
> 
> Maybe, but it's out of scope for this patch.
> 

+1, it was just an idea for future.


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: constify arguments of copy_file() and copydir()

2023-01-17 Thread Michael Paquier

On Tue, Jan 17, 2023 at 08:23:49PM +0100, Peter Eisentraut wrote:
> Looks good to me.

Thanks, done.
--
Michael


signature.asc
Description: PGP signature

Re: constify arguments of copy_file() and copydir()

2023-01-17 Thread Nathan Bossart

On Wed, Jan 18, 2023 at 09:05:52AM +0900, Michael Paquier wrote:
> Thanks, done.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: almost-super-user problems that we haven't fixed yet

2023-01-17 Thread Nathan Bossart

On Tue, Jan 17, 2023 at 02:59:31PM -0500, Robert Haas wrote:
> On Tue, Jan 17, 2023 at 1:42 PM Nathan Bossart  
> wrote:
>> If we create a new batch of reserved connections, only roles with
>> privileges of pg_use_reserved_connections would be able to connect if the
>> number of remaining slots is greater than superuser_reserved_connections
>> but less than or equal to superuser_reserved_connections +
>> reserved_connections.  Only superusers would be able to connect if the
>> number of remaining slots is less than or equal to
>> superuser_reserved_connections.  This helps avoid blocking new superuser
>> connections even if you've reserved some connections for non-superusers.
> 
> This is precisely what I had in mind.

Great.  Here is a first attempt at the patch.

> I think the documentation will need some careful wordsmithing,
> including adjustments to superuser_reserved_connections. We want to
> recast superuser_reserved_connections as a final reserve to be touched
> after even reserved_connections has been exhausted.

I tried to do this, but there is probably still room for improvement,
especially for the parts that discuss the relationship between
max_connections, superuser_reserved_connections, and reserved_connections.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e153d2f22d4303d0bb5e8134391ebf1fa446172c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 17 Jan 2023 13:58:56 -0800
Subject: [PATCH v1 1/2] Rename ReservedBackends to SuperuserReservedBackends.

This is in preparation for adding a new reserved_connections GUC
that will use the ReservedBackends variable name.
---
 src/backend/postmaster/postmaster.c | 18 +-
 src/backend/utils/init/postinit.c   |  4 ++--
 src/backend/utils/misc/guc_tables.c |  2 +-
 src/include/postmaster/postmaster.h |  2 +-
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9cedc1b9f0..470704f364 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -204,15 +204,15 @@ char	   *Unix_socket_directories;
 char	   *ListenAddresses;
 
 /*
- * ReservedBackends is the number of backends reserved for superuser use.
- * This number is taken out of the pool size given by MaxConnections so
+ * SuperuserReservedBackends is the number of backends reserved for superuser
+ * use.  This number is taken out of the pool size given by MaxConnections so
  * number of backend slots available to non-superusers is
- * (MaxConnections - ReservedBackends).  Note what this really means is
- * "if there are <= ReservedBackends connections available, only superusers
- * can make new connections" --- pre-existing superuser connections don't
- * count against the limit.
+ * (MaxConnections - SuperuserReservedBackends).  Note what this really means
+ * is "if there are <= SuperuserReservedBackends connections available, only
+ * superusers can make new connections" --- pre-existing superuser connections
+ * don't count against the limit.
  */
-int			ReservedBackends;
+int			SuperuserReservedBackends;
 
 /* The socket(s) we're listening to. */
 #define MAXLISTEN	64
@@ -908,11 +908,11 @@ PostmasterMain(int argc, char *argv[])
 	/*
 	 * Check for invalid combinations of GUC settings.
 	 */
-	if (ReservedBackends >= MaxConnections)
+	if (SuperuserReservedBackends >= MaxConnections)
 	{
 		write_stderr("%s: superuser_reserved_connections (%d) must be less than max_connections (%d)\n",
 	 progname,
-	 ReservedBackends, MaxConnections);
+	 SuperuserReservedBackends, MaxConnections);
 		ExitPostmaster(1);
 	}
 	if (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL)
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..6fa696fe8d 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -927,8 +927,8 @@ InitPostgres(const char *in_dbname, Oid dboid,
 	 * limited by max_connections or superuser_reserved_connections.
 	 */
 	if (!am_superuser && !am_walsender &&
-		ReservedBackends > 0 &&
-		!HaveNFreeProcs(ReservedBackends))
+		SuperuserReservedBackends > 0 &&
+		!HaveNFreeProcs(SuperuserReservedBackends))
 		ereport(FATAL,
 (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
  errmsg("remaining connection slots are reserved for non-replication superuser connections")));
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f89..5aa2cda8f9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2163,7 +2163,7 @@ struct config_int ConfigureNamesInt[] =
 			gettext_noop("Sets the number of connection slots reserved for superusers."),
 			NULL
 		},
-		&ReservedBackends,
+		&SuperuserReservedBackends,
 		3, 0, MAX_BACKENDS,
 		NULL, NULL, NULL
 	},
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 203177

RE: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-17 Thread Takamichi Osumi (Fujitsu)

Hi,

On Tuesday, January 17, 2023 9:54 PM Amit Kapila  
wrote:
> On Tue, Jan 17, 2023 at 4:30 PM Takamichi Osumi (Fujitsu)
>  wrote:
> >
> > On Saturday, January 14, 2023 3:27 PM vignesh C 
> wrote:
> >
> > > 2) I'm not sure if this will add any extra coverage as the altering
> > > value of min_apply_delay is already tested in the regression, if so
> > > this test can be
> > > removed:
> > > +# Test ALTER SUBSCRIPTION. Delay 86460 seconds (1 day 1 minute).
> > > +$node_subscriber->safe_psql('postgres',
> > > +   "ALTER SUBSCRIPTION tap_sub SET (min_apply_delay =
> > > 8646)"
> > > +);
> > > +
> > > +# New row to trigger apply delay.
> > > +$node_publisher->safe_psql('postgres',
> > > +   "INSERT INTO test_tab VALUES (0, 'foobar')");
> > > +
> > > +check_apply_delay_log("logical replication apply delay",
> > > +"8000");
> > While addressing this point, I've noticed that there is a behavior
> > difference between physical replication's recovery_min_apply_delay and
> > this feature when stopping the replication during delays.
> >
> > At present, in the latter case,
> > the apply worker exits without applying the suspended transaction
> > after ALTER SUBSCRIPTION DISABLE command for the subscription.
> >
> 
> In the previous paragraph, you said the behavior difference while stopping the
> replication but it is not clear from where this DISABLE command comes in that
> scenario.
Sorry for my unclear description. I mean "stopping the replication" is
to disable the subscription during the "min_apply_delay" wait time on logical
replication setup.

I proposed and mentioned this discussion point to define
how the time-delayed apply worker should behave when there is a transaction
delayed by "min_apply_delay" parameter and additionally the user issues
ALTER SUBSCRIPTION ... DISABLE during the delay. When it comes to physical
replication, it's hard to find a perfect correspondent for LR's ALTER 
SUBSCRIPTION
DISABLE command, but I chose a scenario to promote a secondary during
"recovery_min_apply_delay" for comparison this time. After the promotion of
the secondary in the physical replication, the transaction
committed on the publisher but delayed on the secondary can be seen.
This would be because CheckForStandbyTrigger in recoveryApplyDelay returns true
and we apply the record by breaking the wait.
I checked and got the LOG message "received promote request" in the secondary 
log
when I tested this case.

> > Meanwhile, there is no "disabling" command for physical replication,
> > but I checked the behavior about what happens for promoting a
> > secondary during the delay of recovery_min_apply_delay for physical
> replication as one example.
> > The transaction has become visible even in the promoting in the middle of
> delay.
> >
> 
> What causes such a transaction to be visible after promotion? Ideally, if the
> commit doesn't succeed, the transaction shouldn't be visible.
> Do, we allow the transaction waiting due to delay to get committed on
> promotion?
The commit succeeded on the primary and then I promoted the secondary
during the "recovery_min_apply_delay" wait of the transaction. Then, the result
is the transaction turned out to be available on the promoted secondary.


> > I'm not sure if I should make the time-delayed LR aligned with this 
> > behavior.
> > Does someone has an opinion for this ?
> >
> 
> Can you please explain a bit more as asked above to understand the
> difference?
So, the current difference is that the time-delayed apply
worker of logical replication doesn't apply the delayed transaction on the 
subscriber
when the subscription has been disabled during the delay, while (in one example
of a promotion) the physical replication does the apply of the delayed 
transaction.



Best Regards,
Takamichi Osumi

Re: CI and test improvements

2023-01-17 Thread Justin Pryzby

On Tue, Jan 17, 2023 at 11:56:42AM -0800, Andres Freund wrote:
> > $(CF_PGP_TESTS) is missing from contrib/pgcrypto/meson.build
> 
> Assume that's the false positive?

Yes

> > I also tried but failed to write something to warn if "meson test" was
> > run with a list of tests but without tmp_install.  Help wanted.
> 
> That doesn't even catch the worst case - when there's tmp_install, but it's
> too old.

I don't understand what you mean by "too old" ?

> > I propose to put something like this into "SanityCheck".
> 
> Perhaps we instead could add it as a separate "meson-only" test? Then it'd
> fail on developer's machines, instead of later in CI. We could pass the test
> information from the 'tests' array, or it could look at the metadata in
> meson-info/intro-tests.json

I guess you mean that it should be *able* to fail on developer machines
*in addition* to cirrusci.

But, a meson-only test might not be so helpful, as it assumes that the
developer is using meson, in which case the problem would tend not to
have occured in the first place.

BTW I also noticed that:

meson.build:meson_binpath_r = run_command(python, 'src/tools/find_meson', 
check: true)
meson.build-
meson.build-if meson_binpath_r.returncode() != 0 or meson_binpath_r.stdout() == 
''
meson.build-  error('huh, could not run find_meson.\nerrcode: @0@\nstdout: 
@1@\nstderr: @2@'.format(

The return code will never be nonzero since check==true, right ?

-- 
Justin

Re: Backends stalled in 'startup' state

2023-01-17 Thread Ashwin Agrawal

On Tue, Jan 17, 2023 at 4:52 PM Ashwin Agrawal  wrote:

>
> We recently saw many backends (close to max_connection limit) get stalled
> in 'startup' in one of the production environments for Greenplum (fork of
> PostgreSQL). Tracing the reason, it was found all the tuples created by
> bootstrap (xmin=1) in pg_attribute were at super high block numbers (for
> example beyond 30,000). Tracing the reason for the backend startup stall
> exactly matched Tom's reasoning in [1]. Stalls became much longer in
> presence of sub-transaction overflow or presence of long running
> transactions as tuple visibility took longer. The thread ruled out the
> possibility of system catalog rows to be present in higher block numbers
> instead of in front for pg_attribute.
>
> This thread provides simple reproduction on the latest version of
> PostgreSQL and RCA for how bootstrap catalog entries can move to higher
> blocks and as a result cause stalls for backend starts. Simple fix to avoid
> the issue provided at the end.
>
> The cause is syncscan triggering during VACUUM FULL. VACUUM FULL rewrites
> the table by performing the seqscan as well. And
> heapam_relation_copy_for_cluster() conveys feel free to use syncscan. Hence
> logic to not start from block 0 instead some other block already in cache
> is possible and opens the possibility to move the bootstrap tuples to
> anywhere else in the table.
>
> --
> Repro
> --
> -- create database to play
> drop database if exists test;
> create database test;
> \c test
>
> -- function just to create many tables to increase pg_attribute size
> -- (ideally many column table might do the job more easily)
> CREATE OR REPLACE FUNCTION public.f(id integer)
>  RETURNS void
>  LANGUAGE plpgsql
>  STRICT
> AS $function$
> declare
>   sql text;
>   i int;
> begin
>   for i in id..id+ loop
> sql='create table if not exists tbl'||i||' (id int)';
> execute sql;
>   end loop;
> end;
> $function$;
>
> select f(1);
> select f(2);
> select f(3);
> select f(4);
>
> -- validate pg_attribute size is greater than 1/4 of shared_buffers
> -- for syncscan to triggger
> show shared_buffers;
> select pg_size_pretty(pg_relation_size('pg_attribute'));
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>
> -- perform seq scan of pg_attribute to page past bootstrapped tuples
> copy (select * from pg_attribute limit 2000) to '/tmp/p';
>
> -- this will internally use syncscan starting with block after bootstrap
> tuples
> -- and hence move bootstrap tuples last to higher block numbers
> vacuum full pg_attribute;
>
> --
> Sample run
> --
> show shared_buffers;
>  shared_buffers
> 
>  128MB
> (1 row)
>
> select pg_size_pretty(pg_relation_size('pg_attribute'));
>  pg_size_pretty
> 
>  40 MB
> (1 row)
>
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>  ctid  | xmin | attrelid |   attname
> ---+--+--+--
>  (0,1) |1 | 1255 | oid
>  (0,2) |1 | 1255 | proname
>  (0,3) |1 | 1255 | pronamespace
>  (0,4) |1 | 1255 | proowner
>  (0,5) |1 | 1255 | prolang
> (5 rows)
>
> copy (select * from pg_attribute limit 2000) to '/tmp/p';
> COPY 2000
> vacuum full pg_attribute;
> VACUUM
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
>ctid| xmin | attrelid |   attname
> ---+--+--+--
>  (5115,14) |1 | 1255 | oid
>  (5115,15) |1 | 1255 | proname
>  (5115,16) |1 | 1255 | pronamespace
>  (5115,17) |1 | 1255 | proowner
>  (5115,18) |1 | 1255 | prolang
> (5 rows)
>
>
> Note:
> -- used logic causing the problem to fix it as well on the system :-)
> -- scan till block where bootstrap tuples are located
> select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1
> limit 5;
> -- now due to syncscan triggering it will pick the blocks with bootstrap
> tuples first and help to bring them back to front
> vacuum full pg_attribute;
>
> --
> Patch to avoid the problem:
> --
> diff --git a/src/backend/access/heap/heapam_handler.c
> b/src/backend/access/heap/heapam_handler.c
> index a3414a76e8..4c031914a3 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -757,7 +757,17 @@ heapam_relation_copy_for_cluster(Relation OldHeap,
> Relation NewHeap,
> pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
>
>  PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
>
> -   tableScan = table_beginscan(OldHea

Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-01-17 Thread Andrey Chudnovsky

> All of this points at a bigger question to the community: if we choose
> not to provide a flow implementation in libpq, is adding OAUTHBEARER
> worth the additional maintenance cost?

> My personal vote would be "no". I think the hook-only approach proposed
> here would ensure that only larger providers would implement it in
> practice

Flow implementations in libpq are definitely a long term plan, and I
agree that it would democratise the adoption.
In the previous posts in this conversation I outlined the ones I think
we should support.

However, I don't see why it's strictly necessary to couple those.
As long as the SASL exchange for OAUTHBEARER mechanism is supported by
the protocol, the Client side can evolve at its own pace.

At the same time, the current implementation allows clients to start
building provider-agnostic OAUTH support. By using iddawc or OAUTH
client implementations in the respective platforms.
So I wouldn't refer to "larger providers", but rather "more motivated
clients" here. Which definitely overlaps, but keeps the system open.

> I'm not understanding the concern in the final point -- providers
> generally require you to opt into device authorization, at least as far
> as I can tell. So if you decide that it's not appropriate for your use
> case... don't enable it. (And I haven't seen any claims that opting into
> device authorization weakens the other flows in any way. So if we're
> going to implement a flow in libpq, I still think device authorization
> is the best choice, since it works on headless machines as well as those
> with browsers.)
I agree with the statement that Device code is the best first choice
if we absolutely have to pick one.
Though I don't think we have to.

While device flow can be used for all kinds of user-facing
applications, it's specifically designed for input-constrained
scenarios. As clearly stated in the Abstract here -
https://www.rfc-editor.org/rfc/rfc8628
The authorization code with pkce flow is recommended by the RFSc and
major providers for cases when it's feasible.
The long term goal is to provide both, though I don't see why the
backbone protocol implementation first wouldn't add value.

Another point is user authentication is one side of the whole story
and the other critical one is system-to-system authentication. Where
we have Client Credentials and Certificates.
With the latter it is much harder to get generically implemented, as
provider-specific tokens need to be signed.

Adding the other reasoning, I think libpq support for specific flows
can get in the further iterations, after the protocol support.

> in that case I'd rather spend cycles on generic SASL.
I see 2 approaches to generic SASL:
(a). Generic SASL is a framework used in the protocol, with the
mechanisms implemented on top and exposed to the DBAs as auth types to
configure in hba.
This is the direction we're going here, which is well aligned with the
existing hba-based auth configuration.
(b). Generic SASL exposed to developers on the server- and client-
side to extend on. It seems to be a much longer shot.
The specific points of large ambiguity are libpq distribution model
(which you pointed to) and potential pluggability of insecure
mechanisms.

I do see (a) as a sweet spot with a lot of value for various
participants with much less ambiguity.

> Additionally, the "issuer" field added here is not part of the RFC. I've
> written my thoughts about unofficial extensions upthread but haven't
> received a response, so I'm going to start being more strident: Please,
> for the sake of reviewers, call out changes you've made to the spec, and
> why they're justified.
Thanks for your feedback on this. We had this discussion as well, and
added that as a convenience for the client to identify the provider.
I don't see a reason why an issuer would be absolutely necessary, so
we will get your point that sticking to RFCs is a safer choice.

> The patches seem to be out of order now (and the documentation in the
> commit messages has been removed).
Feedback taken. Work in progress.

On Tue, Jan 17, 2023 at 2:44 PM Jacob Champion  wrote:
>
> On Sun, Jan 15, 2023 at 12:03 PM Andrey Chudnovsky
>  wrote:
> > 2. Removed Device Code implementation in libpq. Several reasons:
> >- Reduce scope and focus on the protocol first.
> >- Device code implementation uses iddawc dependency. Taking this
> > dependency is a controversial step which requires broader discussion.
> >- Device code implementation without iddaws would significantly
> > increase the scope of the patch, as libpq needs to poll the token
> > endpoint, setup different API calls, e.t.c.
> >- That flow should canonically only be used for clients which can't
> > invoke browsers. If it is the only flow to be implemented, it can be
> > used in the context when it's not expected by the OAUTH protocol.
>
> I'm not understanding the concern in the final point -- providers
> generally require you to opt into device authorization

Re: Removing redundant grouping columns

2023-01-17 Thread Richard Guo

On Wed, Jan 18, 2023 at 6:51 AM Tom Lane  wrote:

> I wrote:
> > Yeah, sideswiped by 3c6fc5820 apparently.  No substantive change needed.
>
> And immediately sideswiped by da5800d5f.

Yeah, I noticed this too yesterday.  I reviewed through these two
patches yesterday and I think they are in good shape now.

Thanks
Richard

Re: Removing redundant grouping columns

2023-01-17 Thread David Rowley

On Wed, 18 Jan 2023 at 14:55, Richard Guo  wrote:
> Yeah, I noticed this too yesterday.  I reviewed through these two
> patches yesterday and I think they are in good shape now.

I'm currently reviewing the two patches.

David

Re: recovery modules

2023-01-17 Thread Michael Paquier

On Tue, Jan 17, 2023 at 10:23:56AM -0800, Nathan Bossart wrote:
> Yeah, this seems cleaner.  I removed BuildRestoreCommand() in v8.

if (*sp == *lp)
{
-   if (val)
-   {
-   appendStringInfoString(&result, val);
-   found = true;
-   }
-   /* If val is NULL, we will report an error. */
+   appendStringInfoString(&result, val);
+   found = true;

In 0002, this code block has been removed as an effect of the removal
of BuildRestoreCommand(), because RestoreArchivedFile() needs to
handle two flags with two values.  The current design has the
advantage to warn extension developers with an unexpected
manipulation, as well, so I have kept the logic in percentrepl.c
as-is.

I was wondering also if ExecuteRecoveryCommand() should use a bits32
for its two boolean flags, but did not bother as it is static in
shell_restore.c so ABI does not matter, even if there are three
callers of it with 75% of the combinations possible (only false/true
is not used).

And 0002 is applied.
--
Michael


signature.asc
Description: PGP signature

Re: Perform streaming logical transactions by background workers and parallel apply

2023-01-17 Thread Amit Kapila

On Tue, Jan 17, 2023 at 8:07 PM Masahiko Sawada  wrote:
>
> On Tue, Jan 17, 2023 at 6:14 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Tuesday, January 17, 2023 2:46 PM Masahiko Sawada 
> >  wrote:
> > >
> > > On Tue, Jan 17, 2023 at 12:37 PM houzj.f...@fujitsu.com
> > >  wrote:
> > > I'm slightly concerned that there could be overhead of executing
> > > GetLeaderApplyWorkerPid () for every backend process except for parallel
> > > query workers. The number of such backends could be large and
> > > GetLeaderApplyWorkerPid() acquires the lwlock. For example, does it make
> > > sense to check (st_backendType == B_BG_WORKER) before calling
> > > GetLeaderApplyWorkerPid()? Or it might not be a problem since it's
> > > LogicalRepWorkerLock which is not likely to be contended.
> >
> > Thanks for the comment and I think your suggestion makes sense.
> > I have added the check before getting the leader pid. Here is the new 
> > version patch.
>
> Thank you for updating the patch. Looks good to me.
>

Pushed.

-- 
With Regards,
Amit Kapila.

1 2 >

1 - 100 of 121 matches

Mail list logo