Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-20 Thread Bharath Rupireddy
On Sat, Mar 19, 2022 at 4:39 AM Andres Freund  wrote:
>
> before we further change this (as done in this patch) we should deduplicate
> these huge statements in a separate patch / commit.

Something like attached
v6-0001-Deduplicate-checkpoint-restartpoint-complete-mess.patch?

> As discussed in a bunch of other threads, we typically prefer to cast to long
> long and use %ll instead of using UINT64_FORMAT these days.

Changed.

> > + if (CheckpointStats.repl_map_files_rmvd_cnt ||
> > + CheckpointStats.repl_map_files_syncd_cnt > 0)
> > + {
> > + long t_msecs;
> > +
> > + t_msecs = 
> > TimestampDifferenceMilliseconds(CheckpointStats.repl_snap_start_t,
> > +   
> > CheckpointStats.repl_snap_end_t);
> > +
> > + appendStringInfo(,
> > + _("; logical decoding rewrite 
> > mapping file(s) removed=" UINT64_FORMAT ", synced=" UINT64_FORMAT ", 
> > time=%ld.%03d s, cutoff LSN=%X/%X"),
> > +   
> > CheckpointStats.repl_map_files_rmvd_cnt,
> > +   
> > CheckpointStats.repl_map_files_syncd_cnt,
> > +   t_msecs / 1000, (int) 
> > (t_msecs % 1000),
> > +   
> > LSN_FORMAT_ARGS(CheckpointStats.repl_map_cutoff_lsn));
> > + }
> > +
> > + ereport(LOG, errmsg_internal("%s", logmsg.data));
> > + pfree(logmsg.data);
> >  }
>
> This practically doubles the size of the log message - doesn't that seem a bit
> disproportionate? Can we make this more dense? "logical decoding rewrite
> mapping file(s) removed=" and "logical decoding snapshot file(s) removed=" is
> quite long.

Do you suggest something like below? Or some other better wording like
"logical decoding rewrite map files" and "logical decoding snapshot
files" or "logical rewrite map files" and "logical snapshot files" or
just "rewrite mapping files" or "snapshot files"  ?

if (repl_snap_files_rmvd_cnt > 0 && (repl_map_files_rmvd_cnt ||
repl_map_files_syncd_cnt > 0))
appendStringInfo(,
_("; logical decoding snapshot file(s)
removed=%lu, time=%ld.%03d s, cutoff LSN=%X/%X", rewrite mapping
file(s), removed=%lu, synced=%lu, time=%ld.%03d s, cutoff LSN=%X/%X"
else if (repl_snap_files_rmvd_cnt > 0)
appendStringInfo(,
_("; logical decoding snapshot file(s)
removed=%lu, time=%ld.%03d s, cutoff LSN=%X/%X",
else if (repl_map_files_rmvd_cnt || repl_map_files_syncd_cnt > 0
appendStringInfo(,
   _("; logical decoding rewrite mapping file(s),
removed=%lu, synced=%lu, time=%ld.%03d s, cutoff LSN=%X/%X"

On Sat, Mar 19, 2022 at 3:16 AM Nathan Bossart  wrote:
> /* buffer stats */
> appendStringInfo(, "wrote %d buffers (%.1f%%); ",
>   
> CheckpointStats.ckpt_bufs_written,
>   (double) 
> CheckpointStats.ckpt_bufs_written * 100 / NBuffers);
>
> /* WAL file stats */
> appendStringInfo(, "%d WAL file(s) added, %d removed, %d 
> recycled; ",
>   
> CheckpointStats.ckpt_segs_added,
>   
> CheckpointStats.ckpt_segs_removed,
>   
> CheckpointStats.ckpt_segs_recycled);

Do we actually need to granularize the message like this? I actually
don't think so, because we print the stats even if they are 0 (buffers
written is 0 or WAL segments added is 0 and so on).

Regards,
Bharath Rupireddy.


v6-0001-Deduplicate-checkpoint-restartpoint-complete-mess.patch
Description: Binary data


v6-0001-Add-checkpoint-stats-of-snapshot-and-mapping-file.patch
Description: Binary data


Re: Logical replication timeout problem

2022-03-20 Thread Amit Kapila
On Fri, Mar 18, 2022 at 4:20 PM Amit Kapila  wrote:
>
> On Fri, Mar 18, 2022 at 10:43 AM wangw.f...@fujitsu.com
>  wrote:
> >
> > On Thu, Mar 17, 2022 at 7:52 PM Masahiko Sawada  
> > wrote:
> > >
> >
> > Attach the new patch.
> >
>
> *
>   case REORDER_BUFFER_CHANGE_INVALIDATION:
> - /* Execute the invalidation messages locally */
> - ReorderBufferExecuteInvalidations(
> -   change->data.inval.ninvalidations,
> -   change->data.inval.invalidations);
> - break;
> + {
> + LogicalDecodingContext *ctx = rb->private_data;
> +
> + /* Try to send a keepalive message. */
> + UpdateProgress(ctx, true);
>
> Calling UpdateProgress() here appears adhoc to me especially because
> it calls OutputPluginUpdateProgress which appears to be called only
> from plugin API. Am, I missing something? Also why the same handling
> is missed in other similar messages like
> REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID where we don't call any
> plug-in API?
>
> I am not sure what is a good way to achieve this but one idea that
> occurred to me was shall we invent a new callback
> ReorderBufferSkipChangeCB similar to ReorderBufferApplyChangeCB and
> then pgoutput can register its API where we can have the logic similar
> to what you have in UpdateProgress()? If we do so, then all the
> cuurent callers of UpdateProgress in pgoutput can also call that API.
> What do you think?
>

Another idea could be that we leave the DDL case for now as anyway
there is very less chance of timeout for skipping DDLs and we may
later need to even backpatch this bug-fix which would be another
reason to not make such invasive changes. We can handle the DDL case
if required separately.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-03-20 Thread Thomas Munro
On Mon, Mar 21, 2022 at 4:36 PM Thomas Munro  wrote:
> On Sat, Feb 26, 2022 at 7:58 AM David Christensen
>  wrote:
> > Attached is V2 with additional feedback from this email, as well as the 
> > specification of the
> > ForkNumber and FPW as specifiable options.
>
> Trivial fixup needed after commit 3f1ce973.

[04:30:50.630] pg_waldump.c:963:26: error: format ‘%u’ expects
argument of type ‘unsigned int *’, but argument 3 has type ‘ForkNumber
*’ [-Werror=format=]
[04:30:50.630] 963 | if (sscanf(optarg, "%u",
_by_relation_forknum) != 1 ||
[04:30:50.630] | ~^ ~~
[04:30:50.630] | | |
[04:30:50.630] | | ForkNumber *
[04:30:50.630] | unsigned int *

And now that this gets to the CompilerWarnings CI task, it looks like
GCC doesn't like an enum as a scanf %u destination (I didn't see that
warning locally when I compiled the above fixup because clearly Clang
is cool with it...).  Probably needs a temporary unsigned int to
sscanf into first.




Re: [PATCH] add relation and block-level filtering to pg_waldump

2022-03-20 Thread Thomas Munro
On Sat, Feb 26, 2022 at 7:58 AM David Christensen
 wrote:
> Attached is V2 with additional feedback from this email, as well as the 
> specification of the
> ForkNumber and FPW as specifiable options.

Trivial fixup needed after commit 3f1ce973.
From 39e683756e718904b3a8651508cb1e2198a852e4 Mon Sep 17 00:00:00 2001
From: David Christensen 
Date: Fri, 25 Feb 2022 12:52:56 -0600
Subject: [PATCH v3 1/2] Add additional filtering options to pg_waldump

This feature allows you to only output records that are targeting a specific RelFileNode and optional
BlockNumber within this relation, while specifying which ForkNum you want to filter to.

We also add the independent ability to filter via Full Page Write.
---
 doc/src/sgml/ref/pg_waldump.sgml |  48 
 src/bin/pg_waldump/pg_waldump.c  | 128 ++-
 2 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pg_waldump.sgml b/doc/src/sgml/ref/pg_waldump.sgml
index 5735a161ce..f157175764 100644
--- a/doc/src/sgml/ref/pg_waldump.sgml
+++ b/doc/src/sgml/ref/pg_waldump.sgml
@@ -100,6 +100,44 @@ PostgreSQL documentation
   
  
 
+ 
+  -k block
+  --block=block
+  
+   
+Display only records touching the given block. (Requires also
+providing the relation via --relation.)
+   
+  
+ 
+
+ 
+  --fork=fork
+  
+   
+When using the --relation filter, output only records
+from the given fork.  The valid values here are 0
+for the main fork, 1 for the Free Space
+Map, 2 for the Visibility Map,
+and 3 for the Init fork.  If unspecified, defaults
+to the main fork.
+   
+  
+ 
+
+ 
+  -l tbl/db/rel
+  --relation=tbl/db/rel
+  
+   
+Display only records touching the given relation.  The relation is
+specified via tablespace OID, database OID, and relfilenode separated
+by slashes, for instance, 1234/12345/12345.  This
+is the same format used for relations in the WAL dump output.
+   
+  
+ 
+
  
   -n limit
   --limit=limit
@@ -183,6 +221,16 @@ PostgreSQL documentation

  
 
+ 
+   -w
+   --fullpage
+   
+   
+   Filter records to only those which have full page writes.
+   
+   
+ 
+
  
   -x xid
   --xid=xid
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index fc081adfb8..430df85480 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -55,6 +55,12 @@ typedef struct XLogDumpConfig
 	bool		filter_by_rmgr_enabled;
 	TransactionId filter_by_xid;
 	bool		filter_by_xid_enabled;
+	RelFileNode filter_by_relation;
+	bool		filter_by_relation_enabled;
+	BlockNumber	filter_by_relation_block;
+	bool		filter_by_relation_block_enabled;
+	ForkNumber	filter_by_relation_forknum;
+	bool		filter_by_fpw;
 } XLogDumpConfig;
 
 typedef struct Stats
@@ -391,6 +397,56 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 	return count;
 }
 
+/*
+ * Boolean to return whether the given WAL record matches a specific relation and optional block
+ */
+static bool
+XLogRecordMatchesRelationBlock(XLogReaderState *record, RelFileNode matchRnode, BlockNumber matchBlock, ForkNumber matchFork)
+{
+	int			block_id;
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		RelFileNode rnode;
+		ForkNumber	forknum;
+		BlockNumber blk;
+
+		if (!XLogRecHasBlockRef(record, block_id))
+			continue;
+
+		XLogRecGetBlockTag(record, block_id, , , );
+
+		if (forknum == matchFork &&
+			matchRnode.spcNode == rnode.spcNode &&
+			matchRnode.dbNode == rnode.dbNode &&
+			matchRnode.relNode == rnode.relNode &&
+			(matchBlock == InvalidBlockNumber || matchBlock == blk))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Boolean to return whether the given WAL record contains a full page write
+ */
+static bool
+XLogRecordHasFPW(XLogReaderState *record)
+{
+	int			block_id;
+
+	for (block_id = 0; block_id <= record->max_block_id; block_id++)
+	{
+		if (!XLogRecHasBlockRef(record, block_id))
+			continue;
+
+		if (XLogRecHasBlockImage(record, block_id))
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * Calculate the size of a record, split into !FPI and FPI parts.
  */
@@ -765,6 +821,10 @@ usage(void)
 	printf(_("  -b, --bkp-details  output detailed information about backup blocks\n"));
 	printf(_("  -e, --end=RECPTR   stop reading at WAL location RECPTR\n"));
 	printf(_("  -f, --follow   keep retrying after reaching end of WAL\n"));
+	printf(_("  -k, --block=N  with --relation, only show records matching this block\n"));
+	printf(_("  --fork=N   with --relation, only show records matching this fork\n"
+			 " (defaults to 0, the main fork)\n"));
+	printf(_("  -l, --relation=N/N/N   only show records that touch a specific 

Summary Sort workers Stats in EXPLAIN ANALYZE

2022-03-20 Thread Jian Guo

In current EXPLAIN ANALYZE implementation, the Sort Node stats from each 
workers are not summarized: 
https://github.com/postgres/postgres/blob/d4ba8b51c76300f06cc23f4d8a41d9f7210c4866/src/backend/commands/explain.c#L2762

When the worker number is large, it will print out huge amount of node details 
in the plan. I have created this patch to summarize the tuplesort stats by 
AverageSpaceUsed / PeakSpaceUsed, make it behave just like in 
`show_incremental_sort_group_info()`: 
https://github.com/postgres/postgres/blob/d4ba8b51c76300f06cc23f4d8a41d9f7210c4866/src/backend/commands/explain.c#L2890

From 8970f3d0f4a4535457dbe7f625081e592ebc1901 Mon Sep 17 00:00:00 2001
From: Jian Guo 
Date: Mon, 21 Mar 2022 11:19:46 +0800
Subject: [PATCH] Summary Sort workers Stats.

Signed-off-by: Jian Guo 
---
 src/backend/commands/explain.c | 48 +++---
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9f632285b6..5561231d7d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2758,41 +2758,47 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 	if (sortstate->shared_info != NULL)
 	{
 		int			n;
+		const char *sortMethod;
+		const char *spaceType;
+		int64		peakSpaceUsed = 0;
+		int64		totalSpaceUsed = 0;
 
 		for (n = 0; n < sortstate->shared_info->num_workers; n++)
 		{
 			TuplesortInstrumentation *sinstrument;
-			const char *sortMethod;
-			const char *spaceType;
-			int64		spaceUsed;
 
 			sinstrument = >shared_info->sinstrument[n];
 			if (sinstrument->sortMethod == SORT_TYPE_STILL_IN_PROGRESS)
 continue;		/* ignore any unfilled slots */
 			sortMethod = tuplesort_method_name(sinstrument->sortMethod);
 			spaceType = tuplesort_space_type_name(sinstrument->spaceType);
-			spaceUsed = sinstrument->spaceUsed;
+			peakSpaceUsed = Max(peakSpaceUsed, sinstrument->spaceUsed);
+			totalSpaceUsed += sinstrument->spaceUsed;
+		}
 
-			if (es->workers_state)
-ExplainOpenWorker(n, es);
+		int64 avgSpaceUsed = sortstate->shared_info->num_workers > 0 ?
+totalSpaceUsed / sortstate->shared_info->num_workers : 0;
 
-			if (es->format == EXPLAIN_FORMAT_TEXT)
-			{
-ExplainIndentText(es);
-appendStringInfo(es->str,
- "Sort Method: %s  %s: " INT64_FORMAT "kB\n",
- sortMethod, spaceType, spaceUsed);
-			}
-			else
-			{
-ExplainPropertyText("Sort Method", sortMethod, es);
-ExplainPropertyInteger("Sort Space Used", "kB", spaceUsed, es);
-ExplainPropertyText("Sort Space Type", spaceType, es);
-			}
+		if (es->workers_state)
+			ExplainOpenWorker(n, es);
 
-			if (es->workers_state)
-ExplainCloseWorker(n, es);
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			ExplainIndentText(es);
+			appendStringInfo(es->str,
+			 "Sort Method: %s  %s: " INT64_FORMAT INT64_FORMAT "kB\n",
+			 sortMethod, spaceType, avgSpaceUsed, peakSpaceUsed);
+		}
+		else
+		{
+			ExplainPropertyText("Sort Method", sortMethod, es);
+			ExplainPropertyInteger("Average Sort Space Used", "kB", avgSpaceUsed, es);
+			ExplainPropertyInteger("Peak Sort Space Used", "kB", peakSpaceUsed, es);
+			ExplainPropertyText("Sort Space Type", spaceType, es);
 		}
+
+		if (es->workers_state)
+			ExplainCloseWorker(n, es);
 	}
 }
 
-- 
2.35.1



Re: Skipping logical replication transactions on subscriber side

2022-03-20 Thread Amit Kapila
On Mon, Mar 21, 2022 at 7:09 AM Euler Taveira  wrote:
>
> On Thu, Mar 17, 2022, at 3:03 AM, Amit Kapila wrote:
>
> On Wed, Mar 16, 2022 at 1:53 PM Masahiko Sawada  wrote:
> >
> > I've attached an updated version patch.
> >
>
> The patch LGTM. I have made minor changes in comments and docs in the
> attached patch. Kindly let me know what you think of the attached?
>
> I am planning to commit this early next week (on Monday) unless there
> are more comments/suggestions.
>
> I reviewed this last version and I have a few comments.
>
> +* If the user set subskiplsn, we do a sanity check to make
> +* sure that the specified LSN is a probable value.
>
> ... user *sets*...
>
> +   ereport(ERROR,
> +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("skip WAL location (LSN) must be 
> greater than origin LSN %X/%X",
> +   LSN_FORMAT_ARGS(remote_lsn;
>
> Shouldn't we add the LSN to be skipped in the "(LSN)"?
>
> +* Start a new transaction to clear the subskipxid, if not started
> +* yet.
>
> It seems it means subskiplsn.
>
> + * subskipxid in order to inform users for cases e.g., where the user 
> mistakenly
> + * specified the wrong subskiplsn.
>
> It seems it means subskiplsn.
>
> +sub test_skip_xact
> +{
>
> It seems this function should be named test_skip_lsn. Unless the intention is
> to cover other skip options in the future.
>

I have fixed all the above comments as per your suggestion in the
attached. Do let me know if something is missed?

> src/test/subscription/t/029_disable_on_error.pl |  94 --
> src/test/subscription/t/029_on_error.pl | 183 +++
>
> It seems you are removing a test for 705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33.
> I should also name 029_on_error.pl to something else such as 030_skip_lsn.pl 
> or
> a generic name 030_skip_option.pl.
>

As explained in my previous email, I don't think any change is
required for this comment but do let me know if you still think so?

-- 
With Regards,
Amit Kapila.


v17-0001-Add-ALTER-SUBSCRIPTION-.-SKIP.patch
Description: Binary data


Re: Probable CF bot degradation

2022-03-20 Thread Thomas Munro
On Mon, Mar 21, 2022 at 3:11 PM Andres Freund  wrote:
> On 2022-03-21 14:44:55 +1300, Thomas Munro wrote:
> > Those tooltips come from the "name" elements of the .cirrus.yml file
> > where tasks are defined, with Cirrus's task status appended.  If we
> > had a set of monochrome green and red icons with a Linux penguin,
> > FreeBSD daemon, Windows logo and Apple logo of matching dimensions, a
> > config file could map task names to icons, and fall back to
> > ticks/crosses for anything unknown/new, including the
> > "CompilerWarnings" one that doesn't have an obvious icon.  Another
> > thing to think about is the 'solid' and 'hollow' variants, the former
> > indicating a recent change.  So we'd need 4 variants of each logo.
> > Also I believe there is a proposal to add NetBSD and OpenBSD in the
> > works.
>
> Might even be sufficient to add just the first letter of the task inside the
> circle, instead of the "check" and x. Right now the letters are unique.

Nice idea, because it retains the information density.  If someone
with web skills would like to pull down the cfbot page and hack up one
of the rows to show an example of a pass, fail, recent-pass,
recent-fail as a circle with a letter in it, and also an "in progress"
symbol that occupies the same amoutn of space, I'd be keen to try
that.  (The current "in progress" blue circle was originally supposed
to be a pie filling up slowly according to a prediction of finished
time based on past performance, but I never got to that... it's stuck
at 1/4 :-))




Re: Skipping logical replication transactions on subscriber side

2022-03-20 Thread Amit Kapila
On Mon, Mar 21, 2022 at 7:09 AM Euler Taveira  wrote:
>
> src/test/subscription/t/029_disable_on_error.pl |  94 --
> src/test/subscription/t/029_on_error.pl | 183 +++
>
> It seems you are removing a test for 705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33.
>

We have covered the same test in the new test file. See "CREATE
SUBSCRIPTION sub CONNECTION '$publisher_connstr' PUBLICATION pub WITH
(disable_on_error = true, ...". This will test the cases we were
earlier testing via 'disable_on_error'.

> I should also name 029_on_error.pl to something else such as 030_skip_lsn.pl 
> or
> a generic name 030_skip_option.pl.
>

The reason to keep the name 'on_error' is that it has tests for both
'disable_on_error' option and 'skip_lsn'. The other option could be
'on_error_action' or something like that. Now, does this make sense to
you?

-- 
With Regards,
Amit Kapila.




Re: Probable CF bot degradation

2022-03-20 Thread Andres Freund
Hi,

On 2022-03-21 14:44:55 +1300, Thomas Munro wrote:
> Those tooltips come from the "name" elements of the .cirrus.yml file
> where tasks are defined, with Cirrus's task status appended.  If we
> had a set of monochrome green and red icons with a Linux penguin,
> FreeBSD daemon, Windows logo and Apple logo of matching dimensions, a
> config file could map task names to icons, and fall back to
> ticks/crosses for anything unknown/new, including the
> "CompilerWarnings" one that doesn't have an obvious icon.  Another
> thing to think about is the 'solid' and 'hollow' variants, the former
> indicating a recent change.  So we'd need 4 variants of each logo.
> Also I believe there is a proposal to add NetBSD and OpenBSD in the
> works.

Might even be sufficient to add just the first letter of the task inside the
circle, instead of the "check" and x. Right now the letters are unique.

Greetings,

Andres Freund




Re: refactoring basebackup.c (zstd workers)

2022-03-20 Thread Robert Haas
On Sun, Mar 20, 2022 at 9:32 PM Tom Lane  wrote:
> Robert Haas  writes:
> > I think I'm guilty of verbal inexactitude here but not bad coding.
> > Checking for *endptr != '\0', as I did, is not sufficient to detect
> > "whether an error occurred," as I alleged. But, in the part of my
> > response you didn't quote, I believe I made it clear that I only need
> > to detect garbage, not out-of-range values. And I think *endptr !=
> > '\0' will do that.
>
> Hmm ... do you consider an empty string to be valid input?

No, and I thought I had checked properly for that condition before
reaching the point in the code where I call strtol(), but it turns out
I have not, which I guess is what Justin has been trying to tell me
for a few emails now.

I'll send an updated patch tomorrow after looking this all over more carefully.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Probable CF bot degradation

2022-03-20 Thread Peter Geoghegan
On Sun, Mar 20, 2022 at 6:45 PM Thomas Munro  wrote:
> Nice idea, if someone with graphics skills is interested in looking into it...

The logo thing wasn't really the point for me. I'd just like to have
the information be more visible, sooner.

I was hoping that there might be a very simple method of making the
same information more visible, that you could implement in only a few
minutes. Perhaps that was optimistic.

-- 
Peter Geoghegan




Re: Probable CF bot degradation

2022-03-20 Thread Thomas Munro
On Mon, Mar 21, 2022 at 1:41 PM Peter Geoghegan  wrote:
> BTW, I think that the usability of the CFBot website would be improved
> if there was a better visual indicator of what each "green tick inside
> a circle" link actually indicates -- what are we testing for each
> green tick/red X shown?
>
> I already see tooltips which show a descriptive string (for example a
> tooltip that says "FreeBSD - 13: COMPLETED" which comes from
>  tags), which is something. But seeing these tooltips
> requires several seconds of mouseover on my browser (Chrome). I'd be
> quite happy if I could see similar tooltips immediately on mouseover
> (which isn't actually possible with standard generic tooltips IIUC),
> or something equivalent. Any kind of visual feedback on the nature of
> the thing tested by a particular CI run that the user can drill down
> to (you know, a Debian logo next to the tick, that kind of thing).

Nice idea, if someone with graphics skills is interested in looking into it...

Those tooltips come from the "name" elements of the .cirrus.yml file
where tasks are defined, with Cirrus's task status appended.  If we
had a set of monochrome green and red icons with a Linux penguin,
FreeBSD daemon, Windows logo and Apple logo of matching dimensions, a
config file could map task names to icons, and fall back to
ticks/crosses for anything unknown/new, including the
"CompilerWarnings" one that doesn't have an obvious icon.  Another
thing to think about is the 'solid' and 'hollow' variants, the former
indicating a recent change.  So we'd need 4 variants of each logo.
Also I believe there is a proposal to add NetBSD and OpenBSD in the
works.




Re: Skipping logical replication transactions on subscriber side

2022-03-20 Thread Euler Taveira
On Thu, Mar 17, 2022, at 3:03 AM, Amit Kapila wrote:
> On Wed, Mar 16, 2022 at 1:53 PM Masahiko Sawada  wrote:
> >
> > I've attached an updated version patch.
> >
> 
> The patch LGTM. I have made minor changes in comments and docs in the
> attached patch. Kindly let me know what you think of the attached?
> 
> I am planning to commit this early next week (on Monday) unless there
> are more comments/suggestions.
I reviewed this last version and I have a few comments.

+* If the user set subskiplsn, we do a sanity check to make
+* sure that the specified LSN is a probable value.

... user *sets*...

+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("skip WAL location (LSN) must be 
greater than origin LSN %X/%X",
+   LSN_FORMAT_ARGS(remote_lsn;

Shouldn't we add the LSN to be skipped in the "(LSN)"?

+* Start a new transaction to clear the subskipxid, if not started
+* yet.

It seems it means subskiplsn.

+ * subskipxid in order to inform users for cases e.g., where the user 
mistakenly
+ * specified the wrong subskiplsn.

It seems it means subskiplsn.

+sub test_skip_xact
+{

It seems this function should be named test_skip_lsn. Unless the intention is
to cover other skip options in the future.

src/test/subscription/t/029_disable_on_error.pl |  94 --
src/test/subscription/t/029_on_error.pl | 183 +++

It seems you are removing a test for 705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33.
I should also name 029_on_error.pl to something else such as 030_skip_lsn.pl or
a generic name 030_skip_option.pl.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: refactoring basebackup.c (zstd workers)

2022-03-20 Thread Robert Haas
On Sun, Mar 20, 2022 at 3:40 PM Justin Pryzby  wrote:
> The user-facing docs are already standardized using "compression method", with
> 2 exceptions, of which one is contrib/ and the other is what I'm suggesting to
> make consistent here.
>
> $ git grep 'compression algorithm' doc
> doc/src/sgml/pgcrypto.sgml:Which compression algorithm to use.  Only 
> available if
> doc/src/sgml/ref/pg_basebackup.sgml:compression algorithm is 
> selected, or if server-side compression

Well, if you just count the number of occurrences of each string in
the documentation, sure. But all of the ones that are talking about a
compression method seem to have to do with configurable TOAST
compression, and the fact that the documentation for that feature is
more extensive than for the pre-existing feature that refers to a
compression algorithm does not, at least in my view, turn it into a
project standard from which no deviation is permitted.

> > Did the latter. The former would need to be fixed in a bunch of places
> > and while I'm happy to accept an expert opinion on exactly what needs
> > to be done here, I don't want to try to do it and do it wrong. Better
> > to let someone with good knowledge of the subject matter patch it up
> > later than do a crummy job now.
>
> I believe it just needs _("foo")
> See git grep '= _('

Hmm. Maybe.

> I mentioned another issue off-list:
> pg_basebackup.c:2741:10: warning: suggest parentheses around assignment used 
> as truth value [-Wparentheses]
>  2741 |   Assert(compressloc = COMPRESS_LOCATION_SERVER);
>   |  ^~~
> pg_basebackup.c:2741:3: note: in expansion of macro ‘Assert’
>  2741 |   Assert(compressloc = COMPRESS_LOCATION_SERVER);
>
> This crashes the server using your v2 patch:
>
> src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp 
> --no-sync --no-manifest --compress=server-zstd:level, |wc -c

Well that's unfortunate. Will fix.

> I wonder whether the syntax should really use both ":" and ",".
> Maybe ":" isn't needed at all.

I don't think we should treat the compression method name in the same
way as a compression algorithm option.

> This patch also needs to update the other user-facing docs.

Which ones exactly?

> typo: contain a an

OK, will fix.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: refactoring basebackup.c (zstd workers)

2022-03-20 Thread Tom Lane
Robert Haas  writes:
> I think I'm guilty of verbal inexactitude here but not bad coding.
> Checking for *endptr != '\0', as I did, is not sufficient to detect
> "whether an error occurred," as I alleged. But, in the part of my
> response you didn't quote, I believe I made it clear that I only need
> to detect garbage, not out-of-range values. And I think *endptr !=
> '\0' will do that.

Hmm ... do you consider an empty string to be valid input?

regards, tom lane




Re: refactoring basebackup.c (zstd workers)

2022-03-20 Thread Robert Haas
On Sun, Mar 20, 2022 at 3:11 PM Tom Lane  wrote:
> > Even after reading the man page for strtol, it's not clear to me that
> > this is needed. That page represents checking *endptr != '\0' as
> > sufficient to tell whether an error occurred.
>
> I'm not sure whose man page you looked at, but the POSIX standard [1]
> has a pretty clear opinion about this:
>
> Since 0, {LONG_MIN} or {LLONG_MIN}, and {LONG_MAX} or {LLONG_MAX} are
> returned on error and are also valid returns on success, an
> application wishing to check for error situations should set errno to
> 0, then call strtol() or strtoll(), then check errno.
>
> Checking *endptr != '\0' is for detecting whether there is trailing
> garbage after the number; which may be an error case or not as you
> choose, but it's a different matter.

I think I'm guilty of verbal inexactitude here but not bad coding.
Checking for *endptr != '\0', as I did, is not sufficient to detect
"whether an error occurred," as I alleged. But, in the part of my
response you didn't quote, I believe I made it clear that I only need
to detect garbage, not out-of-range values. And I think *endptr !=
'\0' will do that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Probable CF bot degradation

2022-03-20 Thread Peter Geoghegan
On Sun, Mar 20, 2022 at 4:23 PM Thomas Munro  wrote:
> On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent
>  wrote:
> > Would you know how long the expected bitrot re-check period for CF
> > entries that haven't been updated is, or could the bitrot-checking
> > queue be displayed somewhere to indicate the position of a patch in
> > this queue?
>
> I see that your patches were eventually retested.

What about just seeing if the patch still applies cleanly against HEAD
much more frequently? Obviously that would be way cheaper than running
all of the tests again.

Perhaps Cirrus provides a way of taking advantage of that? (Or maybe
that happens already, in which case please enlighten me.)

BTW, I think that the usability of the CFBot website would be improved
if there was a better visual indicator of what each "green tick inside
a circle" link actually indicates -- what are we testing for each
green tick/red X shown?

I already see tooltips which show a descriptive string (for example a
tooltip that says "FreeBSD - 13: COMPLETED" which comes from
 tags), which is something. But seeing these tooltips
requires several seconds of mouseover on my browser (Chrome). I'd be
quite happy if I could see similar tooltips immediately on mouseover
(which isn't actually possible with standard generic tooltips IIUC),
or something equivalent. Any kind of visual feedback on the nature of
the thing tested by a particular CI run that the user can drill down
to (you know, a Debian logo next to the tick, that kind of thing).

> I had been meaning to stump up the USD$10/month it costs to double the
> CPU limits from the basic free Cirrus account, and I've just now done
> that and told cfbot it's allowed to test 4 branches at once and to try
> to test every branch every 24 hours.  Let's see how that goes.

Extravagance!

-- 
Peter Geoghegan




Re: logical replication restrictions

2022-03-20 Thread Euler Taveira
On Mon, Feb 28, 2022, at 9:18 PM, Euler Taveira wrote:
> Long time, no patch. Here it is. I will provide documentation in the next
> version. I would appreciate some feedback.
This patch is broken since commit 705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33. I
rebased it.

I added documentation that explains how this parameter works. I decided to
rename the parameter from apply_delay to min_apply_delay to use the same
terminology from the physical replication. IMO the new name seems clear that
there isn't a guarantee that we are always x ms behind the publisher. Indeed,
due to processing/transferring the delay might be higher than the specified
interval.

I refactored the way the delay is applied. The previous patch is only covering
a regular transaction. This new one also covers prepared transaction. The
current design intercepts the transaction during the first change (at the time
it will start the transaction to apply the changes) and applies the delay
before effectively starting the transaction. The previous patch uses
begin_replication_step() as this point. However, to support prepared
transactions I changed the apply_delay signature to accepts a timestamp
parameter (because we use another variable to calculate the delay for prepared
transactions -- prepare_time). Hence, the apply_delay() moved to another places
-- apply_handle_begin and apply_handle_begin_prepare().

The new code does not apply the delay in 2 situations:

* STREAM START: streamed transactions might not have commit_time or
  prepare_time set. I'm afraid it is not possible to use the referred variables
  because at STREAM START time we don't have a transaction commit time. The
  protocol could provide a timestamp that indicates when it starts streaming
  the transaction then we could use it to apply the delay. Unfortunately, we
  don't have it. Having said that this new patch does not apply delay for
  streamed transactions.
* non-transaction messages: the delay could be applied to non-transaction
  messages too. It is sent independently of the transaction that contains it.
  Since the logical replication does not send messages to the subscriber, this
  is not an issue. However, consumers that use pgoutput and wants to implement
  a delay will require it.

I'm still looking for a way to support streamed transactions without much
surgery into the logical replication protocol.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
From 9ad783c1259aed9bef81877265c648aae84ed5d8 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Sat, 6 Nov 2021 11:31:10 -0300
Subject: [PATCH v2] Time-delayed logical replication subscriber

Similar to physical replication, a time-delayed copy of the data for
logical replication is useful for some scenarios (specially to fix
errors that might cause data loss).

If the subscriber sets min_apply_delay parameter, the logical
replication worker will delay the transaction commit for min_apply_delay
milliseconds.

The delay is calculated between the WAL time stamp and the current time
on the subscriber.

The delay occurs only on WAL records for transaction begins. Regular and
prepared transactions are covered. Streamed transactions are not
delayed. It should be implemented in future releases.

Author: Euler Taveira
Discussion: https://postgr.es/m/CAB-JLwYOYwL=xtyaxkih5ctm_vm8kjkh7aaitckvmch4rzr...@mail.gmail.com
---
 doc/src/sgml/ref/alter_subscription.sgml   |   5 +-
 doc/src/sgml/ref/create_subscription.sgml  |  33 +
 src/backend/catalog/pg_subscription.c  |   1 +
 src/backend/catalog/system_views.sql   |   4 +-
 src/backend/commands/subscriptioncmds.c|  46 ++-
 src/backend/replication/logical/worker.c   |  61 +
 src/backend/utils/adt/timestamp.c  |   8 ++
 src/bin/pg_dump/pg_dump.c  |  13 +-
 src/bin/pg_dump/pg_dump.h  |   1 +
 src/bin/psql/describe.c|  11 +-
 src/include/catalog/pg_subscription.h  |   2 +
 src/include/utils/timestamp.h  |   2 +
 src/test/regress/expected/subscription.out | 141 +
 src/test/regress/sql/subscription.sql  |  20 +++
 src/test/subscription/t/030_apply_delay.pl |  76 +++
 15 files changed, 358 insertions(+), 66 deletions(-)
 create mode 100644 src/test/subscription/t/030_apply_delay.pl

diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 58b78a94ea..b764f8eba8 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -204,8 +204,9 @@ ALTER SUBSCRIPTION name RENAME TO <
   information.  The parameters that can be altered
   are slot_name,
   synchronous_commit,
-  binary, streaming, and
-  disable_on_error.
+  binary, streaming,
+  disable_on_error, and
+  min_apply_delay.
  
 

diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index b701752fc9..2bc8deb132 100644
--- 

Re: Probable CF bot degradation

2022-03-20 Thread Andres Freund
Hi,

On 2022-03-21 12:23:02 +1300, Thomas Munro wrote:
> or actually a shade fewer because it's pretty dumb and only wakes up once a
> minute to decide what to do

Might be worth using https://cirrus-ci.org/api/#webhooks to trigger a run of
the scheduler. Probably still want to have the timeout based "scheduling
iterations", but perhaps at a lower frequency?

Greetings,

Andres Freund




Re: Probable CF bot degradation

2022-03-20 Thread Andres Freund
Hi,

On 2022-03-21 12:23:02 +1300, Thomas Munro wrote:
> It was set to try to recheck every ~48 hours, though it couldn't quite
> always achieve that when the total number of eligible submissions is
> too large.  In this case it had stalled for too long after the github
> outage, which I'm going to try to improve.  The reason for the 48+
> hour cycle is the Windows tests now take ~25 minutes (since we started
> actually running all the tests on that platform)

I see 26-28 minutes regularly :(. And that doesn't even include the "boot
time" of the test of around 3-4min, which is quite a bit higher for windows
than for the other OSs.


> and we could only
> have two Windows tasts running at a time in practice, because the
> limit for Windows was 8 CPUs, and we use 4 for each task, which means
> we could only test ~115 branches per day, or actually a shade fewer
> because it's pretty dumb and only wakes up once a minute to decide
> what to do, and we currently have 242 submissions (though some don't
> apply, those are free, so the number varies over time...).  There are
> limits on the Unixes too but they are more generous, and the Unix
> tests only take 4-10 minutes, so we can ignore that for now, it's all
> down to Windows.

I wonder if it's worth using the number of concurrently running windows tasks
as the limit, rather than the number of commits being tested
concurrently. It's not rare for windows to fail more quickly than other
OSs. But probably the 4 concurrent tests are good enough for now...

I'd love to merge the patch adding mingw CI testing, which'd increase the
pressure substantially :/


> I had been meaning to stump up the USD$10/month it costs to double the
> CPU limits from the basic free Cirrus account, and I've just now done
> that and told cfbot it's allowed to test 4 branches at once and to try
> to test every branch every 24 hours.  Let's see how that goes.

Yay.


> Here's hoping we can cut down the time it takes to run the tests on
> Windows... there's some really dumb stuff happening there.  Top items
> I'm aware of:  (1) general lack of test concurrency, (2) exec'ing new
> backends is glacially slow on that OS but we do it for every SQL
> statement in the TAP tests and every regression test script (I have
> some patches for this to share after the code freeze).

3) build is quite slow and has no caching


With meson the difference of 1, 3 is quite visible. Look at
https://cirrus-ci.com/build/5265480968568832

current buildsystem: 28:07 min
meson w/ msbuild: 22:21 min
meson w/ ninja: 19:24

meson runs quite a few tests that the "current buildsystem" doesn't, so the
win is actually bigger than the time difference indicates...


Greetings,

Andres Freund




Re: Tab completion for SET TimeZone

2022-03-20 Thread Dagfinn Ilmari Mannsåker
Tom Lane  writes:

> After thinking a bit harder, I realized that the SchemaQuery
> infrastructure has no way to deal with the case of the input text not
> being a prefix of what we want the output to be, so it can't do something
> comparable to Query_for_list_of_timezone_names_quoted_out.  Maybe someday
> we'll feel like adding that, but COMPLETE_WITH_ENUM_VALUE isn't a
> compelling enough reason in current usage.  So I just tweaked the comment
> a bit.

Fair enough.

> Pushed that way.

Thanks!

>   regards, tom lane

- ilmari




Re: Probable CF bot degradation

2022-03-20 Thread Thomas Munro
On Mon, Mar 21, 2022 at 12:23 PM Thomas Munro  wrote:
> On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent
>  wrote:
> > Would you know how long the expected bitrot re-check period for CF
> > entries that haven't been updated is, or could the bitrot-checking
> > queue be displayed somewhere to indicate the position of a patch in
> > this queue?

Also, as for the show-me-the-queue page, yeah that's a good idea and
quite feasible.  I'll look into that in a bit.

> > Additionally, are there plans to validate commits of the main branch
> > before using them as a base for CF entries, so that "bad" commits on
> > master won't impact CFbot results as easy?
>
> How do you see this working?

[Now with more coffee on board]  Oh, right, I see, you're probably
thinking that we could look at
https://github.com/postgres/postgres/commits/master and take the most
recent passing commit as a base.  Hmm, interesting idea.




Re: Probable CF bot degradation

2022-03-20 Thread Thomas Munro
On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent
 wrote:
> Would you know how long the expected bitrot re-check period for CF
> entries that haven't been updated is, or could the bitrot-checking
> queue be displayed somewhere to indicate the position of a patch in
> this queue?

I see that your patches were eventually retested.

It was set to try to recheck every ~48 hours, though it couldn't quite
always achieve that when the total number of eligible submissions is
too large.  In this case it had stalled for too long after the github
outage, which I'm going to try to improve.  The reason for the 48+
hour cycle is the Windows tests now take ~25 minutes (since we started
actually running all the tests on that platform), and we could only
have two Windows tasts running at a time in practice, because the
limit for Windows was 8 CPUs, and we use 4 for each task, which means
we could only test ~115 branches per day, or actually a shade fewer
because it's pretty dumb and only wakes up once a minute to decide
what to do, and we currently have 242 submissions (though some don't
apply, those are free, so the number varies over time...).  There are
limits on the Unixes too but they are more generous, and the Unix
tests only take 4-10 minutes, so we can ignore that for now, it's all
down to Windows.

I had been meaning to stump up the USD$10/month it costs to double the
CPU limits from the basic free Cirrus account, and I've just now done
that and told cfbot it's allowed to test 4 branches at once and to try
to test every branch every 24 hours.  Let's see how that goes.

Here's hoping we can cut down the time it takes to run the tests on
Windows... there's some really dumb stuff happening there.  Top items
I'm aware of:  (1) general lack of test concurrency, (2) exec'ing new
backends is glacially slow on that OS but we do it for every SQL
statement in the TAP tests and every regression test script (I have
some patches for this to share after the code freeze).

> Additionally, are there plans to validate commits of the main branch
> before using them as a base for CF entries, so that "bad" commits on
> master won't impact CFbot results as easy?

How do you see this working?

I have wondered about some kind of way to click a button to say "do
this one again now", but I guess that sort of user interaction should
ideally happen after merging this thing into the Commitfest app,
because it already has auth, and interactive Python/Django web stuff.




Re: [PATCH] pg_statio_all_tables: several rows per table due to invalid TOAST index

2022-03-20 Thread Tom Lane
I wrote:
> ... We need to sum separately over the
> table indexes and toast indexes, and I don't immediately see how
> to do that without creating an optimization fence.

After a bit of further fooling, I found that we could make that
work with LEFT JOIN LATERAL.  This formulation has a different
problem, which is that if you do want most or all of the output,
computing each sub-aggregation separately is probably less
efficient than it could be.  But this is probably the better way
to go unless someone has an even better idea.

regards, tom lane

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index bb1ac30cd1..e319b99a9d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -717,22 +717,31 @@ CREATE VIEW pg_statio_all_tables AS
 pg_stat_get_blocks_fetched(C.oid) -
 pg_stat_get_blocks_hit(C.oid) AS heap_blks_read,
 pg_stat_get_blocks_hit(C.oid) AS heap_blks_hit,
-sum(pg_stat_get_blocks_fetched(I.indexrelid) -
-pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_read,
-sum(pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_hit,
+I.idx_blks_read AS idx_blks_read,
+I.idx_blks_hit AS idx_blks_hit,
 pg_stat_get_blocks_fetched(T.oid) -
 pg_stat_get_blocks_hit(T.oid) AS toast_blks_read,
 pg_stat_get_blocks_hit(T.oid) AS toast_blks_hit,
-pg_stat_get_blocks_fetched(X.indexrelid) -
-pg_stat_get_blocks_hit(X.indexrelid) AS tidx_blks_read,
-pg_stat_get_blocks_hit(X.indexrelid) AS tidx_blks_hit
+X.idx_blks_read AS tidx_blks_read,
+X.idx_blks_hit AS tidx_blks_hit
 FROM pg_class C LEFT JOIN
-pg_index I ON C.oid = I.indrelid LEFT JOIN
-pg_class T ON C.reltoastrelid = T.oid LEFT JOIN
-pg_index X ON T.oid = X.indrelid
+pg_class T ON C.reltoastrelid = T.oid
 LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-WHERE C.relkind IN ('r', 't', 'm')
-GROUP BY C.oid, N.nspname, C.relname, T.oid, X.indexrelid;
+LEFT JOIN LATERAL (
+  SELECT sum(pg_stat_get_blocks_fetched(indexrelid) -
+ pg_stat_get_blocks_hit(indexrelid))::bigint
+ AS idx_blks_read,
+ sum(pg_stat_get_blocks_hit(indexrelid))::bigint
+ AS idx_blks_hit
+  FROM pg_index WHERE indrelid = C.oid ) I ON true
+LEFT JOIN LATERAL (
+  SELECT sum(pg_stat_get_blocks_fetched(indexrelid) -
+ pg_stat_get_blocks_hit(indexrelid))::bigint
+ AS idx_blks_read,
+ sum(pg_stat_get_blocks_hit(indexrelid))::bigint
+ AS idx_blks_hit
+  FROM pg_index WHERE indrelid = T.oid ) X ON true
+WHERE C.relkind IN ('r', 't', 'm');
 
 CREATE VIEW pg_statio_sys_tables AS
 SELECT * FROM pg_statio_all_tables
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac468568a1..dd54d71098 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2261,19 +2261,24 @@ pg_statio_all_tables| SELECT c.oid AS relid,
 c.relname,
 (pg_stat_get_blocks_fetched(c.oid) - pg_stat_get_blocks_hit(c.oid)) AS heap_blks_read,
 pg_stat_get_blocks_hit(c.oid) AS heap_blks_hit,
-(sum((pg_stat_get_blocks_fetched(i.indexrelid) - pg_stat_get_blocks_hit(i.indexrelid::bigint AS idx_blks_read,
-(sum(pg_stat_get_blocks_hit(i.indexrelid)))::bigint AS idx_blks_hit,
+i.idx_blks_read,
+i.idx_blks_hit,
 (pg_stat_get_blocks_fetched(t.oid) - pg_stat_get_blocks_hit(t.oid)) AS toast_blks_read,
 pg_stat_get_blocks_hit(t.oid) AS toast_blks_hit,
-(pg_stat_get_blocks_fetched(x.indexrelid) - pg_stat_get_blocks_hit(x.indexrelid)) AS tidx_blks_read,
-pg_stat_get_blocks_hit(x.indexrelid) AS tidx_blks_hit
+x.idx_blks_read AS tidx_blks_read,
+x.idx_blks_hit AS tidx_blks_hit
FROM pg_class c
- LEFT JOIN pg_index i ON ((c.oid = i.indrelid)))
  LEFT JOIN pg_class t ON ((c.reltoastrelid = t.oid)))
- LEFT JOIN pg_index x ON ((t.oid = x.indrelid)))
  LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
-  WHERE (c.relkind = ANY (ARRAY['r'::"char", 't'::"char", 'm'::"char"]))
-  GROUP BY c.oid, n.nspname, c.relname, t.oid, x.indexrelid;
+ LEFT JOIN LATERAL ( SELECT (sum((pg_stat_get_blocks_fetched(pg_index.indexrelid) - pg_stat_get_blocks_hit(pg_index.indexrelid::bigint AS idx_blks_read,
+(sum(pg_stat_get_blocks_hit(pg_index.indexrelid)))::bigint AS idx_blks_hit
+   FROM pg_index
+  WHERE (pg_index.indrelid = c.oid)) i ON (true))
+ LEFT JOIN LATERAL ( SELECT (sum((pg_stat_get_blocks_fetched(pg_index.indexrelid) - 

Re: [PATCH] pg_statio_all_tables: several rows per table due to invalid TOAST index

2022-03-20 Thread Tom Lane
Andrei Zubkov  writes:
> On Tue, 2021-11-30 at 17:29 +0900, Michael Paquier wrote:
>> Hmm.  Why should we care about invalid indexes at all, including
>> pg_statio_all_indexes?

> I think we should care about them at least because they are exists and
> can consume resources. For example, invalid index is to be updated by
> DML operations.
> Of course we can exclude such indexes from a view using isvalid,
> isready, islive fields. But in such case we should mention this in the
> docs, and more important is that the new such states of indexes can
> appear in the future causing change in a view definition. Counting all
> indexes regardless of states seems more reasonable to me.

Yeah, I agree, especially since we do it like that for the table's
own indexes.  I have a couple of comments though:

1. There's a silly typo in the view definition (it outputs tidx_blks_read
twice).  Fixed in the attached v2.

2. Historically, if you put any constraints on the view output, like
select * from pg_statio_all_tables where relname like 'foo%';
you'd get a commensurate reduction in the amount of work done.  With
this version, you don't: the CTE will get computed in its entirety
even if we just need one row of its result.  This seems pretty bad,
especially for installations with many tables --- I suspect many
users would think this cure is worse than the disease.

I'm not quite sure what to do about #2.  I thought of just removing
X.indexrelid from the GROUP BY clause and summing over the toast
index(es) as we do for the table's index(es).  But that doesn't
work: if there are N > 1 table indexes, then the counts for
the toast index(es) will be multiplied by N, and conversely if
there are multiple toast indexes then the counts for the table
indexes will be scaled up.  We need to sum separately over the
table indexes and toast indexes, and I don't immediately see how
to do that without creating an optimization fence.

regards, tom lane

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index bb1ac30cd1..a977efd3c5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -710,6 +710,18 @@ CREATE VIEW pg_stat_xact_user_tables AS
   schemaname !~ '^pg_toast';
 
 CREATE VIEW pg_statio_all_tables AS
+WITH indstat AS (
+SELECT
+indrelid,
+sum(pg_stat_get_blocks_fetched(indexrelid) -
+pg_stat_get_blocks_hit(indexrelid))::bigint
+AS idx_blks_read,
+sum(pg_stat_get_blocks_hit(indexrelid))::bigint
+AS idx_blks_hit
+FROM
+pg_index
+GROUP BY indrelid
+)
 SELECT
 C.oid AS relid,
 N.nspname AS schemaname,
@@ -717,22 +729,19 @@ CREATE VIEW pg_statio_all_tables AS
 pg_stat_get_blocks_fetched(C.oid) -
 pg_stat_get_blocks_hit(C.oid) AS heap_blks_read,
 pg_stat_get_blocks_hit(C.oid) AS heap_blks_hit,
-sum(pg_stat_get_blocks_fetched(I.indexrelid) -
-pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_read,
-sum(pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_hit,
+I.idx_blks_read AS idx_blks_read,
+I.idx_blks_hit AS idx_blks_hit,
 pg_stat_get_blocks_fetched(T.oid) -
 pg_stat_get_blocks_hit(T.oid) AS toast_blks_read,
 pg_stat_get_blocks_hit(T.oid) AS toast_blks_hit,
-pg_stat_get_blocks_fetched(X.indexrelid) -
-pg_stat_get_blocks_hit(X.indexrelid) AS tidx_blks_read,
-pg_stat_get_blocks_hit(X.indexrelid) AS tidx_blks_hit
+X.idx_blks_read AS tidx_blks_read,
+X.idx_blks_hit AS tidx_blks_hit
 FROM pg_class C LEFT JOIN
-pg_index I ON C.oid = I.indrelid LEFT JOIN
+indstat I ON C.oid = I.indrelid LEFT JOIN
 pg_class T ON C.reltoastrelid = T.oid LEFT JOIN
-pg_index X ON T.oid = X.indrelid
+indstat X ON T.oid = X.indrelid
 LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
-WHERE C.relkind IN ('r', 't', 'm')
-GROUP BY C.oid, N.nspname, C.relname, T.oid, X.indexrelid;
+WHERE C.relkind IN ('r', 't', 'm');
 
 CREATE VIEW pg_statio_sys_tables AS
 SELECT * FROM pg_statio_all_tables
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ac468568a1..df1a3c1297 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2256,24 +2256,30 @@ pg_statio_all_sequences| SELECT c.oid AS relid,
FROM (pg_class c
  LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
   WHERE (c.relkind = 'S'::"char");
-pg_statio_all_tables| SELECT c.oid AS relid,
+pg_statio_all_tables| WITH indstat AS (
+ SELECT pg_index.indrelid,
+(sum((pg_stat_get_blocks_fetched(pg_index.indexrelid) - 

Re: shared-memory based stats collector - v66

2022-03-20 Thread Melanie Plageman
On Sun, Mar 20, 2022 at 12:58 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-03-20 12:32:39 -0400, Melanie Plageman wrote:
> > Attached is a TAP test to check that stats are cleaned up on a physical
> > replica after the objects they concern are dropped on the primary.
>
> Thanks!
>
>
> > I'm not sure that the extra force next flush on standby is needed after
> > drop on primary since drop should report stats and I wait for catchup.
>
> A drop doesn't force stats in other sessions to be flushed immediately, so
> unless I misunderstand, yes, it's needed.
>
>
> > Also, I don't think the tests with DROP SCHEMA actually exercise another
> > code path, so it might be worth cutting those.
>
> > +/*
> > + * Checks for presence of stats for object with provided object oid of kind
> > + * specified in the type string in database of provided database oid.
> > + *
> > + * For subscription stats, only the objoid will be used. For database 
> > stats,
> > + * only the dboid will be used. The value passed in for the unused 
> > parameter is
> > + * discarded.
> > + * TODO: should it be 'pg_stat_stats_present' instead of 
> > 'pg_stat_stats_exist'?
> > + */
> > +Datum
> > +pg_stat_stats_exist(PG_FUNCTION_ARGS)
>
> Should we revoke stats for this one from PUBLIC (similar to the reset 
> functions)?
>
>
> > +# Set track_functions to all on standby
> > +$node_standby->append_conf('postgresql.conf', "track_functions = 'all'");
>
> That should already be set, cloning from the primary includes the
> configuration from that point in time.
>
> > +$node_standby->restart;
>
> FWIW, it'd also only require a reload
>

Addressed all of these points in
v2-0001-add-replica-cleanup-tests.patch

also added a new test file in
v2-0002-Add-TAP-test-for-discarding-stats-after-crash.patch
testing correct behavior after a crash and when stats file is invalid

- Melanie
From c521ba1dcb13ba236181adce06893c42ec439877 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 17 Mar 2022 21:54:16 -0400
Subject: [PATCH v2 1/2] add replica cleanup tests

---
 src/backend/catalog/system_functions.sql  |   2 +
 src/backend/utils/activity/pgstat.c   |  32 +++
 src/backend/utils/adt/pgstatfuncs.c   |  22 ++
 src/include/catalog/pg_proc.dat   |   6 +
 src/include/pgstat.h  |   3 +
 .../recovery/t/029_stats_cleanup_replica.pl   | 228 ++
 6 files changed, 293 insertions(+)
 create mode 100644 src/test/recovery/t/029_stats_cleanup_replica.pl

diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 81bac6f581..4f2c80b72e 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -639,6 +639,8 @@ REVOKE EXECUTE ON FUNCTION pg_stat_reset_single_function_counters(oid) FROM publ
 
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_replication_slot(text) FROM public;
 
+REVOKE EXECUTE ON FUNCTION pg_stat_stats_exist(oid, oid, text) FROM public;
+
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_subscription_stats(oid) FROM public;
 
 REVOKE EXECUTE ON FUNCTION lo_import(text) FROM public;
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index e9fab35a46..4c0fea62b4 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -159,6 +159,7 @@ static void pgstat_shared_reset_timestamp(PgStatKind kind, PgStatShm_StatEntryHe
 static void pgstat_pending_delete(PgStatSharedRef *shared_ref);
 static bool pgstat_pending_flush_stats(bool nowait);
 
+static PgStatKind pgstat_kind_for(char *kind_str);
 static inline const PgStatKindInfo *pgstat_kind_info_for(PgStatKind kind);
 
 
@@ -1315,6 +1316,23 @@ pgstat_get_stat_snapshot_timestamp(bool *have_snapshot)
 	return 0;
 }
 
+bool
+pgstat_shared_stat_exists(char *kind_str, Oid dboid, Oid objoid)
+{
+	PgStatKind kind = pgstat_kind_for(kind_str);
+
+	/*
+	 * Ignore dboid or objoid for subscription and db stats respectively.
+	 */
+	if (kind == PGSTAT_KIND_SUBSCRIPTION)
+		dboid = InvalidOid;
+
+	if (kind == PGSTAT_KIND_DB)
+		objoid = InvalidOid;
+
+	return (bool) pgstat_fetch_entry(kind, dboid, objoid);
+}
+
 
 /* 
  * Helper functions
@@ -1343,6 +1361,20 @@ pgstat_setup_memcxt(void)
   ALLOCSET_SMALL_SIZES);
 }
 
+static PgStatKind
+pgstat_kind_for(char *kind_str)
+{
+	for (int kind = 0; kind <= PGSTAT_KIND_LAST; kind++)
+	{
+		if (pg_strcasecmp(kind_str, pgstat_kind_infos[kind].name) == 0)
+			return kind;
+	}
+
+	ereport(ERROR,
+			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("invalid statistic kind: \"%s\"", kind_str)));
+}
+
 static inline const PgStatKindInfo *
 pgstat_kind_info_for(PgStatKind kind)
 {
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5d843cde24..62b2df856f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2420,3 +2420,25 

Re: A test for replay of regression tests

2022-03-20 Thread Thomas Munro
On Mon, Mar 21, 2022 at 2:34 AM Andrew Dunstan  wrote:
> On 3/20/22 05:36, Thomas Munro wrote:
> > On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro  wrote:
> >> I'll try to come up with the perl needed to see the regression.diffs
> >> next time...
> > Here's my proposed change to achieve that.
>
> I think that's OK.

Thanks for looking!  Pushed.




Re: Patch proposal - parameter to limit amount of FPW because of hint bits per second

2022-03-20 Thread Peter Geoghegan
On Sun, Mar 20, 2022 at 12:44 PM Michail Nikolaev
 wrote:
> So, I was thinking about a way to avoid such downtimes. What is about
> a patch to add parameters to limit the number of FPW caused by LP_DEAD
> bits per second? It is always possible to skip the setting of LP_DEAD
> for future time. Such a parameter will make it possible to spread all
> additional WAL traffic over time by some Mbit/s.
>
> Does it look worth its implementation?

The following approach seems like it might fix the problem in the way
that you hope for:

* Add code to _bt_killitems() that detects if it has generated an FPI,
just to set some LP_DEAD bits.

* Instead of avoiding the FPI when this happens, proactively call
_bt_simpledel_pass() just before _bt_killitems() returns. Accept the
immediate cost of setting an LP_DEAD bit, just like today, but avoid
repeated FPIs.

The idea here is to take advantage of the enhancements to LP_DEAD
index tuple deletion (or "simple deletion") in Postgres 14.
_bt_simpledel_pass() will now do a good job of deleting "extra" heap
TIDs in practice, with many workloads. So in your scenario it's likely
that the proactive index tuple deletions will be able to delete many
"extra" nearby index tuples whose TIDs point to the same heap page.

This will be useful to you because it cuts down on repeated FPIs for
the same leaf page. You still get the FPIs, but in practice you may
get far fewer of them by triggering these proactive deletions, that
can easily delete many TIDs in batch. I think that it's better to
pursue an approach like this because it's more general.

It would perhaps also make sense to not set LP_DEAD bits in
_bt_killitems() when we see that doing so right now generates an FPI,
*and* we also see that existing LP_DEAD markings are enough to make
_bt_simpledel_pass() delete the index tuple that we want to mark
LP_DEAD now, anyway (because it'll definitely visit the same heap
block later on). That does mean that we pay a small cost, but at least
we won't miss out on deleting any index tuples as a result of avoiding
an FPI. This second idea is also much more general than simply
avoiding FPIs in general.

-- 
Peter Geoghegan




Re: Hardening heap pruning code (was: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum)

2022-03-20 Thread Matthias van de Meent
On Sun, 20 Mar 2022 at 04:48, Peter Geoghegan  wrote:
>
> Attached is v6, which goes a bit further than v5 in using local state
> that we build up-front to describe the state of the page being pruned
> (rather than rereading the page itself).

I didn't test the code; so these comments are my general feel of this
patch based on visual analysis only.

> > > We now do "3 passes" over the page. The first is the aforementioned
> > > "precalculate HTSV" pass, the second is for determining the extent of
> > > HOT chains, and the third is for any remaining disconnected/orphaned
> > > HOT chains. I suppose that that's okay, since the amount of work has
> > > hardly increased in proportion to this "extra pass over the page". Not
> > > 100% sure about everything myself right now, though. I guess that "3
> > > passes" might be considered excessive.

The passes don't all have a very clear explanation what they do; or
what they leave for the next one to process. It can be distilled from
the code comments at the respective phases and various functions, but
only after careful review of all code comments; there doesn't seem to
be a clear overview.


> @@ -295,30 +305,25 @@ heap_page_prune(Relation relation, Buffer buffer,
> [...]
> + * prstate for later passes.  Scan the page backwards (in reverse item
> + * offset number order).
> - [...]
> + * This approach is good for performance.  Most commonly tuples within a
> + * page are stored at decreasing offsets (while the items are stored at
> + * increasing offsets).

A reference to why this is commonly the case (i.e.
PageRepairFragmentation, compactify_tuples, natural insertion order)
would probably help make the case; as this order being common is not
specifically obvious at first sight.

> @@ -350,28 +370,41 @@ heap_page_prune(Relation relation, Buffer buffer,
> +/* Now scan the page a second time to process each root item */

This second pass also processes the HOT chain of each root item; but
that is not clear from the comment on the loop. I'd suggest a comment
more along these lines:

/*
 * Now scan the page a second time to process each valid HOT chain;
 * i.e. each non-HOT tuple or redirect line pointer and the HOT tuples in
 * the trailing chain, if any.
 * heap_prune_from_root marks all items in the HOT chain as visited;
 * so that phase 3 knows to skip those items.
 */

Apart from changes in comments for extra clarity; I think this
materially improves pruning, so thanks for working on this.

-Matthias




Re: Tab completion for SET TimeZone

2022-03-20 Thread Tom Lane
=?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
> Tom Lane  writes:
>> I think the reason the COMPLETE_WITH_ENUM_VALUE macro doesn't look
>> similar is that it hasn't made an attempt to work with input that
>> the user didn't quote --- that is, if you type
>> alter type planets rename value ur
>> it just fails to match anything, instead of providing "'uranus'".
>> Should we upgrade that likewise?`

> The comment says it will add the quote before text if it's not there, so
> maybe we should adjust that to say that it will only add the quote if
> the user hasn't typed anything?

After thinking a bit harder, I realized that the SchemaQuery
infrastructure has no way to deal with the case of the input text not
being a prefix of what we want the output to be, so it can't do something
comparable to Query_for_list_of_timezone_names_quoted_out.  Maybe someday
we'll feel like adding that, but COMPLETE_WITH_ENUM_VALUE isn't a
compelling enough reason in current usage.  So I just tweaked the comment
a bit.

Pushed that way.

regards, tom lane




Re: Tab completion for SET TimeZone

2022-03-20 Thread Dagfinn Ilmari Mannsåker
Tom Lane  writes:

> =?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
>> I just realised there's no point in the subselect when I'm not applying
>> the same function in the WHERE and the SELECT, so here's an updated
>> version that simplifies that.  It also fixes a typo in the commit
>> message.
>
> This doesn't work right for me under libedit -- it will correctly
> complete "am" to "'America/", but then it fails to complete
> anything past that.  The reason seems to be that once we have a
> leading single quote, libedit will include that in the text passed
> to future completion attempts, while readline won't.  I ended up
> needing three query variants, as attached (bikeshedding welcome).

Well spotted, I must have just tested that it completed something on the
first tab. The fix looks reasonable to me, and I have no better ideas
for the names of the query string #defines.

> I think the reason the COMPLETE_WITH_ENUM_VALUE macro doesn't look
> similar is that it hasn't made an attempt to work with input that
> the user didn't quote --- that is, if you type
>   alter type planets rename value ur
> it just fails to match anything, instead of providing "'uranus'".
> Should we upgrade that likewise?`

The comment says it will add the quote before text if it's not there, so
maybe we should adjust that to say that it will only add the quote if
the user hasn't typed anything?

> Not sure it's worth the trouble though; I think
> COMPLETE_WITH_ENUM_VALUE is there more as a finger exercise than
> because people use it regularly.

I agree, I mostly implemented that for completeness after adding ALTER
TYPE … RENAME VALUE. Also, enum values always need quoting, while SET
TIMEZONE doesn't if the zone name follows identifier syntax and people
might start typing it without quotes if they intend set one of those, so
the extra effort to make that work even though we can't suggest a mix of
quoted and non-quoted names is worth it.

> I added a regression test case too.

Good idea idea, I keep forgetting that we actually have tests for tab
completion, since most cases are so simple and obviously correct that we
don't bother with tests for them.

> ... btw, I forgot to mention that I don't see any problem with
> the patch's behavior for DEFAULT.  What I see with both readline
> and libedit is that if you type something reasonable, like
> set timezone to d
> then it will correctly complete "efault ", without any extra
> quotes.  Now, if you're silly and do
> set timezone to 'd
> then readline gives you "efault' " and libedit gives you nothing.
> But I think that's your own, um, fault.

I agree, it was just a quirk noticed while testing. Also, it helps that
there aren't any actual time zone names starting with 'd'.

> We don't have any other places where tab completion is so aggressive
> as to remove quotes from a keyword.

I noticed afterwards that other config varables behave the same, so even
if we wanted to improve that, it's outwith the scope of this patch.

>   regards, tom lane

- ilmari




Re: refactoring basebackup.c (zstd workers)

2022-03-20 Thread Justin Pryzby
On Sun, Mar 20, 2022 at 03:05:28PM -0400, Robert Haas wrote:
> On Thu, Mar 17, 2022 at 3:41 PM Justin Pryzby  wrote:
> > -errmsg("unrecognized 
> > compression algorithm: \"%s\"",
> > +errmsg("unrecognized 
> > compression algorithm \"%s\"",
> >
> > Most other places seem to say "compression method".  So I'd suggest to 
> > change
> > that here, and in doc/src/sgml/ref/pg_basebackup.sgml.
> 
> I'm not sure that's really better, and I don't think this patch is
> introducing an altogether novel usage. I think I would probably try to
> standardize on algorithm rather than method if I were standardizing
> the whole source tree, but I think we can leave that discussion for
> another time.

The user-facing docs are already standardized using "compression method", with
2 exceptions, of which one is contrib/ and the other is what I'm suggesting to
make consistent here.

$ git grep 'compression algorithm' doc
doc/src/sgml/pgcrypto.sgml:Which compression algorithm to use.  Only 
available if
doc/src/sgml/ref/pg_basebackup.sgml:compression algorithm is selected, 
or if server-side compression

> > +   result->parse_error =
> > +   pstrdup("found empty string where a 
> > compression option was expected");
> >
> > Needs to be localized with _() ?
> > Also, document that it's pstrdup'd.
> 
> Did the latter. The former would need to be fixed in a bunch of places
> and while I'm happy to accept an expert opinion on exactly what needs
> to be done here, I don't want to try to do it and do it wrong. Better
> to let someone with good knowledge of the subject matter patch it up
> later than do a crummy job now.

I believe it just needs _("foo")
See git grep '= _('

I mentioned another issue off-list:
pg_basebackup.c:2741:10: warning: suggest parentheses around assignment used as 
truth value [-Wparentheses]
 2741 |   Assert(compressloc = COMPRESS_LOCATION_SERVER);
  |  ^~~
pg_basebackup.c:2741:3: note: in expansion of macro ‘Assert’
 2741 |   Assert(compressloc = COMPRESS_LOCATION_SERVER);

This crashes the server using your v2 patch:

src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp 
--no-sync --no-manifest --compress=server-zstd:level, |wc -c

I wonder whether the syntax should really use both ":" and ",".
Maybe ":" isn't needed at all.

This patch also needs to update the other user-facing docs.

typo: contain a an

-- 
Justin




Re: refactoring basebackup.c (zstd workers)

2022-03-20 Thread Tom Lane
Robert Haas  writes:
>> Should this also set/check errno ?
>> And check if value != ivalue_endp ?
>> See strtol(3)

> Even after reading the man page for strtol, it's not clear to me that
> this is needed. That page represents checking *endptr != '\0' as
> sufficient to tell whether an error occurred.

I'm not sure whose man page you looked at, but the POSIX standard [1]
has a pretty clear opinion about this:

Since 0, {LONG_MIN} or {LLONG_MIN}, and {LONG_MAX} or {LLONG_MAX} are
returned on error and are also valid returns on success, an
application wishing to check for error situations should set errno to
0, then call strtol() or strtoll(), then check errno.

Checking *endptr != '\0' is for detecting whether there is trailing
garbage after the number; which may be an error case or not as you
choose, but it's a different matter.

regards, tom lane

[1] https://pubs.opengroup.org/onlinepubs/9699919799/




Re: refactoring basebackup.c (zstd workers)

2022-03-20 Thread Robert Haas
On Thu, Mar 17, 2022 at 3:41 PM Justin Pryzby  wrote:
> gzip comma

I think it's fine the way it's written. If we made that change, then
we'd have a comma for gzip and not for the other two algorithms. Also,
I'm just moving that sentence, so any change that there is to be made
here is a job for some other patch.

> alphabetical

Fixed.

> -errmsg("unrecognized 
> compression algorithm: \"%s\"",
> +errmsg("unrecognized 
> compression algorithm \"%s\"",
>
> Most other places seem to say "compression method".  So I'd suggest to change
> that here, and in doc/src/sgml/ref/pg_basebackup.sgml.

I'm not sure that's really better, and I don't think this patch is
introducing an altogether novel usage. I think I would probably try to
standardize on algorithm rather than method if I were standardizing
the whole source tree, but I think we can leave that discussion for
another time.

> -   if (o_compression_level && !o_compression)
> +   if (o_compression_detail && !o_compression)
> ereport(ERROR,
> (errcode(ERRCODE_SYNTAX_ERROR),
>  errmsg("compression level requires 
> compression")));
>
> s/level/detail/

Fixed.


> It'd be great if this were re-usable for wal_compression, which I hope in 
> pg16 will
> support at least level=N.  And eventually pg_dump.  But those clients 
> shouldn't
> accept a client/server prefix.  Maybe the way to handle that is for those 
> tools
> to check locationres and reject it if it was specified.

One thing I forgot to mention in my previous response is that I think
the parsing code is actually well set up for this the way I have it.
server- and client- gets parsed off in a different place than we
interpret the rest, which fits well with your observation that other
cases wouldn't have a client or server prefix.

> sp: sandwich

Fixed.

> star

Fixed.

> should be const ?

OK.

>
> +   /* As a special case, the specification can be a bare integer. */
> +   bare_level = strtol(specification, _level_endp, 10);
>
> Should this call expect_integer_value()?
> See below.

I don't think that would be useful. We have no keyword to pass for the
error message, nor would we use the error message if one got
constructed.

> +   result->parse_error =
> +   pstrdup("found empty string where a 
> compression option was expected");
>
> Needs to be localized with _() ?
> Also, document that it's pstrdup'd.

Did the latter. The former would need to be fixed in a bunch of places
and while I'm happy to accept an expert opinion on exactly what needs
to be done here, I don't want to try to do it and do it wrong. Better
to let someone with good knowledge of the subject matter patch it up
later than do a crummy job now.

> -1 isn't great, since it's also an integer, and, also a valid compression 
> level
> for zstd (did you see my message about that?).  Maybe INT_MIN is ok.

It really doesn't matter. Could just return 42. The client shouldn't
use the value if there's an error.

> +{
> +   int ivalue;
> +   char   *ivalue_endp;
> +
> +   ivalue = strtol(value, _endp, 10);
>
> Should this also set/check errno ?
> And check if value != ivalue_endp ?
> See strtol(3)

Even after reading the man page for strtol, it's not clear to me that
this is needed. That page represents checking *endptr != '\0' as
sufficient to tell whether an error occurred. Maybe it wouldn't catch
an out of range value, but in practice all of the algorithms we
support now and any we support in the future are going to catch
something clamped to LONG_MIN or LONG_MAX as out of range and display
the correct error message. What's your specific thinking here?

> +   unsignedoptions;/* OR of 
> BACKUP_COMPRESSION_OPTION constants */
>
> Should be "unsigned int" or "bits32" ?

I do not see why either of those would be better.

> The server crashes if I send an unknown option - you should hit that in the
> regression tests.

Turns out I was testing this on the client side but not the server
side. Fixed and added more tests.

v2 attached.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-Replace-BASE_BACKUP-COMPRESSION_LEVEL-option-with.patch
Description: Binary data


Re: Concurrent deadlock scenario with DROP INDEX on partitioned index

2022-03-20 Thread Tom Lane
Jimmy Yih  writes:
> Tom Lane  wrote:
>> Actually though, maybe you *don't* want to do this in
>> RangeVarCallbackForDropRelation.  Because of point 2, it might be
>> better to run find_all_inheritors after we've successfully
>> identified and locked the direct target relation, ie do it back
>> in RemoveRelations.  I've not thought hard about that, but it's
>> attractive if only because it'd mean you don't have to fix point 1.

> We think that RangeVarCallbackForDropRelation is probably the most
> correct function to place the fix. It would look a bit out-of-place
> being in RemoveRelations seeing how there's already relative DROP
> INDEX code in RangeVarCallbackForDropRelation.

I really think you made the wrong choice here.  Doing the locking in
RemoveRelations leads to an extremely simple patch, as I demonstrate
below.  Moreover, since RemoveRelations also has special-case code
for partitioned indexes, it's hard to argue that it mustn't cover
this case too.

Also, I think the proposed test case isn't very good, because when
I run it without applying the code patch, it fails to demonstrate
any deadlock.  The query output is different, but not obviously
wrong.

> Fixed in attached patch. Added another local variable
> is_partitioned_index to store the classform value. The main reason we
> need the classform is because the existing relkind and
> expected_relkind local variables would only show RELKIND_INDEX whereas
> we needed exactly RELKIND_PARTITIONED_INDEX.

Yeah.  As I looked at that I realized that it was totally confusing:
at least one previous author thought that "relkind" stored the rel's
actual relkind, which it doesn't as the code stands.  In particular,
in this bit:

if ((relkind == RELKIND_INDEX || relkind == RELKIND_PARTITIONED_INDEX) &&
relOid != oldRelOid)
{
state->heapOid = IndexGetRelation(relOid, true);

the test for relkind == RELKIND_PARTITIONED_INDEX is dead code
because relkind will never be that.  It accidentally works anyway
because the other half of the || does the right thing, but somebody
was confused, and so will readers be.

Hence, I propose the attached.  0001 is pure refactoring: it hopefully
clears up the confusion about which "relkind" is which, and it also
saves a couple of redundant syscache fetches in RemoveRelations.
Then 0002 adds the actual bug fix as well as a test case that does
deadlock with unpatched code.

regards, tom lane

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index dc5872f988..26ccd7f481 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -295,12 +295,18 @@ static const struct dropmsgstrings dropmsgstringarray[] = {
 	{'\0', 0, NULL, NULL, NULL, NULL}
 };
 
+/* communication between RemoveRelations and RangeVarCallbackForDropRelation */
 struct DropRelationCallbackState
 {
-	char		relkind;
+	/* These fields are set by RemoveRelations: */
+	char		expected_relkind;
+	bool		concurrent;
+	/* These fields are state to track which subsidiary locks are held: */
 	Oid			heapOid;
 	Oid			partParentOid;
-	bool		concurrent;
+	/* These fields are passed back by RangeVarCallbackForDropRelation: */
+	char		actual_relkind;
+	char		actual_relpersistence;
 };
 
 /* Alter table target-type flags for ATSimplePermissions */
@@ -1416,10 +1422,12 @@ RemoveRelations(DropStmt *drop)
 		AcceptInvalidationMessages();
 
 		/* Look up the appropriate relation using namespace search. */
-		state.relkind = relkind;
+		state.expected_relkind = relkind;
+		state.concurrent = drop->concurrent;
+		/* We must initialize these fields to show that no locks are held: */
 		state.heapOid = InvalidOid;
 		state.partParentOid = InvalidOid;
-		state.concurrent = drop->concurrent;
+
 		relOid = RangeVarGetRelidExtended(rel, lockmode, RVR_MISSING_OK,
 		  RangeVarCallbackForDropRelation,
 		  (void *) );
@@ -1433,10 +1441,10 @@ RemoveRelations(DropStmt *drop)
 
 		/*
 		 * Decide if concurrent mode needs to be used here or not.  The
-		 * relation persistence cannot be known without its OID.
+		 * callback retrieved the rel's persistence for us.
 		 */
 		if (drop->concurrent &&
-			get_rel_persistence(relOid) != RELPERSISTENCE_TEMP)
+			state.actual_relpersistence != RELPERSISTENCE_TEMP)
 		{
 			Assert(list_length(drop->objects) == 1 &&
    drop->removeType == OBJECT_INDEX);
@@ -1448,7 +1456,7 @@ RemoveRelations(DropStmt *drop)
 		 * either.
 		 */
 		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
-			get_rel_relkind(relOid) == RELKIND_PARTITIONED_INDEX)
+			state.actual_relkind == RELKIND_PARTITIONED_INDEX)
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 	 errmsg("cannot drop partitioned index \"%s\" concurrently",
@@ -1479,7 +1487,6 @@ RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid, Oid oldRelOid,
 {
 	HeapTuple	tuple;
 	struct DropRelationCallbackState *state;
-	char		relkind;
 	char		expected_relkind;
 	

Re: a misbehavior of partition row movement (?)

2022-03-20 Thread Alvaro Herrera
On 2022-Mar-20, Amit Langote wrote:

> On Sun, Mar 20, 2022 at 5:13 AM Alvaro Herrera  
> wrote:
> > On 2022-Mar-18, Zhihong Yu wrote:

> > > +   if (!partRel->rd_rel->relispartition)
> > > +   elog(ERROR, "cannot find ancestors of a non-partition result
> > > relation");
> > >
> > > It would be better to include the relation name in the error message.
> >
> > I don't think it matters.  We don't really expect to hit this.
> 
> I tend to think maybe showing at least the OID in the error message
> doesn't hurt, but maybe we don't need to.

Since we don't even know of a situation in which this error message
would be raised, I'm hardly bothered by failing to print the OID.  If
any users complain, we can add more detail.

> I've fixed the comment as:
> 
> -   /* Ignore the root ancestor, because ...?? */
> +   /* Root ancestor's triggers will be processed. */

Okay, included that.

> A description of what we're looking for with this code is in the
> comment above the loop:
> 
> /*
>  * For any foreign keys that point directly into a non-root ancestors of
>  * the source partition,...
> 
> So finding triggers in those non-root ancestors whose function is
> RI_TRIGGER_PK tells us that those relations have foreign keys pointing
> into it or that it is the PK table in that relationship.  Other than
> the comment, the code itself seems self-documenting with regard to
> what's being done (given the function/macro/variable naming), so maybe
> we're better off without additional commentary here.

Yeah, WFM.

> I've attached a delta patch on v16 for the above comment and a couple
> of other changes.

Merged that in, and pushed.  I made a couple of wording changes in
comments here and there as well.

I lament the fact that this fix is not going to hit Postgres 12-14, but
ratio of effort to reward seems a bit too high.  I think we could
backpatch the two involved commits if someone is motivated enough to
verify everything and come up with solutions for the necessary ABI
changes.

Thank you, Amit, for your perseverance in getting this bug fixed!  

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"




Re: Possible fails in pg_stat_statements test

2022-03-20 Thread Andres Freund
Hi,

On 2022-01-14 11:11:07 +0300, Anton A. Melnikov wrote:
> This patch introduces an additional counter of wal records not related to
> the query being executed.

They're not unrelated though.


> Due to this counter pg_stat_statement finds out the number of wal records
> that are not relevant to the query and does not include them in the per
> query statistic.

-many. For read-only queries the generated WAL due to on-access pruning can be
a significant factor in performance. Removing that information makes
pg_stat_statments *less* useful.


> This removes the possibility of the error described above.
> 
> There is a way to reproduce this error when patch is not applied:
> 1) start server with "shared_preload_libraries = 'pg_stat_statements'"
> string in the postgresql.conf;
> 2) replace makefile in contrib/pg_stat_statements with attached one;
> 3) replace test file contrib/pg_stat_statements/sql/pg_stat_statements.sql
> and expected results
> contrib/pg_stat_statements/expected/pg_stat_statements.out
> with shorter versions from attached files;
> 4) copy test.sh to contrib/pg_stat_statements and make sure that PGHOME
> point to your server;
> 5) cd to contrib/pg_stat_statements and execute:
> export ITER=1 && while ./start.sh || break; export ITER=$(($ITER+1)); do :;
> done
> 
> Usually 100-200 iterations will be enough.
> To catch the error more faster one can add wal_records column to SELECT
> in line 26 of contrib/pg_stat_statements/sql/pg_stat_statements.sql as
> followes:
> SELECT query, calls, rows, wal_records,
> and replace the contrib/pg_stat_statements/expected/pg_stat_statements.out
> with attached pg_stat_statements-fast.out

Can the test failures be encountered without such an elaborate setup? If not,
then I don't really see why we need to do anything here?

Greetings,

Andres Freund




Re: Possible fails in pg_stat_statements test

2022-03-20 Thread Anton A. Melnikov

Hello!

Here is the second version of the patch rebased onto the current master. 
No logical changes.

All other attached files from previous letter are actual.

With best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companycommit 04ce779eb25fec3364c216202b7d7dbd3ed79819
Author: Anton A. Melnikov 
Date:   Sun Mar 20 19:34:58 2022 +0300

Fix possible fails in pg_stat_statements test via taking into account non-query wal records.

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9e525a6ad3..56ed7e0fde 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1370,7 +1370,7 @@ pgss_store(const char *query, uint64 queryId,
 		e->counters.blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_read_time);
 		e->counters.blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_write_time);
 		e->counters.usage += USAGE_EXEC(total_time);
-		e->counters.wal_records += walusage->wal_records;
+		e->counters.wal_records += (walusage->wal_records - walusage->non_query_wal_recs);
 		e->counters.wal_fpi += walusage->wal_fpi;
 		e->counters.wal_bytes += walusage->wal_bytes;
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 4656f1b3db..f09abba04e 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -21,6 +21,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/catalog.h"
+#include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -209,6 +210,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 			ndeleted = heap_page_prune(relation, buffer, vistest, limited_xmin,
 	   limited_ts, , NULL);
 
+			/* Take into account that heap_page_prune() just generated a new
+			 * wal record with zero xl_xid that is not related to current query.
+			 */
+			pgWalUsage.non_query_wal_recs++;
+
 			/*
 			 * Report the number of tuples reclaimed to pgstats.  This is
 			 * ndeleted minus the number of newly-LP_DEAD-set items.
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 3d9088a704..60bc6c7542 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -38,6 +38,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -956,6 +957,12 @@ WriteZeroPageXlogRec(int pageno)
 	XLogBeginInsert();
 	XLogRegisterData((char *) (), sizeof(int));
 	(void) XLogInsert(RM_CLOG_ID, CLOG_ZEROPAGE);
+
+	/*
+	 * Consider that a new unrelated to current query wal record
+	 * with zero xl_xid has just been created.
+	 */
+	pgWalUsage.non_query_wal_recs++;
 }
 
 /*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index c5ff02a842..214fb3cc45 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -268,6 +268,7 @@ WalUsageAdd(WalUsage *dst, WalUsage *add)
 	dst->wal_bytes += add->wal_bytes;
 	dst->wal_records += add->wal_records;
 	dst->wal_fpi += add->wal_fpi;
+	dst->non_query_wal_recs += add->non_query_wal_recs;
 }
 
 void
@@ -276,4 +277,5 @@ WalUsageAccumDiff(WalUsage *dst, const WalUsage *add, const WalUsage *sub)
 	dst->wal_bytes += add->wal_bytes - sub->wal_bytes;
 	dst->wal_records += add->wal_records - sub->wal_records;
 	dst->wal_fpi += add->wal_fpi - sub->wal_fpi;
+	dst->non_query_wal_recs += add->non_query_wal_recs - sub->non_query_wal_recs;
 }
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1b7157bdd1..0d83f37a3c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -49,6 +49,11 @@ typedef struct WalUsage
 	int64		wal_records;	/* # of WAL records produced */
 	int64		wal_fpi;		/* # of WAL full page images produced */
 	uint64		wal_bytes;		/* size of WAL records produced */
+	/*
+	 * Number of WAL records unrelated to current query. In particular due to
+	 * heap_page_prune_opt() or WriteZeroPageXlogRec().
+	 */
+	int64		non_query_wal_recs;
 } WalUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */


Re: Problem with moderation of messages with patched attached.

2022-03-20 Thread Pavel Borisov
>
> оf course we could get complaints no matter what level we set the limit
> at. I think raising it to 2Mb would be a reasonable experiment. If no
> observable evil ensues then leave it that way. If it does then roll it
> back. I agree that plain uncompressed patches are easier to deal with in
> general.
>
Thanks, Andrew! I think it will be more comfortable now.

Pavel.

>


Re: shared-memory based stats collector - v66

2022-03-20 Thread Andres Freund
Hi,

On 2022-03-20 12:32:39 -0400, Melanie Plageman wrote:
> Attached is a TAP test to check that stats are cleaned up on a physical
> replica after the objects they concern are dropped on the primary.

Thanks!


> I'm not sure that the extra force next flush on standby is needed after
> drop on primary since drop should report stats and I wait for catchup.

A drop doesn't force stats in other sessions to be flushed immediately, so
unless I misunderstand, yes, it's needed.


> Also, I don't think the tests with DROP SCHEMA actually exercise another
> code path, so it might be worth cutting those.

> +/*
> + * Checks for presence of stats for object with provided object oid of kind
> + * specified in the type string in database of provided database oid.
> + *
> + * For subscription stats, only the objoid will be used. For database stats,
> + * only the dboid will be used. The value passed in for the unused parameter 
> is
> + * discarded.
> + * TODO: should it be 'pg_stat_stats_present' instead of 
> 'pg_stat_stats_exist'?
> + */
> +Datum
> +pg_stat_stats_exist(PG_FUNCTION_ARGS)

Should we revoke stats for this one from PUBLIC (similar to the reset 
functions)?


> +# Set track_functions to all on standby
> +$node_standby->append_conf('postgresql.conf', "track_functions = 'all'");

That should already be set, cloning from the primary includes the
configuration from that point in time.

> +$node_standby->restart;

FWIW, it'd also only require a reload

Greetings,

Andres Freund




Re: [PATCH v2] use has_privs_for_role for predefined roles

2022-03-20 Thread Stephen Frost
Greetings,

On Sun, Mar 20, 2022 at 18:31 Joshua Brindle 
wrote:

> On Sun, Mar 20, 2022 at 12:27 PM Joe Conway  wrote:
> >
> > On 3/3/22 11:26, Joshua Brindle wrote:
> > > On Thu, Feb 10, 2022 at 2:37 PM Joe Conway  wrote:
> > >>
> > >> On 2/10/22 14:28, Nathan Bossart wrote:
> > >> > On Wed, Feb 09, 2022 at 04:39:11PM -0500, Joe Conway wrote:
> > >> >> On 2/9/22 13:13, Nathan Bossart wrote:
> > >> >>> I do wonder if users find the differences between predefined
> roles and role
> > >> >>> attributes confusing.  INHERIT doesn't govern role attributes,
> but it will
> > >> >>> govern predefined roles when this patch is applied.  Maybe the
> role
> > >> >>> attribute system should eventually be deprecated in favor of using
> > >> >>> predefined roles for everything.  Or perhaps the predefined roles
> should be
> > >> >>> converted to role attributes.
> > >> >>
> > >> >> Yep, I was suggesting that the latter would have been preferable
> to me while
> > >> >> Robert seemed to prefer the former. Honestly I could be happy with
> either of
> > >> >> those solutions, but as I alluded to that is probably a discussion
> for the
> > >> >> next development cycle since I don't see us doing that big a
> change in this
> > >> >> one.
> > >> >
> > >> > I agree.  I still think Joshua's proposed patch is a worthwhile
> improvement
> > >> > for v15.
> > >>
> > >> +1
> > >>
> > >> I am planning to get into it in detail this weekend. So far I have
> > >> really just ensured it merges cleanly and passes make world.
> > >
> > > Rebased patch to apply to master attached.
> >
> > Well longer than I planned, but finally took a closer look.
> >
> > I made one minor editorial fix to Joshua's patch, rebased to current
> > master, and added two missing call sites that presumably are related to
> > recent commits for pg_basebackup.
>
> FWIW I pinged Stephen when I saw the basebackup changes included
> is_member_of and he didn't think they necessarily needed to be changed
> since they aren't accessible to human and you can't SET ROLE on a
> replication connection in order to access the role where inheritance
> was blocked.


Yeah, though that should really be very clearly explained in comments
around that code as anything that uses is_member should really be viewed as
an exception.  I also wouldn’t be against using has_privs there anyway and
saying that you shouldn’t be trying to connect as one role on a replication
connection and using the privs of another when you don’t automatically
inherit those rights, but on the assumption that the committer intended
is_member there because SET ROLE isn’t something one does on replication
connections, I’m alright with it being as is.

Kind Regards,

Stephen P Frost

>


Re: [PATCH v2] use has_privs_for_role for predefined roles

2022-03-20 Thread Joshua Brindle
On Sun, Mar 20, 2022 at 12:34 PM Joe Conway  wrote:
>
> On 3/20/22 12:31, Joshua Brindle wrote:
> > On Sun, Mar 20, 2022 at 12:27 PM Joe Conway  wrote:
> >>
> >> On 3/3/22 11:26, Joshua Brindle wrote:
> >> > On Thu, Feb 10, 2022 at 2:37 PM Joe Conway  wrote:
> >> >>
> >> >> On 2/10/22 14:28, Nathan Bossart wrote:
> >> >> > On Wed, Feb 09, 2022 at 04:39:11PM -0500, Joe Conway wrote:
> >> >> >> On 2/9/22 13:13, Nathan Bossart wrote:
> >> >> >>> I do wonder if users find the differences between predefined roles 
> >> >> >>> and role
> >> >> >>> attributes confusing.  INHERIT doesn't govern role attributes, but 
> >> >> >>> it will
> >> >> >>> govern predefined roles when this patch is applied.  Maybe the role
> >> >> >>> attribute system should eventually be deprecated in favor of using
> >> >> >>> predefined roles for everything.  Or perhaps the predefined roles 
> >> >> >>> should be
> >> >> >>> converted to role attributes.
> >> >> >>
> >> >> >> Yep, I was suggesting that the latter would have been preferable to 
> >> >> >> me while
> >> >> >> Robert seemed to prefer the former. Honestly I could be happy with 
> >> >> >> either of
> >> >> >> those solutions, but as I alluded to that is probably a discussion 
> >> >> >> for the
> >> >> >> next development cycle since I don't see us doing that big a change 
> >> >> >> in this
> >> >> >> one.
> >> >> >
> >> >> > I agree.  I still think Joshua's proposed patch is a worthwhile 
> >> >> > improvement
> >> >> > for v15.
> >> >>
> >> >> +1
> >> >>
> >> >> I am planning to get into it in detail this weekend. So far I have
> >> >> really just ensured it merges cleanly and passes make world.
> >> >
> >> > Rebased patch to apply to master attached.
> >>
> >> Well longer than I planned, but finally took a closer look.
> >>
> >> I made one minor editorial fix to Joshua's patch, rebased to current
> >> master, and added two missing call sites that presumably are related to
> >> recent commits for pg_basebackup.
> >
> > FWIW I pinged Stephen when I saw the basebackup changes included
> > is_member_of and he didn't think they necessarily needed to be changed
> > since they aren't accessible to human and you can't SET ROLE on a
> > replication connection in order to access the role where inheritance
> > was blocked.
>
> Maybe worth a discussion, but it seems strange to me to treat them
> differently when the whole purpose of this patch is to make things
> consistent ¯\_(ツ)_/¯
>

I don't have a strong opinion either way, just noting that it's a
possible exception area due to how it is used.

Thanks for committing this.




Re: [PATCH v2] use has_privs_for_role for predefined roles

2022-03-20 Thread Joe Conway

On 3/20/22 12:31, Joshua Brindle wrote:

On Sun, Mar 20, 2022 at 12:27 PM Joe Conway  wrote:


On 3/3/22 11:26, Joshua Brindle wrote:
> On Thu, Feb 10, 2022 at 2:37 PM Joe Conway  wrote:
>>
>> On 2/10/22 14:28, Nathan Bossart wrote:
>> > On Wed, Feb 09, 2022 at 04:39:11PM -0500, Joe Conway wrote:
>> >> On 2/9/22 13:13, Nathan Bossart wrote:
>> >>> I do wonder if users find the differences between predefined roles and 
role
>> >>> attributes confusing.  INHERIT doesn't govern role attributes, but it 
will
>> >>> govern predefined roles when this patch is applied.  Maybe the role
>> >>> attribute system should eventually be deprecated in favor of using
>> >>> predefined roles for everything.  Or perhaps the predefined roles should 
be
>> >>> converted to role attributes.
>> >>
>> >> Yep, I was suggesting that the latter would have been preferable to me 
while
>> >> Robert seemed to prefer the former. Honestly I could be happy with either 
of
>> >> those solutions, but as I alluded to that is probably a discussion for the
>> >> next development cycle since I don't see us doing that big a change in 
this
>> >> one.
>> >
>> > I agree.  I still think Joshua's proposed patch is a worthwhile improvement
>> > for v15.
>>
>> +1
>>
>> I am planning to get into it in detail this weekend. So far I have
>> really just ensured it merges cleanly and passes make world.
>
> Rebased patch to apply to master attached.

Well longer than I planned, but finally took a closer look.

I made one minor editorial fix to Joshua's patch, rebased to current
master, and added two missing call sites that presumably are related to
recent commits for pg_basebackup.


FWIW I pinged Stephen when I saw the basebackup changes included
is_member_of and he didn't think they necessarily needed to be changed
since they aren't accessible to human and you can't SET ROLE on a
replication connection in order to access the role where inheritance
was blocked.


Maybe worth a discussion, but it seems strange to me to treat them 
differently when the whole purpose of this patch is to make things 
consistent ¯\_(ツ)_/¯


Joe




Re: shared-memory based stats collector - v66

2022-03-20 Thread Melanie Plageman
On Thu, Mar 17, 2022 at 3:36 AM Andres Freund  wrote:
>
> Starting to feel more optimistic about this! There's loads more to do, but now
> the TODOs just seem to require elbow grease, rather than deep thinking.
>
> The biggest todos are:
> - Address all the remaining AFIXMEs and XXXs
>
> - add longer explanation of architecture to pgstat.c (or a README)
>
> - Further improve our stats test coverage - there's a crapton not covered,
>   despite 0017:
>   - test WAL replay with stats (stats for dropped tables are removed etc)

Attached is a TAP test to check that stats are cleaned up on a physical
replica after the objects they concern are dropped on the primary.

I'm not sure that the extra force next flush on standby is needed after
drop on primary since drop should report stats and I wait for catchup.
Also, I don't think the tests with DROP SCHEMA actually exercise another
code path, so it might be worth cutting those.

- Melanie
From 2cb108fadf656de9cc998c0b2123a184881ef4e0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 17 Mar 2022 21:54:16 -0400
Subject: [PATCH] add replica cleanup tests

---
 src/backend/utils/activity/pgstat.c   |  32 +++
 src/backend/utils/adt/pgstatfuncs.c   |  22 ++
 src/include/catalog/pg_proc.dat   |   6 +
 src/include/pgstat.h  |   3 +
 .../recovery/t/029_stats_cleanup_replica.pl   | 238 ++
 5 files changed, 301 insertions(+)
 create mode 100644 src/test/recovery/t/029_stats_cleanup_replica.pl

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index e9fab35a46..4c0fea62b4 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -159,6 +159,7 @@ static void pgstat_shared_reset_timestamp(PgStatKind kind, PgStatShm_StatEntryHe
 static void pgstat_pending_delete(PgStatSharedRef *shared_ref);
 static bool pgstat_pending_flush_stats(bool nowait);
 
+static PgStatKind pgstat_kind_for(char *kind_str);
 static inline const PgStatKindInfo *pgstat_kind_info_for(PgStatKind kind);
 
 
@@ -1315,6 +1316,23 @@ pgstat_get_stat_snapshot_timestamp(bool *have_snapshot)
 	return 0;
 }
 
+bool
+pgstat_shared_stat_exists(char *kind_str, Oid dboid, Oid objoid)
+{
+	PgStatKind kind = pgstat_kind_for(kind_str);
+
+	/*
+	 * Ignore dboid or objoid for subscription and db stats respectively.
+	 */
+	if (kind == PGSTAT_KIND_SUBSCRIPTION)
+		dboid = InvalidOid;
+
+	if (kind == PGSTAT_KIND_DB)
+		objoid = InvalidOid;
+
+	return (bool) pgstat_fetch_entry(kind, dboid, objoid);
+}
+
 
 /* 
  * Helper functions
@@ -1343,6 +1361,20 @@ pgstat_setup_memcxt(void)
   ALLOCSET_SMALL_SIZES);
 }
 
+static PgStatKind
+pgstat_kind_for(char *kind_str)
+{
+	for (int kind = 0; kind <= PGSTAT_KIND_LAST; kind++)
+	{
+		if (pg_strcasecmp(kind_str, pgstat_kind_infos[kind].name) == 0)
+			return kind;
+	}
+
+	ereport(ERROR,
+			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("invalid statistic kind: \"%s\"", kind_str)));
+}
+
 static inline const PgStatKindInfo *
 pgstat_kind_info_for(PgStatKind kind)
 {
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5d843cde24..62b2df856f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2420,3 +2420,25 @@ pg_stat_get_subscription_stats(PG_FUNCTION_ARGS)
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
+
+
+/*
+ * Checks for presence of stats for object with provided object oid of kind
+ * specified in the type string in database of provided database oid.
+ *
+ * For subscription stats, only the objoid will be used. For database stats,
+ * only the dboid will be used. The value passed in for the unused parameter is
+ * discarded.
+ * TODO: should it be 'pg_stat_stats_present' instead of 'pg_stat_stats_exist'?
+ */
+Datum
+pg_stat_stats_exist(PG_FUNCTION_ARGS)
+{
+	Oid dboid = PG_GETARG_OID(0);
+	Oid	objoid = PG_GETARG_OID(1);
+	char *stats_type = text_to_cstring(PG_GETARG_TEXT_P(2));
+
+	PG_RETURN_BOOL((bool) pgstat_shared_stat_exists(stats_type, dboid,
+objoid));
+
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 75dae94c49..3f3c4e0427 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5376,6 +5376,12 @@
   proargmodes => '{i,o,o,o,o,o,o,o,o,o,o}',
   proargnames => '{slot_name,slot_name,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes,total_txns,total_bytes,stats_reset}',
   prosrc => 'pg_stat_get_replication_slot' },
+
+{ oid => '8384', descr => 'statistics: checks if stats exist for provided object of provided type in provided database',
+  proname => 'pg_stat_stats_exist', provolatile => 's', proparallel => 'r',
+  prorettype => 'bool', proargtypes => 'oid oid text',
+  prosrc => 'pg_stat_stats_exist' },
+
 { 

Re: pgsql: Add option to use ICU as global locale provider

2022-03-20 Thread Andres Freund
Hi,

On 2022-03-20 11:03:38 +0100, Peter Eisentraut wrote:
> committed

Thanks. Rebasing over that fixed the meson Centos 7 build in my meson
tree. https://cirrus-ci.com/build/5265480968568832

Greetings,

Andres Freund




Re: [PATCH v2] use has_privs_for_role for predefined roles

2022-03-20 Thread Joshua Brindle
On Sun, Mar 20, 2022 at 12:27 PM Joe Conway  wrote:
>
> On 3/3/22 11:26, Joshua Brindle wrote:
> > On Thu, Feb 10, 2022 at 2:37 PM Joe Conway  wrote:
> >>
> >> On 2/10/22 14:28, Nathan Bossart wrote:
> >> > On Wed, Feb 09, 2022 at 04:39:11PM -0500, Joe Conway wrote:
> >> >> On 2/9/22 13:13, Nathan Bossart wrote:
> >> >>> I do wonder if users find the differences between predefined roles and 
> >> >>> role
> >> >>> attributes confusing.  INHERIT doesn't govern role attributes, but it 
> >> >>> will
> >> >>> govern predefined roles when this patch is applied.  Maybe the role
> >> >>> attribute system should eventually be deprecated in favor of using
> >> >>> predefined roles for everything.  Or perhaps the predefined roles 
> >> >>> should be
> >> >>> converted to role attributes.
> >> >>
> >> >> Yep, I was suggesting that the latter would have been preferable to me 
> >> >> while
> >> >> Robert seemed to prefer the former. Honestly I could be happy with 
> >> >> either of
> >> >> those solutions, but as I alluded to that is probably a discussion for 
> >> >> the
> >> >> next development cycle since I don't see us doing that big a change in 
> >> >> this
> >> >> one.
> >> >
> >> > I agree.  I still think Joshua's proposed patch is a worthwhile 
> >> > improvement
> >> > for v15.
> >>
> >> +1
> >>
> >> I am planning to get into it in detail this weekend. So far I have
> >> really just ensured it merges cleanly and passes make world.
> >
> > Rebased patch to apply to master attached.
>
> Well longer than I planned, but finally took a closer look.
>
> I made one minor editorial fix to Joshua's patch, rebased to current
> master, and added two missing call sites that presumably are related to
> recent commits for pg_basebackup.

FWIW I pinged Stephen when I saw the basebackup changes included
is_member_of and he didn't think they necessarily needed to be changed
since they aren't accessible to human and you can't SET ROLE on a
replication connection in order to access the role where inheritance
was blocked.

> On that last note, I did not find basebackup_to_shell.required_role
> documented at all, and did not attempt to fix that.
>
> When this thread petered out, it seemed that Robert was at least neutral
> on the patch, and everyone else was +1 on applying it to master for pg15.
>
> As such, if there are any other issues, complaints, etc., please speak
> real soon now...
>
> Thanks,
>
> Joe




Re: [PATCH v2] use has_privs_for_role for predefined roles

2022-03-20 Thread Joe Conway

On 3/3/22 11:26, Joshua Brindle wrote:

On Thu, Feb 10, 2022 at 2:37 PM Joe Conway  wrote:


On 2/10/22 14:28, Nathan Bossart wrote:
> On Wed, Feb 09, 2022 at 04:39:11PM -0500, Joe Conway wrote:
>> On 2/9/22 13:13, Nathan Bossart wrote:
>>> I do wonder if users find the differences between predefined roles and role
>>> attributes confusing.  INHERIT doesn't govern role attributes, but it will
>>> govern predefined roles when this patch is applied.  Maybe the role
>>> attribute system should eventually be deprecated in favor of using
>>> predefined roles for everything.  Or perhaps the predefined roles should be
>>> converted to role attributes.
>>
>> Yep, I was suggesting that the latter would have been preferable to me while
>> Robert seemed to prefer the former. Honestly I could be happy with either of
>> those solutions, but as I alluded to that is probably a discussion for the
>> next development cycle since I don't see us doing that big a change in this
>> one.
>
> I agree.  I still think Joshua's proposed patch is a worthwhile improvement
> for v15.

+1

I am planning to get into it in detail this weekend. So far I have
really just ensured it merges cleanly and passes make world.


Rebased patch to apply to master attached.


Well longer than I planned, but finally took a closer look.

I made one minor editorial fix to Joshua's patch, rebased to current 
master, and added two missing call sites that presumably are related to 
recent commits for pg_basebackup.


On that last note, I did not find basebackup_to_shell.required_role 
documented at all, and did not attempt to fix that.


When this thread petered out, it seemed that Robert was at least neutral 
on the patch, and everyone else was +1 on applying it to master for pg15.


As such, if there are any other issues, complaints, etc., please speak 
real soon now...


Thanks,

JoeUse has_privs_for_roles for predefined role checks

Generally if a role is granted membership to another role with NOINHERIT they must use SET ROLE to access the privileges of that role, however with predefined roles the membership and privilege is conflated. Fix that by replacing is_member_of_role with has_privs_for_role for predefined roles. Patch does not remove is_member_of_role from acl.h, but it does add a warning not to use that function for privilege checking. Not backpatched based on hackers list discussion.

Author: Joshua Brindle
Reviewed-by: Stephen Frost, Nathan Bossart, Joe Conway
Discussion: https://postgr.es/m/flat/cagb+vh4zv_tvkt2tv3qns6tum_f_9icmuj0zjywwcgvi4pa...@mail.gmail.com

diff --git a/contrib/adminpack/adminpack.c b/contrib/adminpack/adminpack.c
index d7d84d0..03addf1 100644
*** a/contrib/adminpack/adminpack.c
--- b/contrib/adminpack/adminpack.c
*** convert_and_check_filename(text *arg)
*** 79,85 
  	 * files on the server as the PG user, so no need to do any further checks
  	 * here.
  	 */
! 	if (is_member_of_role(GetUserId(), ROLE_PG_WRITE_SERVER_FILES))
  		return filename;
  
  	/*
--- 79,85 
  	 * files on the server as the PG user, so no need to do any further checks
  	 * here.
  	 */
! 	if (has_privs_of_role(GetUserId(), ROLE_PG_WRITE_SERVER_FILES))
  		return filename;
  
  	/*
diff --git a/contrib/basebackup_to_shell/basebackup_to_shell.c b/contrib/basebackup_to_shell/basebackup_to_shell.c
index d82cb6d..f0ddef1 100644
*** a/contrib/basebackup_to_shell/basebackup_to_shell.c
--- b/contrib/basebackup_to_shell/basebackup_to_shell.c
*** _PG_init(void)
*** 90,96 
  }
  
  /*
!  * We choose to defer sanity sanity checking until shell_get_sink(), and so
   * just pass the target detail through without doing anything. However, we do
   * permissions checks here, before any real work has been done.
   */
--- 90,96 
  }
  
  /*
!  * We choose to defer sanity checking until shell_get_sink(), and so
   * just pass the target detail through without doing anything. However, we do
   * permissions checks here, before any real work has been done.
   */
*** shell_check_detail(char *target, char *t
*** 103,109 
  
  		StartTransactionCommand();
  		roleid = get_role_oid(shell_required_role, true);
! 		if (!is_member_of_role(GetUserId(), roleid))
  			ereport(ERROR,
  	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  	 errmsg("permission denied to use basebackup_to_shell")));
--- 103,109 
  
  		StartTransactionCommand();
  		roleid = get_role_oid(shell_required_role, true);
! 		if (!has_privs_of_role(GetUserId(), roleid))
  			ereport(ERROR,
  	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  	 errmsg("permission denied to use basebackup_to_shell")));
diff --git a/contrib/file_fdw/expected/file_fdw.out b/contrib/file_fdw/expected/file_fdw.out
index 0ac6e4e..14acdb2 100644
*** a/contrib/file_fdw/expected/file_fdw.out
--- b/contrib/file_fdw/expected/file_fdw.out
*** ALTER FOREIGN TABLE agg_text OWNER TO re
*** 459,465 
  ALTER FOREIGN TABLE agg_text OPTIONS (SET format 

Re: Problem with moderation of messages with patched attached.

2022-03-20 Thread Andrew Dunstan


On 3/19/22 14:48, Andres Freund wrote:
> Hi,
>
> On 2022-03-03 13:37:35 +, Dave Page wrote:
>> On Thu, 3 Mar 2022 at 13:28, Pavel Borisov  wrote:
>>
>>> The mail system doesn't have the capability to apply different moderation
 rules for people in that way I'm afraid.

>>> Maybe then 2MB for everyone? Otherwise it's not so convenient. Lead to
>>> answers before the questions in the thread [1], seems weird.
>>>
>> Then someone will complain if their patch is 2.1MB! How often are messages
>> legitimately over 1MB anyway, even with a patch? I don't usually moderate
>> -hackers, so I don't know if this is a common thing or not.
> I don't think it's actually that rare. But most contributors writing that
> large patchsets know about the limit and work around it - I gzip patches when
> I see the email getting too large. But it's more annoying to work with for
> reviewers.
>
> It's somewhat annoying. If you e.g. append a few graphs of performance changes
> and a patch it's pretty easy to get into the range where compressing won't
> help anymore.
>
> And sure, any limit may be hit by somebody. But 1MB across the whole email
> seems pretty low these days.
>

Of course we could get complaints no matter what level we set the limit
at. I think raising it to 2Mb would be a reasonable experiment. If no
observable evil ensues then leave it that way. If it does then roll it
back. I agree that plain uncompressed patches are easier to deal with in
general.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: A test for replay of regression tests

2022-03-20 Thread Andrew Dunstan


On 3/20/22 05:36, Thomas Munro wrote:
> On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro  wrote:
>> Another failure under 027_stream_regress.pl:
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2022-03-16%2005%3A58%3A05
>>
>>  vacuum   ... FAILED 3463 ms
>>
>> I'll try to come up with the perl needed to see the regression.diffs
>> next time...
> Here's my proposed change to achieve that.


I think that's OK.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Probable CF bot degradation

2022-03-20 Thread Julien Rouhaud
On Sun, Mar 20, 2022 at 01:58:01PM +0100, Matthias van de Meent wrote:
>
> I noticed that two of my patches (37/3543 and 37/3542) both failed due
> to a bad commit on master (076f4d9). The issue was fixed an hour later
> with b61e6214; but the pipeline for these patches hasn't run since.
> Because doing a no-op update would only clutter people's inboxes, I
> was waiting for CFBot to do its regular bitrot check; but that hasn't
> happened yet after 4 days.
> I understand that this is probably due to the high rate of new patch
> revisions that get priority in the queue; but that doesn't quite
> fulfill my want for information in this case.

Just in case, if you only want to know whether the cfbot would be happy with
your patches you can run the exact same checks using a personal github repo, as
documented at src/tools/ci/README.

You could also send the URL of a successful run on the related threads, or as
an annotation on the cf entries to let possible reviewers know that the patch
is still in a good shape even if the cfbot is currently still broken.




Re: Probable CF bot degradation

2022-03-20 Thread Matthias van de Meent
On Fri, 18 Mar 2022 at 19:52, Thomas Munro  wrote:
> Unfortunately cfbot didn't handle that failure very well and it was
> waiting for a long timeout before scheduling more jobs.  It's going
> again now, and I'll try to make it more resilient against that type of
> failure...

I noticed that two of my patches (37/3543 and 37/3542) both failed due
to a bad commit on master (076f4d9). The issue was fixed an hour later
with b61e6214; but the pipeline for these patches hasn't run since.
Because doing a no-op update would only clutter people's inboxes, I
was waiting for CFBot to do its regular bitrot check; but that hasn't
happened yet after 4 days.
I understand that this is probably due to the high rate of new patch
revisions that get priority in the queue; but that doesn't quite
fulfill my want for information in this case.

Would you know how long the expected bitrot re-check period for CF
entries that haven't been updated is, or could the bitrot-checking
queue be displayed somewhere to indicate the position of a patch in
this queue?
Additionally, are there plans to validate commits of the main branch
before using them as a base for CF entries, so that "bad" commits on
master won't impact CFbot results as easy?

Kind regards,

Matthias van de Meent




Re: pgsql: Add option to use ICU as global locale provider

2022-03-20 Thread Julien Rouhaud
On Sun, Mar 20, 2022 at 11:03:38AM +0100, Peter Eisentraut wrote:
> On 19.03.22 05:14, Julien Rouhaud wrote:
> > On Fri, Mar 18, 2022 at 03:09:59PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2022-03-18 20:28:58 +0100, Peter Eisentraut wrote:
> > > > Why does your patch introduce a function check_icu_locale() that is only
> > > > called once?  Did you have further plans for that?
> > > 
> > > I like that it moves ICU code out of dbcommands.c
> > 
> > Yes, it seemed cleaner this way.  But more importantly code outside 
> > pg_locale.c
> > really shouldn't have to deal with ICU specific version code.
> > 
> > I'm attaching a v2, addressing Peter and Tom comments to not duplicate the 
> > old
> > ICU versions attribute function.  I removed the ICU locale check entirely 
> > (for
> > consistency across ICU version) thus removing any need for ucol.h include in
> > initdb.
> 
> committed

Thanks!




Re: Column Filtering in Logical Replication

2022-03-20 Thread Tomas Vondra



On 3/20/22 07:23, Amit Kapila wrote:
> On Sun, Mar 20, 2022 at 8:41 AM Amit Kapila  wrote:
>>
>> On Fri, Mar 18, 2022 at 10:42 PM Tomas Vondra
>>  wrote:
>>
>>> So the question is why those two sync workers never complete - I guess
>>> there's some sort of lock wait (deadlock?) or infinite loop.
>>>
>>
>> It would be a bit tricky to reproduce this even if the above theory is
>> correct but I'll try it today or tomorrow.
>>
> 
> I am able to reproduce it with the help of a debugger. Firstly, I have
> added the LOG message and some While (true) loops to debug sync and
> apply workers. Test setup
> 
> Node-1:
> create table t1(c1);
> create table t2(c1);
> insert into t1 values(1);
> create publication pub1 for table t1;
> create publication pu2;
> 
> Node-2:
> change max_sync_workers_per_subscription to 1 in potgresql.conf
> create table t1(c1);
> create table t2(c1);
> create subscription sub1 connection 'dbname = postgres' publication pub1;
> 
> Till this point, just allow debuggers in both workers just continue.
> 
> Node-1:
> alter publication pub1 add table t2;
> insert into t1 values(2);
> 
> Here, we have to debug the apply worker such that when it tries to
> apply the insert, stop the debugger in function apply_handle_insert()
> after doing begin_replication_step().
> 
> Node-2:
> alter subscription sub1 set pub1, pub2;
> 
> Now, continue the debugger of apply worker, it should first start the
> sync worker and then exit because of parameter change. All of these
> debugging steps are to just ensure the point that it should first
> start the sync worker and then exit. After this point, table sync
> worker never finishes and log is filled with messages: "reached
> max_sync_workers_per_subscription limit" (a newly added message by me
> in the attached debug patch).
> 
> Now, it is not completely clear to me how exactly '013_partition.pl'
> leads to this situation but there is a possibility based on the LOGs
> it shows.
> 

Thanks, I'll take a look later. From the description it seems this is an
issue that existed before any of the patches, right? It might be more
likely to hit due to some test changes, but the root cause is older.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pgsql: Add option to use ICU as global locale provider

2022-03-20 Thread Peter Eisentraut

On 19.03.22 18:53, Christoph Berg wrote:

Re: Peter Eisentraut

Since some (legacy) code still uses the libc locale facilities
directly, we still need to set the libc global locale settings even if
ICU is otherwise selected.  So pg_database now has three
locale-related fields: the existing datcollate and datctype, which are
always set, and a new daticulocale, which is only set if ICU is
selected.  A similar change is made in pg_collation for consistency,
but in that case, only the libc-related fields or the ICU-related
field is set, never both.


Since the intended usage seems to be that databases should either be
using libc, or the ICU locales, but probably not both at the same
time, does it make sense to clutter the already very wide `psql -l`
output with two new extra columns?


Good point, let me think about that.




Re: pgsql: Add option to use ICU as global locale provider

2022-03-20 Thread Peter Eisentraut

On 19.03.22 05:14, Julien Rouhaud wrote:

On Fri, Mar 18, 2022 at 03:09:59PM -0700, Andres Freund wrote:

Hi,

On 2022-03-18 20:28:58 +0100, Peter Eisentraut wrote:

Why does your patch introduce a function check_icu_locale() that is only
called once?  Did you have further plans for that?


I like that it moves ICU code out of dbcommands.c


Yes, it seemed cleaner this way.  But more importantly code outside pg_locale.c
really shouldn't have to deal with ICU specific version code.

I'm attaching a v2, addressing Peter and Tom comments to not duplicate the old
ICU versions attribute function.  I removed the ICU locale check entirely (for
consistency across ICU version) thus removing any need for ucol.h include in
initdb.


committed




Re: BufferAlloc: don't take two simultaneous locks

2022-03-20 Thread Yura Sokolov
В Чт, 17/03/2022 в 12:02 +0900, Kyotaro Horiguchi пишет:
> At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov  
> wrote in 
> > В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:
> > > At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov 
> > >  wrote in 
> > > In v7, HASH_ENTER returns the element stored in DynaHashReuse using
> > > the freelist_idx of the new key.  v8 uses that of the old key (at the
> > > time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
> > > returns the stashed)" case the stashed element is returned to its
> > > original partition.  But it is not what I mentioned.
> > > 
> > > On the other hand, once the stahsed element is reused by HASH_ENTER,
> > > it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
> > > from old partition) case.  I suspect that ththat the frequent freelist
> > > starvation comes from the latter case.
> > 
> > Doubtfully. Due to probabilty theory, single partition doubdfully
> > will be too overflowed. Therefore, freelist.
> 
> Yeah.  I think so generally.
> 
> > But! With 128kb shared buffers there is just 32 buffers. 32 entry for
> > 32 freelist partition - certainly some freelist partition will certainly
> > have 0 entry even if all entries are in freelists. 
> 
> Anyway, it's an extreme condition and the starvation happens only at a
> neglegible ratio.
> 
> > > RETURNED: 2
> > > ALLOCED: 0
> > > BORROWED: 435
> > > REUSED: 495444
> > > ASSIGNED: 495467 (-23)
> > > 
> > > Now "BORROWED" happens 0.8% of REUSED
> > 
> > 0.08% actually :)
> 
> Mmm.  Doesn't matter:p
> 
> > > > > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > > > > ...
> > > > > > Strange thing: both master and patched version has higher
> > > > > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > > > > than in first october version [1]. But lower tps at higher
> > > > > > connections number (>= 191 clients).
> > > > > > I'll try to bisect on master this unfortunate change.
> ...
> > I've checked. Looks like something had changed on the server, since
> > old master commit behaves now same to new one (and differently to
> > how it behaved in October).
> > I remember maintainance downtime of the server in november/december.
> > Probably, kernel were upgraded or some system settings were changed.
> 
> One thing I have a little concern is that numbers shows 1-2% of
> degradation steadily for connection numbers < 17.
> 
> I think there are two possible cause of the degradation.
> 
> 1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER.
>   This might cause degradation for memory-contended use.
> 
> 2. nallocs operation might cause degradation on non-shared dynahasyes?
>   I believe doesn't but I'm not sure.
> 
>   On a simple benchmarking with pgbench on a laptop, dynahash
>   allocation (including shared and non-shared) happend about at 50
>   times per second with 10 processes and 200 with 100 processes.
> 
> > > I don't think nalloced needs to be the same width to long.  For the
> > > platforms with 32-bit long, anyway the possible degradation if any by
> > > 64-bit atomic there doesn't matter.  So don't we always define the
> > > atomic as 64bit and use the pg_atomic_* functions directly?
> > 
> > Some 32bit platforms has no native 64bit atomics. Then they are
> > emulated with locks.
> > 
> > Well, and for 32bit platform long is just enough. Why spend other
> > 4 bytes per each dynahash?
> 
> I don't think additional bytes doesn't matter, but emulated atomic
> operations can matter. However I'm not sure which platform uses that
> fallback implementations.  (x86 seems to have __sync_fetch_and_add()
> since P4).
> 
> My opinion in the previous mail is that if that level of degradation
> caued by emulated atomic operations matters, we shouldn't use atomic
> there at all since atomic operations on the modern platforms are not
> also free.
> 
> In relation to 2 above, if we observe that the degradation disappears
> by (tentatively) use non-atomic operations for nalloced, we should go
> back to the previous per-freelist nalloced.

Here is version with nalloced being union of appropriate atomic and
long.

--

regards
Yura Sokolov
From 68800f6f02f062320e6d9fe42c986809a06a37cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/4] [PGPRO-5616] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both locks simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

Tags: lwlock_numa
---
 src/backend/storage/buffer/bufmgr.c | 198 ++--
 1 file changed, 96 insertions(+), 102 deletions(-)

diff --git 

Re: A test for replay of regression tests

2022-03-20 Thread Thomas Munro
On Sun, Mar 20, 2022 at 5:20 PM Thomas Munro  wrote:
> Another failure under 027_stream_regress.pl:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2022-03-16%2005%3A58%3A05
>
>  vacuum   ... FAILED 3463 ms
>
> I'll try to come up with the perl needed to see the regression.diffs
> next time...

Here's my proposed change to achieve that.

Here's an example of where it shows up if it fails (from my
deliberately sabotaged CI run
https://cirrus-ci.com/build/6730380228165632 where I was verifying
that it also works on Windows):

Unix: 
https://api.cirrus-ci.com/v1/artifact/task/5421419923243008/log/src/test/recovery/tmp_check/log/regress_log_027_stream_regress
Windows: 
https://api.cirrus-ci.com/v1/artifact/task/4717732481466368/log/src/test/recovery/tmp_check/log/regress_log_027_stream_regress
From 5f6257c4b4a44750e85d77b01c20d6cd39c5c795 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 20 Mar 2022 21:20:42 +1300
Subject: [PATCH] Log regression.diffs in 027_stream_regress.pl.

To help diagnose the reasons for a regression test failure inside this
TAP test, dump the contents of regression.diffs to the log.  While the
CI scripts show it automatically, the build farm client does not.

Discussion: https://postgr.es/m/CA%2BhUKGK-q9utvxorA8sCwF3S4SzBmxxDgneP4rLqeHWdZxM4Gg%40mail.gmail.com
---
 src/test/recovery/t/027_stream_regress.pl | 30 ---
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl
index c40951b7ba..aa972f8958 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -53,15 +53,27 @@ my $outputdir = $PostgreSQL::Test::Utils::tmp_check;
 
 # Run the regression tests against the primary.
 my $extra_opts = $ENV{EXTRA_REGRESS_OPTS} || "";
-system_or_bail($ENV{PG_REGRESS} . " $extra_opts " .
-			   "--dlpath=\"$dlpath\" " .
-			   "--bindir= " .
-			   "--host=" . $node_primary->host . " " .
-			   "--port=" . $node_primary->port . " " .
-			   "--schedule=../regress/parallel_schedule " .
-			   "--max-concurrent-tests=20 " .
-			   "--inputdir=../regress " .
-			   "--outputdir=\"$outputdir\"");
+my $rc = system($ENV{PG_REGRESS} . " $extra_opts " .
+			"--dlpath=\"$dlpath\" " .
+			"--bindir= " .
+			"--host=" . $node_primary->host . " " .
+			"--port=" . $node_primary->port . " " .
+			"--schedule=../regress/parallel_schedule " .
+			"--max-concurrent-tests=20 " .
+			"--inputdir=../regress " .
+			"--outputdir=\"$outputdir\"");
+if ($rc != 0)
+{
+	# Dump out the regression diffs file, if there is one
+	my $diffs = "$outputdir/regression.diffs";
+	if (-e $diffs)
+	{
+		print "=== dumping $diffs ===\n";
+		print slurp_file($diffs);
+		print "=== EOF ===\n";
+	}
+}
+is($rc, 0, 'regression tests pass');
 
 # Clobber all sequences with their next value, so that we don't have
 # differences between nodes due to caching.
-- 
2.30.2



Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-20 Thread Bharath Rupireddy
On Fri, Mar 18, 2022 at 8:07 PM Nitin Jadhav
 wrote:
>
> Hi Bharath,
>
> Due to recent commits on master, the pg_walinpect module is not
> compiling. Kindly update the patch.

Thanks Nitin. Here's an updated v12 patch-set. I will respond to
Andres comments in a while.

Regards,
Bharath Rupireddy.


v12-0001-pg_walinspect.patch
Description: Binary data


v12-0001-pg_walinspect-tests.patch
Description: Binary data


v12-0001-pg_walinspect-docs.patch
Description: Binary data


Re: WIP: WAL prefetch (another approach)

2022-03-20 Thread Thomas Munro
On Sun, Mar 20, 2022 at 5:36 PM Thomas Munro  wrote:
> Clearly 018_wal_optimize.pl is flapping

Correction, 019_replslot_limit.pl, discussed at
https://www.postgresql.org/message-id/flat/83b46e5f-2a52-86aa-fa6c-8174908174b8%40iki.fi
.




Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

2022-03-20 Thread Tatsuo Ishii
> Hi Yugo,
> 
> I tested with serialization error scenario by setting:
> default_transaction_isolation = 'repeatable read'
> The result was:
> 
> $ pgbench -t 10 -c 10 --max-tries=10 test
> transaction type: 
> scaling factor: 10
> query mode: simple
> number of clients: 10
> number of threads: 1
> maximum number of tries: 10
> number of transactions per client: 10
> number of transactions actually processed: 100/100
> number of failed transactions: 0 (0.000%)
> number of transactions retried: 35 (35.000%)
> total number of retries: 74
> latency average = 5.306 ms
> initial connection time = 15.575 ms
> tps = 1884.516810 (without initial connection time)
> 
> I had hard time to understand what those numbers mean:
> number of transactions retried: 35 (35.000%)
> total number of retries: 74
> 
> It seems "total number of retries" matches with the number of ERRORs
> reported in PostgreSQL. Good. What I am not sure is "number of
> transactions retried". What does this mean?

Oh, ok. I see it now. It turned out that "number of transactions
retried" does not actually means the number of transactions
rtried. Suppose pgbench exectutes following in a session:

BEGIN;  -- transaction A starts
:
(ERROR)
ROLLBACK; -- transaction A aborts

(retry)

BEGIN;  -- transaction B starts
:
(ERROR)
ROLLBACK; -- transaction B aborts

(retry)

BEGIN;  -- transaction C starts
:
END;-- finally succeeds

In this case "total number of retries:" = 2 and "number of
transactions retried:" = 1. In this patch transactions A, B and C are
regarded as "same" transaction, so the retried transaction count
becomes 1. But it's confusing to use the language "transaction" here
because A, B and C are different transactions. I would think it's
better to use different language instead of "transaction", something
like "cycle"? i.e.

number of cycles retried: 35 (35.000%)

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: Column Filtering in Logical Replication

2022-03-20 Thread Amit Kapila
On Sun, Mar 20, 2022 at 8:41 AM Amit Kapila  wrote:
>
> On Fri, Mar 18, 2022 at 10:42 PM Tomas Vondra
>  wrote:
>
> > So the question is why those two sync workers never complete - I guess
> > there's some sort of lock wait (deadlock?) or infinite loop.
> >
>
> It would be a bit tricky to reproduce this even if the above theory is
> correct but I'll try it today or tomorrow.
>

I am able to reproduce it with the help of a debugger. Firstly, I have
added the LOG message and some While (true) loops to debug sync and
apply workers. Test setup

Node-1:
create table t1(c1);
create table t2(c1);
insert into t1 values(1);
create publication pub1 for table t1;
create publication pu2;

Node-2:
change max_sync_workers_per_subscription to 1 in potgresql.conf
create table t1(c1);
create table t2(c1);
create subscription sub1 connection 'dbname = postgres' publication pub1;

Till this point, just allow debuggers in both workers just continue.

Node-1:
alter publication pub1 add table t2;
insert into t1 values(2);

Here, we have to debug the apply worker such that when it tries to
apply the insert, stop the debugger in function apply_handle_insert()
after doing begin_replication_step().

Node-2:
alter subscription sub1 set pub1, pub2;

Now, continue the debugger of apply worker, it should first start the
sync worker and then exit because of parameter change. All of these
debugging steps are to just ensure the point that it should first
start the sync worker and then exit. After this point, table sync
worker never finishes and log is filled with messages: "reached
max_sync_workers_per_subscription limit" (a newly added message by me
in the attached debug patch).

Now, it is not completely clear to me how exactly '013_partition.pl'
leads to this situation but there is a possibility based on the LOGs
it shows.

-- 
With Regards,
Amit Kapila.


debug_sub_workers_1.patch
Description: Binary data