date:20210817

Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-17 Thread Amit Kapila

On Tue, Aug 17, 2021 at 4:34 PM Andres Freund  wrote:
>
> On 2021-08-17 10:54:30 +0530, Amit Kapila wrote:
> > 5. How can we provide a strict mechanism to not allow to use dsm APIs
> > for non-dsm FileSet? One idea could be that we can have a variable
> > (probably bool) in SharedFileSet structure which will be initialized
> > in SharedFileSetInit based on whether the caller has provided dsm
> > segment. Then in other DSM-based APIs, we can check if it is used for
> > the wrong type.
>
> Well, isn't the issue here that it's not a shared file set in case you
> explicitly don't want to share it? ISTM that the proper way to address
> this would be to split out a FileSet from SharedFileSet that's then used
> for worker.c and sharedfileset.c.
>

Okay, but note that to accomplish the same, we need to tweak the
BufFile (buffile.c) APIs as well so that they can work with FileSet.
As per the initial analysis, there doesn't seem to be any problem with
that though.

-- 
With Regards,
Amit Kapila.

pg_veryfybackup can fail with a valid backup for TLI > 1

2021-08-17 Thread Kyotaro Horiguchi

Hello.

pg_veryfybackup can fail with a valid backup when the backup was taken
from TLI > 1.

=
# initdb
$ pg_ctl start (just to make sure)
$ pg_ctl stop
$ touch $PGDATA/standby.signal
$ pg_ctl start
$ pg_ctl promote
$ psql
postgres=# select pg_switch_wal();
 pg_switch_wal 
---
 0/14FE340
(1 row)
postgres=# checkpoint;
postgres=# ^D
$ pg_basebackup -Fp -h /tmp -D tmpbk
$ pg_veryfybackup tmpbk
(thinking.. thiking.. zzz.. for several seconds)
pg_waldump: fatal: could not find file "00020001": No such file 
or directory
pg_verifybackup: error: WAL parsing failed for timeline 2
=

This is bacause the backup_manifest has the wrong values as below.

> "WAL-Ranges": [
> { "Timeline": 2, "Start-LSN": "0/14FE248", "End-LSN": "0/3000100" }
> ],

The "Start-LSN" above is the beginning of timeline 2, not the
backup-start LSN. The segment had been removed by the checkpoint.


The comment for AddWALInfoToBackupManifest() says:
> * Add information about the WAL that will need to be replayed when restoring
> * this backup to the manifest.

So I concluded that it's a thinko.

Please see the attached.  It needs to be back-patched to 13 but 13
doesn't accept the patch as is due to wording chages in TAP tests.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c9c299091ccae694a9c1d58c616407e55be53f6f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 18 Aug 2021 11:43:53 +0900
Subject: [PATCH v1] Fix basebackup to generate correct WAL-Ranges info.

While WAL-Ranges in backup_manifest is supposed to have the WAL range
to apply at recovery. it actually starts from the beginning of the
oldest timeline if the starting TLI > 1.  As the result
pg_verifybackup may result in a false failure for a valid
backup. Adjust the logic to make WAL-Ranges start with the actual
start LSN.

Backpatch to PG13 where this feature has been introduced.
---
 src/backend/replication/backup_manifest.c | 10 +-
 src/bin/pg_verifybackup/t/007_wal.pl  | 18 +-
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c
index 8882444025..d7a8584124 100644
--- a/src/backend/replication/backup_manifest.c
+++ b/src/backend/replication/backup_manifest.c
@@ -255,11 +255,11 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
 	errmsg("expected end timeline %u but found timeline %u",
 		   starttli, entry->tli));
 
-		if (!XLogRecPtrIsInvalid(entry->begin))
-			tl_beginptr = entry->begin;
-		else
-		{
+		if (starttli == entry->tli)
 			tl_beginptr = startptr;
+		else
+		{
+			tl_beginptr = entry->begin;
 
 			/*
 			 * If we reach a TLI that has no valid beginning LSN, there can't
@@ -267,7 +267,7 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
 			 * better have arrived at the expected starting TLI. If not,
 			 * something's gone horribly wrong.
 			 */
-			if (starttli != entry->tli)
+			if (XLogRecPtrIsInvalid(entry->begin))
 ereport(ERROR,
 		errmsg("expected start timeline %u but found timeline %u",
 			   starttli, entry->tli));
diff --git a/src/bin/pg_verifybackup/t/007_wal.pl b/src/bin/pg_verifybackup/t/007_wal.pl
index adf60fb755..bc52f0d1d9 100644
--- a/src/bin/pg_verifybackup/t/007_wal.pl
+++ b/src/bin/pg_verifybackup/t/007_wal.pl
@@ -10,7 +10,7 @@ use Config;
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 7;
+use Test::More tests => 9;
 
 # Start up the server and take a backup.
 my $primary = PostgresNode->new('primary');
@@ -59,3 +59,19 @@ command_fails_like(
 	[ 'pg_verifybackup', $backup_path ],
 	qr/WAL parsing failed for timeline 1/,
 	'corrupt WAL file causes failure');
+
+#
+# Check that WAL-Ranges has the correct values when TLI > 1
+$primary->stop;
+$primary->append_conf('standby.signal');
+$primary->start;
+$primary->promote;
+# base-backup runs a checkpoint but just to make sure.
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal(); CHECKPOINT;');
+
+my $backup_path2 = $primary->backup_dir . '/test_tli';
+$primary->command_ok([ 'pg_basebackup', '-D', $backup_path2, '--no-sync' ],
+	"base backup 2 ok");
+# Should pass this case
+command_ok([ 'pg_verifybackup', $backup_path2 ],
+	'verifybackup should pass for a valid backup on timeline > 1');
-- 
2.27.0

>From 59371c3a1fd6bb504362b00fc3511a791f7d1fb4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 18 Aug 2021 13:58:15 +0900
Subject: [PATCH v1] Fix basebackup to generate correct WAL-Ranges info.

While WAL-Ranges in backup_manifest is supposed to have the WAL range
to apply at recovery. it starts from the beginning of the oldest
timeline if the starting TLI > 1.  As the result pg_verifybackup may
result in a false failure for a valid backup. Adjust the logic to make
the WAL-Ranges starts with the actual start LSN.

Backpatch to PG13 where this feature has been introduced.

Re: Skipping logical replication transactions on subscriber side

2021-08-17 Thread Masahiko Sawada

On Wed, Aug 18, 2021 at 12:02 PM Amit Kapila  wrote:
>
> On Wed, Aug 18, 2021 at 6:53 AM Masahiko Sawada  wrote:
> >
> > On Tue, Aug 17, 2021 at 2:35 PM Amit Kapila  wrote:
> > >
> > > > It's right that we use "STREAM STOP" rather than "STREAM END" in many
> > > > places such as elog messages, a callback name, and source code
> > > > comments. As far as I have found there are two places where we’re
> > > > using "STREAM STOP": LOGICAL_REP_MSG_STREAM_END and a description in
> > > > doc/src/sgml/protocol.sgml. Isn't it better to fix these
> > > > inconsistencies in the first place? I think “STREAM STOP” would be
> > > > more appropriate.
> > > >
> > >
> > > I think keeping STREAM_END in the enum 'LOGICAL_REP_MSG_STREAM_END'
> > > seems to be a bit better because of the value 'E' we use for it.
> >
> > But I think we don't care about the actual value of
> > LOGICAL_REP_MSG_STREAM_END since we use the enum value rather than
> > 'E'?
> >
>
> True, but here we are trying to be consistent with other enum values
> where we try to use the first letter of the last word (which is E in
> this case). I can see there are other cases where we are not
> consistent so it won't be a big deal if we won't be consistent here. I
> am neutral on this one, so, if you feel using STREAM_STOP would be
> better from a code readability perspective then that is fine.

In addition of a code readability, there is a description in the doc
that mentions "Stream End" but we describe "Stream Stop" in the later
description, which seems a bug in the doc to me:

The following messages (Stream Start, Stream End, Stream Commit, and
Stream Abort) are available since protocol version 2.



(snip)



Stream Stop



Perhaps it's better to hear other opinions too, but I've attached the
patch. Please review it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


0001-Rename-LOGICAL_REP_MSG_STREAM_END-to-LOGICAL_REP_MSG.patch
Description: Binary data

Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-17 Thread Dilip Kumar

On Wed, Aug 18, 2021 at 9:30 AM Amit Kapila  wrote:
>
> On Wed, Aug 18, 2021 at 8:23 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Wed, Aug 18, 2021 9:17 AM houzj.f...@fujitsu.com wrote:
> > > Hi,
> > >
> > > I took a quick look at the v2 patch and noticed a typo.
> > >
> > > + * backends and render it read-only.  If missing_ok is true then it will 
> > > return
> > > + * NULL if file doesn not exist otherwise error.
> > >   */
> > > doesn not=> doesn't
> > >
> >
> > Here are some other comments:
> >
> > 1).
> > +BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode,
> > + bool missing_ok)
> >  {
> > BufFile*file;
> > charsegment_name[MAXPGPATH];
> > ...
> > files = palloc(sizeof(File) * capacity);
> > ...
> > @@ -318,10 +320,15 @@ BufFileOpenShared(SharedFileSet *fileset, const char 
> > *name, int mode)
> >  * name.
> >  */
> > if (nfiles == 0)
> > +   {
> > +   if (missing_ok)
> > +   return NULL;
> > +
> >
> > I think it might be better to pfree(files) when return NULL.
> >
> >
> > 2).
> > /* Delete the subxact file and release the memory, if it exist */
> > -   if (ent->subxact_fileset)
> > -   {
> > -   subxact_filename(path, subid, xid);
> > -   SharedFileSetDeleteAll(ent->subxact_fileset);
> > -   pfree(ent->subxact_fileset);
> > -   ent->subxact_fileset = NULL;
> > -   }
> > -
> > -   /* Remove the xid entry from the stream xid hash */
> > -   hash_search(xidhash, (void *) , HASH_REMOVE, NULL);
> > +   subxact_filename(path, subid, xid);
> > +   SharedFileSetDelete(xidfileset, path, true);
> >
> > Without the patch it doesn't throw an error if not exist,
> > But with the patch, it pass error_on_failure=true to SharedFileSetDelete().
> >
>
> Don't we need to use BufFileDeleteShared instead of
> SharedFileSetDelete as you have used to remove the changes file?

Yeah, it should be BufFileDeleteShared.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-17 Thread Amit Kapila

On Wed, Aug 18, 2021 at 8:23 AM houzj.f...@fujitsu.com
 wrote:
>
> On Wed, Aug 18, 2021 9:17 AM houzj.f...@fujitsu.com wrote:
> > Hi,
> >
> > I took a quick look at the v2 patch and noticed a typo.
> >
> > + * backends and render it read-only.  If missing_ok is true then it will 
> > return
> > + * NULL if file doesn not exist otherwise error.
> >   */
> > doesn not=> doesn't
> >
>
> Here are some other comments:
>
> 1).
> +BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode,
> + bool missing_ok)
>  {
> BufFile*file;
> charsegment_name[MAXPGPATH];
> ...
> files = palloc(sizeof(File) * capacity);
> ...
> @@ -318,10 +320,15 @@ BufFileOpenShared(SharedFileSet *fileset, const char 
> *name, int mode)
>  * name.
>  */
> if (nfiles == 0)
> +   {
> +   if (missing_ok)
> +   return NULL;
> +
>
> I think it might be better to pfree(files) when return NULL.
>
>
> 2).
> /* Delete the subxact file and release the memory, if it exist */
> -   if (ent->subxact_fileset)
> -   {
> -   subxact_filename(path, subid, xid);
> -   SharedFileSetDeleteAll(ent->subxact_fileset);
> -   pfree(ent->subxact_fileset);
> -   ent->subxact_fileset = NULL;
> -   }
> -
> -   /* Remove the xid entry from the stream xid hash */
> -   hash_search(xidhash, (void *) , HASH_REMOVE, NULL);
> +   subxact_filename(path, subid, xid);
> +   SharedFileSetDelete(xidfileset, path, true);
>
> Without the patch it doesn't throw an error if not exist,
> But with the patch, it pass error_on_failure=true to SharedFileSetDelete().
>

Don't we need to use BufFileDeleteShared instead of
SharedFileSetDelete as you have used to remove the changes file?

-- 
With Regards,
Amit Kapila.

Re: pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead

2021-08-17 Thread Laurenz Albe

On Tue, 2021-08-17 at 02:14 -0700, Andres Freund wrote:
> On 2021-08-17 10:44:51 +0200, Laurenz Albe wrote:
> > On Sun, 2021-08-01 at 13:55 -0700, Andres Freund wrote:
> > > We maintain last_report as GetCurrentTransactionStopTimestamp(), but then 
> > > use
> > > a separate timestamp in pgstat_send_connstats() to compute the difference 
> > > from
> > > last_report, which is then set to GetCurrentTransactionStopTimestamp()'s
> > > return value.
> > 
> > Makes sense to me.  How about passing "now", which was just initialized from
> > GetCurrentTransactionStopTimestamp(), as additional parameter to
> > pgstat_send_connstats() and use that value instead of taking the current 
> > time?
> 
> Yes.

Here is a patch for that.

> > > I'm also not all that happy with sending yet another UDP packet for this.
> > 
> > Are you suggesting that connection statistics should be shoehorned into
> > some other statistics message?  That would reduce the number of UDP packets,
> > but it sounds ugly and confusing to me.
> 
> That ship already has sailed. Look at struct PgStat_MsgTabstat
> 
> Given that we transport number of commits/commits, block read/write time
> adding the time the connection was active/inactive doesn't really seem like it
> makes things meaningfully worse?

Point taken.

I looked at the other statistics sent in pgstat_report_stat(), and I see
none that are sent unconditionally.  Are you thinking of this:

/*
 * Send partial messages.  Make sure that any pending xact commit/abort
 * gets counted, even if there are no table stats to send.
 */
if (regular_msg.m_nentries > 0 ||
pgStatXactCommit > 0 || pgStatXactRollback > 0)
pgstat_send_tabstat(_msg);
if (shared_msg.m_nentries > 0)
pgstat_send_tabstat(_msg);

I can't think of a way to hack this up that wouldn't make my stomach turn.

> > > Alternatively we could just not send an update to connection stats every 
> > > 500ms
> > > for every active connection, but only do so at connection end. The 
> > > database
> > > stats would only contain stats for disconnected sessions, while the stats 
> > > for
> > > "live" connections are maintained via backend_status.c.
> > 
> > That was my original take, but if I remember right, Magnus convinced me that
> > it would be more useful to have statistics for open sessions as well.
> > With a connection pool, connections can stay open for a very long time,
> > and the accuracy and usefulness of the statistics would become questionable.
> 
> That's not a contradiction to what I propose. Having the data available via
> backend_status.c allows to sum up the data for active connections and the data
> for past connections.
> 
> I think it's also just cleaner to not constantly report changing preliminary
> data as pgstat_send_connstats() does.

Currently, the data are kept in static variables in the backend process.
That would have to change for such an approach, right?

> Doubling the number of UDP messages in common workloads seems also problematic
> enough that it should be addressed for 14.

Ok, but I don't know how to go about it.

Yours,
Laurenz Albe
From 951820a8e312584f283ef4b5bde006c7f41e0d9d Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Wed, 18 Aug 2021 04:52:57 +0200
Subject: [PATCH] Improve performance and accuracy of session statistics

Rather than calling GetCurrentTimestamp() in pgstat_send_connstats(),
which causes an additional system call, reuse the value just
calculated in pgstat_report_stat().

Since the "now" value in pgstat_report_stat() will become the next
"last_report" value, this will also improve the accuracy of the
statistics, since the timestamp difference calculated in
pgstat_send_connstats() will become the correct one.

This patch does not address the issue that yet another UDP packet
is sent to the statistics collector for session statistics.

Discussion: https://postgr.es/m/20210801205501.nyxzxoelqoo4x2qc%40alap3.anarazel.de
---
 src/backend/postmaster/pgstat.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index cecc30..77d734a0ba 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -324,7 +324,7 @@ static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
 static void pgstat_send_funcstats(void);
 static void pgstat_send_slru(void);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-static void pgstat_send_connstats(bool disconnect, TimestampTz last_report);
+static void pgstat_send_connstats(bool disconnect, TimestampTz last_report, TimestampTz now);
 
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
@@ -879,7 +879,7 @@ pgstat_report_stat(bool disconnect)
 
 	/* for backends, send connection statistics */
 	if (MyBackendType == B_BACKEND)
-		pgstat_send_connstats(disconnect, last_report);
+		pgstat_send_connstats(disconnect, last_report, now);
 
 	last_report =

Re: PoC/WIP: Extended statistics on expressions

2021-08-17 Thread Justin Pryzby

> Patch 0001 fixes the "double parens" issue discussed elsewhere in this
> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple
> column reference.

> From: Tomas Vondra 
> Date: Mon, 16 Aug 2021 17:19:33 +0200
> Subject: [PATCH 2/2] fix: identify single-attribute references

> diff --git a/src/bin/pg_dump/t/002_pg_dump.pl 
> b/src/bin/pg_dump/t/002_pg_dump.pl
> index a4ee54d516..be1f3a5175 100644
> --- a/src/bin/pg_dump/t/002_pg_dump.pl
> +++ b/src/bin/pg_dump/t/002_pg_dump.pl
> @@ -2811,7 +2811,7 @@ my %tests = (
>   create_sql   => 'CREATE STATISTICS dump_test.test_ext_stats_expr
>   ON (2 * col1) FROM 
> dump_test.test_fifth_table',
>   regexp => qr/^
> - \QCREATE STATISTICS dump_test.test_ext_stats_expr ON 
> ((2 * col1)) FROM dump_test.test_fifth_table;\E
> + \QCREATE STATISTICS dump_test.test_ext_stats_expr ON (2 
> * col1) FROM dump_test.test_fifth_table;\E


This hunk should be in 0001, no ?

> But I'm not sure 0002 is something we can do without catversion bump. What
> if someone created such "bogus" statistics? It's mostly harmless, because
> the statistics is useless anyway (AFAICS we'll just use the regular one we
> have for the column), but if they do pg_dump, that'll fail because of this
> new restriction.

I think it's okay if it pg_dump throws an error, since the fix is as easy as
dropping the stx object.  (It wouldn't be okay if it silently misbehaved.)

Andres concluded similarly with the reverted autovacuum patch:
https://www.postgresql.org/message-id/20210817105022.e2t4rozkhqy2m...@alap3.anarazel.de

+RMT in case someone wants to argue otherwise.

-- 
Justin

Re: Skipping logical replication transactions on subscriber side

2021-08-17 Thread Amit Kapila

On Wed, Aug 18, 2021 at 6:53 AM Masahiko Sawada  wrote:
>
> On Tue, Aug 17, 2021 at 2:35 PM Amit Kapila  wrote:
> >
> > > It's right that we use "STREAM STOP" rather than "STREAM END" in many
> > > places such as elog messages, a callback name, and source code
> > > comments. As far as I have found there are two places where we’re
> > > using "STREAM STOP": LOGICAL_REP_MSG_STREAM_END and a description in
> > > doc/src/sgml/protocol.sgml. Isn't it better to fix these
> > > inconsistencies in the first place? I think “STREAM STOP” would be
> > > more appropriate.
> > >
> >
> > I think keeping STREAM_END in the enum 'LOGICAL_REP_MSG_STREAM_END'
> > seems to be a bit better because of the value 'E' we use for it.
>
> But I think we don't care about the actual value of
> LOGICAL_REP_MSG_STREAM_END since we use the enum value rather than
> 'E'?
>

True, but here we are trying to be consistent with other enum values
where we try to use the first letter of the last word (which is E in
this case). I can see there are other cases where we are not
consistent so it won't be a big deal if we won't be consistent here. I
am neutral on this one, so, if you feel using STREAM_STOP would be
better from a code readability perspective then that is fine.

-- 
With Regards,
Amit Kapila.

RE: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-17 Thread houzj.f...@fujitsu.com

On Wed, Aug 18, 2021 9:17 AM houzj.f...@fujitsu.com wrote:
> Hi,
> 
> I took a quick look at the v2 patch and noticed a typo.
> 
> + * backends and render it read-only.  If missing_ok is true then it will 
> return
> + * NULL if file doesn not exist otherwise error.
>   */
> doesn not=> doesn't
>

Here are some other comments:

1).
+BufFileOpenShared(SharedFileSet *fileset, const char *name, int mode,
+ bool missing_ok)
 {
BufFile*file;
charsegment_name[MAXPGPATH];
...
files = palloc(sizeof(File) * capacity);
...
@@ -318,10 +320,15 @@ BufFileOpenShared(SharedFileSet *fileset, const char 
*name, int mode)
 * name.
 */
if (nfiles == 0)
+   {
+   if (missing_ok)
+   return NULL;
+

I think it might be better to pfree(files) when return NULL.


2).
/* Delete the subxact file and release the memory, if it exist */
-   if (ent->subxact_fileset)
-   {
-   subxact_filename(path, subid, xid);
-   SharedFileSetDeleteAll(ent->subxact_fileset);
-   pfree(ent->subxact_fileset);
-   ent->subxact_fileset = NULL;
-   }
-
-   /* Remove the xid entry from the stream xid hash */
-   hash_search(xidhash, (void *) , HASH_REMOVE, NULL);
+   subxact_filename(path, subid, xid);
+   SharedFileSetDelete(xidfileset, path, true);

Without the patch it doesn't throw an error if not exist,
But with the patch, it pass error_on_failure=true to SharedFileSetDelete().
Was it intentional ?

Best regards,
Houzj

Re: Allow composite foreign keys to reference a superset of unique constraint columns?

2021-08-17 Thread Laurenz Albe

On Tue, 2021-08-17 at 10:45 -0700, Paul Martinez wrote:
> On Tue, Aug 17, 2021 at 8:41 AM Jack Christensen  wrote:
> > The only way to ensure a user can only be a member of a group in the same
> > tenant is to user_group_memberships.tenant_id be part of the foreign key. 
> > And
> > that will only work with a unique key on id and tenant_id in both users and
> > user_groups. It's a bit inelegant to create multiple extra indexes to ensure
> > consistency when existing indexes are enough to ensure uniqueness.
> 
> You could accomplish this by using composite primary keys on the users and
> user_groups tables:
> 
> CREATE TABLE users (
>   id serial,
>   tenant_id int REFERENCES tenants(id),
>   PRIMARY KEY (tenant_id, id)
> );

That is not a proper solution, because it does not guarantee uniqueness of
the "id" column, which is typically what you want.

So I think Jack's example illustrates the benefit of this proposal well.

On the other hand, the SQL standard requires that a foreign key references
a unique constraint, see chapter 11.8 ,
Syntax Rules 3) a):

 "If the  specifies a ,
  then there shall be a one-to-one correspondence between the set of
  s contained in that  and the set of
  s contained in the  of a unique constraint
  of the referenced table such that corresponding s are 
equivalent."

So while I personally agree that the proposed feature is useful, I am not
sure if it is useful enough to break the standard in a way that may be
incompatible with future extensions of the standard.

Yours,
Laurenz Albe

Re: The Free Space Map: Problems and Opportunities

2021-08-17 Thread Peter Geoghegan

On Tue, Aug 17, 2021 at 12:11 PM John Naylor
 wrote:
> I'm not sure it's essential to have "far" fewer categories, if the 
> closed-to-open transition is made less granular through some other mechanism.

See my remarks to Andres about FSM_CATEGORIES from earlier. (It may
not matter if you don't believe that it helps us to have such high
granularity -- that means it won't hurt to get rid of it, which might
be a good enough reason on its own.)

> > I now wonder if the FSM is fundamentally doing the wrong thing by
> > keeping track of all "free space" in every page over time. Why
> > wouldn't we just have a free space management strategy that is
> > generally only concerned with recent events?

> The second paragraph here is an interesting idea and makes a great deal of 
> sense. It would lead to smaller FSMs that are navigated more quickly and 
> locked for shorter durations.

Most importantly, the FSM/whatever doesn't have to be locked at all if
you don't go to it. And so going to the FSM much less often with space
leases or whatever seems like it might be a big win for that reason
(not just the locality stuff that I care about).

> Implicit "closure" seems riskier in my view if you want to bring VM qualities 
> into it, however.

I didn't mean to imply that we'd be changing that data structure. As I
clarified earlier, "merging the FSM and the VM" was meant in a
conceptual way. Not the best choice of words on my part.

> This makes sense as well. Shared memory for more recent / highly contended 
> state, and disk for less recent / less contended / stickier state.

I would like to also make it WAL-logged, which is easier this way.
That isn't my main goal, but why not do it? The way that the FSM works
right now within VACUUM REDO routines is a problem.

-- 
Peter Geoghegan

Re: The Free Space Map: Problems and Opportunities

2021-08-17 Thread Peter Geoghegan

On Tue, Aug 17, 2021 at 11:24 AM Bruce Momjian  wrote:
> OK, I am trying to think of something simple we could test to see the
> benefit, with few downsides.  I assume the case you are considering is
> that you have a 10 8k-page table, and one page is 80% full and the
> others are 81% full, and if several backends start adding rows at the
> same time, they will all choose the 80%-full page.

That's one problem (there is so many). They start with that same page,
and then fight each other over other pages again and again. This
emergent behavior is a consequence of the heuristic that has the FSM
look for the heap page with the least space that still satisfies a
given request. I mentioned this problem earlier.

> For example:
>
> 1.  page with most freespace is 95% free
> 2.  pages 2,4,6,8,10 have between 86%-95% free
> 3.  five pages
> 4.  proc id 14293 % 5 = 3 so use the third page from #2, page 6

Right now my main concern is giving backends/XIDs their own space, in
bulk. Mostly for concurrent bulk inserts, just so we can say that
we're getting the easy cases right -- that is a good starting point
for my prototype to target, I think. But I am also thinking about
stuff like this, as one way to address contention.

> This should spread out page usage to be more even, but still favor pages
> with more freespace.  Yes, this is simplistic, but it would seem to have
> few downsides and I would be interested to see how much it helps.

I thought about having whole free lists that were owned by XIDs, as
well as shared free lists that are determined by hashing MyProcPid --
or something like that.  So I agree that it might very well make
sense.

-- 
Peter Geoghegan

Re: Skipping logical replication transactions on subscriber side

2021-08-17 Thread Masahiko Sawada

On Tue, Aug 17, 2021 at 2:35 PM Amit Kapila  wrote:
>
> On Tue, Aug 17, 2021 at 10:46 AM Masahiko Sawada  
> wrote:
> >
> > On Mon, Aug 16, 2021 at 5:30 PM Greg Nancarrow  wrote:
> > >
> > > On Mon, Aug 16, 2021 at 5:54 PM houzj.f...@fujitsu.com
> > >  wrote:
> > > >
> > > > Here is another comment:
> > > >
> > > > +char *
> > > > +logicalrep_message_type(LogicalRepMsgType action)
> > > > +{
> > > > ...
> > > > +   case LOGICAL_REP_MSG_STREAM_END:
> > > > +   return "STREAM END";
> > > > ...
> > > >
> > > > I think most the existing code use "STREAM STOP" to describe the
> > > > LOGICAL_REP_MSG_STREAM_END message, is it better to return "STREAM 
> > > > STOP" in
> > > > function logicalrep_message_type() too ?
> > > >
> > >
> > > +1
> > > I think you're right, it should be "STREAM STOP" in that case.
> >
> > It's right that we use "STREAM STOP" rather than "STREAM END" in many
> > places such as elog messages, a callback name, and source code
> > comments. As far as I have found there are two places where we’re
> > using "STREAM STOP": LOGICAL_REP_MSG_STREAM_END and a description in
> > doc/src/sgml/protocol.sgml. Isn't it better to fix these
> > inconsistencies in the first place? I think “STREAM STOP” would be
> > more appropriate.
> >
>
> I think keeping STREAM_END in the enum 'LOGICAL_REP_MSG_STREAM_END'
> seems to be a bit better because of the value 'E' we use for it.

But I think we don't care about the actual value of
LOGICAL_REP_MSG_STREAM_END since we use the enum value rather than
'E'?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: ECPG bug fix: DECALRE STATEMENT and DEALLOCATE, DESCRIBE

2021-08-17 Thread Michael Paquier

On Tue, Aug 17, 2021 at 03:34:28PM +0900, Michael Paquier wrote:
> On Mon, Aug 16, 2021 at 12:06:16PM +0200, Michael Meskes wrote:
>> You patch removes the warning but by doing that also removes the
>> feature that is being tested.
> 
> Oops.  If kept this way, this test scenario is going to need a comment
> to explain exactly that.

Michael has adjusted that as of f576de1, so I am closing this open
item.
--
Michael


signature.asc
Description: PGP signature

RE: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-17 Thread houzj.f...@fujitsu.com

Hi,

I took a quick look at the v2 patch and noticed a typo.

+ * backends and render it read-only.  If missing_ok is true then it will return
+ * NULL if file doesn not exist otherwise error.
  */
doesn not=> doesn't

Best regards,
Houzj

Re: [PATCH]Remove obsolete macro CHECKFLOATVAL in btree_gist

2021-08-17 Thread Michael Paquier

On Tue, Aug 17, 2021 at 11:08:59AM +, tanghy.f...@fujitsu.com wrote:
> On Friday, August 6, 2021 11:14 PM, tanghy.f...@fujitsu.com 
>  wrote:
>> Commit 6bf0bc842 replaced float.c's CHECKFLOATVAL() macro with static inline 
>> subroutines, 
>> but the fix introduced a performance regression as Tom Lane pointed out and 
>> fixed at 607f8ce74.
>>
>> Found obsolete CHECKFLOATVAL usage in contrib/btree_gist, and tried to fix 
>> it according to 607f8ce74.
> 
> Added above patch in commit fest:
> https://commitfest.postgresql.org/34/3287/

Yes, that does not seem wise on performance grounds.  The case of
!zero_is_valid is never reached, so it seems like this code was just a
copy-paste from the float code in the backend.  Your patch looks right
to me.
--
Michael


signature.asc
Description: PGP signature

Re: Two patches to speed up pg_rewind.

2021-08-17 Thread Michael Paquier

On Tue, Aug 17, 2021 at 04:47:44PM +0900, Michael Paquier wrote:
> One argument
> against this approach is that if pg_rewind fails in the middle of its
> operation then we would have done a set of fsync() for nothing, with
> the data folder still unusable.

I was skimming through the patch this morning, and that argument does
not hold much water as the flushes happen in the same place.  Seems
like I got confused, sorry about that.

> I would be curious to see some
> numbers to see how much it matters with many physical files (say cases
> with thousands of small relations?).

For this one, one simple idea would be to create a lot of fake
relation files with a pre-determined size and check how things
change.
--
Michael


signature.asc
Description: PGP signature

Re: Support for NSS as a libpq TLS backend

2021-08-17 Thread Michael Paquier

On Wed, Aug 18, 2021 at 12:06:59AM +, Jacob Champion wrote:
> I have a local test suite that I've been writing against libpq. With
> the new ssldatabase connection option, one tricky aspect is figuring
> out whether it's supported or not. It doesn't look like there's any way
> to tell, from a client application, whether NSS or OpenSSL (or neither)
> is in use.

That's about guessing which library libpq is compiled with, so yes
that's a problem.

> so that you don't have to have an actual connection first in order to
> figure out what connection options you need to supply. Clients that
> support multiple libpq versions would need to know whether that call is
> reliable (older versions of libpq will always return NULL, whether SSL
> is compiled in or not), so maybe we could add a feature macro at the
> same time?

Still, the problem is wider than that, no?  One cannot know either if
a version of libpq is able to work with GSSAPI until they attempt a
connection with gssencmode.  It seems to me that we should work on the
larger picture here.

> We could also add a new API (say, PQsslLibrary()) but I don't know if
> that gives us anything in practice. Thoughts?

Knowing that the GSSAPI stuff is part of fe-secure.c, we may want
instead a call that returns a list of supported secure libraries.
--
Michael


signature.asc
Description: PGP signature

Re: The Free Space Map: Problems and Opportunities

2021-08-17 Thread Peter Geoghegan

On Tue, Aug 17, 2021 at 9:48 AM Robert Haas  wrote:
> I don't know what the right degree of stickiness is, but I can easily
> believe that we want to have more than none. Part of Peter's point, at
> least as I understand it, is that if a page has 100 tuples and you
> delete 1 or 2 or 3 of them, it is not smart to fill up the space thus
> created with 1 or 2 or 3 new tuples. You should instead hope that
> space will optimize future HOT updates, or else wait for the page to
> lose some larger number of tuples and then fill up all the space at
> the same time with tuples that are, hopefully, related to each other
> in some useful way.

That is a big part of it. But it gets worse than that. It is truly
insidious, once you follow the chain of problems over time -- maybe
over hours and hours. BenchmarkSQL makes this a lot easier.

Right now when we set heap fill factor to 85 (which is what
BenchmarkSQL does), we effectively have this chunk of special reserved
free space left behind on each page. The remaining 15% of tuple space
is of course reserved for successor versions of the same logical rows
-- those rows from the time that the page first fills (presumably they
all come from straight inserts). That's our starting point -- simple
enough.

The approach we take effectively gives up on the *whole entire page*
at literally the first sign of trouble -- the first time we fail to
hold a successor version on that same page. Even though it just well
have been a once-off perturbation. What we might call bad luck, as
opposed to an inevitability. For example, maybe ANALYZE ran that
second, which made the crucial difference for that one heap page that
one time. That's kind of enough to make the implementation assume all
bets are off, for the entire page -- it's as if we pessimistically
assume that there is no chance of keeping anything like the original
rows on the heap page, so we might as well let it all get mixed up
even more. This is way too pessimistic -- it's black and white
thinking.

While in general that level of pessimism might turn out to be
justified (anything is possible), we should hold the line and find out
for sure how bad it is. I happen to know that our problematic
BenchmarkSQL/TPC-C tables just don't have any skew, so this pessimism
is particularly wrong-headed there.

Bear in mind that the new unrelated tuples that ultimately replace our
original couldn't-fit tuples (probably after the next VACUUM) are
probably not "average tuples". Maybe they're newly inserted tuples,
which is bad in the BenchmarkSQL case, because they're all
*guaranteed* to be updated in the future (owing to the details of the
workload). With some other typical workload they might often be
successor versions from updates instead -- most likely for rows that
are disproportionately likely to be updated many times. It's truly
pernicious.

If we only accepted the original row migration, and didn't throw good
money after bad, then the cycle would be broken. To me this is a bit
like certain heuristic algorithms, where you technically accept a
suboptimal outcome to avoid getting stuck in a local maxima.

> Now what's the threshold? 20 out of 100? 50? 80?

I'm not going to pretend to know the answer. But I will point out that
one DB system whose heap fill factor defaults to 90 seems to have a
symmetric setting for the "open up page again" point -- the default
for that is 40. Not sure that that really matters to us, but that does
seem pretty low to me. It's very sticky indeed.

> Yeah, I don't think that reducing FSM_CATEGORIES is likely to have
> much value for its own sake. But it might be useful as a way of
> accomplishing some other goal. For example if we decided that we need
> a bit per page to track "open" vs. "closed" status or something of
> that sort, I don't think we'd lose much by having fewer categories.

This is probably one of my less important points. I addressed this
point in my remarks on this to Andres from earlier.

> I don't have a well-formed opinion on this point yet. I am also
> skeptical of the idea of merging the FSM and VM, because it seems to
> me that which pages are all-visible and which pages are full are two
> different things.

As I clarified in my email to Andres, this was just a bad choice of
words. I meant that they're at least conceptually much closer,
provided you're willing to accept my general view of things.

> However, there's something to Peter's point too. An
> empty page is all-visible but still valid as an insertion target, but
> this is not as much of a contradiction to the idea of merging them as
> it might seem, because marking the empty page as all-visible is not
> nearly as useful as marking a full page all-visible. An index-only
> scan can't care about the all-visible status of a page that doesn't
> contain any tuples, and we're also likely to make the page non-empty
> in the near future, in which case the work we did to set the
> all-visible bit and log the change was wasted. The only

Re: PG14: Avoid checking output-buffer-length for every encoded byte during pg_hex_encode

2021-08-17 Thread Michael Paquier

On Tue, Aug 17, 2021 at 09:39:30AM -0400, Tom Lane wrote:
> OK, but the commit message should explain why they're getting reverted.

Sure.  aef8948 gets down because of the performance impact.  ccf4e27
was a cleanup following up aef8948, that loses its meaning.  And
c3826f8 cannot be let alone because of the reasons why aef8948 was
introduced, as it has no safety net for out-of-bound handling in the
result buffer allocated.
--
Michael


signature.asc
Description: PGP signature

Re: PG14: Avoid checking output-buffer-length for every encoded byte during pg_hex_encode

2021-08-17 Thread Michael Paquier

On Wed, Aug 18, 2021 at 12:34:45AM +0800, Julien Rouhaud wrote:
> On Tue, Aug 17, 2021 at 11:26 PM Bruce Momjian  wrote:
>> Uh, I don't see those commits, e.g.:
>>
>> $ git diff 0d70d30
>> fatal: ambiguous argument '0d70d30': unknown revision or path not in 
>> the working tree.
>> Use '--' to separate paths from revisions, like this:
>> 'git  [...] -- [...]'
>>
>> $ git diff 5c33ba5
>> fatal: ambiguous argument '5c33ba5': unknown revision or path not in 
>> the working tree.
>> Use '--' to separate paths from revisions, like this:
>> 'git  [...] -- [...]'
>>
>> $ git diff 92436a7
>> fatal: ambiguous argument '92436a7': unknown revision or path not in 
>> the working tree.
>> Use '--' to separate paths from revisions, like this:
>> 'git  [...] -- [...]'
> 
> Same here.  I'm assuming that the real commits are the one mentioned
> in the patch, which are c3826f8,  aef8948 and ccf4e27.

Oops, incorrect copy-paste from my side.  Yes, the commits impacted
here are aef8948, ccf4e27 and c3826f8.  The small-ish commit message
of the patch got thet right.
--
Michael


signature.asc
Description: PGP signature

Re: Support for NSS as a libpq TLS backend

2021-08-17 Thread Jacob Champion

On Tue, 2021-08-10 at 19:22 +0200, Daniel Gustafsson wrote:
> Another rebase to work around the recent changes in the ssl Makefile.

I have a local test suite that I've been writing against libpq. With
the new ssldatabase connection option, one tricky aspect is figuring
out whether it's supported or not. It doesn't look like there's any way
to tell, from a client application, whether NSS or OpenSSL (or neither)
is in use.

You'd mentioned that perhaps we should support a call like

PQsslAttribute(NULL, "library"); /* returns "NSS", "OpenSSL", or NULL */

so that you don't have to have an actual connection first in order to
figure out what connection options you need to supply. Clients that
support multiple libpq versions would need to know whether that call is
reliable (older versions of libpq will always return NULL, whether SSL
is compiled in or not), so maybe we could add a feature macro at the
same time?

We could also add a new API (say, PQsslLibrary()) but I don't know if
that gives us anything in practice. Thoughts?

--Jacob

Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-17, Bossart, Nathan wrote:
> On 8/17/21, 2:13 PM, "alvhe...@alvh.no-ip.org"  
> wrote:
>
> > So, why isn't it that we call Register in XLogInsertRecord, and
> > Notify in XLogWrite?
> 
> We do.  However, we also call NotifySegmentsReadyForArchive() in
> XLogInsertRecord() to handle the probably-unlikely scenario that the
> flush pointer has already advanced past the to-be-registered boundary.
> This ensures that the .ready files are created as soon as possible.

I see.

I have two thoughts on that.  First, why not do it outside the block
that tests for crossing a segment boundary?  If that's a good thing to
do, then we should do it always.

However, why do it in a WAL-producing client-connected backend?  It
strikes me as a bad thing to do, because you are possibly causing delays
for client-connected backends.  I suggest that we should give this task
to the WAL writer process -- say, have XLogBackgroundFlush do it.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
 (George Orwell's The Lord of the Rings)

Re: The Free Space Map: Problems and Opportunities

2021-08-17 Thread Peter Geoghegan

On Tue, Aug 17, 2021 at 6:18 AM Andres Freund  wrote:
> I suspect that one other fairly fundamental issue is that we are very
> reticent updating the FSM with information about free space. As long as
> that's the case a smarter placement logic would often not have a better
> place to choose from.

I think so too. But that is made a lot harder by the supposed need to
represent the amount of freespace in FSM_CATEGORIES (255) distinct
increments. As opposed to maybe 4 distinct increments per page, like
in other systems. You only have to update the externally stored
metadata when a threshold is crossed.

(This isn't the main reason why I don't like that; more on that later.)

> I think the design of the current bulk extension mechanism is quite backward
> (it only takes contention into account not other needs for bulk work, it does
> expensive stuff like finding victim buffers under lock, it doesn't release
> locks until all blocks are extended, ...).  But: I don't think a bulk
> extension mechanism should be tasked with doing grouping or anything like
> that.

Maybe I'm just drawing the boundaries differently in my head, which
doesn't seem like a real difference of opinion.

> > Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
> > ever end up inserting *any* new logical rows on a heap page after the
> > page first fills -- it is initially "open" when first allocated, and
> > then quickly becomes "closed" to inserts of new logical rows, once it
> > crosses a certain almost-empty space threshold.

> I suspect that we'd get a *LOT* of complaints if we introduced that degree of
> stickiness.

Maybe. It would be very different to our current behavior, which would
undoubtedly have many consequences. Maybe this will be a big problem,
maybe it barely matters. I make no specific claims about it, either
way. Not just yet.

> Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit
> from a fairly granular FSM_CATEGORIES, while still achieving better
> locality. You could e.g. search for free space for an out-of-page update in
> nearby pages with a lower free-percentage threshold, while having a different
> threshold and selection criteria for placement of inserts.

Again, it's all about *systemic* effects. A FSM request is basically a
choice *among* blocks that could in principle satisfy the request --
often the number of basically-satisfactory blocks is huge (way too
large, even). You have to bear in mind that concurrent requests are in
competition with each other in much of the time. They are usually all
looking for exactly the same thing (e.g., same free space
requirement), or close to the same thing -- tuples within a given
table tend to all be about as wide as each other.

I can clearly sometimes see very high contention cases, where backends
do way too much spinning inside the RelationGetBufferForTuple() loop.
They're all more or less chasing the same scraps of free space, in a
highly inefficient way. Even though there is probably an abundance of
free space. Right now the heap pages that have the most free space are
also the ones that are least likely to be used.

Though I think that these backends should cooperate more, some amount
of competition is probably necessary. If FSM_CATEGORIES used 3 bits
instead of 8, then the backends could fall back on caring about some
other thing that distinguished heap pages from each other that
actually matters. Leading to less contention, and maybe better space
utilization.

I have presented this FSM_CATEGORIES contention issue as a distinct
problem to the first one (the locality problem), though honestly that
was a bit arbitrary -- it's just blurry. Thank you for putting up with
that.

> The density of the VM is fairly crucial for efficient IOS, so I'm doubtful
> merging FSM and VM is a good direction to go to. Additionally we need to make
> sure that the VM is accurately durable, which isn't the case for the FSM
> (which I think we should use to maintain it more accurately).

I explained this part badly. I don't think that we should literally
unite the VM and FSM data structures. I was really just pointing out
that a logical-ish notion of a page being full (to use the term of art
I've seen, a "closed" page) is almost the same thing as a page that is
marked all-visible/all-frozen. Presumably VACUUM does that with the
optimistic assumption that it will probably never again need to touch
the same page. And while that can never be guaranteed, it certainly
seems like we might want to weigh things in favor of that happening.

I'm pretty sure that carelessly remembering 200 bytes of free space in
the FSM for an all-frozen page is practically guaranteed to be a bad
idea. While I don't have a great idea of where the break-even point is
just yet, I am prepared to say that I'm sure that it's not 0 bytes,
which is how it works today. A squishy logical definition of "page is
full" seems pretty natural to me.

> But perhaps I'm just missing

Re: archive status ".ready" files may be created too early

2021-08-17 Thread Bossart, Nathan

On 8/17/21, 2:13 PM, "alvhe...@alvh.no-ip.org"  wrote:
> On 2021-Aug-17, Bossart, Nathan wrote:
>
>> The main reason for registering the boundaries in XLogInsertRecord()
>> is that it has the required information about the WAL record
>> boundaries.  I do not think that XLogWrite() has this information.
>
> Doh, of course.  So, why isn't it that we call Register in
> XLogInsertRecord, and Notify in XLogWrite?

We do.  However, we also call NotifySegmentsReadyForArchive() in
XLogInsertRecord() to handle the probably-unlikely scenario that the
flush pointer has already advanced past the to-be-registered boundary.
This ensures that the .ready files are created as soon as possible.

Nathan

Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-17, Bossart, Nathan wrote:

> The main reason for registering the boundaries in XLogInsertRecord()
> is that it has the required information about the WAL record
> boundaries.  I do not think that XLogWrite() has this information.

Doh, of course.  So, why isn't it that we call Register in
XLogInsertRecord, and Notify in XLogWrite?

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)

Re: Allow parallel DISTINCT

2021-08-17 Thread Zhihong Yu

On Tue, Aug 17, 2021 at 1:47 PM David Rowley  wrote:

> On Wed, 18 Aug 2021 at 02:42, Zhihong Yu  wrote:
> > Since create_partial_distinct_paths() calls
> create_final_distinct_paths(), I wonder if numDistinctRows can be passed to
> create_final_distinct_paths() so that the latter doesn't need to call
> estimate_num_groups().
>
> That can't be done. The two calls to estimate_num_groups() are passing
> in a different number of input rows.  In
> create_partial_distinct_paths() the number of rows is the number of
> expected input rows from a partial path.  In
> create_final_distinct_paths() when called to complete the final
> distinct step, that's the number of distinct values multiplied by the
> number of workers.
>
> It might be more possible to do something like cache the value of
> distinctExprs, but I just don't feel the need.  If there are partial
> paths in the input_rel then it's most likely that planning time is not
> going to dominate much between planning and execution. Also, if we
> were to calculate the value of distinctExprs in create_distinct_paths
> always, then we might end up calculating it for nothing as
> create_final_distinct_paths() does not always need it. I don't feel
> the need to clutter up the code by doing any lazy calculating of it
> either.
>
> David
>
Hi,
Thanks for your explanation.

The patch is good from my point of view.

Re: Allow parallel DISTINCT

2021-08-17 Thread David Rowley

On Wed, 18 Aug 2021 at 02:42, Zhihong Yu  wrote:
> Since create_partial_distinct_paths() calls create_final_distinct_paths(), I 
> wonder if numDistinctRows can be passed to create_final_distinct_paths() so 
> that the latter doesn't need to call estimate_num_groups().

That can't be done. The two calls to estimate_num_groups() are passing
in a different number of input rows.  In
create_partial_distinct_paths() the number of rows is the number of
expected input rows from a partial path.  In
create_final_distinct_paths() when called to complete the final
distinct step, that's the number of distinct values multiplied by the
number of workers.

It might be more possible to do something like cache the value of
distinctExprs, but I just don't feel the need.  If there are partial
paths in the input_rel then it's most likely that planning time is not
going to dominate much between planning and execution. Also, if we
were to calculate the value of distinctExprs in create_distinct_paths
always, then we might end up calculating it for nothing as
create_final_distinct_paths() does not always need it. I don't feel
the need to clutter up the code by doing any lazy calculating of it
either.

David

Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org

The thing I still don't understand about this patch is why we call
RegisterSegmentBoundaryEntry and NotifySegmentsReadyForArchive in
XLogInsertRecord.

My model concept of this would have these routines called only in
XLogFlush / XLogWrite, which are conceptually about transferring data
from in-memory WAL buffers into the on-disk WAL store (pg_xlog files).
As I understand, XLogInsertRecord is about copying bytes from the
high-level operation (heap insert etc) into WAL buffers.  So doing the
register/notify dance in both places should be redundant and
unnecessary.


In the NotifySegmentsReadyForArchive() call at the bottom of XLogWrite,
we use flushed=InvalidXLogRecPtr.  Why is that?  Surely we can use
LogwrtResult.Flush, just like in the other callsite there?  (If we're
covering for somebody advancing FlushRecPtr concurrently, I think we
add a comment to explain that.  But why do we need to do that in the
first place?)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)

Re: .ready and .done files considered harmful

2021-08-17 Thread Bossart, Nathan

On 8/17/21, 12:11 PM, "Bossart, Nathan"  wrote:
> On 8/17/21, 11:28 AM, "Robert Haas"  wrote:
>> I can't actually see that there's any kind of hard synchronization
>> requirement here at all. What we're trying to do is guarantee that if
>> the timeline changes, we'll pick up the timeline history for the new
>> timeline next, and that if files are archived out of order, we'll
>> switch to archiving the oldest file that is now present rather than
>> continuing with consecutive files. But suppose we just use an
>> unsynchronized bool. The worst case is that we'll archive one extra
>> file proceeding in order before we jump to the file that we were
>> supposed to archive next. It's not evident to me that this is all that
>> bad. The same thing would have happened if the previous file had been
>> archived slightly faster than it actually was, so that we began
>> archiving the next file just before, rather than just after, the
>> notification was sent. And if it is bad, wrapping an LWLock around the
>> accesses to the flag variable, or using an atomic, does nothing to
>> stop it.
>
> I am inclined to agree.  The archiver only ever reads the flag and
> sets it to false (if we are doing a directory scan).  Others only ever
> set the flag to true.  The only case I can think of where we might
> miss the timeline switch or out-of-order .ready file is when the
> archiver sets the flag to false and then ReadDir() fails.  However,
> that seems to cause the archiver process to restart, and we always
> start with a directory scan at first.

Thinking further, I think the most important thing to ensure is that
resetting the flag happens before we begin the directory scan.
Consider the following scenario in which a timeline history file would
potentially be lost:

1. Archiver completes directory scan.
2. A timeline history file is created and the flag is set.
3. Archiver resets the flag.

I don't think there's any problem with the archiver reading a stale
value for the flag.  It should eventually be updated and route us to
the directory scan code path.

I'd also note that we're depending on the directory scan logic for
picking up all timeline history files and out-of-order .ready files
that may have been created each time the flag is set.  AFAICT that is
safe since we prioritize timeline history files and reset the archiver
state anytime we do a directory scan.  We'll first discover timeline
history files via directory scans, and then we'll move on to .ready
files, starting at the one with the lowest segment number.  If a new
timeline history file or out-of-order .ready file is created, the
archiver is notified, and we start over.

Nathan

Re: The Free Space Map: Problems and Opportunities

2021-08-17 Thread John Naylor

On Mon, Aug 16, 2021 at 1:36 PM Peter Geoghegan  wrote:
>
> Open and closed pages
> -

> This stickiness concept is called "hysteresis" by some DB researchers,
> often when discussing UNDO stuff [8]. Having *far less* granularity
> than FSM_CATEGORIES/255 seems essential to make that work as intended.
> Pages need to be able to settle without being disturbed by noise-level
> differences. That's all that super fine grained values buy you: more
> noise, more confusion.

I'm not sure it's essential to have "far" fewer categories, if the
closed-to-open transition is made less granular through some other
mechanism. We can certainly get by with fewer categories, freeing up bits
-- it seems we'd need at least one bit to track a block's open-close state.

> Visibility map
> --
>
> If the logical database and natural locality are important to the FSM,
> then what about the visibility map? And what about the relationship
> between the FSM and the visibility map, in light of all this?
>
> Currently VACUUM doesn't care about how its FSM behavior interacts
> with how it sets all-frozen/all-visible bits for the same heap page.
> To me this seems completely unreasonable -- they're *obviously*
> related! We're probably only gaining a minimal amount of free space on
> one occasion by ignoring the VM/FSM relationship, for which we pay a
> high cost. Worst of all we're *perpetuating the cycle* of dirtying and
> redirtying the same pages over time. Maybe we should go as far as
> merging the FSM and VM, even -- that seems like a natural consequence
> of logical-ish/qualitative definitions of "page is full".

> [...]

> I now wonder if the FSM is fundamentally doing the wrong thing by
> keeping track of all "free space" in every page over time. Why
> wouldn't we just have a free space management strategy that is
> generally only concerned with recent events? If we find a way to make
> almost all newly allocated heap pages become "closed" quickly (maybe
> they're marked all-frozen quickly too), and make sure that that
> condition is sticky, then this can work out well. We may not even need
> to store explicit freespace information for most heap pages in the
> database -- a page being "closed" can be made implicit by the FSM
> and/or VM. Making heap pages age-out like this (and mostly stay that
> way over time) has obvious benefits in many different areas.

The second paragraph here is an interesting idea and makes a great deal of
sense. It would lead to smaller FSMs that are navigated more quickly and
locked for shorter durations.

Implicit "closure" seems riskier in my view if you want to bring VM
qualities into it, however. Currently, setting an all-visible or all-frozen
flag must be correct and crash-safe, but clearing those is just a lost
optimization. If either of those qualities are implicit by lack of
reference, it seems more vulnerable to bugs.

On Tue, Aug 17, 2021 at 12:48 PM Robert Haas  wrote:
>
> On Tue, Aug 17, 2021 at 9:18 AM Andres Freund  wrote:
>
> > To me this seems like it'd be better addressed by a shared,
per-relfilenode,
> > in-memory datastructure. Thomas Munro has been working on keeping
accurate
> > per-relfilenode relation size information. ISTM that that that'd be a
better
> > place to hook in for this.
>
> +1. I had this same thought reading Peter's email. I'm not sure if it
> makes sense to hook that into Thomas's work, but I think it makes a
> ton of sense to use shared memory to coordinate transient state like
> "hey, right now I'm inserting into this block, you guys leave it
> alone" while using the disk for durable state like "here's how much
> space this block has left."

This makes sense as well. Shared memory for more recent / highly contended
state, and disk for less recent / less contended / stickier state. This
also might have the advantage of smaller, more focused projects from a
coding standpoint.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: .ready and .done files considered harmful

2021-08-17 Thread Bossart, Nathan

On 8/17/21, 11:28 AM, "Robert Haas"  wrote:
> I can't actually see that there's any kind of hard synchronization
> requirement here at all. What we're trying to do is guarantee that if
> the timeline changes, we'll pick up the timeline history for the new
> timeline next, and that if files are archived out of order, we'll
> switch to archiving the oldest file that is now present rather than
> continuing with consecutive files. But suppose we just use an
> unsynchronized bool. The worst case is that we'll archive one extra
> file proceeding in order before we jump to the file that we were
> supposed to archive next. It's not evident to me that this is all that
> bad. The same thing would have happened if the previous file had been
> archived slightly faster than it actually was, so that we began
> archiving the next file just before, rather than just after, the
> notification was sent. And if it is bad, wrapping an LWLock around the
> accesses to the flag variable, or using an atomic, does nothing to
> stop it.

I am inclined to agree.  The archiver only ever reads the flag and
sets it to false (if we are doing a directory scan).  Others only ever
set the flag to true.  The only case I can think of where we might
miss the timeline switch or out-of-order .ready file is when the
archiver sets the flag to false and then ReadDir() fails.  However,
that seems to cause the archiver process to restart, and we always
start with a directory scan at first.

Nathan

dependency between extension and roles

2021-08-17 Thread Pavel Stehule

Hi

I created some roles in extension. I will try to implement some security
checks.  When I dropped this extension, the roles were not dropped.

Is it expected behaviour?

Regards

Pavel

Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-08-17 Thread Robert Haas

On Fri, Aug 13, 2021 at 2:52 AM Greg Nancarrow  wrote:
> With your proposed approach, what I'm seeing is that the worker calls
> GetTransactionSnapshot() at some point, which then builds a new
> snapshot, and results in increasing TransactionXmin (probably because
> another concurrent transaction has since completed). This snapshot is
> thus later than the snapshot used in the execution state of the query
> being executed. This causes the Assert in
> SubTransGetTopmostTransaction() to fire because the xid doesn't follow
> or equal the TransactionXmin value.

Ah ha! Thank you. So I think what I was missing here is that even
though the transaction snapshot is not a well-defined concept when
!IsolationUsesXactSnapshot(), we still need TransactionXmin to be set
to a value that's earlier than any XID we might inquire about. So the
proposal to install the leader's active snapshot as the worker's
transaction snapshot is really just a way of making that happen. Now
that I understand better, that seems OK to me when
!IsolationUsesXactSnapshot(), but otherwise I think we need to
serialize and restore the actual transaction snapshot. Do you agree?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2021-08-17 Thread Robert Haas

On Tue, Aug 17, 2021 at 1:54 PM Tom Lane  wrote:
> Right.  If pg_upgrade explicitly ignores template0 then its OID
> need not be stable ... at least, not unless there's a chance it
> could conflict with some other database OID, which would become
> a live possibility if we let users get at "WITH OID = n".

Well, that might be a good reason not to let them do that, then, at
least for n<64k.

> > In fact, I'd probably go so far as to
> > hardcode that in such a way that even if you drop those databases and
> > recreate them, they get recreated with the same hard-coded OID.
>
> Less sure that this is a good idea, though.  In particular, I do not
> think that you can make it work in the face of
> alter database template1 rename to oops;
> create database template1;

That is a really good point. If we can't categorically force the OID
of those databases to have a particular, fixed value, and based on
this example that seems to be impossible, then there's always a
possibility that we might find a value in the old cluster that doesn't
happen to match what is present in the new cluster. Seen from that
angle, the problem is really with databases that are pre-existent in
the new cluster but whose contents still need to be dumped. Maybe we
could (optionally? conditionally?) drop those databases from the new
cluster and then recreate them with the OID that we want them to have.

> > The only hard requirement for this feature is if we
> > use the database OID for some kind of encryption or integrity checking
> > or checksum type feature.
>
> It's fairly unclear to me why that is so important as to justify the
> amount of klugery that this line of thought seems to be bringing.

Well, I think it would make sense to figure out how small we can make
the kludge first, and then decide whether it's larger than we can
tolerate. From my point of view, I completely understand why people to
whom those kinds of features are important want to include all the
fields that make up a buffer tag in the checksum or other integrity
check. Right now, if somebody copies a page from one place to another,
or if the operating system fumbles things and switches some pages
around, we have no direct way of detecting that anything bad has
happened. This is not the only problem that would need to be solved in
order to fix that, but it's one of them, and I don't particularly see
why it's not a valid goal. It's not as if a 16-bit checksum that is
computed in exactly the same way for every page in the cluster is such
state-of-the-art technology that only fools question its surpassing
excellence.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2021-08-17 Thread Bruce Momjian

On Tue, Aug 17, 2021 at 11:56:30AM -0400, Robert Haas wrote:
> On Wed, Aug 11, 2021 at 3:41 AM Shruthi Gowda  wrote:
> > I have fixed all the issues and now the patch is working as expected.
> 
> Hi,
> 
> I'm changing the subject line since the patch does something which was
> discussed on that thread but isn't really related to the old email
> subject. In general, I think this patch is uncontroversial and in
> reasonably good shape. However, there's one part that I'm not too sure
> about. If Tom Lane happens to be paying attention to this thread, I
> think his feedback would be particularly useful, since he knows a lot
> about the inner workings of pg_dump. Opinions from anybody else would
> be great, too. Anyway, here's the hunk that worries me:

What is the value of preserving db/ts/relfilenode OIDs?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: .ready and .done files considered harmful

2021-08-17 Thread Robert Haas

On Tue, Aug 17, 2021 at 12:33 PM Bossart, Nathan  wrote:
> Sorry, I think my note was not very clear.  I agree that a flag should
> be used for this purpose, but I think we should just use a regular
> bool protected by a spinlock or LWLock instead of an atomic.  The file
> atomics.h has the following note:
>
>  * Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
>  * whenever possible. Writing correct code using these facilities is hard.
>
> IOW I don't think the extra complexity is necessary.  From a
> performance standpoint, contention seems unlikely.  We only need to
> read the flag roughly once per WAL segment, and we only ever set it in
> uncommon scenarios such as a timeline switch or the creation of an
> out-of-order .ready file.

In the interest of full disclosure, I think that I was probably the
one who suggested to Dipesh that he should look into using atomics,
although I can't quite remember right now why I thought we might want
to do that.

I do not on general principle very much like code that does
LWLockAcquire(whatever);
exactly-one-assignment-statement-that-modifies-a-1-2-or-4-byte-quantity;
LWLockRelease(whatever). If you had two assignments in there, then you
know why you have a lock: it's to make those behave as an atomic,
indivisible unit. But when you only have one, what are you protecting
against? You're certainly not making anything atomic that would not
have been anyway, so you must be using the LWLock as a memory barrier.
But then you really kind of have to think about memory barriers
anyway: why do you need one at all, and what things need to be
separated? It's not clear that spelling pg_memory_barrier() as
LWLockAcquire() and/or LWLockRelease() is actually saving you anything
in terms of notional complexity.

In this patch, it appears to me that the atomic flag is only ever
being read unlocked, so I think that we're actually getting no benefit
at all from the use of pg_atomic_flag here. We're not making anything
atomic, because there's only one bit of shared state, and we're not
getting any memory barrier semantics, because it looks to me like the
flag is only ever tested using pg_atomic_unlocked_test_flag, which is
documented not to have barrier semantics. So as far as I can see,
there's no point in using either an LWLock or atomics here. We could
just use bool with no lock and the code would do exactly what it does
now. So I guess the question is whether that's correct or whether we
need some kind of synchronization and, if so, of what sort.

I can't actually see that there's any kind of hard synchronization
requirement here at all. What we're trying to do is guarantee that if
the timeline changes, we'll pick up the timeline history for the new
timeline next, and that if files are archived out of order, we'll
switch to archiving the oldest file that is now present rather than
continuing with consecutive files. But suppose we just use an
unsynchronized bool. The worst case is that we'll archive one extra
file proceeding in order before we jump to the file that we were
supposed to archive next. It's not evident to me that this is all that
bad. The same thing would have happened if the previous file had been
archived slightly faster than it actually was, so that we began
archiving the next file just before, rather than just after, the
notification was sent. And if it is bad, wrapping an LWLock around the
accesses to the flag variable, or using an atomic, does nothing to
stop it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: The Free Space Map: Problems and Opportunities

2021-08-17 Thread Bruce Momjian

On Mon, Aug 16, 2021 at 05:15:36PM -0700, Peter Geoghegan wrote:
> It doesn't make sense to have a local cache for a shared resource --
> that's the problem. You actually need some kind of basic locking or
> lease system, so that 10 backends don't all decide at the same time
> that one particular heap block is fully empty, and therefore a good
> target block for that backend alone. It's as if the backends believe
> that they're special little snowflakes, and that no other backend
> could possibly be thinking the same thing at the same time about the
> same heap page. And so when TPC-C does its initial bulk insert,
> distinct orders are already shuffled together in a hodge-podge, just
> because concurrent bulk inserters all insert on the same heap pages.

OK, I am trying to think of something simple we could test to see the
benefit, with few downsides.  I assume the case you are considering is
that you have a 10 8k-page table, and one page is 80% full and the
others are 81% full, and if several backends start adding rows at the
same time, they will all choose the 80%-full page.

What if we change how we select pages with this:

1.  find the page with the most free space
2.  find all pages with up to 10% less free space than page #1
3.  count the number of pages in #2
4.  compute the proc_id modulus step #3 and use that page's offset from
step #2

For example:

1.  page with most freespace is 95% free
2.  pages 2,4,6,8,10 have between 86%-95% free
3.  five pages
4.  proc id 14293 % 5 = 3 so use the third page from #2, page 6

This should spread out page usage to be more even, but still favor pages
with more freespace.  Yes, this is simplistic, but it would seem to have
few downsides and I would be interested to see how much it helps.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-17, Bossart, Nathan wrote:

> I think we are in agreement.  If we assume that the flush pointer
> jumps along record boundaries and segment boundaries, the solution
> would be to avoid using the flush pointer when it points to a segment
> boundary (given that the segment boundary is not also a record
> boundary).  Instead, we'd only send up to the start position of the
> last record in the segment to standbys.

Agreed.

An implementation for that would be to test the flush pointer for it
being a segment boundary, and in that case we (acquire segment boundary
lock and) test for presence in the segment boundary map.  If present,
then retreat the pointer to the record's start address.

This means that we acquire the segment boundary lock rarely.  I was
concerned that we'd need to acquire it every time we read the flush
pointer, which would have been a disaster.

Thanks

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2021-08-17 Thread Shruthi Gowda

On Tue, Aug 17, 2021 at 11:07 PM Robert Haas  wrote:
>
> On Tue, Aug 17, 2021 at 12:42 PM Tom Lane  wrote:
> > Actually though ... I've not read the patch, but what does it do about
> > the fact that the postgres and template0 DBs do not have stable OIDs?
> > I cannot imagine any way to force those to match across PG versions
> > that would not be an unsustainable crock.
>
> Well, it's interesting that you mention that, because there's a
> comment in the patch that probably has to do with this:
>
> +/*
> + * Make sure that pg_upgrade does not change database OID. Don't care
> + * about "postgres" database, backend will assign it fixed OID anyway.
> + * ("template1" has fixed OID too but the value 1 should not collide with
> + * any other OID so backend pays no attention to it.)
> + */
>
In the original patch the author intended to avoid dumping the
postgres DB OID like below:
+ if (dopt->binary_upgrade && dbCatId.oid != PostgresDbOid)

Since postgres OID is not hardcoded/fixed I removed the check.
My bad I missed updating the comment section. Sorry for the confusion.

Regards,
Shruthi KC
EnterpriseDB: http://www.enterprisedb.com

Re: archive status ".ready" files may be created too early

2021-08-17 Thread Bossart, Nathan

On 8/17/21, 10:44 AM, "alvhe...@alvh.no-ip.org"  wrote:
> On 2021-Aug-17, Bossart, Nathan wrote:
>> I've been thinking about the next steps for this one, too.  ISTM we'll
>> need to basically assume that the flush pointer jumps along record
>> boundaries except for the cross-segment records.  I don't know if that
>> is the safest assumption, but I think the alternative involves
>> recording every record boundary in the map.
>
> I'm not sure I understand your idea correctly.  Perhaps another solution
> is to assume that the flush pointer jumps along record boundaries
> *including* for cross-segment records.  The problem stems precisely from
> the fact that we set the flush pointer at segment boundaries, even when
> they aren't record boundary.

I think we are in agreement.  If we assume that the flush pointer
jumps along record boundaries and segment boundaries, the solution
would be to avoid using the flush pointer when it points to a segment
boundary (given that the segment boundary is not also a record
boundary).  Instead, we'd only send up to the start position of the
last record in the segment to standbys.

Nathan

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2021-08-17 Thread Tom Lane

I wrote:
> Robert Haas  writes:
>> The only hard requirement for this feature is if we
>> use the database OID for some kind of encryption or integrity checking
>> or checksum type feature.

> It's fairly unclear to me why that is so important as to justify the
> amount of klugery that this line of thought seems to be bringing.

And, not to put too fine a point on it, how will you possibly do
that without entirely breaking CREATE DATABASE?

regards, tom lane

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2021-08-17 Thread Tom Lane

Robert Haas  writes:
> I wasn't able to properly understand that comment, and to be honest
> I'm not sure I precisely understand your concern either. I don't quite
> see why the template0 database matters. I think that database isn't
> going to be dumped, or restored, so as far as pg_upgrade is concerned
> it might as well not exist in either cluster, and I don't see why
> pg_upgrade can't therefore just ignore it completely. But template1
> and postgres are another matter. If I understand correctly, those
> databases are going to be created in the new cluster by initdb, but
> then pg_upgrade is going to populate them with data - including
> relation files - from the old cluster.

Right.  If pg_upgrade explicitly ignores template0 then its OID
need not be stable ... at least, not unless there's a chance it
could conflict with some other database OID, which would become
a live possibility if we let users get at "WITH OID = n".

(Having said that, I'm not sure that pg_upgrade special-cases
template0, or that it should do so.)

> To be honest, what I'd be inclined to do about that is just nail down
> those OIDs for future releases.

Yeah, I was thinking along similar lines.

> In fact, I'd probably go so far as to
> hardcode that in such a way that even if you drop those databases and
> recreate them, they get recreated with the same hard-coded OID.

Less sure that this is a good idea, though.  In particular, I do not
think that you can make it work in the face of
alter database template1 rename to oops;
create database template1;

> The only hard requirement for this feature is if we
> use the database OID for some kind of encryption or integrity checking
> or checksum type feature.

It's fairly unclear to me why that is so important as to justify the
amount of klugery that this line of thought seems to be bringing.

regards, tom lane

Re: Allow composite foreign keys to reference a superset of unique constraint columns?

2021-08-17 Thread Paul Martinez

On Tue, Aug 17, 2021 at 8:41 AM Jack Christensen  wrote:
>
> The only way to ensure a user can only be a member of a group in the same
> tenant is to user_group_memberships.tenant_id be part of the foreign key. And
> that will only work with a unique key on id and tenant_id in both users and
> user_groups. It's a bit inelegant to create multiple extra indexes to ensure
> consistency when existing indexes are enough to ensure uniqueness.
>
> Jack

You could accomplish this by using composite primary keys on the users and
user_groups tables:

CREATE TABLE users (
  id serial,
  tenant_id int REFERENCES tenants(id),
  PRIMARY KEY (tenant_id, id)
);

This approach works pretty well for multi-tenant databases, because then your
indexes all start with tenant_id, which should help with performance, and, in
theory, would make your database easier to shard. But then it requires
including a tenant_id in *every* query (and subquery!), which may be difficult
to enforce in a codebase.

One downside of the composite primary/foreign key approach is that ON DELETE
SET NULL foreign keys no longer work properly because they try to set both
columns to NULL, the true foreign key id, AND the shared tenant_id that is part
of the referencing table's primary key. I have a patch [1] out to add new
functionality to solve this problem though.

- Paul

[1]: 
https://www.postgresql.org/message-id/flat/CACqFVBZQyMYJV%3DnjbSMxf%2BrbDHpx%3DW%3DB7AEaMKn8dWn9OZJY7w%40mail.gmail.com

Re: archive status ".ready" files may be created too early

2021-08-17 Thread alvhe...@alvh.no-ip.org

On 2021-Aug-17, Bossart, Nathan wrote:

> On 8/16/21, 5:09 PM, "alvhe...@alvh.no-ip.org"  
> wrote:
> > The reason for the latter is that I suspect the segment boundary map
> > will also have a use to fix the streaming replication issue I mentioned
> > earlier in the thread.  This also makes me think that we'll want the wal
> > record *start* address to be in the hash table too, not just its *end*
> > address.  So we'll use the start-1 as position to send, rather than the
> > end-of-segment when GetFlushRecPtr() returns that.
> 
> I've been thinking about the next steps for this one, too.  ISTM we'll
> need to basically assume that the flush pointer jumps along record
> boundaries except for the cross-segment records.  I don't know if that
> is the safest assumption, but I think the alternative involves
> recording every record boundary in the map.

I'm not sure I understand your idea correctly.  Perhaps another solution
is to assume that the flush pointer jumps along record boundaries
*including* for cross-segment records.  The problem stems precisely from
the fact that we set the flush pointer at segment boundaries, even when
they aren't record boundary.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green
stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'.
After collecting 500 such letters, he mused, a university somewhere in
Arizona would probably grant him a degree.  (Don Knuth)

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2021-08-17 Thread Robert Haas

On Tue, Aug 17, 2021 at 12:42 PM Tom Lane  wrote:
> Actually though ... I've not read the patch, but what does it do about
> the fact that the postgres and template0 DBs do not have stable OIDs?
> I cannot imagine any way to force those to match across PG versions
> that would not be an unsustainable crock.

Well, it's interesting that you mention that, because there's a
comment in the patch that probably has to do with this:

+/*
+ * Make sure that pg_upgrade does not change database OID. Don't care
+ * about "postgres" database, backend will assign it fixed OID anyway.
+ * ("template1" has fixed OID too but the value 1 should not collide with
+ * any other OID so backend pays no attention to it.)
+ */

I wasn't able to properly understand that comment, and to be honest
I'm not sure I precisely understand your concern either. I don't quite
see why the template0 database matters. I think that database isn't
going to be dumped, or restored, so as far as pg_upgrade is concerned
it might as well not exist in either cluster, and I don't see why
pg_upgrade can't therefore just ignore it completely. But template1
and postgres are another matter. If I understand correctly, those
databases are going to be created in the new cluster by initdb, but
then pg_upgrade is going to populate them with data - including
relation files - from the old cluster. And, yeah, I don't see how we
could make those database OIDs match, which is not great.

To be honest, what I'd be inclined to do about that is just nail down
those OIDs for future releases. In fact, I'd probably go so far as to
hardcode that in such a way that even if you drop those databases and
recreate them, they get recreated with the same hard-coded OID. Now
that doesn't do anything to create stability when people upgrade from
an old release to a current one, but I don't really see that as an
enormous problem. The only hard requirement for this feature is if we
use the database OID for some kind of encryption or integrity checking
or checksum type feature. Then, you want to avoid having the database
OID change when you upgrade, so that the encryption or integrity check
or checksum in question does not have to be recomputed for every page
as part of pg_upgrade. But, that only matters if you're going between
two releases that support that feature, which will not be the case if
you're upgrading from some old release. Apart from that kind of
feature, it still seems like a nice-to-have to keep database OIDs the
same, but if those cases end up as exceptions, oh well.

Does that seem reasonable, or am I missing something big?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: The Free Space Map: Problems and Opportunities

2021-08-17 Thread Robert Haas

On Tue, Aug 17, 2021 at 9:18 AM Andres Freund  wrote:
> > Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
> > ever end up inserting *any* new logical rows on a heap page after the
> > page first fills -- it is initially "open" when first allocated, and
> > then quickly becomes "closed" to inserts of new logical rows, once it
> > crosses a certain almost-empty space threshold. The heap page usually
> > stays "closed" forever. While the design does not *completely* ignore
> > the possibility that the page won't eventually get so empty that reuse
> > really does look like a good idea, it makes it rare. A heap page needs
> > to have rather a lot of the original logical rows deleted before that
> > becomes possible again. It's rare for it to go backwards because the
> > threshold that makes a closed page become open again is much lower
> > (i.e. involves much more free space) than the threshold that initially
> > made a (newly allocated) heap page closed. This one-way stickiness
> > quality is important.
>
> I suspect that we'd get a *LOT* of complaints if we introduced that degree of
> stickiness.

I don't know what the right degree of stickiness is, but I can easily
believe that we want to have more than none. Part of Peter's point, at
least as I understand it, is that if a page has 100 tuples and you
delete 1 or 2 or 3 of them, it is not smart to fill up the space thus
created with 1 or 2 or 3 new tuples. You should instead hope that
space will optimize future HOT updates, or else wait for the page to
lose some larger number of tuples and then fill up all the space at
the same time with tuples that are, hopefully, related to each other
in some useful way. Now what's the threshold? 20 out of 100? 50? 80?
That's not really clear to me. But it's probably not 1 out of 100.

> > This stickiness concept is called "hysteresis" by some DB researchers,
> > often when discussing UNDO stuff [8]. Having *far less* granularity
> > than FSM_CATEGORIES/255 seems essential to make that work as intended.
> > Pages need to be able to settle without being disturbed by noise-level
> > differences. That's all that super fine grained values buy you: more
> > noise, more confusion.
>
> Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit
> from a fairly granular FSM_CATEGORIES, while still achieving better
> locality. You could e.g. search for free space for an out-of-page update in
> nearby pages with a lower free-percentage threshold, while having a different
> threshold and selection criteria for placement of inserts.

Yeah, I don't think that reducing FSM_CATEGORIES is likely to have
much value for its own sake. But it might be useful as a way of
accomplishing some other goal. For example if we decided that we need
a bit per page to track "open" vs. "closed" status or something of
that sort, I don't think we'd lose much by having fewer categories.

> The density of the VM is fairly crucial for efficient IOS, so I'm doubtful
> merging FSM and VM is a good direction to go to. Additionally we need to make
> sure that the VM is accurately durable, which isn't the case for the FSM
> (which I think we should use to maintain it more accurately).
>
> But perhaps I'm just missing something here. What is the strong interlink
> between FSM and all-frozen/all-visible you see on the side of VACUUM? ISTM
> that the main linkage is on the side of inserts/update that choose a target
> page from the FSM. Isn't it better to make that side smarter?

I don't have a well-formed opinion on this point yet. I am also
skeptical of the idea of merging the FSM and VM, because it seems to
me that which pages are all-visible and which pages are full are two
different things. However, there's something to Peter's point too. An
empty page is all-visible but still valid as an insertion target, but
this is not as much of a contradiction to the idea of merging them as
it might seem, because marking the empty page as all-visible is not
nearly as useful as marking a full page all-visible. An index-only
scan can't care about the all-visible status of a page that doesn't
contain any tuples, and we're also likely to make the page non-empty
in the near future, in which case the work we did to set the
all-visible bit and log the change was wasted. The only thing we're
accomplishing by making the page all-visible is letting VACUUM skip
it, which will work out to a win if it stays empty for multiple VACUUM
cycles, but not otherwise.

Conceptually, you might think of the merged data structure as a
"quiescence map." Pages that aren't being actively updated are
probably all-visible and are probably not great insertion targets.
Those that are being actively updated are probably not all-visible and
have better chances of being good insertion targets. However, there
are a lot of gremlins buried in the word "probably." A page could get
full but still not be all-visible for a long time, due to long-running
transactions, and it

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2021-08-17 Thread Tom Lane

Stephen Frost  writes:
> Also agreed on this, though I wonder- do we actually need to explicitly
> make CREATE DATABASE q WITH OID = 1234; only work during binary upgrade
> mode in the backend?  That strikes me as perhaps doing more work than we
> really need to while also preventing something that users might actually
> like to do.

There should be adequate defenses against a duplicate OID already,
so +1 --- no reason to insist this only be used during binary upgrade.

Actually though ... I've not read the patch, but what does it do about
the fact that the postgres and template0 DBs do not have stable OIDs?
I cannot imagine any way to force those to match across PG versions
that would not be an unsustainable crock.

regards, tom lane

Re: PG14: Avoid checking output-buffer-length for every encoded byte during pg_hex_encode

2021-08-17 Thread Julien Rouhaud

On Tue, Aug 17, 2021 at 11:26 PM Bruce Momjian  wrote:
>
> On Tue, Aug 17, 2021 at 09:39:30AM -0400, Tom Lane wrote:
> > Michael Paquier  writes:
> > > In short, I am planning to address this regression with the attached,
> > > for a combined revert of 0d70d30, 5c33ba5 and 92436a7.
> >
> > OK, but the commit message should explain why they're getting reverted.
>
> Uh, I don't see those commits, e.g.:
>
> $ git diff 0d70d30
> fatal: ambiguous argument '0d70d30': unknown revision or path not in 
> the working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
>
> $ git diff 5c33ba5
> fatal: ambiguous argument '5c33ba5': unknown revision or path not in 
> the working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'
>
> $ git diff 92436a7
> fatal: ambiguous argument '92436a7': unknown revision or path not in 
> the working tree.
> Use '--' to separate paths from revisions, like this:
> 'git  [...] -- [...]'

Same here.  I'm assuming that the real commits are the one mentioned
in the patch, which are c3826f8,  aef8948 and ccf4e27.

Re: .ready and .done files considered harmful

2021-08-17 Thread Bossart, Nathan

On 8/17/21, 5:53 AM, "Dipesh Pandit"  wrote:
>> I personally don't think it's necessary to use an atomic here.  A
>> spinlock or LWLock would probably work just fine, as contention seems
>> unlikely.  If we use a lock, we also don't have to worry about memory
>> barriers.
>
> History file should be archived as soon as it gets created. The atomic flag
> here will make sure that there is no reordering of read/write instructions 
> while
> accessing the flag in shared memory. Archiver needs to read this flag at the 
> beginning of each cycle. Write to atomic flag is synchronized and it provides 
> a lockless read. I think an atomic flag here is an efficient choice unless I 
> am 
> missing something.

Sorry, I think my note was not very clear.  I agree that a flag should
be used for this purpose, but I think we should just use a regular
bool protected by a spinlock or LWLock instead of an atomic.  The file
atomics.h has the following note:

 * Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
 * whenever possible. Writing correct code using these facilities is hard.

IOW I don't think the extra complexity is necessary.  From a
performance standpoint, contention seems unlikely.  We only need to
read the flag roughly once per WAL segment, and we only ever set it in
uncommon scenarios such as a timeline switch or the creation of an
out-of-order .ready file.

Nathan

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2021-08-17 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Robert Haas  writes:
> > To me, adding a separate TOC entry for a thing that is not really a
> > separate object seems like a scary hack that might come back to bite
> > us. Unfortunately, I don't know enough about pg_dump to say exactly
> > how it might come back to bite us, which leaves wide open the
> > possibility that I am completely wrong I just think it's the
> > intention that archive entries correspond to actual objects in the
> > database, not commands that we want executed in some particular order.
> 
> I agree, this seems like a moderately bad idea.  It could get broken
> either by executing only one of the TOC entries during restore, or
> by executing them in the wrong order.  The latter possibility could
> be forestalled by adding a dependency, which I do not see this hunk
> doing, which is clearly a bug.  The former possibility would require
> user intervention, so maybe it's in the category of "if you break
> this you get to keep both pieces".  Still, it's ugly.

Yeah, agreed.

> > If that criticism is indeed correct, then my proposal would be to
> > instead add a WITH OID = nnn option to CREATE DATABASE and allow it to
> > be used only in binary upgrade mode.
> 
> If it's not too complicated to implement, that seems like an OK idea
> from here.  I don't have any great love for the way we handle OID
> preservation in binary upgrade mode, so not doing it exactly the same
> way for databases doesn't seem like a disadvantage.

Also agreed on this, though I wonder- do we actually need to explicitly
make CREATE DATABASE q WITH OID = 1234; only work during binary upgrade
mode in the backend?  That strikes me as perhaps doing more work than we
really need to while also preventing something that users might actually
like to do.

Either way, we'll need to check that the OID given to us can be used for
the database, I'd think.

Having pg_dump only include it in binary upgrade mode is fine though.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2021-08-17 Thread Tom Lane

Robert Haas  writes:
> To me, adding a separate TOC entry for a thing that is not really a
> separate object seems like a scary hack that might come back to bite
> us. Unfortunately, I don't know enough about pg_dump to say exactly
> how it might come back to bite us, which leaves wide open the
> possibility that I am completely wrong I just think it's the
> intention that archive entries correspond to actual objects in the
> database, not commands that we want executed in some particular order.

I agree, this seems like a moderately bad idea.  It could get broken
either by executing only one of the TOC entries during restore, or
by executing them in the wrong order.  The latter possibility could
be forestalled by adding a dependency, which I do not see this hunk
doing, which is clearly a bug.  The former possibility would require
user intervention, so maybe it's in the category of "if you break
this you get to keep both pieces".  Still, it's ugly.

> If that criticism is indeed correct, then my proposal would be to
> instead add a WITH OID = nnn option to CREATE DATABASE and allow it to
> be used only in binary upgrade mode.

If it's not too complicated to implement, that seems like an OK idea
from here.  I don't have any great love for the way we handle OID
preservation in binary upgrade mode, so not doing it exactly the same
way for databases doesn't seem like a disadvantage.

regards, tom lane

preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)

2021-08-17 Thread Robert Haas

On Wed, Aug 11, 2021 at 3:41 AM Shruthi Gowda  wrote:
> I have fixed all the issues and now the patch is working as expected.

Hi,

I'm changing the subject line since the patch does something which was
discussed on that thread but isn't really related to the old email
subject. In general, I think this patch is uncontroversial and in
reasonably good shape. However, there's one part that I'm not too sure
about. If Tom Lane happens to be paying attention to this thread, I
think his feedback would be particularly useful, since he knows a lot
about the inner workings of pg_dump. Opinions from anybody else would
be great, too. Anyway, here's the hunk that worries me:

+
+   /*
+* Need a separate entry, otherwise the command will
be run in the
+* same transaction as the CREATE DATABASE command, which is not
+* allowed.
+*/
+   ArchiveEntry(fout,
+dbCatId,   /* catalog ID */
+dbDumpId,  /* dump ID */
+ARCHIVE_OPTS(.tag = datname,
+ .owner = dba,
+
.description = "SET_DB_OID",
+
.section = SECTION_PRE_DATA,
+
.createStmt = setDBIdQry->data,
+
.dropStmt = NULL));
+

To me, adding a separate TOC entry for a thing that is not really a
separate object seems like a scary hack that might come back to bite
us. Unfortunately, I don't know enough about pg_dump to say exactly
how it might come back to bite us, which leaves wide open the
possibility that I am completely wrong I just think it's the
intention that archive entries correspond to actual objects in the
database, not commands that we want executed in some particular order.
If that criticism is indeed correct, then my proposal would be to
instead add a WITH OID = nnn option to CREATE DATABASE and allow it to
be used only in binary upgrade mode. That has the disadvantage of
being inconsistent with the way that we preserve OIDs everywhere else,
but the only other alternatives are (1) do something like the above,
(2) remove the requirement that CREATE DATABASE run in its own
transaction, and (3) give up. (2) sounds hard and (3) is unappealing.

The rest of this email will be detailed review comments on the patch
as presented, and thus probably only interesting to someone actually
working on the patch. Feel free to skip if that's not you.

- I suggest splitting the patch into one portion that deals with
database OID and another portion that deals with tablespace OID and
relfilenode OID, or maybe splitting it all the way into three separate
patches, one for each. This could allow the uncontroversial parts to
get committed first while we're wondering what to do about the problem
described above.

- There are two places in the patch, one in dumpDatabase() and one in
generate_old_dump() where blank lines are removed with no other
changes. Please avoid whitespace-only hunks.

- If possible, please try to pgindent the new code. It's pretty good
what you did, but e.g. the declaration of
binary_upgrade_next_pg_tablespace_oid probably has less whitespace
than pgindent is going to want.

- The comments in dumpDatabase() claim that "postgres" and "template1"
are handled specially in some way, but there seems to be no code that
matches those comments.

- heap_create()'s logic around setting create_storage looks slightly
redundant. I'm not positive what would be better, but ... suppose you
just took the part that's currently gated by if (!IsBinaryUpgrade) and
did it unconditionally. Then put if (IsBinaryUpgrade) around the else
clause, but delete the last bit from there that sets create_storage.
Maybe we'd still want a comment saying that it's intentional that
create_storage = true even though it will be overwritten later, but
then, I think, we wouldn't need to set create_storage in two different
places. Maybe I'm wrong.

- If we're not going to do that, then I think you should swap the if
and else clauses and reverse the sense of the test. In createdb(),
CreateTableSpace(), and a bunch of existing places, we do if
(IsBinaryUpgrade) { ... } else { ... } so I don't think it makes sense
for this one to instead do if (!IsBinaryUpgrade) { ... } else { ... }.

- I'm not sure that I'd bother renaming
binary_upgrade_set_pg_class_oids_and_relfilenodes(). It's such a long
name, and a relfilenode is kind of an OID, so the current name isn't
even really wrong. I'd probably drop the header comment too, since it
seems rather obvious. But both of these things are judgement calls.

- Inside that function, there is a comment that says "Indexes cannot
have toast tables, so we need not make this probe in the index code
path." However, you have moved the code from someplace where it didn't
happen for indexes to someplace where it happens for both tables and
indexes. Therefore the comment, which was true when the code was

Re: Allow composite foreign keys to reference a superset of unique constraint columns?

2021-08-17 Thread Jack Christensen

On Mon, Aug 16, 2021 at 7:01 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Mon, Aug 16, 2021 at 4:37 PM Paul Martinez  wrote:
>
>>
>> It seems like a somewhat useful feature. If people think it would be
>> useful to
>> implement, I might take a stab at it when I have time.
>>
>>
> This doesn't seem useful enough for us to be the only implementation to go
> above and beyond the SQL Standard's specification for the references
> feature (I assume that is what this proposal suggests).
>
> This example does a good job of explaining but its assumptions aren't that
> impactful and thus isn't that good at inducing desirability.
>
>

I have no opinion on the broader concerns about this proposed feature, but
speaking simply as a user I have wanted this on multiple occasions. In my
case, it is usually because of the need to maintain consistency in a
diamond table relationship. For example:

create table tenants (
  id serial primary key
);

create table users (
  id serial primary key,
  tenant_id int references tenants
);

create table user_groups (
  id serial primary key,
  tenant_id int references tenants
);

create table user_group_memberships (
  tenant_id int,
  user_id int,
  user_group_id,
  primary key (user_id, user_group_id),
  foreign key (user_id, tenant_id) references users (id, tenant_id),
  foreign key (user_group_id, tenant_id) references user_groups (id,
tenant_id)
);

The only way to ensure a user can only be a member of a group in the same
tenant is to user_group_memberships.tenant_id be part of the foreign key.
And that will only work with a unique key on id and tenant_id in both users
and user_groups. It's a bit inelegant to create multiple extra indexes to
ensure consistency when existing indexes are enough to ensure uniqueness.

Jack

Re: PG14: Avoid checking output-buffer-length for every encoded byte during pg_hex_encode

2021-08-17 Thread Bruce Momjian

On Tue, Aug 17, 2021 at 09:39:30AM -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > In short, I am planning to address this regression with the attached,
> > for a combined revert of 0d70d30, 5c33ba5 and 92436a7.
> 
> OK, but the commit message should explain why they're getting reverted.

Uh, I don't see those commits, e.g.:

$ git diff 0d70d30
fatal: ambiguous argument '0d70d30': unknown revision or path not in 
the working tree.
Use '--' to separate paths from revisions, like this:
'git  [...] -- [...]'

$ git diff 5c33ba5
fatal: ambiguous argument '5c33ba5': unknown revision or path not in 
the working tree.
Use '--' to separate paths from revisions, like this:
'git  [...] -- [...]'

$ git diff 92436a7
fatal: ambiguous argument '92436a7': unknown revision or path not in 
the working tree.
Use '--' to separate paths from revisions, like this:
'git  [...] -- [...]'

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: [PATCH] Allow multiple recursive self-references

2021-08-17 Thread Zhihong Yu

On Tue, Aug 17, 2021 at 5:58 AM Denis Hirn 
wrote:

> > The tests fail when you build with assertions enabled (configure
> --enable-cassert).
>
> Thank you for pointing that out. The new version of this patch fixes that.
> The tests are working properly now. All style related issues are fixed as
> well.
>
> Best wishes,
>   -- Denis
>
>
> Hi,
+   selfrefcountL = cstate->selfrefcount;
+   cstate->selfrefcount = selfrefcount;

Maybe the variable  selfrefcountL can be renamed slightly (e.g.
curr_selfrefcount) so that the code is easier to read.

Cheers

Re: Allow parallel DISTINCT

2021-08-17 Thread Zhihong Yu

On Tue, Aug 17, 2021 at 3:59 AM David Rowley  wrote:

> On Tue, 17 Aug 2021 at 20:07, Zhihong Yu  wrote:
> > Can you attach updated patch so that we know more detail about the two
> new functions; create_final_distinct_paths and
> > create_partial_distinct_paths ?
>
> Must've fallen off in transit :)
>
> David
>
Hi,
Since create_partial_distinct_paths() calls create_final_distinct_paths(),
I wonder if numDistinctRows can be passed to create_final_distinct_paths()
so that the latter doesn't need to call estimate_num_groups().

Cheers

Re: automatically generating node support functions

2021-08-17 Thread Peter Eisentraut

Here is another set of preparatory patches that clean up various special 
cases and similar in the node support.


0001-Remove-T_Expr.patch

Removes unneeded T_Expr.

0002-Add-COPY_ARRAY_FIELD-and-COMPARE_ARRAY_FIELD.patch
0003-Add-WRITE_INDEX_ARRAY.patch

These add macros to handle a few cases that were previously hand-coded.

0004-Make-node-output-prefix-match-node-structure-name.patch

Some nodes' output/read functions use a label that is slightly different 
from their node name, e.g., "NOTIFY" instead of "NOTIFYSTMT".  This 
cleans that up so that an automated approach doesn't have to deal with 
these special cases.


0005-Add-Cardinality-typedef.patch

Adds a typedef Cardinality for double fields that store an estimated row 
or other count.  Works alongside Cost and Selectivity.


This is useful because it appears that the serialization format for 
these float fields depends on their intent: Cardinality => %.0f, Cost => 
%.2f, Selectivity => %.4f.  The only remaining exception is allvisfrac, 
which uses %.6f.  Maybe that could also be a Selectivity, but I left it 
as is.  I think this improves the clarity in this area.
From 4a9cc84d667f64bdcdffb1df8c89c4e73b02c1fb Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 17 Aug 2021 15:51:45 +0200
Subject: [PATCH 1/5] Remove T_Expr

This is an abstract node that shouldn't have a node tag defined.
---
 src/include/nodes/nodes.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 6a4d82f0a8..5ac1996738 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -153,7 +153,6 @@ typedef enum NodeTag
T_Alias,
T_RangeVar,
T_TableFunc,
-   T_Expr,
T_Var,
T_Const,
T_Param,
-- 
2.32.0

From a6f0e9031071ce695153fea600b5cb07a6507e11 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 17 Aug 2021 15:51:45 +0200
Subject: [PATCH 2/5] Add COPY_ARRAY_FIELD and COMPARE_ARRAY_FIELD

These handle node fields that are inline arrays (as opposed to
dynamically allocated arrays handled by COPY_POINTER_FIELD and
COMPARE_POINTER_FIELD).  These cases were hand-coded until now.
---
 src/backend/nodes/copyfuncs.c  | 11 +++
 src/backend/nodes/equalfuncs.c |  7 +++
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 38251c2b8e..05c8ca080c 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -53,6 +53,10 @@
 #define COPY_STRING_FIELD(fldname) \
(newnode->fldname = from->fldname ? pstrdup(from->fldname) : (char *) 
NULL)
 
+/* Copy a field that is an inline array */
+#define COPY_ARRAY_FIELD(fldname) \
+   memcpy(newnode->fldname, from->fldname, sizeof(newnode->fldname))
+
 /* Copy a field that is a pointer to a simple palloc'd object of size sz */
 #define COPY_POINTER_FIELD(fldname, sz) \
do { \
@@ -4931,10 +4935,9 @@ _copyForeignKeyCacheInfo(const ForeignKeyCacheInfo *from)
COPY_SCALAR_FIELD(conrelid);
COPY_SCALAR_FIELD(confrelid);
COPY_SCALAR_FIELD(nkeys);
-   /* COPY_SCALAR_FIELD might work for these, but let's not assume that */
-   memcpy(newnode->conkey, from->conkey, sizeof(newnode->conkey));
-   memcpy(newnode->confkey, from->confkey, sizeof(newnode->confkey));
-   memcpy(newnode->conpfeqop, from->conpfeqop, sizeof(newnode->conpfeqop));
+   COPY_ARRAY_FIELD(conkey);
+   COPY_ARRAY_FIELD(confkey);
+   COPY_ARRAY_FIELD(conpfeqop);
 
return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 8a1762000c..6f656fab97 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -74,6 +74,13 @@
 #define equalstr(a, b) \
(((a) != NULL && (b) != NULL) ? (strcmp(a, b) == 0) : (a) == (b))
 
+/* Compare a field that is an inline array */
+#define COMPARE_ARRAY_FIELD(fldname) \
+   do { \
+   if (memcmp(a->fldname, b->fldname, sizeof(a->fldname)) != 0) \
+   return false; \
+   } while (0)
+
 /* Compare a field that is a pointer to a simple palloc'd object of size sz */
 #define COMPARE_POINTER_FIELD(fldname, sz) \
do { \
-- 
2.32.0

From 7d1d7d6697d03b0a4ad3baaf37f315252f2b70c4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 17 Aug 2021 15:51:45 +0200
Subject: [PATCH 3/5] Add WRITE_INDEX_ARRAY

We have a few WRITE_{name of type}_ARRAY macros, but the one case
using the Index type was hand-coded.  Wrap it into a macro as well.

This also changes the behavior slightly: Before, the field name was
skipped if the length was zero.  Now it prints the field name even in
that case.  This is more consistent with how other array fields are
handled.
---
 src/backend/nodes/outfuncs.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index

Re: Fix uninitialized variable access (src/backend/utils/mmgr/freepage.c)

2021-08-17 Thread Ranier Vilela

Em ter., 17 de ago. de 2021 às 10:22, Greg Nancarrow 
escreveu:

> On Tue, Aug 17, 2021 at 9:13 PM Ranier Vilela  wrote:
> >
> > If that's conditions happen, all *result.index* touches are garbage.
> >
>
> The patch looks valid to me, as the "index" member is not set in the
> "btp == NULL" case, and so has a junk value in the caller, and it's
> being used to index an array,
> BUT - isn't it also necessary to set the "split_pages" member to 0,
> because it also is not currently being set, and so too will have a
> junk value in this case (and it's possible for it to be referenced by
> the caller in this case).
>
I agree.

Attached new version (v1).

regards,
Ranier Vilela


fix_unitialized_var_index_freepage-v1.patch
Description: Binary data

Re: A problem in ExecModifyTable

2021-08-17 Thread Tom Lane

David Rowley  writes:
> I'd been looking at the code in ExecInitModifyTable() that's the same
> as what you pasted thinking you meant that.  I think for the check for
> partitioned tables in ExecModifyTable() then it's likely just dead
> code.  It seems to be causing a bit of confusion though, so might be
> worth doing something about. Copied in Tom to see what he thinks as
> it's one of his.

Yeah, it's dead code in the sense that we should never reach these
if-blocks with a partitioned table.  I believe the reason I made it
like that is that the logic is concerned with which style of row identity
applies to a particular relkind, and in the planner we treat partitioned
tables as requiring this style of row identity, so that the right things
happen for their children.  So the answer is basically "for consistency
with add_row_identity_columns".

Maybe we could instead have add_row_identity_columns do nothing for
a partitioned table, but I'm unconvinced of that.

regards, tom lane

Re: PG14: Avoid checking output-buffer-length for every encoded byte during pg_hex_encode

2021-08-17 Thread Tom Lane

Michael Paquier  writes:
> In short, I am planning to address this regression with the attached,
> for a combined revert of 0d70d30, 5c33ba5 and 92436a7.

OK, but the commit message should explain why they're getting reverted.

regards, tom lane

Re: ALTER TYPE vs extension membership (was Re: BUG #17144: Upgrade from v13 to v14 with the cube extension failed)

2021-08-17 Thread Tom Lane

Andres Freund  writes:
> On 2021-08-16 15:36:59 -0400, Tom Lane wrote:
>> There's a policy question here, which is when does an operation on
>> a pre-existing object within an extension script cause the object
>> to be absorbed into the extension?  You might naively say "never",
>> but that's not our historical behavior, and I think it'd clearly
>> be the wrong thing in some cases.  For example, consider
>> 
>> CREATE TYPE cube;   -- make a shell type
>> -- do something that requires type cube to exist
>> CREATE EXTENSION cube;
>> 
>> We don't want this to fail, because it might be necessary to do
>> things that way to get out of circular dependencies.

> In what scenario would that be a good way to address a circular
> dependency? ISTM that whatever depends on $ext should be using shell
> types or such to avoid the circular dependency, then create $ext. Rather
> than creating shell types for object in $ext and then create $ext.

Perhaps.  But that does work today (I tested), and it's been the
intentional behavior ever since we invented extensions, to judge
by the fact that the existing comments on GenerateTypeDependencies
describe the current behavior explicitly.  (Admitted, that's only
talking about types not other sorts of objects, which is why I moved
that comment in the proposed patch.)  So I'm loath to get rid of it
in a back-patched bug fix.

Maybe there is a case for becoming stricter in HEAD, but I'm not
really clear on what would be an improvement.  I do not think that
"don't adopt the object into the extension" could be sane.  The only
other alternative is "throw an error", which goes against the clear
intention and definition of CREATE OR REPLACE.

The fact that you don't need to say CREATE OR REPLACE TYPE to replace
a shell type muddies the waters here a bit, but I think we should still
consider that behavior to be an instance of CREATE OR REPLACE.

regards, tom lane

Re: Added schema level support for publication.

2021-08-17 Thread Tom Lane

Amit Kapila  writes:
> On Tue, Aug 17, 2021 at 6:40 AM Peter Smith  wrote:
>> On Mon, Aug 16, 2021 at 11:31 PM Tom Lane  wrote:
>>> Abstractly it'd be
>>> 
>>> createpub := CREATE PUBLICATION pubname FOR cpitem [, ... ] [ WITH ... ]
>>> 
>>> cpitem := ALL TABLES |
>>> TABLE name |
>>> ALL TABLES IN SCHEMA name |
>>> ALL SEQUENCES |
>>> SEQUENCE name |
>>> ALL SEQUENCES IN SCHEMA name |
>>> name
>>> 
>>> The grammar output would need some post-analysis to attribute the
>>> right type to bare "name" items, but that doesn't seem difficult.

>> That last bare "name" cpitem. looks like it would permit the following 
>> syntax:
>> CREATE PUBLICATION pub FOR a,b,c;
>> Was that intentional?

> I think so.

I had supposed that we could throw an error at the post-processing stage,
or alternatively default to assuming that such names are tables.

Now you could instead make the grammar work like

cpitem := ALL TABLES |
  TABLE name [, ...] |
  ALL TABLES IN SCHEMA name [, ...] |
  ALL SEQUENCES |
  SEQUENCE name [, ...] |
  ALL SEQUENCES IN SCHEMA name [, ...]

which would result in a two-level-list data structure.  I'm not sure
that this is better, as any sort of mistake would result in a very
uninformative generic "syntax error" from Bison.  Errors out of a
post-processing stage could be more specific than that.

(Perhaps, though, we should *document* it like the latter way,
even if the actual implementation is more like the first way.)

regards, tom lane

Re: Fix uninitialized variable access (src/backend/utils/mmgr/freepage.c)

2021-08-17 Thread Greg Nancarrow

On Tue, Aug 17, 2021 at 9:13 PM Ranier Vilela  wrote:
>
> If that's conditions happen, all *result.index* touches are garbage.
>

The patch looks valid to me, as the "index" member is not set in the
"btp == NULL" case, and so has a junk value in the caller, and it's
being used to index an array,
BUT - isn't it also necessary to set the "split_pages" member to 0,
because it also is not currently being set, and so too will have a
junk value in this case (and it's possible for it to be referenced by
the caller in this case).
The "btp == NULL" case is not hit by any existing test cases, and does
seem to be a rare case.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: The Free Space Map: Problems and Opportunities

2021-08-17 Thread Andres Freund

Hi,

On 2021-08-16 10:35:45 -0700, Peter Geoghegan wrote:
> Problems
> 
>
> The FSM gives out space without considering the passage of time, or
> the likelihood that a particular transaction or client will consume
> its requested free space in a more or less predictable and steady way
> over time. It has no memory. The FSM therefore fails to capture
> naturally occurring locality, or at least doesn't specifically care
> about it. This is the central problem, that all other problems with
> the FSM seem to stem from in one way or another.

I suspect that one other fairly fundamental issue is that we are very
reticent updating the FSM with information about free space. As long as
that's the case a smarter placement logic would often not have a better
place to choose from.


> Background
> --
>
> To recap, the FSM tracks how much free space is in every heap page.
> Most FSM maintenance takes place during VACUUM, though ordinary
> connection backends occasionally help to keep the FSM current, via
> calls to RecordAndGetPageWithFreeSpace() made from hio.c.
>
> There is also a bulk extension mechanism added by commit 719c84c1be,
> which is used when we detect multiple concurrent inserters. This bulk
> extension mechanism does change things a little (it gives some thought
> to systematic effects), though it still has the same basic issues: it
> doesn't do nearly enough to group likely-related rows together. I
> suspect that improving the bulk extension mechanism will be an
> important part of improving the overall picture.

I think the design of the current bulk extension mechanism is quite backward
(it only takes contention into account not other needs for bulk work, it does
expensive stuff like finding victim buffers under lock, it doesn't release
locks until all blocks are extended, ...).  But: I don't think a bulk
extension mechanism should be tasked with doing grouping or anything like
that.


> Take DB2's version of the FSM, FSIP [7]. This design usually doesn't
> ever end up inserting *any* new logical rows on a heap page after the
> page first fills -- it is initially "open" when first allocated, and
> then quickly becomes "closed" to inserts of new logical rows, once it
> crosses a certain almost-empty space threshold. The heap page usually
> stays "closed" forever. While the design does not *completely* ignore
> the possibility that the page won't eventually get so empty that reuse
> really does look like a good idea, it makes it rare. A heap page needs
> to have rather a lot of the original logical rows deleted before that
> becomes possible again. It's rare for it to go backwards because the
> threshold that makes a closed page become open again is much lower
> (i.e. involves much more free space) than the threshold that initially
> made a (newly allocated) heap page closed. This one-way stickiness
> quality is important.

I suspect that we'd get a *LOT* of complaints if we introduced that degree of
stickiness.


> This stickiness concept is called "hysteresis" by some DB researchers,
> often when discussing UNDO stuff [8]. Having *far less* granularity
> than FSM_CATEGORIES/255 seems essential to make that work as intended.
> Pages need to be able to settle without being disturbed by noise-level
> differences. That's all that super fine grained values buy you: more
> noise, more confusion.

Why is that directly tied to FSM_CATEGORIES? ISTM that there's ways to benefit
from a fairly granular FSM_CATEGORIES, while still achieving better
locality. You could e.g. search for free space for an out-of-page update in
nearby pages with a lower free-percentage threshold, while having a different
threshold and selection criteria for placement of inserts.


> Visibility map
> --
>
> If the logical database and natural locality are important to the FSM,
> then what about the visibility map? And what about the relationship
> between the FSM and the visibility map, in light of all this?
>
> Currently VACUUM doesn't care about how its FSM behavior interacts
> with how it sets all-frozen/all-visible bits for the same heap page.
> To me this seems completely unreasonable -- they're *obviously*
> related! We're probably only gaining a minimal amount of free space on
> one occasion by ignoring the VM/FSM relationship, for which we pay a
> high cost. Worst of all we're *perpetuating the cycle* of dirtying and
> redirtying the same pages over time. Maybe we should go as far as
> merging the FSM and VM, even -- that seems like a natural consequence
> of logical-ish/qualitative definitions of "page is full".

The density of the VM is fairly crucial for efficient IOS, so I'm doubtful
merging FSM and VM is a good direction to go to. Additionally we need to make
sure that the VM is accurately durable, which isn't the case for the FSM
(which I think we should use to maintain it more accurately).

But perhaps I'm just missing something here. What is the strong interlink
between FSM and

Re: logical replication empty transactions

2021-08-17 Thread Ajin Cherian

On Mon, Aug 16, 2021 at 4:44 PM Peter Smith  wrote:

> I have reviewed the v13-0001 patch.
>
> Apply / build / test was all OK
>
> Below are my code review comments.
>
> //
>
> Comments for v13-0001
> =
>
> 1. Patch comment
>
> =>
>
> Probably this comment should include some description for the new
> "keepalive" logic as well.

Added.

>
> --
>
> 2. src/backend/replication/syncrep.c - new function
>
> @@ -330,6 +330,18 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
>  }
>
>  /*
> + * Check if Sync Rep is enabled
> + */
> +bool
> +SyncRepEnabled(void)
> +{
> + if (SyncRepRequested() && ((volatile WalSndCtlData *)
> WalSndCtl)->sync_standbys_defined)
> + return true;
> + else
> + return false;
> +}
> +
>
> 2a. Function comment =>
>
> Why abbreviations in the comment? Why not say "synchronous
> replication" instead of "Sync Rep".
>

Changed.

> ~~
>
> 2b. if/else =>
>
> Remove the if/else. e.g.
>
> return SyncRepRequested() && ((volatile WalSndCtlData *)
> WalSndCtl)->sync_standbys_defined;
>
> ~~

Changed.

>
> 2c. Call the new function =>
>
> There is some existing similar code in SyncRepWaitForLSN(), e.g.
>
> if (!SyncRepRequested() ||
> !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
> return;
>
> Now that you have a new function you maybe can call it from here, e.g.
>
> if (!SyncRepEnabled())
> return;
>

Updated.

> --
>
> 3. src/backend/replication/walsender.c - whitespace
>
> + if (send_keep_alive)
> + force_keep_alive_syncrep = true;
> +
> +
>
> =>
>
> Extra blank line?

Removed.

>
> --
>
> 4. src/backend/replication/walsender.c - call keepalive
>
>   if (MyWalSnd->flush < sentPtr &&
>   MyWalSnd->write < sentPtr &&
>   !waiting_for_ping_response)
> + {
>   WalSndKeepalive(false);
> + }
> + else
> + {
> + if (force_keep_alive_syncrep && SyncRepEnabled())
> + WalSndKeepalive(false);
> + }
>
>
> 4a. Move the SynRepEnabled() call =>
>
> I think it is not necessary to call the SynRepEnabled() here. Instead,
> it might be better if this is called back when you assign the
> force_keep_alive_syncrep flag. So change the WalSndUpdateProgress,
> e.g.
>
> BEFORE
> if (send_keep_alive)
>   force_keep_alive_syncrep = true;
> AFTER
> force_keep_alive_syncrep = send_keep_alive && SyncRepEnabled();
>
> Note: Also, that assignment also deserves a big comment to say what it is 
> doing.
>
> ~~

changed.

>
> 4b. Change the if/else =>
>
> If you make the change for 4a. then perhaps the keepalive if/else is
> overkill and could be changed.e.g.
>
> if (force_keep_alive_syncrep ||
>   MyWalSnd->flush < sentPtr &&
>   MyWalSnd->write < sentPtr &&
>   !waiting_for_ping_response)
>   WalSndKeepalive(false);
>

Changed.

regards,
Ajin Cherian
Fujitsu Australia


v14-0001-Skip-empty-transactions-for-logical-replication.patch
Description: Binary data

Re: [PATCH] Allow multiple recursive self-references

2021-08-17 Thread Denis Hirn

> The tests fail when you build with assertions enabled (configure 
> --enable-cassert).

Thank you for pointing that out. The new version of this patch fixes that.
The tests are working properly now. All style related issues are fixed as well.

Best wishes,
  -- Denis



0005-Allow-multiple-recursive-self-references.patch
Description: Binary data

Re: .ready and .done files considered harmful

2021-08-17 Thread Dipesh Pandit

Thanks for the feedback.

> +   StatusFilePath(archiveStatusPath, xlog, ".ready");
> +   if (stat(archiveStatusPath, _buf) == 0)
> +   PgArchEnableDirScan();

> We may want to call PgArchWakeup() after setting the flag.

Yes, added a call to wake up archiver.

> > +  * - The next anticipated log segment is not available.
> >
> > I wonder if we really need to perform a directory scan in this case.
> > Unless there are other cases where the .ready files are created out of
> > order, I think this just causes an unnecessary directory scan every
> > time the archiver catches up.

> Thinking further, I suppose this is necessary for when lastSegNo gets
> reset after processing an out-of-order .ready file.

Also, this is necessary when lastTLI gets reset after switching to a new
timeline.

> +   pg_atomic_flag dirScan;

> I personally don't think it's necessary to use an atomic here.  A
> spinlock or LWLock would probably work just fine, as contention seems
> unlikely.  If we use a lock, we also don't have to worry about memory
> barriers.

History file should be archived as soon as it gets created. The atomic flag
here will make sure that there is no reordering of read/write instructions
while
accessing the flag in shared memory. Archiver needs to read this flag at
the
beginning of each cycle. Write to atomic flag is synchronized and it
provides
a lockless read. I think an atomic flag here is an efficient choice unless
I am
missing something.

Please find the attached patch v7.

Thanks,
Dipesh
From 55c42f851176a75881a55b1c75d624248169b876 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archived. This directory scan can be minimised by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

If there is a timeline switch then archiver performs a full directory
scan to make sure that archiving history file takes precedence over
archiving WAL files on older timeline.
---
 src/backend/access/transam/xlog.c|  15 +++
 src/backend/access/transam/xlogarchive.c |  12 +++
 src/backend/postmaster/pgarch.c  | 163 ---
 src/include/postmaster/pgarch.h  |   1 +
 4 files changed, 178 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f84c0bb..088ab43 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -50,6 +50,7 @@
 #include "port/atomics.h"
 #include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/pgarch.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/basebackup.h"
@@ -7555,6 +7556,13 @@ StartupXLOG(void)
 	 */
 	if (AllowCascadeReplication())
 		WalSndWakeup();
+
+	/*
+	 * Switched to a new timeline, notify archiver to enable
+	 * directory scan.
+	 */
+	if (XLogArchivingActive())
+		PgArchEnableDirScan();
 }
 
 /* Exit loop if we reached inclusive recovery target */
@@ -7797,6 +7805,13 @@ StartupXLOG(void)
 			 EndRecPtr, reason);
 
 		/*
+		 * Switched to a new timeline, notify archiver to enable directory
+		 * scan.
+		 */
+		if (XLogArchivingActive())
+			PgArchEnableDirScan();
+
+		/*
 		 * Since there might be a partial WAL segment named RECOVERYXLOG, get
 		 * rid of it.
 		 */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e..94c74f8 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -609,6 +609,18 @@ XLogArchiveCheckDone(const char *xlog)
 
 	/* Retry creation of the .ready file */
 	XLogArchiveNotify(xlog);
+
+	/*
+	 * This .ready file is created out of order, notify archiver to perform
+	 * a full directory scan to archive corresponding WAL file.
+	 */
+	StatusFilePath(archiveStatusPath, xlog, ".ready");
+	if (stat(archiveStatusPath, _buf) == 0)
+	{
+		PgArchEnableDirScan();
+		PgArchWakeup();
+	}
+
 	return false;
 }
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..e5ea7a6 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -76,8 +76,23 @@
 typedef struct PgArchData
 {
 	int			pgprocno;		/* pgprocno of archiver process */
+
+	/*
+	 * Flag to enable/disable directory scan. If this flag is set then it
+	 * forces archiver to perform a full directory scan to get the next log
+	 * segment.
+	 */
+	pg_atomic_flag dirScan;
 } PgArchData;
 
+/*
+ * Segment number and timeline ID to identify the next file in a WAL

Re: [PATCH] OpenSSL: Mark underlying BIO with the appropriate type flags

2021-08-17 Thread Daniel Gustafsson

> On 12 Aug 2021, at 15:32, Daniel Gustafsson  wrote:

> Barring objections I will go ahead with your proposed patch on HEAD and
> backpatch it all the way, once I've done more testing on it.

I’ve tested this with old and new OpenSSL versions, and have now applied it
backpatched to 9.6 as it’s been wrong for a long time.  Thanks!

--
Daniel Gustafsson   https://vmware.com/

Re: [PATCH] Add extra statistics to explain for Nested Loop

2021-08-17 Thread Ekaterina Sokolova


Hi, hackers.

Here is the new version of patch that add printing of min, max and total
statistics for time and rows across all loops to EXPLAIN ANALYSE.

1) Please add VERBOSE to display extra statistics.
2) Format of extra statistics is:

  a) FORMAT TEXT


Loop min_time: N  max_time: N  min_rows: N  max_rows: N  total_rows: N
Output: ...


   b) FORMAT JSON 


...
"Actual Total Time": N,
"Loop Min Time": N,
"Loop Max Time": N,
"Actual Rows": N,
"Loop Min Rows": N,
"Loop Max Rows": N,
"Loop Total Rows": N,
"Actual Loops": N,
...


I hope you find this patch useful.
Please don't hesitate to share any thoughts on this topic!
--
Ekaterina Sokolova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres CompanyAuthor: Ekaterina Sokolova 

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 10644dfac44..9c71819deb2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1611,6 +1611,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		double		startup_ms = 1000.0 * planstate->instrument->startup / nloops;
 		double		total_ms = 1000.0 * planstate->instrument->total / nloops;
 		double		rows = planstate->instrument->ntuples / nloops;
+		double		loop_total_rows = planstate->instrument->ntuples;
+		double		loop_min_r = planstate->instrument->min_tuples;
+		double		loop_max_r = planstate->instrument->max_tuples;
+		double		loop_min_t_ms = 1000.0 * planstate->instrument->min_t;
+		double		loop_max_t_ms = 1000.0 * planstate->instrument->max_t;
 
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 		{
@@ -1622,6 +1627,19 @@ ExplainNode(PlanState *planstate, List *ancestors,
 appendStringInfo(es->str,
  " (actual rows=%.0f loops=%.0f)",
  rows, nloops);
+			if (nloops > 1 && es->verbose)
+			{
+appendStringInfo(es->str, "\n");
+ExplainIndentText(es);
+if (es->timing)
+	appendStringInfo(es->str,
+	 "Loop min_time: %.3f  max_time: %.3f  min_rows: %.0f  max_rows: %.0f  total_rows: %.0f",
+	 loop_min_t_ms, loop_max_t_ms, loop_min_r, loop_max_r, loop_total_rows);
+else
+	appendStringInfo(es->str,
+	 "Loop min_rows: %.0f  max_rows: %.0f  total_rows: %.0f",
+	 loop_min_r, loop_max_r, loop_total_rows);
+			}
 		}
 		else
 		{
@@ -1631,8 +1649,21 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	 3, es);
 ExplainPropertyFloat("Actual Total Time", "ms", total_ms,
 	 3, es);
+if (nloops > 1 && es->verbose)
+{
+	ExplainPropertyFloat("Loop Min Time", "s", loop_min_t_ms,
+		 3, es);
+	ExplainPropertyFloat("Loop Max Time", "s", loop_max_t_ms,
+		 3, es);
+}
 			}
 			ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
+			if (nloops > 1 && es->verbose)
+			{
+ExplainPropertyFloat("Loop Min Rows", NULL, loop_min_r, 0, es);
+ExplainPropertyFloat("Loop Max Rows", NULL, loop_max_r, 0, es);
+ExplainPropertyFloat("Loop Total Rows", NULL, loop_total_rows, 0, es);
+			}
 			ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
 		}
 	}
@@ -1642,6 +1673,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			appendStringInfoString(es->str, " (never executed)");
 		else
 		{
+			/* without min and max values because actual result is 0 */
 			if (es->timing)
 			{
 ExplainPropertyFloat("Actual Startup Time", "ms", 0.0, 3, es);
@@ -1668,12 +1700,22 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			double		startup_ms;
 			double		total_ms;
 			double		rows;
+			double		loop_min_t_ms;
+			double		loop_max_t_ms;
+			double		loop_min_r;
+			double		loop_max_r;
+			double		loop_total_rows;
 
 			if (nloops <= 0)
 continue;
 			startup_ms = 1000.0 * instrument->startup / nloops;
 			total_ms = 1000.0 * instrument->total / nloops;
 			rows = instrument->ntuples / nloops;
+			loop_min_t_ms = 1000.0 * instrument->min_t;
+			loop_max_t_ms = 1000.0 * instrument->max_t;
+			loop_min_r = instrument->min_tuples;
+			loop_max_r = instrument->max_tuples;
+			loop_total_rows = instrument->ntuples;
 
 			ExplainOpenWorker(n, es);
 
@@ -1688,6 +1730,19 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	appendStringInfo(es->str,
 	 "actual rows=%.0f loops=%.0f\n",
 	 rows, nloops);
+if (nloops > 1)
+{
+	appendStringInfo(es->str, "\n");
+	ExplainIndentText(es);
+	if (es->timing)
+		appendStringInfo(es->str,
+		 "Loop min_time: %.3f  max_time: %.3f  min_rows: %.0f  max_rows: %.0f  total_rows: %.0f",
+		 loop_min_t_ms, loop_max_t_ms, loop_min_r, loop_max_r, loop_total_rows);
+	else
+		appendStringInfo(es->str,
+		 "Loop min_rows: %.0f  max_rows: %.0f  total_rows: %.0f",
+		 loop_min_r, loop_max_r, loop_total_rows);
+}
 			}
 			else
 			{
@@ -1697,8 +1752,21 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		 startup_ms, 3, es);
 	ExplainPropertyFloat("Actual Total Time", "ms",
 		 total_ms, 3, es);
+

Re: A problem in ExecModifyTable

2021-08-17 Thread David Rowley

On Tue, 17 Aug 2021 at 19:06, 李杰(慎追)  wrote:
> Your answer explains that we still need to ModifyTable node without Leaf 
> partitions.
> You are right about this.
>
>  But you can review the source code again,

I'd been looking at the code in ExecInitModifyTable() that's the same
as what you pasted thinking you meant that.  I think for the check for
partitioned tables in ExecModifyTable() then it's likely just dead
code.  It seems to be causing a bit of confusion though, so might be
worth doing something about. Copied in Tom to see what he thinks as
it's one of his.

David

Re: Autovacuum on partitioned table (autoanalyze)

2021-08-17 Thread Justin Pryzby

On Mon, Aug 16, 2021 at 05:28:10PM -0500, Justin Pryzby wrote:
> On Mon, Aug 16, 2021 at 05:42:48PM -0400, Álvaro Herrera wrote:
> > On 2021-Aug-16, Álvaro Herrera wrote:
> > 
> > > Here's the reversal patch for the 14 branch.  (It applies cleanly to
> > > master, but the unused member of PgStat_StatTabEntry needs to be
> > > removed and catversion bumped).
> > 
> > I have pushed this to both branches.  (I did not remove the item from
> > the release notes in the 14 branch.)
> 
> |I retained the addition of relkind 'p' to tables included by
> |pg_stat_user_tables, because reverting that would require a catversion
> |bump.
> 
> Right now, on v15dev, it shows 0, which is misleading.
> Shouldn't it be null ?
> 
> analyze_count   | 0
> 
> Note that having analyze_count and last_analyze would be an an independently
> useful change.  Since parent tables aren't analyzed automatically, I have a
> script to periodically process them if they weren't processed recently.  Right
> now, for partitioned tables, the best I could find is to check its partitions:
> | MIN(last_analyzed) FROM pg_stat_all_tables psat JOIN pg_inherits i ON 
> psat.relid=i.inhrelid
> 
> In 20200418050815.ge26...@telsasoft.com I wrote:
> |This patch includes partitioned tables in pg_stat_*_tables, which is great; I
> |complained awhile ago that they were missing [0].  It might be useful if that
> |part was split out into a separate 0001 patch (?).
> | [0] 
> https://www.postgresql.org/message-id/20180601221428.GU5164%40telsasoft.com

I suggest the attached (which partially reverts the revert), to allow showing
correct data for analyze_count and last_analyzed.

Arguably these should be reported as null in v14 for partitioned tables, since
they're not "known to be zero", but rather "currently unpopulated".

n_mod_since_analyze | 0
n_ins_since_vacuum  | 0

Justin
>From 0d0e149727d89115803b4528e15f5b3c04bd816b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 16 Aug 2021 22:55:06 -0500
Subject: [PATCH] Report last_analyze and analyze_count of partitioned tables..

In v14, partitioned tables are included, but these fields are being reported as
zero, which is misleading.
---
 src/backend/commands/analyze.c  | 36 ++---
 src/backend/postmaster/pgstat.c | 27 +
 2 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8d7b38d170..0050df08f6 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -626,8 +626,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
  PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE);
 
 	/*
-	 * Update pages/tuples stats in pg_class, and report ANALYZE to the stats
-	 * collector ... but not if we're doing inherited stats.
+	 * Update pages/tuples stats in pg_class ... but not if we're doing
+	 * inherited stats.
 	 *
 	 * We assume that VACUUM hasn't set pg_class.reltuples already, even
 	 * during a VACUUM ANALYZE.  Although VACUUM often updates pg_class,
@@ -668,20 +668,32 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 InvalidMultiXactId,
 in_outer_xact);
 		}
-
+	}
+	else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	{
 		/*
-		 * Now report ANALYZE to the stats collector.
-		 *
-		 * We deliberately don't report to the stats collector when doing
-		 * inherited stats, because the stats collector only tracks per-table
-		 * stats.
-		 *
-		 * Reset the changes_since_analyze counter only if we analyzed all
-		 * columns; otherwise, there is still work for auto-analyze to do.
+		 * Partitioned tables don't have storage, so we don't set any fields
+		 * in their pg_class entries except for reltuples, which is necessary
+		 * for auto-analyze to work properly, and relhasindex.
 		 */
+		vac_update_relstats(onerel, -1, totalrows,
+			0, hasindex, InvalidTransactionId,
+			InvalidMultiXactId,
+			in_outer_xact);
+	}
+
+	/*
+	 * Now report ANALYZE to the stats collector.  For regular tables, we do
+	 * it only if not doing inherited stats.  For partitioned tables, we only
+	 * do it for inherited stats. (We're never called for not-inherited stats
+	 * on partitioned tables anyway.)
+	 *
+	 * Reset the changes_since_analyze counter only if we analyzed all
+	 * columns; otherwise, there is still work for auto-analyze to do.
+	 */
+	if (!inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 			  (va_cols == NIL));
-	}
 
 	/*
 	 * If this isn't part of VACUUM ANALYZE, let index AMs do cleanup.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a3c35bdf60..2a9673154b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1632,21 +1632,31 @@ pgstat_report_analyze(Relation rel,
 	 * be double-counted after commit.  (This approach also ensures that the
 	 * collector

Re: ALTER TYPE vs extension membership (was Re: BUG #17144: Upgrade from v13 to v14 with the cube extension failed)

2021-08-17 Thread Andres Freund

Hi,

On 2021-08-16 15:36:59 -0400, Tom Lane wrote:
> There's a policy question here, which is when does an operation on
> a pre-existing object within an extension script cause the object
> to be absorbed into the extension?  You might naively say "never",
> but that's not our historical behavior, and I think it'd clearly
> be the wrong thing in some cases.  For example, consider
> 
> CREATE TYPE cube;   -- make a shell type
> -- do something that requires type cube to exist
> CREATE EXTENSION cube;
> 
> We don't want this to fail, because it might be necessary to do
> things that way to get out of circular dependencies.

In what scenario would that be a good way to address a circular
dependency? ISTM that whatever depends on $ext should be using shell
types or such to avoid the circular dependency, then create $ext. Rather
than creating shell types for object in $ext and then create $ext.


> Another point that perhaps deserves discussion is whether it's
> okay to change the signature of GenerateTypeDependencies in
> stable branches (we need to fix this in v13 not only v14/HEAD).
> I can't see a good reason for extensions to be calling that,
> and codesearch.debian.net knows of no outside callers, so I'm
> inclined to just change it.  If anyone thinks that's too risky,
> we could do something with a wrapper function in v13.

Seems OK to not bother with a wrapper to me - I can't see a reason
either.

Greetings,

Andres Freund

Re: Fix uninitialized variable access (src/backend/utils/mmgr/freepage.c)

2021-08-17 Thread Ranier Vilela

Em ter., 17 de ago. de 2021 às 05:04, Michael Paquier 
escreveu:

> On Fri, Jul 02, 2021 at 06:22:56PM -0300, Ranier Vilela wrote:
> > Em qui., 1 de jul. de 2021 às 17:20, Mahendra Singh Thalor <
> > mahi6...@gmail.com> escreveu:
> >> Please can we try to hit this rare condition by any test case. If you
> have
> >> any test cases, please share.
>
> Yeah, this needs to be proved.

Due to the absolute lack of reports, I believe that this particular case
never happened.

  Are you sure that this change is
> actually right?

Yes, have.

  The bottom of FreePageManagerPutInternal() has
> assumptions that a page may not be found during a btree search, with
> an index value used.
>
Assert assumptions are for Debug.
If that's conditions happen, all *result.index* touches are garbage.

regards,
Ranier Vilela

RE: [PATCH]Remove obsolete macro CHECKFLOATVAL in btree_gist

2021-08-17 Thread tanghy.f...@fujitsu.com

On Friday, August 6, 2021 11:14 PM, tanghy.f...@fujitsu.com 
 wrote:
>Commit 6bf0bc842 replaced float.c's CHECKFLOATVAL() macro with static inline 
>subroutines, 
>but the fix introduced a performance regression as Tom Lane pointed out and 
>fixed at 607f8ce74.
>
>Found obsolete CHECKFLOATVAL usage in contrib/btree_gist, and tried to fix it 
>according to 607f8ce74.

Added above patch in commit fest:
https://commitfest.postgresql.org/34/3287/

Regards,
Tang

Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-17 Thread Andres Freund

Hi,

On 2021-08-17 10:54:30 +0530, Amit Kapila wrote:
> 5. How can we provide a strict mechanism to not allow to use dsm APIs
> for non-dsm FileSet? One idea could be that we can have a variable
> (probably bool) in SharedFileSet structure which will be initialized
> in SharedFileSetInit based on whether the caller has provided dsm
> segment. Then in other DSM-based APIs, we can check if it is used for
> the wrong type.

Well, isn't the issue here that it's not a shared file set in case you
explicitly don't want to share it? ISTM that the proper way to address
this would be to split out a FileSet from SharedFileSet that's then used
for worker.c and sharedfileset.c. Rather than making sharedfileset.c
support a non-shared mode.

Greetings,

Andres Freund

Re: PG14: Avoid checking output-buffer-length for every encoded byte during pg_hex_encode

2021-08-17 Thread Ranier Vilela

Em ter., 17 de ago. de 2021 às 00:43, Michael Paquier 
escreveu:

> On Mon, Aug 16, 2021 at 02:06:31PM -0300, Ranier Vilela wrote:
> > uint64 and size_t in 64 bits are equivalents.
> > uint64 and size_t in 32 bits can be weird, but anyway size_t at 32 bits
> can
> > handle 1GB.
>
> There is usually a reason why things are done as they are, so before
> suggesting changing something I would advise to read the threads that
> created those changes more carefully because they could be mentioned.
> In this case, we are talking about aef8948, and this part of its
> related thread:
> https://www.postgresql.org/message-id/20210105034739.gg7...@momjian.us

Because things have always been done a certain way doesn't mean it's right.
Using int to address strings is an mistake.
Even with an artificial limitation to only deal with 1GB, the correct one
to use would be size_t.


>
> > base64.c can be made in another patch.
>
> This file is a set of code paths used by authentication and SCRAM, it
> will never get as hot as the code paths that Hans has reported as
> these are one-time operations.  Please note as well cfc40d3 for the
> reasons why things are handled this way.  We absolutely have to use
> safe routines here.
>
I have no plans to touch base64.c

regards,
Ranier Vilela

Re: Allow parallel DISTINCT

2021-08-17 Thread David Rowley

On Tue, 17 Aug 2021 at 20:07, Zhihong Yu  wrote:
> Can you attach updated patch so that we know more detail about the two new 
> functions; create_final_distinct_paths and
> create_partial_distinct_paths ?

Must've fallen off in transit :)

David


parallel_distinct_v2.patch
Description: Binary data

Re: Autovacuum on partitioned table (autoanalyze)

2021-08-17 Thread Andres Freund

Hi,

On 2021-08-16 13:13:55 -0400, Álvaro Herrera wrote:
> Another possible problem is that before the revert, we accept
> ALTER TABLE some_partitioned_table SET (autovacuum_enabled=on/off);
> (also autovacuum_analyze_scale_factor and autovacuum_analyze_threshold)
> but after the revert this is will throw a syntax error.  What do people
> think we should do about that?
> 
> 1. Do nothing.  If somebody finds in that situation, they can use
>   ALTER TABLE .. RESET ...
>   to remove the settings.
> 
> 2. Silently accept the option and do nothing.
> 3. Accept the option and throw a warning that it's a no-op.
> 4. Something else

1) seems OK to me.

Greetings,

Andres Freund

Re: Autovacuum on partitioned table (autoanalyze)

2021-08-17 Thread Andres Freund

Hi,

On 2021-08-16 17:42:48 -0400, Álvaro Herrera wrote:
> On 2021-Aug-16, Álvaro Herrera wrote:
>
> > Here's the reversal patch for the 14 branch.  (It applies cleanly to
> > master, but the unused member of PgStat_StatTabEntry needs to be
> > removed and catversion bumped).
>
> I have pushed this to both branches.  (I did not remove the item from
> the release notes in the 14 branch.)
>
> It upsets me to have reverted it, but after spending so much time trying
> to correct the problems, I believe it just wasn't salvageable within the
> beta-period code freeze constraints.

:(


> I described the issues I ran into
> in earlier messages; I think a good starting point to re-develop this is
> to revert the reversal commit, then apply my patch at
> https://postgr.es/m/0794d7ca-5183-486b-9c5e-6d434867c...@www.fastmail.com
> then do something about the remaining problems that were complained
> about.  (Maybe: add an "ancestor OID" member to PgStat_StatTabEntry so
> that the collector knows to propagate counts from children to ancestors
> when the upd/ins/del counts are received.

My suspicion is that it'd be a lot easier to implement this efficiently if
there were no propagation done outside of actually analyzing tables. I.e. have
do_autovacuum() build a hashtable of (parent_table_id, count) and use that to
make the analyze decisions. And then only propagate up the costs to parents of
tables when a child is analyzed (and thus looses its changes_since_analyze)
value. Then we can use hashtable_value + changes_since_analyze for
partitioning decisions of partitioned tables.

I've prototyped this, and it does seem to make do_autovacuum() cheaper. I've
attached that prototype, but note it's in a rough state.

However, unless we change the way inheritance parents are stored, it still
requires repetitive get_partition_ancestors() (or get_partition_parent())
calls in do_autovacuum(), which I think is problematic due to the index scans
you pointed out as well.  The obvious way to address that would be to store
parent oids in pg_class - I suspect duplicating parents in pg_class is the
best way out, but pretty it is not.


> However, consider developing it as follow-up to Horiguchi-san's shmem
> pgstat rather than current pgstat implementation.)

+1


It might be worth to first tackle reusing samples from a relation's children
when building inheritance stats. Either by storing the samples somewhere (not
cheap) and reusing them, or by at least updating a partition's stats when
analyzing the parent.

Greetings,

Andres Freund
commit ec796bd8ee2970e2eae3b3839e1bb96696393dc7
Author: Andres Freund 
Date:   2021-07-30 17:20:21 -0700

tmp

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 0c9591415e4..df021215281 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -320,12 +320,12 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	PgStat_Counter startwritetime = 0;
 
 	if (inh)
-		ereport(elevel,
+		ereport(LOG,
 (errmsg("analyzing \"%s.%s\" inheritance tree",
 		get_namespace_name(RelationGetNamespace(onerel)),
 		RelationGetRelationName(onerel;
 	else
-		ereport(elevel,
+		ereport(LOG,
 (errmsg("analyzing \"%s.%s\"",
 		get_namespace_name(RelationGetNamespace(onerel)),
 		RelationGetRelationName(onerel;
@@ -682,6 +682,18 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			in_outer_xact);
 	}
 
+	/*
+	 * Let the collector know about the ancestor tables of this partition.
+	 */
+	if (!inh &&
+		(va_cols == NIL) &&
+		onerel->rd_rel->relispartition &&
+		onerel->rd_rel->relkind == RELKIND_RELATION &&
+		onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+	{
+		pgstat_report_anl_ancestors(RelationGetRelid(onerel));
+	}
+
 	/*
 	 * Now report ANALYZE to the stats collector.  For regular tables, we do
 	 * it only if not doing inherited stats.  For partitioned tables, we only
@@ -695,22 +707,6 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 			  (va_cols == NIL));
 
-	/*
-	 * If this is a manual analyze of all columns of a permanent leaf
-	 * partition, and not doing inherited stats, also let the collector know
-	 * about the ancestor tables of this partition.  Autovacuum does the
-	 * equivalent of this at the start of its run, so there's no reason to do
-	 * it there.
-	 */
-	if (!inh && !IsAutoVacuumWorkerProcess() &&
-		(va_cols == NIL) &&
-		onerel->rd_rel->relispartition &&
-		onerel->rd_rel->relkind == RELKIND_RELATION &&
-		onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
-	{
-		pgstat_report_anl_ancestors(RelationGetRelid(onerel));
-	}
-
 	/*
 	 * If this isn't part of VACUUM ANALYZE, let index AMs do cleanup.
 	 *
@@ -1183,6 +1179,8 @@ acquire_sample_rows(Relation onerel, int elevel,
 	BlockSamplerData prefetch_bs;
 #endif
 
+

Re: Added schema level support for publication.

2021-08-17 Thread Amit Kapila

On Tue, Aug 17, 2021 at 6:40 AM Peter Smith  wrote:
>
> On Mon, Aug 16, 2021 at 11:31 PM Tom Lane  wrote:
> >
> >
> > CREATE PUBLICATION pub1 FOR
> >   TABLE t1,t2,t3, ALL TABLES IN SCHEMA s1,s2,
> >   SEQUENCE seq1,seq2, ALL SEQUENCES IN SCHEMA s3,s4;
> >
> > Abstractly it'd be
> >
> > createpub := CREATE PUBLICATION pubname FOR cpitem [, ... ] [ WITH ... ]
> >
> > cpitem := ALL TABLES |
> >   TABLE name |
> >   ALL TABLES IN SCHEMA name |
> >   ALL SEQUENCES |
> >   SEQUENCE name |
> >   ALL SEQUENCES IN SCHEMA name |
> >   name
> >
> > The grammar output would need some post-analysis to attribute the
> > right type to bare "name" items, but that doesn't seem difficult.
>
> That last bare "name" cpitem. looks like it would permit the following syntax:
>
> CREATE PUBLICATION pub FOR a,b,c;
>
> Was that intentional?
>

I think so. IIUC, the idea is that after parsing we find out whether
the given name is table, sequence, or any other object. Here, I think
the name could be either of table or sequence because, for schema, we
won't be knowing whether to include tables, sequences, or both in the
schema. Also, we can have the same name for schema and table so it
might be tricky to distinguish among those unless we give priority to
one of those.

-- 
With Regards,
Amit Kapila.

Re: pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead

2021-08-17 Thread Andres Freund

Hi,

On 2021-08-17 10:44:51 +0200, Laurenz Albe wrote:
> On Sun, 2021-08-01 at 13:55 -0700, Andres Freund wrote:
> > We maintain last_report as GetCurrentTransactionStopTimestamp(), but then 
> > use
> > a separate timestamp in pgstat_send_connstats() to compute the difference 
> > from
> > last_report, which is then set to GetCurrentTransactionStopTimestamp()'s
> > return value.
> 
> Makes sense to me.  How about passing "now", which was just initialized from
> GetCurrentTransactionStopTimestamp(), as additional parameter to
> pgstat_send_connstats() and use that value instead of taking the current time?

Yes.


> > I'm also not all that happy with sending yet another UDP packet for this. 
> > This
> > doubles the UDP packets to the stats collector in the common cases (unless
> > more than TABSTAT_QUANTUM tables have stats to report, or shared tables have
> > been accessed). We already send plenty of "summary" information via
> > PgStat_MsgTabstat, one of which is sent unconditionally, why do we need a
> > separate message for connection stats?
> 
> Are you suggesting that connection statistics should be shoehorned into
> some other statistics message?  That would reduce the number of UDP packets,
> but it sounds ugly and confusing to me.

That ship already has sailed. Look at struct PgStat_MsgTabstat

typedef struct PgStat_MsgTabstat
{
PgStat_MsgHdr m_hdr;
Oid m_databaseid;
int m_nentries;
int m_xact_commit;
int m_xact_rollback;
PgStat_Counter m_block_read_time;   /* times in microseconds */
PgStat_Counter m_block_write_time;
PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
} PgStat_MsgTabstat;

Given that we transport number of commits/commits, block read/write time
adding the time the connection was active/inactive doesn't really seem like it
makes things meaningfully worse?


> > Alternatively we could just not send an update to connection stats every 
> > 500ms
> > for every active connection, but only do so at connection end. The database
> > stats would only contain stats for disconnected sessions, while the stats 
> > for
> > "live" connections are maintained via backend_status.c.  That'd give us 
> > *more*
> > information for less costs, because we then could see idle/active times for
> > individual connections.
> 
> That was my original take, but if I remember right, Magnus convinced me that
> it would be more useful to have statistics for open sessions as well.
> With a connection pool, connections can stay open for a very long time,
> and the accuracy and usefulness of the statistics would become questionable.

That's not a contradiction to what I propose. Having the data available via
backend_status.c allows to sum up the data for active connections and the data
for past connections.

I think it's also just cleaner to not constantly report changing preliminary
data as pgstat_send_connstats() does.


> > That'd likely be more change than what we would want to do at this point in
> > the release cycle though. But I think we ought to do something about the
> > increased overhead...
> 
> If you are talking about the extra GetCurrentTimestamp() call, and my first
> suggestion is acceptable, that should be simple.

Doubling the number of UDP messages in common workloads seems also problematic
enough that it should be addressed for 14. It increases the likelihood of
dropping stats messages on busy systems, which can have downstream impacts.

Greetings,

Andres Freund

Re: [PATCH] Allow multiple recursive self-references

2021-08-17 Thread Peter Eisentraut


On 20.07.21 13:15, Denis Hirn wrote:

In the next version of the patch, would you be so kind as to include
updated documentation of the feature and at least one regression test
of same?


As requested, this new version of the patch contains regression tests and 
documentation.


The tests fail when you build with assertions enabled (configure 
--enable-cassert).


Some quick style comments:

DocBook content should use 1-space indentation (not 2 spaces).

Typo?: /* we'll see this later */ -> "set"

Opening brace after "if" should be on new line.  (See existing code.)

Also, there should be a space after "if".

These casts appear to be unnecessary:

if((Node *) stmt->larg != NULL && (Node *) stmt->rarg != NULL)

Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-17 Thread Amit Kapila

On Tue, Aug 17, 2021 at 1:30 PM Dilip Kumar  wrote:
>
> On Tue, Aug 17, 2021 at 10:54 AM Amit Kapila  wrote:
> >
>
> > 5. How can we provide a strict mechanism to not allow to use dsm APIs
> > for non-dsm FileSet? One idea could be that we can have a variable
> > (probably bool) in SharedFileSet structure which will be initialized
> > in SharedFileSetInit based on whether the caller has provided dsm
> > segment. Then in other DSM-based APIs, we can check if it is used for
> > the wrong type.
>
> Yeah, we can do something like that, can't we just use an existing
> variable instead of adding new, e.g. refcnt is required only when
> multiple processes are attached, so maybe if dsm segment is not passed
> then we can keep refcnt as 0 and based on we can give an error.  For
> example, if we try to call SharedFileSetAttach for the SharedFileSet
> which has refcnt as 0 then we error out?
>

But as of now, we treat refcnt as 0 for SharedFileSet that is already
destroyed. See SharedFileSetAttach.

-- 
With Regards,
Amit Kapila.

Re: pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead

2021-08-17 Thread Laurenz Albe

On Sun, 2021-08-01 at 13:55 -0700, Andres Freund wrote:
> Since
> 
> commit 960869da0803427d14335bba24393f414b476e2c
> Author: Magnus Hagander 
> Date:   2021-01-17 13:34:09 +0100
> 
> Add pg_stat_database counters for sessions and session time
> 
> pgstat_report_stat() does another timestamp computation via
> pgstat_send_connstats(), despite typically having computed one just a few
> lines before.
> 
> Given that timestamp computation isn't all that cheap, that's not great. Even
> more, that additional timestamp computation makes things *less* accurate:
> 
> void
> pgstat_report_stat(bool disconnect)
> ...
>   /*
>* Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
>* msec since we last sent one, or the backend is about to exit.
>*/
>   now = GetCurrentTransactionStopTimestamp();
>   if (!disconnect &&
>   !TimestampDifferenceExceeds(last_report, now, 
> PGSTAT_STAT_INTERVAL))
>   return;
> 
>   /* for backends, send connection statistics */
>   if (MyBackendType == B_BACKEND)
>   pgstat_send_connstats(disconnect, last_report);
> 
>   last_report = now;
> 
> and then pgstat_send_connstats() does:
>   /* session time since the last report */
>   TimestampDifference(((last_report == 0) ? MyStartTimestamp : 
> last_report),
>   GetCurrentTimestamp(),
>   , );
>   msg.m_session_time = secs * 100 + usecs;
> 
> We maintain last_report as GetCurrentTransactionStopTimestamp(), but then use
> a separate timestamp in pgstat_send_connstats() to compute the difference from
> last_report, which is then set to GetCurrentTransactionStopTimestamp()'s
> return value.

Makes sense to me.  How about passing "now", which was just initialized from
GetCurrentTransactionStopTimestamp(), as additional parameter to
pgstat_send_connstats() and use that value instead of taking the current time?

> I'm also not all that happy with sending yet another UDP packet for this. This
> doubles the UDP packets to the stats collector in the common cases (unless
> more than TABSTAT_QUANTUM tables have stats to report, or shared tables have
> been accessed). We already send plenty of "summary" information via
> PgStat_MsgTabstat, one of which is sent unconditionally, why do we need a
> separate message for connection stats?

Are you suggesting that connection statistics should be shoehorned into
some other statistics message?  That would reduce the number of UDP packets,
but it sounds ugly and confusing to me.

> Alternatively we could just not send an update to connection stats every 500ms
> for every active connection, but only do so at connection end. The database
> stats would only contain stats for disconnected sessions, while the stats for
> "live" connections are maintained via backend_status.c.  That'd give us *more*
> information for less costs, because we then could see idle/active times for
> individual connections.

That was my original take, but if I remember right, Magnus convinced me that
it would be more useful to have statistics for open sessions as well.
With a connection pool, connections can stay open for a very long time,
and the accuracy and usefulness of the statistics would become questionable.

> That'd likely be more change than what we would want to do at this point in
> the release cycle though. But I think we ought to do something about the
> increased overhead...

If you are talking about the extra GetCurrentTimestamp() call, and my first
suggestion is acceptable, that should be simple.

Yours,
Laurenz Albe

RE: Skipping logical replication transactions on subscriber side

2021-08-17 Thread tanghy.f...@fujitsu.com

On Thursday, August 12, 2021 1:53 PM Masahiko Sawada  
wrote:
>
> I've attached the updated patches. FYI I've included the patch
> (v8-0005) that fixes the assertion failure during shared fileset
> cleanup to make cfbot tests happy.

Hi

Thanks for your patch. I met a problem when using it. The log is not what I 
expected in some cases, but in streaming mode, they work well.

For example:
--publisher--
create table test (a int primary key, b varchar);
create publication pub for table test;

--subscriber--
create table test (a int primary key, b varchar);
insert into test values (1);
create subscription sub connection 'dbname=postgres port=5432' publication pub 
with(streaming=on);

--publisher--
insert into test values (1);

Subscriber log:
2021-08-17 14:24:43.415 CST [3630341] ERROR:  duplicate key value violates 
unique constraint "test_pkey"
2021-08-17 14:24:43.415 CST [3630341] DETAIL:  Key (a)=(1) already exists.

It didn't give more context info generated by apply_error_callback function.

In streaming mode(which worked as I expected):
--publisher--
INSERT INTO test SELECT i, md5(i::text) FROM generate_series(1, 1) s(i);

Subscriber log:
2021-08-17 14:26:26.521 CST [3630510] ERROR:  duplicate key value violates 
unique constraint "test_pkey"
2021-08-17 14:26:26.521 CST [3630510] DETAIL:  Key (a)=(1) already exists.
2021-08-17 14:26:26.521 CST [3630510] CONTEXT:  processing remote data during 
"INSERT" for replication target relation "public.test" in transaction id 710 
with commit timestamp 2021-08-17 14:26:26.403214+08

I looked into it briefly and thought it was related to some code in
apply_dispatch function. It set callback when apply_error_callback_arg.command
is 0, and reset the callback back at the end of the function. But
apply_error_callback_arg.command was not reset to 0, so it won't set callback
when calling apply_dispatch function next time.

I tried to fix it with the following change, thoughts?

@@ -2455,7 +2455,10 @@ apply_dispatch(StringInfo s)

/* Pop the error context stack */
if (set_callback)
+   {
error_context_stack = errcallback.previous;
+   apply_error_callback_arg.command = 0;
+   }
 }

Besides, if we make the changes like this, do we still need to reset
apply_error_callback_arg.command in reset_apply_error_context_info function?

Regards
Tang

Re: Allow parallel DISTINCT

2021-08-17 Thread Zhihong Yu

On Mon, Aug 16, 2021 at 10:07 PM David Rowley  wrote:

> On Wed, 11 Aug 2021 at 16:51, David Rowley  wrote:
> > The patch is just some plumbing work to connect all the correct paths
> > up to make it work. It's all fairly trivial.
>
> I looked at this patch again and realise that it could be done a bit
> better. For example, the previous version set the distinct_rel's FDW
> fields twice, once when making the serial paths and once when
> finalizing the partial paths.
>
> I've now added two new functions; create_final_distinct_paths and
> create_partial_distinct_paths.  The responsibility of
> create_distinct_paths has changed. Instead of it creating the
> non-parallel DISTINCT paths, it calls the two new functions and also
> takes charge of calling the create_upper_paths_hook for
> UPPERREL_DISTINCT plus the FDW GetForeignUpperPaths() call.   I think
> this is nicer as I'd previously added a new parameter to
> create_distinct_paths() so I could tell it not to call the hook as I
> didn't want to call that twice on the same relation as it would no
> doubt result in some plugin just creating the same paths again.
>
> I've also changed my mind about the previous choice I'd made not to
> call GetForeignUpperPaths for the UPPERREL_PARTIAL_DISTINCT.  I now
> think that's ok.
>
> I think this is a fairly trivial patch that just does a bit of wiring
> up of paths.  Unless anyone has anything to say about it in the next
> few days, I'll be looking at it again with intensions to push it.
>
> David
>
>
> Hi, David:
Can you attach updated patch so that we know more detail about the two new
functions; create_final_distinct_paths and
create_partial_distinct_paths ?

Thanks

Re: Fix uninitialized variable access (src/backend/utils/mmgr/freepage.c)

2021-08-17 Thread Michael Paquier

On Fri, Jul 02, 2021 at 06:22:56PM -0300, Ranier Vilela wrote:
> Em qui., 1 de jul. de 2021 às 17:20, Mahendra Singh Thalor <
> mahi6...@gmail.com> escreveu:
>> Please can we try to hit this rare condition by any test case. If you have
>> any test cases, please share.

Yeah, this needs to be proved.  Are you sure that this change is
actually right?  The bottom of FreePageManagerPutInternal() has
assumptions that a page may not be found during a btree search, with
an index value used.
--
Michael


signature.asc
Description: PGP signature

Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-17 Thread Dilip Kumar

On Tue, Aug 17, 2021 at 10:54 AM Amit Kapila  wrote:
>
> On Mon, Aug 16, 2021 at 8:18 PM Dilip Kumar  wrote:
> >
> > On Fri, Aug 13, 2021 at 9:29 PM Andres Freund  wrote:
> >
> > > > I think we can extend this API but I guess it is better to then do it
> > > > for dsm-based as well so that these get tracked via resowner.
> > >
> > > DSM segments are resowner managed already, so it's not obvious that 
> > > that'd buy
> > > us much? Although I guess there could be a few edge cases that'd look 
> > > cleaner,
> > > because we could reliably trigger cleanup in the leader instead of 
> > > relying on
> > > dsm detach hooks + refcounts to manage when a set is physically deleted?
> > >
> >
> > In an off-list discussion with Thomas and Amit, we tried to discuss
> > how to clean up the shared files set in the current use case.  Thomas
> > suggested that instead of using individual shared fileset for storing
> > the data for each XID why don't we just create a single shared fileset
> > for complete worker lifetime and when the worker is exiting we can
> > just remove that shared fileset.  And for storing XID data, we can
> > just create the files under the same shared fileset and delete those
> > files when we longer need them.  I like this idea and it looks much
> > cleaner, after this, we can get rid of the special cleanup mechanism
> > using 'filesetlist'.  I have attached a patch for the same.
> >
>
> It seems to me that this idea would obviate any need for resource
> owners as we will have only one fileset now. I have a few initial
> comments on the patch:
>
> 1.
> + /* do cleanup on worker exit (e.g. after DROP SUBSCRIPTION) */
> + on_shmem_exit(worker_cleanup, (Datum) 0);
>
> It should be registered with before_shmem_exit() hook to allow sending
> stats for file removal.

Done

> 2. After these changes, the comments atop stream_open_file and
> SharedFileSetInit need to be changed.

Done

> 3. In function subxact_info_write(), we are computing subxact file
> path twice which doesn't seem to be required.

Fixed

> 4.
> + if (missing_ok)
> + return NULL;
>   ereport(ERROR,
>   (errcode_for_file_access(),
> - errmsg("could not open temporary file \"%s\" from BufFile \"%s\": %m",
> + errmsg("could not open temporary file \"%s\" from BufFile \"%s\": %m",
>   segment_name, name)));
>
> There seems to be a formatting issue with errmsg. Also, it is better
> to keep an empty line before ereport.

Done

> 5. How can we provide a strict mechanism to not allow to use dsm APIs
> for non-dsm FileSet? One idea could be that we can have a variable
> (probably bool) in SharedFileSet structure which will be initialized
> in SharedFileSetInit based on whether the caller has provided dsm
> segment. Then in other DSM-based APIs, we can check if it is used for
> the wrong type.

Yeah, we can do something like that, can't we just use an existing
variable instead of adding new, e.g. refcnt is required only when
multiple processes are attached, so maybe if dsm segment is not passed
then we can keep refcnt as 0 and based on we can give an error.  For
example, if we try to call SharedFileSetAttach for the SharedFileSet
which has refcnt as 0 then we error out?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v2-0001-Better-usage-of-sharedfileset-in-apply-worker.patch
Description: Binary data

Re: Two patches to speed up pg_rewind.

2021-08-17 Thread Michael Paquier

On Thu, Aug 05, 2021 at 06:18:03PM +0800, Paul Guo wrote:
> I modified the copy_file_range() patch using the below logic:
> 
> If the first call of copy_file_range() fails with errno EXDEV or
> ENOTSUP, pg_rewind
> would not use copy_file_range() in rest code, and if copy_file_range() fails 
> we
> fallback to use the previous read()+write() code logic for the file.

I have looked at 0001, and I don't really like it.  One argument
against this approach is that if pg_rewind fails in the middle of its
operation then we would have done a set of fsync() for nothing, with
the data folder still unusable.  I would be curious to see some
numbers to see how much it matters with many physical files (say cases
with thousands of small relations?).

+/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
+#if defined(HAVE_SYNC_FILE_RANGE)
+#define PG_FLUSH_DATA_WORKS 1
+#elif !defined(WIN32) && defined(MS_ASYNC)
+#define PG_FLUSH_DATA_WORKS 1
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+#define PG_FLUSH_DATA_WORKS 1

This is wrong for the code frontend on platforms that may finish by
using MS_ASYNC, no?  There is no such implementation in file_utils.c
but there is one in fd.c.

+   fsync_fname("global/pg_control", false);
+   fsync_fname("backup_label", false);
+   if (access("recovery.conf", F_OK) == 0)
+   fsync_fname("recovery.conf", false);
+   if (access("postgresql.auto.conf", F_OK) == 0)
+   fsync_fname("postgresql.auto.conf", false);

This list is wrong on various aspects, no?  This would miss custom
configuration files, or included files.

-   if (showprogress)
-   pg_log_info("syncing target data directory");
-   sync_target_dir();
-
/* Also update the standby configuration, if requested. */
if (writerecoveryconf && !dry_run)
WriteRecoveryConfig(conn, datadir_target,
GenerateRecoveryConfig(conn, NULL));

+   if (showprogress)
+   pg_log_info("syncing target data directory");
+   perform_sync(filemap);

Why inverting the order here?

+   * Pre Linux 5.3 does not allow cross-fs copy_file_range() call
+   * (return EXDEV). Some fs do not support copy_file_range() (return
+   * ENOTSUP). Here we explicitly disable copy_file_range() for the
+   * two scenarios. For other failures we still allow subsequent
+   * copy_file_range() try.
+   */
+   if (errno == ENOTSUP || errno == EXDEV)
+   copy_file_range_support = false;
Are you sure that it is right to always cancel the support of
copy_file_range() after it does not work once?  Couldn't it be
possible that things work properly depending on the tablespace being
worked on by pg_rewind?

Having the facility for copy_file_range() in pg_rewind is not nice at
the end, and we are going to need a run-time check to fallback
dynamically to an equivalent implementation on errno={EXDEV,ENOTSUP}.
Hmm.  What about locating all that in file_utils.c instead, with a
brand new routine name (pg_copy_file_range would be one choice)?  We
still need the ./configure check, except that the conditions to use
the fallback implementation is in this routine, aka fallback on EXDEV,
ENOTSUP or !HAVE_COPY_FILE_RANGE.  The backend may need to solve this
problem at some point, but logging and fd handling will likely just
locate that in fd.c, so having one routine for the purpose of all
frontends looks like a step in the right direction.
--
Michael


signature.asc
Description: PGP signature

Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-17 Thread Dilip Kumar

On Tue, Aug 17, 2021 at 12:06 PM Amit Kapila  wrote:

> One more comment:
> @@ -2976,39 +2952,17 @@ subxact_info_write(Oid subid, TransactionId xid)
> ..
> + /* Try to open the subxact file, if it doesn't exist then create it */
> + fd = BufFileOpenShared(xidfileset, path, O_RDWR, true);
> + if (fd == NULL)
> + fd = BufFileCreateShared(xidfileset, path);
> ..
>
> Instead of trying to create the file here based on whether it exists
> or not, can't we create it in subxact_info_add where we are first time
> allocating memory for subxacts? If that works then in the above code,
> the file should always exist.

One problem with this approach is that for now we delay creating the
subxact file till the end of the stream and if by end of the stream
all the subtransactions got aborted within the same stream then we
don't even create that file.  But with this suggestion, we will always
create the file as soon as we get the first subtransaction.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

回复：A problem in ExecModifyTable

2021-08-17 Thread 李杰(慎追)

Hi, David

Your answer explains that we still need to ModifyTable node without Leaf 
partitions.
You are right about this.

 But you can review the source code again, 
  ```
 /*
 * Fetch rows from subplan, and execute the required table modification
 * for each row. */
for (;;){
...
 /* No more tuples to process? */
if (TupIsNull(planSlot))break;


/*
  * For UPDATE/DELETE, fetch the row identity info for the tuple to 
be
 * updated/deleted.  For a heap relation, that's a TID; otherwise we
 * may have a wholerow junk attr that carries the old tuple in toto.
 * Keep this in step with the part of ExecInitModifyTable that sets 
up
 * ri_RowIdAttNo.
 */
if (operation == CMD_UPDATE || operation == CMD_DELETE)
{
charrelkind;
Datum   datum;
boolisNull;
relkind = resultRelInfo->ri_RelationDesc->rd_rel->relkind;
if (relkind == RELKIND_RELATION ||
relkind == RELKIND_MATVIEW ||relkind == 
RELKIND_PARTITIONED_TABLE)

...
}

```
 We can see that if the scanned tuple is NULL, it will break.
 Therefore, this step will never be achieved that is to extract ctid.

>We'll need some sort of ResultRelInfo in the case that all partitions
>are pruned.  

 In this case, the tuple returned should be null. 

>Using the one for the partitioned table is convenient.  I
>believe that's required if there are any statement-level triggers on
>the partitioned table or if there's a RETURNING clause.

 If the tuple to be modified is null, you do not need to process RETURNING 
clause.
 statement-level triggers cannot use tuple, and triggers on partitioned tables 
cannot have transition tables.
 I set Assert(relkind! = RELKIND_PARTITIONED_TABLE);  in the code, But it never 
arrives.

 Did I not understand your expression clearly?
 Could you provide me with a case to verify it? 

 Thank you very much for your answer.
 
 Regards & Thanks Adger

Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

2021-08-17 Thread Amit Kapila

On Tue, Aug 17, 2021 at 10:54 AM Amit Kapila  wrote:
>
> On Mon, Aug 16, 2021 at 8:18 PM Dilip Kumar  wrote:
> >
> > On Fri, Aug 13, 2021 at 9:29 PM Andres Freund  wrote:
> >
> > > > I think we can extend this API but I guess it is better to then do it
> > > > for dsm-based as well so that these get tracked via resowner.
> > >
> > > DSM segments are resowner managed already, so it's not obvious that 
> > > that'd buy
> > > us much? Although I guess there could be a few edge cases that'd look 
> > > cleaner,
> > > because we could reliably trigger cleanup in the leader instead of 
> > > relying on
> > > dsm detach hooks + refcounts to manage when a set is physically deleted?
> > >
> >
> > In an off-list discussion with Thomas and Amit, we tried to discuss
> > how to clean up the shared files set in the current use case.  Thomas
> > suggested that instead of using individual shared fileset for storing
> > the data for each XID why don't we just create a single shared fileset
> > for complete worker lifetime and when the worker is exiting we can
> > just remove that shared fileset.  And for storing XID data, we can
> > just create the files under the same shared fileset and delete those
> > files when we longer need them.  I like this idea and it looks much
> > cleaner, after this, we can get rid of the special cleanup mechanism
> > using 'filesetlist'.  I have attached a patch for the same.
> >
>
> It seems to me that this idea would obviate any need for resource
> owners as we will have only one fileset now. I have a few initial
> comments on the patch:
>

One more comment:
@@ -2976,39 +2952,17 @@ subxact_info_write(Oid subid, TransactionId xid)
..
+ /* Try to open the subxact file, if it doesn't exist then create it */
+ fd = BufFileOpenShared(xidfileset, path, O_RDWR, true);
+ if (fd == NULL)
+ fd = BufFileCreateShared(xidfileset, path);
..

Instead of trying to create the file here based on whether it exists
or not, can't we create it in subxact_info_add where we are first time
allocating memory for subxacts? If that works then in the above code,
the file should always exist.

-- 
With Regards,
Amit Kapila.

Re: ECPG bug fix: DECALRE STATEMENT and DEALLOCATE, DESCRIBE

2021-08-17 Thread Michael Paquier

On Mon, Aug 16, 2021 at 12:06:16PM +0200, Michael Meskes wrote:
> You patch removes the warning but by doing that also removes the
> feature that is being tested.

Oops.  If kept this way, this test scenario is going to need a comment
to explain exactly that.

> I'm not sure what's the best way to go about it, Shall we accept to not
> test this particular feature and remove the warning? After all this is
> not the way the statement should be used, hence the warning. Or should
> be keep it in and redirect the warning? In that case, we would also
> lose other warnings that are not planned, though.

FWIW, I would tend to drop the warning here.  I am not sure that this
is a use case interesting enough.  My 2c.
--
Michael


signature.asc
Description: PGP signature

Re: Diagnostic comment in LogicalIncreaseXminForSlot

2021-08-17 Thread Ashutosh Bapat

Hi Amit and Andres,
Here's updated patch

On Mon, Aug 9, 2021 at 11:14 AM Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

>
>
> On Sat, Aug 7, 2021 at 11:40 AM Andres Freund  wrote:
>
>> Hi,
>>
>> On 2021-07-12 17:28:15 +0530, Ashutosh Bapat wrote:
>> > On Mon, Jul 12, 2021 at 8:39 AM Amit Kapila 
>> wrote:
>> >
>> > > On Mon, Jul 5, 2021 at 12:54 PM Masahiko Sawada <
>> sawada.m...@gmail.com>
>> > > Today, again looking at this patch, I don't think doing elog inside
>> > > spinlock is a good idea. We can either release spinlock before it or
>> > > use some variable to remember that we need to write such an elog and
>> > > do it after releasing the lock. What do you think?
>> >
>> >
>> > The elog will be effective only under DEBUG1, otherwise it will be
>> almost a
>> > NOOP. I am wondering whether it's worth adding a bool assignment and
>> move
>> > the elog out only for DEBUG1. Anyway, will defer it to you.
>>
>> It's *definitely* not ok to do an elog() while holding a spinlock.
>> Consider
>> what happens if the elog tries to format the message and runs out of
>> memory. Or if elog detects the stack depth is too deep. An ERROR would be
>> thrown, and we'd loose track of the held spinlock.
>>
>
> Thanks for the clarification.
>
> Amit,
> I will provide the patch changed accordingly.
>
> --
> Best Wishes,
> Ashutosh
>


-- 
--
Best Wishes,
Ashutosh
From 0a2f2c1f530d72219d9db7bd1ff77c1ae8c4ffe4 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Thu, 13 May 2021 13:40:33 +0530
Subject: [PATCH] Report new catalog_xmin candidate in
 LogicalIncreaseXminForSlot()

Similar to LogicalIncreaseRestartDecodingForSlot() add a debug message
to LogicalIncreaseXminForSlot() reporting a new catalog_xmin candidate.
---
 src/backend/replication/logical/logical.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 64b8280c13..6c8c387e73 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1565,6 +1565,7 @@ LogicalIncreaseXminForSlot(XLogRecPtr current_lsn, TransactionId xmin)
 {
 	bool		updated_xmin = false;
 	ReplicationSlot *slot;
+	bool		got_new_xmin = false;
 
 	slot = MyReplicationSlot;
 
@@ -1602,12 +1603,22 @@ LogicalIncreaseXminForSlot(XLogRecPtr current_lsn, TransactionId xmin)
 	{
 		slot->candidate_catalog_xmin = xmin;
 		slot->candidate_xmin_lsn = current_lsn;
+
+		/*
+		 * Add a line about new xmin in server logs at an appropriate log level
+		 * after releasing the spinlock.
+		 */
+		got_new_xmin = true;
 	}
 	SpinLockRelease(>mutex);
 
 	/* candidate already valid with the current flush position, apply */
 	if (updated_xmin)
 		LogicalConfirmReceivedLocation(slot->data.confirmed_flush);
+
+	if (got_new_xmin)
+		elog(DEBUG1, "got new catalog_xmin %u at %X/%X", xmin,
+			 LSN_FORMAT_ARGS(current_lsn));
 }
 
 /*
-- 
2.25.1

95 matches

Mail list logo