Re: Tablesync early exit

2022-04-04 Thread Amit Kapila
On Tue, Apr 5, 2022 at 9:37 AM Peter Smith  wrote:
>
> On Sat, Apr 2, 2022 at 5:17 PM Amit Kapila  wrote:
> >
> > On Fri, Apr 1, 2022 at 1:52 PM Peter Smith  wrote:
> > >
> > > On Wed, Mar 16, 2022 at 4:07 PM Amit Kapila  
> > > wrote:
> > >
> > > I think the STATE_CATCHUP guarantees the apply worker must have
> > > received (or tried to receive) a message. See the previous answer.
> > >
> >
> > Sorry, I intend to say till the sync worker has received any message.
> > The point is that LSN till where the copy has finished might actually
> > be later than some of the in-progress transactions on the server. It
> > may not be a good idea to blindly skip those changes if the apply
> > worker has already received those changes (say via a 'streaming'
> > mode). Today, all such changes would be written to the file and
> > applied at commit time but tomorrow, we can have an implementation
> > where we can apply such changes (via some background worker) by
> > skipping changes related to the table for which the table-sync worker
> > is in-progress. Now, in such a scenario, unless, we allow the table
> > sync worker to process more messages, we will end up losing some
> > changes for that particular table.
> >
> > As per my understanding, this is safe as per the current code but it
> > can't be guaranteed for future implementations and the amount of extra
> > work is additional work to receive the messages for one transaction. I
> > still don't think that it is a good idea to pursue this patch.
>
> IIUC you are saying that my patch is good today, but it may cause
> problems in a hypothetical future if the rest of the replication logic
> is implemented differently.
>

The approach I have alluded to above is already proposed earlier on
-hackers [1] to make streaming transactions perform better. So, it is
not completely hypothetical.

[1] - 
https://www.postgresql.org/message-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17%40postgrespro.ru

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences

2022-04-04 Thread Amit Kapila
On Sat, Apr 2, 2022 at 5:47 AM Tomas Vondra
 wrote:
>
> On 4/1/22 17:02, Tomas Vondra wrote:
>
> So, I investigated this a bit more, and I wrote a couple test_decoding
> isolation tests (patch attached) demonstrating the issue. Actually, I
> should say "issues" because it's a bit worse than you described ...
>
> The whole problem is in this chunk of code in sequence_decode():
>
>
>   /* Skip the change if already processed (per the snapshot). */
>   if (transactional &&
>   !SnapBuildProcessChange(builder, xid, buf->origptr))
>   return;
>   else if (!transactional &&
>(SnapBuildCurrentState(builder) != SNAPBUILD_CONSISTENT ||
> SnapBuildXactNeedsSkip(builder, buf->origptr)))
>   return;
>
>   /* Queue the increment (or send immediately if not transactional). */
>   snapshot = SnapBuildGetOrBuildSnapshot(builder, xid);
>   ReorderBufferQueueSequence(ctx->reorder, xid, snapshot, buf->endptr,
>  origin_id, target_node, transactional,
>  xlrec->created, tuplebuf);
>
> With the script you described, the increment is non-transactional, so we
> end up in the second branch, return and thus discard the increment.
>
> But it's also possible the change is transactional, which can only
> trigger the first branch. But it does not, so we start building the
> snapshot. But the first thing SnapBuildGetOrBuildSnapshot does is
>
>Assert(builder->state == SNAPBUILD_CONSISTENT);
>
> and we're still not in a consistent snapshot, so it just crashes and
> burn :-(
>
> The sequences.spec file has two definitions of s2restart step, one empty
> (resulting in non-transactional change), one with ALTER SEQUENCE (which
> means the change will be transactional).
>
>
> The really "funny" thing is this is not new code - this is an exact copy
> from logicalmsg_decode(), and logical messages have all those issues
> too. We may discard some messages, trigger the same Assert, etc. There's
> a messages2.spec demonstrating this (s2message step defines whether the
> message is transactional or not).
>

It seems to me that the Assert in SnapBuildGetOrBuildSnapshot() is
wrong. It is required only for non-transactional logical messages. For
transactional message(s), we decide at the commit time whether the
snapshot has reached a consistent state and then decide whether to
skip the entire transaction or not. So, the possible fix for Assert
could be that we pass an additional parameter 'transactional' to
SnapBuildGetOrBuildSnapshot() and then assert only when it is false. I
have also checked the development thread for this work and it appears
to be introduced for non-transactional cases only. See email[1], this
new function and Assert was for the non-transactional case and later
while rearranging the code, this problem got introduced.

Now, for the non-transactional cases, I am not sure if there is a
one-to-one mapping with the sequence case. The way sequences are dealt
with on the subscriber-side (first we copy initial data and then
replicate the incremental changes) appears more as we deal with the
table and its incremental changes. There is some commonality with
non-transactional messages w.r.t the case where we want sequence
changes to be sent even on rollbacks unless some DDL has happened for
them but if we see the overall solution it doesn't appear that we can
use it similar to messages. I think this is the reason we are facing
the other problems w.r.t to syncing sequences to subscribers including
the problem reported by Sawada-San yesterday.

Now, the particular case where we won't send a non-transactional
logical message unless the snapshot is consistent could be considered
as its behavior and may be documented better. I am not very sure about
this as there is no example of the way sync for these messages happens
in the core but if someone outside the core wants a different behavior
and presents the case then we can probably try to enhance it. I feel
the same is not true for sequences because it can cause the replica
(subscriber) to go out of sync with the master (publisher).

> So I guess we need to fix both places, perhaps in a similar way.
>

It depends but I think for logical messages we should do the minimal
fix required for Asserts and probably document the current behavior
bit better unless we think there is a case to make it behave similar
to sequences.


[1] - https://www.postgresql.org/message-id/56D4B3AD.5000207%402ndquadrant.com

-- 
With Regards,
Amit Kapila.




Re: Tablesync early exit

2022-04-04 Thread Peter Smith
On Sat, Apr 2, 2022 at 5:17 PM Amit Kapila  wrote:
>
> On Fri, Apr 1, 2022 at 1:52 PM Peter Smith  wrote:
> >
> > On Wed, Mar 16, 2022 at 4:07 PM Amit Kapila  wrote:
> >
> > I think the STATE_CATCHUP guarantees the apply worker must have
> > received (or tried to receive) a message. See the previous answer.
> >
>
> Sorry, I intend to say till the sync worker has received any message.
> The point is that LSN till where the copy has finished might actually
> be later than some of the in-progress transactions on the server. It
> may not be a good idea to blindly skip those changes if the apply
> worker has already received those changes (say via a 'streaming'
> mode). Today, all such changes would be written to the file and
> applied at commit time but tomorrow, we can have an implementation
> where we can apply such changes (via some background worker) by
> skipping changes related to the table for which the table-sync worker
> is in-progress. Now, in such a scenario, unless, we allow the table
> sync worker to process more messages, we will end up losing some
> changes for that particular table.
>
> As per my understanding, this is safe as per the current code but it
> can't be guaranteed for future implementations and the amount of extra
> work is additional work to receive the messages for one transaction. I
> still don't think that it is a good idea to pursue this patch.

IIUC you are saying that my patch is good today, but it may cause
problems in a hypothetical future if the rest of the replication logic
is implemented differently.

Anyway, it seems there is no chance of this getting committed, so it
is time for me to stop flogging this dead horse.

I will remove this from the CF.

--
Kind Regards,
Peter Smith
Fujitsu Australia




Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

2022-04-04 Thread Julien Rouhaud
On Tue, Apr 05, 2022 at 10:40:04AM +0900, Masahiko Sawada wrote:
> On Tue, Apr 5, 2022 at 1:31 AM Julien Rouhaud  wrote:
> >
> > Yes.  In normal circumstances it shouldn't need a lot of time to do that, 
> > but
> > I'm not so sure with e.g. network filesystems.  I'm not strongly in favor of
> > counting it, especially since smgrextend doesn't either.
> 
> Good point. I think that adding a new place to track I/O timing can be
> a separate patch so probably we can work on it for PG16 or later.

Agreed.

> I've attached updated patches, please review it.

It looks good to me, just one minor thing in 002:

@@ -183,8 +184,10 @@ typedef struct Counters
int64   local_blks_written; /* # of local disk blocks written */
int64   temp_blks_read; /* # of temp blocks read */
int64   temp_blks_written;  /* # of temp blocks written */
-   double  blk_read_time;  /* time spent reading, in msec */
-   double  blk_write_time; /* time spent writing, in msec */
+   double  blk_read_time;  /* time spent reading blocks, in msec */
+   double  blk_write_time; /* time spent writing blocks, in msec */
+   double  temp_blk_read_time; /* time spent reading temp 
blocks, in msec */
+   double  temp_blk_write_time; /* time spent writing temp blocks, 
in msec */

maybe the comments should respectively be data file blocks and temp file
blocks.

This is a minor detail and the rest of the patch looks good to me, so I'm
marking the patch as Ready for Committer!




Re: Skipping logical replication transactions on subscriber side

2022-04-04 Thread Masahiko Sawada
On Tue, Apr 5, 2022 at 10:46 AM Noah Misch  wrote:
>
> On Tue, Apr 05, 2022 at 10:13:06AM +0900, Masahiko Sawada wrote:
> > On Tue, Apr 5, 2022 at 9:21 AM Noah Misch  wrote:
> > > On Mon, Apr 04, 2022 at 06:55:45PM +0900, Masahiko Sawada wrote:
> > > > On Mon, Apr 4, 2022 at 3:26 PM Noah Misch  wrote:
> > > > > On Mon, Apr 04, 2022 at 08:20:08AM +0530, Amit Kapila wrote:
> > > > > > How about a comment like: "It has to be kept at 8-byte alignment
> > > > > > boundary so as to be accessed directly via C struct as it uses
> > > > > > TYPALIGN_DOUBLE for storage which has 4-byte alignment on platforms
> > > > > > like AIX."? Can you please suggest a better comment if you don't 
> > > > > > like
> > > > > > this one?
> > > > >
> > > > > I'd write it like this, though I'm not sure it's an improvement on 
> > > > > your words:
> > > > >
> > > > >   When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte 
> > > > > alignment on
> > > > >   some of the C types that correspond to TYPALIGN_DOUBLE SQL types.  
> > > > > To ensure
> > > > >   catalog C struct layout matches catalog tuple layout, arrange for 
> > > > > the tuple
> > > > >   offset of each fixed-width, attalign='d' catalog column to be 
> > > > > divisible by 8
> > > > >   unconditionally.  Keep such columns before the first NameData 
> > > > > column of the
> > > > >   catalog, since packagers can override NAMEDATALEN to an odd number.
> > > >
> > > > Thanks!
> > > >
> > > > >
> > > > > The best place for such a comment would be in one of
> > > > > src/test/regress/sql/*sanity*.sql, next to a test written to detect 
> > > > > new
> > > > > violations.
> > > >
> > > > Agreed.
> > > >
> > > > IIUC in the new test, we would need a new SQL function to calculate
> > > > the offset of catalog columns including padding, is that right? Or do
> > > > you have an idea to do that by using existing functionality?
> > >
> > > Something like this:
> > >
> > > select
> > >   attrelid::regclass,
> > >   attname,
> > >   array(select typname
> > > from pg_type t join pg_attribute pa on t.oid = pa.atttypid
> > > where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < 
> > > a.attnum order by pa.attnum) AS types_before,
> > >   (select sum(attlen)
> > >from pg_type t join pg_attribute pa on t.oid = pa.atttypid
> > >where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < 
> > > a.attnum) AS len_before
> > > from pg_attribute a
> > > join pg_class c on c.oid = attrelid
> > > where attalign = 'd' and relkind = 'r' and attnotnull and attlen <> -1
> > > order by attrelid::regclass::text, attnum;
> > > attrelid │   attname│types_before 
> > > │ len_before
> > > ─┼──┼─┼
> > >  pg_sequence │ seqstart │ {oid,oid}   
> > > │  8
> > >  pg_sequence │ seqincrement │ {oid,oid,int8}  
> > > │ 16
> > >  pg_sequence │ seqmax   │ {oid,oid,int8,int8} 
> > > │ 24
> > >  pg_sequence │ seqmin   │ {oid,oid,int8,int8,int8}
> > > │ 32
> > >  pg_sequence │ seqcache │ {oid,oid,int8,int8,int8,int8}   
> > > │ 40
> > >  pg_subscription │ subskiplsn   │ 
> > > {oid,oid,name,oid,bool,bool,bool,char,bool} │ 81
> > > (6 rows)
> > >
> > > That doesn't count padding, but hazardous column changes will cause a 
> > > diff in
> > > the output.
> >
> > Yes, in this case, we can detect the violated column order even
> > without considering padding. On the other hand, I think this
> > calculation could not detect some patterns of order. For instance,
> > suppose the column order is {oid, bool, bool, oid, bool, bool, oid,
> > int8}, the len_before is 16 but offset of int8 column including
> > padding is 20 on ALIGNOF_DOUBLE==4 environment.
>
> Correct.  Feel free to make it more precise.  If you do want to add a
> function, it could be a regress.c function rather than an always-installed
> part of PostgreSQL.  Again, getting the buildfarm green is a priority; we can
> always add tests later.

Agreed. I'll update and submit the patch as soon as possible.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-04 Thread Peter Geoghegan
On Mon, Apr 4, 2022 at 8:18 PM Andres Freund  wrote:
> The remaining patch are the warnings in vac_update_relstats(), correct?  I
> guess one could argue they should be LOG rather than WARNING, but I find the
> project stance on that pretty impractical. So warning's ok with me.

Right. The reason I used WARNINGs was because it matches vaguely
related WARNINGs in vac_update_relstats()'s sibling function,
vacuum_set_xid_limits().

> Not sure why you used errmsg_internal()?

The usual reason for using errmsg_internal(), I suppose. I tend to do
that with corruption related messages on the grounds that they're
usually highly obscure issues that are (by definition) never supposed
to happen. The only thing that a user can be expected to do with the
information from the message is to report it to -bugs, or find some
other similar report.

-- 
Peter Geoghegan




Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-04 Thread Andres Freund
Hi,

On 2022-04-04 19:32:13 -0700, Peter Geoghegan wrote:
> On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan  wrote:
> > I also refined the WARNING patch in v15. It now actually issues
> > WARNINGs (rather than PANICs, which were just a temporary debugging
> > measure in v14).
> 
> Going to commit this remaining patch tomorrow, barring objections.

The remaining patch are the warnings in vac_update_relstats(), correct?  I
guess one could argue they should be LOG rather than WARNING, but I find the
project stance on that pretty impractical. So warning's ok with me.

Not sure why you used errmsg_internal()?

Otherwise LGTM.

Greetings,

Andres Freund




Re: Window Function "Run Conditions"

2022-04-04 Thread Andres Freund
On 2022-04-05 12:04:18 +1200, David Rowley wrote:
> > This is afaics slightly cheaper than referencing a variable in a slot.
> 
> I guess you must mean cheaper because it means there will be no
> EEOP_*_FETCHSOME step?  Otherwise it seems a fairly similar amount of
> work.

That, and slightly fewer indirections for accessing values IIRC.




Re: Handle infinite recursion in logical replication setup

2022-04-04 Thread Peter Smith
On Tue, Apr 5, 2022 at 12:36 PM Amit Kapila  wrote:
>
> On Tue, Apr 5, 2022 at 6:21 AM Peter Smith  wrote:
> >
> > Here are my comments for the latest patch v6-0001.
> >
> > (I will post my v6-0002 review comments separately)
> >
> > PATCH v6-0001 comments
> > ==
> >
> > 1.1 General - Option name
> >
> > I still feel like the option name is not ideal. Unfortunately, this is
> > important because any name change would impact lots of these patch
> > files and docs, struct members etc.
> >
> > It was originally called "local_only", but I thought that as a
> > SUBSCRIPTION option that was confusing because "local" means local to
> > what? Really it is local to the publisher, not local to the
> > subscriber, so that name seemed misleading.
> >
> > Then I suggested "publish_local_only". Although that resolved the
> > ambiguity problem, other people thought it seemed odd to have the
> > "publish" prefix for a subscription-side option.
> >
> > So now it is changed again to "subscribe_local_only" -- It's getting
> > better but still, it is implied that the "local" means local to the
> > publisher except there is nothing in the option name really to convey
> > that meaning. IMO we here all understand the meaning of this option
> > mostly by familiarity with this discussion thread, but I think a user
> > coming to this for the first time will still be confused by the name.
> >
> > Below are some other name ideas. Maybe they are not improvements, but
> > it might help other people to come up with something better.
> >
> > subscribe_publocal_only = true/false
> > origin = pub_only/any
> > adjacent_data_only = true/false
> > direct_subscriptions_only = true/false
> > ...
> >
>
> I think local_only with a description should be okay considering other
> current options. Some of the options like slot_name, binary, etc. are
> publisher-specific which becomes clear only after reading the
> description. I am not sure adding pub*/sub* to it is much helpful. So,
> +1 to a simple boolean like 'local_only', 'data_local', or something
> like that.

+1 to use a simple boolean option.

Thanks for the examples of the other options which are also really
publisher-specific. Seen in that light, and with a clear description,
probably the original "local_only"  was fine after all.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Temporary tables versus wraparound... again

2022-04-04 Thread Greg Stark
So here's an updated patch.

I had to add a public method to multixact.c to expose the locally
calculated OldestMultiXactId. It's possible we could use something
even tighter (like the current next mxid since we're about to commit)
but I didn't see a point in going further and it would have become
more complex.

I also added a branch in heapam_handler.c in ..._set_new_filenode() of
temporary tables. It feels like a free win and it's more consistent.

I'm not 100% on the tableam abstraction -- it's possible all of this
change should have happened in heapam_handler somewhere? I don't think
so but it does feel weird to be touching it and also doing the same
thing elsewhere.

I think this has addressed all the questions now.
From 2e5d2c47288d27a620af09214c82fd66f61fb605 Mon Sep 17 00:00:00 2001
From: Greg Stark 
Date: Thu, 31 Mar 2022 15:48:38 -0400
Subject: [PATCH v6 1/3] Add warnings when old temporary tables are found to
 still be in use during autovacuum. Long lived sessions using temporary tables
 are required to vacuum them themselves.

For the warning to be useful modify checkTempNamespaceStatus to
return the backend pid using it so that we can inform super-user
which pid to terminate. Otherwise it's quite tricky to determine
as a user. Rename the function to avoid an incompatible ABI break.
---
 src/backend/access/transam/varsup.c | 12 ---
 src/backend/catalog/namespace.c |  9 +++--
 src/backend/postmaster/autovacuum.c | 52 ++---
 src/include/catalog/namespace.h |  4 +--
 4 files changed, 57 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 748120a012..8b29573e9f 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -129,14 +129,16 @@ GetNewTransactionId(bool isSubXact)
 		 errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"",
 oldest_datname),
 		 errhint("Stop the postmaster and vacuum that database in single-user mode.\n"
- "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+ "You might also need to commit or roll back old prepared transactions,\n"
+ "drop temporary tables, or drop stale replication slots.")));
 			else
 ereport(ERROR,
 		(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 		 errmsg("database is not accepting commands to avoid wraparound data loss in database with OID %u",
 oldest_datoid),
 		 errhint("Stop the postmaster and vacuum that database in single-user mode.\n"
- "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+ "You might also need to commit or roll back old prepared transactions,\n"
+ "drop temporary tables, or drop stale replication slots.")));
 		}
 		else if (TransactionIdFollowsOrEquals(xid, xidWarnLimit))
 		{
@@ -149,14 +151,16 @@ GetNewTransactionId(bool isSubXact)
 oldest_datname,
 xidWrapLimit - xid),
 		 errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n"
- "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+ "You might also need to commit or roll back old prepared transactions,\n"
+ "drop temporary tables, or drop stale replication slots.")));
 			else
 ereport(WARNING,
 		(errmsg("database with OID %u must be vacuumed within %u transactions",
 oldest_datoid,
 xidWrapLimit - xid),
 		 errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n"
- "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+ "You might also need to commit or roll back old prepared transactions,\n"
+ "drop temporary tables, or drop stale replication slots.")));
 		}
 
 		/* Re-acquire lock and start over */
diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index fafb9349cc..c1fd3ced95 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -3272,15 +3272,18 @@ isOtherTempNamespace(Oid namespaceId)
 
 /*
  * checkTempNamespaceStatus - is the given namespace owned and actively used
- * by a backend?
+ * by a backend? Optionally return the pid of the owning backend if there is
+ * one. Returned pid is only meaningful when TEMP_NAMESPACE_IN_USE but note
+ * below about race conditions.
  *
  * Note: this can be used while scanning relations in pg_class to detect
  * orphaned temporary tables or namespaces with a backend connected to a
  * given database.  The result may be out of date quickly, so the caller
  * must be careful how to handle this information.
+ *
  */
 TempNamespaceStatus
-checkTempNamespaceStatus(Oid namespaceId)
+checkTempNamespaceStatusAndPid(Oid 

Re: Handle infinite recursion in logical replication setup

2022-04-04 Thread Peter Smith
Here are my comments for the latest patch v6-0002.

PATCH v6-0002 comments
==

2.1 General - should this be an independent patch?

In many ways, I think most of this patch is unrelated to the other
"local_only" patch (v6-0001).

For example, IIUC even in the current HEAD, we could consider it to be
a user error if multiple SUBSCRIPTIONS or multiple PUBLICATIONS of the
same SUBSCRIPTION are replicating to the same TABLE on the same node
and using "copy_data = on".

So I think it would be ok to throw an ERROR if such a copy_data clash
is detected, and then the user will have to change to use "copy_data =
off" for some/all of them to avoid data duplication.

The "local_only" option only adds some small logic to this new ERROR,
but it's not really a prerequisite at all.

e.g. this whole ERROR part of the patch can be a separate thread.

~~~

2.2 General - can we remove the "force" enum?

Now, because I consider the clashing "copy_data = on" ERROR to be a
user error, I think that is something that the user can already take
care of themselves just using the copy_data = off.

I did not really like the modifying of the "copy_data" option from
just boolean to some kind of hybrid boolean + "force".

a) It smells a bit off to me. IMO replication is supposed to end up
with the same (replicated) data on the standby machine but this
"force" mode seems to be just helping the user to break that concept
and say - "I know what I'm doing, and I don't care if I get lots of
duplicated data in the replica table - just let me do it"...

b) It also somehow feels like the new "force" was introduced mostly to
make the code ERROR handling implementation simpler, rather than to
make the life of the end-user better. Yes, if force is removed maybe
the copy-clash-detection-code will need to be internally quite more
complex than it is now, but that is how it should be, instead of
putting extra burden on the user (e.g. by complicating the PG docs and
giving them yet more alternatives to configure). I think any clashing
copy_data options really is a user error, but also I think the current
boolean copy_data true/false already gives the user a way to fix it.

c) Actually, your new error hint messages are similar to my
perspective: They already say "Use CREATE/ALTER SUBSCRIPTION with
copy_data = off or force". All I am saying is remove the "force", and
the user can still fix the problem just by using "copy_data = off"
appropriately.

==

So (from above) I am not much in favour of the copy_data becoming a
hybrid enum and using "force", yet that is what most of this patch is
implementing. Anyway, the remainder of my review comments here are for
the code in its current form. Maybe if "force" can be removed most of
the following comments end up being redundant.

==

2.3 Commit message - wording

This message is difficult to understand.

I think that the long sentence "Now when user is trying..." can be
broken into more manageable parts.

This part "and throw an error so that user can handle the initial copy
data." also seemed a bit vague.

~~~

2.4 Commit message - more functions

"This patch does couple of things:"

IIUC, there seems a third thing implemented by this patch but not
described by the comment. I think it also adds support for ALTER
SUBSCRIPTION SET PUBLICATION WITH (subscribe_local_only)

~~~

2.5 doc/src/sgml/ref/create_subscription.sgml - wording

@@ -161,6 +161,13 @@ CREATE SUBSCRIPTION subscription_namefalse.
  
+ 
+  If the tables in the publication were also subscribing to the data in
+  the publisher from other publishers, it will affect the
+  CREATE SUBSCRIPTION based on the value specified
+  for copy_data option. Refer to the
+   for details.
+ 

Is there is a simpler way to express all that?

SUGGESTION
There is some interation between the option "subscribe_local_only" and
option "copy_data". Refer to the  for details.

~~~

2.6 doc/src/sgml/ref/create_subscription.sgml - whitespace

@@ -213,18 +220,28 @@ CREATE SUBSCRIPTION subscription_name


-copy_data (boolean)
+copy_data (enum)
 
  
   Specifies whether to copy pre-existing data in the publications
-  that are being subscribed to when the replication starts.
-  The default is true.
+  that are being subscribed to when the replication starts. This
+  parameter may be either true,
+  false or force. The default is
+  true.

That last line has trailing whitespace.

~~~

2.7 doc/src/sgml/ref/create_subscription.sgml - wording

+ 
+  If the tables in the publication were also subscribing to the data in
+  the publisher from other publishers, it will affect the
+  CREATE SUBSCRIPTION based on the value specified
+  for subscribe_local_only option. Refer to the
+   for details.
+ 

This is similar to review comment #2.5 which I 

Re: shared-memory based stats collector - v68

2022-04-04 Thread Andres Freund
Hi,

On 2022-04-04 19:03:13 -0700, David G. Johnston wrote:
> > > (if this is true...but given this is an optimization category I'm
> > thinking
> > > maybe it doesn't actually matter...)
> >
> > It is true. Not sure what you mean with "optimization category"?
> >
> >
> I mean that distinguishing between stats that are fixed and those that are
> variable implies that fixed kinds have a better performance (speed, memory)
> characteristic than variable kinds (at least in part due to the presence of
> changecount).  If fixed kinds did not have a performance benefit then
> having the variable kind implementation simply handle fixed kinds as well
> (using the common struct header and storage in a hash table) would make the
> implementation simpler since all statistics would report through the same
> API.

Yes, fixed-numbered stats are faster.



> Coming back to this:
> """
> + /* cluster-scoped object stats having a variable number of entries */
> + PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */
> + PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */
> + PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd
> spot to be closer to the database-scoped options)
> +
> + /* database-scoped object stats having a variable number of entries */
> + PGSTAT_KIND_RELATION, /* per-table statistics */
> + PGSTAT_KIND_FUNCTION, /* per-function statistics */
> +
> + /* cluster-scoped stats having a fixed number of entries */ (maybe these
> should go first, the variable following?)
> + PGSTAT_KIND_ARCHIVER,
> + PGSTAT_KIND_BGWRITER,
> + PGSTAT_KIND_CHECKPOINTER,
> + PGSTAT_KIND_SLRU,
> + PGSTAT_KIND_WAL,
> """
> 
> I see three "KIND_GROUP" categories here:
> PGSTAT_KIND_CLUSTER (open to a different word here though...)
> PGSTAT_KIND_DATABASE (we seem to agree on this above)
> PGSTAT_KIND_GLOBAL (already used in the code)
> 
> This single enum can replace the two booleans that, in combination, would
> define 4 unique groups (of which only three are interesting -
> database+fixed doesn't seem interesting and so is not given a name/value
> here).

The more I think about it, the less I think a split like that makes sense. The
difference between PGSTAT_KIND_CLUSTER / PGSTAT_KIND_DATABASE is tiny. Nearly
all code just deals with both together.

I think all this is going to achieve is to making code more complicated. There
is a *single* non-assert use of accessed_across_databases and now a single
assertion involving it.

What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?

Greetings,

Andres Freund




Re: Handle infinite recursion in logical replication setup

2022-04-04 Thread Amit Kapila
On Tue, Apr 5, 2022 at 6:21 AM Peter Smith  wrote:
>
> Here are my comments for the latest patch v6-0001.
>
> (I will post my v6-0002 review comments separately)
>
> PATCH v6-0001 comments
> ==
>
> 1.1 General - Option name
>
> I still feel like the option name is not ideal. Unfortunately, this is
> important because any name change would impact lots of these patch
> files and docs, struct members etc.
>
> It was originally called "local_only", but I thought that as a
> SUBSCRIPTION option that was confusing because "local" means local to
> what? Really it is local to the publisher, not local to the
> subscriber, so that name seemed misleading.
>
> Then I suggested "publish_local_only". Although that resolved the
> ambiguity problem, other people thought it seemed odd to have the
> "publish" prefix for a subscription-side option.
>
> So now it is changed again to "subscribe_local_only" -- It's getting
> better but still, it is implied that the "local" means local to the
> publisher except there is nothing in the option name really to convey
> that meaning. IMO we here all understand the meaning of this option
> mostly by familiarity with this discussion thread, but I think a user
> coming to this for the first time will still be confused by the name.
>
> Below are some other name ideas. Maybe they are not improvements, but
> it might help other people to come up with something better.
>
> subscribe_publocal_only = true/false
> origin = pub_only/any
> adjacent_data_only = true/false
> direct_subscriptions_only = true/false
> ...
>

I think local_only with a description should be okay considering other
current options. Some of the options like slot_name, binary, etc. are
publisher-specific which becomes clear only after reading the
description. I am not sure adding pub*/sub* to it is much helpful. So,
+1 to a simple boolean like 'local_only', 'data_local', or something
like that.

-- 
With Regards,
Amit Kapila.




Re: Window Function "Run Conditions"

2022-04-04 Thread Andy Fan
>
>
> I'm pretty happy with this now. If anyone wants to have a look at
> this, can they do so or let me know they're going to within the next
> 24 hours.  Otherwise I plan to move into commit mode with it.
>
>
I just came to the office today to double check this patch.  I probably can
finish it very soon. But if you are willing to commit it sooner,  I am
totally
fine with it.


-- 
Best Regards
Andy Fan


Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-04-04 Thread Peter Geoghegan
On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan  wrote:
> I also refined the WARNING patch in v15. It now actually issues
> WARNINGs (rather than PANICs, which were just a temporary debugging
> measure in v14).

Going to commit this remaining patch tomorrow, barring objections.

-- 
Peter Geoghegan




Re: generic plans and "initial" pruning

2022-04-04 Thread Amit Langote
On Mon, Apr 4, 2022 at 9:55 PM Amit Langote  wrote:
> On Sun, Apr 3, 2022 at 8:33 PM Alvaro Herrera  wrote:
> > I think the names ExecCreatePartitionPruneState and
> > ExecInitPartitionPruning are too confusingly similar.  Maybe the former
> > should be renamed to somehow make it clear that it is a subroutine for
> > the former.
>
> Ah, yes.  I've taken out the "Exec" from the former.

While at it, maybe it's better to rename ExecInitPruningContext() to
InitPartitionPruneContext(), which I've done in the attached updated
patch.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v10-0001-Some-refactoring-of-runtime-pruning-code.patch
Description: Binary data


Re: Lowering the ever-growing heap->pd_lower

2022-04-04 Thread Peter Geoghegan
On Tue, Feb 15, 2022 at 10:48 AM Matthias van de Meent
 wrote:
> # Truncating lp_array during pruning
> ===
>
> The following adversarial load grows the heap relation, but with the
> patch the relation keeps its size. The point being that HOT updates
> can temporarily inflate the LP array significantly, and this patch can
> actively combat that issue while we're waiting for the 2nd pass of
> vacuum to arrive.

I am sympathetic to the idea that giving the system a more accurate
picture of how much free space is available on each heap page is an
intrinsic good. This might help us in a few different areas. For
example, the FSM cares about relatively small differences in available
free space *among* heap pages that are "in competition" in
RelationGetBufferForTuple(). Plus we have a heuristic based on
PageGetHeapFreeSpace() in heap_page_prune_opt() to consider.

We should definitely increase MaxHeapTuplesPerPage before too long,
for a variety of reasons that I have talked about in the past. Its
current value is 291 on all mainstream platforms, a value that's
derived from accidental historic details -- which predate HOT.
Obviously an increase in MaxHeapTuplesPerPage is likely to make the
problem that the patch proposes to solve worse. I lean towards
committing the patch now as work in that direction, in fact.

It helps that this patch now seems relatively low risk.

--
Peter Geoghegan




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-04 Thread Thomas Munro
On Tue, Apr 5, 2022 at 10:24 AM Thomas Munro  wrote:
> On Tue, Apr 5, 2022 at 2:18 AM Robert Haas  wrote:
> > I'm not sure that it really matters, but with the idea that I proposed
> > it's possible to "save" a pending writeback, if we notice that we're
> > accessing the relation with a proper lock after the barrier arrives
> > and before we actually try to do the writeback. With this approach we
> > throw them out immediately, so they're just gone. I don't mind that
> > because I don't think this will happen often enough to matter, and I
> > don't think it will be very expensive when it does, but it's something
> > to think about.
>
> The checkpointer never takes heavyweight locks, so the opportunity
> you're describing can't arise.

  Hmm, oh, you probably meant the buffer interlocking
in SyncOneBuffer().  It's true that my most recent patch throws away
more requests than it could, by doing the level check at the end of
the loop over all buffers instead of adding some kind of
DropPendingWritebacks() in the barrier handler.  I guess I could find
a way to improve that, basically checking the level more often instead
of at the end, but I don't know if it's worth it; we're still throwing
out an arbitrary percentage of writeback requests.




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-04-04 Thread Kyotaro Horiguchi
At Mon, 4 Apr 2022 21:14:27 +0530, Dilip Kumar  wrote in 
> On Mon, Apr 4, 2022 at 2:25 PM Kyotaro Horiguchi
>  wrote:
> >
> > At Mon, 04 Apr 2022 17:29:48 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in
> > > I haven't found how the patch caused creation of a relation file that
> > > is to be removed soon.  However, I find that v19 patch fails by maybe
> > > due to some change in Cluster.pm.  It takes a bit more time to check
> > > that..
> >
> > I was a bit away, of course the wal-logged create database interfares
> > with the patch here. But I haven't found that why it stops creating
> > database directory under pg_tblspc.
> 
> I did not understand what is the exact problem here, but the database
> directory and the version file are created under the default
> tablespace of the target database.  However, other than the default
> tablespace of the database, the database directory will be created
> along with the smgrcreate() so that we do not create an unnecessary
> directory under the tablespace where we do not have any data to be
> copied.

Thanks. Yeah, I suspected something like that but I didn't find a
difference in the code I suspected to be related with, but it's was
wrong.  I took wrong steps trying to reveal that state and faced the
wrong error message. With the correct steps, I could see that
Storage/CREATE creates pg_tblspc/.

So, if we create missing tablespace directory, we have no way
otherthan creating it directly in pg_tblspc, which is violating the
rule that there shouldn't be real directory in pg_tblspc (when
allow_in_place_tablespaces is false).

So, I have the following points in my mind for now.

- We create the directory "since we know it is just tentative state".

- Then, check that no directory in pg_tblspc when reaching consistency
  when allow_in_place_tablespaces is false.

- Leave the log_invalid_page() mechanism alone as it is always result
  in a corrpt page if a differential WAL record is applied on a newly
  created page that should have been exist.

However, while working on it, I found that I found that recovery faces
missing tablespace directories *after* reaching consistency.  I'm
examining that further.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?

2022-04-04 Thread Michael Paquier
On Mon, Apr 04, 2022 at 05:39:58PM -0700, Andres Freund wrote:
> On my workstation it takes about 2.39s to run pg_amcheck on a regression
> database with all thoroughness options enabled. With -j4 it's 0.62s.
> 
> Without more thorough checking it's 1.24s and 0.30s with -j4.

Okay.  That sounds like an argument to enable that by default, with
parallelism.
--
Michael


signature.asc
Description: PGP signature


Re: New Object Access Type hooks

2022-04-04 Thread Tom Lane
Mark Dilger  writes:
> On Apr 4, 2022, at 1:47 PM, Tom Lane  wrote:
>> Perhaps libpq should be trying harder to make those cases look alike, but
>> this test is about server behavior not libpq behavior, so I'm inclined
>> to just make it lax.

> +1.

> I've gotten this test failure only a few times in perhaps the last six 
> months, so if we narrow the opportunity for test failure without closing it 
> entirely, we're just making the test failures that much harder to diagnose.

Done that way.

regards, tom lane




Re: PATCH: add "--config-file=" option to pg_rewind

2022-04-04 Thread Michael Paquier
On Sun, Apr 03, 2022 at 09:31:32PM +0200, Michael Banck wrote:
> The patch applies, make is ok, make check is ok, pg_rewind TAP tests
> are ok.

+   appendPQExpBufferStr(postgres_cmd, " --config_file=");
+   appendShellString(postgres_cmd, config_file);
Nit.  This is a valid option for the postgres command, but I'd rather
use a -c switch instead, mostly as a matter of self-documentation.

In the tests, the only difference between the modes "archive_cli" and
"archive" is the extra option given to the pg_rewind command, and
that's a simple redirect to what pg_rewind would choose by default
anyway.  A more elegant solution would be to have an extra option in
RewindTest::run_pg_rewind(), where any caller of the routine can feed
extra options to the pg_rewind command.  Now, in this case, we are
not going to lose any coverage if the existing "archive" mode is
changed so as we move away postgresql.conf from the target data folder
and just use --config-file by default, no?  I would make the choice of
simplicity, by giving up on "archive_cli" and using
primary-postgresql.conf.tmp as value for --config-file.  There could
be an argument for using --config-file for all the modes, as well.
--
Michael


signature.asc
Description: PGP signature


Re: shared-memory based stats collector - v68

2022-04-04 Thread David G. Johnston
On Mon, Apr 4, 2022 at 3:44 PM Andres Freund  wrote:

> Hi,
>
> On 2022-04-04 15:24:24 -0700, David G. Johnston wrote:
> > Replacing the existing assert(!kind->fixed_amount) with
> > assert(!kind->accessed_across_databases) produces the same result as the
> > later presently implies the former.
>
> I wasn't proposing to replace, but to add...
>

Right, but it seems redundant to have both when one implies the other.  But
I'm not hard set against it either, though my idea below make them both
obsolete.

>
> > Now I start to dislike the behavioral aspect of the attribute and would
> > rather just name it: kind->is_cluster_scoped (or something else that is
> > descriptive of the stat category itself, not how it is used)
>
> I'm not in love with the name either. But cluster is just a badly
> overloaded
> word :(.
>
> system_wide? Or invert it and say: database_scoped? I think I like the
> latter.
>
>
I like database_scoped as well...but see my idea below that makes this
obsolete.

>
> > Then reorganize the Kind documentation to note and emphasize these two
> > primary descriptors:
> > variable, which can be cluster or database scoped
> > fixed, which are cluster scoped by definition
>
> Hm. There's not actually that much difference between cluster/non-cluster
> wide
> scope for most of the system. I'm not strongly against, but I'm also not
> really seeing the benefit.
>

Not married to it myself, something to come back to when the dust settles.


> > (if this is true...but given this is an optimization category I'm
> thinking
> > maybe it doesn't actually matter...)
>
> It is true. Not sure what you mean with "optimization category"?
>
>
I mean that distinguishing between stats that are fixed and those that are
variable implies that fixed kinds have a better performance (speed, memory)
characteristic than variable kinds (at least in part due to the presence of
changecount).  If fixed kinds did not have a performance benefit then
having the variable kind implementation simply handle fixed kinds as well
(using the common struct header and storage in a hash table) would make the
implementation simpler since all statistics would report through the same
API.  In that world, variability is simply a possibility that not every
actual reporter has to use.  That improved performance characteristic is
what I meant by "optimization category".  I question whether we should be
publishing "fixed" and "variable" as concrete properties.  I'm not
presently against the current choice to do so, but as you say above, I'm
also not really seeing the benefit.

(goes and looks at all the places that use the fixed_amount
field...sparking an idea)

Coming back to this:
"""
+ /* cluster-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */
+ PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd
spot to be closer to the database-scoped options)
+
+ /* database-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+
+ /* cluster-scoped stats having a fixed number of entries */ (maybe these
should go first, the variable following?)
+ PGSTAT_KIND_ARCHIVER,
+ PGSTAT_KIND_BGWRITER,
+ PGSTAT_KIND_CHECKPOINTER,
+ PGSTAT_KIND_SLRU,
+ PGSTAT_KIND_WAL,
"""

I see three "KIND_GROUP" categories here:
PGSTAT_KIND_CLUSTER (open to a different word here though...)
PGSTAT_KIND_DATABASE (we seem to agree on this above)
PGSTAT_KIND_GLOBAL (already used in the code)

This single enum can replace the two booleans that, in combination, would
define 4 unique groups (of which only three are interesting -
database+fixed doesn't seem interesting and so is not given a name/value
here).

While the succinctness of the booleans has appeal the need for half of the
booleans to end up being negated quickly tarnishes it.  With the three
groups, every assertion is positive in nature indicating which of the three
groups are handled by the function.  While that is probably a few more
characters it seems like an easier read and is less complicated as it has
fewer independent parts.  At most you OR two kinds together which is
succinct enough I would think.  There are no gaps relative to the existing
implementation that defines fixed_amount and accessed_across_databases -
every call site using either of them can be transformed mechanically.

David J.


Re: Handle infinite recursion in logical replication setup

2022-04-04 Thread vignesh C
On Mon, Apr 4, 2022 at 6:50 PM Ashutosh Bapat
 wrote:
>
> I didn't find this in https://commitfest.postgresql.org/37/. Is this
> somewhere in the commitfest?

I missed adding it earlier, I have added it to commitfest today:
https://commitfest.postgresql.org/38/3610/

Regards,
Vignesh




Re: Skipping logical replication transactions on subscriber side

2022-04-04 Thread Noah Misch
On Tue, Apr 05, 2022 at 10:13:06AM +0900, Masahiko Sawada wrote:
> On Tue, Apr 5, 2022 at 9:21 AM Noah Misch  wrote:
> > On Mon, Apr 04, 2022 at 06:55:45PM +0900, Masahiko Sawada wrote:
> > > On Mon, Apr 4, 2022 at 3:26 PM Noah Misch  wrote:
> > > > On Mon, Apr 04, 2022 at 08:20:08AM +0530, Amit Kapila wrote:
> > > > > How about a comment like: "It has to be kept at 8-byte alignment
> > > > > boundary so as to be accessed directly via C struct as it uses
> > > > > TYPALIGN_DOUBLE for storage which has 4-byte alignment on platforms
> > > > > like AIX."? Can you please suggest a better comment if you don't like
> > > > > this one?
> > > >
> > > > I'd write it like this, though I'm not sure it's an improvement on your 
> > > > words:
> > > >
> > > >   When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte 
> > > > alignment on
> > > >   some of the C types that correspond to TYPALIGN_DOUBLE SQL types.  To 
> > > > ensure
> > > >   catalog C struct layout matches catalog tuple layout, arrange for the 
> > > > tuple
> > > >   offset of each fixed-width, attalign='d' catalog column to be 
> > > > divisible by 8
> > > >   unconditionally.  Keep such columns before the first NameData column 
> > > > of the
> > > >   catalog, since packagers can override NAMEDATALEN to an odd number.
> > >
> > > Thanks!
> > >
> > > >
> > > > The best place for such a comment would be in one of
> > > > src/test/regress/sql/*sanity*.sql, next to a test written to detect new
> > > > violations.
> > >
> > > Agreed.
> > >
> > > IIUC in the new test, we would need a new SQL function to calculate
> > > the offset of catalog columns including padding, is that right? Or do
> > > you have an idea to do that by using existing functionality?
> >
> > Something like this:
> >
> > select
> >   attrelid::regclass,
> >   attname,
> >   array(select typname
> > from pg_type t join pg_attribute pa on t.oid = pa.atttypid
> > where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < 
> > a.attnum order by pa.attnum) AS types_before,
> >   (select sum(attlen)
> >from pg_type t join pg_attribute pa on t.oid = pa.atttypid
> >where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < 
> > a.attnum) AS len_before
> > from pg_attribute a
> > join pg_class c on c.oid = attrelid
> > where attalign = 'd' and relkind = 'r' and attnotnull and attlen <> -1
> > order by attrelid::regclass::text, attnum;
> > attrelid │   attname│types_before   
> >   │ len_before
> > ─┼──┼─┼
> >  pg_sequence │ seqstart │ {oid,oid} 
> >   │  8
> >  pg_sequence │ seqincrement │ {oid,oid,int8}
> >   │ 16
> >  pg_sequence │ seqmax   │ {oid,oid,int8,int8}   
> >   │ 24
> >  pg_sequence │ seqmin   │ {oid,oid,int8,int8,int8}  
> >   │ 32
> >  pg_sequence │ seqcache │ {oid,oid,int8,int8,int8,int8} 
> >   │ 40
> >  pg_subscription │ subskiplsn   │ 
> > {oid,oid,name,oid,bool,bool,bool,char,bool} │ 81
> > (6 rows)
> >
> > That doesn't count padding, but hazardous column changes will cause a diff 
> > in
> > the output.
> 
> Yes, in this case, we can detect the violated column order even
> without considering padding. On the other hand, I think this
> calculation could not detect some patterns of order. For instance,
> suppose the column order is {oid, bool, bool, oid, bool, bool, oid,
> int8}, the len_before is 16 but offset of int8 column including
> padding is 20 on ALIGNOF_DOUBLE==4 environment.

Correct.  Feel free to make it more precise.  If you do want to add a
function, it could be a regress.c function rather than an always-installed
part of PostgreSQL.  Again, getting the buildfarm green is a priority; we can
always add tests later.




Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

2022-04-04 Thread Masahiko Sawada
On Tue, Apr 5, 2022 at 1:31 AM Julien Rouhaud  wrote:
>
> On Tue, Apr 05, 2022 at 12:51:12AM +0900, Masahiko Sawada wrote:
> > On Mon, Apr 4, 2022 at 1:30 PM Julien Rouhaud  wrote:
> > >
> > > Hmm, but AFAICS the json format would be stable as the counters are always
> > > shown even if zero.  So just doing the json format first and then the text
> > > format should also work.  Although if you're really unlucky there could 
> > > be a
> > > cache invalidation in between so we could just ignore the text format.  
> > > But I
> > > think we should at least keep a regression test with the json format, 
> > > with a
> > > comment explain why only this one is tested.
> >
> > Fair point. By commit 7e12256b478 we disabled track_io_timing, but
> > probably we can temporarily enable it and run one query with "buffers"
> > and "format json" options.
>
> Yes, enabling it for just this query.  It can't really find any problem with
> the values themselves but at least the new code path would be partially
> executed.
>
> > >
> > > - not really your patch fault I guess, but I see that extendBufFile() 
> > > isn't
> > >   handled.  There shouldn't be much activity there so maybe it's ok.
> > >   This is likely because smgr_extend is also not handled, but this one 
> > > seems
> > >   much more likely to take quite some time, and therefore should bump the
> > >   timing counters.
> >
> > You mean we should include the time for opening files as write time?
>
> Yes.  In normal circumstances it shouldn't need a lot of time to do that, but
> I'm not so sure with e.g. network filesystems.  I'm not strongly in favor of
> counting it, especially since smgrextend doesn't either.

Good point. I think that adding a new place to track I/O timing can be
a separate patch so probably we can work on it for PG16 or later.

I've attached updated patches, please review it.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v4-0001-Track-I-O-timing-for-temp-buffers.patch
Description: Binary data


v4-0002-pg_stat_statements-Track-I-O-timing-for-temp-bloc.patch
Description: Binary data


Re: Add LZ4 compression in pg_dump

2022-04-04 Thread Michael Paquier
On Fri, Apr 01, 2022 at 03:06:40PM +, gkokola...@pm.me wrote:
> I understand the itch. Indeed when LZ4 is added as compression method, this
> block changes slightly. I went with the minimum amount changed. Please find
> in 0001 of the attached this variable renamed as $gzip_program_exist. I 
> thought
> that as prefix it will match better the already used $ENV{GZIP_PROGRAM}.

Hmm.  I have spent some time on that, and upon review I really think
that we should skip the tests marked as dedicated to the gzip
compression entirely if the build is not compiled with this option,
rather than letting the code run a dump for nothing in some cases,
relying on the default to uncompress the contents in others.  In the
latter case, it happens that we have already some checks like
defaults_custom_format, but you already mentioned that.

We should also skip the later parts of the tests if the compression
program does not exist as we rely on it, but only if the command does
not exist.  This will count for LZ4.

> I can see the overlap case. Yet, I understand the test_key as serving 
> different
> purpose, as it is a key of %tests and %full_runs. I do not expect the database
> content of the generated dump to change based on which compression method is 
> used.

Contrary to the current LZ4 tests in pg_dump, what we have here is a
check for a command-level run and not a data-level check.  So what's
introduced is a new concept, and we need a new way to control if the
tests should be entirely skipped or not, particularly if we finish by
not using test_key to make the difference.  Perhaps the best way to
address that is to have a new keyword in the $runs structure.  The
attached defines a new compile_option, that can be completed later for
new compression methods introduced in the tests.  So the idea is to
mark all the tests related to compression with the same test_key, and
the tests can be skipped depending on what compile_option requires.

> In the attached version, I propose that the compression_cmd is converted into
> a hash. It contains two keys, the program and the arguments. Maybe it is 
> easier
> to read than before or than simply grabbing the first element of the array.

Splitting the program and its arguments makes sense.

At the end I am finishing with the attached.  I also saw an overlap
with the addition of --jobs for the directory format vs not using the
option, so I have removed the case where --jobs was not used in the
directory format.
--
Michael
From b705f77862c9e41a149ce078df638032903bce3d Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 5 Apr 2022 10:30:06 +0900
Subject: [PATCH v6] Extend compression coverage for pg_dump, pg_restore

---
 src/bin/pg_dump/Makefile |  2 +
 src/bin/pg_dump/t/002_pg_dump.pl | 93 +++-
 2 files changed, 93 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile
index 302f7e02d6..2f524b09bf 100644
--- a/src/bin/pg_dump/Makefile
+++ b/src/bin/pg_dump/Makefile
@@ -16,6 +16,8 @@ subdir = src/bin/pg_dump
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
+export GZIP_PROGRAM=$(GZIP)
+
 override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
 LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
 
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index af5d6fa5a3..dda3649373 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -20,12 +20,22 @@ my $tempdir   = PostgreSQL::Test::Utils::tempdir;
 # test_key indicates that a given run should simply use the same
 # set of like/unlike tests as another run, and which run that is.
 #
+# compile_option indicates if the commands run depend on a compilation
+# option, if any.  This controls if tests are skipped if a dependency
+# is not satisfied.
+#
 # dump_cmd is the pg_dump command to run, which is an array of
 # the full command and arguments to run.  Note that this is run
 # using $node->command_ok(), so the port does not need to be
 # specified and is pulled from $PGPORT, which is set by the
 # PostgreSQL::Test::Cluster system.
 #
+# compress_cmd is the utility command for (de)compression, if any.
+# Note that this should generally be used on pg_dump's output
+# either to generate a text file to run the through the tests, or
+# to test pg_restore's ability to parse manually compressed files
+# that otherwise pg_dump does not compress on its own (e.g. *.toc).
+#
 # restore_cmd is the pg_restore command to run, if any.  Note
 # that this should generally be used when the pg_dump goes to
 # a non-text file and that the restore can then be used to
@@ -54,6 +64,58 @@ my %pgdump_runs = (
 			"$tempdir/binary_upgrade.dump",
 		],
 	},
+
+	# Do not use --no-sync to give test coverage for data sync.
+	compression_gzip_custom => {
+		test_key => 'compression',
+		compile_option => 'gzip',
+		dump_cmd => [
+			'pg_dump',  '--format=custom',
+			

Re: Skipping logical replication transactions on subscriber side

2022-04-04 Thread Masahiko Sawada
On Tue, Apr 5, 2022 at 9:21 AM Noah Misch  wrote:
>
> On Mon, Apr 04, 2022 at 06:55:45PM +0900, Masahiko Sawada wrote:
> > On Mon, Apr 4, 2022 at 3:26 PM Noah Misch  wrote:
> > > On Mon, Apr 04, 2022 at 08:20:08AM +0530, Amit Kapila wrote:
> > > > How about a comment like: "It has to be kept at 8-byte alignment
> > > > boundary so as to be accessed directly via C struct as it uses
> > > > TYPALIGN_DOUBLE for storage which has 4-byte alignment on platforms
> > > > like AIX."? Can you please suggest a better comment if you don't like
> > > > this one?
> > >
> > > I'd write it like this, though I'm not sure it's an improvement on your 
> > > words:
> > >
> > >   When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte 
> > > alignment on
> > >   some of the C types that correspond to TYPALIGN_DOUBLE SQL types.  To 
> > > ensure
> > >   catalog C struct layout matches catalog tuple layout, arrange for the 
> > > tuple
> > >   offset of each fixed-width, attalign='d' catalog column to be divisible 
> > > by 8
> > >   unconditionally.  Keep such columns before the first NameData column of 
> > > the
> > >   catalog, since packagers can override NAMEDATALEN to an odd number.
> >
> > Thanks!
> >
> > >
> > > The best place for such a comment would be in one of
> > > src/test/regress/sql/*sanity*.sql, next to a test written to detect new
> > > violations.
> >
> > Agreed.
> >
> > IIUC in the new test, we would need a new SQL function to calculate
> > the offset of catalog columns including padding, is that right? Or do
> > you have an idea to do that by using existing functionality?
>
> Something like this:
>
> select
>   attrelid::regclass,
>   attname,
>   array(select typname
> from pg_type t join pg_attribute pa on t.oid = pa.atttypid
> where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < 
> a.attnum order by pa.attnum) AS types_before,
>   (select sum(attlen)
>from pg_type t join pg_attribute pa on t.oid = pa.atttypid
>where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < a.attnum) 
> AS len_before
> from pg_attribute a
> join pg_class c on c.oid = attrelid
> where attalign = 'd' and relkind = 'r' and attnotnull and attlen <> -1
> order by attrelid::regclass::text, attnum;
> attrelid │   attname│types_before 
> │ len_before
> ─┼──┼─┼
>  pg_sequence │ seqstart │ {oid,oid}   
> │  8
>  pg_sequence │ seqincrement │ {oid,oid,int8}  
> │ 16
>  pg_sequence │ seqmax   │ {oid,oid,int8,int8} 
> │ 24
>  pg_sequence │ seqmin   │ {oid,oid,int8,int8,int8}
> │ 32
>  pg_sequence │ seqcache │ {oid,oid,int8,int8,int8,int8}   
> │ 40
>  pg_subscription │ subskiplsn   │ {oid,oid,name,oid,bool,bool,bool,char,bool} 
> │ 81
> (6 rows)
>
> That doesn't count padding, but hazardous column changes will cause a diff in
> the output.

Yes, in this case, we can detect the violated column order even
without considering padding. On the other hand, I think this
calculation could not detect some patterns of order. For instance,
suppose the column order is {oid, bool, bool, oid, bool, bool, oid,
int8}, the len_before is 16 but offset of int8 column including
padding is 20 on ALIGNOF_DOUBLE==4 environment.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Handle infinite recursion in logical replication setup

2022-04-04 Thread Peter Smith
Here are my comments for the latest patch v6-0001.

(I will post my v6-0002 review comments separately)

PATCH v6-0001 comments
==

1.1 General - Option name

I still feel like the option name is not ideal. Unfortunately, this is
important because any name change would impact lots of these patch
files and docs, struct members etc.

It was originally called "local_only", but I thought that as a
SUBSCRIPTION option that was confusing because "local" means local to
what? Really it is local to the publisher, not local to the
subscriber, so that name seemed misleading.

Then I suggested "publish_local_only". Although that resolved the
ambiguity problem, other people thought it seemed odd to have the
"publish" prefix for a subscription-side option.

So now it is changed again to "subscribe_local_only" -- It's getting
better but still, it is implied that the "local" means local to the
publisher except there is nothing in the option name really to convey
that meaning. IMO we here all understand the meaning of this option
mostly by familiarity with this discussion thread, but I think a user
coming to this for the first time will still be confused by the name.

Below are some other name ideas. Maybe they are not improvements, but
it might help other people to come up with something better.

subscribe_publocal_only = true/false
origin = pub_only/any
adjacent_data_only = true/false
direct_subscriptions_only = true/false
...


(FYI, the remainder of these review comments will assume the option is
still called "subscribe_local_only")

~~~

1.2 General - inconsistent members and args

IMO the struct members and args should also be named for close
consistency with whatever the option name is.

Currently the option is called "subscription_local_only".  So I think
the members/args would be better to be called "local_only" instead of
"only_local".

~~~

1.3 Commit message - wrong option name

The commit message refers to the option name as "publish_local_only"
instead of the option name that is currently implemented.

~~~

1.4 Commit message - wording

The wording seems a bit off. Below is suggested simpler wording which
I AFAIK conveys the same information.

BEFORE
Add an option publish_local_only which will subscribe only to the locally
generated data in the publisher node. If subscriber is created with this
option, publisher will skip publishing the data that was subscribed
from other nodes. It can be created using following syntax:
ex: CREATE SUBSCRIPTION sub1 CONNECTION 'dbname =postgres port='
PUBLICATION pub1 with (publish_local_only = on);

SUGGESTION
This patch adds a new SUBSCRIPTION boolean option
"subscribe_local_only". The default is false. When a SUBSCRIPTION is
created with this option enabled, the publisher will only publish data
that originated at the publisher node.
Usage:
CREATE SUBSCRIPTION sub1 CONNECTION 'dbname =postgres port='
PUBLICATION pub1 with (subscribe_local_only = true);

~~~

1.5 doc/src/sgml/ref/create_subscription.sgml - "generated" changes.

+ 
+  Specifies whether the subscription will request the publisher to send
+  locally generated changes or both the locally generated changes and
+  the replicated changes that was generated from other nodes. The
+  default is false.
+ 

For some reason, it seemed a bit strange to me to use the term
"generated" changes. Maybe better to refer to the origin of changes?

SUGGESTION
Specifies whether the publisher should send only changes that
originated locally at the publisher node, or send any publisher node
changes regardless of their origin. The default is false.

~~~

1.6 src/backend/replication/pgoutput/pgoutput.c -
LOGICALREP_PROTO_TWOPHASE_VERSION_NUM

@@ -496,6 +509,12 @@ pgoutput_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
  else
  ctx->twophase_opt_given = true;

+ if (data->only_local && data->protocol_version <
LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("requested proto_version=%d does not support
subscribe_local_only, need %d or higher",
+ data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));

I thought this code should not be using
LOGICALREP_PROTO_TWOPHASE_VERSION_NUM. Shouldn't there be some newly
introduced constant like LOGICALREP_PROTO_LOCALONLY_VERSION_NUM which
you will use here?

~~~

1.7 src/bin/pg_dump/pg_dump.c - 15

@@ -4451,11 +4452,13 @@ getSubscriptions(Archive *fout)
  if (fout->remoteVersion >= 15)
  appendPQExpBufferStr(query,
  " s.subtwophasestate,\n"
- " s.subdisableonerr\n");
+ " s.subdisableonerr,\n"
+ " s.sublocal\n");
  else
  appendPQExpBuffer(query,
" '%c' AS subtwophasestate,\n"
-   " false AS subdisableonerr\n",
+   " false AS subdisableonerr,\n"
+   " false AS s.sublocal\n",
LOGICALREP_TWOPHASE_STATE_DISABLED);

I think this local_only feature is unlikely to get into the PG15
release, so this code should be split out into 

Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?

2022-04-04 Thread Andres Freund
Hi,

On 2022-04-05 08:46:06 +0900, Michael Paquier wrote:
> On Sun, Apr 03, 2022 at 11:53:03AM -0700, Andres Freund wrote:
> > It seems $subject would have a chance of catching some of these bugs, as 
> > well
> > as exposing amcheck to a database with a bit more varied content?
> 
> Makes sense to me to extend that.
> 
> > Depending on the cost it might make sense to do this optionally, via
> > PG_TEST_EXTRA?
> 
> Yes, it would be good to check the difference in run-time before
> introducing more.  A logical dump of the regression database is no
> more than 15MB if I recall correctly, so my guess is that most of the
> runtime is still going to be eaten by the run of pg_regress.

On my workstation it takes about 2.39s to run pg_amcheck on a regression
database with all thoroughness options enabled. With -j4 it's 0.62s.

Without more thorough checking it's 1.24s and 0.30s with -j4.

Greetings,

Andres Freund




Re: Extensible Rmgr for Table AMs

2022-04-04 Thread Jeff Davis
On Mon, 2022-04-04 at 11:15 +0530, Bharath Rupireddy wrote:
> 1) Why can't rmid be chosen by the extensions in sequential order
> from
> (129 till 255), say, to start with a columnar extension can choose
> 129, another extension can register 130 and so on right?

I'm not sure what you mean by "chosen by the extensions in sequential
order". If you mean:

(a) that it would depend on the library load order, and that postgres
would assign the ID; then that won't work because the rmgr IDs need to
be stable across restarts and config changes

(b) that there should be a convention where people reserve IDs on the
wiki page in sequential order; then that's fine with me

> 2) RM_MAX_ID is now UINT8_MAX - do we need to make it configurable?
> or
> do we think that only a few (127) custom rmgrs can exist?

127 should be plenty for quite a long time. If we need more, we can
come up with a different solution (which is OK because the WAL format
changes in new major versions).

> 3) Do we need to talk about extensible rmgrs and
> https://wiki.postgresql.org/wiki/ExtensibleRmgr in documentation
> somewhere? This will help extension developers to refer when needed.

We don't have user-facing documentation for every extension hook, and I
don't think it would be a good idea to document this one (at least in
the user-facing docs). Otherwise, we'd have to document the whole
resource manager interface, and that seems like way too much of the
internals.

> 4) Do we need to change this description?
>
> The default value of this setting is the empty string, which
> disables
> the feature.  It can be set to all to
> check all
> records, or to a comma-separated list of resource managers to
> check
> only records originating from those resource
> managers.  Currently,
> the supported resource managers are heap,
> heap2, btree,
> hash,
> gin, gist,
> sequence,
> spgist, brin, and
> generic. Onl
> superusers can change this setting.

Yes, you're right, I updated the docs for pg_waldump and the
wal_consistency_checking GUC.

> 5) Do we need to add a PGDLLIMPORT qualifier to RegisterCustomRmgr()
> (possibly for RmgrStartup, RmgrTable, RmgrCleanup as well?)  so that
> the Windows versions of extensions can use this feature?

That seems to only be required for variables, not functions.

I don't think we want to mark RmgrTable this way, because it's not
intended to be accessed by extensions directly. It's accessed by
GetRmgr(), which is 'static inline', but that's also not intended to be
used by extensions.

> 6) Should the below strcmp pg_strcmp? The custom rmgrs can still use
> rm_name with different cases and below code fail to catch it?
> + if (!strcmp(existing_rmgr->rm_name, rmgr->rm_name))
> + ereport(PANIC,
> + (errmsg("custom rmgr \"%s\" has the same name as builtin rmgr",
> + existing_rmgr->rm_name)));

There are already a lot of places where users can choose identifiers
that differ only in uppercase/lowercase. I don't see a reason to make
an exception here.

> 7) What's the intention of the below code? It seems like below it
> checks if there's already a rmgr with the given name (RM_MAX_ID). Had
> it been RM_MAX_BUILTIN_ID instead of  RM_MAX_ID, the error message
> does make sense. Is the intention here to not have duplicate rmgrs in
> the entire RM_MAX_ID(both builtin and custom)?

Thank you. I redid a lot of the error messages and checked them out to
make sure they are helpful.

> 8) Per https://wiki.postgresql.org/wiki/ExtensibleRmgr, 128 is
> reserved, can we have it as a macro? Or something like
> (RM_MAX_ID/2+1)

It's already defined as RM_EXPERIMENTAL_ID. Is that what you meant or
did I misunderstand?

I don't see the point in defining it as RM_MAX_ID/2+1.

> 9) Thinking if there's a way to test the code, maybe exposing
> RegisterCustomRmgr as a function? I think we need to have a dummy
> extension that will be used for testing this kind of patches/features
> but that's a separate discussion IMO.

It's already exposed as a C function in xlog_internal.h.

You mean as a SQL function? I don't think that makes sense. It can only
be called while processing shared_preload_libraries (on server startup)
anyway.

Other changes this version:

 * implemented Andres's proposed performance change[1]
 * fix wal_consistency_checking
 * general cleanup

Regards,
Jeff Davis

[1] 
https://postgr.es/m/20220404043337.ocjnni7hknjsi...@alap3.anarazel.de
From 7f6acaf93d279baa464ae3ba2be173a01d969402 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Sat, 6 Nov 2021 13:01:38 -0700
Subject: [PATCH] Extensible rmgr.

Allow extensions to specify a new custom rmgr, which allows
specialized WAL. This is meant to be used by a custom Table Access
Method, which would not otherwise be able to offer logical
decoding/replication. It may also be used by new Index Access Methods.

Prior to this commit, only Generic WAL was available, which offers
support for recovery and 

Re: Skipping logical replication transactions on subscriber side

2022-04-04 Thread Noah Misch
On Mon, Apr 04, 2022 at 06:55:45PM +0900, Masahiko Sawada wrote:
> On Mon, Apr 4, 2022 at 3:26 PM Noah Misch  wrote:
> > On Mon, Apr 04, 2022 at 08:20:08AM +0530, Amit Kapila wrote:
> > > How about a comment like: "It has to be kept at 8-byte alignment
> > > boundary so as to be accessed directly via C struct as it uses
> > > TYPALIGN_DOUBLE for storage which has 4-byte alignment on platforms
> > > like AIX."? Can you please suggest a better comment if you don't like
> > > this one?
> >
> > I'd write it like this, though I'm not sure it's an improvement on your 
> > words:
> >
> >   When ALIGNOF_DOUBLE==4 (e.g. AIX), the C ABI may impose 8-byte alignment 
> > on
> >   some of the C types that correspond to TYPALIGN_DOUBLE SQL types.  To 
> > ensure
> >   catalog C struct layout matches catalog tuple layout, arrange for the 
> > tuple
> >   offset of each fixed-width, attalign='d' catalog column to be divisible 
> > by 8
> >   unconditionally.  Keep such columns before the first NameData column of 
> > the
> >   catalog, since packagers can override NAMEDATALEN to an odd number.
> 
> Thanks!
> 
> >
> > The best place for such a comment would be in one of
> > src/test/regress/sql/*sanity*.sql, next to a test written to detect new
> > violations.
> 
> Agreed.
> 
> IIUC in the new test, we would need a new SQL function to calculate
> the offset of catalog columns including padding, is that right? Or do
> you have an idea to do that by using existing functionality?

Something like this:

select
  attrelid::regclass,
  attname,
  array(select typname
from pg_type t join pg_attribute pa on t.oid = pa.atttypid
where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < 
a.attnum order by pa.attnum) AS types_before,
  (select sum(attlen)
   from pg_type t join pg_attribute pa on t.oid = pa.atttypid
   where pa.attrelid = a.attrelid and pa.attnum > 0 and pa.attnum < a.attnum) 
AS len_before
from pg_attribute a
join pg_class c on c.oid = attrelid
where attalign = 'd' and relkind = 'r' and attnotnull and attlen <> -1
order by attrelid::regclass::text, attnum;
attrelid │   attname│types_before │ 
len_before
─┼──┼─┼
 pg_sequence │ seqstart │ {oid,oid}   │ 
 8
 pg_sequence │ seqincrement │ {oid,oid,int8}  │ 
16
 pg_sequence │ seqmax   │ {oid,oid,int8,int8} │ 
24
 pg_sequence │ seqmin   │ {oid,oid,int8,int8,int8}│ 
32
 pg_sequence │ seqcache │ {oid,oid,int8,int8,int8,int8}   │ 
40
 pg_subscription │ subskiplsn   │ {oid,oid,name,oid,bool,bool,bool,char,bool} │ 
81
(6 rows)

That doesn't count padding, but hazardous column changes will cause a diff in
the output.




Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-04-04 Thread Mark Dilger



> On Apr 4, 2022, at 5:12 PM, Tom Lane  wrote:
> 
> Wrote it already, no need for you to do it.

Thanks!

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-04-04 Thread Tom Lane
Mark Dilger  writes:
> On Apr 4, 2022, at 2:26 PM, Tom Lane  wrote:
>> So I think that instead of what you've got here, we should
>> (1) apply the map_old_guc_names[] mapping, which is constant
>> (for any one PG release anyway)
>> (2) smash to lower case
>> (3) verify validity per valid_variable_name.
>> 
>> This also simplifies life on the lookup side, where it's sufficient
>> to do steps (1) and (2) before performing a catalog search.
>> 
>> Thoughts?

> That sounds right.  Do you already have something like that coded, or would 
> you like me to post a patch?

Wrote it already, no need for you to do it.

regards, tom lane




Re: Extensible Rmgr for Table AMs

2022-04-04 Thread Jeff Davis
On Sun, 2022-04-03 at 21:33 -0700, Andres Freund wrote:
> I still think the easiest and fastest would be to just make RmgrTable
> longer,
> and not const. When registering, copy the caller provided struct into
> the
> respective RmgrData element.  Yes, we'd waste a bit of space at the
> end of the
> array, but it's typically not going to be touched and thus not be
> backed by
> "actual" memory.

Sounds good to me. I tried to break down the performance between these
approaches and didn't get a clear signal, so I'll go with your
intuition here.

Posting new patch in response to Bharath's review, which will include
this change.

Note that GetRmgr() also has an unlikely branch where it tests for the
validity of the rmgr before using it, so that it can throw a nice error
message if someone forgot to include the module in
shared_preload_libraries. I expect this will be highly predictable and
therefore not a problem.

Regards,
Jeff Davis






Re: Window Function "Run Conditions"

2022-04-04 Thread David Rowley
Thanks for having a look at this.

On Wed, 30 Mar 2022 at 11:16, Andres Freund  wrote:
> On 2022-03-29 15:11:52 +1300, David Rowley wrote:
> > One thing which I'm not sure about with the patch is how I'm handling
> > the evaluation of the runcondition in nodeWindowAgg.c.  Instead of
> > having ExecQual() evaluate an OpExpr such as "row_number() over (...)
> > <= 10", I'm replacing the WindowFunc with the Var in the targetlist
> > that corresponds to the given WindowFunc.  This saves having to double
> > evaluate the WindowFunc. Instead, the value of the Var can be taken
> > directly from the slot.  I don't know of anywhere else we do things
> > quite like that.  The runcondition is slightly similar to HAVING
> > clauses, but HAVING clauses don't work this way.
>
> Don't HAVING clauses actually work pretty similar? Yes, they don't have a Var,
> but for expression evaluation purposes an Aggref is nearly the same as a plain
> Var:
>
> EEO_CASE(EEOP_INNER_VAR)
> {
> int attnum = op->d.var.attnum;
>
> /*
>  * Since we already extracted all referenced columns from the
>  * tuple with a FETCHSOME step, we can just grab the value
>  * directly out of the slot's decomposed-data arrays.  But let's
>  * have an Assert to check that that did happen.
>  */
> Assert(attnum >= 0 && attnum < innerslot->tts_nvalid);
> *op->resvalue = innerslot->tts_values[attnum];
> *op->resnull = innerslot->tts_isnull[attnum];
>
> EEO_NEXT();
> }
> vs
> EEO_CASE(EEOP_AGGREF)
> {
> /*
>  * Returns a Datum whose value is the precomputed aggregate value
>  * found in the given expression context.
>  */
> int aggno = op->d.aggref.aggno;
>
> Assert(econtext->ecxt_aggvalues != NULL);
>
> *op->resvalue = econtext->ecxt_aggvalues[aggno];
> *op->resnull = econtext->ecxt_aggnulls[aggno];
>
> EEO_NEXT();
> }
>
> specifically we don't re-evaluate expressions?

Thanks for highlighting the similarities. I'm feeling better about the
choice now.

I've made another pass over the patch and updated a few comments and
made a small code change to delay the initialisation of a variable.

I'm pretty happy with this now. If anyone wants to have a look at
this, can they do so or let me know they're going to within the next
24 hours.  Otherwise I plan to move into commit mode with it.

> This is afaics slightly cheaper than referencing a variable in a slot.

I guess you must mean cheaper because it means there will be no
EEOP_*_FETCHSOME step?  Otherwise it seems a fairly similar amount of
work.

David
From f5dab05872f0f06c05fae1ee2024285b89de8c44 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Tue, 5 Apr 2022 12:00:40 +1200
Subject: [PATCH v5] Teach planner and executor about monotonic window funcs

Window functions such as row_number() always return a value higher than
the one previously in any given partition.  If a query were to only
request the first few row numbers, then traditionally we would continue
evaluating the WindowAgg node until all tuples are exhausted.  However, it
is possible if someone, say only wanted all row numbers <= 10, then we
could just stop once we get number 11.

Here we implement means to do just that.  This is done by way of adding a
pg_proc.prosupport function to various of the built-in window functions
and adding supporting code to allow the support function to inform the
planner if the function is monotonically increasing, monotonically
decreasing, both or neither.  The planner is then able to make use of that
information and possibly allow the executor to short-circuit execution by
way of adding a "run condition" to the WindowAgg to instruct it to run
while this condition is true and stop when it becomes false.

When there are multiple WindowAgg nodes to evaluate then this complicates
the situation as if we were to stop execution on a lower-level WindowAgg
then an upper-level WindowAgg may not receive all of the tuples it should.
This may lead to incorrect query results.  To get around this problem all
non-top-level WindowAggs go into "passthrough" mode when their
runcondition is no longer true.  This means that they continue to pull
tuples from their subnode but no longer evaluate their window functions.
Only the top-level WindowAgg node may stop when the runcondition is no
longer true.

Here we add prosupport functions to allow this to work for; row_number(),
rank(), dense_rank(), count(*) and count(expr).  It appears technically
possible to do the same for min() and max(), however, it seems unlikely to
be useful enough, so that's not done here.

Author: David Rowley
Reviewed-by: Andy Fan, Zhihong Yu
Discussion: 
https://postgr.es/m/caaphdvqvp3at8++yf8ij06sdcoo1s_b2yoat9d4nf+mobzs...@mail.gmail.com
---
 

Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-04-04 Thread Mark Dilger



> On Apr 4, 2022, at 2:26 PM, Tom Lane  wrote:
> 
> Thanks.  As I'm working through this, I'm kind of inclined to drop
> the has_parameter_privilege() variants that take an OID as object
> identifier.  This gets back to the fact that I don't think
> pg_parameter_acl OIDs have any outside use; we wouldn't even have
> them except that we need a way to track their role dependencies
> in pg_shdepend.  AFAICS users will only ever be interested in
> looking up a GUC by name.  Any objections?

None.

> Another thought here is that I see you're expending some code
> to store the canonical name of a GUC in pg_parameter_acl, but
> I think that's probably going too far.  In particular, for the
> legacy mixed-case names like "DateStyle", what ends up in the
> table is the mixed-case name, which seems problematic.  It would
> definitely be problematic if an extension used such a name,
> because we might or might not be aware of the idiosyncratic
> casing at the time a GRANT is issued.  I'm thinking that we
> really want to avoid looking up custom GUCs at all during GRANT,
> because that can't do anything except create hazards.

Yikes.  It took a few tries to see what you mean.  Yes, if the GRANT happens 
before the LOAD, that can have bad consequences.  So I agree something should 
be changed.

> So I think that instead of what you've got here, we should
> (1) apply the map_old_guc_names[] mapping, which is constant
>(for any one PG release anyway)
> (2) smash to lower case
> (3) verify validity per valid_variable_name.
> 
> This also simplifies life on the lookup side, where it's sufficient
> to do steps (1) and (2) before performing a catalog search.
> 
> Thoughts?

That sounds right.  Do you already have something like that coded, or would you 
like me to post a patch?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?

2022-04-04 Thread Michael Paquier
On Sun, Apr 03, 2022 at 11:53:03AM -0700, Andres Freund wrote:
> It seems $subject would have a chance of catching some of these bugs, as well
> as exposing amcheck to a database with a bit more varied content?

Makes sense to me to extend that.

> Depending on the cost it might make sense to do this optionally, via
> PG_TEST_EXTRA?

Yes, it would be good to check the difference in run-time before
introducing more.  A logical dump of the regression database is no
more than 15MB if I recall correctly, so my guess is that most of the
runtime is still going to be eaten by the run of pg_regress.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

2022-04-04 Thread Tatsuo Ishii
Hi Fabien,

> Hello Tatsuo-san,
> 
>> interval_start num_transactions sum_latency sum_latency_2 min_latency
>> max_latency
>> sum_lag sum_lag_2 min_lag max_lag skipped
>> failures serialization_failures deadlock_failures retried retries
> 
> I would suggest to reorder the last chunk to:
> 
>... retried retries failures serfail dlfail
> 
> because I intend to add connection failures handling at some point,
> and it would make more sense to add the corresponding count at the end
> with other fails.

Ok, I have adjusted the patch. V2 patch attached.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ebdb4b3f46..d1818ff316 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -2401,7 +2401,9 @@ END;
format is used for the log files:
 
 
-interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures | serialization_failures deadlock_failures }  sum_lag sum_lag_2 min_lag max_lag  skippedretried retries 
+interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency
+sum_lag sum_lag_2 min_lag max_lag skipped
+retried retries failures serialization_failures deadlock_failures
 
 
where
@@ -2417,41 +2419,55 @@ END;
and
max_latency is the maximum latency within the interval,
failures is the number of transactions that ended
-   with a failed SQL command within the interval. If you use option
-   --failures-detailed, instead of the sum of all failed
-   transactions you will get more detailed statistics for the failed
-   transactions grouped by the following types:
-   serialization_failures is the number of
-   transactions that got a serialization error and were not retried after this,
-   deadlock_failures is the number of transactions
-   that got a deadlock error and were not retried after this.
+   with a failed SQL command within the interval.
+  
+  
The next fields,
sum_lag, sum_lag_2, min_lag,
-   and max_lag, are only present if the --rate
-   option is used.
+   and max_lag, only meaningful if the --rate
+   option is used. Otherwise, they are all 0.0.
They provide statistics about the time each transaction had to wait for the
previous one to finish, i.e., the difference between each transaction's
scheduled start time and the time it actually started.
The next field, skipped,
-   is only present if the --latency-limit option is used, too.
+   is only meaningful if the --latency-limit option is used, too. Otherwise it is 0.
It counts the number of transactions skipped because they would have
started too late.
-   The retried and retries
-   fields are present only if the --max-tries option is not
-   equal to 1. They report the number of retried transactions and the sum of all
-   retries after serialization or deadlock errors within the interval.
-   Each transaction is counted in the interval when it was committed.
+  
+  
+   The retried
+   and retries fields are only meaningful if
+   the --max-tries option is not equal to 1. Otherwise they
+   are 0. They report the number of retried transactions and the sum of all
+   retries after serialization or deadlock errors within the interval.  Each
+   transaction is counted in the interval when it was committed.
+  
+  
+   failures is the sum of all failed transactions.
+   If --failures-detailed is specified, instead of the sum of
+   all failed transactions you will get more detailed statistics for the
+   failed transactions grouped by the following types:
+   serialization_failures is the number of
+   transactions that got a serialization error and were not retried after this,
+   deadlock_failures is the number of transactions
+   that got a deadlock error and were not retried after this.
+   If --failures-detailed is not
+   specified, serialization_failures
+   and deadlock_failures are always 0.
   
 
   
-   Here is some example output:
+   Here is some example output with following options:
 
-1345828501 5601 1542744 483552416 61 2573 0
-1345828503 7884 1979812 565806736 60 1479 0
-1345828505 7208 1979422 567277552 59 1391 0
-1345828507 7685 1980268 569784714 60 1398 0
-1345828509 7073 1979779 573489941 236 1411 0
-
+pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000
+--latency-limit=10 --failures-detailed --max-tries=10 test
+
+
+
+1649114136 5815 27552565 177846919143 1078 21716 2756787 7264696105 0 9661 0 7854 31472 4022 4022 0
+1649114146 5958 28460110 182785513108 1083 20391 2539395 6411761497 0 7268 0 8127 32595 4101 4101 0
+
+  
 
   
Notice that while the plain (unaggregated) log file shows which script
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index acf3e56413..4d4b979e4f 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4494,6 +4494,17 @@ doLog(TState *thread, 

Re: shared-memory based stats collector - v67

2022-04-04 Thread Andres Freund
Hi,

On 2022-03-23 10:42:03 -0700, Andres Freund wrote:
> On 2022-03-23 17:27:50 +0900, Kyotaro Horiguchi wrote:
> > +   /*
> > +* When starting with crash recovery, reset pgstat data - it might not 
> > be
> > +* valid. Otherwise restore pgstat data. It's safe to do this here,
> > +* because postmaster will not yet have started any other processes
> > +*
> > +* TODO: With a bit of extra work we could just start with a pgstat file
> > +* associated with the checkpoint redo location we're starting from.
> > +*/
> > +   if (ControlFile->state == DB_SHUTDOWNED ||
> > +   ControlFile->state == DB_SHUTDOWNED_IN_RECOVERY)
> > +   pgstat_restore_stats();
> > +   else
> > +   pgstat_discard_stats();
> > +
> >
> > Before there, InitWalRecovery changes the state to
> > DB_IN_ARCHIVE_RECOVERY if it was either DB_SHUTDOWNED or
> > DB_IN_PRODUCTION. So the stat seems like always discarded on standby.
>
> Hm. I though it worked at some point. I guess there's a reason this commit is
> a separate commit marked WIP ;)

FWIW, it had gotten broken by

commit be1c00ab13a7c2c9299d60cb5a9d285c40e2506c
Author: Heikki Linnakangas 
Date:   2022-02-16 09:22:44 +0200

Move code around in StartupXLOG().

because that moved the spot where
ControlFile->state = DB_IN_CRASH_RECOVERY
is set to an earlier location.

Greetings,

Andres Freund




Re: pushdown of joinquals beyond group by/distinct on

2022-04-04 Thread David Rowley
On Tue, 5 Apr 2022 at 07:40, Arne Roland  wrote:
> can someone point out to me, why we don't consider pushdowns of the joinqual 
> for these queries beyond the distinct on?
>
> When the qual matches the distinct clause, it should be possible to generate 
> both parametrized and non parametrized subplans for the same query. The same 
> should hold true for aggregates, if the group by clause matches. Is there any 
> specific reason we aren't doing that already?

Your example shows that it's not always beneficial to pushdown such
quals.  In all cases where we currently consider qual pushdowns, we do
so without any costing. This is done fairly early in planning before
we have any visibility as to if it would be useful or not.

With your example case, if we unconditionally rewrote the subquery to
be laterally joined and pushed the condition into the subquery then we
could slow down a bunch of cases as the planner would be forced into
using a parameterized nested loop.

I don't really see how we could properly cost this short of performing
the join search twice. The join search is often the most costly part
of planning.  When you consider that there might be many quals to push
and/or many subqueries to do this to, the number of times we'd need to
perform the join search might explode fairly quickly.  That wouldn't
be great for queries where there are many join-levels to search.

It might be possible if we could come up with some heuristics earlier
in planning to determine if it's going to be a useful transformation
to make.  However, that seems fairly difficult in the absence of any
cardinality estimations.

David




Re: shared-memory based stats collector - v68

2022-04-04 Thread Andres Freund
Hi,

On 2022-04-04 15:24:24 -0700, David G. Johnston wrote:
> Replacing the existing assert(!kind->fixed_amount) with
> assert(!kind->accessed_across_databases) produces the same result as the
> later presently implies the former.

I wasn't proposing to replace, but to add...


> Now I start to dislike the behavioral aspect of the attribute and would
> rather just name it: kind->is_cluster_scoped (or something else that is
> descriptive of the stat category itself, not how it is used)

I'm not in love with the name either. But cluster is just a badly overloaded
word :(.

system_wide? Or invert it and say: database_scoped? I think I like the latter.



> Then reorganize the Kind documentation to note and emphasize these two
> primary descriptors:
> variable, which can be cluster or database scoped
> fixed, which are cluster scoped by definition

Hm. There's not actually that much difference between cluster/non-cluster wide
scope for most of the system. I'm not strongly against, but I'm also not
really seeing the benefit.


> (if this is true...but given this is an optimization category I'm thinking
> maybe it doesn't actually matter...)

It is true. Not sure what you mean with "optimization category"?


Greetings,

Andres Freund




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-04 Thread Thomas Munro
On Tue, Apr 5, 2022 at 2:18 AM Robert Haas  wrote:
> On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro  wrote:
> > Another idea would be to call a new function DropPendingWritebacks(),
> > and also tell all the SMgrRelation objects to close all their internal
> > state (ie the fds + per-segment objects) but not free the main
> > SMgrRelationData object, and for good measure also invalidate the
> > small amount of cached state (which hasn't been mentioned in this
> > thread, but it kinda bothers me that that state currently survives, so
> > it was one unspoken reason I liked the smgrcloseall() idea).  Since
> > DropPendingWritebacks() is in a rather different module, perhaps if we
> > were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
> > something else, because it's not really a SMGR-only thing anymore.
>
> I'm not sure that it really matters, but with the idea that I proposed
> it's possible to "save" a pending writeback, if we notice that we're
> accessing the relation with a proper lock after the barrier arrives
> and before we actually try to do the writeback. With this approach we
> throw them out immediately, so they're just gone. I don't mind that
> because I don't think this will happen often enough to matter, and I
> don't think it will be very expensive when it does, but it's something
> to think about.

The checkpointer never takes heavyweight locks, so the opportunity
you're describing can't arise.

This stuff happens entirely within the scope of a call to
BufferSync(), which is called by CheckPointBuffer().  BufferSync()
does WritebackContextInit() to set up the object that accumulates the
writeback requests, and then runs for a while performing a checkpoint
(usually spread out over time), and then at the end does
IssuePendingWritebacks().  A CFI() can be processed any time up until
the IssuePendingWritebacks(), but not during IssuePendingWritebacks().
That's the size of the gap we need to cover.

I still like the patch I posted most recently.  Note that it's not
quite as I described above: there is no DropPendingWritebacks(),
because that turned out to be impossible to implement in a way that
the barrier handler could call -- it needs access to the writeback
context.  Instead I had to expose a counter so that
IssuePendingWritebacks() could detect whether any smgrreleaseall()
calls had happened while the writeback context was being filled up
with writeback requests, by comparing with the level recorded therein.




Re: shared-memory based stats collector - v68

2022-04-04 Thread David G. Johnston
On Mon, Apr 4, 2022 at 2:54 PM Andres Freund  wrote:

>
> > As the existing function only handles functions and relations why not
> just
> > perform a specific Kind check for them?  Generalizing to assert on
> whether
> > or not the function works on fixed or variable Kinds seems beyond its
> > present state.  Or could it be used, as-is, for databases, replication
> > slots, and subscriptions today, and we just haven't migrated those areas
> to
> > use the now generalized function?
>
> It couldn't quite be used for those, because it really only makes sense for
> objects "within a database", because it wants to reset the timestamp of the
> pg_stat_database row too (I don't like that behaviour as-is, but that's the
> topic of another thread as you know...).
>
> It will work for other per-database stats though, once we have them.
>
>
> > Even then, unless we do expand the
> > definition of the this publicly facing function is seems better to
> > precisely define what it requires as an input Kind by checking for
> RELATION
> > or FUNCTION specifically.
>
> I don't see a benefit in adding a restriction on it that we'd just have to
> lift again?
>
> How about adding a
> Assert(!pgstat_kind_info_for(kind)->accessed_across_databases)
>
> and extending the function comment to say that it's used for per-database
> stats and that it resets both the passed-in stats object as well as
> pg_stat_database?
>
>
I could live with adding that, but...

Replacing the existing assert(!kind->fixed_amount) with
assert(!kind->accessed_across_databases) produces the same result as the
later presently implies the former.

Now I start to dislike the behavioral aspect of the attribute and would
rather just name it: kind->is_cluster_scoped (or something else that is
descriptive of the stat category itself, not how it is used)

Then reorganize the Kind documentation to note and emphasize these two
primary descriptors:
variable, which can be cluster or database scoped
fixed, which are cluster scoped by definition (if this is true...but given
this is an optimization category I'm thinking maybe it doesn't actually
matter...)

+ /* cluster-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */
+ PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd
spot to be closer to the database-scoped options)
+
+ /* database-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+
+ /* cluster-scoped stats having a fixed number of entries */ (maybe these
should go first, the variable following?)
+ PGSTAT_KIND_ARCHIVER,
+ PGSTAT_KIND_BGWRITER,
+ PGSTAT_KIND_CHECKPOINTER,
+ PGSTAT_KIND_SLRU,
+ PGSTAT_KIND_WAL,

David J.


Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-04-04 Thread Andrew Dunstan


On 4/4/22 17:26, Tom Lane wrote:
> Mark Dilger  writes:
>> [ v15 patch ]
> Thanks.  As I'm working through this, I'm kind of inclined to drop
> the has_parameter_privilege() variants that take an OID as object
> identifier.  This gets back to the fact that I don't think
> pg_parameter_acl OIDs have any outside use; we wouldn't even have
> them except that we need a way to track their role dependencies
> in pg_shdepend.  AFAICS users will only ever be interested in
> looking up a GUC by name.  Any objections?


Not from me


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: shared-memory based stats collector - v68

2022-04-04 Thread Andres Freund
Hi,

On 2022-04-04 14:25:57 -0700, David G. Johnston wrote:
> > You mentioned this as a restriction above - I'm not seeing it as such?  I'd
> > like to write out stats more often in the future (e.g. in the
> > checkpointer),
> > but then it'd not be written out with this function...
> >
> >
> Yeah, the idea only really works if you can implement "last one out, shut
> off the lights".  I think I was subconsciously wanting this to work that
> way, but the existing process is good.

Preserving stats more than we do today (the patch doesn't really affect that)
will require a good chunk more work. My idea for it is that we'd write the
file out as part of a checkpoint / restartpoint, with a name including the
redo-lsn. Then when recovery starts, it can use the stats file associated with
that to start from.  Then we'd loose at most 1 checkpoint's worth of stats
during a crash, not more.

There's a few non-trivial corner cases to solve, around stats objects getting
dropped concurrently with creating that serialized snapshot. Solvable, but not
trivial.


> > > + * I also am unsure, off the top of my head, whether both replication
> > > slots and subscriptions,
> > > + * which are fixed, can be reset singly (today, and/or whether this
> > patch
> > > enables that capability)
> > > + */
> >
> > FWIW, neither are implemented as fixed amount stats.
> 
> 
> That was a typo, I meant to write variable.  My point was that of these 5
> kinds that will pass the assertion test only 2 of them are actually handled
> by the function today.
> 
> + PGSTAT_KIND_DATABASE = 1, /* database-wide statistics */
> + PGSTAT_KIND_RELATION, /* per-table statistics */
> + PGSTAT_KIND_FUNCTION, /* per-function statistics */
> + PGSTAT_KIND_REPLSLOT, /* per-slot statistics */
> + PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */

> As the existing function only handles functions and relations why not just
> perform a specific Kind check for them?  Generalizing to assert on whether
> or not the function works on fixed or variable Kinds seems beyond its
> present state.  Or could it be used, as-is, for databases, replication
> slots, and subscriptions today, and we just haven't migrated those areas to
> use the now generalized function?

It couldn't quite be used for those, because it really only makes sense for
objects "within a database", because it wants to reset the timestamp of the
pg_stat_database row too (I don't like that behaviour as-is, but that's the
topic of another thread as you know...).

It will work for other per-database stats though, once we have them.


> Even then, unless we do expand the
> definition of the this publicly facing function is seems better to
> precisely define what it requires as an input Kind by checking for RELATION
> or FUNCTION specifically.

I don't see a benefit in adding a restriction on it that we'd just have to
lift again?

How about adding a
Assert(!pgstat_kind_info_for(kind)->accessed_across_databases)

and extending the function comment to say that it's used for per-database
stats and that it resets both the passed-in stats object as well as
pg_stat_database?

Greetings,

Andres Freund




Re: psql - add SHOW_ALL_RESULTS option

2022-04-04 Thread Peter Eisentraut

On 02.04.22 15:26, Fabien COELHO wrote:



Again, after the SendQuery refactoring extraction.


I'm doing this locally, so don't feel obliged to send more of these. ;-)


Good for me :-)


This has been committed.

I reduced some of your stylistic changes in order to keep the surface 
area of this complicated patch small.  We can apply some of those later 
if you are interested.  Right now, let's let it settle a bit.






Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-04-04 Thread Tom Lane
Mark Dilger  writes:
> [ v15 patch ]

Thanks.  As I'm working through this, I'm kind of inclined to drop
the has_parameter_privilege() variants that take an OID as object
identifier.  This gets back to the fact that I don't think
pg_parameter_acl OIDs have any outside use; we wouldn't even have
them except that we need a way to track their role dependencies
in pg_shdepend.  AFAICS users will only ever be interested in
looking up a GUC by name.  Any objections?

Another thought here is that I see you're expending some code
to store the canonical name of a GUC in pg_parameter_acl, but
I think that's probably going too far.  In particular, for the
legacy mixed-case names like "DateStyle", what ends up in the
table is the mixed-case name, which seems problematic.  It would
definitely be problematic if an extension used such a name,
because we might or might not be aware of the idiosyncratic
casing at the time a GRANT is issued.  I'm thinking that we
really want to avoid looking up custom GUCs at all during GRANT,
because that can't do anything except create hazards.

So I think that instead of what you've got here, we should
(1) apply the map_old_guc_names[] mapping, which is constant
(for any one PG release anyway)
(2) smash to lower case
(3) verify validity per valid_variable_name.

This also simplifies life on the lookup side, where it's sufficient
to do steps (1) and (2) before performing a catalog search.

Thoughts?

regards, tom lane




Re: shared-memory based stats collector - v68

2022-04-04 Thread David G. Johnston
On Mon, Apr 4, 2022 at 2:06 PM Andres Freund  wrote:

>
> > My first encounter with pg_stat_exists_stat() didn't draw my attention as
> > being problematic so I'd say we just stick with it.  As a SQL user
> reading:
> > WHERE exists (...) is somewhat natural; using "have" or back-to-back
> > stat_stat is less appealing.
>
> There are a number of other *_exists functions, albeit not within
> pg_stat_*. Like jsonb_exists. Perhaps just pg_stat_exists()?
>
>
Works for me.


> A way to get back to the old behaviour seems good,
> and the function idea doesn't provide that.
>

Makes sense.

> (merged the typos that I hadn't already fixed based on Justin / Thomas'
> feedback)
>
>
> > @@ -371,7 +371,9 @@ pgstat_discard_stats(void)
> >  /*
> >   * pgstat_before_server_shutdown() needs to be called by exactly one
> > process
> >   * during regular server shutdowns. Otherwise all stats will be lost.
> > - *
> > + * XXX: What bad things happen if this is invoked by more than one
> process?
> > + *   I'd presume stats are not actually lost in that case.  Can we just
> > 'no-op'
> > + *   subsequent calls and say "at least once at shutdown, as late as
> > possible"
>
> What's the reason behind this question? There really shouldn't be a second
> call (and there's only a single callsite). As is you'd get an assertion
> failure about things already having been shutdown.
>

Mostly OCD  I guess, "exactly one" has two failure modes - zero, and > 1;
and the "Otherwise" only covers the zero mode.


> I don't think we want to relax that, because in all the realistic scenarios
> that I can think of that'd open us up to loosing stats that were generated
> after the first writeout of the stats data.


> You mentioned this as a restriction above - I'm not seeing it as such?  I'd
> like to write out stats more often in the future (e.g. in the
> checkpointer),
> but then it'd not be written out with this function...
>
>
Yeah, the idea only really works if you can implement "last one out, shut
off the lights".  I think I was subconsciously wanting this to work that
way, but the existing process is good.


>
> > @@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
> > objoid)
> >
> >   Assert(!pgstat_kind_info_for(kind)->fixed_amount);
> >
> > + /*
> > + * More of a conceptual observation here - the fact that something is
> > fixed does not imply
> > + * that it is not fixed at a value greater than zero and thus could have
> > single subentries
> > + * that could be addressed.
>
> pgstat_reset_single_counter() is a pre-existing function (with a
> pre-existing
> name, but adapted signature in the patch), it's currently only used for
> functions and relation stats.
>
>
> > + * I also am unsure, off the top of my head, whether both replication
> > slots and subscriptions,
> > + * which are fixed, can be reset singly (today, and/or whether this
> patch
> > enables that capability)
> > + */
>
> FWIW, neither are implemented as fixed amount stats.


That was a typo, I meant to write variable.  My point was that of these 5
kinds that will pass the assertion test only 2 of them are actually handled
by the function today.

+ PGSTAT_KIND_DATABASE = 1, /* database-wide statistics */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+ PGSTAT_KIND_REPLSLOT, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */


> There's one example of fixed amount stats that can be reset more
> granularly,
> namely slru. That can be done via pg_stat_reset_slru().
>
>
Right, hence the conceptual disconnect.  It doesn't affect the
implementation, everything is working just fine, but is something to ponder
for future maintainers getting up to speed here.

As the existing function only handles functions and relations why not just
perform a specific Kind check for them?  Generalizing to assert on whether
or not the function works on fixed or variable Kinds seems beyond its
present state.  Or could it be used, as-is, for databases, replication
slots, and subscriptions today, and we just haven't migrated those areas to
use the now generalized function?  Even then, unless we do expand the
definition of the this publicly facing function is seems better to
precisely define what it requires as an input Kind by checking for RELATION
or FUNCTION specifically.

David J.


Re: shared-memory based stats collector - v68

2022-04-04 Thread Andres Freund
Hi,

On 2022-04-04 13:45:40 -0700, David G. Johnston wrote:
> I didn't take the time to fixup all the various odd typos in the general
> code comments; none of them reduced comprehension appreciably.  I may do so
> when/if I do another pass.

Cool.


> My first encounter with pg_stat_exists_stat() didn't draw my attention as
> being problematic so I'd say we just stick with it.  As a SQL user reading:
> WHERE exists (...) is somewhat natural; using "have" or back-to-back
> stat_stat is less appealing.

There are a number of other *_exists functions, albeit not within
pg_stat_*. Like jsonb_exists. Perhaps just pg_stat_exists()?


> I would suggest we do away with stats_fetch_consistency "snapshot" mode and
> instead add a function that can be called that would accomplish the same
> thing but in "cache" mode.  Future iterations of that function could accept
> patterns, allowing for something between "one" and "everything".

I don't want to do that. We had a lot of discussion around what consistency
model we want, and Tom was adamant that there needs to be a mode that behaves
like the current consistency model (which is what snapshot behaves like, with
very minor differences).  A way to get back to the old behaviour seems good,
and the function idea doesn't provide that.


(merged the typos that I hadn't already fixed based on Justin / Thomas'
feedback)


> @@ -371,7 +371,9 @@ pgstat_discard_stats(void)
>  /*
>   * pgstat_before_server_shutdown() needs to be called by exactly one
> process
>   * during regular server shutdowns. Otherwise all stats will be lost.
> - *
> + * XXX: What bad things happen if this is invoked by more than one process?
> + *   I'd presume stats are not actually lost in that case.  Can we just
> 'no-op'
> + *   subsequent calls and say "at least once at shutdown, as late as
> possible"

What's the reason behind this question? There really shouldn't be a second
call (and there's only a single callsite). As is you'd get an assertion
failure about things already having been shutdown.

I don't think we want to relax that, because in all the realistic scenarios
that I can think of that'd open us up to loosing stats that were generated
after the first writeout of the stats data.

You mentioned this as a restriction above - I'm not seeing it as such?  I'd
like to write out stats more often in the future (e.g. in the checkpointer),
but then it'd not be written out with this function...


> @@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
> objoid)
> 
>   Assert(!pgstat_kind_info_for(kind)->fixed_amount);
> 
> + /*
> + * More of a conceptual observation here - the fact that something is
> fixed does not imply
> + * that it is not fixed at a value greater than zero and thus could have
> single subentries
> + * that could be addressed.

pgstat_reset_single_counter() is a pre-existing function (with a pre-existing
name, but adapted signature in the patch), it's currently only used for
functions and relation stats.


> + * I also am unsure, off the top of my head, whether both replication
> slots and subscriptions,
> + * which are fixed, can be reset singly (today, and/or whether this patch
> enables that capability)
> + */

FWIW, neither are implemented as fixed amount stats. There's afaics no limit
at all for the number of existing subscriptions (although some would either
need to be disabled or you'd get errors). While there is a limit on the number
of slots, that's a configurable limit. So replication slot stats are also
implemented as variable amount stats (that used to be different, wasn't nice).

There's one example of fixed amount stats that can be reset more granularly,
namely slru. That can be done via pg_stat_reset_slru().

Thanks,

Andres




Re: New Object Access Type hooks

2022-04-04 Thread Mark Dilger



> On Apr 4, 2022, at 1:47 PM, Tom Lane  wrote:
> 
> Yeah, it's plausible to get a failure on either the write or read side
> depending on timing.
> 
> Perhaps libpq should be trying harder to make those cases look alike, but
> this test is about server behavior not libpq behavior, so I'm inclined
> to just make it lax.

+1.

I've gotten this test failure only a few times in perhaps the last six months, 
so if we narrow the opportunity for test failure without closing it entirely, 
we're just making the test failures that much harder to diagnose.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: New Object Access Type hooks

2022-04-04 Thread Tom Lane
Mark Dilger  writes:
>> On Apr 4, 2022, at 12:05 PM, Tom Lane  wrote:
>> So scratch that.  Maybe we'd better add "could not send data to server"
>> to the regex?

> If it fails in pqsecure_raw_write(), you get either "server closed the 
> connection unexpectedly" or "could not send data to server".  Do we need to 
> support pgtls_write() or pg_GSS_write(), which have different error messages?

Don't see why, since this test sets up a new cluster in which neither
is enabled.

> Is it possible that pgFlush will call pqSendSome which calls pqReadData 
> before trying to write anything, and get back a "could not receive data from 
> server" from pqsecure_raw_read()?

Yeah, it's plausible to get a failure on either the write or read side
depending on timing.

Perhaps libpq should be trying harder to make those cases look alike, but
this test is about server behavior not libpq behavior, so I'm inclined
to just make it lax.

regards, tom lane




Re: shared-memory based stats collector - v68

2022-04-04 Thread David G. Johnston
On Sun, Apr 3, 2022 at 9:16 PM Andres Freund  wrote:

>
> Please take a look!
>
>
I didn't take the time to fixup all the various odd typos in the general
code comments; none of them reduced comprehension appreciably.  I may do so
when/if I do another pass.

I did skim over the entire patch set and, FWIW, found it to be quite
understandable.  I don't have the experience to comment on the lower-level
details like locking and such but the medium picture stuff makes sense to
me both as a user and a developer.  I did leave a couple of comments about
parts that at least piqued my interest (reset single stats) or seemed like
an undesirable restriction that was under addressed (before server shutdown
called exactly once).

I agree with Thomas's observation regarding PGSTAT_KIND_LAST.  I also think
that leaving it starting at 1 makes sense - maybe just fix the name and
comment to better reflect its actual usage in core.

I concur also with changing usages of " / " to ", or"

My first encounter with pg_stat_exists_stat() didn't draw my attention as
being problematic so I'd say we just stick with it.  As a SQL user reading:
WHERE exists (...) is somewhat natural; using "have" or back-to-back
stat_stat is less appealing.

I would suggest we do away with stats_fetch_consistency "snapshot" mode and
instead add a function that can be called that would accomplish the same
thing but in "cache" mode.  Future iterations of that function could accept
patterns, allowing for something between "one" and "everything".

I'm also not an immediate fan of "fetch_consistency"; with the function
suggestion it is basically "cache" and "no-cache" so maybe:
stats_use_transaction_cache ? (haven't thought hard or long on this one...)


 diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 22d0a1e491..e889c11d9e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2123,7 +2123,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss
11:34   0:00 postgres: ser
  
  
   PgStatsData
-  Waiting fo shared memory stats data access
+  Waiting for shared memory stats data access
  
  
   SerializableXactHash
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2689d0962c..bc7bdf8064 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4469,7 +4469,7 @@ PostgresMain(const char *dbname, const char *username)

  /*
  * (4) turn off the idle-in-transaction, idle-session and
- * idle-state-update timeouts if active.  We do this before step (5) so
+ * idle-stats-update timeouts if active.  We do this before step (5) so
  * that any last-moment timeout is certain to be detected in step (5).
  *
  * At most one of these timeouts will be active, so there's no need to
diff --git a/src/backend/utils/activity/pgstat.c
b/src/backend/utils/activity/pgstat.c
index dbd55a065d..370638b33b 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -5,7 +5,7 @@
  * Provides the infrastructure to collect and access cumulative statistics,
  * e.g. per-table access statistics, of all backends in shared memory.
  *
- * Most statistics updates are first first accumulated locally in each
process
+ * Most statistics updates are first accumulated locally in each process
  * as pending entries, then later flushed to shared memory (just after
commit,
  * or by idle-timeout).
  *
@@ -371,7 +371,9 @@ pgstat_discard_stats(void)
 /*
  * pgstat_before_server_shutdown() needs to be called by exactly one
process
  * during regular server shutdowns. Otherwise all stats will be lost.
- *
+ * XXX: What bad things happen if this is invoked by more than one process?
+ *   I'd presume stats are not actually lost in that case.  Can we just
'no-op'
+ *   subsequent calls and say "at least once at shutdown, as late as
possible"
  * We currently only write out stats for proc_exit(0). We might want to
change
  * that at some point... But right now pgstat_discard_stats() would be
called
  * during the start after a disorderly shutdown, anyway.
@@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
objoid)

  Assert(!pgstat_kind_info_for(kind)->fixed_amount);

+ /*
+ * More of a conceptual observation here - the fact that something is
fixed does not imply
+ * that it is not fixed at a value greater than zero and thus could have
single subentries
+ * that could be addressed.
+ * I also am unsure, off the top of my head, whether both replication
slots and subscriptions,
+ * which are fixed, can be reset singly (today, and/or whether this patch
enables that capability)
+ */
+
  /* Set the reset timestamp for the whole database */
  pgstat_reset_database_timestamp(MyDatabaseId, ts);


David J.


Re: Mingw task for Cirrus CI

2022-04-04 Thread Andres Freund
Hi,

On 2022-03-30 17:26:18 -0700, Andres Freund wrote:
> Hi,
>
> On 2022-03-22 19:00:42 +0300, Melih Mutlu wrote:
> > Rebased it.
> > I also removed the temp installation task and
> > used NoDefaultCurrentDirectoryInExePath env variable instead.
>
> Hm. But you're still using a separate build directory, from what I can see?
> The NoDefaultCurrentDirectoryInExePath thing should only have an effect when
> not using a separate build directory, no?

Melih chatted with me about making this work. Turns out it doesn't readily -
pg_ctl still fails.


The reason that NoDefaultCurrentDirectoryInExePath doesn't suffice to get a
working in-tree build, is that we break it ourselves:

int
find_my_exec(const char *argv0, char *retpath)
...
#ifdef WIN32
/* Win32 checks the current directory first for names without slashes */
join_path_components(retpath, cwd, argv0);
if (validate_exec(retpath) == 0)
return resolve_symlinks(retpath);
#endif

So even if windows doesn't actually use the current path, and the current
pg_ctl process isn't the one from the current directory, we *still* return
that.

Gah.


Maybe we should just use GetModuleFileName()?


But even after that the tests don't work. Commands started via IPC::Run do,
but when using system() they don't. Looks like perl parses the path the itself
:(.

- Andres




Re: New Object Access Type hooks

2022-04-04 Thread Mark Dilger



> On Apr 4, 2022, at 12:05 PM, Tom Lane  wrote:
> 
> I wrote:
>> The "terminating connection" warning absolutely should get through,
> 
> ... oh, no, that's not guaranteed at all, since it's sent from quickdie().
> So scratch that.  Maybe we'd better add "could not send data to server"
> to the regex?

If it fails in pqsecure_raw_write(), you get either "server closed the 
connection unexpectedly" or "could not send data to server".  Do we need to 
support pgtls_write() or pg_GSS_write(), which have different error messages?  
Can anybody run the tests with TLS or GSS enabled?  I assume the test framework 
prevents this, but I didn't check too closely

Is it possible that pgFlush will call pqSendSome which calls pqReadData before 
trying to write anything, and get back a "could not receive data from server" 
from pqsecure_raw_read()?

It's a bit hard to prove to myself which paths might be followed through this 
code.  Thoughts?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Pluggable toaster

2022-04-04 Thread Robert Haas
On Mon, Apr 4, 2022 at 4:05 PM Nikita Malakhov  wrote:
> - Is 'git apply' not a valid way to apply such patches?

I have found that it never works. This case is no exception:

[rhaas pgsql]$ git apply ~/Downloads/1_toaster_interface_v4.patch
/Users/rhaas/Downloads/1_toaster_interface_v4.patch:253: trailing whitespace.
toasterapi.o
/Users/rhaas/Downloads/1_toaster_interface_v4.patch:1276: trailing whitespace.
{
/Users/rhaas/Downloads/1_toaster_interface_v4.patch:1294: trailing whitespace.
 * CREATE TOASTER name HANDLER handler_name
/Users/rhaas/Downloads/1_toaster_interface_v4.patch:2261: trailing whitespace.
 * va_toasterdata could contain varatt_external structure for old Toast
/Users/rhaas/Downloads/1_toaster_interface_v4.patch:3047: trailing whitespace.
SELECT attnum, attname, atttypid, attstorage, tsrname
error: patch failed: src/backend/commands/tablecmds.c:42
error: src/backend/commands/tablecmds.c: patch does not apply
error: patch failed: src/backend/commands/tablecmds.c:943
error: src/backend/commands/tablecmds.c: patch does not apply
error: patch failed: src/backend/commands/tablecmds.c:973
error: src/backend/commands/tablecmds.c: patch does not apply
error: patch failed: src/backend/commands/tablecmds.c:44
error: src/backend/commands/tablecmds.c: patch does not apply

I would really encourage you to use 'git format-patch' to generate a
stack of patches. But there is no point in reposting 30+ patches that
haven't been properly refactored into separate chunks. You need to
maintain a branch, periodically rebased over master, with some
probably-small number of patches on it, each of which is a logically
independent patch with its own commit message, its own clear purpose,
etc. And then generate patches to post from there using 'git
format-patch'. Look into using 'git rebase -i --autosquash' and 'git
commit --fixup' to maintain the branch, if you're not already familiar
with those things.

Also, it is a really good idea when you post the patch set to include
in the email a clear description of the overall purpose of the patch
set and what each patch does toward that goal. e.g. "The overall goal
of this patch set is to support faster-than-light travel. Currently,
PostgreSQL does not know anything about the speed of light, so 0001
adds some code for speed-of-light detection. Building on this, 0002
adds general support for disabling physical laws of the universe.
Then, 0003 makes use of this support to disable specifically the speed
of light." Perhaps you want a little more text than that for each
patch, depending on the situation, but this gives you the idea, I
hope.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Pluggable toaster

2022-04-04 Thread Nikita Malakhov
Hi,
Thank you very much for your review, I'd like to get it much earlier. I'm
currently
working on cleaning up code, and would try to comment as much as possible.
This patch set is really a large set of functionality, it was divided into
4 logically complete
parts, but anyway these parts contain a lot of changes themselves.
- Yes, you're right, new syntax is added. I'm also working on the
documentation part for it;
- Patches actually consist of a lot of minor commits. As I see we have to
squash them
to provide parts as 2-3 main commits without any unnecessary garbage;
- Is 'git apply' not a valid way to apply such patches?

We'll try to straighten these issues out asap.

Thank you!

On Mon, Apr 4, 2022 at 7:40 PM Robert Haas  wrote:

> On Mon, Apr 4, 2022 at 12:13 PM Nikita Malakhov  wrote:
> > I'm really sorry. Messed up some code while merging rebased branches
> with previous (v1)
> > patches issued in December and haven't noticed that seg fault because of
> corrupted code
> > while running check-world.
> > I've fixed messed code in Dummy Toaster package and checked twice -
> all's correct now,
> > patches are applied correctly and tests are clean.
> > Thank you for pointing out the error and for your patience!
>
> Hi,
>
> This patch set doesn't seem anywhere close to committable to me. For
> example:
>
> - It apparently adds a new command called CREATE TOASTER but that
> command doesn't seem to be documented anywhere.
>
> - contrib/dummy_toaster is added in patch 1 with a no implementation
> and code comments that say "Bloom index utilities" and then those
> comments are fixed and an implementation is added in later patches.
>
> - What is supposedly patch 1 is actually 32 patch files concatenated
> together. It doesn't apply properly either with 'patch' or 'git am' so
> I don't understand what we would even do with this.
>
> - None of these patches have appropriate commit messages.
>
> - Many of these patches add 'XXX' comments or errors or other
> obviously unfinished bits of code. Some of this may be cleaned up by
> later patches but it's hard to tell because right now one can't even
> apply the patch set properly. Even if that were possible, it's the job
> of the person submitting the patch to organize the patch into
> independent, separately committable chunks that are self-contained and
> have good comments and good commit messages for each one.
>
> I don't think based on the status of this patch set that it's even
> possible to provide useful feedback on the design at this point, never
> mind getting anything committed.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>


-- 
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/


pushdown of joinquals beyond group by/distinct on

2022-04-04 Thread Arne Roland
Hi,


can someone point out to me, why we don't consider pushdowns of the joinqual 
for these queries beyond the distinct on?

When the qual matches the distinct clause, it should be possible to generate 
both parametrized and non parametrized subplans for the same query. The same 
should hold true for aggregates, if the group by clause matches. Is there any 
specific reason we aren't doing that already?


Regards

Arne



distinct_on_lateral.sql
Description: distinct_on_lateral.sql


Re: Higher level questions around shared memory stats

2022-04-04 Thread Andres Freund
Hi,

On 2022-03-30 14:08:41 -0700, Andres Freund wrote:
> On 2022-03-30 14:42:23 -0400, Robert Haas wrote:
> > On Tue, Mar 29, 2022 at 5:01 PM Andres Freund  wrote:
> > > I can think of these different times:
> > >
> > > - Last time stats were removed due to starting up in crash recovery
> > > - Last time stats were created from scratch, because no stats data file 
> > > was
> > >   present at startup
> > > - Last time stats were thrown away due to corruption
> > > - Last time a subset of stats were reset using one of the pg_reset* 
> > > functions
> > >
> > > Makes sense?
> > 
> > Yes. Possibly that last could be broken in to two: when all stats were
> > last reset, when some stats were last reset.
> 
> Believe it or not, we don't currently have a function to reset all stats. We
> should definitely add that though, because the invocation to reset all stats
> gets more ridiculous^Wcomplicated with each release.

I assume we'd want all of these to be persistent (until the next time stats
are lost, of course)?

We currently have the following SQL functions showing reset times:
  pg_stat_get_bgwriter_stat_reset_time
  pg_stat_get_db_stat_reset_time
and kind of related:
  pg_stat_get_snapshot_timestamp

There's also a few functions showing the last time something has happened,
like pg_stat_get_last_analyze_time().

Trying to fit those patterns we could add functions like:

pg_stat_get_stats_last_cleared_recovery_time
pg_stat_get_stats_last_cleared_corrupted_time
pg_stat_get_stats_last_not_present_time
pg_stat_get_stats_last_partially_reset_time

Not great, but I don't immediately have a better idea.

Maybe it'd look better as a view / SRF? pg_stat_stats / pg_stat_get_stats()?
Potential column names:
- cleared_recovery_time
- cleared_corrupted_time
- not_preset_time
- partially_reset_time

It's not a lot of time to code these up. However, it's late in cycle, and
they're not suddenly needed due to shared memory stats, so I have a few doubts
about adding them now?

Greetings,

Andres Freund




Re: Pluggable toaster

2022-04-04 Thread Greg Stark
Thanks for the review Robert. I think that gives some pretty
actionable advice on how to improve the patch and it doesn't seem
likely to get much more in this cycle.

I'll mark the patch Returned with Feedback. Hope to see it come back
with improvements in the next release.




Re: shared-memory based stats collector - v68

2022-04-04 Thread Andres Freund
Hi,

On 2022-04-03 21:15:16 -0700, Andres Freund wrote:
> - collect who reviewed earlier revisions

I found reviews by
- Tomas Vondra 
- Arthur Zakirov 
- Antonin Houska 

There's also reviews by Fujii and Alvaro, but afaics just for parts that were
separately committed.

Greetings,

Andres Freund




Re: New Object Access Type hooks

2022-04-04 Thread Tom Lane
I wrote:
> The "terminating connection" warning absolutely should get through,

... oh, no, that's not guaranteed at all, since it's sent from quickdie().
So scratch that.  Maybe we'd better add "could not send data to server"
to the regex?

regards, tom lane




Re: New Object Access Type hooks

2022-04-04 Thread Tom Lane
Mark Dilger  writes:

> # Running: pg_ctl kill QUIT 2083
> ok 4 - killed process with SIGQUIT
> # pump_until: process terminated unexpectedly when searching for 
> "(?^m:WARNING:  terminating connection because of crash of another server 
> process|server closed the connection unexpectedly|connection to server was 
> lost)" with stream: "psql::9: WARNING:  terminating connection because 
> of unexpected SIGQUIT signal
> # psql::9: could not send data to server: Socket is not connected
> # "
> not ok 5 - psql query died successfully after SIGQUIT

And there we have it: the test wasn't updated for the new backend message
spelling, and we're seeing a different frontend behavior.  Evidently the
backend is dying before we're able to send the "SELECT 1;" to it.

I'm not quite sure whether it's a libpq bug that it doesn't produce the
"connection to server was lost" message here, but in any case I suspect
that we shouldn't be checking for the second and third regex alternatives.
The "terminating connection" warning absolutely should get through, and
if it doesn't we want to know about it.  So my proposal for a fix is
to change the regex to be just "WARNING:  terminating connection because
of unexpected SIGQUIT signal".

regards, tom lane




Re: New Object Access Type hooks

2022-04-04 Thread Mark Dilger



> On Apr 4, 2022, at 10:44 AM, Mark Dilger  wrote:
> 
> I'm writing a parallel test just for this.  Will get back to you.

Ok, that experiment didn't accomplish anything, beyond refreshing my memory 
regarding Cluster.pm preferring sockets over ports.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?

2022-04-04 Thread Robert Haas
On Mon, Apr 4, 2022 at 2:16 PM Andres Freund  wrote:
> On 2022-04-04 10:02:37 -0400, Robert Haas wrote:
> > It does a good job, I think, checking all the things that a human being
> > could potentially spot just by looking at an individual page.
>
> I think there's a few more things that'd be good to check. For example amcheck
> doesn't verify that HOT chains are reasonable, which can often be spotted
> looking at an individual page. Which is a bit unfortunate, given how many bugs
> we had in that area.
>
> Stuff to check around that:
> - target of redirect has HEAP_ONLY_TUPLE, HEAP_UPDATED set
> - In a valid ctid chain within a page (i.e. xmax = xmin):
>   - tuples have HEAP_UPDATED set
>   - HEAP_ONLY_TUPLE / HEAP_HOT_UPDATED matches across chains elements
>
> I think it'd also be good to check for things like visible tuples following
> invisible ones.

Interesting.

*takes notes*

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: JSON constructors and window functions

2022-04-04 Thread Andrew Dunstan


On 4/4/22 11:43, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 4/3/22 22:46, Andrew Dunstan wrote:
>>> On 4/3/22 20:11, Andres Freund wrote:
 I don't think you're allowed to free stuff in a finalfunc - we might reuse 
 the
 transition state for further calls to the aggregate.
>>> Doh! Of course! I'll fix it in the morning. Thanks.
>> I've committed a fix for this. I didn't find something to clean out the
>> hash table, so I just removed the 'hash_destroy' and left it at that.
>> All the test I did came back with expected results.
>> Maybe a hash_reset() is something worth having assuming it's possible? I
>> note that simplehash has a reset function.
> But removing the hash entries would be just as much of a problem
> wouldn't it?
>
>   


Yes, quite possibly. It looks from some experimentation as though,
unlike my naive preconception, it doesn't process each frame again from
the beginning, so losing the hash entries could indeed be an issue here.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: JSON constructors and window functions

2022-04-04 Thread Andrew Dunstan


On 4/4/22 12:33, Andres Freund wrote:
> Hi,
>
> On 2022-04-04 11:54:23 -0400, Greg Stark wrote:
>> Are we missing regression tests using these functions as window functions?
> So far, yes.
>
> ISTM that 4eb97988796 should have at least included the crashing statement as
> a test... The statement can be simpler too:
>
> SELECT json_objectagg(k : v with unique keys) OVER (ORDER BY k) FROM (VALUES 
> (1,1), (2,2)) a(k,v);
>
> is sufficient to trigger the crash for me, without even using asan (after
> reverting the bugfix, of course).
>


I will add some regression tests.


>> Hm. I suppose it's possible to write a general purpose regression test
>> that loops over all aggregate functions and runs them as window
>> functions and aggregates over the same data sets and compares results.
>> At least for the case of aggregate functions with a single parameter
>> belonging to a chosen set of data types.
> I was wondering about that too. Hardest part would be to come up with values
> to pass to the aggregates.
>
> I don't think it'd help in this case though, since it depends on special case
> grammar stuff to even be reached. json_objectagg(k : v with unique
> keys). "Normal" use of aggregates can't even reach the problematic path
> afaics.
>

It can, as Jaime's original post showed.

But on further consideration I'm thinking this area needs some rework.
ISTM that it might be a whole lot simpler and comprehensible to generate
the json first without bothering about null values or duplicate keys and
then in the finalizer check for null values to be skipped and duplicate
keys. That way we would need to keep far less state for the aggregation
functions, although it might be marginally less efficient. Thoughts?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?

2022-04-04 Thread Andres Freund
Hi,

On 2022-04-04 10:02:37 -0400, Robert Haas wrote:
> It does a good job, I think, checking all the things that a human being
> could potentially spot just by looking at an individual page.

I think there's a few more things that'd be good to check. For example amcheck
doesn't verify that HOT chains are reasonable, which can often be spotted
looking at an individual page. Which is a bit unfortunate, given how many bugs
we had in that area.

Stuff to check around that:
- target of redirect has HEAP_ONLY_TUPLE, HEAP_UPDATED set
- In a valid ctid chain within a page (i.e. xmax = xmin):
  - tuples have HEAP_UPDATED set
  - HEAP_ONLY_TUPLE / HEAP_HOT_UPDATED matches across chains elements

I think it'd also be good to check for things like visible tuples following
invisible ones.

Greetings,

Andres Freund




Re: New Object Access Type hooks

2022-04-04 Thread Mark Dilger


> On Apr 4, 2022, at 11:07 AM, Tom Lane  wrote:
> 
> I was hoping to see regress_log_013_crash_restart, though.



regress_log_013_crash_restart
Description: Binary data


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Re: New Object Access Type hooks

2022-04-04 Thread Tom Lane
Mark Dilger  writes:
> The test logs are attached.

Server log looks as-expected:

2022-04-04 09:11:51.087 PDT [2084] 013_crash_restart.pl LOG:  statement: SELECT 
pg_sleep(3600);
2022-04-04 09:11:51.094 PDT [2083] 013_crash_restart.pl WARNING:  terminating 
connection because of unexpected SIGQUIT signal
2022-04-04 09:11:51.095 PDT [2070] LOG:  server process (PID 2083) exited with 
exit code 2
2022-04-04 09:11:51.095 PDT [2070] DETAIL:  Failed process was running: INSERT 
INTO alive VALUES($$in-progress-before-sigquit$$) RETURNING status;
2022-04-04 09:11:51.095 PDT [2070] LOG:  terminating any other active server 
processes

I was hoping to see regress_log_013_crash_restart, though.

regards, tom lane




Re: New Object Access Type hooks

2022-04-04 Thread Mark Dilger


> On Apr 4, 2022, at 10:51 AM, Tom Lane  wrote:
> 
> Oh, I'm barking up the wrong tree.  This test must have been run
> against HEAD between 6da65a3f9 (23 Feb) and 2beb4acff (31 Mar), when
> pump_until indeed didn't print this essential information :-(
> 
> If you just got this failure, could you look in the log to
> see if there's a pump_until report?

I was running `make -j12 check-world` against my local patched version of 
master:

commit 80399fa5f208c4acd4ec194c47e534ba8dd3ae7c (HEAD -> 0001)
Author: Mark Dilger 
Date:   Mon Mar 28 13:35:11 2022 -0700

Allow grant and revoke of privileges on parameters

Add new SET and ALTER SYSTEM privileges for configuration parameters
(GUCs), and a new catalog, pg_parameter_acl, for tracking grants of
these privileges.

The privilege to SET a parameter marked USERSET is inherent in that
parameter's marking and cannot be revoked.  This saves cycles when
performing SET operations, as looking up privileges in the catalog
can be skipped.  If we find that administrators need to revoke SET
privilege on a particular variable from public, that variable can be
redefined in future releases as SUSET with a default grant of SET to
PUBLIC issued.

commit 4eb9798879680dcc0e3ebb301cf6f925dfa69422 (origin/master, origin/HEAD, 
master)
Author: Andrew Dunstan 
Date:   Mon Apr 4 10:12:30 2022 -0400

Avoid freeing objects during json aggregate finalization

Commit f4fb45d15c tried to free memory during aggregate finalization.
This cause issues, particularly when used as a window function, so stop
doing that.

Per complaint by Jaime Casanova and diagnosis by Andres Freund

Discussion: https://postgr.es/m/YkfeMNYRCGhySKyg@ahch-to


The test logs are attached.



013_crash_restart_primary.log
Description: Binary data

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Re: shared-memory based stats collector - v68

2022-04-04 Thread Andres Freund
Hi,

On 2022-04-05 01:16:04 +1200, Thomas Munro wrote:
> On Mon, Apr 4, 2022 at 4:16 PM Andres Freund  wrote:
> > Please take a look!
> 
> A few superficial comments:
> 
> > [PATCH v68 01/31] pgstat: consistent function header formatting.
> > [PATCH v68 02/31] pgstat: remove some superflous comments from pgstat.h.
> 
> +1

Planning to commit these after making another coffee and proof reading them
some more.


> > [PATCH v68 03/31] dshash: revise sequential scan support.
> 
> Logic looks good.  That is,
> lock-0-and-ensure_valid_bucket_pointers()-only-once makes sense.  Just
> some comment trivia:
> 
> + * dshash_seq_term needs to be called when a scan finished.  The caller may
> + * delete returned elements midst of a scan by using dshash_delete_current()
> + * if exclusive = true.
> 
> s/scan finished/scan is finished/
> s/midst of/during/ (or /in the middle of/, ...)
> 
> > [PATCH v68 04/31] dsm: allow use in single user mode.
> 
> LGTM.


> +   Assert(IsUnderPostmaster || !IsPostmasterEnvironment);

> (Not this patch's fault, but I wish we had a more explicit way to say "am
> single user".)

Agreed.


> > [PATCH v68 05/31] pgstat: stats collector references in comments
> 
> LGTM.  I could think of some alternative suggested names for this subsystem,
> but don't think it would be helpful at this juncture so I will refrain :-)

Heh. I did start a thread about it a while ago :)


> > [PATCH v68 08/31] pgstat: introduce PgStat_Kind enum.
> 
> +#define PGSTAT_KIND_FIRST PGSTAT_KIND_DATABASE
> +#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
> +#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)
> 
> It's a little confusing that PGSTAT_NUM_KINDS isn't really the number of 
> kinds,
> because there is no kind 0.  For the two users of it... maybe just use
> pgstat_kind_infos[] = {...}, and
> global_valid[PGSTAT_KIND_LAST + 1]?

Maybe the whole justification for not defining an invalid kind is moot
now. There's not a single switch covering all kinds of stats left, and I hope
that we don't introduce one again...


> > [PATCH v68 10/31] pgstat: scaffolding for transactional stats creation / 
> > drop.
> 
> +   /*
> +* Dropping the statistics for objects that dropped transactionally itself
> +* needs to be transactional. ...
> 
> Hard to parse.  How about:  "Objects are dropped transactionally, so
> related statistics need to be dropped transactionally too."

Not all objects are dropped transactionally. But I agree it reads awkwardly. I
now, incorporating feedback from Justin as well, rephrased it to:

/*
 * Statistics for transactionally dropped objects need to be
 * transactionally dropped as well. Collect the stats dropped in the
 * current (sub-)transaction and only execute the stats drop when we 
know
 * if the transaction commits/aborts. To handle replicas and crashes,
 * stats drops are included in commit / abort records.
 */

A few too many "drop"s in there, but maybe that's unavoidable.



> +/*
> + * Whenever the for a dropped stats entry could not be freed (because
> + * backends still have references), this is incremented, causing backends
> + * to run pgstat_gc_entry_refs(), allowing that memory to be reclaimed.
> + */
> +pg_atomic_uint64 gc_count;
> 
> Whenever the ...?

 * Whenever statistics for dropped objects could not be freed - because
 * backends still have references - the dropping backend calls
 * pgstat_request_entry_refs_gc() incrementing this counter. Eventually
 * that causes backends to run pgstat_gc_entry_refs(), allowing memory 
to
 * be reclaimed.


> Would it be better to call this variable gc_request_count?

Agreed.


> + * Initialize refcount to 1, marking it as valid / not tdroped. The entry
> 
> s/tdroped/dropped/
> 
> + * further if a longer lived references is needed.
> 
> s/references/reference/

Fixed.


> +/*
> + * There are legitimate cases where the old stats entry might not
> + * yet have been dropped by the time it's reused. The easiest 
> case
> + * are replication slot stats. But oid wraparound can lead to
> + * other cases as well. We just reset the stats to their plain
> + * state.
> + */
> +shheader = pgstat_reinit_entry(kind, shhashent);
> 
> This whole comment is repeated in pgstat_reinit_entry and its caller.

I guess I felt as indecisive about where to place it between the two locations
when I wrote it as I do now. Left it at the callsite for now.


> +/*
> + * XXX: Might be worth adding some frobbing of the allocation before
> + * freeing, to make it easier to detect use-after-free style bugs.
> + */
> +dsa_free(pgStatLocal.dsa, pdsa);
> 
> FWIW dsa_free() clobbers memory in assert builds, just like pfree().

Oh. I could swear I saw that not being the case a while ago. But clearly it is
the case. Removed.


> +static Size

Re: New Object Access Type hooks

2022-04-04 Thread Tom Lane
I wrote:
> Is our CI setup failing to capture stderr from TAP tests??

Oh, I'm barking up the wrong tree.  This test must have been run
against HEAD between 6da65a3f9 (23 Feb) and 2beb4acff (31 Mar), when
pump_until indeed didn't print this essential information :-(

If you just got this failure, could you look in the log to
see if there's a pump_until report?

regards, tom lane




Re: New Object Access Type hooks

2022-04-04 Thread Mark Dilger



> On Apr 4, 2022, at 10:41 AM, Tom Lane  wrote:
> 
> Is our CI setup failing to capture stderr from TAP tests??

I'm looking into the way our TAP test infrastructure assigns port numbers to 
nodes, and whether that is reasonable during parallel test runs with nodes 
stopping and starting again.  On casual inspection, that doesn't seem ok, 
because the Cluster.pm logic to make sure nodes get unique ports doesn't seem 
to be thinking about other parallel tests running.  It will notice if another 
node is already bound to the port, but if another node has been killed and not 
yet restarted, won't things go off the rails?

I'm writing a parallel test just for this.  Will get back to you.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: New Object Access Type hooks

2022-04-04 Thread Tom Lane
I wrote:
> Usually the test would succeed anyway because of matching the
> second or third regex alternative, but I wonder if there is
> some other spelling of libpq's complaint that shows up
> occasionally.  It'd be nice if we could see the contents of
> $killme_stderr upon failure.

OK, now I'm confused, because pump_until is very clearly
*trying* to report exactly that:

if (not $proc->pumpable())
{
diag("pump_until: process terminated unexpectedly when searching 
for \"$until\" with stream: \"$$stream\"");
return 0;
}

and if I intentionally break the regex then I do see this
output when running the test by hand:

# Running: pg_ctl kill QUIT 1922645
ok 4 - killed process with SIGQUIT
# pump_until: process terminated unexpectedly when searching for "(?^m:WARNING: 
 terminating connection because of crash of another server process|server 
closed the connection foounexpectedly|connection to server was lost)" with 
stream: "psql::9: WARNING:  terminating connection because of unexpected 
SIGQUIT signal
# psql::9: server closed the connection unexpectedly
#   This probably means the server terminated abnormally
#   before or while processing the request.
# "
not ok 5 - psql query died successfully after SIGQUIT

Is our CI setup failing to capture stderr from TAP tests??

regards, tom lane




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-04-04 Thread David Steele

On 4/4/22 11:42 AM, Nathan Bossart wrote:

I noticed a couple of other things that can be removed.  Since we no longer
wait on exclusive backup mode during smart shutdown, we can change
connsAllowed (in postmaster.c) to a boolean and remove CAC_SUPERUSER.  We
can also remove a couple of related notes in the documentation.  I've done
all this in the attached patch.


These changes look good to me. IMV it is a real bonus how much the state 
machine has been simplified.


I've also run this patch through the pgbackrest regression tests without 
any problems.


Regards,
--
-David
da...@pgmasters.net




Re: New Object Access Type hooks

2022-04-04 Thread Tom Lane
Mark Dilger  writes:
> I just got a crash in this test again.  Are you still interested?  I still 
> have the logs.  No core file appears to have been generated.
> The test failure is
> not ok 5 - psql query died successfully after SIGQUIT

Hmm ... I can see one problem with that test:

ok( pump_until(
$killme,
$psql_timeout,
\$killme_stderr,
qr/WARNING:  terminating connection because of crash of another 
server process|server closed the connection unexpectedly|connection to server 
was lost/m
),
"psql query died successfully after SIGQUIT");

It's been a little while since that message looked like that.
Nowadays you get

WARNING:  terminating connection because of unexpected SIGQUIT signal
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Usually the test would succeed anyway because of matching the
second or third regex alternative, but I wonder if there is
some other spelling of libpq's complaint that shows up
occasionally.  It'd be nice if we could see the contents of
$killme_stderr upon failure.

regards, tom lane




Re: A test for replay of regression tests

2022-04-04 Thread Tom Lane
I wrote:
> I think the fundamental instability here is that this TAP test is
> setting shared_buffers small enough to allow the syncscan logic
> to kick in where it does not in normal testing.  Maybe we should
> just disable syncscan in this test script?

Did that, we'll see how much it helps.

regards, tom lane




Re: Pluggable toaster

2022-04-04 Thread Robert Haas
On Mon, Apr 4, 2022 at 12:13 PM Nikita Malakhov  wrote:
> I'm really sorry. Messed up some code while merging rebased branches with 
> previous (v1)
> patches issued in December and haven't noticed that seg fault because of 
> corrupted code
> while running check-world.
> I've fixed messed code in Dummy Toaster package and checked twice - all's 
> correct now,
> patches are applied correctly and tests are clean.
> Thank you for pointing out the error and for your patience!

Hi,

This patch set doesn't seem anywhere close to committable to me. For example:

- It apparently adds a new command called CREATE TOASTER but that
command doesn't seem to be documented anywhere.

- contrib/dummy_toaster is added in patch 1 with a no implementation
and code comments that say "Bloom index utilities" and then those
comments are fixed and an implementation is added in later patches.

- What is supposedly patch 1 is actually 32 patch files concatenated
together. It doesn't apply properly either with 'patch' or 'git am' so
I don't understand what we would even do with this.

- None of these patches have appropriate commit messages.

- Many of these patches add 'XXX' comments or errors or other
obviously unfinished bits of code. Some of this may be cleaned up by
later patches but it's hard to tell because right now one can't even
apply the patch set properly. Even if that were possible, it's the job
of the person submitting the patch to organize the patch into
independent, separately committable chunks that are self-contained and
have good comments and good commit messages for each one.

I don't think based on the status of this patch set that it's even
possible to provide useful feedback on the design at this point, never
mind getting anything committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: JSON constructors and window functions

2022-04-04 Thread Andres Freund
Hi,

On 2022-04-04 11:54:23 -0400, Greg Stark wrote:
> Are we missing regression tests using these functions as window functions?

So far, yes.

ISTM that 4eb97988796 should have at least included the crashing statement as
a test... The statement can be simpler too:

SELECT json_objectagg(k : v with unique keys) OVER (ORDER BY k) FROM (VALUES 
(1,1), (2,2)) a(k,v);

is sufficient to trigger the crash for me, without even using asan (after
reverting the bugfix, of course).


> Hm. I suppose it's possible to write a general purpose regression test
> that loops over all aggregate functions and runs them as window
> functions and aggregates over the same data sets and compares results.
> At least for the case of aggregate functions with a single parameter
> belonging to a chosen set of data types.

I was wondering about that too. Hardest part would be to come up with values
to pass to the aggregates.

I don't think it'd help in this case though, since it depends on special case
grammar stuff to even be reached. json_objectagg(k : v with unique
keys). "Normal" use of aggregates can't even reach the problematic path
afaics.

Greetings,

Andres Freund




Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

2022-04-04 Thread Julien Rouhaud
On Tue, Apr 05, 2022 at 12:51:12AM +0900, Masahiko Sawada wrote:
> On Mon, Apr 4, 2022 at 1:30 PM Julien Rouhaud  wrote:
> >
> > Hmm, but AFAICS the json format would be stable as the counters are always
> > shown even if zero.  So just doing the json format first and then the text
> > format should also work.  Although if you're really unlucky there could be a
> > cache invalidation in between so we could just ignore the text format.  But 
> > I
> > think we should at least keep a regression test with the json format, with a
> > comment explain why only this one is tested.
> 
> Fair point. By commit 7e12256b478 we disabled track_io_timing, but
> probably we can temporarily enable it and run one query with "buffers"
> and "format json" options.

Yes, enabling it for just this query.  It can't really find any problem with
the values themselves but at least the new code path would be partially
executed.

> >
> > - not really your patch fault I guess, but I see that extendBufFile() isn't
> >   handled.  There shouldn't be much activity there so maybe it's ok.
> >   This is likely because smgr_extend is also not handled, but this one seems
> >   much more likely to take quite some time, and therefore should bump the
> >   timing counters.
> 
> You mean we should include the time for opening files as write time?

Yes.  In normal circumstances it shouldn't need a lot of time to do that, but
I'm not so sure with e.g. network filesystems.  I'm not strongly in favor of
counting it, especially since smgrextend doesn't either.

> IIUC smgrextend() writes data while extending file whereas
> extendBufFile() doesn't do that but just opens a new file.

Note that smgrextend can also call register_dirty_segment(), which can also
take some time, so we can't just assume that all the time there is IO time.




Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?

2022-04-04 Thread Mark Dilger



> On Apr 4, 2022, at 9:27 AM, Peter Geoghegan  wrote:
> 
> I'd really like it if amcheck had HOT chain verification. That's the
> other area where catching bugs passively with assertions and whatnot
> is clearly not good enough.

I agree, and was hoping to get around to this in the postgres 15 development 
cycle.  Alas, that did not happen.  Worse, I have several other projects that 
will keep me busy for the next few months, at least.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?

2022-04-04 Thread Peter Geoghegan
On Mon, Apr 4, 2022 at 7:02 AM Robert Haas  wrote:
> Yeah, I was very excited about verify_heapam(). There is a lot more
> stuff that we could check, but a lot of those things would be much
> more expensive to check.

If anything I understated the value of verify_heapam() with this kind
of work before. Better to show just how valuable it is using an
example.

Let's introduce a fairly blatant bug to lazyvacuum.c. This change
makes VACUUM fail to account for the fact that skipping a skippable
range with an all-visible page makes it unsafe to advance
relfrozenxid:

--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1371,8 +1371,6 @@ lazy_scan_skip(LVRelState *vacrel, Buffer
*vmbuffer, BlockNumber next_block,
 else
 {
 *skipping_current_range = true;
-if (skipsallvis)
-vacrel->skippedallvis = true;
 }

 return next_unskippable_block;

If I run "make check-world", the tests all pass! But when I run pg_amcheck
against an affected "regression" database, it will complain about
relfrozenxid related corruption in several different tables.

> It does a good job, I think, checking all the
> things that a human being could potentially spot just by looking at an
> individual page. I love the idea of using it in regression testing in
> more places. It might find bugs in amcheck, which would be good, but I
> think it's even more likely to help us find bugs in other code.

I'd really like it if amcheck had HOT chain verification. That's the
other area where catching bugs passively with assertions and whatnot
is clearly not good enough.


--
Peter Geoghegan




Re: JSON constructors and window functions

2022-04-04 Thread Andres Freund
Hi,

On 2022-04-04 11:43:31 -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
> > On 4/3/22 22:46, Andrew Dunstan wrote:
> >> On 4/3/22 20:11, Andres Freund wrote:
> >>> I don't think you're allowed to free stuff in a finalfunc - we might 
> >>> reuse the
> >>> transition state for further calls to the aggregate.
> 
> >> Doh! Of course! I'll fix it in the morning. Thanks.
> 
> > I've committed a fix for this. I didn't find something to clean out the
> > hash table, so I just removed the 'hash_destroy' and left it at that.
> > All the test I did came back with expected results.
> > Maybe a hash_reset() is something worth having assuming it's possible? I
> > note that simplehash has a reset function.
> 
> But removing the hash entries would be just as much of a problem
> wouldn't it?

I think so. I guess we could mark it as FINALFUNC_MODIFY = READ_WRITE. But I
don't see a reason why it'd be needed here.

Is it a problem that skipped_keys is reset in the finalfunc? I don't know how
these functions work. So far I don't understand why
JsonUniqueBuilderState->skipped_keys is long lived...

Greetings,

Andres Freund




Re: New Object Access Type hooks

2022-04-04 Thread Mark Dilger



> On Mar 21, 2022, at 10:03 PM, Thomas Munro  wrote:
> 
> On Fri, Mar 18, 2022 at 4:22 PM Mark Dilger
>  wrote:
>> (FYI, I got a test failure from src/test/recovery/t/013_crash_restart.pl 
>> when testing v1-0001.  I'm not sure yet what that is about.)
> 
> Doesn't look like 0001 has anything to do with that...   Are you on a
> Mac?  Did it look like this recent failure from CI?
> 
> https://cirrus-ci.com/task/4686108033286144
> https://api.cirrus-ci.com/v1/artifact/task/4686108033286144/log/src/test/recovery/tmp_check/log/regress_log_013_crash_restart
> https://api.cirrus-ci.com/v1/artifact/task/4686108033286144/log/src/test/recovery/tmp_check/log/013_crash_restart_primary.log
> 
> I have no idea what is going on there, but searching for discussion
> brought me here...

I just got a crash in this test again.  Are you still interested?  I still have 
the logs.  No core file appears to have been generated.

The test failure is

not ok 5 - psql query died successfully after SIGQUIT

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: JSON constructors and window functions

2022-04-04 Thread Greg Stark
Are we missing regression tests using these functions as window functions?

Hm. I suppose it's possible to write a general purpose regression test
that loops over all aggregate functions and runs them as window
functions and aggregates over the same data sets and compares results.
At least for the case of aggregate functions with a single parameter
belonging to a chosen set of data types.




Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

2022-04-04 Thread Masahiko Sawada
On Mon, Apr 4, 2022 at 1:30 PM Julien Rouhaud  wrote:
>
> Hi,
>
> On Tue, Mar 01, 2022 at 11:35:32AM +0900, Masahiko Sawada wrote:
> > On Wed, Jan 19, 2022 at 5:52 PM Julien Rouhaud  wrote:
> > >
> > > It seems that the regression tests aren't entirely stable, per cfbot:
> > > https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3298.
> > >
> > > The failures look like:
> > >
> > > diff -U3 
> > > /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/explain.out 
> > > /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/explain.out
> > > --- 
> > > /tmp/cirrus-ci-build/src/test/recovery/../regress/expected/explain.out
> > >   2022-01-19 03:50:37.087931908 +
> > > +++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/explain.out  
> > >   2022-01-19 03:57:41.013616137 +
> > > @@ -512,9 +512,10 @@
> > > I/O Timings: temp read=N.N write=N.N
> > > ->  Function Scan on generate_series  (cost=N.N..N.N rows=N width=N) 
> > > (actual time=N.N..N.N rows=N loops=N)
> > >   I/O Timings: temp read=N.N write=N.N
> > > +   I/O Timings: shared/local read=N.N
> > >   Planning Time: N.N ms
> > >   Execution Time: N.N ms
> > > -(6 rows)
> > > +(7 rows)
> > >
> > >  select explain_filter('explain (analyze, buffers, format json) select 
> > > count(*) from generate_series(1, 10)');
> > >  explain_filter
> > >
> > >
> > > I don't see any obvious error in the code, so I'm wondering if it's simply
> > > the result of populating the cache with generate_series() info.
> >
> > Thank you for reporting.
> >
> > You're right, the regression test with track_io_timing = on is not
> > entirely stable because we may read catalog data during planning time,
> > resulting in an additional line in the EXPLAIN output. I've removed
> > the regression tests.
>
> Hmm, but AFAICS the json format would be stable as the counters are always
> shown even if zero.  So just doing the json format first and then the text
> format should also work.  Although if you're really unlucky there could be a
> cache invalidation in between so we could just ignore the text format.  But I
> think we should at least keep a regression test with the json format, with a
> comment explain why only this one is tested.

Fair point. By commit 7e12256b478 we disabled track_io_timing, but
probably we can temporarily enable it and run one query with "buffers"
and "format json" options.

>
> > I've attached updated patches. I've included an improvement of
> > pg_stat_statement support to support temp I/O timing.
>
> Great!
>
> Some other considerations:
>
> - should we update pg_stat_statements documentation (and comments) for
>   blk_(read|write)_time to mention that it's for *data file* blocks rather 
> than
>   simply blocks?  It seems obvious now that we'd have
>   temp_blk_(read|write)_time, but still.  This part should probably be
>   backpatched.

Agreed.

>
> - not really your patch fault I guess, but I see that extendBufFile() isn't
>   handled.  There shouldn't be much activity there so maybe it's ok.
>   This is likely because smgr_extend is also not handled, but this one seems
>   much more likely to take quite some time, and therefore should bump the
>   timing counters.

You mean we should include the time for opening files as write time?
IIUC smgrextend() writes data while extending file whereas
extendBufFile() doesn't do that but just opens a new file.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-04-04 Thread Mark Dilger



> On Apr 4, 2022, at 8:36 AM, Tom Lane  wrote:
> 
> Mark Dilger  writes:
>> If we want to backtrack to v8, that's fine.  I can rebase that, port
>> some of the other changes from v14 to it, and repost it as v15.
> 
> Are you working on that?  I've set aside time this week to hopefully
> get this over the finish line, but I don't want to find out that
> I've been duplicating your effort.

Yes, I expect to be posting the latest in maybe an hour?  I believe the latest 
patch (just reviewing, adjusting code comments, etc.) that I'm preparing to 
post has all the changes we've discussed, aside from your parameterId renaming.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-04-04 Thread Dilip Kumar
On Mon, Apr 4, 2022 at 2:25 PM Kyotaro Horiguchi
 wrote:
>
> At Mon, 04 Apr 2022 17:29:48 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > I haven't found how the patch caused creation of a relation file that
> > is to be removed soon.  However, I find that v19 patch fails by maybe
> > due to some change in Cluster.pm.  It takes a bit more time to check
> > that..
>
> I was a bit away, of course the wal-logged create database interfares
> with the patch here. But I haven't found that why it stops creating
> database directory under pg_tblspc.

I did not understand what is the exact problem here, but the database
directory and the version file are created under the default
tablespace of the target database.  However, other than the default
tablespace of the database, the database directory will be created
along with the smgrcreate() so that we do not create an unnecessary
directory under the tablespace where we do not have any data to be
copied.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: JSON constructors and window functions

2022-04-04 Thread Tom Lane
Andrew Dunstan  writes:
> On 4/3/22 22:46, Andrew Dunstan wrote:
>> On 4/3/22 20:11, Andres Freund wrote:
>>> I don't think you're allowed to free stuff in a finalfunc - we might reuse 
>>> the
>>> transition state for further calls to the aggregate.

>> Doh! Of course! I'll fix it in the morning. Thanks.

> I've committed a fix for this. I didn't find something to clean out the
> hash table, so I just removed the 'hash_destroy' and left it at that.
> All the test I did came back with expected results.
> Maybe a hash_reset() is something worth having assuming it's possible? I
> note that simplehash has a reset function.

But removing the hash entries would be just as much of a problem
wouldn't it?

regards, tom lane




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-04-04 Thread Nathan Bossart
I noticed a couple of other things that can be removed.  Since we no longer
wait on exclusive backup mode during smart shutdown, we can change
connsAllowed (in postmaster.c) to a boolean and remove CAC_SUPERUSER.  We
can also remove a couple of related notes in the documentation.  I've done
all this in the attached patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 5d79d79472252207eea5b7dcc52010736da10296 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 1 Dec 2021 23:50:49 +
Subject: [PATCH v10 1/1] remove exclusive backup mode

---
 doc/src/sgml/backup.sgml  | 230 +---
 doc/src/sgml/func.sgml| 111 +---
 doc/src/sgml/high-availability.sgml   |   6 +-
 doc/src/sgml/monitoring.sgml  |   4 +-
 doc/src/sgml/ref/pg_ctl-ref.sgml  |   6 +-
 doc/src/sgml/ref/pgupgrade.sgml   |   2 +-
 doc/src/sgml/runtime.sgml |   8 +-
 src/backend/access/transam/xlog.c | 493 ++
 src/backend/access/transam/xlogfuncs.c| 253 ++---
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/catalog/system_functions.sql  |  18 +-
 src/backend/postmaster/postmaster.c   |  75 +--
 src/backend/replication/basebackup.c  |  20 +-
 src/backend/utils/init/postinit.c |  18 -
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |   4 +
 src/bin/pg_ctl/pg_ctl.c   |  30 --
 src/bin/pg_rewind/filemap.c   |   6 +-
 src/include/access/xlog.h |   7 +-
 src/include/catalog/pg_control.h  |   2 +-
 src/include/catalog/pg_proc.dat   |  28 +-
 src/include/libpq/libpq-be.h  |   3 +-
 src/include/miscadmin.h   |   4 -
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  56 +-
 .../t/010_logical_decoding_timelines.pl   |   4 +-
 24 files changed, 213 insertions(+), 1177 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0d69851bb1..5b7139c7df 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065  cp pg_wal/0
 sequence, and that the success of a step is verified before
 proceeding to the next step.

-   
-Low level base backups can be made in a non-exclusive or an exclusive
-way. The non-exclusive method is recommended and the exclusive one is
-deprecated and will eventually be removed.
-   
-
-   
-Making a Non-Exclusive Low-Level Backup
 
- A non-exclusive low level backup is one that allows other
+ A low level backup allows other
  concurrent backups to be running (both those started using
  the same backup API and those started using
  ).
@@ -881,23 +873,23 @@ test ! -f /mnt/server/archivedir/000100A90065  cp pg_wal/0

 
  Connect to the server (it does not matter which database) as a user with
- rights to run pg_start_backup (superuser, or a user who has been granted
+ rights to run pg_backup_start (superuser, or a user who has been granted
  EXECUTE on the function) and issue the command:
 
-SELECT pg_start_backup('label', false, false);
+SELECT pg_backup_start(label => 'label', fast => false);
 
  where label is any string you want to use to uniquely
  identify this backup operation. The connection
- calling pg_start_backup must be maintained until the end of
+ calling pg_backup_start must be maintained until the end of
  the backup, or the backup will be automatically aborted.
 
 
 
- By default, pg_start_backup can take a long time to finish.
- This is because it performs a checkpoint, and the I/O
- required for the checkpoint will be spread out over a significant
- period of time, by default half your inter-checkpoint interval
- (see the configuration parameter
+ By default, pg_backup_start can take a long time to finish.
+ This is because it waits for the next checkpoint to complete, and the I/O
+ required for the checkpoint might be spread out over a significant
+ period of time (see the configuration parameters
+  and
  ).  This is
  usually what you want, because it minimizes the impact on query
  processing.  If you want to start the backup as soon as
@@ -905,10 +897,6 @@ SELECT pg_start_backup('label', false, false);
  issue an immediate checkpoint using as much I/O as available.
 
 
-
- The third parameter being false tells
- pg_start_backup to initiate a non-exclusive base backup.
-


 
@@ -926,7 +914,7 @@ SELECT pg_start_backup('label', false, false);
 
  In the same connection as before, issue the command:
 
-SELECT * FROM pg_stop_backup(false, true);
+SELECT * FROM pg_backup_stop(wait_for_archive => true);
 
  This terminates backup mode. On a primary, it 

Re: PATCH: add "--config-file=" option to pg_rewind

2022-04-04 Thread Greg Stark
Just reposting the previous patches to straighten out the cfbot.
--- /usr/lib/python3/dist-packages/patroni/postgresql/__init__.py	2022-02-18 13:16:15.0 +
+++ __init__.py	2022-04-03 19:17:29.952665383 +
@@ -798,7 +798,8 @@
 return True
 
 def get_guc_value(self, name):
-cmd = [self.pgcommand('postgres'), '-D', self._data_dir, '-C', name]
+cmd = [self.pgcommand('postgres'), '-D', self._data_dir, '-C', name,
+   '--config-file={}'.format(self.config.postgresql_conf)]
 try:
 data = subprocess.check_output(cmd)
 if data:
--- /usr/lib/python3/dist-packages/patroni/postgresql/rewind.py	2022-02-18 13:16:15.0 +
+++ rewind.py	2022-04-03 19:21:14.479726127 +
@@ -314,6 +314,7 @@
 cmd = [self._postgresql.pgcommand('pg_rewind')]
 if self._postgresql.major_version >= 13 and restore_command:
 cmd.append('--restore-target-wal')
+cmd.append('--config-file={0}/postgresql.conf'.format(self._postgresql.config._config_dir))
 cmd.extend(['-D', self._postgresql.data_dir, '--source-server', dsn])
 
 while True:
 doc/src/sgml/ref/pg_rewind.sgml   |  12 +
 src/bin/pg_rewind/pg_rewind.c |  24 -
 src/bin/pg_rewind/t/001_basic.pl  |   1 +
 src/bin/pg_rewind/t/RewindTest.pm | 105 +++---
 4 files changed, 101 insertions(+), 41 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 33e6bb64ad..62fcb71825 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -241,6 +241,18 @@ PostgreSQL documentation
   
  
 
+ 
+  --config-file=filepath
+  
+   
+When using the -c / --restore-target-wal option, the restore_command
+is extracted from the configuration of the target cluster. If the postgresql.conf
+of that cluster is not in the target data directory (or if you want to use an alternative config),
+use this option to provide a (relative or absolute) path to the postgresql.conf to be used.
+   
+  
+ 
+
  
   --debug
   
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index b39b5c1aac..2e83c7ee8e 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -61,6 +61,7 @@ char	   *datadir_target = NULL;
 char	   *datadir_source = NULL;
 char	   *connstr_source = NULL;
 char	   *restore_command = NULL;
+char	   *config_file = NULL;
 
 static bool debug = false;
 bool		showprogress = false;
@@ -87,6 +88,8 @@ usage(const char *progname)
 	printf(_("Options:\n"));
 	printf(_("  -c, --restore-target-wal   use restore_command in target configuration to\n"
 			 " retrieve WAL files from archives\n"));
+	printf(_("  --config-file=FILE path to postgresql.conf if it resides outside\n"));
+	printf(_(" the target data directory (for -c)\n"));
 	printf(_("  -D, --target-pgdata=DIRECTORY  existing data directory to modify\n"));
 	printf(_("  --source-pgdata=DIRECTORY  source data directory to synchronize with\n"));
 	printf(_("  --source-server=CONNSTRsource server to synchronize with\n"));
@@ -114,13 +117,14 @@ main(int argc, char **argv)
 		{"write-recovery-conf", no_argument, NULL, 'R'},
 		{"source-pgdata", required_argument, NULL, 1},
 		{"source-server", required_argument, NULL, 2},
+		{"debug", no_argument, NULL, 3},
 		{"no-ensure-shutdown", no_argument, NULL, 4},
+		{"config-file", required_argument, NULL, 5},
 		{"version", no_argument, NULL, 'V'},
 		{"restore-target-wal", no_argument, NULL, 'c'},
 		{"dry-run", no_argument, NULL, 'n'},
 		{"no-sync", no_argument, NULL, 'N'},
 		{"progress", no_argument, NULL, 'P'},
-		{"debug", no_argument, NULL, 3},
 		{NULL, 0, NULL, 0}
 	};
 	int			option_index;
@@ -205,6 +209,10 @@ main(int argc, char **argv)
 			case 4:
 no_ensure_shutdown = true;
 break;
+
+			case 5:
+config_file = pg_strdup(optarg);
+break;
 		}
 	}
 
@@ -1061,6 +1069,13 @@ getRestoreCommand(const char *argv0)
 	appendPQExpBufferStr(postgres_cmd, " -D ");
 	appendShellString(postgres_cmd, datadir_target);
 
+	/* add --config_file switch only if requested */
+	if (config_file != NULL)
+	{
+		appendPQExpBufferStr(postgres_cmd, " --config_file=");
+		appendShellString(postgres_cmd, config_file);
+	}
+
 	/* add -C switch, for restore_command */
 	appendPQExpBufferStr(postgres_cmd, " -C restore_command");
 
@@ -1139,6 +1154,13 @@ ensureCleanShutdown(const char *argv0)
 	appendPQExpBufferStr(postgres_cmd, " --single -F -D ");
 	appendShellString(postgres_cmd, datadir_target);
 
+	/* add --config_file switch only if requested */
+	if (config_file != NULL)
+	{
+		appendPQExpBufferStr(postgres_cmd, " --config_file=");
+		appendShellString(postgres_cmd, config_file);
+	}
+
 	/* finish with the database name, and a properly quoted 

Re: Granting SET and ALTER SYSTE privileges for GUCs

2022-04-04 Thread Tom Lane
Mark Dilger  writes:
> If we want to backtrack to v8, that's fine.  I can rebase that, port
> some of the other changes from v14 to it, and repost it as v15.

Are you working on that?  I've set aside time this week to hopefully
get this over the finish line, but I don't want to find out that
I've been duplicating your effort.

regards, tom lane




Re: Add non-blocking version of PQcancel

2022-04-04 Thread Jelte Fennema
Hereby what I consider the final version of this patch. I don't have any
changes planned myself (except for ones that come up during review). 
Things that changed since the previous iteration:
1. postgres_fdw now uses the non-blocking cancellation API (including test).
2. Added some extra sleeps to the cancellation test, to remove random failures 
on FreeBSD.


0002-Add-non-blocking-version-of-PQcancel.patch
Description: 0002-Add-non-blocking-version-of-PQcancel.patch


Re: JSON constructors and window functions

2022-04-04 Thread Andrew Dunstan


On 4/3/22 22:46, Andrew Dunstan wrote:
> On 4/3/22 20:11, Andres Freund wrote:
>> Hi,
>>
>> On 2022-04-03 18:56:39 -0400, Andrew Dunstan wrote:
>>> Haven't found the issue yet :-( It happens on the second call for the
>>> partition to  json_check_unique_key().
>>>
>>> Here's a more idiomatic and self-contained query that triggers the problem.
>>>
>>>
>>> select json_objectagg('10' : ref_0.level2 with unique keys)
>>>     over (partition by ref_0.parent_no order by ref_0.level2)
>>> from (values (1::int,1::int),(1,2),(2,1),(2,2)) as ref_0(parent_no,level2);
>> The hash was created in a context that's already freed.
>>
> [...]
>>
>> I don't think you're allowed to free stuff in a finalfunc - we might reuse 
>> the
>> transition state for further calls to the aggregate.
>>
>
> Doh! Of course! I'll fix it in the morning. Thanks.
>
>


I've committed a fix for this. I didn't find something to clean out the
hash table, so I just removed the 'hash_destroy' and left it at that.
All the test I did came back with expected results.

Maybe a hash_reset() is something worth having assuming it's possible? I
note that simplehash has a reset function.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: basebackup/lz4 crash

2022-04-04 Thread Robert Haas
On Thu, Mar 31, 2022 at 3:19 AM Dipesh Pandit  wrote:
> Patch attached.

Committed. Thanks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: use has_privs_of_role() for pg_hba.conf

2022-04-04 Thread Nathan Bossart
On Mon, Apr 04, 2022 at 09:36:13AM -0400, Joshua Brindle wrote:
> Good catch, I think this is a logical followup to the previous
> has_privs_of_role patch.
> 
> Reviewed and +1

Thanks!  I created a commitfest entry for this:

https://commitfest.postgresql.org/38/3609/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file

2022-04-04 Thread Nathan Bossart
On Mon, Apr 04, 2022 at 09:56:26AM -0400, David Steele wrote:
> Minor typo in the docs:
> 
> +  * capable of doing an online backup, but exclude then just in case.
> 
> Should be:
> 
> capable of doing an online backup, but exclude them just in case.

fixed

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 8960a15630290e49af4906a6951eced563efce7d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 1 Dec 2021 23:50:49 +
Subject: [PATCH v9 1/1] remove exclusive backup mode

---
 doc/src/sgml/backup.sgml  | 230 +---
 doc/src/sgml/func.sgml| 111 +---
 doc/src/sgml/high-availability.sgml   |   6 +-
 doc/src/sgml/monitoring.sgml  |   4 +-
 doc/src/sgml/ref/pgupgrade.sgml   |   2 +-
 src/backend/access/transam/xlog.c | 493 ++
 src/backend/access/transam/xlogfuncs.c| 253 ++---
 src/backend/access/transam/xlogrecovery.c |   2 +-
 src/backend/catalog/system_functions.sql  |  18 +-
 src/backend/postmaster/postmaster.c   |  46 +-
 src/backend/replication/basebackup.c  |  20 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |   4 +
 src/bin/pg_ctl/pg_ctl.c   |  30 --
 src/bin/pg_rewind/filemap.c   |   6 +-
 src/include/access/xlog.h |   7 +-
 src/include/catalog/pg_control.h  |   2 +-
 src/include/catalog/pg_proc.dat   |  28 +-
 src/include/miscadmin.h   |   4 -
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  56 +-
 .../t/010_logical_decoding_timelines.pl   |   4 +-
 20 files changed, 199 insertions(+), 1127 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0d69851bb1..5b7139c7df 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -857,16 +857,8 @@ test ! -f /mnt/server/archivedir/000100A90065  cp pg_wal/0
 sequence, and that the success of a step is verified before
 proceeding to the next step.

-   
-Low level base backups can be made in a non-exclusive or an exclusive
-way. The non-exclusive method is recommended and the exclusive one is
-deprecated and will eventually be removed.
-   
-
-   
-Making a Non-Exclusive Low-Level Backup
 
- A non-exclusive low level backup is one that allows other
+ A low level backup allows other
  concurrent backups to be running (both those started using
  the same backup API and those started using
  ).
@@ -881,23 +873,23 @@ test ! -f /mnt/server/archivedir/000100A90065  cp pg_wal/0

 
  Connect to the server (it does not matter which database) as a user with
- rights to run pg_start_backup (superuser, or a user who has been granted
+ rights to run pg_backup_start (superuser, or a user who has been granted
  EXECUTE on the function) and issue the command:
 
-SELECT pg_start_backup('label', false, false);
+SELECT pg_backup_start(label => 'label', fast => false);
 
  where label is any string you want to use to uniquely
  identify this backup operation. The connection
- calling pg_start_backup must be maintained until the end of
+ calling pg_backup_start must be maintained until the end of
  the backup, or the backup will be automatically aborted.
 
 
 
- By default, pg_start_backup can take a long time to finish.
- This is because it performs a checkpoint, and the I/O
- required for the checkpoint will be spread out over a significant
- period of time, by default half your inter-checkpoint interval
- (see the configuration parameter
+ By default, pg_backup_start can take a long time to finish.
+ This is because it waits for the next checkpoint to complete, and the I/O
+ required for the checkpoint might be spread out over a significant
+ period of time (see the configuration parameters
+  and
  ).  This is
  usually what you want, because it minimizes the impact on query
  processing.  If you want to start the backup as soon as
@@ -905,10 +897,6 @@ SELECT pg_start_backup('label', false, false);
  issue an immediate checkpoint using as much I/O as available.
 
 
-
- The third parameter being false tells
- pg_start_backup to initiate a non-exclusive base backup.
-


 
@@ -926,7 +914,7 @@ SELECT pg_start_backup('label', false, false);
 
  In the same connection as before, issue the command:
 
-SELECT * FROM pg_stop_backup(false, true);
+SELECT * FROM pg_backup_stop(wait_for_archive => true);
 
  This terminates backup mode. On a primary, it also performs an automatic
  switch to the next WAL segment.  On a standby, it is not possible to
@@ -937,7 +925,7 @@ SELECT * FROM pg_stop_backup(false, true);
  ready to archive.
 
 
- The pg_stop_backup will return one row with three
+ pg_backup_stop will return 

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-04 Thread Robert Haas
On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro  wrote:
> Another idea would be to call a new function DropPendingWritebacks(),
> and also tell all the SMgrRelation objects to close all their internal
> state (ie the fds + per-segment objects) but not free the main
> SMgrRelationData object, and for good measure also invalidate the
> small amount of cached state (which hasn't been mentioned in this
> thread, but it kinda bothers me that that state currently survives, so
> it was one unspoken reason I liked the smgrcloseall() idea).  Since
> DropPendingWritebacks() is in a rather different module, perhaps if we
> were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
> something else, because it's not really a SMGR-only thing anymore.

I'm not sure that it really matters, but with the idea that I proposed
it's possible to "save" a pending writeback, if we notice that we're
accessing the relation with a proper lock after the barrier arrives
and before we actually try to do the writeback. With this approach we
throw them out immediately, so they're just gone. I don't mind that
because I don't think this will happen often enough to matter, and I
don't think it will be very expensive when it does, but it's something
to think about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Run pg_amcheck in 002_pg_upgrade.pl and 027_stream_regress.pl?

2022-04-04 Thread Robert Haas
On Sun, Apr 3, 2022 at 10:10 PM Peter Geoghegan  wrote:
> I meant to tell the authors of verify_heapam() (also CC'd) that it
> really helped with my recent VACUUM project. While the assertions that
> I wrote in vacuumlazy.c might catch certain bugs like this,
> verify_heapam() is much more effective in practice.

Yeah, I was very excited about verify_heapam(). There is a lot more
stuff that we could check, but a lot of those things would be much
more expensive to check. It does a good job, I think, checking all the
things that a human being could potentially spot just by looking at an
individual page. I love the idea of using it in regression testing in
more places. It might find bugs in amcheck, which would be good, but I
think it's even more likely to help us find bugs in other code.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




  1   2   >