Re: PG DOCS - logical replication filtering

2022-04-07 Thread Peter Smith
PSA patch v6 to address some of Amit's review comments [1].

--
[1] 
https://www.postgresql.org/message-id/CAA4eK1JdwQQsxa%2BzpsBW5rCxEfXopYx381nwcCyeXk6mpF8ChQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia


v6-0001-PG-DOCS-page-for-row-filters.patch
Description: Binary data


Re: generic plans and "initial" pruning

2022-04-07 Thread Amit Langote
On Thu, Apr 7, 2022 at 9:41 PM David Rowley  wrote:
> On Thu, 7 Apr 2022 at 20:28, Amit Langote  wrote:
> > Here's an updated version.  In Particular, I removed
> > part_prune_results list from PortalData, in favor of anything that
> > needs to look at the list can instead get it from the CachedPlan
> > (PortalData.cplan).  This makes things better in 2 ways:
>
> Thanks for making those changes.
>
> I'm not overly familiar with the data structures we use for planning
> around plans between the planner and executor, but storing the pruning
> results in CachedPlan seems pretty bad. I see you've stashed it in
> there and invented a new memory context to stop leaks into the cache
> memory.
>
> Since I'm not overly familiar with these structures, I'm trying to
> imagine why you made that choice and the best I can come up with was
> that it was the most convenient thing you had to hand inside
> CheckCachedPlan().

Yeah, it's that way because it felt convenient, though I have wondered
if a simpler scheme that doesn't require any changes to the CachedPlan
data structure might be better after all.  Your pointing it out has
made me think a bit harder on that.

> I don't really have any great ideas right now on how to make this
> better. I wonder if GetCachedPlan() should be changed to return some
> struct that wraps up the CachedPlan with some sort of executor prep
> info struct that we can stash the list of PartitionPruneResults in,
> and perhaps something else one day.

I think what might be better to do now is just add an output List
parameter to GetCachedPlan() to add the PartitionPruneResult node to
instead of stashing them into CachedPlan as now.  IMHO, we should
leave inventing a new generic struct to the next project that will
make it necessary to return more information from GetCachedPlan() to
its users.  I find it hard to convincingly describe what the new
generic struct really is if we invent it *now*, when it's going to
carry a single list whose purpose is pretty narrow.

So, I've implemented this by making the callers of GetCachedPlan()
pass a list to add the PartitionPruneResults that may be produced.
Most callers can put that into the Portal for passing that to other
modules, so I have reinstated PortalData.part_prune_results.  As for
its memory management, the list and the PartitionPruneResults therein
will be allocated in a context that holds the Portal itself.

> Some lesser important stuff that I think could be done better.
>
> * Are you also able to put meaningful comments on the
> PartitionPruneResult struct in execnodes.h?
>
> * In create_append_plan() and create_merge_append_plan() you have the
> same code to set the part_prune_index. Why not just move all that code
> into make_partition_pruneinfo() and have make_partition_pruneinfo()
> return the index and append to the PlannerInfo.partPruneInfos List?

That sounds better, so done.

> * Why not forboth() here?
>
> i = 0;
> foreach(stmtlist_item, portal->stmts)
> {
> PlannedStmt *pstmt = lfirst_node(PlannedStmt, stmtlist_item);
> PartitionPruneResult *part_prune_result = part_prune_results ?
>   list_nth(part_prune_results, i) :
>   NULL;
>
> i++;

Because the PartitionPruneResult list may not always be available.  To
wit, it's only available when it is GetCachedPlan() that gave the
portal its plan.  I know this is a bit ugly, but it seems better than
fixing all users of Portal to build a dummy list, not that it is
totally avoidable even in the current implementation.

> * It would be good if ReleaseExecutorLocks() already knew the RTIs
> that were locked. Maybe the executor prep info struct I mentioned
> above could also store the RTIs that have been locked already and
> allow ReleaseExecutorLocks() to just iterate over those to release the
> locks.

Rewrote this such that ReleaseExecutorLocks() just receives a list of
per-PlannedStmt bitmapsets containing the RT indexes of only the
locked entries in that plan.

Attached updated patch with these changes.



--
Amit Langote
EDB: http://www.enterprisedb.com


v13-0001-Optimize-AcquireExecutorLocks-to-skip-pruned-par.patch
Description: Binary data


Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

2022-04-07 Thread Masahiko Sawada
On Fri, Apr 8, 2022 at 1:37 PM Michael Paquier  wrote:
>
> On Fri, Apr 08, 2022 at 11:34:17AM +0900, Michael Paquier wrote:
> > Now, 0002 is straight-forward but I need more coffee and lunch..
>
> Done this one as well, as of 76cbf7e with few tweaks.  1.9 and 1.10
> changed the definition of pg_stat_statements, so I have added two
> extra queries for those upgrade paths in oldextversions.sql.

Thank you for committing both patches. Agreed with these changes.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




RE: Logical replication timeout problem

2022-04-07 Thread wangw.f...@fujitsu.com
On Wed, Apr 7, 2022 at 1:34 PM Amit Kapila   wrote:
>
Thanks for your comments.

> One comment:
> +static void
> +update_progress(LogicalDecodingContext *ctx, bool skipped_xact, bool
> end_xact)
> +{
> + static int changes_count = 0;
> +
> + if (end_xact)
> + {
> + /* Update progress tracking at xact end. */
> + OutputPluginUpdateProgress(ctx, skipped_xact);
> + changes_count = 0;
> + }
> + /*
> + * After continuously processing CHANGES_THRESHOLD changes, update
> progress
> + * which will also try to send a keepalive message if required.
> 
> I think you can simply return after making changes_count = 0. There
> should be an empty line before starting the next comment.
Improve as suggested.
BTW, there is a conflict in current HEAD when applying v12 because of the
commit 2c7ea57. Also rebase it.

Attach the new patch.
1. Make some improvements to the new function update_progress. [suggestion by 
Amit-San]
2. Rebase the patch because the commit 2c7ea57 in current HEAD.

Regards,
Wang wei


v13-0001-Fix-the-logical-replication-timeout-during-large.patch
Description:  v13-0001-Fix-the-logical-replication-timeout-during-large.patch


Re: shared-memory based stats collector - v70

2022-04-07 Thread Kyotaro Horiguchi
At Thu, 7 Apr 2022 20:59:21 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-04-08 11:10:14 +0900, Kyotaro Horiguchi wrote:
> > I can read it. But I'm not sure that the difference is obvious for
> > average users between "starting a standby from a basebackup" and
> > "starting a standby after a normal shutdown"..
> 
> Yea, that's what I was concerned about. How about:
> 
>   
>Cumulative statistics are collected in shared memory. Every
>PostgreSQL process collects statistics locally
>then updates the shared data at appropriate intervals.  When a server,
>including a physical replica, shuts down cleanly, a permanent copy of the
>statistics data is stored in the pg_stat subdirectory,
>so that statistics can be retained across server restarts.  In contrast,
>when starting from an unclean shutdown (e.g., after an immediate shutdown,
>a server crash, starting from a base backup, and point-in-time recovery),
>all statistics counters are reset.
>   

Looks perfect generally, and especially in regard to the concern.

> I think I like my version above a bit better?

Quite a bit.  It didn't answer for the concern.

> > > 2)
> > > The edit is not a problem, but it's hard to understand what the existing
> > > paragraph actually means?
> > > 
> > > diff --git a/doc/src/sgml/high-availability.sgml 
> > > b/doc/src/sgml/high-availability.sgml
> > > index 3247e056663..8bfb584b752 100644
> > > --- a/doc/src/sgml/high-availability.sgml
> > > +++ b/doc/src/sgml/high-availability.sgml
> > > @@ -,17 +,17 @@ HINT:  You can then restart the server after 
> > > making the necessary configuration
> > > ...
> > > 
> > > -The statistics collector is active during recovery. All scans, 
> > > reads, blocks,
> > > +The cumulative statistics system is active during recovery. All 
> > > scans, reads, blocks,
> > >  index usage, etc., will be recorded normally on the standby. Replayed
> > >  actions will not duplicate their effects on primary, so replaying an
> > >  insert will not increment the Inserts column of pg_stat_user_tables.
> > >  The stats file is deleted at the start of recovery, so stats from 
> > > primary
> > >  and standby will differ; this is considered a feature, not a bug.
> > > 
> > > 
> > > 
> > 
> > Agreed partially. It's too detailed.  It might not need to mention WAL
> > replay.
> 
> My concern is more that it seems halfway nonsensical. "Replayed actions will
> not duplicate their effects on primary" - I can guess what that means, but not
> more. There's no "Inserts" column of pg_stat_user_tables.
> 
> 
>
> The cumulative statistics system is active during recovery. All scans,
> reads, blocks, index usage, etc., will be recorded normally on the
> standby. However, WAL replay will not increment relation and database
> specific counters. I.e. replay will not increment pg_stat_all_tables
> columns (like n_tup_ins), nor will reads or writes performed by the
> startup process be tracked in the pg_statio views, nor will associated
> pg_stat_database columns be incremented.
>

Looks clearer since it mention user-facing interfaces with concrete
example columns.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Windows now has fdatasync()

2022-04-07 Thread Tom Lane
Michael Paquier  writes:
> On Fri, Apr 08, 2022 at 02:56:15PM +1200, Thomas Munro wrote:
>> I propose that we drop support for Windows versions older than
>> 10/Server 2016 in the PostgreSQL 16 cycle,

Do we have any data on what people are actually using?

> Do you think that we could raise the minimum C standard on WIN32 to
> C11, at least for MSVC?

As long as the C11-isms are in MSVC-only code, it seems like this is
exactly equivalent to setting a minimum MSVC version.  I don't see
an objection-in-principle there, it's just a practical question of
how far back is reasonable to support MSVC versions.  (That's very
distinct from how far back we need the built code to run.)

regards, tom lane




Re: shared-memory based stats collector - v70

2022-04-07 Thread David G. Johnston
On Thu, Apr 7, 2022 at 8:59 PM Andres Freund  wrote:

>
>   
>Cumulative statistics are collected in shared memory. Every
>PostgreSQL process collects statistics
> locally
>then updates the shared data at appropriate intervals.  When a server,
>including a physical replica, shuts down cleanly, a permanent copy of
> the
>statistics data is stored in the pg_stat
> subdirectory,
>so that statistics can be retained across server restarts.  In contrast,
>when starting from an unclean shutdown (e.g., after an immediate
> shutdown,
>a server crash, starting from a base backup, and point-in-time
> recovery),
>all statistics counters are reset.
>   
>

I like this.  My comment regarding using "i.e.," here stands though.


>
>
> The cumulative statistics system is active during recovery. All scans,
> reads, blocks, index usage, etc., will be recorded normally on the
> standby. However, WAL replay will not increment relation and database
> specific counters. I.e. replay will not increment pg_stat_all_tables
> columns (like n_tup_ins), nor will reads or writes performed by the
> startup process be tracked in the pg_statio views, nor will associated
> pg_stat_database columns be incremented.
>
>
>
I like this too.  The second part with three nors is a bit rough.  Maybe:

... specific counters.  In particular, replay will not increment
pg_stat_database or pg_stat_all_tables columns, and the startup process
will not report reads and writes for the pg_statio views.

It would helpful to give at least one specific example of what is being
recorded normally, especially since we give three of what is not.

David J.


Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund
On 2022-04-07 16:37:51 -0700, Andres Freund wrote:
> On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > I've gotten through the main commits (and then a fix for the apparently
> > inevitable bug that's immediately highlighted by the buildfarm), and the 
> > first
> > test. I'll call it a night now, and work on the other tests & docs tomorrow.
> 
> I've gotten through the tests now. There's one known, not yet addressed, issue
> with the stats isolation test, see [1].

That has since been fixed, in d6c0db14836cd843d589372d909c73aab68c7a24




Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

2022-04-07 Thread Michael Paquier
On Fri, Apr 08, 2022 at 11:34:17AM +0900, Michael Paquier wrote:
> Now, 0002 is straight-forward but I need more coffee and lunch..

Done this one as well, as of 76cbf7e with few tweaks.  1.9 and 1.10
changed the definition of pg_stat_statements, so I have added two
extra queries for those upgrade paths in oldextversions.sql.
--
Michael


signature.asc
Description: PGP signature


Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 20:59:21 -0700, Andres Freund wrote:
> 
>   
>Cumulative statistics are collected in shared memory. Every
>PostgreSQL process collects statistics locally
>then updates the shared data at appropriate intervals.  When a server,
>including a physical replica, shuts down cleanly, a permanent copy of the
>statistics data is stored in the pg_stat subdirectory,
>so that statistics can be retained across server restarts.  In contrast,
>when starting from an unclean shutdown (e.g., after an immediate shutdown,
>a server crash, starting from a base backup, and point-in-time recovery),
>all statistics counters are reset.
>   
> ...
>
> The cumulative statistics system is active during recovery. All scans,
> reads, blocks, index usage, etc., will be recorded normally on the
> standby. However, WAL replay will not increment relation and database
> specific counters. I.e. replay will not increment pg_stat_all_tables
> columns (like n_tup_ins), nor will reads or writes performed by the
> startup process be tracked in the pg_statio views, nor will associated
> pg_stat_database columns be incremented.
>

I went with these for now. My guess is that there's further improvements in
them, and in surrounding areas...


With that, I'll close this CF entry. It's been a while.

Greetings,

Andres Freund




Re: Windows now has fdatasync()

2022-04-07 Thread Michael Paquier
On Fri, Apr 08, 2022 at 02:56:15PM +1200, Thomas Munro wrote:
> I propose that we drop support for Windows versions older than
> 10/Server 2016 in the PostgreSQL 16 cycle, because the OS patches for
> everything older come to an end in October next year[1], and we have a
> lot of patches relating to modern Windows features that stall on
> details about old systems that no one actually has.
> 
> [1] https://en.wikipedia.org/wiki/List_of_Microsoft_Windows_versions

Do you think that we could raise the minimum C standard on WIN32 to
C11, at least for MSVC?  There is a patch floating around to add
pg_attribute_aligned() and perhaps pg_attribute_noreturn() for MSVC:
https://www.postgresql.org/message-id/yk6ugcglzkuxr...@paquier.xyz

noreturn() needed at least C11:
https://docs.microsoft.com/en-us/cpp/c-language/noreturn?view=msvc-140

Perhaps we'd better also bump up the minimum version of MSVC
supported..
--
Michael


signature.asc
Description: PGP signature


Re: Kerberos delegation support in libpq and postgres_fdw

2022-04-07 Thread Stephen Frost
Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Jacob Champion (pchamp...@vmware.com) wrote:
> > On Fri, 2022-03-11 at 19:39 -0500, Stephen Frost wrote:
> > > Even so, I’m not against adding an option… but exactly how would that
> > > option be configured?  Server level?  On the HBA line?  role level..?
> > 
> > In the OPTIONS for CREATE SERVER, maybe? At least for the FDW case.
> 
> I'm a bit confused on this.  The option to allow or not allow delegated
> credentials couldn't be something that's in the CREATE SERVER for FDWs
> as it applies to more than just FDWs but also dblink and anything else
> where we reach out from PG to contact some other system.

Thinking it through further, it seems like the right place to allow an
administrator to control if credentials are allowed to be delegated is
through a pg_hba option.  Attached patch adds such an option.

> > > > Also, we're
> > > > globally ignoring whatever ccache was set by an administrator. Can't
> > > > two postgres_fdw connections from the same backend process require
> > > > different settings?
> > > 
> > > Settings..? Perhaps, but delegated credentials aren’t really 
> > > settings, so not really sure what you’re suggesting here.
> > 
> > I mean that one backend server might require delegated credentials, and
> > another might require whatever the admin has already set up in the
> > ccache, and the user might want to use tables from both servers in the
> > same session.
> 
> That an admin might have a credential cache that's picked up and used
> for connections from a regular user backend to another system strikes me
> as an altogether concerning idea.  Even so, in such a case, the admin
> would have had to set up the user mapping with 'password required =
> false' or it wouldn't have worked for a non-superuser anyway, so I'm not
> sure that I'm too worried about this case.

To address this, I also added a new GUC which allows an administrator to
control what the credential cache is set to for user-authenticated
backends, with a default of MEMORY:, which should generally be safe and
won't cause a user backend to pick up on a file-based credential cache
which might exist on the server somewhere.  This gives the administrator
the option to set it to more-or-less whatever they'd like though, so if
they want to set it to a file-based credential cache, then they can do
so (I did put some caveats about doing that into the documentation as I
don't think it's generally a good idea to do...).

> > > > I notice that gss_store_cred_into() has a companion,
> > > > gss_acquire_cred_from(). Is it possible to use that to pull out our
> > > > delegated credential explicitly by name, instead of stomping on the
> > > > global setup?
> > > 
> > > Not really sure what is meant here by global setup..?  Feeling like
> > > this is a follow on confusion from maybe mixing server vs client
> > > libpq?
> > 
> > By my reading, the gss_store_cred_into() call followed by
> > the setenv("KRB5CCNAME", ...) is effectively performing global
> > configuration for the process. Any KRB5CCNAME already set up by the
> > server admin is going to be ignored from that point onward. Is that
> > accurate?
> 
> The process, yes, but I guess I disagree on that being 'global'- it's
> just for that PG backend process.

The new krb_user_ccache is a lot closer to 'global', though it's
specifically for user-authenticated backends (allowing the postmaster
and other things like replication connections to use whatever the
credential cache is set to by the administrator on startup), but that
seems like it makes sense to me- generally you're not going to want
regular user backends to be accessing the credential cache of the
'postgres' unix account on the server.

> Attached is an updated patch which adds the gss_release_creds call, a
> function in libpq to allow checking if the libpq connection was made
> using GSS, changes to dblink to have it check for password-or-gss when
> connecting to a remote system, and tests for dblink and postgres_fdw to
> make sure that this all works correctly.

I've added a couple more tests to address the new options too, along
with documentation for them.  This is starting to feel reasonably decent
to me, at least as a first pass at supporting kerberos credential
delegation, which is definitely a feature I've been hoping we would get
into PG for quite a while.  Would certainly appreciate some feedback on
this (from anyone who'd like to comment), though I know we're getting
into the last few hours before feature freeze ends.

Updated patch attached.

Thanks!

Stephen
From 83b2979e252866ce86a9bfb3d0b631db2f5b8404 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Thu, 7 Apr 2022 15:34:39 -0400
Subject: [PATCH] Add support for Kerberos credential delegation

Accept GSSAPI/Kerberos delegated credentials.  With this, a user could
authenticate to PostgreSQL using Kerberos credentials, delegate
credentials to the PostgreSQL server, and then the PostgreSQL server

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 20:51:10 -0700, David G. Johnston wrote:
> On Thu, Apr 7, 2022 at 7:10 PM Kyotaro Horiguchi 
> wrote:
> > I can read it. But I'm not sure that the difference is obvious for
> > average users between "starting a standby from a basebackup" and
> > "starting a standby after a normal shutdown"..
> >
> > Other than that, it might be easier to read if the additional part
> > were moved out to the end of the paragraph, prefixing with "Note:
> > ". For example,
> >
> > ...
> > statistics can be retained across server restarts.  When crash recovery is
> > performed at server start (e.g., after immediate shutdown, server crash,
> > and point-in-time recovery), all statistics counters are reset. Note that
> > crash recovery is not performed when starting a standby that was shut
> > down normally then all counters are retained.
> >
> >
> Maybe:
> When the server shuts down cleanly a permanent copy of the statistics data
> is stored in the pg_stat subdirectory so that
> statistics can be retained across server restarts.  However, if crash
> recovery is performed (i.e., after immediate shutdown, server crash, or
> point-in-time recovery), all statistics counters are reset.  For any
> standby server, the initial startup to get the cluster initialized is a
> point-in-time crash recovery startup. For all subsequent startups it
> behaves like any other server.  For a hot standby server, statistics are
> retained during a failover promotion.
> 
> I'm pretty sure i.e., is correct since those three situations are not
> examples but rather the complete set.

I don't think the "initial startup ..." bit is quite correct. A standby can be
created for a shut down server, and IIRC there's some differences in how PITR
is handled too.


> Is crash recovery ever performed other than at server start?  If so I
> choose to remove the redundancy.

No.


> I feel like some of this detail about standby servers is/should be covered
> elsewhere and we are at least missing a cross-reference chance even if we
> leave the material coverage as-is.

I didn't find anything good to reference...


> > 2)
> > > The edit is not a problem, but it's hard to understand what the existing
> > > paragraph actually means?
> > >
> > > diff --git a/doc/src/sgml/high-availability.sgml
> > b/doc/src/sgml/high-availability.sgml
> > > index 3247e056663..8bfb584b752 100644
> > > --- a/doc/src/sgml/high-availability.sgml
> > > +++ b/doc/src/sgml/high-availability.sgml
> > > @@ -,17 +,17 @@ HINT:  You can then restart the server after
> > making the necessary configuration
> > > ...
> > > 
> > > -The statistics collector is active during recovery. All scans,
> > reads, blocks,
> > > +The cumulative statistics system is active during recovery. All
> > scans, reads, blocks,
> > >  index usage, etc., will be recorded normally on the standby.
> > Replayed
> > >  actions will not duplicate their effects on primary, so replaying an
> > >  insert will not increment the Inserts column of pg_stat_user_tables.
> > >  The stats file is deleted at the start of recovery, so stats from
> > primary
> > >  and standby will differ; this is considered a feature, not a bug.
> > > 
> > >
> > > 
> >
> > Agreed partially. It's too detailed.  It might not need to mention WAL
> > replay.
> >
> >
> The insert example seems like a poor one...IIUC cumulative statistics are
> not WAL logged and while in recovery INSERT is prohibited, so how would
> replaying the insert in the WAL result in a duplicated effect on
> pg_stat_user_tables.inserts?

I agree, the sentence doesn't make much sense.

It doesn't really matter that stats aren't WAL logged, one could infer them
from the WAL at a decent level of accuracy. However, we can't actually
associate those actions to relations, we just know the relfilenode... And the
startup process can't read the catalog to figure the mapping out.


> I also have no idea what, in the fragment, "Replayed actions will not
> duplicate their effects on primary...", what "on primary" is supposed to
> mean.

I think it's trying to say "will not duplicate the effect on
pg_stat_user_tables they had on the primary".


> I would like to write the following but I don't believe it is sufficiently
> true:
> 
> "The cumulative statistics system records only the locally generated
> activity of the cluster, including while in recovery.  Activity happening
> only due to the replay of WAL is not considered local."
> 
> But to apply the WAL we have to fetch blocks from the local filesystem and
> write them back out.  That is local activity happening due to the replay of
> WAL which sounds like it is and should be reported ("All...blocks...", and
> the example given being logical DDL oriented).

That's not true today - the startup processes reads / writes aren't reflected
in pg_statio, pg_stat_database or whatnot.


> I cannot think of a better paragraph at the moment, the minimal change is
> good, and the detail it 

Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-08 11:10:14 +0900, Kyotaro Horiguchi wrote:
> At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund  wrote 
> in 
> > Hi,
> > 
> > On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > > I've gotten through the main commits (and then a fix for the apparently
> > > inevitable bug that's immediately highlighted by the buildfarm), and the 
> > > first
> > > test. I'll call it a night now, and work on the other tests & docs 
> > > tomorrow.
> > 
> > I've gotten through the tests now. There's one known, not yet addressed, 
> > issue
> > with the stats isolation test, see [1].
> > 
> > 
> > Working on the docs. Found a few things worth raising:
> > 
> > 1)
> > Existing text:
> >When the server shuts down cleanly, a permanent copy of the statistics
> >data is stored in the pg_stat subdirectory, so that
> >statistics can be retained across server restarts.  When recovery is
> >performed at server start (e.g., after immediate shutdown, server crash,
> >and point-in-time recovery), all statistics counters are reset.
> > 
> > The existing docs patch hadn't updated yet. My current edit is
> > 
> >When the server shuts down cleanly, a permanent copy of the statistics
> >data is stored in the pg_stat subdirectory, so that
> >statistics can be retained across server restarts.  When crash recovery 
> > is
> >performed at server start (e.g., after immediate shutdown, server crash,
> >and point-in-time recovery, but not when starting a standby that was shut
> >down normally), all statistics counters are reset.
> > 
> > but I'm not sure the parenthetical is easy enough to understand?
> 
> I can read it. But I'm not sure that the difference is obvious for
> average users between "starting a standby from a basebackup" and
> "starting a standby after a normal shutdown"..

Yea, that's what I was concerned about. How about:

  
   Cumulative statistics are collected in shared memory. Every
   PostgreSQL process collects statistics locally
   then updates the shared data at appropriate intervals.  When a server,
   including a physical replica, shuts down cleanly, a permanent copy of the
   statistics data is stored in the pg_stat subdirectory,
   so that statistics can be retained across server restarts.  In contrast,
   when starting from an unclean shutdown (e.g., after an immediate shutdown,
   a server crash, starting from a base backup, and point-in-time recovery),
   all statistics counters are reset.
  


> Other than that, it might be easier to read if the additional part
> were moved out to the end of the paragraph, prefixing with "Note:
> ". For example,
> 
> ...
> statistics can be retained across server restarts.  When crash recovery is
> performed at server start (e.g., after immediate shutdown, server crash,
> and point-in-time recovery), all statistics counters are reset. Note that
> crash recovery is not performed when starting a standby that was shut
> down normally then all counters are retained.

I think I like my version above a bit better?


> > 2)
> > The edit is not a problem, but it's hard to understand what the existing
> > paragraph actually means?
> > 
> > diff --git a/doc/src/sgml/high-availability.sgml 
> > b/doc/src/sgml/high-availability.sgml
> > index 3247e056663..8bfb584b752 100644
> > --- a/doc/src/sgml/high-availability.sgml
> > +++ b/doc/src/sgml/high-availability.sgml
> > @@ -,17 +,17 @@ HINT:  You can then restart the server after making 
> > the necessary configuration
> > ...
> > 
> > -The statistics collector is active during recovery. All scans, reads, 
> > blocks,
> > +The cumulative statistics system is active during recovery. All scans, 
> > reads, blocks,
> >  index usage, etc., will be recorded normally on the standby. Replayed
> >  actions will not duplicate their effects on primary, so replaying an
> >  insert will not increment the Inserts column of pg_stat_user_tables.
> >  The stats file is deleted at the start of recovery, so stats from 
> > primary
> >  and standby will differ; this is considered a feature, not a bug.
> > 
> > 
> > 
> 
> Agreed partially. It's too detailed.  It might not need to mention WAL
> replay.

My concern is more that it seems halfway nonsensical. "Replayed actions will
not duplicate their effects on primary" - I can guess what that means, but not
more. There's no "Inserts" column of pg_stat_user_tables.


   
The cumulative statistics system is active during recovery. All scans,
reads, blocks, index usage, etc., will be recorded normally on the
standby. However, WAL replay will not increment relation and database
specific counters. I.e. replay will not increment pg_stat_all_tables
columns (like n_tup_ins), nor will reads or writes performed by the
startup process be tracked in the pg_statio views, nor will associated
pg_stat_database columns be incremented.
   



Greetings,

Andres Freund




Re: shared-memory based stats collector - v70

2022-04-07 Thread David G. Johnston
On Thu, Apr 7, 2022 at 7:10 PM Kyotaro Horiguchi 
wrote:

> At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund 
> wrote in
> > Hi,
> >
> > On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > > I've gotten through the main commits (and then a fix for the apparently
> > > inevitable bug that's immediately highlighted by the buildfarm), and
> the first
> > > test. I'll call it a night now, and work on the other tests & docs
> tomorrow.
> >
> > I've gotten through the tests now. There's one known, not yet addressed,
> issue
> > with the stats isolation test, see [1].
> >
> >
> > Working on the docs. Found a few things worth raising:
> >
> > 1)
> > Existing text:
> >When the server shuts down cleanly, a permanent copy of the statistics
> >data is stored in the pg_stat subdirectory, so
> that
> >statistics can be retained across server restarts.  When recovery is
> >performed at server start (e.g., after immediate shutdown, server
> crash,
> >and point-in-time recovery), all statistics counters are reset.
> >
> > The existing docs patch hadn't updated yet. My current edit is
> >
> >When the server shuts down cleanly, a permanent copy of the statistics
> >data is stored in the pg_stat subdirectory, so
> that
> >statistics can be retained across server restarts.  When crash
> recovery is
> >performed at server start (e.g., after immediate shutdown, server
> crash,
> >and point-in-time recovery, but not when starting a standby that was
> shut
> >down normally), all statistics counters are reset.
> >
> > but I'm not sure the parenthetical is easy enough to understand?
>
> I can read it. But I'm not sure that the difference is obvious for
> average users between "starting a standby from a basebackup" and
> "starting a standby after a normal shutdown"..
>
> Other than that, it might be easier to read if the additional part
> were moved out to the end of the paragraph, prefixing with "Note:
> ". For example,
>
> ...
> statistics can be retained across server restarts.  When crash recovery is
> performed at server start (e.g., after immediate shutdown, server crash,
> and point-in-time recovery), all statistics counters are reset. Note that
> crash recovery is not performed when starting a standby that was shut
> down normally then all counters are retained.
>
>
Maybe:
When the server shuts down cleanly a permanent copy of the statistics data
is stored in the pg_stat subdirectory so that
statistics can be retained across server restarts.  However, if crash
recovery is performed (i.e., after immediate shutdown, server crash, or
point-in-time recovery), all statistics counters are reset.  For any
standby server, the initial startup to get the cluster initialized is a
point-in-time crash recovery startup. For all subsequent startups it
behaves like any other server.  For a hot standby server, statistics are
retained during a failover promotion.

I'm pretty sure i.e., is correct since those three situations are not
examples but rather the complete set.

Is crash recovery ever performed other than at server start?  If so I
choose to remove the redundancy.

I feel like some of this detail about standby servers is/should be covered
elsewhere and we are at least missing a cross-reference chance even if we
leave the material coverage as-is.

> 2)
> > The edit is not a problem, but it's hard to understand what the existing
> > paragraph actually means?
> >
> > diff --git a/doc/src/sgml/high-availability.sgml
> b/doc/src/sgml/high-availability.sgml
> > index 3247e056663..8bfb584b752 100644
> > --- a/doc/src/sgml/high-availability.sgml
> > +++ b/doc/src/sgml/high-availability.sgml
> > @@ -,17 +,17 @@ HINT:  You can then restart the server after
> making the necessary configuration
> > ...
> > 
> > -The statistics collector is active during recovery. All scans,
> reads, blocks,
> > +The cumulative statistics system is active during recovery. All
> scans, reads, blocks,
> >  index usage, etc., will be recorded normally on the standby.
> Replayed
> >  actions will not duplicate their effects on primary, so replaying an
> >  insert will not increment the Inserts column of pg_stat_user_tables.
> >  The stats file is deleted at the start of recovery, so stats from
> primary
> >  and standby will differ; this is considered a feature, not a bug.
> > 
> >
> > 
>
> Agreed partially. It's too detailed.  It might not need to mention WAL
> replay.
>
>
The insert example seems like a poor one...IIUC cumulative statistics are
not WAL logged and while in recovery INSERT is prohibited, so how would
replaying the insert in the WAL result in a duplicated effect on
pg_stat_user_tables.inserts?

I also have no idea what, in the fragment, "Replayed actions will not
duplicate their effects on primary...", what "on primary" is supposed to
mean.

I would like to write the following but I don't believe it is sufficiently
true:

"The cumulative statistics system 

Re: pg14 psql broke \d datname.nspname.relname

2022-04-07 Thread Greg Stark
On Thu, 7 Apr 2022 at 22:32, Robert Haas  wrote:
>
> On Thu, Apr 7, 2022 at 7:41 PM Tom Lane  wrote:
>
> > Possibly a better idea is to add an enum argument telling the function
> > what to do (parse the whole thing as one name regardless of dots,
> > parse as two names if there's a dot, throw error if there's a dot,
> > etc etc as needed by existing call sites).  Perhaps some of the
> > existing arguments could be merged into such an enum, too.
>
> AFAICS there's not much to be done about the fact
> that one caller wants pg_log_error + exit(2), another wants fatal(), a
> third PQfinish(conn) and exit_nicely(), and the last termPQExpBuffer()
> and return false.

That doesn't seem to be entirely inconsistent with what Tom describes.
Instead of "throw an error" the function would return an error and
possibly some extra info which the caller would use to handle the
error appropriately.

It still has the nice property that the decision that it is in fact an
error would be made inside the parsing function based on the enum
declaring what's intended. And it wouldn't return a possibly bogus
parsing with information the caller might use to infer it isn't what
was desired (or fail to).

-- 
greg




Re: pg14 psql broke \d datname.nspname.relname

2022-04-07 Thread Justin Pryzby
On Thu, Apr 07, 2022 at 10:26:18PM -0400, Robert Haas wrote:
> +   pg_log_error("improper relation name (too many dotted names): 
> %s", pattern);
> 
> Come to think of it, maybe the error text there could stand some
> bikeshedding, but AFAICS

AFAICT the error text deliberately matches this, which I mentioned seems to me
the strongest argument for supporting \d datname.nspname.relname

ts=# SELECT 'a.a.a.a'::regclass;
ERROR:  42601: improper relation name (too many dotted names): a.a.a.a
LINE 1: SELECT 'a.a.a.a'::regclass;
   ^
LOCATION:  makeRangeVarFromNameList, namespace.c:3129




Re: Windows now has fdatasync()

2022-04-07 Thread Thomas Munro
I've bumped this to the next cycle, so I can hopefully skip the
missing version detection stuff that I have no way to test (no CI, no
build farm, and I have zero interest in dumpster diving for Windows 7
or whatever installations).

I propose that we drop support for Windows versions older than
10/Server 2016 in the PostgreSQL 16 cycle, because the OS patches for
everything older come to an end in October next year[1], and we have a
lot of patches relating to modern Windows features that stall on
details about old systems that no one actually has.

[1] https://en.wikipedia.org/wiki/List_of_Microsoft_Windows_versions




Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

2022-04-07 Thread Michael Paquier
On Thu, Apr 07, 2022 at 09:48:02PM +0900, Masahiko Sawada wrote:
> Oops, the results are opposite:
> 
> HEAD: 5367.234 ms
> Patched: 5418.869 ms

I have been playing with external sorts & friends after running an
instance on scissors (fsync=off, PGDATA on tmpfs, etc.), even forcing
a compilation of the code with gettimeofday(), and I can see a
tendency of a ~1% impact with the patch in this configuration.  So
there is a tendency, while this is also rather close to the usual
noise range.  Anyway, this is going to be helpful for debugging when
temp file I/O is a bottleneck, so..  Applied 0001 with some tweaks to
the docs and some indentation fixes.

Now, 0002 is straight-forward but I need more coffee and lunch..
--
Michael


signature.asc
Description: PGP signature


Re: pg14 psql broke \d datname.nspname.relname

2022-04-07 Thread Robert Haas
On Thu, Apr 7, 2022 at 7:41 PM Tom Lane  wrote:
> Mark Dilger  writes:
> > The patch submitted changes processSQLNamePattern() to return a dot count 
> > by reference.  It's up to the caller to decide whether to raise an error.  
> > If you pass in no schemavar, and you get back dotcnt=2, you know it parsed 
> > it as a two part pattern, and you can pg_fatal(...) or ereport(ERROR, ...) 
> > or whatever.
>
> Well, I'm not telling Robert what to do, but I wouldn't accept that
> API.  It requires duplicative error-handling code at every call site
> and is an open invitation to omitting necessary error checks.
>
> Possibly a better idea is to add an enum argument telling the function
> what to do (parse the whole thing as one name regardless of dots,
> parse as two names if there's a dot, throw error if there's a dot,
> etc etc as needed by existing call sites).  Perhaps some of the
> existing arguments could be merged into such an enum, too.

I hadn't considered that approach, but I don't think it works very
well, because front-end error handling is so inconsistent. From the
patch:

+   pg_log_error("improper relation name (too many dotted
names): %s", pattern);
+   exit(2);

+   fatal("improper qualified name (too many
dotted names): %s",
+ cell->val);

+   pg_log_error("improper qualified name (too
many dotted names): %s",
+cell->val);
+   PQfinish(conn);
+   exit_nicely(1);

+   pg_log_error("improper qualified name (too many dotted
names): %s",
+pattern);
+   termPQExpBuffer();
+   return false;

Come to think of it, maybe the error text there could stand some
bikeshedding, but AFAICS there's not much to be done about the fact
that one caller wants pg_log_error + exit(2), another wants fatal(), a
third PQfinish(conn) and exit_nicely(), and the last termPQExpBuffer()
and return false.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-07 Thread Robert Haas
On Thu, Apr 7, 2022 at 7:52 PM Tom Lane  wrote:
> Michael Paquier  writes:
> > On Thu, Apr 07, 2022 at 11:19:15AM -0400, Robert Haas wrote:
> >> Here are patches for master and v14 to do things this way. Comments?
>
> > Thanks for the patches.  They look correct.  For ~14, I'd rather avoid
> > the code duplication done by GetVirtualXIDsDelayingChkptEnd() and
> > HaveVirtualXIDsDelayingChkpt() that could be avoided with an extra
> > bool argument to the existing routine.
>
> Isn't adding another argument an API break?  (If there's any outside
> code calling GetVirtualXIDsDelayingChkpt, which it seems like there
> might be.)

Yeah, that's exactly why I didn't do what Michael proposes. If we're
going to go to this trouble to avoid changing the layout of a PGPROC,
we must be doing that on the theory that extension code cares about
delayChkpt. And if that is so, it seems reasonable to suppose that it
might also want to call the associated functions.

Honestly, I wouldn't have thought that this mattered, because I
wouldn't have guessed that any non-core code cared about delayChkpt.
But I would have been wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: shared-memory based stats collector - v70

2022-04-07 Thread Kyotaro Horiguchi
At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > I've gotten through the main commits (and then a fix for the apparently
> > inevitable bug that's immediately highlighted by the buildfarm), and the 
> > first
> > test. I'll call it a night now, and work on the other tests & docs tomorrow.
> 
> I've gotten through the tests now. There's one known, not yet addressed, issue
> with the stats isolation test, see [1].
> 
> 
> Working on the docs. Found a few things worth raising:
> 
> 1)
> Existing text:
>When the server shuts down cleanly, a permanent copy of the statistics
>data is stored in the pg_stat subdirectory, so that
>statistics can be retained across server restarts.  When recovery is
>performed at server start (e.g., after immediate shutdown, server crash,
>and point-in-time recovery), all statistics counters are reset.
> 
> The existing docs patch hadn't updated yet. My current edit is
> 
>When the server shuts down cleanly, a permanent copy of the statistics
>data is stored in the pg_stat subdirectory, so that
>statistics can be retained across server restarts.  When crash recovery is
>performed at server start (e.g., after immediate shutdown, server crash,
>and point-in-time recovery, but not when starting a standby that was shut
>down normally), all statistics counters are reset.
> 
> but I'm not sure the parenthetical is easy enough to understand?

I can read it. But I'm not sure that the difference is obvious for
average users between "starting a standby from a basebackup" and
"starting a standby after a normal shutdown"..

Other than that, it might be easier to read if the additional part
were moved out to the end of the paragraph, prefixing with "Note:
". For example,

...
statistics can be retained across server restarts.  When crash recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery), all statistics counters are reset. Note that
crash recovery is not performed when starting a standby that was shut
down normally then all counters are retained.

> 2)
> The edit is not a problem, but it's hard to understand what the existing
> paragraph actually means?
> 
> diff --git a/doc/src/sgml/high-availability.sgml 
> b/doc/src/sgml/high-availability.sgml
> index 3247e056663..8bfb584b752 100644
> --- a/doc/src/sgml/high-availability.sgml
> +++ b/doc/src/sgml/high-availability.sgml
> @@ -,17 +,17 @@ HINT:  You can then restart the server after making 
> the necessary configuration
> ...
> 
> -The statistics collector is active during recovery. All scans, reads, 
> blocks,
> +The cumulative statistics system is active during recovery. All scans, 
> reads, blocks,
>  index usage, etc., will be recorded normally on the standby. Replayed
>  actions will not duplicate their effects on primary, so replaying an
>  insert will not increment the Inserts column of pg_stat_user_tables.
>  The stats file is deleted at the start of recovery, so stats from primary
>  and standby will differ; this is considered a feature, not a bug.
> 
> 
> 

Agreed partially. It's too detailed.  It might not need to mention WAL
replay.

> I'll just commit the necessary bit, but we really ought to rephrase this.
> 
> 
> 
> 
> Greetings,
> 
> Andres Freund
> 
> [1] 
> https://www.postgresql.org/message-id/20220407165709.jgdkrzqlkcwue6ko%40alap3.anarazel.de

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: WIP: WAL prefetch (another approach)

2022-04-07 Thread Thomas Munro
On Fri, Apr 8, 2022 at 12:55 AM Justin Pryzby  wrote:
> The docs seem to be wrong about the default.
>
> +are not yet in the buffer pool, during recovery.  Valid values are
> +off (the default), on and
> +try.  The setting try enables

Fixed.

> +   concurrency and distance, respectively.  By default, it is set to
> +   try, which enabled the feature on systems where
> +   posix_fadvise is available.
>
> Should say "which enables".

Fixed.

> Curiously, I reported a similar issue last year.

Sorry.  I guess both times we only agreed on what the default should
be in the final review round before commit, and I let the docs get out
of sync (well, the default is mentioned in two places and I apparently
ended my search too soon, changing only one).  I also found another
recently obsoleted sentence: the one about showing nulls sometimes was
no longer true.  Removed.




Re: API stability

2022-04-07 Thread Kyotaro Horiguchi
(Mmm. My mailer automatically teared off the [was: ..] part from the
subject..)

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: API stability

2022-04-07 Thread Kyotaro Horiguchi
At Fri, 8 Apr 2022 08:47:42 +0900, Michael Paquier  wrote 
in 
> On Thu, Apr 07, 2022 at 11:19:15AM -0400, Robert Haas wrote:
> > Here are patches for master and v14 to do things this way. Comments?
> 
> Thanks for the patches.  They look correct.  For ~14, I'd rather avoid
> the code duplication done by GetVirtualXIDsDelayingChkptEnd() and
> HaveVirtualXIDsDelayingChkpt() that could be avoided with an extra
> bool argument to the existing routine.  The same kind of duplication
> happens with GetVirtualXIDsDelayingChkpt() and
> GetVirtualXIDsDelayingChkptEnd().

FWIW, it is done in [1].

[1] 
https://www.postgresql.org/message-id/20220406.164521.17171257901083417.horikyota.ntt%40gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: test/isolation/expected/stats_1.out broken for me

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 18:31:35 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-04-07 09:57:09 -0700, Andres Freund wrote:
> >> Yea :(. I tested debug_discard_caches, but not -DRELCACHE_FORCE_RELEASE
> >> -DCATCACHE_FORCE_RELEASE.
> >>
> >> Not quite sure what to do about it - it's intentionally trying to test the
> >> case of no invalidations being processed, as that's an annoying edge case 
> >> with
> >> functions.  Perhaps wrapping the function call of the "already dropped"
> >> function in another function that catches the error would do the trick? 
> >> It'd
> >> be more easily silently broken, but still be better than not having the 
> >> test.
>
> > Anybody got a better idea?
>
> Maybe if the wrapper function checks for exactly the two expected
> behaviors, it'd be robust enough?

Seems to work. If I break the code it's trying to test, it still fails... Of
course only when none of debug_discard_caches, RELCACHE_FORCE_RELEASE,
CATCACHE_FORCE_RELEASE are used, but that seems unavoidable / harmless. Let's
see what the buildfarm thinks.

Greetings,

Andres Freund




Re: Documentation about PL transforms

2022-04-07 Thread chap

On 2022-02-22 12:59, Chapman Flack wrote:

It would have been painful to write documentation of get_func_trftypes
saying its result isn't what get_transform_{from.to}sql expect, so
patch 1 does add a get_call_trftypes that returns a List *.

Patch 2 then updates the docs as discussed in this thread. It turned 
out

plhandler.sgml was kind of a long monolith of text even before adding
transform information, so I broke it into sections first. This patch 
adds

the section markup without reindenting, so the changes aren't obscured.

The chapter had also fallen behind the introduction of procedures, so
I have changed many instances of 'function' to the umbrella term
'routine'.

Patch 3 simply reindents for the new section markup and rewraps.
Whitespace only.

Patch 4 updates src/test/modules/plsample to demonstrate handling of
transforms (and to add some more comments generally).


Here is a rebase.From 9c726423d47a114a82ca703c0e03a62e3d74f1f6 Mon Sep 17 00:00:00 2001
From: Chapman Flack 
Date: Mon, 21 Feb 2022 20:56:28 -0500
Subject: [PATCH v2 1/4] Warmup: add a get_call_trftypes function

The existing get_func_trftypes function produces an Oid[], where
both existing get_transform_{from,to}sql functions that depend
on the result expect a List*.

Rather than writing documentation awkwardly describing functions
that won't play together, add a get_call_trftypes function that
returns List*. (The name get_call_... to distinguish from
get_func_... follows the naming used in funcapi.h for a function
returning information about either a function or a procedure.)
---
 src/backend/utils/cache/lsyscache.c | 18 ++
 src/include/funcapi.h   |  5 +
 src/include/utils/lsyscache.h   |  1 +
 3 files changed, 24 insertions(+)

diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 1b7e11b..3cd94db 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -2117,6 +2117,24 @@ get_transform_tosql(Oid typid, Oid langid, List *trftypes)
 		return InvalidOid;
 }
 
+/*
+ * get_call_trftypes
+ *
+ *		A helper function that does not itself query the transform cache, but
+ *		constructs the transform-type List expected by the functions above.
+ */
+List *
+get_call_trftypes(HeapTuple procTup)
+{
+	Datum		protrftypes;
+	bool		isNull;
+
+	protrftypes = SysCacheGetAttr(PROCOID, procTup,
+  Anum_pg_proc_protrftypes,
+  );
+	return isNull ?  NIL : oid_array_to_list(protrftypes);
+}
+
 
 /*-- TYPE CACHE --		 */
 
diff --git a/src/include/funcapi.h b/src/include/funcapi.h
index dc3d819..70c3e13 100644
--- a/src/include/funcapi.h
+++ b/src/include/funcapi.h
@@ -175,7 +175,12 @@ extern int	get_func_arg_info(HeapTuple procTup,
 extern int	get_func_input_arg_names(Datum proargnames, Datum proargmodes,
 	 char ***arg_names);
 
+/*
+ * A deprecated earlier version of get_call_trftypes (in lsyscache.h).
+ * That version produces a List, which is the form downstream functions expect.
+ */
 extern int	get_func_trftypes(HeapTuple procTup, Oid **p_trftypes);
+
 extern char *get_func_result_name(Oid functionId);
 
 extern TupleDesc build_function_result_tupdesc_d(char prokind,
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index b8dd27d..93b19e7 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_rel_persistence(Oid relid);
+extern List *get_call_trftypes(HeapTuple procTup);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
-- 
2.7.3

From 7fb3f8971c4c573903b1d08bcbeb126e31a6544c Mon Sep 17 00:00:00 2001
From: Chapman Flack 
Date: Mon, 21 Feb 2022 20:59:32 -0500
Subject: [PATCH v2 2/4] Update PL handler implementation docs

The original purpose was to add information on support for
CREATE TRANSFORM (which must be explicitly coded in any PL
implementation intending to support it). But the plhandler
section was about as long as a monolith of text ought to be,
even before adding transform information, so reorganized
first into sections.

Front-loaded with short descriptions of the three possible
functions (call handler, validator, inline handler) registered
with CREATE LANGUAGE. The latter two were afterthoughts in historical
sequence, but the docs don't need to present them that way.

The section had also fallen behind the introduction of procedures,
so updated to generally use the umbrella term 'routine' in place
of 'function'.

New section markup added here without indentation change, to avoid
obscuring changes. A follow-on commit will reindent and rewrap.
---
 doc/src/sgml/event-trigger.sgml|  16 ++
 doc/src/sgml/plhandler.sgml  

Re: Handle infinite recursion in logical replication setup

2022-04-07 Thread vignesh C
On Fri, Apr 8, 2022 at 4:38 AM Peter Smith  wrote:
>
> Hi Vignesh, FYI the patch is recently broken again and is failing on cfbot 
> [1].
>
> --
> [1] http://cfbot.cputube.org/patch_38_3610.log

I'm working on fixing a few review comments, I will be posting an
updated version to handle this today.

Regards,
Vignesh




GSoC: New and improved website for pgjdbc (JDBC)

2022-04-07 Thread Joseph Ho
Hello,

I am Joseph Ho, a senior at Dr Norman Bethune Collegiate Institute
interested in going into computer science. I am interested in working to
create and improve the website for pgjdbc during GSoC 2022.

I am wondering how the draft proposal should be made. Will I need to submit
a web design of the new and improved website or will I need to submit
something else? Also, am I able to use a web framework of my choice or is
there one that you prefer that we use?

Looking forward to hearing from you and working with you!

Regards,
Joseph.


Re: Should pg_dumpall dump ALTER SYSTEM settings?

2022-04-07 Thread Kyotaro Horiguchi
At Thu, 07 Apr 2022 12:38:43 +0200, Laurenz Albe  
wrote in 
> On Wed, 2022-04-06 at 21:39 -0400, Robert Haas wrote:
> > On Wed, Apr 6, 2022 at 2:26 PM Tom Lane  wrote:
> > > Thoughts?
> > 
> > I'm a little bit skeptical about this proposal, mostly because it
> > seems like it has the end result that values that are configured in
> > postgresql.conf and postgresql.auto.conf end up being handled
> > differently: one file has to be copied by hand, while the other file's
> > contents are propagated forward to the new version by pg_dump. I don't
> > think that's what people are going to be expecting...
> 
> "postgresql.auto.conf" is an implementation detail, and I would expect
> most users to distinguish between "parameters set in postgresql.conf"
> and "parameters set via the SQL statement ALTER SYSTEM".
> If that is the way you look at things, then it seems natural for the
> latter to be included in a dump, but not the former.
> 
> As another case in point, the Ubuntu/Debian packages split up the data
> directory so that the config files are under /etc, while the rest of
> the data directory is under /var/lib.  "postgresql.auto.conf" is *not*
> in /etc, but in /var/lib there.  So a user of these distributions would
> naturally think that the config files in /etc need to be handled manually,
> but "postgresql.auto.conf" need not.
> 
> I am +1 on Tom's idea.

I'm -0.2 if it is the default/implicit behavior. postgresql.conf and
ALTER SYSTEM SET works on the same set of settings.  If we include
.auto's settings in a dump data, it overrides the intentional changes
in postgresql.conf.  I see it a bit surprising.

I'm +-0 if it is a optional behavior.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Proposal] vacuumdb --schema only

2022-04-07 Thread Justin Pryzby
On Wed, Apr 06, 2022 at 07:43:42PM +0200, Gilles Darold wrote:
> Thanks for the review, all these changes are available in new version v6
> of the patch and attached here.

This is failing in CI (except on macos, which is strangely passing).
http://cfbot.cputube.org/gilles-darold.html

https://api.cirrus-ci.com/v1/artifact/task/5379693443547136/log/src/bin/scripts/tmp_check/log/regress_log_100_vacuumdb

not ok 59 - vacuumdb --schema "Foo" postgres exit code 0

#   Failed test 'vacuumdb --schema "Foo" postgres exit code 0'
#   at t/100_vacuumdb.pl line 151.
not ok 60 - vacuumdb --schema schema only: SQL found in server log

#   Failed test 'vacuumdb --schema schema only: SQL found in server log'
#   at t/100_vacuumdb.pl line 151.
#   '2022-04-06 18:15:36.313 UTC [34857][not initialized] 
[[unknown]][:0] LOG:  connection received: host=[local]
# 2022-04-06 18:15:36.314 UTC [34857][client backend] [[unknown]][3/2801:0] 
LOG:  connection authorized: user=postgres database=postgres 
application_name=100_vacuumdb.pl
# 2022-04-06 18:15:36.318 UTC [34857][client backend] 
[100_vacuumdb.pl][3/2802:0] LOG:  statement: SELECT 
pg_catalog.set_config('search_path', '', false);
# 2022-04-06 18:15:36.586 UTC [34857][client backend] [100_vacuumdb.pl][:0] 
LOG:  disconnection: session time: 0:00:00.273 user=postgres database=postgres 
host=[local]
# '
# doesn't match '(?^:VACUUM "Foo".bar)'




Re: Collecting statistics about contents of JSONB columns

2022-04-07 Thread Justin Pryzby
I noticed some typos.

diff --git a/src/backend/utils/adt/jsonb_selfuncs.c 
b/src/backend/utils/adt/jsonb_selfuncs.c
index f5520f88a1d..d98cd7020a1 100644
--- a/src/backend/utils/adt/jsonb_selfuncs.c
+++ b/src/backend/utils/adt/jsonb_selfuncs.c
@@ -1342,7 +1342,7 @@ jsonSelectivityContains(JsonStats stats, Jsonb *jb)
path->stats = jsonStatsFindPath(stats, 
pathstr.data,

pathstr.len);
 
-   /* Appeend path string entry for array 
elements, get stats. */
+   /* Append path string entry for array elements, 
get stats. */
jsonPathAppendEntry(, NULL);
pstats = jsonStatsFindPath(stats, pathstr.data, 
pathstr.len);
freq = jsonPathStatsGetFreq(pstats, 0.0);
@@ -1367,7 +1367,7 @@ jsonSelectivityContains(JsonStats stats, Jsonb *jb)
case WJB_END_ARRAY:
{
struct Path *p = path;
-   /* Absoulte selectivity of the path with its 
all subpaths */
+   /* Absolute selectivity of the path with its 
all subpaths */
Selectivity abs_sel = p->sel * p->freq;
 
/* Pop last path entry */
diff --git a/src/backend/utils/adt/jsonb_typanalyze.c 
b/src/backend/utils/adt/jsonb_typanalyze.c
index 7882db23a87..9a759aadafb 100644
--- a/src/backend/utils/adt/jsonb_typanalyze.c
+++ b/src/backend/utils/adt/jsonb_typanalyze.c
@@ -123,10 +123,9 @@ typedef struct JsonScalarStats
 /*
  * Statistics calculated for a set of values.
  *
- *
  * XXX This seems rather complicated and needs simplification. We're not
  * really using all the various JsonScalarStats bits, there's a lot of
- * duplication (e.g. each JsonScalarStats contains it's own array, which
+ * duplication (e.g. each JsonScalarStats contains its own array, which
  * has a copy of data from the one in "jsons").
  */
 typedef struct JsonValueStats
@@ -849,7 +848,7 @@ jsonAnalyzePathValues(JsonAnalyzeContext *ctx, 
JsonScalarStats *sstats,
stats->stanullfrac = (float4)(1.0 - freq);
 
/*
-* Similarly, we need to correct the MCV frequencies, becuse those are
+* Similarly, we need to correct the MCV frequencies, because those are
 * also calculated only from the non-null values. All we need to do is
 * simply multiply that with the non-NULL frequency.
 */
@@ -1015,7 +1014,7 @@ jsonAnalyzeBuildPathStats(JsonPathAnlStats *pstats)
 
/*
 * We keep array length stats here for queries like jsonpath '$.size() 
> 5'.
-* Object lengths stats can be useful for other query lanuages.
+* Object lengths stats can be useful for other query languages.
 */
if (vstats->arrlens.values.count)
jsonAnalyzeMakeScalarStats(, "array_length", 
>arrlens.stats);
@@ -1069,7 +1068,7 @@ jsonAnalyzeCalcPathFreq(JsonAnalyzeContext *ctx, 
JsonPathAnlStats *pstats,
  * We're done with accumulating values for this path, so calculate the
  * statistics for the various arrays.
  *
- * XXX I wonder if we could introduce some simple heuristict on which
+ * XXX I wonder if we could introduce some simple heuristic on which
  * paths to keep, similarly to what we do for MCV lists. For example a
  * path that occurred just once is not very interesting, so we could
  * decide to ignore it and not build the stats. Although that won't
@@ -1414,7 +1413,7 @@ compute_json_stats(VacAttrStats *stats, 
AnalyzeAttrFetchFunc fetchfunc,
 
/*
 * Collect and analyze JSON path values in single or multiple passes.
-* Sigle-pass collection is faster but consumes much more memory than
+* Single-pass collection is faster but consumes much more memory than
 * collecting and analyzing by the one path at pass.
 */
if (ctx.single_pass)




Re: REINDEX blocks virtually any queries but some prepared queries.

2022-04-07 Thread Michael Paquier
On Thu, Apr 07, 2022 at 05:29:36PM +0200, Guillaume Lelarge a écrit :
> Le jeu. 7 avr. 2022 à 15:44, Frédéric Yhuel  a 
> écrit :
>> On 4/7/22 14:40, Justin Pryzby wrote:
>> Thank you Justin! I applied your fixes in the v2 patch (attached).
>
> v2 patch sounds good.

The location of the new sentence and its wording seem fine to me.  So
no objections from me to add what's suggested, as suggested.  I'll
wait for a couple of days first.

>> Indeed ;) That being said, REINDEX CONCURRENTLY could give you an
>> invalid index, so sometimes you may be tempted to go for a simpler
>> REINDEX, especially if you believe that the SELECTs won't be blocked.
> 
> Agreed.

There are many factors to take into account, one is more expensive
than the other in terms of resources and has downsides, downsides
compensated by the reduction in the lock requirements.  There are
cases where REINDEX is a must-have, as CONCURRENTLY does not support
catalog indexes, and these tend to be easily noticed when corruption
spreads around.
--
Michael


signature.asc
Description: PGP signature


Re: logical decoding and replication of sequences

2022-04-07 Thread Justin Pryzby
Some typos I found before the patch was reverted.

diff --git a/doc/src/sgml/logicaldecoding.sgml 
b/doc/src/sgml/logicaldecoding.sgml
index a6ea6ff3fcf..d4bd8d41c4b 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -834,9 +834,8 @@ typedef void (*LogicalDecodeSequenceCB) (struct 
LogicalDecodingContext *ctx,
   non-transactional increments, the transaction may be either NULL or not
   NULL, depending on if the transaction already has an XID assigned.
   The sequence_lsn has the WAL location of the
-  sequence update. transactional says if the
-  sequence has to be replayed as part of the transaction or directly.
-
+  sequence update. transactional indicates whether
+  the sequence has to be replayed as part of the transaction or directly.
   The last_value, log_cnt and
   is_called parameters describe the sequence change.
  
diff --git a/src/backend/replication/logical/reorderbuffer.c 
b/src/backend/replication/logical/reorderbuffer.c
index 60866431db3..b71122cce5d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -927,7 +927,7 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId 
xid,
  * Treat the sequence increment as transactional?
  *
  * The hash table tracks all sequences created in in-progress transactions,
- * so we simply do a lookup (the sequence is identified by relfilende). If
+ * so we simply do a lookup (the sequence is identified by relfilenode). If
  * we find a match, the increment should be handled as transactional.
  */
 bool
@@ -2255,7 +2255,7 @@ ReorderBufferApplySequence(ReorderBuffer *rb, 
ReorderBufferTXN *txn,
tuple = >data.sequence.tuple->tuple;
seq = (Form_pg_sequence_data) GETSTRUCT(tuple);
 
-   /* Only ever called from ReorderBufferApplySequence, so transational. */
+   /* Only ever called from ReorderBufferApplySequence, so transactional. 
*/
if (streaming)
rb->stream_sequence(rb, txn, change->lsn, relation, true,
seq->last_value, 
seq->log_cnt, seq->is_called);
diff --git a/src/backend/utils/cache/relcache.c 
b/src/backend/utils/cache/relcache.c
index a15ce9edb13..8193bfe6515 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5601,7 +5601,7 @@ RelationBuildPublicationDesc(Relation relation, 
PublicationDesc *pubdesc)
 
GetSchemaPublications(schemaid, objType));
 
/*
-* If this is a partion (and thus a table), lookup all ancestors and 
track
+* If this is a partition (and thus a table), lookup all ancestors and 
track
 * all publications them too.
 */
if (relation->rd_rel->relispartition)




Re: why pg_walfile_name() cannot be executed during recovery?

2022-04-07 Thread Michael Paquier
On Thu, Apr 07, 2022 at 11:37:15AM -0400, Robert Haas wrote:
> It's also worth noting that there's a bit of a definitional problem
> here. If in the same situation, I ask for pg_walfile_name('11/0'),
> it's going to give me a filename based on TLI 2, but there's also a
> WAL file for that LSN with TLI 1. How do we know which one the user
> wants? Perhaps one idea would be to say that the relevant TLI is the
> one which was in effect at the time that LSN was replayed. If we do
> that, what about future LSNs? We could assume that for future LSNs,
> the TLI should be the same as the current TLI, but maybe that's also
> misleading, because recovery_target_timeline could be set.

FWIW, for future positions, I'd be rather on board with the concept of
using the TLI currently being replayed, but as you say that comes down
to the definition borders we want to use.  Another possibility would
be to return an error and kick the can down the road if we are unsure
of what the right behavior is.  For past positions, this should go
through a lookup of the timeline history file (the patch does not do
that at quick glance).

> I think it's really important to start by being precise about the
> question that we think pg_walfile_name() ought to be answering. If we
> don't know that, then we really can't say what TLI it should be using.
> It's not hard to make the function return SOME answer using SOME TLI,
> but then it's not clear that the answer is the right one for any
> particular purpose. And in that case the function is more dangerous
> than useful, because people will write code that uses it to do stuff,
> and then that stuff won't actually work correctly under all
> circumstances.

Agreed.
--
Michael


signature.asc
Description: PGP signature


Re: Can we automatically add elapsed times to tap test log?

2022-04-07 Thread Andrew Dunstan


On 4/7/22 17:58, Andres Freund wrote:
> Hi,
>
> On 2022-04-07 17:45:09 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>>> On 2022-04-07 17:21:09 -0400, Tom Lane wrote:
 I too think that the elapsed time is useful.  I'm less convinced
 that the time-of-day marker is useful.
>>> I think it'd be quite useful if it had more precision - it's a pita to
>>> correlate regress_log_* output with server logs.
>> Fair point.  Maybe we could keep the timestamp (with ms precision
>> if possible) and then the parenthetical bit is time-since-last-line
>> (also with ms precision)?  I think that would more or less satisfy
>> both uses.
> Would work for me...
>

All doable. Time::HiRes gives us a higher resolution timer. I'll post a
new version in a day or two.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-07 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Apr 07, 2022 at 11:19:15AM -0400, Robert Haas wrote:
>> Here are patches for master and v14 to do things this way. Comments?

> Thanks for the patches.  They look correct.  For ~14, I'd rather avoid
> the code duplication done by GetVirtualXIDsDelayingChkptEnd() and
> HaveVirtualXIDsDelayingChkpt() that could be avoided with an extra
> bool argument to the existing routine.

Isn't adding another argument an API break?  (If there's any outside
code calling GetVirtualXIDsDelayingChkpt, which it seems like there
might be.)

regards, tom lane




Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-07 Thread Michael Paquier
On Thu, Apr 07, 2022 at 11:19:15AM -0400, Robert Haas wrote:
> Here are patches for master and v14 to do things this way. Comments?

Thanks for the patches.  They look correct.  For ~14, I'd rather avoid
the code duplication done by GetVirtualXIDsDelayingChkptEnd() and
HaveVirtualXIDsDelayingChkpt() that could be avoided with an extra
bool argument to the existing routine.  The same kind of duplication
happens with GetVirtualXIDsDelayingChkpt() and
GetVirtualXIDsDelayingChkptEnd().
--
Michael


signature.asc
Description: PGP signature


Re: pg14 psql broke \d datname.nspname.relname

2022-04-07 Thread Tom Lane
Mark Dilger  writes:
> The patch submitted changes processSQLNamePattern() to return a dot count by 
> reference.  It's up to the caller to decide whether to raise an error.  If 
> you pass in no schemavar, and you get back dotcnt=2, you know it parsed it as 
> a two part pattern, and you can pg_fatal(...) or ereport(ERROR, ...) or 
> whatever.

Well, I'm not telling Robert what to do, but I wouldn't accept that
API.  It requires duplicative error-handling code at every call site
and is an open invitation to omitting necessary error checks.

Possibly a better idea is to add an enum argument telling the function
what to do (parse the whole thing as one name regardless of dots,
parse as two names if there's a dot, throw error if there's a dot,
etc etc as needed by existing call sites).  Perhaps some of the
existing arguments could be merged into such an enum, too.

regards, tom lane




Re: shared-memory based stats collector - v70

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> I've gotten through the main commits (and then a fix for the apparently
> inevitable bug that's immediately highlighted by the buildfarm), and the first
> test. I'll call it a night now, and work on the other tests & docs tomorrow.

I've gotten through the tests now. There's one known, not yet addressed, issue
with the stats isolation test, see [1].


Working on the docs. Found a few things worth raising:

1)
Existing text:
   When the server shuts down cleanly, a permanent copy of the statistics
   data is stored in the pg_stat subdirectory, so that
   statistics can be retained across server restarts.  When recovery is
   performed at server start (e.g., after immediate shutdown, server crash,
   and point-in-time recovery), all statistics counters are reset.

The existing docs patch hadn't updated yet. My current edit is

   When the server shuts down cleanly, a permanent copy of the statistics
   data is stored in the pg_stat subdirectory, so that
   statistics can be retained across server restarts.  When crash recovery is
   performed at server start (e.g., after immediate shutdown, server crash,
   and point-in-time recovery, but not when starting a standby that was shut
   down normally), all statistics counters are reset.

but I'm not sure the parenthetical is easy enough to understand?


2)
The edit is not a problem, but it's hard to understand what the existing
paragraph actually means?

diff --git a/doc/src/sgml/high-availability.sgml 
b/doc/src/sgml/high-availability.sgml
index 3247e056663..8bfb584b752 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -,17 +,17 @@ HINT:  You can then restart the server after making the 
necessary configuration
...

-The statistics collector is active during recovery. All scans, reads, 
blocks,
+The cumulative statistics system is active during recovery. All scans, 
reads, blocks,
 index usage, etc., will be recorded normally on the standby. Replayed
 actions will not duplicate their effects on primary, so replaying an
 insert will not increment the Inserts column of pg_stat_user_tables.
 The stats file is deleted at the start of recovery, so stats from primary
 and standby will differ; this is considered a feature, not a bug.




I'll just commit the necessary bit, but we really ought to rephrase this.




Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20220407165709.jgdkrzqlkcwue6ko%40alap3.anarazel.de




Re: pg14 psql broke \d datname.nspname.relname

2022-04-07 Thread Mark Dilger



> On Apr 7, 2022, at 3:37 PM, Tom Lane  wrote:
> 
>> 
>> I don't know whether that's a bug fix for the existing code or some
>> new bit of functionality that \dconfig requires and nothing else
>> needs.
> 
> Well, \dconfig needs it because it would like foo.bar to get processed
> as just a name.  But I think it's a bug fix because as things stood,
> if the caller doesn't provide a schemavar and the pattern contains a
> dot, the code just silently throws away the dot and all to the left.
> That doesn't seem very sane, even if it is a longstanding behavior.

The patch submitted changes processSQLNamePattern() to return a dot count by 
reference.  It's up to the caller to decide whether to raise an error.  If you 
pass in no schemavar, and you get back dotcnt=2, you know it parsed it as a two 
part pattern, and you can pg_fatal(...) or ereport(ERROR, ...) or whatever.

It looks like I'll need to post a new version of the patch with an argument 
telling the function to ignore dots, but I'm not prepared to say that for sure. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Add index scan progress to pg_stat_progress_vacuum

2022-04-07 Thread Greg Stark
It looks like this patch got feedback from Andres and Robert with some
significant design change recommendations. I'm marking the patch
Returned with Feedback. Feel free to add it back to a future
commitfest when a new version is ready.




Re: Handle infinite recursion in logical replication setup

2022-04-07 Thread Peter Smith
Hi Vignesh, FYI the patch is recently broken again and is failing on cfbot [1].

--
[1] http://cfbot.cputube.org/patch_38_3610.log

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Lowering the ever-growing heap->pd_lower

2022-04-07 Thread Peter Geoghegan
On Thu, Apr 7, 2022 at 4:01 PM Andres Freund  wrote:
> I'm on-board with that - but I think we should rewrite a bunch of places that
> use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using
> several KB of stack at the current the current value already (*), but if it 
> grows
> further...

No arguments here. There are probably quite a few places that won't
need to be fixed, because it just doesn't matter, but
lazy_scan_prune() will.

-- 
Peter Geoghegan




Re: Lowering the ever-growing heap->pd_lower

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-04 19:24:22 -0700, Peter Geoghegan wrote:
> We should definitely increase MaxHeapTuplesPerPage before too long,
> for a variety of reasons that I have talked about in the past. Its
> current value is 291 on all mainstream platforms, a value that's
> derived from accidental historic details -- which predate HOT.

I'm on-board with that - but I think we should rewrite a bunch of places that
use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using
several KB of stack at the current the current value already (*), but if it 
grows
further...

Greetings,

Andres Freund


* lazy_scan_prune() has
OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
which already works out to 4074 bytes.




Re: Lowering the ever-growing heap->pd_lower

2022-04-07 Thread Peter Geoghegan
On Mon, Apr 4, 2022 at 7:24 PM Peter Geoghegan  wrote:
> I am sympathetic to the idea that giving the system a more accurate
> picture of how much free space is available on each heap page is an
> intrinsic good. This might help us in a few different areas. For
> example, the FSM cares about relatively small differences in available
> free space *among* heap pages that are "in competition" in
> RelationGetBufferForTuple(). Plus we have a heuristic based on
> PageGetHeapFreeSpace() in heap_page_prune_opt() to consider.

Pushed a slightly revised version of this just now. Differences:

* Rewrote the comments, and adjusted related comments in vacuumlazy.c.
Mostly just made them shorter.

* I eventually decided that it was fine to just accept the issue with
maxoff in lazy_scan_prune (the pruning path used by VACUUM).

There seemed to be no need to reestablish a maxoff for the page here
following further reflection. I changed my mind.

Setting reclaimed line pointer array space to a pattern of 0x7F bytes
wasn't adding much here. Pruning either runs in VACUUM, or
opportunistically. When it runs in VACUUM things are highly
constrained already. Opportunistic pruning for heap_page_prune_opt()
callers doesn't even require that the caller start out with a buffer
lock. Pruning only goes ahead when we successfully acquire a cleanup
lock -- so callers can't be relying on maxoff not changing.

* Didn't keep the changes to PageTruncateLinePointerArray().

There is at least no reason to tie this question about VACUUM to how
pruning behaves. I still see some small value in avoiding creating a
new path that allows PageIsEmpty() pages to come into existence in a
new way, which is no less true with the patch I pushed.

Thanks
--
Peter Geoghegan




Re: Window Function "Run Conditions"

2022-04-07 Thread David Rowley
On Fri, 8 Apr 2022 at 02:11, David Rowley  wrote:
> Barring any objection, I'm planning to push this one in around 10 hours time.

Pushed. 9d9c02ccd

Thank you all for the reviews.

David




Re: pg14 psql broke \d datname.nspname.relname

2022-04-07 Thread Tom Lane
Robert Haas  writes:
> I still have a vague feeling that there's probably some way of doing
> this better, but had more or less resolved to commit this patch as is
> anyway and had that all queued up. But then I had to go to a meeting
> and when I came out I discovered that Tom had done this:

Sorry, it didn't occur to me that that would impinge on what you
were doing over here ... though in retrospect I should have thought
of it.

> I don't know whether that's a bug fix for the existing code or some
> new bit of functionality that \dconfig requires and nothing else
> needs.

Well, \dconfig needs it because it would like foo.bar to get processed
as just a name.  But I think it's a bug fix because as things stood,
if the caller doesn't provide a schemavar and the pattern contains a
dot, the code just silently throws away the dot and all to the left.
That doesn't seem very sane, even if it is a longstanding behavior.

regards, tom lane




Re: test/isolation/expected/stats_1.out broken for me

2022-04-07 Thread Tom Lane
Andres Freund  writes:
> On 2022-04-07 09:57:09 -0700, Andres Freund wrote:
>> Yea :(. I tested debug_discard_caches, but not -DRELCACHE_FORCE_RELEASE
>> -DCATCACHE_FORCE_RELEASE.
>> 
>> Not quite sure what to do about it - it's intentionally trying to test the
>> case of no invalidations being processed, as that's an annoying edge case 
>> with
>> functions.  Perhaps wrapping the function call of the "already dropped"
>> function in another function that catches the error would do the trick? It'd
>> be more easily silently broken, but still be better than not having the test.

> Anybody got a better idea?

Maybe if the wrapper function checks for exactly the two expected
behaviors, it'd be robust enough?

regards, tom lane




Re: trigger example for plsample

2022-04-07 Thread Tom Lane
Chapman Flack  writes:
> v4 looks good to me.

Pushed with very minor editorialization.  Mainly, I undid the
decision to stop printing the function source text, on the
grounds that (1) it falsified the comment immediately above,
and (2) if you have to print it anyway to avoid compiler warnings,
you're just creating confusing inconsistency between the two
handler functions.

regards, tom lane




Re: remove more archiving overhead

2022-04-07 Thread Nathan Bossart
On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
> Yes.  I found that a crash at an unfortunate moment can produce multiple
> links to the same file in pg_wal, which seemed bad independent of archival.
> By fixing that (i.e., switching from durable_rename_excl() to
> durable_rename()), we not only avoid this problem, but we also avoid trying
> to archive a file the server is concurrently writing.  Then, after a crash,
> the WAL file to archive should either not exist (which is handled by the
> archiver) or contain the same contents as any preexisting archives.

I moved the fix for this to a new thread [0] since I think it should be
back-patched.  I've attached a new patch that only contains the part
related to reducing archiving overhead.

[0] https://www.postgresql.org/message-id/20220407182954.GA1231544%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 6d8972412179fd3c82bb7e471966d4b7862bcbff Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 7 Apr 2022 14:11:59 -0700
Subject: [PATCH v3 1/1] Reduce overhead of renaming archive status files.

Presently, archive status files are durably renamed from .ready to
.done to indicate that a file has been archived.  Persisting this
rename to disk accounts for a significant amount of the overhead
associated with archiving.  While durably renaming the file
prevents re-archiving in most cases, archive commands and libraries
must already gracefully handle attempts to re-archive the last
archived file after a crash (e.g., a crash immediately after
archive_command exits but before the server renames the status
file).

This change reduces the amount of overhead associated with
archiving by using rename() instead of durable_rename() to rename
the archive status files.  As a consequence, the server is more
likely to attempt to re-archive files after a crash, but as noted
above, archive commands and modules are already expected to handle
this.  It is also possible that the server will attempt to re-
archive files that have been removed or recycled, but the archiver
already handles this, too.

Author: Nathan Bossart
---
 src/backend/postmaster/pgarch.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 0c8ca29f73..17a4151c01 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -746,7 +746,19 @@ pgarch_archiveDone(char *xlog)
 
 	StatusFilePath(rlogready, xlog, ".ready");
 	StatusFilePath(rlogdone, xlog, ".done");
-	(void) durable_rename(rlogready, rlogdone, WARNING);
+
+	/*
+	 * To avoid extra overhead, we don't durably rename the .ready file to
+	 * .done.  Archive commands and libraries must gracefully handle attempts
+	 * to re-archive files (e.g., if the server crashes just before this
+	 * function is called), so it should be okay if the .ready file reappears
+	 * after a crash.
+	 */
+	if (rename(rlogready, rlogdone) < 0)
+		ereport(WARNING,
+(errcode_for_file_access(),
+ errmsg("could not rename file \"%s\" to \"%s\": %m",
+		rlogready, rlogdone)));
 }
 
 
-- 
2.25.1



Re: pg14 psql broke \d datname.nspname.relname

2022-04-07 Thread Robert Haas
On Wed, Apr 6, 2022 at 12:07 PM Mark Dilger
 wrote:
> I was able to clean up the "if (left && want_literal_dbname)" stuff, though.

I still have a vague feeling that there's probably some way of doing
this better, but had more or less resolved to commit this patch as is
anyway and had that all queued up. But then I had to go to a meeting
and when I came out I discovered that Tom had done this:

--- a/src/fe_utils/string_utils.c
+++ b/src/fe_utils/string_utils.c
@@ -918,8 +918,12 @@ processSQLNamePattern(PGconn *conn, PQExpBuffer
buf, const char *pattern,
  * Convert shell-style 'pattern' into the regular expression(s) we want to
  * execute.  Quoting/escaping into SQL literal format will be done below
  * using appendStringLiteralConn().
+ *
+ * If the caller provided a schemavar, we want to split the pattern on
+ * ".", otherwise not.
  */
-patternToSQLRegex(PQclientEncoding(conn), NULL, , ,
+patternToSQLRegex(PQclientEncoding(conn), NULL,
+  (schemavar ?  : NULL), ,
   pattern, force_escape);

 /*

I don't know whether that's a bug fix for the existing code or some
new bit of functionality that \dconfig requires and nothing else
needs. A related point that I had noticed during review is that these
existing tests look pretty bogus:

if (namebuf.len > 2)

if (schemabuf.len > 2)

In the v13 code, these tests occur at a point where we've definitely
added ^( to the buffer, but possibly nothing else. But starting in v14
that's no longer the case. So probably this test should be changed
somehow. The proposed patch changes these to something like this:

+   if (schemavar && schemabuf.len > 2)

But that doesn't really seem like it's fixing the problem I'm talking about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Window Function "Run Conditions"

2022-04-07 Thread Zhihong Yu
On Thu, Apr 7, 2022 at 7:11 AM David Rowley  wrote:

> On Thu, 7 Apr 2022 at 19:01, David Rowley  wrote:
> >
> > On Thu, 7 Apr 2022 at 15:41, Zhihong Yu  wrote:
> > > +* We must keep the original qual in place if there is
> a
> > > +* PARTITION BY clause as the top-level WindowAgg
> remains in
> > > +* pass-through mode and does nothing to filter out
> unwanted
> > > +* tuples.
> > > +*/
> > > +   *keep_original = false;
> > >
> > > The comment talks about keeping original qual but the assignment uses
> the value false.
> > > Maybe the comment can be rephrased so that it matches the assignment.
> >
> > Thanks. I've just removed that comment locally now. You're right, it
> > was out of date.
>
> I've attached the updated patch with the fixed comment and a few other
> comments reworded slightly.
>
> I've also pgindented the patch.
>
> Barring any objection, I'm planning to push this one in around 10 hours
> time.
>
> David
>
Hi,

+   WINDOWAGG_PASSTHROUGH_STRICT/* Pass-through plus don't store new
+* tuples during spool */

I think the comment in code is illustrative:

+* STRICT pass-through mode is required for the top
window
+* when there is a PARTITION BY clause.  Otherwise we
must
+* ensure we store tuples that don't match the
+* runcondition so they're available to WindowAggs
above.

If you think the above is too long where WINDOWAGG_PASSTHROUGH_STRICT is
defined, maybe point to the longer version so that people can find that
more easily.

Cheers


Re: test/isolation/expected/stats_1.out broken for me

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 09:57:09 -0700, Andres Freund wrote:
> Yea :(. I tested debug_discard_caches, but not -DRELCACHE_FORCE_RELEASE
> -DCATCACHE_FORCE_RELEASE.
> 
> Not quite sure what to do about it - it's intentionally trying to test the
> case of no invalidations being processed, as that's an annoying edge case with
> functions.  Perhaps wrapping the function call of the "already dropped"
> function in another function that catches the error would do the trick? It'd
> be more easily silently broken, but still be better than not having the test.

Anybody got a better idea?

- Andres




Re: Can we automatically add elapsed times to tap test log?

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 17:45:09 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-04-07 17:21:09 -0400, Tom Lane wrote:
> >> I too think that the elapsed time is useful.  I'm less convinced
> >> that the time-of-day marker is useful.
> 
> > I think it'd be quite useful if it had more precision - it's a pita to
> > correlate regress_log_* output with server logs.
> 
> Fair point.  Maybe we could keep the timestamp (with ms precision
> if possible) and then the parenthetical bit is time-since-last-line
> (also with ms precision)?  I think that would more or less satisfy
> both uses.

Would work for me...

Greetings,

Andres Freund




Re: Last day of commitfest

2022-04-07 Thread Greg Stark
On Wed, 6 Apr 2022 at 21:59, Julien Rouhaud  wrote:
>
>
> FWIW I think that this 5 days threshold before closing a patch with RwF is way
> too short.  As far as I know we usually use something like 2/3 weeks.

Yeah, I haven't been enforcing a timeout like that during the
commitfest. But now that we're at the end...

-- 
greg




Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

2022-04-07 Thread Matthias van de Meent
On Thu, 7 Apr 2022 at 21:11, Robert Haas  wrote:
>
> On Thu, Apr 7, 2022 at 2:43 PM Peter Geoghegan  wrote:
> > But if we were in a green-field situation we'd probably not want to
> > use up several bytes for a nonse anyway. You said so yourself.
>
> I don't know what statement of mine you're talking about here, and
> while I don't love using up space for a nonce, it seems to be the way
> this encryption stuff works. I don't see that there's a reasonable
> alternative, green field or no.
>
> > > I do understand that there are significant challenges and performance
> > > concerns around having these kinds of initdb-controlled page layout
> > > changes, so the future of that patch is unclear.
> >
> > Why does it need to be at initdb time?
> >
> > Though I cannot prove it, I suspect that the original intent of the
> > special area was to support an additional (though typically small)
> > variable length array, that works a little like the current line
> > pointer array. This array would have to grow backwards (newer items
> > get appended at earlier physical offsets), unlike our line pointer
> > array (which gets appended to at the end, in the simple and obvious
> > way). Growing backwards like this happens with DB systems, that store
> > their line pointer array at the end of the page(the traditional
> > approach from the System R days, I believe).
> >
> > Supporting a variable-length special area array like this would mean
> > that any time you add a new item to the variable-sized array in the
> > special area, the page's entire tuple space has to be memmove()'d
> > backwards by a couple of bytes to create the required space. And so
> > the relevant bufpage.c routine would have to adjust the whole line
> > pointer array such that each lp_off received a compensating
> > adjustment. The array might only be for some kind of page-level
> > transaction metadata, something like that -- shifting it around is
> > pretty expensive (reusing existing slots isn't too expensive, though).
> >
> > Why can't it work like that? You don't really need to build the full
> > set of bufpage.c facilities (though it might not be a bad idea to
> > fully support these variable-length arrays, which seem like they might
> > come in handy). That seems perfectly compatible with what Matthias
> > wants to do, provided we're willing to deem the special area struct
> > (e.g. BTOpaque) as always coming "first" (which is essentially the
> > same as his current proposal anyway). You can even do the same thing
> > yourself for the nonse (use a fixed, known offset), with relatively
> > modest effort. You'd need to have AM-specific knowledge (it would
> > stack right on top of Matthias's technique), but that doesn't seem all
> > that hard. There are plenty of remaining status bits in BTOpaque, and
> > probably all other index AM special areas.
>
> I'm not really following any of this. You seem to be arguing about
> whether it's possible to change the length of the special space
> *later* than initdb time. I agree that might have some use for some
> purpose, but for encryption it's not necessarily all that helpful
> because you have to be able to find the nonce on the page before
> you've decrypted it. If you don't know whether there's a nonce or
> where it's located, you can't do that. What Matthias and I were
> discussing is whether you have to make a decision about appending
> stuff to the special space *earlier* than initdb-time i.e. at compile
> time.
>
> My position is that if we need some space in every page to put a
> nonce, the best place to put it is at the very end of the page, within
> the special space and after anything else that is stored in the
> special space. Code that only manipulates the line pointer array and
> tuple data won't care, because pd_special will just be a bit smaller
> than it would otherwise have been, and none of that code looks at any
> byte offset >= pd_special. Code that looks at the special space won't
> care either, as long as it uses PageGetSpecialPointer to find the
> data, and doesn't examine how large the special space actually is.
> That corresponds pretty well to how existing users of the special
> space work, so it seems pretty good.

Except that reserving space on each page requires recalculation of all
variables that depend on the amount of potential free space available
on a page (for some cases this is less important, for some it is
critical that the value is not wrong). If this is always done at
runtime then that can cause significant overhead.

> If we *didn't* put the nonce at the end of the page, where else would
> we put it? It has to be at a fixed offset, because otherwise you can't
> find it without decrypting the page first, which would be circular.

I think there's no specifically good reason why we'd need to put the
nonce in storage at the same place as where we reserve the space for
the nonce in the unencrypted in-memory format.

> You could put it at the beginning of the page, or 

Re: Can we automatically add elapsed times to tap test log?

2022-04-07 Thread Tom Lane
Andres Freund  writes:
> On 2022-04-07 17:21:09 -0400, Tom Lane wrote:
>> I too think that the elapsed time is useful.  I'm less convinced
>> that the time-of-day marker is useful.

> I think it'd be quite useful if it had more precision - it's a pita to
> correlate regress_log_* output with server logs.

Fair point.  Maybe we could keep the timestamp (with ms precision
if possible) and then the parenthetical bit is time-since-last-line
(also with ms precision)?  I think that would more or less satisfy
both uses.

regards, tom lane




Re: Can we automatically add elapsed times to tap test log?

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 17:21:09 -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
> > On 4/2/22 06:57, Andrew Dunstan wrote:
> >> Here's a version that actually works. It produces traces that look like
> >> this:
> >> andrew@emma:pg_upgrade $ grep '([0-9]*s)'
> >> tmp_check/log/regress_log_002_pg_upgrade
> >> [21:55:06](63s) ok 1 - dump before running pg_upgrade
> >> [21:55:22](79s) ok 2 - run of pg_upgrade for new instance
> >> [21:55:27](84s) ok 3 - old and new dumps match after pg_upgrade
> >> [21:55:27](84s) 1..3
> 
> > I know there's a lot going on, but are people interested in this? It's a
> > pretty small patch to produce something that seems quite useful.

It's been 0 days since I last wanted this.


> I too think that the elapsed time is useful.  I'm less convinced
> that the time-of-day marker is useful.

I think it'd be quite useful if it had more precision - it's a pita to
correlate regress_log_* output with server logs.


> It also seems kind of odd that the elapsed time accumulates rather
> than being reset for each line.  As it stands one would be doing a lot
> of mental subtractions rather than being able to see directly how long
> each step takes.  I suppose that on fast machines where each step is
> under one second, accumulation would be more useful than printing a
> lot of zeroes --- but on the other hand, those aren't the cases where
> you're going to be terribly concerned about the runtime.

I like both - if you want to find where the slowdown among a lot of log lines
is, it's easier to look at the time accumulated elapsed time. If you actually
want to see how long individual things take, non-accumulated is more useful.

I've printed both in the past...

Any chance we could print higher res time?

Greetings,

Andres Freund




Re: trigger example for plsample

2022-04-07 Thread Chapman Flack
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

v4 looks good to me.

I don't think this requires any documentation change.
The patch simply adds trigger handling example code to plsample.c,
and plsample is already mentioned in the documentation on writing
a PL handler.

Regards,
-Chap

The new status of this patch is: Ready for Committer


Re: Can we automatically add elapsed times to tap test log?

2022-04-07 Thread Tom Lane
Andrew Dunstan  writes:
> On 4/2/22 06:57, Andrew Dunstan wrote:
>> Here's a version that actually works. It produces traces that look like
>> this:
>> andrew@emma:pg_upgrade $ grep '([0-9]*s)'
>> tmp_check/log/regress_log_002_pg_upgrade
>> [21:55:06](63s) ok 1 - dump before running pg_upgrade
>> [21:55:22](79s) ok 2 - run of pg_upgrade for new instance
>> [21:55:27](84s) ok 3 - old and new dumps match after pg_upgrade
>> [21:55:27](84s) 1..3

> I know there's a lot going on, but are people interested in this? It's a
> pretty small patch to produce something that seems quite useful.

I too think that the elapsed time is useful.  I'm less convinced
that the time-of-day marker is useful.

It also seems kind of odd that the elapsed time accumulates rather
than being reset for each line.  As it stands one would be doing a lot
of mental subtractions rather than being able to see directly how long
each step takes.  I suppose that on fast machines where each step is
under one second, accumulation would be more useful than printing a
lot of zeroes --- but on the other hand, those aren't the cases where
you're going to be terribly concerned about the runtime.

regards, tom lane




Re: test/isolation/expected/stats_1.out broken for me

2022-04-07 Thread chap

On 2022-04-07 15:04, Andres Freund wrote:

And done. Chap, could you confirm this fixes the issue for you?


Looks good from here. One installcheck-world with no failures; 
previously,

it failed for me every time.

Regards,
-Chap




Re: Can we automatically add elapsed times to tap test log?

2022-04-07 Thread Andrew Dunstan


On 4/2/22 06:57, Andrew Dunstan wrote:
> Here's a version that actually works. It produces traces that look like
> this:
>
>
> andrew@emma:pg_upgrade $ grep '([0-9]*s)'
> tmp_check/log/regress_log_002_pg_upgrade
> [21:55:06](63s) ok 1 - dump before running pg_upgrade
> [21:55:22](79s) ok 2 - run of pg_upgrade for new instance
> [21:55:27](84s) ok 3 - old and new dumps match after pg_upgrade
> [21:55:27](84s) 1..3
>

I know there's a lot going on, but are people interested in this? It's a
pretty small patch to produce something that seems quite useful.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Mingw task for Cirrus CI

2022-04-07 Thread Andrew Dunstan


On 4/7/22 16:48, Andrew Dunstan wrote:
> On 4/7/22 13:10, Andres Freund wrote:
>> Hi,
>>
>> On 2022-04-06 11:03:37 -0400, Andrew Dunstan wrote:
>>> On 3/30/22 20:26, Andres Freund wrote:
 Could you try using dash to invoke configure here, and whether it makes 
 configure faster?
>>> I got weird failures re libxml/parser.h when I tried with dash. See
>>>  (It would be nice if we
>>> could see config.log on failure.)
>> Since dash won't help us to get the build time down sufficiently, and the
>> tests don't pass without a separate build tree, I looked at what makes
>> config/prep_buildtree so slow.
>>
>> It's largely just bad code. The slowest part are spawning one expr and mkdir
>> -p for each directory. One 'cmp' for each makefile doesn't help either.
>>
>> The expr can be replaced with
>>   subdir=${item#$sourcetree}
>> that's afaics posix syntax ([1]), not bash.
>>
>> Spawning one mkdir for each directory can be replaced by a single mkdir
>> invocation with all the directories. On my linux workstation that gets the
>> time for the first loop down from 1005ms to 38ms, really.
>>
>> That has the danger of the commandline getting too long. But since we rely on
>> the final link of the backend to be done in a single command, I don't think
>> it's making things worse? We could try to use xargs otherwise, iirc that's in
>> posix as well.
>>
>> Using parameter substitution in the second loop takes it down from 775ms to
>> 533ms. Not calling cmp when the file doesn't exist cuts it down to 337ms.
>>
>> I don't know of a way to batch the call to ln. The time with ln replaced with
>> : is 151ms, fwiw.
>
> AFAIK Msy2s 'ln -s' by default copies a non-directory rather than
> actually symlinking it. If we want real symlinks, then we need
> MSYS=|winsymlinks:nativestrict set. The is will fail unless the calling
> user is an Administrator or has the SeCreateSymbolicLink privilege. See
> |


Sometimes I hate Thunderbird. Of course the | is spurious above, we
would need

MSYS=winsymlinks:nativestrict

set.


cheers

andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Mingw task for Cirrus CI

2022-04-07 Thread Andrew Dunstan


On 4/7/22 13:10, Andres Freund wrote:
> Hi,
>
> On 2022-04-06 11:03:37 -0400, Andrew Dunstan wrote:
>> On 3/30/22 20:26, Andres Freund wrote:
>>> Could you try using dash to invoke configure here, and whether it makes 
>>> configure faster?
>> I got weird failures re libxml/parser.h when I tried with dash. See
>>  (It would be nice if we
>> could see config.log on failure.)
> Since dash won't help us to get the build time down sufficiently, and the
> tests don't pass without a separate build tree, I looked at what makes
> config/prep_buildtree so slow.
>
> It's largely just bad code. The slowest part are spawning one expr and mkdir
> -p for each directory. One 'cmp' for each makefile doesn't help either.
>
> The expr can be replaced with
>   subdir=${item#$sourcetree}
> that's afaics posix syntax ([1]), not bash.
>
> Spawning one mkdir for each directory can be replaced by a single mkdir
> invocation with all the directories. On my linux workstation that gets the
> time for the first loop down from 1005ms to 38ms, really.
>
> That has the danger of the commandline getting too long. But since we rely on
> the final link of the backend to be done in a single command, I don't think
> it's making things worse? We could try to use xargs otherwise, iirc that's in
> posix as well.
>
> Using parameter substitution in the second loop takes it down from 775ms to
> 533ms. Not calling cmp when the file doesn't exist cuts it down to 337ms.
>
> I don't know of a way to batch the call to ln. The time with ln replaced with
> : is 151ms, fwiw.


AFAIK Msy2s 'ln -s' by default copies a non-directory rather than
actually symlinking it. If we want real symlinks, then we need
MSYS=|winsymlinks:nativestrict set. The is will fail unless the calling
user is an Administrator or has the SeCreateSymbolicLink privilege. See
|

|
for more details.


> On windows that makes prep_buildtree go from 42.4s to 5.8s for me.


That's pretty good.


I think we can get rid of the CVS pruning, it's only 15 years or so
since we've had that in the tree.


+    if test ! -d "$buildtree/$subdir"; then
+    echo "$buildtree/$subdir"
+    fi


I would probably just write that as


test -d "$buildtree/$subdir' || echo "$buildtree/$subdir"


although it's really just a matter of taste.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





GSOC proposal for Improve pgarchives by Yedil

2022-04-07 Thread Yedil Serzhan
Dear Sir or Madam,

My name is Yedil Serzhan, a Computer Science MSc student at the University
of Freiburg. I have had several experiences with PostgreSQL for full-stack
development. Among all the databases I have used, PostgreSQL gives me the
most comfortable user experience. Recently I'm trying to contribute to open
source projects, and I was very happy to find PostgreSQL in GSOC this time.

In the list of ideas from your PostgreSQL organization, I think there are
several ideas I like, such as "Improve pgarchives", “Database Load Stress
Benchmark", and "New and improved website for pgjdbc"  and so on.

I used these days to get a preliminary understanding of the pgarchives
codebase, deployed the pgarchives project, and wrote a preliminary project
proposal based on my understanding. Please take a look, the link to Google
docs of the proposal is here
,
and I've opened up the comments feature for those of you who want to leave
comments.

I'd appreciate it if you could give me some guidance. Thank you in advance!
Looking forward to your reply!

Best regards,
Yedil


Re: How about a psql backslash command to show GUCs?

2022-04-07 Thread Tom Lane
Pavel Stehule  writes:
> čt 7. 4. 2022 v 19:04 odesílatel David G. Johnston <
> david.g.johns...@gmail.com> napsal:
>> \dconfig[+] gets my vote.  I was going to say "conf" just isn't common
>> jargon to say or write; but the one place it is - file extensions - is
>> relevant and common.  But still, I would go with the non-jargon form.

> dconfig is better, because google can be confused - dconf is known project
> https://en.wikipedia.org/wiki/Dconf

Looks like the consensus has shifted to \dconfig.  I'll do it like that.

regards, tom lane




Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

2022-04-07 Thread Peter Geoghegan
On Thu, Apr 7, 2022 at 1:09 PM Stephen Frost  wrote:
> > > I got that much, of course. That will work, I suppose, but it'll be
> > > the first and last time that anybody gets to do that (unless we accept
> > > it being incompatible with encryption).
> >
> > Yeah.
>
> I don't know that I agree with this being the 'first and last time'..?
> If we have two options that could work together and each need some
> amount of per-page space, such as a nonce or authentication tag, we'd
> just need to be able to track which of those are enabled (eg: through
> pg_control) and then know which gets what space.

Sounds very messy.

> I don't see why we
> couldn't add something today and then add something else later on.

That's what I'm arguing in favor of, in part.

> I'm also doubtful about how well this would work, but the other question
> is- what would be the advantage to doing it this way?  If we could have
> it be run-time instead of initdb-time, that'd be great (imagine a
> database that's encrypted while another isn't in the same cluster, or
> even individual tables, which would all be very cool), but I don't think
> this approach would make that possible..?

That would be the main advantage, yes. But I also tend to doubt that
we should make it completely impossible to know anything at all about
the page without fully decrypting it.

It was just a suggestion. I will leave it at that.

-- 
Peter Geoghegan




Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

2022-04-07 Thread Stephen Frost
Greetings,

* Peter Geoghegan (p...@bowt.ie) wrote:
> On Thu, Apr 7, 2022 at 12:37 PM Robert Haas  wrote:
> > On Thu, Apr 7, 2022 at 3:27 PM Peter Geoghegan  wrote:
> > > I just meant that it wouldn't be reasonable to impose a fixed cost on
> > > every user, even those not using the feature. Which you said yourself.
> >
> > Unfortunately, I think there's bound to be some cost. We can avoid
> > using the space in the page for every user, but anything that makes
> > the page layout variable is going to cost some number of CPU cycles
> > someplace. We have to measure that overhead and see if it's small
> > enough that we're OK with it.
> 
> I wouldn't go so far as to say that any non-zero overhead is not okay
> (that sounds really extreme). I would only say this much: wasting
> non-trivial amounts of space on every page must not happen. If the
> user opts-in to the feature then it's not just waste, so that's
> perfectly okay. (I imagine that we agree on this much already.)

Right, the *space* wouldn't ever be wasted- either the user opts-in and
the space is used for what the user asked it to be used for, or ther
user doesn't select that option and we don't reserve that space.
Robert's point was that by allowing this to happen at initdb time, there
may be some CPU cycles that end up getting spent, even if the user
didn't opt-in, that would be saved if this decision was made at compile
time.  For my 2c, I'd think that would be our main concern with this,
but I also am hopeful and strongly suspect that the extra CPU cycles
aren't likely to be much and therefore would be acceptable.

A thought that's further down the line would be the question of if we
could get those CPU cycles back (and perhaps more..) by using JIT.

> > > Immediately before the special area proper (say BTOpaque), which would
> > > "logically come after" the special space under this scheme. You
> > > wouldn't have a simple constant offset into the page, but you'd have
> > > something not too far removed from such a constant. It could work as a
> > > constant with minimal context (just the AM type). Just like with
> > > Matthias' patch.
> >
> > I don't think this would work, because I don't think it would be
> > practical to always know the AM type. Think about applying an XLOG_FPI
> > record, for example.
> 
> There are already some pretty shaky heuristics that are used by tools
> like pg_filedump for this exact purpose. But they work! And they're
> even supported by Postgres, quasi-officially -- grep for "pg_filedump"
> to see what I mean.

Shaky heuristics feels like something appropriate for pg_filedump.. less
so for things like TDE.

> There are numerous reasons why we might want to put that on a formal
> footing, so that we can reliably detect the AM type starting from only
> a page image. I suspect that you're going to end up needing to account
> for this index AMs anyway, so going this way isn't necessarily making
> your life all that much harder. (I am not really sure about that,
> though, since I don't have enough background information.)

This seems like it'd be pretty difficult given that AMs can be loaded as
extensions..?  Also seems like a much bigger change and I'm still not
sure what the advantage is, at least in terms of what this particular
patch is doing.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

2022-04-07 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Apr 7, 2022 at 3:27 PM Peter Geoghegan  wrote:
> > I just meant that it wouldn't be reasonable to impose a fixed cost on
> > every user, even those not using the feature. Which you said yourself.
> 
> Unfortunately, I think there's bound to be some cost. We can avoid
> using the space in the page for every user, but anything that makes
> the page layout variable is going to cost some number of CPU cycles
> someplace. We have to measure that overhead and see if it's small
> enough that we're OK with it.

Agreed that there may be some cost in CPU cycles for folks who don't
initialize the database with an option that needs this, but hopefully
that'll be quite small.

> > I got that much, of course. That will work, I suppose, but it'll be
> > the first and last time that anybody gets to do that (unless we accept
> > it being incompatible with encryption).
> 
> Yeah.

I don't know that I agree with this being the 'first and last time'..?
If we have two options that could work together and each need some
amount of per-page space, such as a nonce or authentication tag, we'd
just need to be able to track which of those are enabled (eg: through
pg_control) and then know which gets what space.  I don't see why we
couldn't add something today and then add something else later on.  I do
think there'll likely be cases where we have two options that don't make
sense together and we wouldn't wish to allow that (such as page-level
strong checksums and authenticated encryption, as the latter provides
the former inherently) but there could certainly be things that do work
together too.

> > > If we *didn't* put the nonce at the end of the page, where else would
> > > we put it? It has to be at a fixed offset, because otherwise you can't
> > > find it without decrypting the page first, which would be circular.
> >
> > Immediately before the special area proper (say BTOpaque), which would
> > "logically come after" the special space under this scheme. You
> > wouldn't have a simple constant offset into the page, but you'd have
> > something not too far removed from such a constant. It could work as a
> > constant with minimal context (just the AM type). Just like with
> > Matthias' patch.
> 
> I don't think this would work, because I don't think it would be
> practical to always know the AM type. Think about applying an XLOG_FPI
> record, for example.

I'm also doubtful about how well this would work, but the other question
is- what would be the advantage to doing it this way?  If we could have
it be run-time instead of initdb-time, that'd be great (imagine a
database that's encrypted while another isn't in the same cluster, or
even individual tables, which would all be very cool), but I don't think
this approach would make that possible..?

(sorry for only just seeing this thread getting traction, have been a
bit busy with other things ...)

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

2022-04-07 Thread Peter Geoghegan
On Thu, Apr 7, 2022 at 12:37 PM Robert Haas  wrote:
> On Thu, Apr 7, 2022 at 3:27 PM Peter Geoghegan  wrote:
> > I just meant that it wouldn't be reasonable to impose a fixed cost on
> > every user, even those not using the feature. Which you said yourself.
>
> Unfortunately, I think there's bound to be some cost. We can avoid
> using the space in the page for every user, but anything that makes
> the page layout variable is going to cost some number of CPU cycles
> someplace. We have to measure that overhead and see if it's small
> enough that we're OK with it.

I wouldn't go so far as to say that any non-zero overhead is not okay
(that sounds really extreme). I would only say this much: wasting
non-trivial amounts of space on every page must not happen. If the
user opts-in to the feature then it's not just waste, so that's
perfectly okay. (I imagine that we agree on this much already.)

> > Immediately before the special area proper (say BTOpaque), which would
> > "logically come after" the special space under this scheme. You
> > wouldn't have a simple constant offset into the page, but you'd have
> > something not too far removed from such a constant. It could work as a
> > constant with minimal context (just the AM type). Just like with
> > Matthias' patch.
>
> I don't think this would work, because I don't think it would be
> practical to always know the AM type. Think about applying an XLOG_FPI
> record, for example.

There are already some pretty shaky heuristics that are used by tools
like pg_filedump for this exact purpose. But they work! And they're
even supported by Postgres, quasi-officially -- grep for "pg_filedump"
to see what I mean.

There are numerous reasons why we might want to put that on a formal
footing, so that we can reliably detect the AM type starting from only
a page image. I suspect that you're going to end up needing to account
for this index AMs anyway, so going this way isn't necessarily making
your life all that much harder. (I am not really sure about that,
though, since I don't have enough background information.)


--
Peter Geoghegan




Re: Mingw task for Cirrus CI

2022-04-07 Thread Andrew Dunstan


On 4/6/22 12:34, Andres Freund wrote:
> Hi,
>
> On 2022-04-06 11:03:37 -0400, Andrew Dunstan wrote:
>> On 3/30/22 20:26, Andres Freund wrote:
>>> Could you try using dash to invoke configure here, and whether it makes 
>>> configure faster?
>> I got weird failures re libxml/parser.h when I tried with dash.
> Hm. Hadn't enabled that when I tried...
>
>
>> See  (It would be nice if we
>> could see config.log on failure.)
> All *.log files are preserved on failure. There's a file directory brower at
> the top to navigate around:
> https://api.cirrus-ci.com/v1/artifact/task/5963254039052288/log/build/config.log



Thanks.


I got it working with this added to the config settings:


--with-includes='/ucrt64/include/libxml2 /c/cirrus/src/include/port/win32'


I conclude tentatively that while bash translates widows paths to msys
paths, dash does not.


see https://cirrus-ci.com/task/5968968560148480?logs=configure#L1


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

2022-04-07 Thread Robert Haas
On Thu, Apr 7, 2022 at 3:27 PM Peter Geoghegan  wrote:
> I just meant that it wouldn't be reasonable to impose a fixed cost on
> every user, even those not using the feature. Which you said yourself.

Unfortunately, I think there's bound to be some cost. We can avoid
using the space in the page for every user, but anything that makes
the page layout variable is going to cost some number of CPU cycles
someplace. We have to measure that overhead and see if it's small
enough that we're OK with it.

> I got that much, of course. That will work, I suppose, but it'll be
> the first and last time that anybody gets to do that (unless we accept
> it being incompatible with encryption).

Yeah.

> > If we *didn't* put the nonce at the end of the page, where else would
> > we put it? It has to be at a fixed offset, because otherwise you can't
> > find it without decrypting the page first, which would be circular.
>
> Immediately before the special area proper (say BTOpaque), which would
> "logically come after" the special space under this scheme. You
> wouldn't have a simple constant offset into the page, but you'd have
> something not too far removed from such a constant. It could work as a
> constant with minimal context (just the AM type). Just like with
> Matthias' patch.

I don't think this would work, because I don't think it would be
practical to always know the AM type. Think about applying an XLOG_FPI
record, for example.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

2022-04-07 Thread Peter Geoghegan
On Thu, Apr 7, 2022 at 12:11 PM Robert Haas  wrote:
> I don't know what statement of mine you're talking about here, and
> while I don't love using up space for a nonce, it seems to be the way
> this encryption stuff works. I don't see that there's a reasonable
> alternative, green field or no.

I just meant that it wouldn't be reasonable to impose a fixed cost on
every user, even those not using the feature. Which you said yourself.

> I'm not really following any of this. You seem to be arguing about
> whether it's possible to change the length of the special space
> *later* than initdb time. I agree that might have some use for some
> purpose, but for encryption it's not necessarily all that helpful
> because you have to be able to find the nonce on the page before
> you've decrypted it. If you don't know whether there's a nonce or
> where it's located, you can't do that.

What if you just encrypt a significant subset of the page, and leave a
small amount of metadata that can be read without decryption? Is that
approach feasible?

I think that you could even encrypt the line pointer array (not just
the tuple space), without breaking anything.

> What Matthias and I were
> discussing is whether you have to make a decision about appending
> stuff to the special space *earlier* than initdb-time i.e. at compile
> time.

I got that much, of course. That will work, I suppose, but it'll be
the first and last time that anybody gets to do that (unless we accept
it being incompatible with encryption).

> That corresponds pretty well to how existing users of the special
> space work, so it seems pretty good.

I think that it still needs to be accounted for in a bunch of places.
In particular anywhere concerned with page-level space management. For
example the definition of "1/3 of a page" used to enforce the limit on
the maximum size of an nbtree index tuple will need to account for the
presence of a nonse, either way.

> If we *didn't* put the nonce at the end of the page, where else would
> we put it? It has to be at a fixed offset, because otherwise you can't
> find it without decrypting the page first, which would be circular.

Immediately before the special area proper (say BTOpaque), which would
"logically come after" the special space under this scheme. You
wouldn't have a simple constant offset into the page, but you'd have
something not too far removed from such a constant. It could work as a
constant with minimal context (just the AM type). Just like with
Matthias' patch.

-- 
Peter Geoghegan




Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

2022-04-07 Thread Robert Haas
On Thu, Apr 7, 2022 at 2:43 PM Peter Geoghegan  wrote:
> But if we were in a green-field situation we'd probably not want to
> use up several bytes for a nonse anyway. You said so yourself.

I don't know what statement of mine you're talking about here, and
while I don't love using up space for a nonce, it seems to be the way
this encryption stuff works. I don't see that there's a reasonable
alternative, green field or no.

> > I do understand that there are significant challenges and performance
> > concerns around having these kinds of initdb-controlled page layout
> > changes, so the future of that patch is unclear.
>
> Why does it need to be at initdb time?
>
> Though I cannot prove it, I suspect that the original intent of the
> special area was to support an additional (though typically small)
> variable length array, that works a little like the current line
> pointer array. This array would have to grow backwards (newer items
> get appended at earlier physical offsets), unlike our line pointer
> array (which gets appended to at the end, in the simple and obvious
> way). Growing backwards like this happens with DB systems, that store
> their line pointer array at the end of the page(the traditional
> approach from the System R days, I believe).
>
> Supporting a variable-length special area array like this would mean
> that any time you add a new item to the variable-sized array in the
> special area, the page's entire tuple space has to be memmove()'d
> backwards by a couple of bytes to create the required space. And so
> the relevant bufpage.c routine would have to adjust the whole line
> pointer array such that each lp_off received a compensating
> adjustment. The array might only be for some kind of page-level
> transaction metadata, something like that -- shifting it around is
> pretty expensive (reusing existing slots isn't too expensive, though).
>
> Why can't it work like that? You don't really need to build the full
> set of bufpage.c facilities (though it might not be a bad idea to
> fully support these variable-length arrays, which seem like they might
> come in handy). That seems perfectly compatible with what Matthias
> wants to do, provided we're willing to deem the special area struct
> (e.g. BTOpaque) as always coming "first" (which is essentially the
> same as his current proposal anyway). You can even do the same thing
> yourself for the nonse (use a fixed, known offset), with relatively
> modest effort. You'd need to have AM-specific knowledge (it would
> stack right on top of Matthias's technique), but that doesn't seem all
> that hard. There are plenty of remaining status bits in BTOpaque, and
> probably all other index AM special areas.

I'm not really following any of this. You seem to be arguing about
whether it's possible to change the length of the special space
*later* than initdb time. I agree that might have some use for some
purpose, but for encryption it's not necessarily all that helpful
because you have to be able to find the nonce on the page before
you've decrypted it. If you don't know whether there's a nonce or
where it's located, you can't do that. What Matthias and I were
discussing is whether you have to make a decision about appending
stuff to the special space *earlier* than initdb-time i.e. at compile
time.

My position is that if we need some space in every page to put a
nonce, the best place to put it is at the very end of the page, within
the special space and after anything else that is stored in the
special space. Code that only manipulates the line pointer array and
tuple data won't care, because pd_special will just be a bit smaller
than it would otherwise have been, and none of that code looks at any
byte offset >= pd_special. Code that looks at the special space won't
care either, as long as it uses PageGetSpecialPointer to find the
data, and doesn't examine how large the special space actually is.
That corresponds pretty well to how existing users of the special
space work, so it seems pretty good.

If we *didn't* put the nonce at the end of the page, where else would
we put it? It has to be at a fixed offset, because otherwise you can't
find it without decrypting the page first, which would be circular.
You could put it at the beginning of the page, or after the page
header and before the line pointer array, but either of those things
seem likely to affect a lot more code, because there's a lot more
stuff that accesses the line pointer array than the special space.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: test/isolation/expected/stats_1.out broken for me

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 11:54:08 -0700, Andres Freund wrote:
> I'll change it to use distinct payloads..

And done. Chap, could you confirm this fixes the issue for you?

Greetings,

Andres Freund




Re: test/isolation/expected/stats_1.out broken for me

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 11:02:41 -0700, Andres Freund wrote:
> I've now reproduced this, albeit not reliably yet. Looking.

Caused by me misremembering when deduplication happens - somehow recalled that
deduplication didn't happen when payloads. So the statement that was supposed
to guarantee needing more than one page:
  SELECT pg_notify('stats_test_use', repeat('0', 
current_setting('block_size')::int / 2)) FROM generate_series(1, 3);

didn't actually guarantee that. It just failed to fail by chance.

When regression tests and isolation test run in sequence against the same
freshly started cluster, the offsets when starting out are just right to not
need another page in the first test.

I'll change it to use distinct payloads..

Greetings,

Andres Freund




Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

2022-04-07 Thread Peter Geoghegan
On Thu, Apr 7, 2022 at 7:01 AM Robert Haas  wrote:
> Because there's no place to put them in the existing page format. We
> jammed checksums into the 2-byte field that had previously been set
> aside for the TLI, but that wasn't really an ideal solution because it
> meant we ended up with a checksum that is only 16 bits wide. However,
> the 2 bytes set aside for the TLI weren't really being used
> effectively anyway, so repurposing them was relatively easy, and a
> 16-bit checksum is better than nothing.

But if we were in a green-field situation we'd probably not want to
use up several bytes for a nonse anyway. You said so yourself.

> I do understand that there are significant challenges and performance
> concerns around having these kinds of initdb-controlled page layout
> changes, so the future of that patch is unclear.

Why does it need to be at initdb time?

Though I cannot prove it, I suspect that the original intent of the
special area was to support an additional (though typically small)
variable length array, that works a little like the current line
pointer array. This array would have to grow backwards (newer items
get appended at earlier physical offsets), unlike our line pointer
array (which gets appended to at the end, in the simple and obvious
way). Growing backwards like this happens with DB systems, that store
their line pointer array at the end of the page(the traditional
approach from the System R days, I believe).

Supporting a variable-length special area array like this would mean
that any time you add a new item to the variable-sized array in the
special area, the page's entire tuple space has to be memmove()'d
backwards by a couple of bytes to create the required space. And so
the relevant bufpage.c routine would have to adjust the whole line
pointer array such that each lp_off received a compensating
adjustment. The array might only be for some kind of page-level
transaction metadata, something like that -- shifting it around is
pretty expensive (reusing existing slots isn't too expensive, though).

Why can't it work like that? You don't really need to build the full
set of bufpage.c facilities (though it might not be a bad idea to
fully support these variable-length arrays, which seem like they might
come in handy). That seems perfectly compatible with what Matthias
wants to do, provided we're willing to deem the special area struct
(e.g. BTOpaque) as always coming "first" (which is essentially the
same as his current proposal anyway). You can even do the same thing
yourself for the nonse (use a fixed, known offset), with relatively
modest effort. You'd need to have AM-specific knowledge (it would
stack right on top of Matthias's technique), but that doesn't seem all
that hard. There are plenty of remaining status bits in BTOpaque, and
probably all other index AM special areas.

-- 
Peter Geoghegan




avoid multiple hard links to same WAL file after a crash

2022-04-07 Thread Nathan Bossart
Hi hackers,

I am splitting this off of a previous thread aimed at reducing archiving
overhead [0], as I believe this fix might deserve back-patching.

Presently, WAL recycling uses durable_rename_excl(), which notes that a
crash at an unfortunate moment can result in two links to the same file.
My testing [1] demonstrated that it was possible to end up with two links
to the same file in pg_wal after a crash just before unlink() during WAL
recycling.  Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL file was
re-recycled upon restarting.  This seems likely to lead to WAL corruption.

The attached patch prevents this problem by using durable_rename() instead
of durable_rename_excl() for WAL recycling.  This removes the protection
against accidentally overwriting an existing WAL file, but there shouldn't
be one.

This patch also sets the stage for reducing archiving overhead (as
discussed in the other thread [0]).  The proposed change to reduce
archiving overhead will make it more likely that the server will attempt to
re-archive segments after a crash.  This might lead to archive corruption
if the server concurrently writes to the same file via the aforementioned
bug.

[0] https://www.postgresql.org/message-id/20220222011948.GA3850532%40nathanxps13
[1] https://www.postgresql.org/message-id/20220222173711.GA3852671%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 244726f6a78aca52c2fe6e70cef966f152057191 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 7 Apr 2022 10:07:42 -0700
Subject: [PATCH v1 1/1] Avoid multiple hard links to same WAL file after a
 crash.

Presently, WAL recycling uses durable_rename_excl(), which notes that a crash at
an unfortunate moment can result in two links to the same file.  My testing
demonstrated that it was possible to end up with two links to the same file in
pg_wal after a crash just before unlink() during WAL recycling.  Specifically,
the test produced links to the same file for the current WAL file and the next
one because the half-recycled WAL file was re-recycled upon restarting.  This
seems likely to lead to WAL corruption.

This change prevents this problem by using durable_rename() instead of
durable_rename_excl() for WAL recycling.  This removes the protection against
accidentally overwriting an existing WAL file, but there shouldn't be one.

Back-patch to all supported versions.

Author: Nathan Bossart
---
 src/backend/access/transam/xlog.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6770c3ddba..6ab5b2a622 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3324,13 +3324,15 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 
 	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
+	 * Perform the rename.  Ideally, we'd use link() and unlink() to avoid
+	 * overwriting an existing file (there shouldn't be one).  However, that
+	 * approach opens up the possibility that pg_wal will contain multiple hard
+	 * links to the same WAL file after a crash.
 	 */
-	if (durable_rename_excl(tmppath, path, LOG) != 0)
+	if (durable_rename(tmppath, path, LOG) != 0)
 	{
 		LWLockRelease(ControlFileLock);
-		/* durable_rename_excl already emitted log message */
+		/* durable_rename already emitted log message */
 		return false;
 	}
 
-- 
2.25.1



Re: LogwrtResult contended spinlock

2022-04-07 Thread Alvaro Herrera
On 2022-Apr-05, Alvaro Herrera wrote:

> Apologies -- I selected the wrong commit to extract the commit message
> from.  Here it is again.  I also removed an obsolete /* XXX */ comment.

I spent a lot of time staring at this to understand the needs for memory
barriers in the interactions.  In the end I decided not to get this out
for this cycle because I don't want to create subtle bugs in WAL.  I'll
come back with this for pg16.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")




Re: [PATCH] Implement INSERT SET syntax

2022-04-07 Thread Marko Tiikkaja
On Wed, Mar 23, 2022 at 5:33 PM Tom Lane  wrote:
>
> Justin Pryzby  writes:
> > You have to either include the pre-requisite patches as 0001, and your 
> > patch as
> > 0002 (as I'm doing now), or name your patch something other than *.diff or
> > *.patch, so cfbot doesn't think it's a new version of the patch to be 
> > tested.
>
> This patch has been basically ignored for a full two years now.
> (Remarkably, it's still passing in the cfbot.)
>
> I have to think that that means there's just not enough interest
> to justify committing it.  Should we mark it rejected and move on?
> If not, what needs to happen to get it unstuck?

I can help with review and/or other work here.  Please give me a
couple of weeks.


.m




Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 13:57:45 -0400, Tom Lane wrote:
> Yeah, with only one instance it could just be cosmic rays or something.
> However, assuming it is real, I guess I wonder why we don't say
> CHECKPOINT_FORCE in standby mode too.

I guess it might partially be that restartpoints require a checkpoint to have
happened on the primary. If we used FORCE, we'd have to wait till the next
checkpoint on the primary, which'd be a problem if it's e.g. a manually issued
CHECKPOINT; before shutting the standby down.

Greetings,

Andres Freund




Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-07 Thread Petr Jelinek


> On 7. 4. 2022, at 17:19, Robert Haas  wrote:
> 
> On Tue, Apr 5, 2022 at 10:17 AM Tom Lane  wrote:
>> What I think you need to do is:
>> 
>> 1. In the back branches, revert delayChkpt to its previous type and
>> semantics.  Squeeze a separate delayChkptEnd bool in somewhere
>> (you can't change the struct size either ...).
>> 
>> 2. In HEAD, rename the field to something like delayChkptFlags,
>> to ensure that any code touching it has to be inspected and updated.
> 
> Here are patches for master and v14 to do things this way. Comments?


Yeah I think this should do it (compilers should warn on master even without 
the rename, but who notices that right? :) )

Thanks,
Petr





Re: test/isolation/expected/stats_1.out broken for me

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 10:29:10 -0700, Andres Freund wrote:
> On 2022-04-07 13:16:53 -0400, c...@anastigmatix.net wrote:
> > The command that I've just been reusing from my bash_history without
> > thinking about it for some years is:
> > 
> > configure --enable-cassert --enable-tap-tests \
> >  --with-libxml --enable-debug \
> >  CFLAGS='-ggdb -Og -g3 -fno-omit-frame-pointer'
> 
> Hm, that's similar to what I use without seeing the problem.
> 
> IIUC you ran installcheck - did you set any non-default config options in the
> postgres instance that runs against? Is anything else running on that
> instance?  Do you use any -j setting when running installcheck-world?
> 
> Is the failure reproducible, or a one-off?

I've now reproduced this, albeit not reliably yet. Looking.

Greetings,

Andres Freund




Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

2022-04-07 Thread Tom Lane
Andres Freund  writes:
> On 2022-04-07 13:40:30 -0400, Tom Lane wrote:
>> This test is sending a CHECKPOINT command to the standby and
>> expecting it to run the archive_cleanup_command, but it looks
>> like the standby did not actually run any checkpoint:
>> ...
>> I wondered if the recent pg_stats changes could have affected this, but I
>> don't really see how.

> I don't really see either. It's a bit more conceivable that the recovery
> prefetching changes could affect the timing sufficiently?

Oh, that's at least a little plausible.

> I guess we'll have to wait and see what the frequency of the problem is?

Yeah, with only one instance it could just be cosmic rays or something.
However, assuming it is real, I guess I wonder why we don't say
CHECKPOINT_FORCE in standby mode too.

regards, tom lane




Re: [PATCH] Add native windows on arm64 support

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 13:42:49 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-04-07 09:40:43 -0400, Tom Lane wrote:
> >> If it causes problems down the road, how will we debug it?
>
> > If what causes problems down the road? Afaics the patch doesn't change
> > anything outside of windows-on-arm, so it shouldn't cause any breakage we 
> > care
> > about until we get a buildfarm animal.
>
> Do we have a volunteer to run a buildfarm animal?  I don't see much
> point in thinking about this till there is one ready to go.

OP said that they might be able to:
https://www.postgresql.org/message-id/CAFPTBD_NZb%3Dq_6uE-QV%2BS%2Bpm%3DRc9sBKrg8CQ_Y3dc27Awrm7cQ%40mail.gmail.com

Niyas, any updates on that?

Greetings,

Andres Freund




Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 13:40:30 -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > Add TAP test for archive_cleanup_command and recovery_end_command
> 
> grassquit just showed a non-reproducible failure in this test [1]:

I was just staring at that as well.


> # Postmaster PID for node "standby" is 291160
> ok 1 - check content from archives
> not ok 2 - archive_cleanup_command executed on checkpoint
> 
> #   Failed test 'archive_cleanup_command executed on checkpoint'
> #   at t/002_archiving.pl line 74.
> 
> This test is sending a CHECKPOINT command to the standby and
> expecting it to run the archive_cleanup_command, but it looks
> like the standby did not actually run any checkpoint:
> 
> 2022-04-07 16:11:33.060 UTC [291806][not initialized][:0] LOG:  connection 
> received: host=[local]
> 2022-04-07 16:11:33.078 UTC [291806][client backend][2/15:0] LOG:  connection 
> authorized: user=bf database=postgres application_name=002_archiving.pl
> 2022-04-07 16:11:33.084 UTC [291806][client backend][2/16:0] LOG:  statement: 
> CHECKPOINT
> 2022-04-07 16:11:33.092 UTC [291806][client backend][:0] LOG:  disconnection: 
> session time: 0:00:00.032 user=bf database=postgres host=[local]
> 
> I am suspicious that the reason is that ProcessUtility does not
> ask for a forced checkpoint when in recovery:
> 
> RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
>   (RecoveryInProgress() ? 0 : CHECKPOINT_FORCE));
> 
> The trouble with this theory is that this test has been there for
> nearly six months and this is the first such failure (I scraped the
> buildfarm logs to be sure).  Seems like failures should be a lot
> more common than that.

> I wondered if the recent pg_stats changes could have affected this, but I
> don't really see how.

I don't really see either. It's a bit more conceivable that the recovery
prefetching changes could affect the timing sufficiently?

It's also possible that it requires an animal of a certain speed to happen -
we didn't have an -fsanitize=address animal until recently.

I guess we'll have to wait and see what the frequency of the problem is?

Greetings,

Andres Freund




Re: [PATCH] Add native windows on arm64 support

2022-04-07 Thread Tom Lane
Andres Freund  writes:
> On 2022-04-07 09:40:43 -0400, Tom Lane wrote:
>> If it causes problems down the road, how will we debug it?

> If what causes problems down the road? Afaics the patch doesn't change
> anything outside of windows-on-arm, so it shouldn't cause any breakage we care
> about until we get a buildfarm animal.

Do we have a volunteer to run a buildfarm animal?  I don't see much
point in thinking about this till there is one ready to go.

regards, tom lane




Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

2022-04-07 Thread Tom Lane
Michael Paquier  writes:
> Add TAP test for archive_cleanup_command and recovery_end_command

grassquit just showed a non-reproducible failure in this test [1]:

# Postmaster PID for node "standby" is 291160
ok 1 - check content from archives
not ok 2 - archive_cleanup_command executed on checkpoint

#   Failed test 'archive_cleanup_command executed on checkpoint'
#   at t/002_archiving.pl line 74.

This test is sending a CHECKPOINT command to the standby and
expecting it to run the archive_cleanup_command, but it looks
like the standby did not actually run any checkpoint:

2022-04-07 16:11:33.060 UTC [291806][not initialized][:0] LOG:  connection 
received: host=[local]
2022-04-07 16:11:33.078 UTC [291806][client backend][2/15:0] LOG:  connection 
authorized: user=bf database=postgres application_name=002_archiving.pl
2022-04-07 16:11:33.084 UTC [291806][client backend][2/16:0] LOG:  statement: 
CHECKPOINT
2022-04-07 16:11:33.092 UTC [291806][client backend][:0] LOG:  disconnection: 
session time: 0:00:00.032 user=bf database=postgres host=[local]

I am suspicious that the reason is that ProcessUtility does not
ask for a forced checkpoint when in recovery:

RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
  (RecoveryInProgress() ? 0 : CHECKPOINT_FORCE));

The trouble with this theory is that this test has been there for
nearly six months and this is the first such failure (I scraped the
buildfarm logs to be sure).  Seems like failures should be a lot
more common than that.  I wondered if the recent pg_stats changes
could have affected this, but I don't really see how.

Thoughts?

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit=2022-04-07%2015%3A45%3A48




Re: test/isolation/expected/stats_1.out broken for me

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 13:16:53 -0400, c...@anastigmatix.net wrote:
> The command that I've just been reusing from my bash_history without
> thinking about it for some years is:
> 
> configure --enable-cassert --enable-tap-tests \
>  --with-libxml --enable-debug \
>  CFLAGS='-ggdb -Og -g3 -fno-omit-frame-pointer'

Hm, that's similar to what I use without seeing the problem.

IIUC you ran installcheck - did you set any non-default config options in the
postgres instance that runs against? Is anything else running on that
instance?  Do you use any -j setting when running installcheck-world?

Is the failure reproducible, or a one-off?

Greetings,

Andres Freund




Re: [PATCH] Add native windows on arm64 support

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 09:40:43 -0400, Tom Lane wrote:
> If it causes problems down the road, how will we debug it?

If what causes problems down the road? Afaics the patch doesn't change
anything outside of windows-on-arm, so it shouldn't cause any breakage we care
about until we get a buildfarm animal.

We've traditionally been somewhat relaxed about adding support for new
platforms, on similar notions. That said:

> So it seems like the sort of patch to put in at the beginning of a
> development cycle, not post-feature-freeze.

There doesn't seem to be a great urgency, and there's plenty stuff going on
right now. I can see us merging it post branching off, and then backpatching
it a bit later?

Greetings,

Andres Freund




Re: test/isolation/expected/stats_1.out broken for me

2022-04-07 Thread chap

On 2022-04-07 12:49, Tom Lane wrote:

So what non-default build options are you using?


The command that I've just been reusing from my bash_history without
thinking about it for some years is:

configure --enable-cassert --enable-tap-tests \
 --with-libxml --enable-debug \
 CFLAGS='-ggdb -Og -g3 -fno-omit-frame-pointer'

Regards,
-Chap




Re: Mingw task for Cirrus CI

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-06 11:03:37 -0400, Andrew Dunstan wrote:
> On 3/30/22 20:26, Andres Freund wrote:
> > Could you try using dash to invoke configure here, and whether it makes 
> > configure faster?
> I got weird failures re libxml/parser.h when I tried with dash. See
>  (It would be nice if we
> could see config.log on failure.)

Since dash won't help us to get the build time down sufficiently, and the
tests don't pass without a separate build tree, I looked at what makes
config/prep_buildtree so slow.

It's largely just bad code. The slowest part are spawning one expr and mkdir
-p for each directory. One 'cmp' for each makefile doesn't help either.

The expr can be replaced with
  subdir=${item#$sourcetree}
that's afaics posix syntax ([1]), not bash.

Spawning one mkdir for each directory can be replaced by a single mkdir
invocation with all the directories. On my linux workstation that gets the
time for the first loop down from 1005ms to 38ms, really.

That has the danger of the commandline getting too long. But since we rely on
the final link of the backend to be done in a single command, I don't think
it's making things worse? We could try to use xargs otherwise, iirc that's in
posix as well.

Using parameter substitution in the second loop takes it down from 775ms to
533ms. Not calling cmp when the file doesn't exist cuts it down to 337ms.

I don't know of a way to batch the call to ln. The time with ln replaced with
: is 151ms, fwiw.

On windows that makes prep_buildtree go from 42.4s to 5.8s for me.

Greetings,

Andres Freund

[1] https://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html
diff --git i/config/prep_buildtree w/config/prep_buildtree
index a0eabd3dee2..d26de429e61 100644
--- i/config/prep_buildtree
+++ w/config/prep_buildtree
@@ -26,19 +26,30 @@ buildtree=`cd ${2:-'.'} && pwd`
 # If we did, it would interfere with installation of prebuilt docs from
 # the source tree, if a VPATH build is done from a distribution tarball.
 # See bug #5595.
-for item in `find "$sourcetree" -type d \( \( -name CVS -prune \) -o \( -name .git -prune \) -o -print \) | grep -v "$sourcetree/doc/src/sgml/\+"`; do
-subdir=`expr "$item" : "$sourcetree\(.*\)"`
-if test ! -d "$buildtree/$subdir"; then
-mkdir -p "$buildtree/$subdir" || exit 1
-fi
-done
+
+mkdir -p $(
+for item in `find "$sourcetree" -type d \( \( -name CVS -prune \) -o \( -name .git -prune \) -o -print \) | grep -v "$sourcetree/doc/src/sgml/\+"`; do
+subdir=${item#$sourcetree}
+
+if test ! -d "$buildtree/$subdir"; then
+echo "$buildtree/$subdir"
+fi
+done
+)
 
 for item in `find "$sourcetree" -name Makefile -print -o -name GNUmakefile -print | grep -v "$sourcetree/doc/src/sgml/images/"`; do
-filename=`expr "$item" : "$sourcetree\(.*\)"`
-if test ! -f "${item}.in"; then
-if cmp "$item" "$buildtree/$filename" >/dev/null 2>&1; then : ; else
-ln -fs "$item" "$buildtree/$filename" || exit 1
-fi
+filename=${item#$sourcetree}
+
+if test -f "${item}.in"; then
+continue;
+fi
+
+if test ! -e "$buildtree/$filename"; then
+ln -s "$item" "$buildtree/$filename" || exit 1;
+elif cmp "$item" "$buildtree/$filename" >/dev/null 2>&1; then
+: ;
+else
+ln -fs "$item" "$buildtree/$filename" || exit 1
 fi
 done
 


Re: How about a psql backslash command to show GUCs?

2022-04-07 Thread Pavel Stehule
čt 7. 4. 2022 v 19:04 odesílatel David G. Johnston <
david.g.johns...@gmail.com> napsal:

> On Thu, Apr 7, 2022 at 9:58 AM Joe Conway  wrote:
>
>> On 4/7/22 12:37, Tom Lane wrote:
>> > Mark Dilger  writes:
>> >>> On Apr 7, 2022, at 9:29 AM, Tom Lane  wrote:
>> >>> I wouldn't
>> >>> fight too hard if people want to lengthen it to \dconfig for
>> consistency
>> >>> with set_config().
>> >
>> >> I'd prefer \dconfig, but if the majority on this list view that as
>> pedantically forcing them to type more, I'm not going to kick up a fuss
>> about \dconf.
>> >
>> > Maybe I'm atypical, but I'm probably going to use tab completion
>> > either way, so it's not really more keystrokes.  The consistency
>> > point is a good one that I'd not considered before.
>>
>> Yeah I had thought about \dconfig too -- +1 to that, although I am fine
>> with \dconf too.
>>
>>
> \dconfig[+] gets my vote.  I was going to say "conf" just isn't common
> jargon to say or write; but the one place it is - file extensions - is
> relevant and common.  But still, I would go with the non-jargon form.
>

dconfig is better, because google can be confused - dconf is known project
https://en.wikipedia.org/wiki/Dconf

The length is not too important when we have tab complete

Regards

Pavel


> David J.
>


Re: How about a psql backslash command to show GUCs?

2022-04-07 Thread David G. Johnston
On Thu, Apr 7, 2022 at 9:58 AM Joe Conway  wrote:

> On 4/7/22 12:37, Tom Lane wrote:
> > Mark Dilger  writes:
> >>> On Apr 7, 2022, at 9:29 AM, Tom Lane  wrote:
> >>> I wouldn't
> >>> fight too hard if people want to lengthen it to \dconfig for
> consistency
> >>> with set_config().
> >
> >> I'd prefer \dconfig, but if the majority on this list view that as
> pedantically forcing them to type more, I'm not going to kick up a fuss
> about \dconf.
> >
> > Maybe I'm atypical, but I'm probably going to use tab completion
> > either way, so it's not really more keystrokes.  The consistency
> > point is a good one that I'd not considered before.
>
> Yeah I had thought about \dconfig too -- +1 to that, although I am fine
> with \dconf too.
>
>
\dconfig[+] gets my vote.  I was going to say "conf" just isn't common
jargon to say or write; but the one place it is - file extensions - is
relevant and common.  But still, I would go with the non-jargon form.

David J.


Re: How about a psql backslash command to show GUCs?

2022-04-07 Thread Joe Conway

On 4/7/22 12:37, Tom Lane wrote:

Mark Dilger  writes:

On Apr 7, 2022, at 9:29 AM, Tom Lane  wrote:
I wouldn't
fight too hard if people want to lengthen it to \dconfig for consistency
with set_config().



I'd prefer \dconfig, but if the majority on this list view that as pedantically 
forcing them to type more, I'm not going to kick up a fuss about \dconf.


Maybe I'm atypical, but I'm probably going to use tab completion
either way, so it's not really more keystrokes.  The consistency
point is a good one that I'd not considered before.


Yeah I had thought about \dconfig too -- +1 to that, although I am fine 
with \dconf too.


Joe




Re: test/isolation/expected/stats_1.out broken for me

2022-04-07 Thread Andres Freund
Hi,

On 2022-04-07 12:49:07 -0400, Tom Lane wrote:
> c...@anastigmatix.net writes:
> > Running installcheck-world on an unrelated patch, I noticed a failure
> > here in test/isolation/expected/stats_1.out (this is line 3102):
> 
> So what non-default build options are you using?
> 
> The only isolationcheck failure remaining in the buildfarm is
> prion's, which I can reproduce here by building with
> -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE as it does.
> Looking at the nature of the diffs, this is not too surprising;
> the expected output appears to rely on a cache flush not happening
> quickly in s2.

Yea :(. I tested debug_discard_caches, but not -DRELCACHE_FORCE_RELEASE
-DCATCACHE_FORCE_RELEASE.

Not quite sure what to do about it - it's intentionally trying to test the
case of no invalidations being processed, as that's an annoying edge case with
functions.  Perhaps wrapping the function call of the "already dropped"
function in another function that catches the error would do the trick? It'd
be more easily silently broken, but still be better than not having the test.

Greetings,

Andres Freund




Re: test/isolation/expected/stats_1.out broken for me

2022-04-07 Thread Tom Lane
c...@anastigmatix.net writes:
> Running installcheck-world on an unrelated patch, I noticed a failure
> here in test/isolation/expected/stats_1.out (this is line 3102):

So what non-default build options are you using?

The only isolationcheck failure remaining in the buildfarm is
prion's, which I can reproduce here by building with
-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE as it does.
Looking at the nature of the diffs, this is not too surprising;
the expected output appears to rely on a cache flush not happening
quickly in s2.

regards, tom lane




Re: How about a psql backslash command to show GUCs?

2022-04-07 Thread Jonathan S. Katz

On 4/7/22 12:42 PM, Mark Dilger wrote:




On Apr 7, 2022, at 9:37 AM, Tom Lane  wrote:

Maybe I'm atypical, but I'm probably going to use tab completion
either way, so it's not really more keystrokes.


Same here, because after tab-completing \dcon\t\t into \dconfig, I'm likely to 
also tab-complete to get the list of parameters.  I frequently can't recall the 
exact spelling of them.


This seems like the only "\d" command that would require tab completion, 
given all the others are far less descriptive (\dt, \dv, etc.) And at 
least from my user perspective, if I ever need anything other than \dt, 
I typically have to go to \? to look it up.


I'm generally in favor of consistency, though in case skewing towards 
what we expose to the user. If "\dconfig" gives a bit more across the 
board, I'm OK with that. "\dparam" could be a bit confusing to end users 
("parameter for what?") so I'm -1 on that.


Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: How about a psql backslash command to show GUCs?

2022-04-07 Thread Mark Dilger



> On Apr 7, 2022, at 9:37 AM, Tom Lane  wrote:
> 
> Maybe I'm atypical, but I'm probably going to use tab completion
> either way, so it's not really more keystrokes.

Same here, because after tab-completing \dcon\t\t into \dconfig, I'm likely to 
also tab-complete to get the list of parameters.  I frequently can't recall the 
exact spelling of them.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







  1   2   >