Re: RLS makes COPY TO process child tables

2023-02-01 Thread Antonin Houska
Yugo NAGATA  wrote:

> On Wed, 01 Feb 2023 12:45:57 +0100
> Antonin Houska  wrote:
> 
> > While working on [1] I noticed that if RLS gets enabled, the COPY TO command
> > includes the contents of child table into the result, although the
> > documentation says it should not:
> > 
> > "COPY TO can be used only with plain tables, not views, and does not
> > copy rows from child tables or child partitions. For example, COPY
> > table TO copies the same rows as SELECT * FROM ONLY table. The syntax
> > COPY (SELECT * FROM table) TO ... can be used to dump all of the rows
> > in an inheritance hierarchy, partitioned table, or view."
> > 
> > A test case is attached (rls.sql) as well as fix proposal
> > (copy_rls_no_inh.diff).
> 
> I think this is a bug because the current behaviour is different from
> the documentation. 
> 
> When RLS is enabled on a table in `COPY ... TO ...`, the query is converted
> to `COPY (SELECT * FROM ...) TO ...` to allow the rewriter to add in RLS
> clauses. This causes to dump the rows of child tables.
> 
> The patch fixes this by setting "inh" of the table in the converted query
> to false. This seems reasonable and actually fixes the problem.
> 
> However, I think we would want a comment on the added line.

A short comment added, see the new patch version.

> Also, the attached test should be placed in the regression test.

Hm, I'm not sure it's necessary. It would effectively test whether the 'inh'
field works, but if it didn't, many other tests would fail. I discovered the
bug by reading the code, so I wanted to demonstrate (also to myself) that it
causes incorrect behavior from user perspective. That was the purpose of the
test.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

>From 3a4acb29901a7b53eb32767e30bdfce74b80df8b Mon Sep 17 00:00:00 2001
From: Antonin Houska 
Date: Thu, 2 Feb 2023 07:31:32 +0100
Subject: [PATCH] Make sure that COPY TO does not process child tables.

It's expected (and documented) that the child tables are not copied, however
the query generated for RLS didn't meet this expectation so far.
---
 src/backend/commands/copy.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index e34f583ea7..539fb682c2 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -250,6 +250,9 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
 pstrdup(RelationGetRelationName(rel)),
 -1);
 
+			/* COPY TO should not process child tables. */
+			from->inh = false;
+
 			/* Build query */
 			select = makeNode(SelectStmt);
 			select->targetList = targetList;
-- 
2.31.1



Re: RLS makes COPY TO process child tables

2023-02-01 Thread Yugo NAGATA
On Wed, 01 Feb 2023 11:47:23 -0500
Tom Lane  wrote:

> Yugo NAGATA  writes:
> > Antonin Houska  wrote:
> >> While working on [1] I noticed that if RLS gets enabled, the COPY TO 
> >> command
> >> includes the contents of child table into the result, although the
> >> documentation says it should not:
> 
> > I think this is a bug because the current behaviour is different from
> > the documentation. 
> 
> I agree, it shouldn't do that.
> 
> > When RLS is enabled on a table in `COPY ... TO ...`, the query is converted
> > to `COPY (SELECT * FROM ...) TO ...` to allow the rewriter to add in RLS
> > clauses. This causes to dump the rows of child tables.
> 
> Do we actually say that in so many words, either in the code or docs?
> If so, it ought to read `COPY (SELECT * FROM ONLY ...) TO ...`
> instead.  (If we say that in the docs, then arguably the code *does*
> conform to the docs.  But I don't see it in the COPY ref page at least.)

The documentation do not say that, but the current code actually do that.
Also, there is the following comment in BeginCopyTo().

 * With row-level security and a user using "COPY relation TO", we
 * have to convert the "COPY relation TO" to a query-based COPY (eg:
 * "COPY (SELECT * FROM relation) TO"), to allow the rewriter to add
 * in any RLS clauses.

Maybe, it is be better to change the description in the comment to
"COPY (SELECT * FROM ONLY relation) TO" when fixing the bug.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: run pgindent on a regular basis / scripted manner

2023-02-01 Thread Tom Lane
Noah Misch  writes:
> Regarding the concern about a pre-receive hook blocking an emergency push, the
> hook could approve every push where a string like "pgindent: no" appears in a
> commit message within the push.  You'd still want to make the tree clean
> sometime the same week or so.  It's cheap to provide a break-glass like that.

I think the real question here is whether we can get all (or at least
a solid majority of) committers to accept such draconian constraints.
I'd buy into it, and evidently so would you, but I can't help noting
that less than a quarter of active committers have bothered to
comment on this thread.  I suspect the other three-quarters would
be quite annoyed if we tried to institute such requirements.  That's
not manpower we can afford to drive away.

Maybe this should get taken up at the this-time-for-sure developer
meeting at PGCon?

regards, tom lane




RE: Deadlock between logrep apply worker and tablesync worker

2023-02-01 Thread houzj.f...@fujitsu.com
On Tuesday, January 31, 2023 1:07 AM vignesh C  wrote:
> On Mon, 30 Jan 2023 at 17:30, vignesh C  wrote:
> >
> > On Mon, 30 Jan 2023 at 13:00, houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Monday, January 30, 2023 2:32 PM Amit Kapila
>  wrote:
> > > >
> > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C 
> wrote:
> > > > >
> > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila 
> wrote:
> > > > > >
> > > > > > One thing that looks a bit odd is that we will anyway have a
> > > > > > similar check in replorigin_drop_guts() which is a static
> > > > > > function and called from only one place, so, will it be required to
> check at both places?
> > > > >
> > > > > There is a possibility that the initial check to verify if
> > > > > replication origin exists in replorigin_drop_by_name was
> > > > > successful but later one of either table sync worker or apply
> > > > > worker process might have dropped the replication origin,
> > > > >
> > > >
> > > > Won't locking on the particular origin prevent concurrent drops?
> > > > IIUC, the drop happens after the patch acquires the lock on the origin.
> > >
> > > Yes, I think the existence check in replorigin_drop_guts is
> > > unnecessary as we already lock the origin before that. I think the
> > > check in replorigin_drop_guts is a custom check after calling
> > > SearchSysCache1 to get the tuple, but the error should not happen as no
> concurrent drop can be performed.
> > >
> > > To make it simpler, one idea is to move the code that getting the
> > > tuple from system cache to the replorigin_drop_by_name(). After
> > > locking the origin, we can try to get the tuple and do the existence
> > > check, and we can reuse this tuple to perform origin delete. In this
> > > approach we only need to check origin existence once after locking.
> > > BTW, if we do this, then we'd better rename the
> > > replorigin_drop_guts() to something like replorigin_state_clear() as
> > > the function only clear the in-memory information after that.
> > >
> >
> > The attached updated patch has the changes to handle the same.
> 
> I had not merged one of the local changes that was present, please find the
> updated patch including that change. Sorry for missing that change.
> 

I also tried to test the time of "src/test/subscription/t/002_types.pl"
before and after the patch(change the lock level) and Tom's patch(split
transaction) like what Vignesh has shared on -hackers.

I run about 100 times for each case. Tom's and the lock level patch
behave similarly on my machines[1].

HEAD: 3426 ~ 6425 ms
HEAD + Tom: 3404 ~ 3462 ms
HEAD + Vignesh: 3419 ~ 3474 ms
HEAD + Tom + Vignesh: 3408 ~ 3454 ms

Even apart from the testing time reduction, reducing the lock level and lock
the specific object can also help improve the lock contention which user(that
use the exposed function) , table sync worker and apply worker can also benefit
from it. So, I think pushing the patch to change the lock level makes sense.

And the patch looks good to me.

While on it, after pushing the patch, I think there is another case might also
worth to be improved, that is the table sync and apply worker try to drop the
same origin which might cause some delay. This is another case(different from
the deadlock), so I feel we can try to improve this in another patch.

[1] CentOS 8.2, 128G RAM, 40 processors Intel(R) Xeon(R) Silver 4210 CPU @ 
2.20GHz

Best Regards,
Hou zj


Re: Add index scan progress to pg_stat_progress_vacuum

2023-02-01 Thread Masahiko Sawada
On Fri, Jan 20, 2023 at 11:38 PM Imseih (AWS), Sami  wrote:
>
> >Number of indexes that will be vacuumed or cleaned up. This counter only
> >advances when the phase is vacuuming indexes or cleaning up indexes.
>
> I agree, this reads better.
>
> ---
> -/* Report that we are now vacuuming indexes */
> -pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
> -
> PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
> +/*
> + * Report that we are now vacuuming indexes
> + * and the number of indexes to vacuum.
> + */
> +progress_start_val[0] = PROGRESS_VACUUM_PHASE_VACUUM_INDEX;
> +progress_start_val[1] = vacrel->nindexes;
> +pgstat_progress_update_multi_param(2, progress_start_index,
> progress_start_val);
>
> >According to our code style guideline[1], we limit line lengths so
> >that the code is readable in an 80-column window. Some comments
>  >   updated in this patch seem too short.
>
> I will correct this.
>
> >I think it's better to define "void *arg".
>
> Agree
>
> >---
> >+/*
> >+ * A Leader process that receives this 
> > message
> >+ * must be ready to update progress.
> >+ */
> >+
> > Assert(pcxt->parallel_progress_callback);
> >+
> > Assert(pcxt->parallel_progress_callback_arg);
> >+
> >+/* Report progress */
> >+
> >pcxt->parallel_progress_callback(pcxt->parallel_progress_callback_arg);
>
> >I think the parallel query infra should not require
> >parallel_progress_callback_arg to always be set. I think it can be
> >NULL.
>
> This assertion is inside the new 'P' message type handling.
> If a leader is consuming this message, they must have a
> progress callback set. Right now we only set the callback
> in the parallel vacuum case only, so not all leaders will be prepared
> to handle this case.
>
> Would you agree this is needed for safety?

I think it makes sense that we assume pcxt->parallel_progress_callback
is always not NULL but my point is that in the future one might want
to use this callback without the argument and we should allow it. If
parallel vacuum assumes pcxt->parallel_progress_callback_arg is not
NULL, I think we should add an assertion in
parallel_vacuum_update_progress(), but not in HandleParallelMessage().

>
> case 'P':   /* Parallel progress reporting */
> {
> /*
>  * A Leader process that receives this message
>  * must be ready to update progress.
>  */
> Assert(pcxt->parallel_progress_callback);
> Assert(pcxt->parallel_progress_callback_arg);
>
> ---
> >+void
> >+parallel_vacuum_update_progress(void *arg)
> >+{
> >+ParallelVacuumState *pvs = (ParallelVacuumState *)arg;
> >+
> >+Assert(!IsParallelWorker());
> >+
> >+if (pvs)
> >+
> > pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED,
> >+
> >   pg_atomic_add_fetch_u32(&(pvs->shared->nindexes_completed), 1));
> >+}
>
> >Since parallel vacuum always sets the arg, I think we don't need to 
> > check it.
>
> The arg is only set during parallel vacuum. During non-parallel vacuum,
> It's NULL. This check can be removed, but I realize now that we do need
> an Assert(pvs). Do you agree?

Agreed to add the assertion in this function.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: run pgindent on a regular basis / scripted manner

2023-02-01 Thread Noah Misch
On Mon, Jan 30, 2023 at 03:42:09PM -0500, Bruce Momjian wrote:
> On Sat, Jan 28, 2023 at 05:06:03PM -0800, Noah Misch wrote:
> > On Tue, Jan 24, 2023 at 02:04:02PM -0500, Bruce Momjian wrote:
> > > On Tue, Jan 24, 2023 at 09:54:57AM -0500, Tom Lane wrote:
> > > > As another example, the mechanisms we use to create the typedefs list
> > > > in the first place are pretty squishy/leaky: they depend on which
> > > > buildfarm animals are running the typedef-generation step, and on
> > > > whether anything's broken lately in that code --- which happens on
> > > > a fairly regular basis (eg [1]).  Maybe that could be improved,
> > > > but I don't see an easy way to capture the set of system-defined
> > > > typedefs that are in use on platforms other than your own.  I
> > > > definitely do not want to go over to hand maintenance of that list.
> > > 
> > > As I now understand it, we would need to standardize on a typedef list
> > > at the beginning of each major development cycle, and then only allow
> > > additions,
> > 
> > Not to my knowledge.  There's no particular obstacle to updating the list 
> > more
> > frequently or removing entries.
> 
> We would need to re-pgindent the tree each time, I think, which would
> cause disruption if we did it too frequently.

More important than frequency is how much old code changes.  A new typedef
typically is an identifier not already appearing in the tree, so no old code
changes.  A removed typedef typically no longer appears in the tree, so again
no old code changes.  The tree can get those daily; they're harmless.

The push that adds or removes FooTypedef from the source code is in the best
position to react to any surprising indentation consequences of adding or
removing FooTypedef from typedefs.list.  (Reactions could include choosing a
different typedef name or renaming incidental matches in older code.)  Hence,
changing typedefs.list as frequently as it affects the code is less disruptive
than changing it once a year.  The same applies to challenges like pgindent
wrecking a non-"/*--" comment.  Such breakage is hard to miss when
it's part of the push that crafts the comment; it's easier to miss in a bulk,
end-of-cycle pgindent.

Regarding the concern about a pre-receive hook blocking an emergency push, the
hook could approve every push where a string like "pgindent: no" appears in a
commit message within the push.  You'd still want to make the tree clean
sometime the same week or so.  It's cheap to provide a break-glass like that.




Re: Add progress reporting to pg_verifybackup

2023-02-01 Thread Michael Paquier
On Thu, Feb 02, 2023 at 02:57:44PM +0900, Masahiko Sawada wrote:
> It seems that the --progress option doesn't work with command_like()
> since the progress information is written in stderr but command_like()
> doesn't want it.

What about command_checks_all()?  It should check for stderr, stdout
as well as the expected error code.
--
Michael


signature.asc
Description: PGP signature


Re: heapgettup refactoring

2023-02-01 Thread David Rowley
On Thu, 2 Feb 2023 at 10:12, Melanie Plageman  wrote:
> FWIW, I like setting rs_inited in heapgettup_initial_block() better in
> the final refactor, but I agree with you that in this patch on its own
> it is better in the body of heapgettup() and heapgettup_pagemode().

We can reconsider that when we get to that patch. It just felt a bit
ugly to add an InvalidBlockNumber check after calling
table_block_parallelscan_nextpage()

> I believe this else is superfluous since we returned above.

TBH, that's on purpose. I felt that it just looked better that way as
the code all fitted onto my screen.  I think if the function was
longer and people had to scroll down to read it, it can often be
better to return and reduce the nesting. This allows you to mentally
not that a certain case is handled above.  However, since all these
helper functions seem to fit onto a screen without too much trouble,
it just seems better (to me) if they all follow the format that I
mentioned earlier. I might live to regret that as we often see
get-rid-of-useless-else-clause patches coming up. I'm starting to
wonder if someone's got some alarm that goes off every time one gets
committed, but we'll see. I'd much rather have consistency between the
helper functions than save a few bytes on tab characters. It would be
different if the indentation were shifting things too far right, but
that's not going to happen in a function that all fits onto a screen
at once.

I've attached a version of the next patch in the series. I admit to
making a couple of adjustments. I couldn't bring myself to remove the
snapshot local variable in this commit. We can deal with that when we
come to it in whichever patch that needs to be changed in.  The only
other thing I really did was question your use of rs_cindex to store
the last OffsetNumber.  I ended up adding a new field which slots into
the padding between a bool and BlockNumber field named rs_coffset for
this purpose. I noticed what Andres wrote [1] earlier in this thread
about that, so thought we should move away from looking at the last
tuple to get this number

I've attached the rebased and updated patch.

David

[1] https://postgr.es/m/20221101010948.hsf33emgnwzvi...@awork3.anarazel.de
From 32652b312d0c802e3105c064c24ff5b99694399c Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Thu, 2 Feb 2023 15:07:27 +1300
Subject: [PATCH v9] Further refactor of heapgettup and heapgettup_pagemode

Backward and forward scans share much of the same page acquisition code.
Here we consolidate that code to reduce some duplication.

Additionally, add a new rs_coffset field to HeapScanDescData to track the
offset of the current tuple.  The new field fits nicely into the padding
between a bool and BlockNumber field and saves having to look at the last
returned tuple to figure out which offset we should be looking at for the
current tuple.

Author: Melanie Plageman
Reviewed-by: David Rowley
Discussion: 
https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_t8qeqzqol02rpi...@mail.gmail.com
---
 src/backend/access/heap/heapam.c | 200 ++-
 src/include/access/heapam.h  |   1 +
 2 files changed, 63 insertions(+), 138 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b34e20a71f..ec6b7962c5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -576,101 +576,67 @@ heapgettup(HeapScanDesc scan,
BlockNumber block;
boolfinished;
Pagepage;
-   int lines;
OffsetNumber lineoff;
int linesleft;
ItemId  lpp;
 
-   /*
-* calculate next starting lineoff, given scan direction
-*/
-   if (ScanDirectionIsForward(dir))
+   if (unlikely(!scan->rs_inited))
{
-   if (!scan->rs_inited)
-   {
-   block = heapgettup_initial_block(scan, dir);
+   block = heapgettup_initial_block(scan, dir);
 
-   /*
-* Check if we have reached the end of the scan 
already. This
-* could happen if the table is empty or if the 
parallel workers
-* have already finished the scan before we did 
anything ourselves
-*/
-   if (block == InvalidBlockNumber)
-   {
-   Assert(!BufferIsValid(scan->rs_cbuf));
-   tuple->t_data = NULL;
-   return;
-   }
-   heapgetpage((TableScanDesc) scan, block);
-   lineoff = FirstOffsetNumber;/* first offnum */
-   scan->rs_inited = true;
-   }
-   else
+   /*
+* Check if we have reached the end of the scan already. This 
could
+* 

Re: Add progress reporting to pg_verifybackup

2023-02-01 Thread Masahiko Sawada
On Wed, Feb 1, 2023 at 10:25 AM Michael Paquier  wrote:
>
> On Fri, Jan 06, 2023 at 04:28:42PM +0900, Masahiko Sawada wrote:
> > I've attached the simple patch to add the progress reporting option to
> > pg_verifybackup. The progress information is displayed with --progress
> > option only during the checksum verification, which is the most time
> > consuming task. It cannot be used together with --quiet option.
>
> That looks helpful, particularly when a backup has many relation
> files.  Calculating the total size when browsing the file list looks
> fine.
>
> +   /* Complain if the specified arguments conflict */
> +   if (show_progress && quiet)
> +   pg_fatal("cannot specify both --progress and --quiet");
>
> Nothing on HEAD proposes --progress and --quiet at the same time from
> what I can see, so just disabling the combination is fine by me.  For
> the error message, I would recommend to switch to a more generic
> pattern, as of:
> "cannot specify both %s and %s", "-P/--progress", "-q/--quiet"

Agreed.

>
> Could you add a check based on command_fails_like() in 004_options.pl,
> at least?

Agreed, done in v2 patch.

>   A second test to check after the output of --progress would
> be a nice bonus, for example by sticking --progress into one of the
> existing commands doing a command_like().

It seems that the --progress option doesn't work with command_like()
since the progress information is written in stderr but command_like()
doesn't want it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v2-0001-Add-progress-reporting-to-pg_verifybackup.patch
Description: Binary data


Re: Exposing the lock manager's WaitForLockers() to SQL

2023-02-01 Thread Will Mortensen
Here is a first attempt at a WIP patch. Sorry about the MIME type.

It doesn't take any locks on the tables, but I'm not super confident
that that's safe, so any input would be appreciated.

I omitted view support for simplicity, but if that seems like a
requirement I'll see about adding it. I assume we would need to take
AccessShareLock on views (and release it, per above).

If the syntax and behavior seem roughly correct I'll work on updating the docs.

The commit message at the beginning of the .patch has slightly more commentary.

Thanks for any and all feedback!


0001-Add-WAIT-ONLY-option-to-LOCK.patch
Description: Binary data


Re: Exit walsender before confirming remote flush in logical replication

2023-02-01 Thread Amit Kapila
On Thu, Feb 2, 2023 at 10:04 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 1 Feb 2023 14:58:14 +0530, Amit Kapila  
> wrote in
> > On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada  
> > wrote:
> > >
> > > Otherwise, we will end up terminating
> > > the WAL stream without the done message. Which will lead to an error
> > > message "ERROR:  could not receive data from WAL stream: server closed
> > > the connection unexpectedly" on the subscriber even at a clean
> > > shutdown.
> > >
> >
> > But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
> > disallows new connections, then waits for all existing clients to
> > disconnect. If the server is in hot standby, recovery and streaming
> > replication will be terminated once all clients have disconnected.),
> > there is no such guarantee. I see that it is required for the
> > switchover in physical replication to ensure that all the WAL is sent
> > and replicated but we don't need that for logical replication.
>
> +1
>
> Since publisher is not aware of apply-delay (by this patch), as a
> matter of fact publisher seems gone before sending EOS in that
> case. The error message is correctly describing that situation.
>

This can happen even without apply-delay patch. For example, when
apply process is waiting on some lock.

-- 
With Regards,
Amit Kapila.




Re: Exit walsender before confirming remote flush in logical replication

2023-02-01 Thread Amit Kapila
On Thu, Feb 2, 2023 at 10:48 AM Masahiko Sawada  wrote:
>
> On Wed, Feb 1, 2023 at 6:28 PM Amit Kapila  wrote:
> >
> >
> > > In a case where pq_is_send_pending() doesn't become false
> > > for a long time, (e.g., the network socket buffer got full due to the
> > > apply worker waiting on a lock), I think users should unblock it by
> > > themselves. Or it might be practically better to shutdown the
> > > subscriber first in the logical replication case, unlike the physical
> > > replication case.
> > >
> >
> > Yeah, will users like such a dependency? And what will they gain by doing 
> > so?
>
> IIUC there is no difference between smart shutdown and fast shutdown
> in logical replication walsender, but reading the doc[1], it seems to
> me that in the smart shutdown mode, the server stops existing sessions
> normally. For example, If the client is psql that gets stuck for some
> reason and the network buffer gets full, the smart shutdown waits for
> a backend process to send all results to the client. I think the
> logical replication walsender should follow this behavior for
> consistency. One idea is to distinguish smart shutdown and fast
> shutdown also in logical replication walsender so that we disconnect
> even without the done message in fast shutdown mode, but I'm not sure
> it's worthwhile.
>

The main problem we want to solve here is to avoid shutdown failing in
case walreceiver/applyworker is busy waiting for some lock or for some
other reason as shown in the email [1]. I haven't tested it but if
such a problem doesn't exist in smart shutdown mode then probably we
can allow walsender to wait till all the data is sent. We can once
investigate what it takes to introduce shutdown mode knowledge for
logical walsender. OTOH, the docs for smart shutdown says "If the
server is in hot standby, recovery and streaming replication will be
terminated once all clients have disconnected." which to me indicates
that it is okay to terminate logical replication connections even in
smart mode.

[1] - 
https://www.postgresql.org/message-id/TYAPR01MB58669CB06F6657ABCEFE6555F5F29%40TYAPR01MB5866.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.




Re: recovery modules

2023-02-01 Thread Michael Paquier
On Wed, Feb 01, 2023 at 01:23:26PM -0800, Nathan Bossart wrote:
> Yeah, that's nicer.  cfbot is complaining about a missing #include, so I
> need to send a new revision anyway.

Okay, the changes done here look straight-forward seen from here.  I
got one small-ish comment.

+basic_archive_startup(ArchiveModuleState *state)
+{
+   BasicArchiveData *data = palloc0(sizeof(BasicArchiveData));

Perhaps this should use MemoryContextAlloc() rather than a plain
palloc().  This should not matter based on the position where the
startup callback is called, still that may be a pattern worth
encouraging.
--
Michael


signature.asc
Description: PGP signature


Re: Exit walsender before confirming remote flush in logical replication

2023-02-01 Thread Masahiko Sawada
On Wed, Feb 1, 2023 at 6:28 PM Amit Kapila  wrote:
>
> On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada  wrote:
> >
> > On Fri, Jan 20, 2023 at 7:45 PM Amit Kapila  wrote:
> > >
> > > On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila  
> > > wrote:
> > > >
> > > > Let me try to summarize the discussion till now. The problem we are
> > > > trying to solve here is to allow a shutdown to complete when walsender
> > > > is not able to send the entire WAL. Currently, in such cases, the
> > > > shutdown fails. As per our current understanding, this can happen when
> > > > (a) walreceiver/walapply process is stuck (not able to receive more
> > > > WAL) due to locks or some other reason; (b) a long time delay has been
> > > > configured to apply the WAL (we don't yet have such a feature for
> > > > logical replication but the discussion for same is in progress).
> > > >
> > > > Both reasons mostly apply to logical replication because there is no
> > > > separate walreceiver process whose job is to just flush the WAL. In
> > > > logical replication, the process that receives the WAL also applies
> > > > it. So, while applying it can stuck for a long time waiting for some
> > > > heavy-weight lock to be released by some other long-running
> > > > transaction by the backend.
> > > >
> ...
> ...
> >
> > +1 to eliminate condition (b) for logical replication.
> >
> > Regarding (a), as Amit mentioned before[1], I think we should check if
> > pq_is_send_pending() is false.
> >
>
> Sorry, but your suggestion is not completely clear to me. Do you mean
> to say that for logical replication, we shouldn't wait for all the WAL
> to be successfully replicated but we should ensure to inform the
> subscriber that XLOG streaming is done (by ensuring
> pq_is_send_pending() is false and by calling EndCommand, pq_flush())?

Yes.

>
> > Otherwise, we will end up terminating
> > the WAL stream without the done message. Which will lead to an error
> > message "ERROR:  could not receive data from WAL stream: server closed
> > the connection unexpectedly" on the subscriber even at a clean
> > shutdown.
> >
>
> But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
> disallows new connections, then waits for all existing clients to
> disconnect. If the server is in hot standby, recovery and streaming
> replication will be terminated once all clients have disconnected.),
> there is no such guarantee.

In smart shutdown case, the walsender doesn't exit until it can flush
the done message, no?

> I see that it is required for the
> switchover in physical replication to ensure that all the WAL is sent
> and replicated but we don't need that for logical replication.

It won't be a problem in practice in terms of logical replication. But
I'm concerned that this error could confuse users. Is there any case
where the client gets such an error at the smart shutdown?

>
> > In a case where pq_is_send_pending() doesn't become false
> > for a long time, (e.g., the network socket buffer got full due to the
> > apply worker waiting on a lock), I think users should unblock it by
> > themselves. Or it might be practically better to shutdown the
> > subscriber first in the logical replication case, unlike the physical
> > replication case.
> >
>
> Yeah, will users like such a dependency? And what will they gain by doing so?

IIUC there is no difference between smart shutdown and fast shutdown
in logical replication walsender, but reading the doc[1], it seems to
me that in the smart shutdown mode, the server stops existing sessions
normally. For example, If the client is psql that gets stuck for some
reason and the network buffer gets full, the smart shutdown waits for
a backend process to send all results to the client. I think the
logical replication walsender should follow this behavior for
consistency. One idea is to distinguish smart shutdown and fast
shutdown also in logical replication walsender so that we disconnect
even without the done message in fast shutdown mode, but I'm not sure
it's worthwhile.

Regards,

[1] https://www.postgresql.org/docs/devel/server-shutdown.html

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl

2023-02-01 Thread Michael Paquier
On Wed, Feb 01, 2023 at 02:29:23PM +0900, Michael Paquier wrote:
> As also mentioned upthread by Tom, I am not sure that this combination
> makes much sense, actually, because I don't see why one would never
> want to know what is the effective value loaded for a parameter stored
> in a file when he/she has the permission to do so.  This could be
> changed as of ALTER SYSTEM, postgresql.conf or even an included file,
> and the value can only be read if permission to see it is given to the
> role querying SHOW or pg_settings.  This combination of flags is not a
> practice to encourage.

So, coming back quickly to this one, it seems to me that the tests in
guc.sql had better be adjusted down v15 where they have been
introduced, and that the extra check is worth doing on HEAD.  Any
thoughts?
--
Michael


signature.asc
Description: PGP signature


Re: Exit walsender before confirming remote flush in logical replication

2023-02-01 Thread Kyotaro Horiguchi
At Wed, 1 Feb 2023 14:58:14 +0530, Amit Kapila  wrote 
in 
> On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada  wrote:
> >
> > Otherwise, we will end up terminating
> > the WAL stream without the done message. Which will lead to an error
> > message "ERROR:  could not receive data from WAL stream: server closed
> > the connection unexpectedly" on the subscriber even at a clean
> > shutdown.
> >
> 
> But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
> disallows new connections, then waits for all existing clients to
> disconnect. If the server is in hot standby, recovery and streaming
> replication will be terminated once all clients have disconnected.),
> there is no such guarantee. I see that it is required for the
> switchover in physical replication to ensure that all the WAL is sent
> and replicated but we don't need that for logical replication.

+1

Since publisher is not aware of apply-delay (by this patch), as a
matter of fact publisher seems gone before sending EOS in that
case. The error message is correctly describing that situation.

> > In a case where pq_is_send_pending() doesn't become false
> > for a long time, (e.g., the network socket buffer got full due to the
> > apply worker waiting on a lock), I think users should unblock it by
> > themselves. Or it might be practically better to shutdown the
> > subscriber first in the logical replication case, unlike the physical
> > replication case.
> >
> 
> Yeah, will users like such a dependency? And what will they gain by doing so?

If PostgreSQL required such kind of special care about shutdown at
facing a trouble to keep replication consistency, that won't be
acceptable. The current time-delayed logical replication can be seen
as a kind of intentional continuous large network lag in this
aspect. And I think the consistency is guaranteed even in such cases.

On the other hand I don't think the almost all people care about the
exact progress when facing such troubles, as far as replication
consistently is maintained. IMHO that is also true for the
logical-delayed-replication case.

> [1] - https://www.postgresql.org/docs/devel/app-pg-ctl.html

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Michael Paquier
On Wed, Feb 01, 2023 at 09:34:44PM -0500, Tom Lane wrote:
> Michael Paquier  writes:
>> Hmm.  Isn't that something that we should also document in startup.c
>> where both routines are defined?  If we begin to use
>> PreRestoreCommand() and PostRestoreCommand() in more code paths in the
>> future, that could be again an issue.
> 
> I was vaguely wondering about removing both of those functions
> in favor of an integrated function that does a system() call
> with those things before and after it.

It seems to me that this is pretty much the same as storing
in_restore_command in shell_restore.c, and that for recovery modules
this comes down to the addition of an extra callback called in
startup.c to check if the flag is up or not.  Now the patch is doing 
things the opposite way: like on HEAD, store the flag in startup.c but
switch it at will with the routines in startup.c.  I find the approach
of the patch a bit more intuitive, TBH, as that makes the interface
simpler for other recovery modules that may want to switch the flag
back-and-forth, and I suspect that there may be cases in recovery
modules where we'd still want to switch the flag, but not necessarily
link it to system().
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-01 Thread shveta malik
On Wed, Feb 1, 2023 at 5:42 PM Melih Mutlu  wrote:
>
> Hi Shveta,
>
> shveta malik , 1 Şub 2023 Çar, 15:01 tarihinde şunu 
> yazdı:
>>
>> On Wed, Feb 1, 2023 at 5:05 PM Melih Mutlu  wrote:
>> 2) I found a crash in the previous patch (v9), but have not tested it
>> on the latest yet. Crash happens when all the replication slots are
>> consumed and we are trying to create new. I tweaked the settings like
>> below so that it can be reproduced easily:
>> max_sync_workers_per_subscription=3
>> max_replication_slots = 2
>> and then ran the test case shared by you. I think there is some memory
>> corruption happening. (I did test in debug mode, have not tried in
>> release mode). I tried to put some traces to identify the root-cause.
>> I observed that worker_1 keeps on moving from 1 table to another table
>> correctly, but at some point, it gets corrupted i.e. origin-name
>> obtained for it is wrong and it tries to advance that and since that
>> origin does not exist, it  asserts and then something else crashes.
>> From log: (new trace lines added by me are prefixed by shveta, also
>> tweaked code to have my comment 1 fixed to have clarity on worker-id).
>>
>> form below traces, it is clear that worker_1 was moving from one
>> relation to another, always getting correct origin 'pg_16688_1', but
>> at the end it got 'pg_16688_49' which does not exist. Second part of
>> trace shows who updated 'pg_16688_49', it was done by worker_49 which
>> even did not get chance to create this origin due to max_rep_slot
>> reached.
>
>
> Thanks for investigating this error. I think it's the same error as the one 
> Shi reported earlier. [1]
> I couldn't reproduce it yet but will apply your tweaks and try again.
> Looking into this.
>
> [1] 
> https://www.postgresql.org/message-id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpnprd01.prod.outlook.com
>

Hi Melih,
I think I am able to identify the root cause. It is not memory
corruption, but the way origin-names are stored in system-catalog
mapped to a particular relation-id before even those are created.

After adding few more logs:

[4227] LOG:  shveta- LogicalRepSyncTableStart- worker_49 constructed
originname :pg_16684_49, relid:16540
[4227] LOG:  shveta- LogicalRepSyncTableStart- worker_49
updated-origin in system catalog:pg_16684_49,
slot:pg_16684_sync_49_7195149685251088378, relid:16540
[4227] LOG:  shveta- LogicalRepSyncTableStart- worker_49
get-origin-id2 originid:0, originname:pg_16684_49
[4227] ERROR:  could not create replication slot
"pg_16684_sync_49_7195149685251088378": ERROR:  all replication slots
are in use
HINT:  Free one or increase max_replication_slots.


[4428] LOG:  shveta- LogicalRepSyncTableStart- worker_148 constructed
originname :pg_16684_49, relid:16540
[4428] LOG:  could not drop replication slot
"pg_16684_sync_49_7195149685251088378" on publisher: ERROR:
replication slot "pg_16684_sync_49_7195149  685251088378" does not
exist
[4428] LOG:  shveta- LogicalRepSyncTableStart- worker_148 drop-origin
originname:pg_16684_49
[4428] LOG:  shveta- LogicalRepSyncTableStart- worker_148
updated-origin:pg_16684_49,
slot:pg_16684_sync_148_7195149685251088378, relid:16540

So from above, worker_49 came and picked up relid:16540, constructed
origin-name and slot-name and updated in system-catalog and then it
tried to actually create that slot and origin but since max-slot count
was reached, it failed and exited, but did not do cleanup from system
catalog for that relid.

Then worker_148 came and also picked up table 16540 since it was not
completed/started by previous worker, but this time it found that
origin and slot entry present in system-catalog against this relid, so
it picked the same names and started processing, but since those do
not exist, it asserted while advancing the origin.

The assert is only reproduced when an already running worker (say
worker_1) who has 'created=true' set, gets to sync the relid for which
a previously failed worker has tried and updated origin-name w/o
creating it. In such a case worker_1 (with created=true) will try to
reuse the origin and thus will try to advance it and will end up
asserting. That is why you might not see the assertion always. The
condition 'created' is set to true for that worker and it goes to
reuse the origin updated by the previous worker.

So to fix this, I think either we update origin and slot entries in
the system catalog after the creation has passed or we clean-up the
system catalog in case of failure. What do you think?

thanks
Shveta




Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-02-01 Thread Amit Kapila
On Wed, Feb 1, 2023 at 3:10 PM shiy.f...@fujitsu.com
 wrote:
>
> On Mon, Jan 30, 2023 6:05 PM Takamichi Osumi (Fujitsu) 
>  wrote:
> >
> > Kindly have a look at the attached v25.
> >
>
> Thanks for your patch. Here are some comments.
>
> 1.
> +   /*
> +* The min_apply_delay parameter is ignored until all tablesync 
> workers
> +* have reached READY state. This is because if we allowed the delay
> +* during the catchup phase, then once we reached the limit of 
> tablesync
> +* workers it would impose a delay for each subsequent worker. That 
> would
> +* cause initial table synchronization completion to take a long time.
> +*/
> +   if (!AllTablesyncsReady())
> +   return;
>
> I saw that the new parameter becomes effective after all tables are in ready
> state, because the apply worker can't set the state to catchup during the 
> delay.
> But can we call process_syncing_tables() in the while-loop of
> maybe_apply_delay()? Then the tablesync can finish without delay. If we can't 
> do
> so, it might be better to add some comments for it.
>

I think the point here is that if the apply worker is ahead of
tablesync worker then to complete the catch-up, tablesync worker needs
to apply additional transactions, and delaying during that time will
cause initial table synchronization completion to take a long time. I
am not sure how much more details can be added to the existing
comments.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] New [relation] option engine

2023-02-01 Thread vignesh C
On Wed, 1 Feb 2023 at 14:49, Alvaro Herrera  wrote:
>
> On 2023-Jan-31, vignesh C wrote:
>
> > On Tue, 3 Jan 2023 at 18:38, vignesh C  wrote:
>
> > > The patch does not apply on top of HEAD as in [1], please post a rebased 
> > > patch:
> > > === Applying patches on top of PostgreSQL commit ID
> > > 92957ed98c5c565362ce665266132a7f08f6b0c0 ===
> > > === applying patch ./new_options_take_two_v03f.patch
> > > patching file src/include/access/reloptions.h
> > > Hunk #1 FAILED at 1.
> > > 1 out of 4 hunks FAILED -- saving rejects to file
> > > src/include/access/reloptions.h.rej
> >
> > There has been no updates on this thread for some time, so this has
> > been switched as Returned with Feedback. Feel free to open it in the
> > next commitfest if you plan to continue on this.
>
> Well, no feedback has been given, so I'm not sure this is a great
> outcome.  In the interest of keeping it alive, I've rebased it.  It
> turns out that the only conflict is with the 2022 -> 2023 copyright line
> update.
>
> I have not reviewed it.

Since there was no response to rebase the patch, I was not sure if the
author was planning to continue. Anyways Nikolay Shaplov plans to work
on this soon, So I have added this to the next commitfest at [1].
[1] - https://commitfest.postgresql.org/41/3536/

Regards,
Vignesh




Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Tom Lane
Michael Paquier  writes:
> Hmm.  Isn't that something that we should also document in startup.c
> where both routines are defined?  If we begin to use
> PreRestoreCommand() and PostRestoreCommand() in more code paths in the
> future, that could be again an issue.

I was vaguely wondering about removing both of those functions
in favor of an integrated function that does a system() call
with those things before and after it.

regards, tom lane




Re: Worth using personality(ADDR_NO_RANDOMIZE) for EXEC_BACKEND on linux?

2023-02-01 Thread Michael Paquier
On Thu, Feb 02, 2023 at 10:06:15AM +0900, Michael Paquier wrote:
> Thanks for confirming.  I am wondering what these animals may complain
> about next, but based on some tests on this buildfarm host with the
> same configuration, things are looking OK once this stuff is applied
> on 11~14.

Actually, I completely forgot to take into account that there is a
minor release planned for next week:
https://www.postgresql.org/developer/roadmap/

So I'll hold on a bit longer here, until the next versions get their
tags.
--
Michael


signature.asc
Description: PGP signature


Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Michael Paquier
On Wed, Feb 01, 2023 at 02:35:55PM -0800, Nathan Bossart wrote:
> Here is a first draft for the proposed stopgap fix.  If we want to proceed
> with this, I can provide patches for the back branches.

> + /*
> +  * PreRestoreCommand() is used to tell the SIGTERM handler for the 
> startup
> +  * process that it is okay to proc_exit() right away on SIGTERM.  This 
> is
> +  * done for the duration of the system() call because there isn't a good
> +  * way to break out while it is executing.  Since we might call 
> proc_exit()
> +  * in a signal handler here, it is extremely important that nothing but 
> the
> +  * system() call happens between the calls to PreRestoreCommand() and
> +  * PostRestoreCommand().  Any additional code must go before or after 
> this
> +  * section.
> +  */
> + if (exitOnSigterm)
> + PreRestoreCommand();
> +
>   rc = system(command);
> +
> + if (exitOnSigterm)
> + PostRestoreCommand();
> +
>   pgstat_report_wait_end();

Hmm.  Isn't that something that we should also document in startup.c
where both routines are defined?  If we begin to use
PreRestoreCommand() and PostRestoreCommand() in more code paths in the
future, that could be again an issue.  That looks enough to me to
reduce the window back to what it was before 9a740f8, as exitOnSigterm
is only used for restore_command.  There is a different approach
possible here: rely more on wait_event_info rather than failOnSignal
and exitOnSigterm to decide which code path should do what.

Andres Freund wrote:
> - the error message for a failed restore command seems to have gotten
>   worse:
>   could not restore file \"%s\" from archive: %s"
>   ->
>   "%s \"%s\": %s", commandName, command

IMO, we don't lose any context with this method: the command type and
the command string itself are the bits actually relevant.  Perhaps
something like that would be more intuitive?  One idea:
"could not execute command for %s: %s", commandName, command

> - shell_* imo is not a good namespace for something called from xlog.c,
>   xlogarchive.c. I realize the intention is that shell_archive.c is
>   going to be its own "restore module", but for now it imo looks odd

shell_restore.c does not sound that bad to me, FWIW.  The parallel
with the archive counterparts is here.  My recent history is not that
good when it comes to naming, based on the feedback I received,
though.

> And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
> doesn't look right:
> - We now have two places open-coding what BuildRestoreCommand did

Yeah, BuildRestoreCommand() was just a small wrapper on top of the new
percentrepl.c, making it rather irrelevant at this stage, IMO.  For
the two code paths where it was called.

> - The comment moved out of RestoreArchivedFile() doesn't seems less
>   useful at its new location

We are talking about that:
-   /*
-* Remember, we rollforward UNTIL the restore fails so failure here is
-* just part of the process... that makes it difficult to determine
-* whether the restore failed because there isn't an archive to restore,
-* or because the administrator has specified the restore program
-* incorrectly.  We have to assume the former.
-*
-* However, if the failure was due to any sort of signal, it's best to
-* punt and abort recovery.  (If we "return false" here, upper levels will
-* assume that recovery is complete and start up the database!) It's
-* essential to abort on child SIGINT and SIGQUIT, because per spec
-* system() ignores SIGINT and SIGQUIT while waiting; if we see one of
-* those it's a good bet we should have gotten it too.
-*
-* On SIGTERM, assume we have received a fast shutdown request, and exit
-* cleanly. It's pure chance whether we receive the SIGTERM first, or the
-* child process. If we receive it first, the signal handler will call
-* proc_exit, otherwise we do it here. If we or the child process received
-* SIGTERM for any other reason than a fast shutdown request, postmaster
-* will perform an immediate shutdown when it sees us exiting
-* unexpectedly.
-*
-* We treat hard shell errors such as "command not found" as fatal, too.
-*/

The failure processing is stuck within the way we build and handle the
command given down to system(), so keeping this in shell_restore.c (or
whatever name you think would be a better fit) makes sense to me.
Now, thinking a bit more of this, we could just push the description
down to ExecuteRecoveryCommand(), that actually does the work,
adaptinh the comment based on the refactored internals of the
routine.
--
Michael


signature.asc
Description: PGP signature


Fwd: pgsql: Remove over-optimistic Assert.

2023-02-01 Thread Richard Guo
Resend this email to -hackers.  Sorry for the noise.

Thanks
Richard

-- Forwarded message -
From: Richard Guo 
Date: Thu, Feb 2, 2023 at 9:51 AM
Subject: Re: pgsql: Remove over-optimistic Assert.
To: Tom Lane 
Cc: , PostgreSQL-development <
pgsql-hack...@postgresql.org>



On Thu, Feb 2, 2023 at 8:40 AM Tom Lane  wrote:

> Remove over-optimistic Assert.
>
> In commit 2489d76c4, I'd thought it'd be safe to assert that a
> PlaceHolderVar appearing in a scan-level expression has empty
> nullingrels.  However this is not so, as when we determine that a
> join relation is certainly empty we'll put its targetlist into a
> Result-with-constant-false-qual node, and nothing is done to adjust
> the nullingrels of the Vars or PHVs therein.  (Arguably, a Result
> used in this way isn't really a scan-level node, but it certainly
> isn't an upper node either ...)


It seems this is the only case we can have PlaceHolderVar with non-empty
nullingrels at scan level.  So I wonder if we can manually adjust the
nullingrels of PHVs in this special case, and keep the assertion about
phnullingrels being NULL in fix_scan_expr.  I think that assertion is
asserting the right thing in most cases.  It's a pity to lose it.

Currently for the tlist of a childless Result, we special-case ROWID_VAR
Vars in set_plan_refs and thus keep assertions about varno != ROWID_VAR
in fix_scan_expr.  Do you think we can special-case PHVs at the same
place by setting its phnullingrels to NULL?  I'm imagining something
like attached.

Thanks
Richard
From 41fc3b6855e5b6e525ec46a276a0892a321045e0 Mon Sep 17 00:00:00 2001
From: Richard Guo 
Date: Thu, 2 Feb 2023 09:06:53 +0800
Subject: [PATCH v1] Adjust phnullingrels for childless Result

---
 src/backend/optimizer/plan/setrefs.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 186fc8014b..f6d8eaf1df 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1045,16 +1045,32 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	 * expected to occur here, it seems safer to special-case
 	 * it here and keep the assertions that ROWID_VARs
 	 * shouldn't be seen by fix_scan_expr.
+	 *
+	 * The tlist of a childless Result could contain PHVs with
+	 * non-empty phnullingrels, in case it's a dummy plan
+	 * generated for empty join relation.  Special-case it here
+	 * by setting phnullingrels to NULL and keep the assertion
+	 * about phnullingrels being NULL in fix_scan_expr.
 	 */
 	foreach(l, splan->plan.targetlist)
 	{
 		TargetEntry *tle = (TargetEntry *) lfirst(l);
-		Var		   *var = (Var *) tle->expr;
 
-		if (var && IsA(var, Var) && var->varno == ROWID_VAR)
-			tle->expr = (Expr *) makeNullConst(var->vartype,
-			   var->vartypmod,
-			   var->varcollid);
+		if (tle->expr && IsA(tle->expr, Var))
+		{
+			Var		   *var = (Var *) tle->expr;
+
+			if (var->varno == ROWID_VAR)
+tle->expr = (Expr *) makeNullConst(var->vartype,
+   var->vartypmod,
+   var->varcollid);
+		}
+		else if (tle->expr && IsA(tle->expr, PlaceHolderVar))
+		{
+			PlaceHolderVar *phv = (PlaceHolderVar *) tle->expr;
+
+			phv->phnullingrels = NULL;
+		}
 	}
 
 	splan->plan.targetlist =
@@ -2205,7 +2221,7 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		/* At scan level, we should always just evaluate the contained expr */
 		PlaceHolderVar *phv = (PlaceHolderVar *) node;
 
-		/* XXX can we assert something about phnullingrels? */
+		Assert(phv->phnullingrels == NULL);
 		return fix_scan_expr_mutator((Node *) phv->phexpr, context);
 	}
 	if (IsA(node, AlternativeSubPlan))
-- 
2.31.0



Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Michael Paquier
On Wed, Feb 01, 2023 at 10:18:27AM -0800, Andres Freund wrote:
> On 2023-02-01 12:27:19 -0500, Tom Lane wrote:
>> Andres Freund  writes:
>> The main thing that system() brings to the table is platform-specific
>> knowledge of where the shell is.  I'm not very sure that we want to
>> wire in "/bin/sh".
> 
> We seem to be doing OK with using SHELLPROG in pg_regress, which just
> seems to be using $SHELL from the build environment.

It looks like this had better centralize a bit more of the logic from
pg_regress.c if that were to happen, to avoid more fuzzy logic with
WIN32.  That becomes invasive for a back-patch.

By the way, there is something that's itching me a bit here.  9a740f8
has enlarged by a lot the window between PreRestoreCommand() and
PostRestoreCommand(), however curculio has reported a failure on
REL_15_STABLE, where we only manipulate my_wait_event_info while the
flag is on.  Or I am getting that right that there is no way out of it
unless we remove the dependency to system() even in the back-branches?
Could there be an extra missing piece here?
--
Michael


signature.asc
Description: PGP signature


Re: CI and test improvements

2023-02-01 Thread Tom Lane
Thomas Munro  writes:
> Some observations:

> * macOS has a new release every year in June[1]
> * updates cease after three years[1]
> * thus three releases are in support (by that definition) at a time
> * we need an image on Cirrus; 13 appeared ~1 month later[2]
> * we need Homebrew support; 13 appeared ~3 months later[3]
> * we have 13 and 12 in the buildfarm, but no 11
> * it's common for developers but uncommon for servers/deployment

> So what should our policy be on when to roll the CI image forward?  I
> guess around New Year/now (~6 months after release) is a good time and
> we should just do it.  Anyone got a reason why we should wait?  Our
> other CI OSes have slower major version release cycles and longer
> lives, so it's not quite the same hamster wheel of upgrades.

I'd argue that developers are probably the kind of people who update
their OS sooner rather than later --- I've usually updated my laptop
and at least one BF animal to $latest macOS within a month or so of
the dot-zero release.  So waiting 6 months seems to me like CI will be
behind the users, which will be unhelpful.  I'd rather drop the oldest
release sooner, if we need to hold down the number of macOS revisions
under test.

regards, tom lane




Re: Worth using personality(ADDR_NO_RANDOMIZE) for EXEC_BACKEND on linux?

2023-02-01 Thread Michael Paquier
On Wed, Feb 01, 2023 at 03:06:16PM +1300, Thomas Munro wrote:
> +1, go for it.  It shouldn't affect Unix build releases, and on
> Windows the function does nothing.

Thanks for confirming.  I am wondering what these animals may complain
about next, but based on some tests on this buildfarm host with the
same configuration, things are looking OK once this stuff is applied
on 11~14.
--
Michael


signature.asc
Description: PGP signature


Re: pg_dump versus hash partitioning

2023-02-01 Thread Tom Lane
David Rowley  writes:
> Digging into the history a bit, I found [2] and particularly [3] that
> seem to indicate this option was thought about due to concerns about
> hash functions not returning consistent results on different
> architectures. I suspect it might have been defaulted to load into the
> leaf partitions for performance reasons, however. I mean, why else
> would you?
> [3] 
> https://www.postgresql.org/message-id/CA%2BTgmoZFn7TJ7QBsFatnuEE%3DGYGdZSNXqr9489n5JBsdy5rFfA%40mail.gmail.com

Hah ... so we went over almost exactly this same ground in 2017.
The consensus at that point seemed to be that actual problems
would be rare enough that we'd not need to impose the overhead
of --load-via-partition-root by default.  Now, AFAICS that was
based on exactly zero hard evidence, as to either the frequency
of actual problems or the cost of --load-via-partition-root.
Our optimism about the former seems to have been mostly borne out,
given the lack of complaints since then; but I still think our
pessimism about the latter is on shaky grounds.

Anyway, after re-reading the old thread I wonder if my first instinct
(force --load-via-partition-root for enum hash cases only) was the
best compromise after all.  I'm not sure how painful it is to get
pg_dump to detect such cases, but it's probably possible.

regards, tom lane




Re: CI and test improvements

2023-02-01 Thread Thomas Munro
On Fri, Dec 30, 2022 at 4:59 PM Thomas Munro  wrote:
> On Wed, Nov 23, 2022 at 11:57 AM Justin Pryzby  wrote:
> > [PATCH 03/10] cirrus/macos: update to macos ventura
>
> I don't know any reason not to push this one too, but it's not time critical.

Some observations:

* macOS has a new release every year in June[1]
* updates cease after three years[1]
* thus three releases are in support (by that definition) at a time
* we need an image on Cirrus; 13 appeared ~1 month later[2]
* we need Homebrew support; 13 appeared ~3 months later[3]
* we have 13 and 12 in the buildfarm, but no 11
* it's common for developers but uncommon for servers/deployment

So what should our policy be on when to roll the CI image forward?  I
guess around New Year/now (~6 months after release) is a good time and
we should just do it.  Anyone got a reason why we should wait?  Our
other CI OSes have slower major version release cycles and longer
lives, so it's not quite the same hamster wheel of upgrades.

[1] https://en.wikipedia.org/wiki/MacOS_version_history#Releases
[2] https://github.com/orgs/cirruslabs/packages?tab=packages=macos
[3] https://brew.sh/2022/09/07/homebrew-3.6.0/




Re: pg_dump versus hash partitioning

2023-02-01 Thread David Rowley
On Thu, 2 Feb 2023 at 11:38, Tom Lane  wrote:
>
> Peter Geoghegan  writes:
> > You mentioned "minor releases" here. Who said anything about that?
>
> I did: I'd like to back-patch the fix if possible.  I think changing
> the default --load-via-partition-root choice could be back-patchable.
>
> If Robert is resistant to that but would accept it in master,
> I'd settle for that in preference to having no fix.

I'm not sure it'll help the discussion any, but on master, the
performance gap between using --load-via-partition-root and not using
it should be somewhat closed due to 3592e0ff9. So using that is likely
not as terrible as it once was.

[1] does not show results for inserting directly into partitions, but
the benchmark results do show that performance is better than it was
without the caching. The order that pg_dump outputs the rows should
mean the cache is hit most of the time for RANGE partitioned tables,
at least, and likely more often than not for LIST. HASH partitioning
is not really affected by that commit. The idea there is that it's
probably as cheap to hash as it is to do an equality check with the
last Datum.

Digging into the history a bit, I found [2] and particularly [3] that
seem to indicate this option was thought about due to concerns about
hash functions not returning consistent results on different
architectures. I suspect it might have been defaulted to load into the
leaf partitions for performance reasons, however. I mean, why else
would you?

David

[1] 
https://www.postgresql.org/message-id/CAApHDvqFeW5hvQqprXOLuGMMJSf%2B1C%2BWk4w_L-M03sVduF3oYg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAGPqQf0C1he087bz9xRBOGZBuESYz9X=fp8ca_g+tfhgaff...@mail.gmail.com
[3] 
https://www.postgresql.org/message-id/CA%2BTgmoZFn7TJ7QBsFatnuEE%3DGYGdZSNXqr9489n5JBsdy5rFfA%40mail.gmail.com




Re: Perform streaming logical transactions by background workers and parallel apply

2023-02-01 Thread Peter Smith
Some minor review comments for v91-0001

==
doc/src/sgml/config.sgml

1.

-Allows streaming or serializing changes immediately in
logical decoding.
-The allowed values of logical_replication_mode are
-buffered and immediate. When set
-to immediate, stream each change if
+The allowed values are buffered and
+immediate. The default is
buffered.
+This parameter is intended to be used to test logical decoding and
+replication of large transactions for which otherwise we need
to generate
+the changes till logical_decoding_work_mem is
+reached.  The effect of logical_replication_mode is
+different for the publisher and subscriber:
+   

The "for which otherwise..." part is only relevant for the
publisher-side. So it seemed slightly strange to give the reason why
to use the GUC for one side but not the other side.

Maybe we can just to remove that "for which otherwise..." part, since
the logical_decoding_work_mem gets mentioned later in the "On the
publisher side,..." paragraph anyway.

~~~

2.

-This parameter is intended to be used to test logical decoding and
-replication of large transactions for which otherwise we need to
-generate the changes till logical_decoding_work_mem
-is reached.
+On the subscriber side, if the streaming
option is set to
+parallel,
logical_replication_mode
+can be used to direct the leader apply worker to send changes to the
+shared memory queue or to serialize changes to the file.  When set to
+buffered, the leader sends changes to parallel apply
+workers via a shared memory queue.  When set to
+immediate, the leader serializes all
changes to files
+and notifies the parallel apply workers to read and apply them at the
+end of the transaction.


"or serialize changes to the file." --> "or serialize all changes to
files." (just to use same wording as later in this same paragraph, and
also same wording as the GUC hint text).

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: pg_dump versus hash partitioning

2023-02-01 Thread Peter Geoghegan
On Wed, Feb 1, 2023 at 2:49 PM Tom Lane  wrote:
> It's precisely because you want to analyze it in the same terms
> as range/list partitioning that we have these issues.  Or we could
> have built it on some other infrastructure than hash index opclasses
> ... but we didn't do that, and now we have a mess.  I don't see a way
> out other than relaxing the guarantees about how hash partitioning
> works compared to the other kinds.

I suspect that using hash index opclasses was the right design --
sticking to the same approach to hashing seems valuable. I agree with
your overall conclusion, though, since it doesn't seem sensible to
allow hashing behavior to ever be anything greater than an
implementation detail. On general principle. What happens when a hash
function is discovered to have a huge flaw, as happened a couple of
times before now?

It's just the same with collations, where a particular collation
shouldn't be expected to have perfectly stable behavior across a dump
and reload. While admitting that possibility does open the door to
problems, in particular problems when range partitioning is in use,
those problems at least make sense. And they probably won't come up
very often -- collation updates don't often contain enormous
gratuitous differences that are liable to create dump/reload hazards
with range partitioning. It is the least worst approach, overall. In
theory, and in practice.

-- 
Peter Geoghegan




Re: pg_dump versus hash partitioning

2023-02-01 Thread Tom Lane
Robert Haas  writes:
> On Wed, Feb 1, 2023 at 4:12 PM Tom Lane  wrote:
>> That being the case, I don't think moving the goalposts for hash
>> function stability is going to lead to a workable solution.

> I don't see that there is any easy, clean way to solve this in
> released branches. The idea that I proposed could be implemented in
> master, and I think it is the right kind of fix, but it is not
> back-patchable.

You waved your arms about inventing some new hashing infrastructure,
but it was phrased in such a way that it wasn't clear to me if that
was actually a serious proposal or not.  But if it is: how will you
get around the fact that any change to hashing behavior will break
pg_upgrade of existing hash-partitioned tables?  New infrastructure
avails nothing if it has to be bug-compatible with the old.  So I'm
not sure how restricting the fix to master helps us.

regards, tom lane




Re: pg_dump versus hash partitioning

2023-02-01 Thread David G. Johnston
On Wed, Feb 1, 2023 at 3:38 PM Tom Lane  wrote:

> Peter Geoghegan  writes:
> > You mentioned "minor releases" here. Who said anything about that?
>
> I did: I'd like to back-patch the fix if possible.  I think changing
> the default --load-via-partition-root choice could be back-patchable.
>
> If Robert is resistant to that but would accept it in master,
> I'd settle for that in preference to having no fix.
>

 I'm for accepting --load-via-partition-root as the solution to this problem.
I think it is better than doing nothing and, at the moment, I don't see any
alternatives to pick from.

As evidenced from the current influx of collation problems related to
indexes, we would be foolish to discount the collation issues with just
plain text, so limiting this only to the enum case (which is a must-have)
doesn't seem wise.

pg_dump should be conservative in what it produces - which in this
situation means having minimal environmental dependencies and internal
volatility.

In the interest of compatibility, having an escape hatch to do things as
they are done today is something we should provide.  We got this one wrong
and that is going to cause some pain.  Though at least with the escape
hatch we shouldn't be dealing with as many unresolvable complaints as when
we back-patched removing the public schema from search_path.

In the worst case, being conservative, the user can always at minimum
restore their dump file into a local database and get access to their data
in a usable format with minimal hassle.  The few that would like "same
table name or bust" behavior because of external or application-specific
requirements can still get that behavior.

David J.


Re: pg_dump versus hash partitioning

2023-02-01 Thread Tom Lane
Robert Haas  writes:
> On Wed, Feb 1, 2023 at 5:08 PM Tom Lane  wrote:
>> I can agree with that argument for range or list partitioning, where
>> the partitions have some semantic meaning to the user.  I don't buy it
>> for hash partitioning.  It was an implementation artifact to begin
>> with that a given row ended up in partition 3 not partition 11, so why
>> would users care which partition it ends up in after a dump/reload?
>> If they think there is a difference between the partitions, they need
>> education.

> I see your point. I think it's somewhat valid. However, I also think
> it muddies the definition of what pg_dump is allowed to do in a way
> that I do not like. I think there's a difference between the CTID or
> XMAX of a tuple changing and it ending up in a totally different
> partition. It feels like it makes the definition of correctness
> subjective: we do think that people care about range and list
> partitions as individual entities, so we'll put the rows back where
> they were and complain if we can't, but we don't think they think
> about hash partitions that way, so we will err on the side of making
> the dump restoration succeed. That's a level of guessing what the user
> meant that I think is uncomfortable.

I see your point too, and to me it's evidence for the position that
we should never have done hash partitioning in the first place.
It's precisely because you want to analyze it in the same terms
as range/list partitioning that we have these issues.  Or we could
have built it on some other infrastructure than hash index opclasses
... but we didn't do that, and now we have a mess.  I don't see a way
out other than relaxing the guarantees about how hash partitioning
works compared to the other kinds.

regards, tom lane




Re: pg_dump versus hash partitioning

2023-02-01 Thread Tom Lane
Peter Geoghegan  writes:
> You mentioned "minor releases" here. Who said anything about that?

I did: I'd like to back-patch the fix if possible.  I think changing
the default --load-via-partition-root choice could be back-patchable.

If Robert is resistant to that but would accept it in master,
I'd settle for that in preference to having no fix.

regards, tom lane




Re: pg_dump versus hash partitioning

2023-02-01 Thread Robert Haas
On Wed, Feb 1, 2023 at 5:08 PM Tom Lane  wrote:
> > Here, you'd like to argue that it's perfectly
> > fine if we instead insert some of the rows into different tables than
> > where they were on the original system.
>
> I can agree with that argument for range or list partitioning, where
> the partitions have some semantic meaning to the user.  I don't buy it
> for hash partitioning.  It was an implementation artifact to begin
> with that a given row ended up in partition 3 not partition 11, so why
> would users care which partition it ends up in after a dump/reload?
> If they think there is a difference between the partitions, they need
> education.

I see your point. I think it's somewhat valid. However, I also think
it muddies the definition of what pg_dump is allowed to do in a way
that I do not like. I think there's a difference between the CTID or
XMAX of a tuple changing and it ending up in a totally different
partition. It feels like it makes the definition of correctness
subjective: we do think that people care about range and list
partitions as individual entities, so we'll put the rows back where
they were and complain if we can't, but we don't think they think
about hash partitions that way, so we will err on the side of making
the dump restoration succeed. That's a level of guessing what the user
meant that I think is uncomfortable. I feel like when somebody around
here discovers that sort of reasoning in some other software's code,
or in a proposed patch, it pretty often gets blasted on this mailing
list with considerable vigor.

I think you can construct plausible cases where it's not just
academic. For instance, suppose I intend to use some kind of logical
replication system, not necessarily the one built into PostgreSQL, to
replicate data between two systems. Before engaging that system, I
need to make the initial database contents match. The system is
oblivious to partitioning, and just replicates each table to a table
with a matching name. Well, if the partitions don't actually match up
1:1, I kind of need to know about that. In this use case, the rows
silently moving around doesn't meet my needs. Or, suppose I dump and
restore two databases. It works perfectly. I then run a comparison
tool of some sort that compares the two databases. EDB has such a
tool! I don't know whether it would perform the comparison via the
partition root or not, because I don't know how it works. But I find
it pretty plausible that some such tool would show differences between
the source and target databases. Now, if I had done the data migration
using --load-with-partition-root, I would expect that. I might even be
looking for it, to see what got moved around. But otherwise, it might
be unexpected.

Another subtle problem with this whole situation is: suppose that on
host A, I set up a table hash-partitioned by an enum column and make a
bunch of hash partitions. Then, on host B, I set up the same table
with a bunch of foreign table partitions, each corresponding to the
matching partition on the other node. I guess that just doesn't work,
whereas if the column were of any other data type, it would work. If
it were a collatable data type, it would need the collations and
collation definitions to match, too.

I know these things are subtle, and maybe these specific things will
never happen to anyone, or nobody will care. I don't know. I just have
a really hard time accepting that a categorical statement that this
just can't ever matter to anyone.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Nathan Bossart
On Wed, Feb 01, 2023 at 09:58:06AM -0800, Nathan Bossart wrote:
> On Wed, Feb 01, 2023 at 08:58:01AM -0800, Andres Freund wrote:
>> On 2023-02-01 10:12:26 -0500, Tom Lane wrote:
>>> The fundamental issue is that we have no good way to break out
>>> of system(), and I think the original idea was that
>>> in_restore_command would be set *only* for the duration of the
>>> system() call.  That's clearly been lost sight of completely,
>>> but maybe as a stopgap we could try to get back to that.
>> 
>> We could push the functions setting in_restore_command down into
>> ExecuteRecoveryCommand(). But I don't think that'd end up necessarily
>> being right either - we'd now use the mechanism in places we previously
>> didn't (cleanup/end commands).
> 
> Right, we'd only want to set it for restore_command.  I think that's
> doable.

Here is a first draft for the proposed stopgap fix.  If we want to proceed
with this, I can provide patches for the back branches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c703a9f7ac6c43e65fc32117980495ac7980e2e3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 1 Feb 2023 14:32:02 -0800
Subject: [PATCH v1 1/1] stopgap fix for restore_command

---
 src/backend/access/transam/shell_restore.c | 22 --
 src/backend/access/transam/xlogarchive.c   |  7 ---
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/shell_restore.c b/src/backend/access/transam/shell_restore.c
index 8458209f49..abec023c1a 100644
--- a/src/backend/access/transam/shell_restore.c
+++ b/src/backend/access/transam/shell_restore.c
@@ -21,6 +21,7 @@
 #include "access/xlogarchive.h"
 #include "access/xlogrecovery.h"
 #include "common/percentrepl.h"
+#include "postmaster/startup.h"
 #include "storage/ipc.h"
 #include "utils/wait_event.h"
 
@@ -124,8 +125,7 @@ shell_recovery_end(const char *lastRestartPointFileName)
  * human-readable name describing the command emitted in the logs. If
  * 'failOnSignal' is true and the command is killed by a signal, a FATAL
  * error is thrown. Otherwise, 'fail_elevel' is used for the log message.
- * If 'exitOnSigterm' is true and the command is killed by SIGTERM, we exit
- * immediately.
+ * If 'exitOnSigterm' is true and SIGTERM is received, we exit immediately.
  *
  * Returns whether the command succeeded.
  */
@@ -146,7 +146,25 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	 */
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
+
+	/*
+	 * PreRestoreCommand() is used to tell the SIGTERM handler for the startup
+	 * process that it is okay to proc_exit() right away on SIGTERM.  This is
+	 * done for the duration of the system() call because there isn't a good
+	 * way to break out while it is executing.  Since we might call proc_exit()
+	 * in a signal handler here, it is extremely important that nothing but the
+	 * system() call happens between the calls to PreRestoreCommand() and
+	 * PostRestoreCommand().  Any additional code must go before or after this
+	 * section.
+	 */
+	if (exitOnSigterm)
+		PreRestoreCommand();
+
 	rc = system(command);
+
+	if (exitOnSigterm)
+		PostRestoreCommand();
+
 	pgstat_report_wait_end();
 
 	if (rc != 0)
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 4b89addf97..66312c816b 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -147,18 +147,11 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 	else
 		XLogFileName(lastRestartPointFname, 0, 0L, wal_segment_size);
 
-	/*
-	 * Check signals before restore command and reset afterwards.
-	 */
-	PreRestoreCommand();
-
 	/*
 	 * Copy xlog from archival storage to XLOGDIR
 	 */
 	ret = shell_restore(xlogfname, xlogpath, lastRestartPointFname);
 
-	PostRestoreCommand();
-
 	if (ret)
 	{
 		/*
-- 
2.25.1



Re: pg_dump versus hash partitioning

2023-02-01 Thread Peter Geoghegan
On Wed, Feb 1, 2023 at 2:12 PM Robert Haas  wrote:
> On Wed, Feb 1, 2023 at 4:44 PM Peter Geoghegan  wrote:
> > This is a misrepresentation of Tom's words. It isn't actually
> > self-evident what "we end up with all of the same objects, each
> > defined in the same way, and that all of the tables end up with all
> > the same contents that they had before" actually means here, in
> > general. Tom's main concern seems to be just that -- the ambiguity
> > itself.
>
> As far as I can see, Tom didn't admit that there was any ambiguity.

Tom said:

"I spent a bit more time thinking about that, and while I agree that
it's an oddity, I don't see that it matters in the case of hash
partitioning.  You would notice an issue if you tried to do a
selective restore of just one partition --- but under what
circumstance would that be a useful thing to do?"

While the word ambiguity may not have actually been used, Tom very
clearly admitted some ambiguity. But even if he didn't, so what? It's
perfectly obvious that that's the major underlying issue, and that
this is a high level problem rather than a low level problem.

> He just said that I was advocating for wrong behavior for the sake of
> performance. I don't think that is what I was doing. I also don't
> really think Tom thinks that that is what I was doing. But it is what
> he said I was doing.

And I think that you're making a mountain out of a molehill.

> I do agree with you that the ambiguity is the root of the issue. I
> mean, if we can't put the rows back into the same partitions where
> they were before, does the user care about that, or do they only care
> that the rows end up in some partition of the toplevel partitioned
> table?

That was precisely the question that Tom posed to you, in the same
email as the one that you found objectionable.

> Tom, as I understand it, is arguing that the
> --load-via-partition-root behavior has negligible downsides and is
> almost categorically better than the current default behavior, and
> thus making that the new default in some or all situations in a minor
> release is totally fine.

I don't know why you seem to think that it was such an absolutist
position as that.

You mentioned "minor releases" here. Who said anything about that?

-- 
Peter Geoghegan




Re: pg_dump versus hash partitioning

2023-02-01 Thread Tom Lane
Robert Haas  writes:
> Tom, as I understand it, is arguing that the
> --load-via-partition-root behavior has negligible downsides and is
> almost categorically better than the current default behavior, and
> thus making that the new default in some or all situations in a minor
> release is totally fine.

I think it's categorically better than a failed restore.  I wouldn't
be proposing this if there were no such problem; but there is,
and I don't buy your apparent position that we should leave affected
users to cope as best they can.  Yes, it's clearly only a minority
of users that are affected, else we'd have heard complaints before.
But it could be absolutely catastrophic for an affected user,
if they're trying to restore their only backup.  I'd rather impose
an across-the-board cost on all users of hash partitioning than
risk such outcomes for a few.

Also, you've really offered no evidence for your apparent position
that --load-via-partition-root has unacceptable overhead.  We've
done enough work on partition routing over the last few years that
whatever measurements might've originally justified that idea
don't necessarily apply anymore.  Admittedly, I've not measured
it either.  But we don't tell people to avoid partitioning because
INSERT is unduly expensive.  Partition routing is just the cost of
doing business in that space.

regards, tom lane




Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-02-01 Thread Aleksander Alekseev
Hi,

Here are my two cents.

> the minimum version appears to be newer than RHEL8's 1.8.2,
> which I find pretty unfortunate.  On RHEL8, it fails with

> $ ninja
> ninja: error: build.ninja:6771: multiple outputs aren't (yet?) supported by 
> depslog; bring this up on the mailing list if it affects you

> [...]  I don't have hard data on which distros have which
> versions of ninja, but surely somebody checked that at some point?

I'm using three different systems at the moment and the minimum
version of Ninja that is known to work is 1.10.1.

> Normally the ninja version that's pulled in by meson should suffice.

There are several ways to install Meson one of which, if you want the
latest version, is just using PIP:

```
pip3 install --user meson
```

Naturally Ninja will not be pulled in this case.

-- 
Best regards,
Aleksander Alekseev




Re: pg_dump versus hash partitioning

2023-02-01 Thread Robert Haas
On Wed, Feb 1, 2023 at 4:44 PM Peter Geoghegan  wrote:
> This is a misrepresentation of Tom's words. It isn't actually
> self-evident what "we end up with all of the same objects, each
> defined in the same way, and that all of the tables end up with all
> the same contents that they had before" actually means here, in
> general. Tom's main concern seems to be just that -- the ambiguity
> itself.

As far as I can see, Tom didn't admit that there was any ambiguity. He
just said that I was advocating for wrong behavior for the sake of
performance. I don't think that is what I was doing. I also don't
really think Tom thinks that that is what I was doing. But it is what
he said I was doing.

I do agree with you that the ambiguity is the root of the issue. I
mean, if we can't put the rows back into the same partitions where
they were before, does the user care about that, or do they only care
that the rows end up in some partition of the toplevel partitioned
table? I think that there's no single definition of correctness that
is the only defensible one here, and we can't know which one the user
wants a priori. I also think that they can care which one they're
getting, and thus I think that changing the default in a minor release
is a bad plan. Tom, as I understand it, is arguing that the
--load-via-partition-root behavior has negligible downsides and is
almost categorically better than the current default behavior, and
thus making that the new default in some or all situations in a minor
release is totally fine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_dump versus hash partitioning

2023-02-01 Thread Tom Lane
Robert Haas  writes:
> It seems to me that the job of pg_dump is to produce a dump that, when
> reloaded on another system, recreates the same database state. That
> means that we end up with all of the same objects, each defined in the
> same way, and that all of the tables end up with all the same contents
> that they had before.

No, its job is to produce *logically* the same state.  We don't expect
it to produce, say, the same CTID value that each row had before ---
if you want that, you use pg_upgrade or physical replication, not
pg_dump.  There are fairly strong constraints on how identical we
can make the results given that we're not replicating physical state.
And that's not even touching the point that frequently what the user
is after is not an identical copy anyway.

> Here, you'd like to argue that it's perfectly
> fine if we instead insert some of the rows into different tables than
> where they were on the original system.

I can agree with that argument for range or list partitioning, where
the partitions have some semantic meaning to the user.  I don't buy it
for hash partitioning.  It was an implementation artifact to begin
with that a given row ended up in partition 3 not partition 11, so why
would users care which partition it ends up in after a dump/reload?
If they think there is a difference between the partitions, they need
education.

> ... But just as we normally prioritize correctness over speed, so
> also do we normally throw errors when things aren't right instead of
> silently accepting bad input. The partitions in this scenario are
> tables that have constraints.

Again, for hash partitioning those constraints are implementation
artifacts not something that users should have any concern with.

> Keep in mind that there's no rule that a user can't
> query a partition directly.

Sure, and in the case of a hash partition, what he is going to
get is an implementation-dependent subset of the rows.  I use
"implementation-dependent" here in the same sense that the SQL
standard does, which is "there is a rule but the implementation
doesn't have to tell you what it is".  In particular, we are not
bound to make the subset be the same on different installations.
We already didn't promise that, because of issues like endianness.

regards, tom lane




Re: pg_dump versus hash partitioning

2023-02-01 Thread Robert Haas
On Wed, Feb 1, 2023 at 4:12 PM Tom Lane  wrote:
> > I don't think the fact that our *traditional* standard for how stable
> > a hash function needs to be has been XYZ carries any water.
>
> Well, it wouldn't need to if we had a practical way of changing the
> behavior of an existing hash function, but guess what: we don't.
> Andrew's original proposal for fixing this was exactly to change the
> behavior of hashenum().  There were some problems with the idea of
> depending on enumsortorder instead of enum OID, but the really
> fundamental issue is that you can't change hashing behavior without
> breaking pg_upgrade completely.  Not only will your hash indexes be
> corrupt, but your hash-partitioned tables will be broken, in exactly
> the same way that we're trying to solve for dump/reload cases (which
> of course will *also* be broken by redefining the hash function, if
> you didn't use --load-via-partition-root).  Moreover, while we can
> always advise people to reindex, there's no similarly easy way to fix
> broken partitioning.
>
> That being the case, I don't think moving the goalposts for hash
> function stability is going to lead to a workable solution.

I don't see that there is any easy, clean way to solve this in
released branches. The idea that I proposed could be implemented in
master, and I think it is the right kind of fix, but it is not
back-patchable. However, I think your argument rests on the premise
that making --load-via-partition-root the default behavior in some or
all cases will not break anything for anyone, and I'm skeptical. I
think that's a significant behavior change and that some people will
notice, and some will find it an improvement while others will find it
worse than the current behavior. I also think that there must be a lot
more people using partitioning in general, and even hash partitioning
specifically, than there are people using hash partitioning on an enum
column.

Personally, I would rather disallow this case in the back-branches --
i.e. make pg_dump barf if it is encountered and block CREATE TABLE
from setting up any new situations of this type -- than foist
--load-via-partition-root on many people who aren't affected by the
issue. I'm not saying that's a great answer, but we have to pick from
the choices we have.

I also don't accept that if someone has hit this issue they are just
hosed and there's no way out. Yeah, it's not a lot of fun: you
probably have to use "CREATE TABLE unpartitioned AS SELECT * FROM
borked; DROP TABLE borked;" or so to rescue your data. But what would
we do if we discovered that the btree opclass sorts 1 before 0, or
something? Surely we wouldn't refuse to fix the opclass just because
some users have existing indexes on disk that would be out of order
with the new opclass definition. We'd just change it and people would
have to deal. People with indexes would need to reindex. People with
partitioning boundaries between 0 and 1 would need to repartition.
This case isn't the same because hashenum() isn't broken in general,
just for this particular purpose. But I think you're trying to argue
that we should fix this by changing something other than the thing
that is broken, and I don't agree with that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_dump versus hash partitioning

2023-02-01 Thread Peter Geoghegan
On Wed, Feb 1, 2023 at 12:39 PM Robert Haas  wrote:
> I don't think the fact that our *traditional* standard for how stable
> a hash function needs to be has been XYZ carries any water. Needs
> change over time, and we adapt the code to meet the new needs. Since
> we have no system for type properties in PostgreSQL -- a design
> decision I find questionable -- we tie all such properties to operator
> classes.

Are you familiar with B-Tree opclass support function 4, equalimage?
It's used to determine whether a B-Tree index can use deduplication at
CREATE INDEX time. ISTM that the requirements are rather similar here
-- perhaps even identical.

See: https://www.postgresql.org/docs/devel/btree-support-funcs.html

-- 
Peter Geoghegan




Re: pg_dump versus hash partitioning

2023-02-01 Thread Peter Geoghegan
On Wed, Feb 1, 2023 at 1:14 PM Robert Haas  wrote:
> It seems to me that the job of pg_dump is to produce a dump that, when
> reloaded on another system, recreates the same database state. That
> means that we end up with all of the same objects, each defined in the
> same way, and that all of the tables end up with all the same contents
> that they had before. Here, you'd like to argue that it's perfectly
> fine if we instead insert some of the rows into different tables than
> where they were on the original system. Under normal circumstances, of
> course, we wouldn't consider any such thing, because then we would not
> be faithfully replicating the database state, which would be
> incorrect. But here you want to argue that it's OK to create a
> different database state because trying to recreate the same one would
> produce an error and the user might not like getting an error so let's
> just do something else instead and not even bother telling them.

This is a misrepresentation of Tom's words. It isn't actually
self-evident what "we end up with all of the same objects, each
defined in the same way, and that all of the tables end up with all
the same contents that they had before" actually means here, in
general. Tom's main concern seems to be just that -- the ambiguity
itself.

If there was a fully worked out idea of what that would mean, then I
suspect it would be quite subtle and complicated -- it's an inherently
tricky area. You seem to be saying that the way that this stuff
currently works is correct by definition, except when it isn't.

-- 
Peter Geoghegan




Re: recovery modules

2023-02-01 Thread Nathan Bossart
On Wed, Feb 01, 2023 at 01:06:06PM -0800, Andres Freund wrote:
> On 2023-02-01 12:15:29 -0800, Nathan Bossart wrote:
>> Here's a new patch set in which I've attempted to address this feedback and
>> Michael's feedback.
> 
> Looks better!

Thanks!

>> @@ -25,12 +34,14 @@ extern PGDLLIMPORT char *XLogArchiveLibrary;
>>   * For more information about the purpose of each callback, refer to the
>>   * archive modules documentation.
>>   */
>> -typedef bool (*ArchiveCheckConfiguredCB) (void);
>> -typedef bool (*ArchiveFileCB) (const char *file, const char *path);
>> -typedef void (*ArchiveShutdownCB) (void);
>> +typedef void (*ArchiveStartupCB) (ArchiveModuleState *state);
>> +typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state);
>> +typedef bool (*ArchiveFileCB) (const char *file, const char *path, 
>> ArchiveModuleState *state);
>> +typedef void (*ArchiveShutdownCB) (ArchiveModuleState *state);
> 
> Personally I'd always pass ArchiveModuleState *state as the first arg,
> but it's not important.

Yeah, that's nicer.  cfbot is complaining about a missing #include, so I
need to send a new revision anyway.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e54ed061b772e7f0859a37612ed1effebc5c6ba1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 27 Jan 2023 21:01:22 -0800
Subject: [PATCH v4 1/3] s/ArchiveContext/ArchiveCallbacks

---
 src/backend/postmaster/pgarch.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 3c714a79c6..dda6698509 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -97,7 +97,7 @@ char	   *XLogArchiveLibrary = "";
  */
 static time_t last_sigterm_time = 0;
 static PgArchData *PgArch = NULL;
-static ArchiveModuleCallbacks ArchiveContext;
+static ArchiveModuleCallbacks ArchiveCallbacks;
 
 
 /*
@@ -406,8 +406,8 @@ pgarch_ArchiverCopyLoop(void)
 			HandlePgArchInterrupts();
 
 			/* can't do anything if not configured ... */
-			if (ArchiveContext.check_configured_cb != NULL &&
-!ArchiveContext.check_configured_cb())
+			if (ArchiveCallbacks.check_configured_cb != NULL &&
+!ArchiveCallbacks.check_configured_cb())
 			{
 ereport(WARNING,
 		(errmsg("archive_mode enabled, yet archiving is not configured")));
@@ -508,7 +508,7 @@ pgarch_archiveXlog(char *xlog)
 	snprintf(activitymsg, sizeof(activitymsg), "archiving %s", xlog);
 	set_ps_display(activitymsg);
 
-	ret = ArchiveContext.archive_file_cb(xlog, pathname);
+	ret = ArchiveCallbacks.archive_file_cb(xlog, pathname);
 	if (ret)
 		snprintf(activitymsg, sizeof(activitymsg), "last was %s", xlog);
 	else
@@ -814,7 +814,7 @@ HandlePgArchInterrupts(void)
 /*
  * LoadArchiveLibrary
  *
- * Loads the archiving callbacks into our local ArchiveContext.
+ * Loads the archiving callbacks into our local ArchiveCallbacks.
  */
 static void
 LoadArchiveLibrary(void)
@@ -827,7 +827,7 @@ LoadArchiveLibrary(void)
  errmsg("both archive_command and archive_library set"),
  errdetail("Only one of archive_command, archive_library may be set.")));
 
-	memset(, 0, sizeof(ArchiveModuleCallbacks));
+	memset(, 0, sizeof(ArchiveModuleCallbacks));
 
 	/*
 	 * If shell archiving is enabled, use our special initialization function.
@@ -844,9 +844,9 @@ LoadArchiveLibrary(void)
 		ereport(ERROR,
 (errmsg("archive modules have to define the symbol %s", "_PG_archive_module_init")));
 
-	(*archive_init) ();
+	(*archive_init) ();
 
-	if (ArchiveContext.archive_file_cb == NULL)
+	if (ArchiveCallbacks.archive_file_cb == NULL)
 		ereport(ERROR,
 (errmsg("archive modules must register an archive callback")));
 
@@ -859,6 +859,6 @@ LoadArchiveLibrary(void)
 static void
 pgarch_call_module_shutdown_cb(int code, Datum arg)
 {
-	if (ArchiveContext.shutdown_cb != NULL)
-		ArchiveContext.shutdown_cb();
+	if (ArchiveCallbacks.shutdown_cb != NULL)
+		ArchiveCallbacks.shutdown_cb();
 }
-- 
2.25.1

>From 37e19c986b548f5d281190e196e01ce94f8c4391 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 27 Jan 2023 21:16:34 -0800
Subject: [PATCH v4 2/3] move archive module exports to dedicated headers

---
 contrib/basic_archive/basic_archive.c   |  2 +-
 src/backend/postmaster/pgarch.c |  2 ++
 src/backend/postmaster/shell_archive.c  |  3 +-
 src/backend/utils/misc/guc_tables.c |  1 +
 src/include/postmaster/archive_module.h | 47 +
 src/include/postmaster/pgarch.h | 39 
 src/include/postmaster/shell_archive.h  | 24 +
 7 files changed, 77 insertions(+), 41 deletions(-)
 create mode 100644 src/include/postmaster/archive_module.h
 create mode 100644 src/include/postmaster/shell_archive.h

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 3d29711a31..87bbb2174d 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ 

Re: pg_dump versus hash partitioning

2023-02-01 Thread Robert Haas
On Wed, Feb 1, 2023 at 3:34 PM Tom Lane  wrote:
> I spent a bit more time thinking about that, and while I agree that
> it's an oddity, I don't see that it matters in the case of hash
> partitioning.  You would notice an issue if you tried to do a selective
> restore of just one partition --- but under what circumstance would
> that be a useful thing to do?  By definition, under hash partitioning
> there is no user-meaningful difference between different partitions.
> Moreover, in the case at hand you would get constraint failures without
> --load-via-partition-root, or tuple routing failures with it,
> so what's the difference?  (Unless you'd created all the partitions
> to start with and were doing a selective restore of just one partition's
> data, in which case the outcome is "fails" or "works" respectively.)

I guess I was worried that pg_dump's dependency ordering stuff might
do something weird in some case that I'm not smart enough to think up.

> > Also, and I think pretty
> > significantly, using --load-via-partition-root forces you to pay the
> > overhead of rerouting every tuple to the target partition whether you
> > need it or not, which is potentially a large unnecessary expense.
>
> Oddly, I always thought that we prioritize correctness over speed.

That's a bit rich.

It seems to me that the job of pg_dump is to produce a dump that, when
reloaded on another system, recreates the same database state. That
means that we end up with all of the same objects, each defined in the
same way, and that all of the tables end up with all the same contents
that they had before. Here, you'd like to argue that it's perfectly
fine if we instead insert some of the rows into different tables than
where they were on the original system. Under normal circumstances, of
course, we wouldn't consider any such thing, because then we would not
be faithfully replicating the database state, which would be
incorrect. But here you want to argue that it's OK to create a
different database state because trying to recreate the same one would
produce an error and the user might not like getting an error so let's
just do something else instead and not even bother telling them.

As you have quite rightly pointed out, the --load-via-partition-root
behavior is useful for working around a variety of unfortunate things
that can happen. If a user is willing to say that getting a row into
one partition of some table is just as good as getting it into another
partition of that same table and that you don't mind paying the cost
associated with that, then that is something that we can do for that
user. But just as we normally prioritize correctness over speed, so
also do we normally throw errors when things aren't right instead of
silently accepting bad input. The partitions in this scenario are
tables that have constraints. If a dump contains a row that doesn't
satisfy some constraint on the table into which it is being loaded,
that's an error. Keep in mind that there's no rule that a user can't
query a partition directly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_dump versus hash partitioning

2023-02-01 Thread Peter Geoghegan
On Wed, Feb 1, 2023 at 12:34 PM Tom Lane  wrote:
> > Also, and I think pretty
> > significantly, using --load-via-partition-root forces you to pay the
> > overhead of rerouting every tuple to the target partition whether you
> > need it or not, which is potentially a large unnecessary expense.
>
> Oddly, I always thought that we prioritize correctness over speed.
> I don't mind having an option that allows people to select a less-safe
> way of doing this, but I do think it's unwise for less-safe to be the
> default, especially when it's something you can't fix after the fact.

+1

As you pointed out already, pg_dump's primary use-cases all involve
target systems that aren't identical to the source system. Anybody
using pg_dump is unlikely to be particularly concerned about
performance.

> What do you think of "--load-via-partition-root=on/off/auto", where
> auto means "not with hash partitions" or the like?  I'm still
> uncomfortable about the collation aspect, but I'm willing to concede
> that range partitioning is less likely to fail in this way than hash.

Currently, pg_dump ignores collation versions entirely, except when
run by pg_upgrade. So pg_dump already understands that sometimes it's
important that the collation behavior be completely identical when the
database is restored, and sometimes it's desirable to produce a dump
with any available "logically equivalent" collation. This is about the
high level requirements, which makes sense to me.

ISTM that range partitioning is missing a good high level model that
builds on that. What's really needed is a fully worked out abstraction
that recognizes how collations can be equivalent for some purposes,
but not other purposes. The indirection between "logical and physical
collations" is underdeveloped. There isn't even an official name for
that idea.

-- 
Peter Geoghegan




Re: heapgettup refactoring

2023-02-01 Thread Melanie Plageman
On Thu, Feb 02, 2023 at 12:21:20AM +1300, David Rowley wrote:
> On Tue, 31 Jan 2023 at 12:18, Melanie Plageman
>  wrote:
> > v7 attached
> 
> I've been looking over the v7-0002 patch today and I did make a few
> adjustments to heapgettup_initial_block() as I would prefer to see the
> branching of each of all these helper functions follow the pattern of:
> 
> if ()
> {
> if ()
> 
> else
> 
> }
> else
> {
> 
> }
> 
> which wasn't quite what the function was doing.

I'm fine with this. One code comment about the new version inline.

> Along the way, I noticed that 0002 has a subtle bug that does not seem
> to be present once the remaining patches are applied.  I think I'm
> happier to push these along the lines of how you have them in the
> patches, so I've held off pushing for now due to the bug and the
> change I had to make to fix it.
> 
> The problem is around the setting of scan->rs_inited = true; you've
> moved that into heapgettup_initial_block() and you've correctly not
> initialised the scan for empty tables when you return
> InvalidBlockNumber, however, you've not correctly considered the fact
> that table_block_parallelscan_nextpage() could also return
> InvalidBlockNumber if the parallel workers manage to grab all of the
> blocks before the current process gets the first block. I don't know
> for sure, but it looks like this could cause problems when
> heapgettup() or heapgettup_pagemode() got called again for a rescan.
> We'd have returned the NULL tuple to indicate that no further tuples
> exist, but we'll have left rs_inited set to true which looks like
> it'll cause issues.

Ah, yes. In the later patches in the series, I handle all end of scan
cases (regardless of whether or not there was a beginning) in a single
place at the end of the function. There I release the buffer and reset
all state -- including setting rs_inited to false. So, that made it okay
to set rs_inited to true in heapgettup_initial_block().

When splitting it up, I made a mistake and missed the case you
mentioned. Thanks for catching that!

FWIW, I like setting rs_inited in heapgettup_initial_block() better in
the final refactor, but I agree with you that in this patch on its own
it is better in the body of heapgettup() and heapgettup_pagemode().
 
> I wondered if it might be better to do the scan->rs_inited = true; in
> heapgettup() and heapgettup_pagemode() instead. The attached v8 patch
> does it this way. Despite this fixing that bug, I think this might be
> a slightly better division of duties.

LGTM.

> From cbd37463bdaa96afed4c7c739c8e91b770a9f8a7 Mon Sep 17 00:00:00 2001
> From: David Rowley 
> Date: Wed, 1 Feb 2023 19:35:16 +1300
> Subject: [PATCH v8] Refactor heapam.c adding heapgettup_initial_block function
> 
> Here we adjust heapgettup() and heapgettup_pagemode() to move the code
> that fetches the first block out into a helper function.  This removes
> some code duplication.
> 
> Author: Melanie Plageman
> Reviewed-by: David Rowley
> Discussion: 
> https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_t8qeqzqol02rpi...@mail.gmail.com
> ---
>  src/backend/access/heap/heapam.c | 225 ++-
>  1 file changed, 103 insertions(+), 122 deletions(-)
> 
> diff --git a/src/backend/access/heap/heapam.c 
> b/src/backend/access/heap/heapam.c
> index 0a8bac25f5..40168cc9ca 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -483,6 +483,67 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
>   scan->rs_ntuples = ntup;
>  }
>  
> +/*
> + * heapgettup_initial_block - return the first BlockNumber to scan
> + *
> + * Returns InvalidBlockNumber when there are no blocks to scan.  This can
> + * occur with empty tables and in parallel scans when parallel workers get 
> all
> + * of the pages before we can get a chance to get our first page.
> + */
> +static BlockNumber
> +heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
> +{
> + Assert(!scan->rs_inited);
> +
> + /* When there are no pages to scan, return InvalidBlockNumber */
> + if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
> + return InvalidBlockNumber;
> +
> + if (ScanDirectionIsForward(dir))
> + {
> + /* serial scan */
> + if (scan->rs_base.rs_parallel == NULL)
> + return scan->rs_startblock;

I believe this else is superfluous since we returned above.

> + else
> + {
> + /* parallel scan */
> + 
> table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
> + 
>  scan->rs_parallelworkerdata,
> + 
>  (ParallelBlockTableScanDesc) 
> scan->rs_base.rs_parallel);
> +
> + /* may return InvalidBlockNumber 

Re: pg_dump versus hash partitioning

2023-02-01 Thread Tom Lane
Robert Haas  writes:
> On Wed, Feb 1, 2023 at 1:23 PM Tom Lane  wrote:
>> In the meantime, I think we need to recognize that hash values are
>> not very portable.  I do not think we do our users a service by
>> letting them discover the corner cases the hard way.

> I think you're not really engaging with the argument that "not
> completely portable" and "totally broken" are two different things,
> and I still think that's an important point here.

I don't buy that argument.  From the user's perspective, it's broken
if her use-case fails.  "It works for most people" is cold comfort,
most especially so if there's no convenient path to fixing it after
a failure.

> I don't think the fact that our *traditional* standard for how stable
> a hash function needs to be has been XYZ carries any water.

Well, it wouldn't need to if we had a practical way of changing the
behavior of an existing hash function, but guess what: we don't.
Andrew's original proposal for fixing this was exactly to change the
behavior of hashenum().  There were some problems with the idea of
depending on enumsortorder instead of enum OID, but the really
fundamental issue is that you can't change hashing behavior without
breaking pg_upgrade completely.  Not only will your hash indexes be
corrupt, but your hash-partitioned tables will be broken, in exactly
the same way that we're trying to solve for dump/reload cases (which
of course will *also* be broken by redefining the hash function, if
you didn't use --load-via-partition-root).  Moreover, while we can
always advise people to reindex, there's no similarly easy way to fix
broken partitioning.

That being the case, I don't think moving the goalposts for hash
function stability is going to lead to a workable solution.

> On the question of whether hash partitioning is a good feature in
> general, I can only say that I disagree with what seems to be your
> position, which as best as I can tell is "it sucks and we should kill
> it with fire".

As I said, I'm not prepared to litigate that case today ... but
I do have a sneaking suspicion that we will eventually reach that
conclusion.  In any case, if we don't want to reach that conclusion,
we need some practical solution to these dump/reload problems.
Have you got a better idea than --load-via-partition-root?

regards, tom lane




Re: recovery modules

2023-02-01 Thread Andres Freund
Hi,

On 2023-02-01 12:15:29 -0800, Nathan Bossart wrote:
> Here's a new patch set in which I've attempted to address this feedback and
> Michael's feedback.

Looks better!


> @@ -25,12 +34,14 @@ extern PGDLLIMPORT char *XLogArchiveLibrary;
>   * For more information about the purpose of each callback, refer to the
>   * archive modules documentation.
>   */
> -typedef bool (*ArchiveCheckConfiguredCB) (void);
> -typedef bool (*ArchiveFileCB) (const char *file, const char *path);
> -typedef void (*ArchiveShutdownCB) (void);
> +typedef void (*ArchiveStartupCB) (ArchiveModuleState *state);
> +typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state);
> +typedef bool (*ArchiveFileCB) (const char *file, const char *path, 
> ArchiveModuleState *state);
> +typedef void (*ArchiveShutdownCB) (ArchiveModuleState *state);

Personally I'd always pass ArchiveModuleState *state as the first arg,
but it's not important.

Greetings,

Andres Freund




Re: Non-superuser subscription owners

2023-02-01 Thread Andres Freund
Hi,

On 2023-01-30 15:32:34 -0500, Robert Haas wrote:
> I had a long think about what to do with ALTER SUBSCRIPTION ... OWNER
> TO in terms of permissions checks. The previous version required that
> the new owner have permissions of pg_create_subscription, but there
> seems to be no particular reason for that rule except that it happens
> to be what I made the code do. So I changed it to say that the current
> owner must have CREATE privilege on the database, and must be able to
> SET ROLE to the new owner. This matches the rule for CREATE SCHEMA.
> Possibly we should *additionally* require that the person performing
> the rename still have pg_create_subscription, but that shouldn't be
> the only requirement.

As long as owner and run-as are the same, I think it's strongly
preferrable to *not* require pg_create_subscription.


> There seems to be a good deal of inconsistency here. If you want to
> give someone a schema, YOU need CREATE on the database. But if you
> want to give someone a table, THEY need CREATE on the containing
> schema. It make sense that we check permissions on the containing
> object, which could be a database or a schema depending on what you're
> renaming, but it's unclear to me why we sometimes check on the person
> performing the ALTER command and at other times on the recipient. It's
> also somewhat unclear to me why we are checking CREATE in the first
> place, especially on the donor. It might make sense to have a rule
> that you can't own an object in a place where you couldn't have
> created it, but there is no such rule, because you can give someone
> CREATE on a schema, they can create an object, and they you can take
> CREATE a way and they still own an object there. So it kind of looks
> to me like we made it up as we went along and that the result isn't
> very consistent, but I'm inclined to follow CREATE SCHEMA here unless
> there's some reason to do otherwise.

Yuck. No idea what the best policy around this is.


> Another question around ALTER SUBSCRIPTION ... OWNER TO and also ALTER
> SUBSCRIPTION .. RENAME is whether they ought to fail if you're not a
> superuser and password_required false is set.

I don't really see a benefit in allowing it, so I'm inclined to go for
the more restrictive option. But this is a really weakly held opinion.



> > > If there is, I think we could fix it by moving the LockSharedObject call 
> > > up
> > > higher, above object_ownercheck. The only problem with that is it lets you
> > > lock an object on which you have no permissions: see
> > > 2ad36c4e44c8b513f6155656e1b7a8d26715bb94. To really fix that, we'd need an
> > > analogue of RangeVarGetRelidExtended.
> >
> > Yea, we really should have something like RangeVarGetRelidExtended() for 
> > other
> > kinds of objects. It'd take a fair bit of work / time to use it widely, but
> > it'll take even longer if we start in 5 years ;)
>
> We actually have something sort of like that in the form of
> get_object_address(). It doesn't allow for a callback, but it does
> have a retry loop.

Hm, sure looks like that code doesn't do any privilege checking...


> @@ -1269,13 +1270,19 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
>   
> slotname,
>   
> NAMEDATALEN);
>
> + /* Is the use of a password mandatory? */
> + must_use_password = MySubscription->passwordrequired &&
> + !superuser_arg(MySubscription->owner);

There's a few repetitions of this - perhaps worth putting into a helper?


> @@ -180,6 +181,13 @@ libpqrcv_connect(const char *conninfo, bool logical, 
> const char *appname,
>   if (PQstatus(conn->streamConn) != CONNECTION_OK)
>   goto bad_connection_errmsg;
>
> + if (must_use_password && !PQconnectionUsedPassword(conn->streamConn))
> + ereport(ERROR,
> + 
> (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
> +  errmsg("password is required"),
> +  errdetail("Non-superuser cannot connect if the 
> server does not request a password."),
> +  errhint("Target server's authentication method 
> must be changed. or set password_required=false in the subscription 
> attributes\
.")));
> +
>   if (logical)
>   {
>   PGresult   *res;

This still leaks the connection on error, no?

Greetings,

Andres Freund




Re: Add connection active, idle time to pg_stat_activity

2023-02-01 Thread Sergey Dudoladov
Hello hackers,

I've sketched the first version of a patch to add pg_stat_session.
Please review this early version.

Regards,
Sergey.
From 31f781ecd69fc42aaadd9bcdbebaf8f72449946c Mon Sep 17 00:00:00 2001
From: Sergey Dudoladov 
Date: Tue, 22 Nov 2022 09:23:32 +0100
Subject: [PATCH] Add pg_stat_session

Author: Sergey Dudoladov

Adds pg_stat_session view to track statistics accumulated during
 lifetime of a session.

Reviewed-by: Aleksander Alekseev, Bertrand Drouvot, and Atsushi
Torikoshi

Discussion:
https://www.postgresql.org/message-id/flat/CA%2BFpmFcJF0vwi-SWW0wYO-c-FbhyawLq4tCpRDCJJ8Bq%3Dja-gA%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml| 153 
 src/backend/catalog/system_views.sql|  15 ++
 src/backend/utils/activity/backend_status.c |  58 ++--
 src/backend/utils/adt/pgstatfuncs.c |  70 +
 src/include/catalog/pg_proc.dat |   9 ++
 src/include/utils/backend_status.h  |  37 +
 src/test/regress/expected/rules.out |  12 ++
 7 files changed, 345 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..38cc29810a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -414,6 +414,20 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
See .
   
  
+
+ 
+  
+   pg_stat_session
+   pg_stat_session
+  
+  
+   One row per server process, showing information related to
+   the currently accumulated activity of that process, such as time spent in
+   a certain state.
+   See 
+   pg_stat_session for details.
+  
+ 
 

   
@@ -5315,6 +5329,129 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i

   
 
+  
+   pg_stat_session View
+   
+
+ 
+  
+   Column Type
+  
+  
+   Description
+  
+ 
+
+
+
+ 
+  
+   pid integer
+  
+  
+   Process ID of this backend.
+  
+ 
+
+ 
+  
+   total_running_time double precision
+  
+  
+   Time in milliseconds this backend spent in the running state.
+  
+ 
+
+
+  
+   total_running_count bigint
+  
+  
+   Number of times this backend switched to the running state.
+  
+ 
+
+ 
+  
+   total_idle_time double precision
+  
+  
+   Time in milliseconds this backend spent in the idle state.
+  
+ 
+
+ 
+  
+   total_idle_count bigint
+  
+  
+   Number of times this backend switched to the idle state.
+  
+ 
+
+ 
+  
+   total_transaction_idle_time double precision
+  
+  
+   Time in milliseconds this backend spent in the idle in transaction
+   state.
+  
+ 
+
+ 
+  
+   total_transaction_idle_count bigint
+  
+  
+   Number of times this backend switched to the idle in transaction
+   state.
+  
+ 
+
+ 
+  
+   total_transaction_idle_aborted_time double precision
+  
+  
+   Time in milliseconds this backend spent in the idle in transaction (aborted) 
+   state.
+  
+ 
+
+ 
+  
+   total_transaction_idle_aborted_count bigint
+  
+  
+   Number of times this backend switched to the idle in transaction (aborted)
+   state.
+  
+ 
+
+ 
+  
+   total_fastpath_time double precision
+  
+  
+   Time in milliseconds this backend spent in the fastpath state.
+  
+ 
+
+ 
+  
+   total_fastpath_count bigint
+  
+  
+   Number of times this backend switched to the fastpath
+   state.
+  
+ 
+
+
+   
+  
+
  
 
  
@@ -5382,6 +5519,22 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i

   
 
+  
+   
+
+ pg_stat_get_session
+
+pg_stat_get_session ( integer )
+setof record
+   
+   
+Returns a record of information about the backend with the specified
+process ID, or one record for each active backend in the system
+if NULL is specified.  The fields returned are a
+subset of those in the pg_stat_session view.
+   
+  
+
   

 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..8f68a6ea00 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -870,6 +870,21 @@ CREATE VIEW pg_stat_activity AS
 LEFT JOIN pg_database AS D ON (S.datid = D.oid)
 LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
 
+CREATE VIEW pg_stat_session AS
+SELECT
+s.pid,
+s.total_running_time,
+s.total_running_count,
+s.total_idle_time,
+

Re: [PATCH] CF app: add "Returned: Needs more interest"

2023-02-01 Thread Jacob Champion
On Wed, Jan 4, 2023 at 9:33 AM Jacob Champion  wrote:
> Is there a good way to remind people that, hey, this exists as a
> patchset? (Other than me pinging the list every so often.)

I've withdrawn this patchset for now, but if anyone has any ideas on
where and how I can better propose features for CF itself, I'm all
ears.

Thanks,
--Jacob




Re: pg_dump versus hash partitioning

2023-02-01 Thread Robert Haas
On Wed, Feb 1, 2023 at 1:23 PM Tom Lane  wrote:
> Well, that was what I thought too to start with, but I now think that
> it is far too narrow-minded a view of the problem.  The real issue
> is something I said that you trimmed:
>
> >> In general, we've never thought that hash values are
> >> required to be consistent across platforms.
>
> hashenum() is doing something that is perfectly fine in the context
> of hash indexes, which have traditionally been our standard of how
> reproducible hash values need to be.  Hash partitioning has moved
> those goalposts, and if there was any discussion of the consequences
> of that, I didn't see it.

Right, I don't think it was ever discussed, and I think that was a
mistake on my part.

> In the meantime, I think we need to recognize that hash values are
> not very portable.  I do not think we do our users a service by
> letting them discover the corner cases the hard way.

I think you're not really engaging with the argument that "not
completely portable" and "totally broken" are two different things,
and I still think that's an important point here. One idea that I had
is to add a flag somewhere to indicate whether a particular opclass or
opfamily is suitable for hash partitioning, or perhaps better, an
alternative to opcdefault that sets the default for partitioning,
which could be different from the default for indexing. Then we could
either prohibit this case, or make it work. Of course we would have to
define what "suitable for hash partitioning" means, but "would be
likely to survive a dump and reload on the same machine without any
software changes" is probably a reasonable minimum standard.

I don't think the fact that our *traditional* standard for how stable
a hash function needs to be has been XYZ carries any water. Needs
change over time, and we adapt the code to meet the new needs. Since
we have no system for type properties in PostgreSQL -- a design
decision I find questionable -- we tie all such properties to operator
classes. That's why, for example, we have HASHEXTENDED_PROC, even
though hash indexes don't need 64-bit hash values or a seed. We added
that for hash partitioning, and it's now used in other places too,
because 32-bits aren't enough for everything just because they're
enough for hash indexes, and seeds are handy. That's also why we have
BTINRANGE_PROC, which doesn't exist to support btree indexes, but
rather window frames. The fact that a certain feature requires us to
graft some additional stuff into the operator class/family mechanism,
or that it doesn't quite work with everything that's already part of
that mechanism, isn't an argument against the feature. That's just how
we do things around here. Indeed, if somebody, instead of implementing
hash partitioning by tying it into hash opfamilies, were to make up
some completely new hashing infrastructure that had exactly the
properties they wanted for partitioning, that would be *totally*
unacceptable and surely a reason for rejecting such a feature
outright. The fact that it tries to make use of the existing
infrastructure is a good thing about that feature, not a bad thing,
even though it is turning out that there are some problems.

On the question of whether hash partitioning is a good feature in
general, I can only say that I disagree with what seems to be your
position, which as best as I can tell is "it sucks and we should kill
it with fire". I do think that partitioning in general leaves a lot to
be desired in PostgreSQL in general, and perhaps the issues are even
worse for hash partitioning than they are elsewhere. However, I think
that the solution to that is for people to keep putting more work into
making it better, not to give up and say "ah, partitioning (or hash
partitioning specifically) is a stupid feature that nobody wants". To
think that, you have to be living in a bubble. It's unfortunate that
with all the work that so many people have put into this area we don't
have better results to show for it, but AFAICS there's no help for
that but to keep hacking. Amit Langote's work on locking before
pruning, for example, will be hugely impactful for some kinds of
queries if it gets committed, and it's been a long time coming, partly
because so many other problems needed to be sorted out first. But you
can't run the simplest workload with any kind of partitioning, range,
hash, whatever, and not run into that problem immediately.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Non-superuser subscription owners

2023-02-01 Thread Andres Freund
Hi,

On 2023-02-01 09:43:39 -0500, Robert Haas wrote:
> On Tue, Jan 31, 2023 at 7:01 PM Andres Freund  wrote:
> > I don't really understand that - the run-as approach seems like a
> > necessary piece of improving the security model.
> >
> > I think it's perfectly reasonable to want to replicate from one system
> > in another, but to not want to allow logical replication to insert into
> > pg_class or whatnot. So not using superuser to execute the replication
> > makes sense.
> >
> > This is particularly the case if you're just replicating a small part of
> > the tables from one system to another. E.g. in a sharded setup, you may
> > want to replicate metadata too servers.
> 
> I don't think that a system catalog should be considered a valid
> replication target, no matter who owns the subscription, so ISTM that
> writing to pg_class should be blocked regardless.

The general point is that IMO is that in many setups you should use a
user with fewer privileges than a superuser.  It doesn't really matter
whether we have an ad-hoc restriction for system catalogs. More often
than not being able to modify other tables will give you a lot of
privileges too.


> The thing I'm struggling to understand is: if you only want to
> replicate into tables that Alice can write, why not just make Alice
> own the subscription?

Because it implies that the replication happens as a user that's
privileged enough to change the configuration of replication.


> Mark was postulating a scenario where the publisher and subscriber
> don't trust each other.

FWIW, I don't this this is mainly about "trust", but instead about
layering security / the principle of least privilege. The "run-as" user
(i.e. currently owner) is constantly performing work on behalf of a
remote node, including executing code (default clauses etc). To make it
harder to use such a cross-system connection to move from one system to
the next, it's a good idea to execute it in the least privileged context
possible. And I don't see why it'd need the permission to modify the
definition of the subscription and similar "admin" tasks.

It's not that such an extra layer would necessarily completely stop an
attacker. But it might delay them and make their attack more noisy.


Similarly, if I were to operate an important production environment
again, I'd not have relations owned by the [pseudo]superuser, but by a
user controlled by the [pseudo]superuser. That way somebody tricking the
superuser into a REINDEX or such only gets the ability to execute code
in a less privileged context.




> I was thinking a little bit more about that. I
> still maintain that the current system is poorly set up to make that
> work, but suppose we wanted to do better. We could add filtering on
> the subscriber side, like you list schemas or specific relations that
> you are or are not willing to replicate into.

Isn't that largely a duplication of the ACLs on relations etc?


> > I think we'll need two things to improve upon the current situation:
> >
> > 1) run-as user, to reduce the scope of potential danger
> >
> > 2) Option to run the database inserts as the owner of the table, with a
> >check that the run-as is actually allowed to perform work as the
> >owning role. That prevents escalation from table owner (who could add
> >default expressions etc) from gettng the privs of the
> >run-as/replication owner.
> 
> I'm not quite sure what we do here now, but I agree that trigger
> firing seems like a problem. It might be that we need to worry about
> the user on the origin server, too. If Alice inserts a row that causes
> a replicated table owned by Bob to fire a trigger or evaluate a
> default expression or whatever due the presence of a subscription
> owned by Charlie, there is a risk that Alice might try to attack
> either Bob or Charlie, or that Bob might try to attack Charlie.

The attack on Bob exists without logical replication too - a REINDEX or
such is executed as the owner of the relation and re-evaluates index
expressions, constraints etc.  Given our security model I don't think we
can protect the relation owner if they trust somebody to insert rows, so
I don't really know what we can do to protect Charlie against Bob.



> > > And if we suppose that that already works and is safe, well then
> > > what's the case where I do need a run-as user?
> >
> > It's not at all safe today, IMO. You need to trust that nothing bad will
> > be replicated, otherwise the owner of the subscription has to be
> > considered compromised.
> 
> What kinds of things are bad to replicate?

I think that's unfortunately going to be specific to a setup.

Greetings,

Andres Freund




Re: postgres_fdw, dblink, and CREATE SUBSCRIPTION security

2023-02-01 Thread Jacob Champion
On Mon, Jan 30, 2023 at 2:21 PM Robert Haas  wrote:
> On Mon, Jan 30, 2023 at 4:12 PM Jacob Champion  
> wrote:
> > For our case, assuming that connections have side effects, one
> > solution could be for the client to signal to the server that the
> > connection should use in-band authentication only; i.e. fail the
> > connection if the credentials provided aren't good enough by
> > themselves to authenticate the client. (This has some overlap with
> > SASL negotiation, maybe.)
>
> I'm not an expert on this stuff, but to me that feels like a weak and
> fuzzy concept. If the client is going to tell the server something,
> I'd much rather have it say something like "i'm proxying a request
> from my local user rhaas, who authenticated using such and such a
> method and connected from such and such an IP yadda yadda". That feels
> to me like really clear communication that the server can then be
> configured to something about via pg_hba.conf or similar. Saying "use
> in-band authentication only", to me, feels much murkier. As the
> recipient of that message, I don't know exactly what to do about it,
> and it feels like whatever heuristic I adopt might end up being wrong
> and something bad happens anyway.

Is it maybe just a matter of terminology? If a proxy tells the server,
"This user is logging in. Here's the password I have for them. DO NOT
authenticate using anything else," and the HBA says to use ident auth
for that user, then the server fails the connection. That's what I
mean by in-band -- the proxy says, "here are the credentials for this
connection." That's it.

Alternatively, if you really don't like making this server-side: any
future "connection side effects" we add, such as logon triggers, could
either be opted into by the client or explicitly invoked by the client
after it's happy with the authentication exchange. Or it could be
disabled at the server side for forms of ambient authn. (This is
getting closer to HTTP's method safety concept.)

> > I agree. (But for the record, I think that an outbound proxy filter is
> > also insufficient. Someone, somewhere, is going to want to safely
> > proxy through localhost _and_ have peer authentication set up.)
>
> Well then they're indeed going to need some way to distinguish a
> proxied connection from a non-proxied one. You can't send identical
> connection requests in different scenarios and get different
> results

Yeah. Most of these solutions require explicitly labelling things that
were implicit before.

> I think what we really need here is an example or three of a proposed
> configuration file syntax. I think it would be good if we could pick a
> syntax that doesn't require a super-complicated parser

Agreed. The danger from my end is, I'm trained on configuration
formats that have infinite bells and whistles. I don't really want to
go too crazy with it.

> and that maybe
> has something in common with our existing configuration file syntaxes.
> But if we have to invent something new, then we can do that.

Okay. Personally I'd like
- the ability to set options globally (so filters are optional)
- the ability to maintain many options for a specific scope (host? IP
range?) without making my config lines grow without bound
- the ability to audit a configuration without trusting its comments

But getting all of my wishlist into a sane configuration format that
handles all the use cases is the tricky part. I'll think about it.

--Jacob




Re: pg_dump versus hash partitioning

2023-02-01 Thread Tom Lane
Robert Haas  writes:
> ... I like the
> fact that we have --load-via-partition-root, but it is a bit of a
> hack. You don't get a single copy into the partition root, you get one
> per child table -- and those COPY statements are listed as data for
> the partitions where the data lives now, not for the parent table. I
> am not completely sure whether there is a scenario where that's an
> issue, but it's certainly an oddity.

I spent a bit more time thinking about that, and while I agree that
it's an oddity, I don't see that it matters in the case of hash
partitioning.  You would notice an issue if you tried to do a selective
restore of just one partition --- but under what circumstance would
that be a useful thing to do?  By definition, under hash partitioning
there is no user-meaningful difference between different partitions.
Moreover, in the case at hand you would get constraint failures without
--load-via-partition-root, or tuple routing failures with it,
so what's the difference?  (Unless you'd created all the partitions
to start with and were doing a selective restore of just one partition's
data, in which case the outcome is "fails" or "works" respectively.)

> Also, and I think pretty
> significantly, using --load-via-partition-root forces you to pay the
> overhead of rerouting every tuple to the target partition whether you
> need it or not, which is potentially a large unnecessary expense.

Oddly, I always thought that we prioritize correctness over speed.
I don't mind having an option that allows people to select a less-safe
way of doing this, but I do think it's unwise for less-safe to be the
default, especially when it's something you can't fix after the fact.

What do you think of "--load-via-partition-root=on/off/auto", where
auto means "not with hash partitions" or the like?  I'm still
uncomfortable about the collation aspect, but I'm willing to concede
that range partitioning is less likely to fail in this way than hash.

regards, tom lane




Re: recovery modules

2023-02-01 Thread Nathan Bossart
On Wed, Feb 01, 2023 at 03:54:26AM -0800, Andres Freund wrote:
> I'd make basic_archive's private data a struct, with a member for the
> context, but it's not that important.
> 
> I'd also be inclined to do the same for the private_state you're passing
> around for each module. Even if it's just to reduce the number of
> functions accepting void * - loosing compiler type checking isn't great.
> 
> So maybe an ArchiveModuleState { void *private_data } that's passed to
> basic_archive_startup() and all the other callbacks.

Here's a new patch set in which I've attempted to address this feedback and
Michael's feedback.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e54ed061b772e7f0859a37612ed1effebc5c6ba1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 27 Jan 2023 21:01:22 -0800
Subject: [PATCH v3 1/3] s/ArchiveContext/ArchiveCallbacks

---
 src/backend/postmaster/pgarch.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 3c714a79c6..dda6698509 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -97,7 +97,7 @@ char	   *XLogArchiveLibrary = "";
  */
 static time_t last_sigterm_time = 0;
 static PgArchData *PgArch = NULL;
-static ArchiveModuleCallbacks ArchiveContext;
+static ArchiveModuleCallbacks ArchiveCallbacks;
 
 
 /*
@@ -406,8 +406,8 @@ pgarch_ArchiverCopyLoop(void)
 			HandlePgArchInterrupts();
 
 			/* can't do anything if not configured ... */
-			if (ArchiveContext.check_configured_cb != NULL &&
-!ArchiveContext.check_configured_cb())
+			if (ArchiveCallbacks.check_configured_cb != NULL &&
+!ArchiveCallbacks.check_configured_cb())
 			{
 ereport(WARNING,
 		(errmsg("archive_mode enabled, yet archiving is not configured")));
@@ -508,7 +508,7 @@ pgarch_archiveXlog(char *xlog)
 	snprintf(activitymsg, sizeof(activitymsg), "archiving %s", xlog);
 	set_ps_display(activitymsg);
 
-	ret = ArchiveContext.archive_file_cb(xlog, pathname);
+	ret = ArchiveCallbacks.archive_file_cb(xlog, pathname);
 	if (ret)
 		snprintf(activitymsg, sizeof(activitymsg), "last was %s", xlog);
 	else
@@ -814,7 +814,7 @@ HandlePgArchInterrupts(void)
 /*
  * LoadArchiveLibrary
  *
- * Loads the archiving callbacks into our local ArchiveContext.
+ * Loads the archiving callbacks into our local ArchiveCallbacks.
  */
 static void
 LoadArchiveLibrary(void)
@@ -827,7 +827,7 @@ LoadArchiveLibrary(void)
  errmsg("both archive_command and archive_library set"),
  errdetail("Only one of archive_command, archive_library may be set.")));
 
-	memset(, 0, sizeof(ArchiveModuleCallbacks));
+	memset(, 0, sizeof(ArchiveModuleCallbacks));
 
 	/*
 	 * If shell archiving is enabled, use our special initialization function.
@@ -844,9 +844,9 @@ LoadArchiveLibrary(void)
 		ereport(ERROR,
 (errmsg("archive modules have to define the symbol %s", "_PG_archive_module_init")));
 
-	(*archive_init) ();
+	(*archive_init) ();
 
-	if (ArchiveContext.archive_file_cb == NULL)
+	if (ArchiveCallbacks.archive_file_cb == NULL)
 		ereport(ERROR,
 (errmsg("archive modules must register an archive callback")));
 
@@ -859,6 +859,6 @@ LoadArchiveLibrary(void)
 static void
 pgarch_call_module_shutdown_cb(int code, Datum arg)
 {
-	if (ArchiveContext.shutdown_cb != NULL)
-		ArchiveContext.shutdown_cb();
+	if (ArchiveCallbacks.shutdown_cb != NULL)
+		ArchiveCallbacks.shutdown_cb();
 }
-- 
2.25.1

>From 0bcfb2dc8ef1544dcd6a8cc07a784b36a38febad Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 27 Jan 2023 21:16:34 -0800
Subject: [PATCH v3 2/3] move archive module exports to dedicated headers

---
 contrib/basic_archive/basic_archive.c   |  2 +-
 src/backend/postmaster/pgarch.c |  2 ++
 src/backend/postmaster/shell_archive.c  |  3 +-
 src/backend/utils/misc/guc_tables.c |  1 +
 src/include/postmaster/archive_module.h | 47 +
 src/include/postmaster/pgarch.h | 39 
 src/include/postmaster/shell_archive.h  | 22 
 7 files changed, 75 insertions(+), 41 deletions(-)
 create mode 100644 src/include/postmaster/archive_module.h
 create mode 100644 src/include/postmaster/shell_archive.h

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 3d29711a31..87bbb2174d 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -32,7 +32,7 @@
 
 #include "common/int.h"
 #include "miscadmin.h"
-#include "postmaster/pgarch.h"
+#include "postmaster/archive_module.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "utils/guc.h"
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index dda6698509..ec47b2cc20 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -34,8 +34,10 @@
 #include "lib/binaryheap.h"
 #include "libpq/pqsignal.h"
 #include "pgstat.h"

Re: Introduce "log_connection_stages" setting.

2023-02-01 Thread Sergey Dudoladov
Hi again,

Justin, thank you for the fast review. The new version is attached.

Regards,
Sergey Dudoladov
From 994a86e6ac3abb647d93bdaf0f42be76f42b83a8 Mon Sep 17 00:00:00 2001
From: Sergey Dudoladov 
Date: Tue, 8 Nov 2022 18:56:26 +0100
Subject: [PATCH] Introduce 'log_connection_messages'

This patch removes 'log_connections' and 'log_disconnections'
in favor of 'log_connection_messages', thereby introducing incompatibility

Author: Sergey Dudoladov

Reviewed-by: Justin Pryzby

Discussion:
https://www.postgresql.org/message-id/flat/CAA8Fd-qCB96uwfgMKrzfNs32mqqysi53yZFNVaRNJ6xDthZEgA%40mail.gmail.com
---
 doc/src/sgml/config.sgml  | 89 +--
 src/backend/commands/variable.c   | 77 
 src/backend/libpq/auth.c  |  5 +-
 src/backend/postmaster/postmaster.c   |  5 +-
 src/backend/tcop/postgres.c   | 11 ++-
 src/backend/utils/init/postinit.c |  2 +-
 src/backend/utils/misc/guc_tables.c   | 30 +++
 src/backend/utils/misc/postgresql.conf.sample |  5 +-
 src/include/postgres.h|  9 ++
 src/include/postmaster/postmaster.h   |  1 -
 src/include/utils/guc_hooks.h |  2 +
 src/test/authentication/t/001_password.pl |  2 +-
 src/test/authentication/t/003_peer.pl |  2 +-
 src/test/kerberos/t/001_auth.pl   |  2 +-
 src/test/ldap/t/001_auth.pl   |  2 +-
 src/test/ldap/t/002_bindpasswd.pl |  2 +-
 src/test/recovery/t/013_crash_restart.pl  |  2 +-
 src/test/recovery/t/022_crash_temp_files.pl   |  2 +-
 src/test/recovery/t/032_relfilenode_reuse.pl  |  2 +-
 src/test/ssl/t/SSL/Server.pm  |  2 +-
 src/tools/ci/pg_ci_base.conf  |  3 +-
 21 files changed, 182 insertions(+), 75 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1cf53c74ea..ccefe144c3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -140,7 +140,7 @@
  An example of what this file might look like is:
 
 # This is a comment
-log_connections = yes
+log_connection_messages = all
 log_destination = 'syslog'
 search_path = '"$user", public'
 shared_buffers = 128MB
@@ -335,7 +335,7 @@ UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter
passed to the postgres command via the
-c command-line parameter.  For example,
 
-postgres -c log_connections=yes -c log_destination='syslog'
+postgres -c log_connection_messages=all -c log_destination='syslog'
 
Settings provided in this way override those set via
postgresql.conf or ALTER SYSTEM,
@@ -7085,23 +7085,74 @@ local0.*/var/log/postgresql
   
  
 
- 
-  log_connections (boolean)
+ 
+  log_connection_messages (string)
   
-   log_connections configuration parameter
+   log_connection_messages configuration parameter
   
   
   

-Causes each attempted connection to the server to be logged,
-as well as successful completion of both client authentication (if
-necessary) and authorization.
+Causes the specified stages of each connection attempt to the server to be logged. Example: authorized,disconnected.
+The default is the empty string, which means that nothing is logged.
 Only superusers and users with the appropriate SET
 privilege can change this parameter at session start,
 and it cannot be changed at all within a session.
-The default is off.

 
+
+ Connection messages
+ 
+  
+  
+  
+   
+Name
+Description
+   
+  
+  
+   
+received
+Logs receipt of a connection. At this point a connection has been
+received, but no further work has been done: PostgreSQL is about to start
+interacting with a client.
+   
+
+   
+authenticated
+Logs the original identity used by an authentication method
+to identify a user. In most cases, the identity string matches the
+PostgreSQL username, but some third-party authentication methods may
+alter the original user identifier before the server stores it. Failed
+authentication is always logged regardless of the value of this setting.
+
+   
+
+   
+authorized
+Logs successful completion of authorization. At this point the
+connection has been fully established.
+
+   
+
+   
+disconnected
+Logs session termination. The log output provides information
+similar to authorized, plus the duration of the session.
+
+   
+
+   
+all
+A convenience alias. If present, it must be the only value in the
+list.
+   
+
+  
+ 
+
+

 
  Some client programs, like psql, attempt
@@ -7113,26 

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-02-01 Thread Tom Lane
Andres Freund  writes:
> On 2023-02-01 09:49:00 -0800, Andres Freund wrote:
>> On 2023-02-01 12:23:27 -0500, Tom Lane wrote:
>>> And the minimum version appears to be newer than RHEL8's 1.8.2, which
>>> I find pretty unfortunate.

> Unfortunately the test script accidentally pulled in ninja from epel,
> hence not noticing the issue.

Ah.  For myself, pulling the newer version from epel would not be a big
problem.  I think what we need to do is figure out what is the minimum
ninja version we want to support, and then see if we need to make any
of these changes.  I don't have hard data on which distros have which
versions of ninja, but surely somebody checked that at some point?

regards, tom lane




Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-02-01 Thread Andres Freund
Hi,

On 2023-02-01 09:49:00 -0800, Andres Freund wrote:
> On 2023-02-01 12:23:27 -0500, Tom Lane wrote:
> > And the minimum version appears to be newer than RHEL8's 1.8.2, which
> > I find pretty unfortunate.  On RHEL8, it fails with
> > $ ninja
> > ninja: error: build.ninja:6771: multiple outputs aren't (yet?) supported by 
> > depslog; bring this up on the mailing list if it affects you
> 
> What's in that line +- 2 lines?  And/or what are the steps that got you
> to that point?
> 
> I'll try building 1.8.2 and reproing.
> 
> 
> > I did manage to test this stuff on bleeding-edge Fedora,
> > but ...
> 
> Yea, I worked a fair bit to avoid requiring a too new version, I'll try
> to figure out what went wrong.  I did built on rhel8 not long ago, so I
> suspect it's a corner case somewhere.

Unfortunately the test script accidentally pulled in ninja from epel,
hence not noticing the issue.


There's three issues:

One is easy enough, albeit slightly annoying: 1.8.2 wants the
"depending" file only be named once in a dependency file. Slightly
uglier code in snowball_create.pl, but whatever.

The second is one case of multiple outputs with a depfile:
create_help.pl creates both sql_help.c and sql_help.h. Not immediately
sure what a good solution here is. The brute force solution would be to
invoke it twice, but I don't like that at all.

The last case is the various man directories. That'd be easy enough to
avoid if we generated them inside a man/ directory.

Greetings,

Andres Freund




Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-02-01 Thread Tom Lane
Andres Freund  writes:
> On 2023-02-01 12:23:27 -0500, Tom Lane wrote:
>> It's unlike what "make -C doc/src/sgml all" does in the Makefile
>> system, and I don't find it to be an improvement.

> Well, that'd actually build the manpages too, afaics :). But I get the
> point.

Ah, sorry, I too had forgotten that "all" isn't the default target
there.  I actually just go into that directory and type "make".

> I really have no opinion on what we should should build under what
> name. Happy to change what's included in 'docs', add additional targets,
> etc.

I think "docs" for just the html and "alldocs" for all supported
outputs is probably reasonable.  If we ever get to the point of
building distribution tarballs with meson, we might need another
target for html+man, but I suppose that's a long way off.

>> And the minimum version appears to be newer than RHEL8's 1.8.2, which
>> I find pretty unfortunate.  On RHEL8, it fails with
>> $ ninja
>> ninja: error: build.ninja:6771: multiple outputs aren't (yet?) supported by 
>> depslog; bring this up on the mailing list if it affects you

> What's in that line +- 2 lines?  And/or what are the steps that got you
> to that point?

"meson setup build" is sufficient to see it --- apparently ninja
gets invoked at the end of that, and it's already unhappy.  But
it repeats after "cd build; ninja".

It seems to be unhappy about the stanza for building sql_help.c?
Line 6771 is the blank line after "description" in this bit:

build src/bin/psql/sql_help.c src/bin/psql/sql_help.h: CUSTOM_COMMAND_DEP  | 
../src/bin/psql/create_help.pl /usr/bin/perl
 DEPFILE = src/bin/psql/sql_help.dep
 DEPFILE_UNQUOTED = src/bin/psql/sql_help.dep
 COMMAND = /usr/bin/perl ../src/bin/psql/create_help.pl --docdir 
../doc/src/sgml/ref --depfile src/bin/psql/sql_help.dep --outdir src/bin/psql 
--basename sql_help
 description = Generating$ psql_help$ with$ a$ custom$ command

build src/bin/psql/psql.p/meson-generated_.._psqlscanslash.c.o: c_COMPILER 
src/bin/psql/psqlscanslash.c || src/bin/psql/sql_help.h 
src/include/catalog/pg_aggregate_d.h src/include/catalog/pg_am_d.h 
src/include/catalog/pg_amop_d.h src/include/catalog/pg_amproc_d.h 
src/include/catalog/pg_attrdef_d.h src/include/catalog/pg_attribute_d.h 
src/include/catalog/pg_auth_members_d.h src/include/catalog/pg_authid_d.h 
src/include/catalog/pg_cast_d.h src/include/catalog/pg_class_d.h 
src/include/catalog/pg_collation_d.h src/include/catalog/pg_constraint_d.h 
src/include/catalog/pg_conversion_d.h src/include/catalog/pg_database_d.h 
src/include/catalog/pg_db_role_setting_d.h 
src/include/catalog/pg_default_acl_d.h src/include/catalog/pg_depend_d.h 
src/include/catalog/pg_description_d.h src/include/catalog/pg_enum_d.h 
src/include/catalog/pg_event_trigger_d.h src/include/catalog/pg_extension_d.h 
src/include/catalog/pg_foreign_data_wrapper_d.h 
src/include/catalog/pg_foreign_server_d.h 
src/include/catalog/pg_foreign_table_d.h src/include/catalog/pg_index_d.h 
src/include/catalog/pg_inherits_d.h src/include/catalog/pg_init_privs_d.h 
src/include/catalog/pg_language_d.h src/include/catalog/pg_largeobject_d.h 
src/include/catalog/pg_largeobject_metadata_d.h 
src/include/catalog/pg_namespace_d.h src/include/catalog/pg_opclass_d.h 
src/include/catalog/pg_operator_d.h src/include/catalog/pg_opfamily_d.h 
src/include/catalog/pg_parameter_acl_d.h 
src/include/catalog/pg_partitioned_table_d.h src/include/catalog/pg_policy_d.h 
src/include/catalog/pg_proc_d.h src/include/catalog/pg_publication_d.h 
src/include/catalog/pg_publication_namespace_d.h 
src/include/catalog/pg_publication_rel_d.h src/include/catalog/pg_range_d.h 
src/include/catalog/pg_replication_origin_d.h 
src/include/catalog/pg_rewrite_d.h src/include/catalog/pg_seclabel_d.h 
src/include/catalog/pg_sequence_d.h src/include/catalog/pg_shdepend_d.h 
src/include/catalog/pg_shdescription_d.h src/include/catalog/pg_shseclabel_d.h 
src/include/catalog/pg_statistic_d.h src/include/catalog/pg_statistic_ext_d.h 
src/include/catalog/pg_statistic_ext_data_d.h 
src/include/catalog/pg_subscription_d.h 
src/include/catalog/pg_subscription_rel_d.h 
src/include/catalog/pg_tablespace_d.h src/include/catalog/pg_transform_d.h 
src/include/catalog/pg_trigger_d.h src/include/catalog/pg_ts_config_d.h 
src/include/catalog/pg_ts_config_map_d.h src/include/catalog/pg_ts_dict_d.h 
src/include/catalog/pg_ts_parser_d.h src/include/catalog/pg_ts_template_d.h 
src/include/catalog/pg_type_d.h src/include/catalog/pg_user_mapping_d.h 
src/include/catalog/postgres.bki src/include/catalog/schemapg.h 
src/include/catalog/system_constraints.sql src/include/catalog/system_fk_info.h 
src/include/nodes/nodetags.h src/include/utils/errcodes.h
 DEPFILE = src/bin/psql/psql.p/meson-generated_.._psqlscanslash.c.o.d
 DEPFILE_UNQUOTED = src/bin/psql/psql.p/meson-generated_.._psqlscanslash.c.o.d
 ARGS = -Isrc/bin/psql/psql.p -Isrc/bin/psql -I../src/bin/psql -Isrc/include 
-I../src/include -Isrc/interfaces/libpq 

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Matthias van de Meent
On Wed, 1 Feb 2023 at 18:51, Ilya Gladyshev  wrote:
>
> 1 февр. 2023 г., в 20:27, Matthias van de Meent 
>  написал(а):
>
>> In HEAD we set TOTAL to whatever number partitioned table we're
>> currently processing has - regardless of whether we're the top level
>> statement.
>> With the patch we instead add the number of child relations to that
>> count, for which REL_HAS_STORAGE(child) -- or at least, in the v3
>> posted by Ilya. Approximately immediately after updating that count we
>> recurse to the child relations, and that only returns once it is done
>> creating the indexes, so both TOTAL and DONE go up as we process more
>> partitions in the hierarchy.
>
>
> The TOTAL in the patch is set only when processing the top-level parent and 
> it is not updated when we recurse, so yes, it is constant. From v3:

Ugh, I misread the patch, more specifically count_leaf_partitions and
the !OidIsValid(parentIndexId) condition changes.

You are correct, sorry for the noise.

Kind regards,

Matthias van de Meent




Re: pg_dump versus hash partitioning

2023-02-01 Thread Tom Lane
Robert Haas  writes:
> On Wed, Feb 1, 2023 at 11:18 AM Tom Lane  wrote:
>> Over at [1] we have a complaint that dump-and-restore fails for
>> hash-partitioned tables if a partitioning column is an enum,
>> because the enum values are unlikely to receive the same OIDs
>> in the destination database as they had in the source, and the
>> hash codes are dependent on those OIDs.

> It seems to me that this is the root of the problem.

Well, that was what I thought too to start with, but I now think that
it is far too narrow-minded a view of the problem.  The real issue
is something I said that you trimmed:

>> In general, we've never thought that hash values are
>> required to be consistent across platforms.

hashenum() is doing something that is perfectly fine in the context
of hash indexes, which have traditionally been our standard of how
reproducible hash values need to be.  Hash partitioning has moved
those goalposts, and if there was any discussion of the consequences
of that, I didn't see it.

>> Furthermore, I think we should make this happen in time for
>> next week's releases.  I can write the patch easily enough,
>> but we need a consensus PDQ that this is what to do.

> This seems extremely precipitous to me and I'm against it.

Yeah, it's been this way for awhile, so if there's debate then
we can let it go awhile longer.

> So I think we should be asking ourselves what we could do about the
> enum case specifically, rather than resorting to a bazooka.

My idea of solving this with a bazooka would be to deprecate hash
partitioning.  Quite frankly that's where I think we will end up
someday, but I'm not going to try to make that argument today.

In the meantime, I think we need to recognize that hash values are
not very portable.  I do not think we do our users a service by
letting them discover the corner cases the hard way.

I'd be willing to compromise on the intermediate idea of forcing
--load-via-partition-root just for hashed partitioning.

regards, tom lane




Re: Optimizing PostgreSQL with LLVM's PGO+LTO

2023-02-01 Thread João Paulo Labegalini de Carvalho
On Mon, Jan 30, 2023 at 10:47 AM Andres Freund  wrote:

> For some reason my notes for using LTO include changing RANLIB to point to
> gcc/llvm-ranlib of the appropriate version. Won't even be used on HEAD, but
> before that it can make a difference.
>

I will try that.


> Depending on how you built clang, it could be that the above recipe ends up
> using the system lld, which might be too old.
>

I double checked and I am using the lld that I built from source.


> What are the crashes you're getting?
>

When I run make check, the server starts up fine but the test queries seem
to not execute. I don't see any errors, the check step just quits after a
while.

2023-02-01 13:00:38.703 EST postmaster[28750] LOG:  starting PostgreSQL
14.5 on x86_64-pc-linux-gnu, compiled by clang version 15.0.6, 64-bit
2023-02-01 13:00:38.703 EST postmaster[28750] LOG:  listening on Unix
socket "/tmp/pg_regress-h8Fmqu/.s.PGSQL.58085"
2023-02-01 13:00:38.704 EST startup[28753] LOG:  database system was shut
down at 2023-02-01 13:00:38 EST
2023-02-01 13:00:38.705 EST postmaster[28750] LOG:  database system is
ready to accept connections

-- 
João Paulo L. de Carvalho
Ph.D Computer Science |  IC-UNICAMP | Campinas , SP - Brazil
Postdoctoral Research Fellow | University of Alberta | Edmonton, AB - Canada
joao.carva...@ic.unicamp.br
joao.carva...@ualberta.ca


Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Andres Freund
Hi,

On 2023-02-01 12:27:19 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-02-01 12:08:24 -0500, Robert Haas wrote:
> >> I like the idea of not relying on system(). In most respects, doing
> >> fork() + exec() ourselves seems superior. We can control where the
> >> output goes, what we do while waiting, etc. But system() runs the
> >> command through the shell, so that for example you don't have to
> >> invent your own way of splitting a string into words to be passed to
> >> exec[whatever](). I've never understood how you're supposed to get
> >> that behavior other than by calling system().
> 
> > We could just exec the shell in the forked process, using -c to invoke
> > the command. That should give us pretty much the same efficiency as
> > system(), with a lot more control.
> 
> The main thing that system() brings to the table is platform-specific
> knowledge of where the shell is.  I'm not very sure that we want to
> wire in "/bin/sh".

We seem to be doing OK with using SHELLPROG in pg_regress, which just
seems to be using $SHELL from the build environment.

Greetings,

Andres Freund




Re: Non-superuser subscription owners

2023-02-01 Thread Mark Dilger



> On Feb 1, 2023, at 6:43 AM, Robert Haas  wrote:

> The thing I'm
> struggling to understand is: if you only want to replicate into tables
> that Alice can write, why not just make Alice own the subscription?
> For a run-as user to make sense, you need a scenario where we want the
> replication to target only tables that Alice can touch, but we also
> don't want Alice herself to be able to touch the subscription, so you
> make Alice the run-as user and yourself the owner, or something like
> that. But I'm not sure what that scenario is exactly.

This "run-as" idea came about because we didn't want arbitrary roles to be able 
to change the subscription's connection string.  A competing idea was to have a 
server object rather than a string, with roles like Alice being able to use the 
server object if they have been granted usage privilege, and not otherwise.  So 
the "run-as" and "server" ideas were somewhat competing.

> Mark was postulating a scenario where the publisher and subscriber
> don't trust each other. I was thinking a little bit more about that. I
> still maintain that the current system is poorly set up to make that
> work, but suppose we wanted to do better. We could add filtering on
> the subscriber side, like you list schemas or specific relations that
> you are or are not willing to replicate into. Then you could, for
> example, connect your subscription to a certain remote publication,
> but with the restriction that you're only willing to replicate into
> the "headquarters" schema. Then we'll replicate whatever tables they
> send us, but if the dorks at headquarters mess up the publications on
> their end (intentionally or otherwise) and add some tables from the
> "locally_controlled_stuff" schema, we'll refuse to replicate that into
> our eponymous schema.

That example is good, though I don't see how "filters" are better than 
roles+privileges.  Care to elaborate?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Nathan Bossart
On Wed, Feb 01, 2023 at 08:58:01AM -0800, Andres Freund wrote:
> On 2023-02-01 10:12:26 -0500, Tom Lane wrote:
>> The fundamental issue is that we have no good way to break out
>> of system(), and I think the original idea was that
>> in_restore_command would be set *only* for the duration of the
>> system() call.  That's clearly been lost sight of completely,
>> but maybe as a stopgap we could try to get back to that.
> 
> We could push the functions setting in_restore_command down into
> ExecuteRecoveryCommand(). But I don't think that'd end up necessarily
> being right either - we'd now use the mechanism in places we previously
> didn't (cleanup/end commands).

Right, we'd only want to set it for restore_command.  I think that's
doable.

> And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
> doesn't look right:
> - We now have two places open-coding what BuildRestoreCommand did

This was done because BuildRestoreCommand() had become a thin wrapper
around replace_percent_placeholders().  I can add it back if you don't
think this was the right decision.

> - I'm doubtful that the new shell_* functions are the base for a good
>   API to abstract restoring files

Why?

> - the error message for a failed restore command seems to have gotten
>   worse:
>   could not restore file \"%s\" from archive: %s"
>   ->
>   "%s \"%s\": %s", commandName, command

Okay, I'll work on improving this message.

> - shell_* imo is not a good namespace for something called from xlog.c,
>   xlogarchive.c. I realize the intention is that shell_archive.c is
>   going to be its own "restore module", but for now it imo looks odd

What do you propose instead?  FWIW this should go away with recovery
modules.  This is just an intermediate state to simplify those patches.

> - The comment moved out of RestoreArchivedFile() doesn't seems less
>   useful at its new location

Where do you think it should go?

> - explanation of why we use GetOldestRestartPoint() is halfway lost

Okay, I'll work on adding more context here.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Show various offset arrays for heap WAL records

2023-02-01 Thread Robert Haas
On Wed, Feb 1, 2023 at 12:47 PM Peter Geoghegan  wrote:
> On Wed, Feb 1, 2023 at 5:20 AM Robert Haas  wrote:
> > If we're dumping a lot of details out of each WAL record, we might
> > want to switch to a multi-line format of some kind. No one enjoys a
> > 460-character wide line, let alone 46000.
>
> I generally prefer it when I can use psql without using expanded table
> format mode, and without having to use a pager. Of course that isn't
> always possible, but it often is. I just don't think that that's going
> to become feasible with pg_walinspect queries any time soon, since it
> really requires a comprehensive strategy to deal with the issue of
> verbosity.

Well, if we're thinking of making the output a lot more verbose, it
seems like we should at least do a bit of brainstorming about what
that strategy could be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Ilya Gladyshev


> 1 февр. 2023 г., в 20:27, Matthias van de Meent 
>  написал(а):
> 
> On Wed, 1 Feb 2023 at 16:53, Justin Pryzby  > wrote:
>> 
>> On Wed, Feb 01, 2023 at 04:21:35PM +0100, Matthias van de Meent wrote:
>>> On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev  
>>> wrote:
> 1 февр. 2023 г., в 16:01, Alvaro Herrera  
> написал(а):
> Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
> lookup for every single element therein ... this sounds slow.
> 
> In one of the callsites, we already have the partition descriptor
> available.  We could just scan partdesc->is_leaf[] and add one for each
> 'true' value we see there.
 
 The problem is that partdesc contains only direct children of the table 
 and we need all the children down the inheritance tree to count the total 
 number of leaf partitions in the first callsite.
 
> In the other callsite, we had the table open just a few lines before the
> place you call count_leaf_partitions.  Maybe we can rejigger things by
> examining its state before closing it: if relkind is not partitioned we
> know leaf_partitions=0, and only if partitioned we count leaf partitions.
> I think that would save some work.  I also wonder if it's worth writing
> a bespoke function for counting leaf partitions rather than relying on
> find_all_inheritors.
 
 Sure, added this condition to avoid the extra work here.
 
>>> 
   When creating an index on a partitioned table, this column is set to
 -   the total number of partitions on which the index is to be created.
 +   the total number of leaf partitions on which the index is to be 
 created or attached.
>>> 
>>> I think we should also add a note about the (now) non-constant nature
>>> of the value, something along the lines of "This value is updated as
>>> we're processing and discovering partitioned tables in the partition
>>> hierarchy".
>> 
>> But the TOTAL is constant, right ?  Updating the total when being called
>> recursively is the problem these patches fix.
> 
> If that's the case, then I'm not seeing the 'fix' part of the patch. I
> thought this patch was fixing the provably incorrect TOTAL value where
> DONE > TOTAL due to the recursive operation overwriting the DONE/TOTAL
> values instead of updating them.
> 
> In HEAD we set TOTAL to whatever number partitioned table we're
> currently processing has - regardless of whether we're the top level
> statement.
> With the patch we instead add the number of child relations to that
> count, for which REL_HAS_STORAGE(child) -- or at least, in the v3
> posted by Ilya. Approximately immediately after updating that count we
> recurse to the child relations, and that only returns once it is done
> creating the indexes, so both TOTAL and DONE go up as we process more
> partitions in the hierarchy.

The TOTAL in the patch is set only when processing the top-level parent and it 
is not updated when we recurse, so yes, it is constant. From v3:

@@ -1219,8 +1243,14 @@ DefineIndex(Oid relationId,
RelationparentIndex;
TupleDesc   parentDesc;
 
-   
pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
-   
 nparts);
+   if (!OidIsValid(parentIndexId))
+   {
+   int total_parts;
+
+   total_parts = count_leaf_partitions(relationId);
+   
pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
+   
 total_parts);
+   }


It is set to the total number of children on all levels of the hierarchy, not 
just the current one, so the total value doesn’t need to be updated later, 
because it is set to the correct value from the very beginning. 

It is the DONE counter that is updated, and when we attach an index of a 
partition that is itself a partitioned table (like a2 in your example, if it 
already had an index created), it will be updated by the number of children of 
the partition.

@@ -1431,9 +1463,25 @@ DefineIndex(Oid relationId,

SetUserIdAndSecContext(child_save_userid,

   child_save_sec_context);
}
+   else
+   {
+   int attached_parts = 1;
+
+   if 
(RELKIND_HAS_PARTITIONS(child_relkind))
+   attached_parts = 
count_leaf_partitions(childRelid);
+
+   /*
+ 

Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-02-01 Thread Andres Freund
Hi,

On 2023-02-01 12:23:27 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-02-01 13:05:32 +0300, Aleksander Alekseev wrote:
> >> It works. Perhaps we should add:
> >> ninja -C build alldocs
> 
> > FWIW, just 'docs' would build just the multi-page html/man pages,
> > alldocs takes a lot longer...
> 
> Hmm ... why does 'docs' include the man pages, and not just the html?

I think it's because the makefile is doing things a bit oddly, and I
didn't quite grok that in the right moment.

I probably just saw:
all: html man

but before that there's

# Make "html" the default target, since that is what most people tend
# to want to use.
html:


> It's unlike what "make -C doc/src/sgml all" does in the Makefile
> system, and I don't find it to be an improvement.

Well, that'd actually build the manpages too, afaics :). But I get the
point.

I really have no opinion on what we should should build under what
name. Happy to change what's included in 'docs', add additional targets,
etc.


> I want the man pages approximately never, so I don't care to wait
> around for them to be built.
> 
> While I'm bitching ... section 17.1 doesn't mention that you need
> ninja to use meson, much less mention the minimum version.

Peter rewrote the requirements (almost?) entirely while committing the
docs from Samay and hasn't responded to my concerns about the new
form...


Normally the ninja version that's pulled in by meson should suffice. I
suspect that the problem you found can be worked around.

> And the minimum version appears to be newer than RHEL8's 1.8.2, which
> I find pretty unfortunate.  On RHEL8, it fails with
> $ ninja
> ninja: error: build.ninja:6771: multiple outputs aren't (yet?) supported by 
> depslog; bring this up on the mailing list if it affects you

What's in that line +- 2 lines?  And/or what are the steps that got you
to that point?

I'll try building 1.8.2 and reproing.


> I did manage to test this stuff on bleeding-edge Fedora,
> but ...

Yea, I worked a fair bit to avoid requiring a too new version, I'll try
to figure out what went wrong.  I did built on rhel8 not long ago, so I
suspect it's a corner case somewhere.

Greetings,

Andres Freund




Re: Show various offset arrays for heap WAL records

2023-02-01 Thread Peter Geoghegan
On Wed, Feb 1, 2023 at 5:20 AM Robert Haas  wrote:
> If we're dumping a lot of details out of each WAL record, we might
> want to switch to a multi-line format of some kind. No one enjoys a
> 460-character wide line, let alone 46000.

I generally prefer it when I can use psql without using expanded table
format mode, and without having to use a pager. Of course that isn't
always possible, but it often is. I just don't think that that's going
to become feasible with pg_walinspect queries any time soon, since it
really requires a comprehensive strategy to deal with the issue of
verbosity.

It seems practically mandatory to use a pager when running
pg_walinspect queries in psql right now -- pspg is good for this. I
really can't use expanded table mode here, since it obscures the
relationship between adjoining records. I'm usually looking through
rows/records in LSN order, and want to be able to easily compare the
LSNs (or other details) of groups of adjoining records.

--
Peter Geoghegan




Re: pg_dump versus hash partitioning

2023-02-01 Thread Robert Haas
On Wed, Feb 1, 2023 at 11:18 AM Tom Lane  wrote:
> Over at [1] we have a complaint that dump-and-restore fails for
> hash-partitioned tables if a partitioning column is an enum,
> because the enum values are unlikely to receive the same OIDs
> in the destination database as they had in the source, and the
> hash codes are dependent on those OIDs.

It seems to me that this is the root of the problem. We can't expect
to hash on something that's not present in the dump file and have
anything work.

> So here's what I think we should actually do: make
> --load-via-partition-root the default.  We can provide a
> switch to turn it off, for those who are brave or foolish
> enough to risk that in the name of saving a few cycles,
> but it ought to be the default.
>
> Furthermore, I think we should make this happen in time for
> next week's releases.  I can write the patch easily enough,
> but we need a consensus PDQ that this is what to do.

This seems extremely precipitous to me and I'm against it. I like the
fact that we have --load-via-partition-root, but it is a bit of a
hack. You don't get a single copy into the partition root, you get one
per child table -- and those COPY statements are listed as data for
the partitions where the data lives now, not for the parent table. I
am not completely sure whether there is a scenario where that's an
issue, but it's certainly an oddity. Also, and I think pretty
significantly, using --load-via-partition-root forces you to pay the
overhead of rerouting every tuple to the target partition whether you
need it or not, which is potentially a large unnecessary expense. I
don't think we should just foist that kind of overhead onto everyone
in every situation for every data type because somebody had a problem
in a certain case.

And even if we do decide to do that at some point, I don't think it is
right at all to rush such a change out on a short time scale, with
little time to mull over consequences and alternative fixes. I think
that could easily hurt more people than it helps.

I think that not all of the cases that you list are of the same type.
Loading a dump under a different encoding or on a different endianness
are surely corner cases. They might come up for some people
occasionally, but they're not typical. In the case of endianness,
that's because little-Endian has pretty much taken over the world; in
the case of encoding, that's because converting data between encodings
is a real pain, and combining that with a database dump and restore is
likely to be very little fun. It's hard to argue that collation
changes fall into the same category: we know that they get changed all
the time, often silently. But none of us database geeks think that's a
good thing: just that it's a thing that we have to deal with.

The enum case seems completely different to me. That's not the result
of trying to migrate your data to another architecture or of the glibc
maintainers not believing in sorting working the same way on Tuesday
that it did on Monday. That's the result of the PostgreSQL project
hashing data in a way that is does not make any real sense for the
application at hand. Any hash function that we use for partitioning
has to work based on data that is preserved by the dump-and-restore
process. I would argue that the float case is not of the same kind:
yes, if you round your data off, then the values are going to hash
differently, but if you truncate your strings, those will hash
differently too. Duh. Intentionally changing the value is supposed to
change the hash code, that's kind of the point of hashing.

So I think we should be asking ourselves what we could do about the
enum case specifically, rather than resorting to a bazooka.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Tom Lane
Andres Freund  writes:
> On 2023-02-01 12:08:24 -0500, Robert Haas wrote:
>> I like the idea of not relying on system(). In most respects, doing
>> fork() + exec() ourselves seems superior. We can control where the
>> output goes, what we do while waiting, etc. But system() runs the
>> command through the shell, so that for example you don't have to
>> invent your own way of splitting a string into words to be passed to
>> exec[whatever](). I've never understood how you're supposed to get
>> that behavior other than by calling system().

> We could just exec the shell in the forked process, using -c to invoke
> the command. That should give us pretty much the same efficiency as
> system(), with a lot more control.

The main thing that system() brings to the table is platform-specific
knowledge of where the shell is.  I'm not very sure that we want to
wire in "/bin/sh".

regards, tom lane




Re: MacOS: xsltproc fails with "warning: failed to load external entity"

2023-02-01 Thread Tom Lane
Andres Freund  writes:
> On 2023-02-01 13:05:32 +0300, Aleksander Alekseev wrote:
>> It works. Perhaps we should add:
>> ninja -C build alldocs

> FWIW, just 'docs' would build just the multi-page html/man pages,
> alldocs takes a lot longer...

Hmm ... why does 'docs' include the man pages, and not just the html?
It's unlike what "make -C doc/src/sgml all" does in the Makefile
system, and I don't find it to be an improvement.  I want the man
pages approximately never, so I don't care to wait around for them
to be built.

While I'm bitching ... section 17.1 doesn't mention that you need
ninja to use meson, much less mention the minimum version.  And
the minimum version appears to be newer than RHEL8's 1.8.2,
which I find pretty unfortunate.  On RHEL8, it fails with

$ ninja
ninja: error: build.ninja:6771: multiple outputs aren't (yet?) supported by 
depslog; bring this up on the mailing list if it affects you

I did manage to test this stuff on bleeding-edge Fedora,
but ...

regards, tom lane




Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Andres Freund
Hi,

On 2023-02-01 12:08:24 -0500, Robert Haas wrote:
> On Wed, Feb 1, 2023 at 11:58 AM Andres Freund  wrote:
> > > 9a740f81e clearly made things a lot worse, but it wasn't great
> > > before.  Can we see a way forward to removing the problem entirely?
> >
> > Yea, I think we can - we should stop relying on system(). If we instead
> > run the command properly as a subprocess, we don't need to do bad things
> > in the signal handler anymore.
> 
> I like the idea of not relying on system(). In most respects, doing
> fork() + exec() ourselves seems superior. We can control where the
> output goes, what we do while waiting, etc. But system() runs the
> command through the shell, so that for example you don't have to
> invent your own way of splitting a string into words to be passed to
> exec[whatever](). I've never understood how you're supposed to get
> that behavior other than by calling system().

We could just exec the shell in the forked process, using -c to invoke
the command. That should give us pretty much the same efficiency as
system(), with a lot more control.

I think we already do that somewhere. . Ah, yes, spawn_process() in
pg_regress.c.  I suspect we couldn't use exec for restore_command etc,
as I think it's not uncommon to use && in the command.


Perhaps we should abstract the relevant pieces of spawn_process() that
into something more general? The OS specifics are sufficiently
complicated that I don't think it'd be good to have multiple copies.


It's too bad that we have the history of passing things to shell,
otherwise we could define a common argument handling of the GUC and just
execve ourselves, but that ship has sailed.

Greetings,

Andres Freund




Re: meson: pkgconfig difference

2023-02-01 Thread Peter Eisentraut

On 01.02.23 08:55, Andres Freund wrote:

Hi,

On January 31, 2023 11:40:52 PM PST, Peter Eisentraut 
 wrote:

I think there is a tiny typo in src/interfaces/ecpg/ecpglib/meson.build:

diff --git a/src/interfaces/ecpg/ecpglib/meson.build 
b/src/interfaces/ecpg/ecpglib/meson.build
index dba9e3c3d9..da8d304f54 100644
--- a/src/interfaces/ecpg/ecpglib/meson.build
+++ b/src/interfaces/ecpg/ecpglib/meson.build
@@ -57,7 +57,7 @@ pkgconfig.generate(
   description: 'PostgreSQL libecpg library',
   url: pg_url,
   libraries: ecpglib_so,
-  libraries_private: [frontend_shlib_code, thread_dep],
+  libraries_private: [frontend_stlib_code, thread_dep],
   requires_private: ['libpgtypes', 'libpq'],
)

This makes it match the other libraries.

Without this change, we get

Libs.private: ... -lpgport_shlib -lpgcommon_shlib

which seems wrong.


Ugh, yes, that's wrong. Do you want me to apply the fix?


I've done it now.





Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Robert Haas
On Wed, Feb 1, 2023 at 11:58 AM Andres Freund  wrote:
> > 9a740f81e clearly made things a lot worse, but it wasn't great
> > before.  Can we see a way forward to removing the problem entirely?
>
> Yea, I think we can - we should stop relying on system(). If we instead
> run the command properly as a subprocess, we don't need to do bad things
> in the signal handler anymore.

I like the idea of not relying on system(). In most respects, doing
fork() + exec() ourselves seems superior. We can control where the
output goes, what we do while waiting, etc. But system() runs the
command through the shell, so that for example you don't have to
invent your own way of splitting a string into words to be passed to
exec[whatever](). I've never understood how you're supposed to get
that behavior other than by calling system().

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: About PostgreSQL Core Team

2023-02-01 Thread Pavel Borisov
Hi, Adherent!

IMO "not liking" that you quote in the picture is just other words for
expressing caution for the patch or for the general direction of some
change. At least I never felt personal or arbitrary presumptions in
relation to my patches. So if you can join a discussion with your proposals
to address the sources of caution, and improve or review the patches, it
would be really helpful. And these are really the things that move patches
forward, not just complaining about words.

Regards,
Pavel Borisov,
Supabase


Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Andres Freund
Hi,

On 2023-02-01 10:12:26 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
> >> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
> >> handler which is allowed to call that while in_restore_command is
> >> true.
>
> > Ugh, no wonder we're getting crashes. This whole business seems bogus as
> > hell.
>
> Indeed :-(
>
> > I don't see a choice but to revert the recent changes. They need a
> > fairly large rewrite.
>
> 9a740f81e clearly made things a lot worse, but it wasn't great
> before.  Can we see a way forward to removing the problem entirely?

Yea, I think we can - we should stop relying on system(). If we instead
run the command properly as a subprocess, we don't need to do bad things
in the signal handler anymore.


> The fundamental issue is that we have no good way to break out
> of system(), and I think the original idea was that
> in_restore_command would be set *only* for the duration of the
> system() call.  That's clearly been lost sight of completely,
> but maybe as a stopgap we could try to get back to that.

We could push the functions setting in_restore_command down into
ExecuteRecoveryCommand(). But I don't think that'd end up necessarily
being right either - we'd now use the mechanism in places we previously
didn't (cleanup/end commands).

And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
doesn't look right:
- We now have two places open-coding what BuildRestoreCommand did

- I'm doubtful that the new shell_* functions are the base for a good
  API to abstract restoring files

- the error message for a failed restore command seems to have gotten
  worse:
  could not restore file \"%s\" from archive: %s"
  ->
  "%s \"%s\": %s", commandName, command

- shell_* imo is not a good namespace for something called from xlog.c,
  xlogarchive.c. I realize the intention is that shell_archive.c is
  going to be its own "restore module", but for now it imo looks odd

- The comment moved out of RestoreArchivedFile() doesn't seems less
  useful at its new location

- explanation of why we use GetOldestRestartPoint() is halfway lost


My name is listed as the first Reviewed-by, but I certainly haven't done
any meaningful review of these patches. I just replied to top-level
email proposing "recovery modules".

Greetings,

Andres Freund




Re: Timeline ID hexadecimal format

2023-02-01 Thread Sébastien Lardière

On 31/01/2023 20:16, Greg Stark wrote:

The fact that the *filename* has it encoded in hex is an
implementation detail and really gets exposed here because it's giving
you the underlying system error that caused the problem.



It's an implementation detail, but an exposed detail, so, people refer 
to the filename to find the timeline ID (That's why it happened to me)




  The confusion
only arises when the two are juxtaposed. A hint or something just in
that case might be enough?




Thanks, i got your point.

 Note that my proposal was to remove the ambiguous notation which 
happen in some case (as in 11 <-> 17). A hint is useless in most of the 
case, because there is no ambiguous. That's why i though format 
hexadecimal everywhere.



At least, can I propose to improve the documentation to expose the fact 
that the timeline ID is exposed in hexadecimal in filenames but must be 
used in decimal in recovery_target_timeline and pg_waldump ?



regards,

--
Sébastien





Re: RLS makes COPY TO process child tables

2023-02-01 Thread Tom Lane
Yugo NAGATA  writes:
> Antonin Houska  wrote:
>> While working on [1] I noticed that if RLS gets enabled, the COPY TO command
>> includes the contents of child table into the result, although the
>> documentation says it should not:

> I think this is a bug because the current behaviour is different from
> the documentation. 

I agree, it shouldn't do that.

> When RLS is enabled on a table in `COPY ... TO ...`, the query is converted
> to `COPY (SELECT * FROM ...) TO ...` to allow the rewriter to add in RLS
> clauses. This causes to dump the rows of child tables.

Do we actually say that in so many words, either in the code or docs?
If so, it ought to read `COPY (SELECT * FROM ONLY ...) TO ...`
instead.  (If we say that in the docs, then arguably the code *does*
conform to the docs.  But I don't see it in the COPY ref page at least.)

regards, tom lane




Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Matthias van de Meent
On Wed, 1 Feb 2023 at 16:53, Justin Pryzby  wrote:
>
> On Wed, Feb 01, 2023 at 04:21:35PM +0100, Matthias van de Meent wrote:
> > On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev  
> > wrote:
> > > > 1 февр. 2023 г., в 16:01, Alvaro Herrera  
> > > > написал(а):
> > > > Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
> > > > lookup for every single element therein ... this sounds slow.
> > > >
> > > > In one of the callsites, we already have the partition descriptor
> > > > available.  We could just scan partdesc->is_leaf[] and add one for each
> > > > 'true' value we see there.
> > >
> > > The problem is that partdesc contains only direct children of the table 
> > > and we need all the children down the inheritance tree to count the total 
> > > number of leaf partitions in the first callsite.
> > >
> > > > In the other callsite, we had the table open just a few lines before the
> > > > place you call count_leaf_partitions.  Maybe we can rejigger things by
> > > > examining its state before closing it: if relkind is not partitioned we
> > > > know leaf_partitions=0, and only if partitioned we count leaf 
> > > > partitions.
> > > > I think that would save some work.  I also wonder if it's worth writing
> > > > a bespoke function for counting leaf partitions rather than relying on
> > > > find_all_inheritors.
> > >
> > > Sure, added this condition to avoid the extra work here.
> > >
> >
> > >When creating an index on a partitioned table, this column is set 
> > > to
> > > -   the total number of partitions on which the index is to be 
> > > created.
> > > +   the total number of leaf partitions on which the index is to be 
> > > created or attached.
> >
> > I think we should also add a note about the (now) non-constant nature
> > of the value, something along the lines of "This value is updated as
> > we're processing and discovering partitioned tables in the partition
> > hierarchy".
>
> But the TOTAL is constant, right ?  Updating the total when being called
> recursively is the problem these patches fix.

If that's the case, then I'm not seeing the 'fix' part of the patch. I
thought this patch was fixing the provably incorrect TOTAL value where
DONE > TOTAL due to the recursive operation overwriting the DONE/TOTAL
values instead of updating them.

In HEAD we set TOTAL to whatever number partitioned table we're
currently processing has - regardless of whether we're the top level
statement.
With the patch we instead add the number of child relations to that
count, for which REL_HAS_STORAGE(child) -- or at least, in the v3
posted by Ilya. Approximately immediately after updating that count we
recurse to the child relations, and that only returns once it is done
creating the indexes, so both TOTAL and DONE go up as we process more
partitions in the hierarchy.

An example hierarchy:
CREATE TABLE parent (a, b) partition by list (a);
CREATE TABLE a1
PARTITION OF parent FOR VALUES IN (1)
PARTITION BY LIST (b);
CREATE TABLE a1bd
PARTITION OF a1 DEFAULT;

CREATE TABLE a2
PARTITION OF parent FOR VALUES IN (2)
PARTITION BY LIST (b);
CREATE TABLE a2bd
PARTITION OF a2 DEFAULT;

INSERT INTO parent (a, b) SELECT * from generate_series(1, 2) a(a)
cross join generate_series(1, 10),b(b);
CREATE INDEX ON parent(a,b);

This will only discover that a2bd will need to be indexed after a1bd
is done (or vice versa, depending on which order a1 and a2 are
processed in the ).

Kind regards,

Matthias van de Meent




pg_dump versus hash partitioning

2023-02-01 Thread Tom Lane
Over at [1] we have a complaint that dump-and-restore fails for
hash-partitioned tables if a partitioning column is an enum,
because the enum values are unlikely to receive the same OIDs
in the destination database as they had in the source, and the
hash codes are dependent on those OIDs.  So restore tries to
load rows into the wrong leaf tables, and it's all a mess.
The patch approach proposed at [1] doesn't really work, but
what does work is to use pg_dump's --load-via-partition-root
option, so that the tuple routing decisions are all re-made.

I'd initially proposed that we force --load-via-partition-root
if we notice that we have hash partitioning on an enum column.
But the more I thought about this, the more comparable problems
came to mind:

1. Hash partitioning on text columns will likely fail if the
destination uses a different encoding.

2. Hash partitioning on float columns will fail if you use
--extra-float-digits to round off the values.  And then
there's the fact that the behavior of strtod() might vary
across platforms.

3. Hash partitioning on floats is also endian-dependent,
and the same is likely true for some other types.

4. Anybody want to bet that complex types such as jsonb
are entirely free of similar hazards?  (Yes, somebody
thought it'd be a good idea to provide jsonb_hash.)

In general, we've never thought that hash values are
required to be consistent across platforms.

That was leading me to think that we should force
--load-via-partition-root for any hash-partitioned table,
just to pre-emptively avoid these problems.  But then
I remembered that

5. Range partitioning on text columns will likely fail if the
destination uses a different collation.

This is looking like a very large-caliber foot-gun, isn't it?
And remember that --load-via-partition-root acts at pg_dump
time, not restore.  If all you have is a dump file with no
opportunity to go back and get a new one, and it won't load
into your new server, you have got a nasty problem.

I don't think this is an acceptable degree of risk, considering
that the primary use-cases for pg_dump involve target systems
that aren't 100.00% identical to the source.

So here's what I think we should actually do: make
--load-via-partition-root the default.  We can provide a
switch to turn it off, for those who are brave or foolish
enough to risk that in the name of saving a few cycles,
but it ought to be the default.

Furthermore, I think we should make this happen in time for
next week's releases.  I can write the patch easily enough,
but we need a consensus PDQ that this is what to do.

Anyone want to bikeshed on the spelling of the new switch?
I'm contemplating "--load-via-partition-leaf" or perhaps
"--no-load-via-partition-root".

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/765e5968-6c39-470f-95bf-7b14e6b9a1c0%40app.fastmail.com




Re: RLS makes COPY TO process child tables

2023-02-01 Thread Yugo NAGATA
On Wed, 01 Feb 2023 12:45:57 +0100
Antonin Houska  wrote:

> While working on [1] I noticed that if RLS gets enabled, the COPY TO command
> includes the contents of child table into the result, although the
> documentation says it should not:
> 
>   "COPY TO can be used only with plain tables, not views, and does not
>   copy rows from child tables or child partitions. For example, COPY
>   table TO copies the same rows as SELECT * FROM ONLY table. The syntax
>   COPY (SELECT * FROM table) TO ... can be used to dump all of the rows
>   in an inheritance hierarchy, partitioned table, or view."
> 
> A test case is attached (rls.sql) as well as fix proposal
> (copy_rls_no_inh.diff).

I think this is a bug because the current behaviour is different from
the documentation. 

When RLS is enabled on a table in `COPY ... TO ...`, the query is converted
to `COPY (SELECT * FROM ...) TO ...` to allow the rewriter to add in RLS
clauses. This causes to dump the rows of child tables.

The patch fixes this by setting "inh" of the table in the converted query
to false. This seems reasonable and actually fixes the problem.

However, I think we would want a comment on the added line. Also, the
attached test should be placed in the regression test.

Regards,
Yugo Nagata

> 
> [1] https://commitfest.postgresql.org/41/3641/
> 
> -- 
> Antonin Houska
> Web: https://www.cybertec-postgresql.com
> 


-- 
Yugo NAGATA 




Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Nathan Bossart
On Wed, Feb 01, 2023 at 10:12:26AM -0500, Tom Lane wrote:
> Andres Freund  writes:
>> On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
>>> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
>>> handler which is allowed to call that while in_restore_command is
>>> true.
> 
>> Ugh, no wonder we're getting crashes. This whole business seems bogus as
>> hell.
> 
> Indeed :-(

Ugh.  My bad.

> The fundamental issue is that we have no good way to break out
> of system(), and I think the original idea was that
> in_restore_command would be set *only* for the duration of the
> system() call.  That's clearly been lost sight of completely,
> but maybe as a stopgap we could try to get back to that.

+1.  I'll produce some patches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Justin Pryzby
On Wed, Feb 01, 2023 at 04:21:35PM +0100, Matthias van de Meent wrote:
> On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev  
> wrote:
> > > 1 февр. 2023 г., в 16:01, Alvaro Herrera  
> > > написал(а):
> > > Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
> > > lookup for every single element therein ... this sounds slow.
> > >
> > > In one of the callsites, we already have the partition descriptor
> > > available.  We could just scan partdesc->is_leaf[] and add one for each
> > > 'true' value we see there.
> >
> > The problem is that partdesc contains only direct children of the table and 
> > we need all the children down the inheritance tree to count the total 
> > number of leaf partitions in the first callsite.
> >
> > > In the other callsite, we had the table open just a few lines before the
> > > place you call count_leaf_partitions.  Maybe we can rejigger things by
> > > examining its state before closing it: if relkind is not partitioned we
> > > know leaf_partitions=0, and only if partitioned we count leaf partitions.
> > > I think that would save some work.  I also wonder if it's worth writing
> > > a bespoke function for counting leaf partitions rather than relying on
> > > find_all_inheritors.
> >
> > Sure, added this condition to avoid the extra work here.
> >
> 
> >When creating an index on a partitioned table, this column is set to
> > -   the total number of partitions on which the index is to be created.
> > +   the total number of leaf partitions on which the index is to be 
> > created or attached.
> 
> I think we should also add a note about the (now) non-constant nature
> of the value, something along the lines of "This value is updated as
> we're processing and discovering partitioned tables in the partition
> hierarchy".

But the TOTAL is constant, right ?  Updating the total when being called
recursively is the problem these patches fix.

-- 
Justin




Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Matthias van de Meent
On Wed, 1 Feb 2023 at 15:21, Ilya Gladyshev  wrote:
>
> > 1 февр. 2023 г., в 16:01, Alvaro Herrera  
> > написал(а):
> >
> > Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
> > lookup for every single element therein ... this sounds slow.
> >
> > In one of the callsites, we already have the partition descriptor
> > available.  We could just scan partdesc->is_leaf[] and add one for each
> > 'true' value we see there.
>
> The problem is that partdesc contains only direct children of the table and 
> we need all the children down the inheritance tree to count the total number 
> of leaf partitions in the first callsite.
>
> > In the other callsite, we had the table open just a few lines before the
> > place you call count_leaf_partitions.  Maybe we can rejigger things by
> > examining its state before closing it: if relkind is not partitioned we
> > know leaf_partitions=0, and only if partitioned we count leaf partitions.
> > I think that would save some work.  I also wonder if it's worth writing
> > a bespoke function for counting leaf partitions rather than relying on
> > find_all_inheritors.
>
> Sure, added this condition to avoid the extra work here.
>

>When creating an index on a partitioned table, this column is set to
> -   the total number of partitions on which the index is to be created.
> +   the total number of leaf partitions on which the index is to be 
> created or attached.

I think we should also add a note about the (now) non-constant nature
of the value, something along the lines of "This value is updated as
we're processing and discovering partitioned tables in the partition
hierarchy".

Kind regards,

Matthias van de Meent




Re: Weird failure with latches in curculio on v15

2023-02-01 Thread Tom Lane
Andres Freund  writes:
> On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
>> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
>> handler which is allowed to call that while in_restore_command is
>> true.

> Ugh, no wonder we're getting crashes. This whole business seems bogus as
> hell.

Indeed :-(

> I don't see a choice but to revert the recent changes. They need a
> fairly large rewrite.

9a740f81e clearly made things a lot worse, but it wasn't great
before.  Can we see a way forward to removing the problem entirely?

The fundamental issue is that we have no good way to break out
of system(), and I think the original idea was that
in_restore_command would be set *only* for the duration of the
system() call.  That's clearly been lost sight of completely,
but maybe as a stopgap we could try to get back to that.

regards, tom lane




Re: Support for dumping extended statistics

2023-02-01 Thread Tom Lane
Tomas Vondra  writes:
> On 1/7/23 03:39, Bruce Momjian wrote:
>> There is certainly interest in allowing the optimizer statistics to be
>> dumped and reloaded.  This could be used by pg_restore and pg_upgrade.

> Indeed, although I think it'd be better to deal with regular statistics
> (which is what 99% of systems use). Furthermore, we should probably
> think about differences between major versions - until now we could
> change on-disk format of the statistics, because we have reset them.

Yeah, it's extremely odd to be proposing dump/reload for extended
stats when we don't yet have it for plain stats.  And yes, the main
stumbling block is that you need to have a plan for stats changing
across versions, or even just environmental issues.  For example,
what if the target DB doesn't use the same collation as the source?
That would affect string sorting and therefore at least partially
invalidate histograms for text columns.

I actually did some work on this, probably close to ten years ago
now, and came up with some hacks that didn't pass community review.
It'd be a good idea to dig up those old discussions if you want to
re-open the topic.

regards, tom lane




Re: Non-superuser subscription owners

2023-02-01 Thread Robert Haas
On Tue, Jan 31, 2023 at 7:01 PM Andres Freund  wrote:
> I don't really understand that - the run-as approach seems like a
> necessary piece of improving the security model.
>
> I think it's perfectly reasonable to want to replicate from one system
> in another, but to not want to allow logical replication to insert into
> pg_class or whatnot. So not using superuser to execute the replication
> makes sense.
>
> This is particularly the case if you're just replicating a small part of
> the tables from one system to another. E.g. in a sharded setup, you may
> want to replicate metadata too servers.

I don't think that a system catalog should be considered a valid
replication target, no matter who owns the subscription, so ISTM that
writing to pg_class should be blocked regardless. The thing I'm
struggling to understand is: if you only want to replicate into tables
that Alice can write, why not just make Alice own the subscription?
For a run-as user to make sense, you need a scenario where we want the
replication to target only tables that Alice can touch, but we also
don't want Alice herself to be able to touch the subscription, so you
make Alice the run-as user and yourself the owner, or something like
that. But I'm not sure what that scenario is exactly.

Mark was postulating a scenario where the publisher and subscriber
don't trust each other. I was thinking a little bit more about that. I
still maintain that the current system is poorly set up to make that
work, but suppose we wanted to do better. We could add filtering on
the subscriber side, like you list schemas or specific relations that
you are or are not willing to replicate into. Then you could, for
example, connect your subscription to a certain remote publication,
but with the restriction that you're only willing to replicate into
the "headquarters" schema. Then we'll replicate whatever tables they
send us, but if the dorks at headquarters mess up the publications on
their end (intentionally or otherwise) and add some tables from the
"locally_controlled_stuff" schema, we'll refuse to replicate that into
our eponymous schema. I don't think this kind of system is well-suited
to environments where people are totally hostile to each other,
because you still need to have replication slots on the remote side
and stuff. Also, having the remote side decode stuff and ignoring it
locally is expensive, and I bet if we add stuff like this then people
will misuse it and be sad. But it would make the system easier to
reason about: I know for sure that this subscription will only write
to these places, because that's all I've given it permission to do.

In the sharding scenario you mention, if you want to provide
accidental writes to unrelated tables due to the publication being not
what we expect, you can either make the subscription owned by the same
role that owns the sharded tables, or a special-purpose role that has
permission to write to exactly the set of tables that you expect to be
touched and no others. Or, if you had something like what I posited in
the last paragraph, you could use that instead. But I don't see how a
separate run-as user helps. If I'm just being super-dense here, I hope
that one of you will explain using short words. :-)

> I think we'll need two things to improve upon the current situation:
>
> 1) run-as user, to reduce the scope of potential danger
>
> 2) Option to run the database inserts as the owner of the table, with a
>check that the run-as is actually allowed to perform work as the
>owning role. That prevents escalation from table owner (who could add
>default expressions etc) from gettng the privs of the
>run-as/replication owner.

I'm not quite sure what we do here now, but I agree that trigger
firing seems like a problem. It might be that we need to worry about
the user on the origin server, too. If Alice inserts a row that causes
a replicated table owned by Bob to fire a trigger or evaluate a
default expression or whatever due the presence of a subscription
owned by Charlie, there is a risk that Alice might try to attack
either Bob or Charlie, or that Bob might try to attack Charlie.

> > And if we suppose that that already works and is safe, well then
> > what's the case where I do need a run-as user?
>
> It's not at all safe today, IMO. You need to trust that nothing bad will
> be replicated, otherwise the owner of the subscription has to be
> considered compromised.

What kinds of things are bad to replicate?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Progress report of CREATE INDEX for nested partitioned tables

2023-02-01 Thread Ilya Gladyshev

> 1 февр. 2023 г., в 16:01, Alvaro Herrera  написал(а):
> 
> Hmm, count_leaf_partitions has to scan pg_inherits and do a syscache
> lookup for every single element therein ... this sounds slow. 
> 
> In one of the callsites, we already have the partition descriptor
> available.  We could just scan partdesc->is_leaf[] and add one for each
> 'true' value we see there.

The problem is that partdesc contains only direct children of the table and we 
need all the children down the inheritance tree to count the total number of 
leaf partitions in the first callsite.

> In the other callsite, we had the table open just a few lines before the
> place you call count_leaf_partitions.  Maybe we can rejigger things by
> examining its state before closing it: if relkind is not partitioned we
> know leaf_partitions=0, and only if partitioned we count leaf partitions.
> I think that would save some work.  I also wonder if it's worth writing
> a bespoke function for counting leaf partitions rather than relying on
> find_all_inheritors.

Sure, added this condition to avoid the extra work here.


v3-0001-create-index-progress-increment.patch
Description: Binary data





Performance issues with parallelism and LIMIT

2023-02-01 Thread David Geier

Hi hackers,

While migrating from PostgreSQL 14 to 15, we encountered the following 
performance degradation caused by commit 46846433a03dff: "shm_mq: Update 
mq_bytes_written less often", discussion in [1].


The batching can make queries with a LIMIT clause run significantly 
slower compared to PostgreSQL 14, because neither the ring buffer write 
position is updated, nor the latch to inform the leader that there's 
data available is set, before a worker's queue is 1/4th full. This can 
be seen in the number of rows produced by a parallel worker. Worst-case, 
the data set is large and all rows to answer the query appear early, but 
are not big enough to fill the queue to 1/4th (e.g. when the LIMIT and 
the tuple sizes are small). Here is an example to reproduce the problem.


CREATE TABLE t(id1 INT, id2 INT, id3 INT, id4 INT, id5 INT);
INSERT INTO t(id1, id2, id3, id4, id5) SELECT i%1000, i, i, i, i FROM 
generate_series(1, 1000) AS i;

ANALYZE t;
SET parallel_tuple_cost = 0;
SET parallel_setup_cost = 0;
SET min_parallel_table_scan_size = 0;
SET max_parallel_workers_per_gather = 8;
EXPLAIN ANALYZE VERBOSE SELECT id2 FROM t WHERE id1 = 100 LIMIT 100;

PostgreSQL 15:

 Limit  (cost=0.00..797.43 rows=100 width=4) (actual 
time=65.083..69.207 rows=100 loops=1)

   Output: id2
   ->  Gather  (cost=0.00..79320.18 rows=9947 width=4) (actual 
time=65.073..68.417 rows=100 loops=1)

 Output: id2
 Workers Planned: 8
 Workers Launched: 7
 ->  Parallel Seq Scan on public.t (cost=0.00..79320.18 
rows=1243 width=4) (actual time=0.204..33.049 rows=100 loops=7)

   Output: id2
   Filter: (t.id1 = 100)
   Rows Removed by Filter: 99345
   Worker 0:  actual time=0.334..32.284 rows=100 loops=1
   Worker 1:  actual time=0.060..32.680 rows=100 loops=1
   Worker 2:  actual time=0.637..33.954 rows=98 loops=1
   Worker 3:  actual time=0.136..33.301 rows=100 loops=1
   Worker 4:  actual time=0.140..31.942 rows=100 loops=1
   Worker 5:  actual time=0.062..33.673 rows=100 loops=1
   Worker 6:  actual time=0.062..33.512 rows=100 loops=1
 Planning Time: 0.113 ms
 Execution Time: 69.772 ms

PostgreSQL 14:

 Limit  (cost=0.00..797.75 rows=100 width=4) (actual 
time=30.602..38.459 rows=100 loops=1)

   Output: id2
   ->  Gather  (cost=0.00..79320.18 rows=9943 width=4) (actual 
time=30.592..37.669 rows=100 loops=1)

 Output: id2
 Workers Planned: 8
 Workers Launched: 7
 ->  Parallel Seq Scan on public.t (cost=0.00..79320.18 
rows=1243 width=4) (actual time=0.221..5.181 rows=15 loops=7)

   Output: id2
   Filter: (t.id1 = 100)
   Rows Removed by Filter: 15241
   Worker 0:  actual time=0.129..4.840 rows=15 loops=1
   Worker 1:  actual time=0.125..4.924 rows=15 loops=1
   Worker 2:  actual time=0.314..5.249 rows=17 loops=1
   Worker 3:  actual time=0.252..5.341 rows=15 loops=1
   Worker 4:  actual time=0.163..5.179 rows=15 loops=1
   Worker 5:  actual time=0.422..5.248 rows=15 loops=1
   Worker 6:  actual time=0.139..5.489 rows=16 loops=1
 Planning Time: 0.084 ms
 Execution Time: 38.880 ms

I had a quick look at the code and I started wondering if we can't 
achieve the same performance improvement without batching by e.g.:


- Only set the latch if new data is written to an empty queue. 
Otherwise, the leader should anyways keep try reading from the queues 
without waiting for the latch, so no need to set the latch again.


- Reorganize struct shm_mq. There seems to be false sharing happening 
between at least mq_ring_size and the atomics and potentially also 
between the atomics. I'm wondering if the that's not the root cause of 
the "slow atomics" observed in [1]? I'm happy to do some profiling.


Alternatively, we could always set the latch if numberTuples in 
ExecutePlan() is reasonably low. To do so, the DestReceiver's receive() 
method would only need an additional "force flush" argument.



A slightly different but related problem is when some workers have 
already produced enough rows to answer the LIMIT query, but other 
workers are still running without producing any new rows. In that case 
the "already done" workers will stop running even though they haven't 
reached 1/4th of the queue size, because the for-loop in execMain.c 
bails out in the following condition:


    if (numberTuples && numberTuples == current_tuple_count)
    break;

Subsequently, the leader will end the plan and then wait in the Gather 
node for all workers to shutdown. However, workers still running but not 
producing any new rows will never reach the following condition in 
execMain.c to check if they're supposed to stop (the shared memory queue 
dest receiver will return false on detached queues):


    /*
 * If we are 

Re: Syncrep and improving latency due to WAL throttling

2023-02-01 Thread Jakub Wartak
On Wed, Feb 1, 2023 at 2:14 PM Tomas Vondra
 wrote:

> > Maybe we should avoid calling fsyncs for WAL throttling? (by teaching
> > HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when
> > we are flushing just because of WAL thortting ?) Would that still be
> > safe?
>
> It's not clear to me how could this work and still be safe. I mean, we
> *must* flush the local WAL first, otherwise the replica could get ahead
> (if we send unflushed WAL to replica and then crash). Which would be
> really bad, obviously.

Well it was just a thought: in this particular test - with no other
concurrent activity happening - we are fsyncing() uncommitted
Heap/INSERT data much earlier than the final Transaction/COMMIT WAL
record comes into play. I agree that some other concurrent backend's
COMMIT could fsync it, but I was wondering if that's sensible
optimization to perform (so that issue_fsync() would be called for
only commit/rollback records). I can imagine a scenario with 10 such
concurrent backends running - all of them with this $thread-GUC set -
but that would cause 20k unnecessary fsyncs (?) -- (assuming single
HDD with IOlat=20ms and standby capable of sync-ack < 0.1ms , that
would be wasted close to 400s just due to local fsyncs?). I don't have
a strong opinion or in-depth on this, but that smells like IO waste.

-J.




Re: Support for dumping extended statistics

2023-02-01 Thread Tomas Vondra



On 1/7/23 03:39, Bruce Momjian wrote:
> On Thu, Jan  5, 2023 at 06:29:03PM +, Hari krishna Maddileti wrote:
>> Hi Team,
>> In order to restore dumped extended statistics (stxdndistinct,
>> stxddependencies, stxdmcv) we need to provide input functions to parse
>> pg_distinct/pg_dependency/pg_mcv_list strings.
>>
>> Today we get the ERROR "cannot accept a value of type pg_ndistinct/
>> pg_dependencies/pg_mcv_list" when we try to do an insert of any type.
>>
>> Approch tried:
>>
>> - Using yacc grammar file (statistics_gram.y) to parse the input string to 
>> its
>> internal format for the types pg_distinct and pg_dependencies
>>
>> - We are just calling byteain() for serialized input text of type 
>> pg_mcv_list.
>>
>> Currently the changes are working locally,  I would like to push the commit
>> changes to upstream if there any usecase for postgres.   Would like to know 
>> if
>> there any interest from postgres side.
> 
> There is certainly interest in allowing the optimizer statistics to be
> dumped and reloaded.  This could be used by pg_restore and pg_upgrade.
> 

Indeed, although I think it'd be better to deal with regular statistics
(which is what 99% of systems use). Furthermore, we should probably
think about differences between major versions - until now we could
change on-disk format of the statistics, because we have reset them.
It'd be silly to do dump on version X, and then fail to restore it on
(X+1) just because the statistics changed a bit.

So we need to be able to determine is the statistics has the correct
format/version, or what. And we need to do that for pg_upgrade.

At the very least we need an option to skip restoring statistics, or
something like that.

regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: recovery modules

2023-02-01 Thread Andres Freund
Hi,

On 2023-01-31 15:30:13 -0800, Nathan Bossart wrote:


> +/*
> + * basic_archive_startup
> + *
> + * Creates the module's memory context.
> + */
> +void *
> +basic_archive_startup(void)
> +{
> + return (void *) AllocSetContextCreate(TopMemoryContext,
> + 
>   "basic_archive",
> + 
>   ALLOCSET_DEFAULT_SIZES);
>  }

I'd make basic_archive's private data a struct, with a member for the
context, but it's not that important.

I'd also be inclined to do the same for the private_state you're passing
around for each module. Even if it's just to reduce the number of
functions accepting void * - loosing compiler type checking isn't great.

So maybe an ArchiveModuleState { void *private_data } that's passed to
basic_archive_startup() and all the other callbacks.

Greetings,

Andres Freund




Re: Show various offset arrays for heap WAL records

2023-02-01 Thread Robert Haas
On Tue, Jan 31, 2023 at 6:20 PM Peter Geoghegan  wrote:
> Actually the really wide output comes from COMMIT records. After I run
> the regression tests, and execute some of my own custom pg_walinspect
> queries, I see that some individual COMMIT records have a
> length(description) of over 10,000 bytes/characters. There is even one
> particular COMMIT record whose length(description) is about 46,000
> bytes/characters. So *ludicrously* verbose GetRmgr() strings are not
> uncommon today. The worst case (or even particularly bad cases) won't
> be made any worse by this patch, because there are obviously limits on
> the width of the arrays that it outputs details descriptions of, that
> don't apply to these COMMIT records.

If we're dumping a lot of details out of each WAL record, we might
want to switch to a multi-line format of some kind. No one enjoys a
460-character wide line, let alone 46000.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Syncrep and improving latency due to WAL throttling

2023-02-01 Thread Tomas Vondra



On 2/1/23 11:04, Jakub Wartak wrote:
> On Mon, Jan 30, 2023 at 9:16 AM Bharath Rupireddy
>  wrote:
> 
> Hi Bharath, thanks for reviewing.
> 
>> I think measuring the number of WAL flushes with and without this
>> feature that the postgres generates is great to know this feature
>> effects on IOPS. Probably it's even better with variations in
>> synchronous_commit_flush_wal_after or the throttling limits.
> 
> It's the same as point 19, so I covered it there.
> 
>> Although v2 is a WIP patch, I have some comments:
>> 1. Coding style doesn't confirm to the Postgres standard:
> 
> Fixed all thoses that you mentioned and I've removed elog() and code
> for timing. BTW: is there a way to pgindent only my changes (on git
> diff?)
> 
>> 2. It'd be better to add a TAP test hitting the WAL throttling.
> 
> TODO, any hints on how that test should look like are welcome
> (checking pg_stat_wal?) I've could spot only 1 place for it --
> src/test/recovery/007_sync_rep.pl
> 
>> 3. We generally don't need timings to be calculated explicitly when we
>> emit before and after log messages. If needed one can calculate the
>> wait time from timestamps of the log messages. If it still needs an
>> explicit time difference which I don't think is required, because we
>> have a special event and before/after log messages, I think it needs
>> to be under track_wal_io_timing to keep things simple.
> 
> Removed, it was just debugging aid to demonstrate the effect in the
> session waiting.
> 
>> 4. XLogInsertRecord() can be a hot path for high-write workload, so
>> effects of adding anything in it needs to be measured. So, it's better
>> to run benchmarks with this feature enabled and disabled.
> 
> When the GUC is off(0 / default), in my tests the impact is none (it's
> just set of simple IFs), however if the feature is enabled then the
> INSERT is slowed down (I've repeated the initial test from 1st post
> and single-statement INSERT for 50MB WAL result is the same 4-5s and
> ~23s +/- 1s when feature is activated when the RTT is 10.1ms, but
> that's intentional).
> 
>> 5. Missing documentation of this feature and the GUC. I think we can
>> describe extensively why we'd need a feature of this kind in the
>> documentation for better adoption or usability.
> 
> TODO, I'm planning on adding documentation when we'll be a little bit
> closer to adding to CF.
> 
>> 6. Shouldn't the max limit be MAX_KILOBYTES?
>> +_commit_flush_wal_after,
>> +0, 0, 1024L*1024L,
> 
> Fixed.
> 
>> 7. Can this GUC name be something like
>> synchronous_commit_wal_throttle_threshold to better reflect what it
>> does for a backend?
>> +{"synchronous_commit_flush_wal_after", PGC_USERSET,
>> REPLICATION_SENDING,
> 
> Done.
> 
>> 8. A typo - s/confiration/confirmation
> [..]
>> 9. This
>> "Sets the maximum logged WAL in kbytes, after which wait for sync
>> commit confiration even without commit "
>> better be something like below?
>> "Sets the maximum amount of WAL in kilobytes a backend generates after
>> which it waits for synchronous commit confiration even without commit
>> "
> 
> Fixed as you have suggested.
> 
>> 10. I think there's nothing wrong in adding some assertions in
>> HandleXLogDelayPending():
>> Assert(synchronous_commit_flush_wal_after > 0);
>> Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L);
>> Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr);
> 
> Sure, added.
> 
>> 11. Specify the reason in the comments as to why we had to do the
>> following things:
>> Here:
>> +/* flush only up to the last fully filled page */
>> +XLogRecPtr LastFullyWrittenXLogPage = XactLastThrottledRecEnd
>> - (XactLastThrottledRecEnd % XLOG_BLCKSZ);
>> Talk about how this avoids multiple-flushes for the same page.
>>
>> Here:
>> + * Called from ProcessMessageInterrupts() to avoid waiting while
>> being in critical section
>> + */
>> +void HandleXLogDelayPending()
>> Talk about how waiting in a critical section, that is in
>> XLogInsertRecord() causes locks to be held longer durations and other
>> effects.
> 
> Added.
> 
>> Here:
>> +/* WAL throttling */
>> +backendWalInserted += rechdr->xl_tot_len;
>> Be a bit more verbose about why we try to throttle WAL and why only
>> for sync replication, the advantages, effects etc.
> 
> Added.
> 
>> 12. This better be a DEBUG1 or 2 message instead of WARNINGs to not
>> clutter server logs.
>> +/* XXX Debug for now */
>> +elog(WARNING, "throttling WAL down on this session
>> (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)",
>> +backendWalInserted,
>> +LSN_FORMAT_ARGS(XactLastThrottledRecEnd),
>> +LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
>>
>> +elog(WARNING, "throttling WAL down on this session - end (%f
>> ms)", timediff);
> 
> OK, that's entirely removed.
> 
>> 13. BTW, we don't need to hold interrupts while waiting for sync
>> replication ack as it may block query cancels or proc die pendings.
>> You can 

  1   2   >