Re: multi-install PostgresNode fails with older postgres versions

2021-04-23 Thread Michael Paquier
On Fri, Apr 23, 2021 at 08:10:01AM -0400, Andrew Dunstan wrote:
> Yeah, I think it's ok for comparison purposes just to lump them all
> together. Here's a patch that does that and some consequent cleanup.
> Note we now cache the string rather than trying to reconstruct it.

No objections from here to build the version string beforehand.  

> +  (devel|(?:alpha|beta|rc)\d+)?   # dev marker - see version_stamp.pl
> +  !x);

I have been playing with patch and version_stamp.pl, and that does the
job.  Nice.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] force_parallel_mode and GUC categories

2021-04-23 Thread Justin Pryzby
On Sat, Apr 24, 2021 at 10:50:21AM +0900, Michael Paquier wrote:
> On Fri, Apr 23, 2021 at 01:23:26PM -0500, Justin Pryzby wrote:
> > The patch seems to apply cleanly on v12 but cherry-pick needs help for other
> > branches...
> 
> FWIW, this did not seem bad enough to me to require a back-patch.
> This parameter got introduced in 2016 and this was the only report
> related to it for the last 5 years.

No, it's not the first report - although I'm surprised I wasn't able to find
more than these.

https://www.postgresql.org/message-id/20190102164525.gu25...@telsasoft.com
https://www.postgresql.org/message-id/CAKJS1f_Qi0iboCos3wu6QiAbdF-9FoK57wxzKbe2-WcesN4rFA%40mail.gmail.com

-- 
Justin




Re: Testing autovacuum wraparound (including failsafe)

2021-04-23 Thread Peter Geoghegan
On Fri, Apr 23, 2021 at 7:53 PM Andres Freund  wrote:
> I mainly suggested it because to me the current seems hard to
> understand. I do think it'd be better to check more often. But checking
> depending on the amount of dead tuples at the right time doesn't strike
> me as a good idea - a lot of anti-wraparound vacuums will mainly be
> freezing tuples, rather than removing a lot of dead rows. Which makes it
> hard to understand when the failsafe kicks in.

I'm convinced -- decoupling the logic from the one-pass-not-two pass
case seems likely to be simpler and more useful. For both the one pass
and two pass/has indexes case.

-- 
Peter Geoghegan




Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-23 Thread Amit Kapila
On Fri, Apr 23, 2021 at 6:45 PM Tom Lane  wrote:
>
> Greg Nancarrow  writes:
> > I'm curious. The FmgrBuiltin struct includes the "strict" flag, so
> > that would "lock down the value" of the strict flag, wouldn't it?
>
> It does, but that's much more directly a property of the function's
> C code than parallel-safety is.
>

Isn't parallel safety also the C code property? I mean unless someone
changes the built-in function code, changing that property would be
dangerous. The other thing is even if a user is allowed to change one
function's property, how will they know which other functions are
called by that function and whether they are parallel-safe or not. For
example, say if the user wants to change the parallel safe property of
a built-in function brin_summarize_new_values, unless she changes its
code and the functions called by it like brin_summarize_range, it
would be dangerous. So, isn't it better to disallow changing parallel
safety for built-in functions?

Also, if the strict property of built-in functions is fixed
internally, why we allow users to change it and is that of any help?

-- 
With Regards,
Amit Kapila.




Re: Testing autovacuum wraparound (including failsafe)

2021-04-23 Thread Andres Freund
Hi,

On 2021-04-23 19:42:30 -0700, Peter Geoghegan wrote:
> On Fri, Apr 23, 2021 at 7:33 PM Andres Freund  wrote:
> > Check it every so often, independent of whether there are indexes or
> > dead tuples? Or just check it at the boundaries.
>
> I think that the former suggestion might be better -- I actually
> thought about doing it that way myself.

Cool.


> The latter suggestion sounds like you're suggesting that we just check
> it at the beginning and the end in all cases (we do the beginning in
> all cases already, but now we'd also do the end outside of the loop in
> all cases). Is that right?

Yes.


> If that is what you meant, then you should note that there'd hardly be
> any check in the one-pass case with that scheme (apart from the
> initial check that we do already). The only work we'd be skipping at
> the end (in the event of that check triggering the failsafe) would be
> heap truncation, which (as you've pointed out yourself) doesn't seem
> particularly likely to matter.

I mainly suggested it because to me the current seems hard to
understand. I do think it'd be better to check more often. But checking
depending on the amount of dead tuples at the right time doesn't strike
me as a good idea - a lot of anti-wraparound vacuums will mainly be
freezing tuples, rather than removing a lot of dead rows. Which makes it
hard to understand when the failsafe kicks in.

Greetings,

Andres Freund




Re: Testing autovacuum wraparound (including failsafe)

2021-04-23 Thread Peter Geoghegan
On Fri, Apr 23, 2021 at 7:33 PM Andres Freund  wrote:
> Check it every so often, independent of whether there are indexes or
> dead tuples? Or just check it at the boundaries.

I think that the former suggestion might be better -- I actually
thought about doing it that way myself.

The latter suggestion sounds like you're suggesting that we just check
it at the beginning and the end in all cases (we do the beginning in
all cases already, but now we'd also do the end outside of the loop in
all cases). Is that right? If that is what you meant, then you should
note that there'd hardly be any check in the one-pass case with that
scheme (apart from the initial check that we do already). The only
work we'd be skipping at the end (in the event of that check
triggering the failsafe) would be heap truncation, which (as you've
pointed out yourself) doesn't seem particularly likely to matter.

-- 
Peter Geoghegan




Re: Testing autovacuum wraparound (including failsafe)

2021-04-23 Thread Andres Freund
Hi,

On 2021-04-23 19:15:43 -0700, Peter Geoghegan wrote:
> > The failsafe mode affects the table scan itself by disabling cost
> > limiting. As far as I can see the ways it triggers for the table scan (vs
> > truncation or index processing) are:
> >
> > 1) Before vacuuming starts, for heap phases and indexes, if already
> >necessary at that point
> > 2) For a table with indexes, before/after each index vacuum, if now
> >necessary
> > 3) On a table without indexes, every 8GB, iff there are dead tuples, if now 
> > necessary
> >
> > Why would we want to trigger the failsafe mode during a scan of a table
> > with dead tuples and no indexes, but not on a table without dead tuples
> > or with indexes but fewer than m_w_m dead tuples? That makes little
> > sense to me.
> 
> What alternative does make sense to you?

Check it every so often, independent of whether there are indexes or
dead tuples? Or just check it at the boundaries.

I'd make it dependent on the number of pages scanned, rather than the
block distance to the last check - otherwise we might end up doing it
way too often when there's only a few individual pages not in the freeze
map.

Greetings,

Andres Freund




Re: Testing autovacuum wraparound (including failsafe)

2021-04-23 Thread Peter Geoghegan
On Fri, Apr 23, 2021 at 5:29 PM Andres Freund  wrote:
> On 2021-04-23 16:12:33 -0700, Peter Geoghegan wrote:
> > The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the
> > related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits
> > how often we'll consider the failsafe in the single-pass/no-indexes
> > case.
>
> I don't really understand why it makes sense to tie FAILSAFE_MIN_PAGES
> and VACUUM_FSM_EVERY_PAGES together? They seem pretty independent to me?

VACUUM_FSM_EVERY_PAGES controls how often VACUUM does work that
usually takes place right after the two pass case finishes a round of
index and heap vacuuming. This is work that we certainly don't want to
do every time we process a single heap page in the one-pass/no-indexes
case. Initially this just meant FSM vacuuming, but it now includes a
failsafe check.

Of course all of the precise details here are fairly arbitrary
(including VACUUM_FSM_EVERY_PAGES, which has been around for a couple
of releases now). The overall goal that I had in mind was to make the
one-pass case's use of the failsafe have analogous behavior to the
two-pass/has-indexes case -- a goal which was itself somewhat
arbitrary.

> The failsafe mode affects the table scan itself by disabling cost
> limiting. As far as I can see the ways it triggers for the table scan (vs
> truncation or index processing) are:
>
> 1) Before vacuuming starts, for heap phases and indexes, if already
>necessary at that point
> 2) For a table with indexes, before/after each index vacuum, if now
>necessary
> 3) On a table without indexes, every 8GB, iff there are dead tuples, if now 
> necessary
>
> Why would we want to trigger the failsafe mode during a scan of a table
> with dead tuples and no indexes, but not on a table without dead tuples
> or with indexes but fewer than m_w_m dead tuples? That makes little
> sense to me.

What alternative does make sense to you?

It seemed important to put the failsafe check at points where we do
other analogous work in all cases. We made a pragmatic trade-off. In
theory almost any scheme might not check often enough, and/or might
check too frequently.

> It seems that for the no-index case the warning message is quite off?

I'll fix that up some point soon. FWIW this happened because the
support for one-pass VACUUM was added quite late, at Robert's request.

Another issue with the failsafe commit is that we haven't considered
the autovacuum_multixact_freeze_max_age table reloption -- we only
check the GUC. That might have accidentally been the right thing to
do, though, since the reloption is interpreted as lower than the GUC
in all cases anyway -- arguably the
autovacuum_multixact_freeze_max_age GUC should be all we care about
anyway. I will need to think about this question some more, though.

> > > For 2), I don't really have a better idea than making that configurable
> > > somehow?
> >
> > That could make sense as a developer/testing option, I suppose. I just
> > doubt that it makes sense as anything else.
>
> Yea, I only was thinking of making it configurable to be able to test
> it. If we change the limit to something considerably lower I wouldn't
> see a need for that anymore.

It would probably be okay to just lower it significantly. Not sure if
that's the best approach, though. Will pick it up next week.

-- 
Peter Geoghegan




Re: [PATCH] force_parallel_mode and GUC categories

2021-04-23 Thread Michael Paquier
On Fri, Apr 23, 2021 at 01:23:26PM -0500, Justin Pryzby wrote:
> The patch seems to apply cleanly on v12 but cherry-pick needs help for other
> branches...

FWIW, this did not seem bad enough to me to require a back-patch.
This parameter got introduced in 2016 and this was the only report
related to it for the last 5 years.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Remove extraneous whitespace in tags: > foo< and >bar

2021-04-23 Thread Michael Paquier
On Fri, Apr 23, 2021 at 01:43:38PM -0500, Justin Pryzby wrote:
> More fixes like the one Peter committed as 9bd563aa9.
> I eyeballed the HTML to make sure this looks right.

-  Conflicting Lock Modes
+ Conflicting Lock Modes
That's a nice regex-fu here to detect cases like this one.

Thanks, applied.
--
Michael


signature.asc
Description: PGP signature


Re: pg_upgrade can result in early wraparound on databases with high transaction load

2021-04-23 Thread Andres Freund
Hi,

On 2021-04-23 19:28:27 -0500, Justin Pryzby wrote:
> This (combination of) thread(s) seems relevant.
> 
> Subject: pg_upgrade failing for 200+ million Large Objects
> https://www.postgresql.org/message-id/flat/12601596dbbc4c01b86b4ac4d2bd4d48%40EX13D05UWC001.ant.amazon.com
> https://www.postgresql.org/message-id/flat/a9f9376f1c3343a6bb319dce294e20ac%40EX13D05UWC001.ant.amazon.com
> https://www.postgresql.org/message-id/flat/cc089cc3-fc43-9904-fdba-d830d8222145%40enterprisedb.com#3eec85391c6076a4913e96a86fece75e

Huh. Thanks for digging these up.


> > Allows the user to provide a constant via pg_upgrade command-line, that
> >overrides the 2 billion constant in pg_resetxlog [1] thereby increasing the
> >(window of) Transaction IDs available for pg_upgrade to complete.

That seems the entirely the wrong approach to me, buying further into
the broken idea of inventing random wrong values for oldestXid.

We drive important things like the emergency xid limits off oldestXid. On
databases with tables that are older than ~147million xids (i.e. not even
affected by the default autovacuum_freeze_max_age) the current constant leads
to setting the oldestXid to a value *in the future*/wrapped around. Any
different different constant (or pg_upgrade parameter) will do that too in
other scenarios.

As far as I can tell there is precisely *no* correct behaviour here other than
exactly copying the oldestXid limit from the source database.

Greetings,

Andres Freund




Re: Forgot some LSN_FORMAT_ARGS() in xlogreader.c

2021-04-23 Thread Michael Paquier
On Fri, Apr 23, 2021 at 02:18:10PM +0900, Kyotaro Horiguchi wrote:
> AFAICS it fixes the all remaining LSN parameters.

Thanks for double-checking.  I was not sure if I got all of them or
not.  Applied that now as of 4aba61b.
--
Michael


signature.asc
Description: PGP signature


Re: Testing autovacuum wraparound (including failsafe)

2021-04-23 Thread Andres Freund
Hi,

On 2021-04-23 16:12:33 -0700, Peter Geoghegan wrote:
> The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the
> related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits
> how often we'll consider the failsafe in the single-pass/no-indexes
> case.

I don't really understand why it makes sense to tie FAILSAFE_MIN_PAGES
and VACUUM_FSM_EVERY_PAGES together? They seem pretty independent to me?



> I see no reason why it cannot be changed now. VACUUM_FSM_EVERY_PAGES
> also frustrates FSM testing in the single-pass case in about the same
> way, so maybe that should be considered as well? Note that the FSM
> handling for the single pass case is actually a bit different to the
> two pass/has-indexes case, since the single pass case calls
> lazy_vacuum_heap_page() directly in its first and only pass over the
> heap (that's the whole point of having it of course).

I'm not opposed to lowering VACUUM_FSM_EVERY_PAGES (the costs don't seem
all that high compared to vacuuming?), but I don't think there's as
clear a need for testing around that as there is around wraparound.


The failsafe mode affects the table scan itself by disabling cost
limiting. As far as I can see the ways it triggers for the table scan (vs
truncation or index processing) are:

1) Before vacuuming starts, for heap phases and indexes, if already
   necessary at that point
2) For a table with indexes, before/after each index vacuum, if now
   necessary
3) On a table without indexes, every 8GB, iff there are dead tuples, if now 
necessary

Why would we want to trigger the failsafe mode during a scan of a table
with dead tuples and no indexes, but not on a table without dead tuples
or with indexes but fewer than m_w_m dead tuples? That makes little
sense to me.


It seems that for the no-index case the warning message is quite off?

ereport(WARNING,
(errmsg("abandoned index vacuuming of table 
\"%s.%s.%s\" as a failsafe after %d index scans",

Doesn't exactly make one understand that vacuum cost limiting now is
disabled? And is confusing because there would never be index vacuuming?

And even in the cases indexes exist, it's odd to talk about abandoning
index vacuuming that hasn't even started yet?


> > For 2), I don't really have a better idea than making that configurable
> > somehow?
> 
> That could make sense as a developer/testing option, I suppose. I just
> doubt that it makes sense as anything else.

Yea, I only was thinking of making it configurable to be able to test
it. If we change the limit to something considerably lower I wouldn't
see a need for that anymore.

Greetings,

Andres Freund




Re: pg_upgrade can result in early wraparound on databases with high transaction load

2021-04-23 Thread Justin Pryzby
On Fri, Apr 23, 2021 at 04:42:56PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-06-15 11:37:59 -0700, Noah Misch wrote:
> > On Tue, May 21, 2019 at 03:23:00PM -0700, Peter Geoghegan wrote:
> > > On Mon, May 20, 2019 at 3:10 AM Jason Harvey  wrote:
> > > > This week I upgraded one of my large(2.8TB), high-volume databases from 
> > > > 9 to 11. The upgrade itself went fine. About two days later, we 
> > > > unexpectedly hit transaction ID wraparound. What was perplexing about 
> > > > this was that the age of our oldest `datfrozenxid` was only 1.2 billion 
> > > > - far away from where I'd expect a wraparound. Curiously, the 
> > > > wraparound error referred to a mysterious database of `OID 0`:
> > > >
> > > > UPDATE ERROR:  database is not accepting commands to avoid wraparound 
> > > > data loss in database with OID 0
> > 
> > That's bad.
> 
> Yea. The code triggering it in pg_resetwal is bogus as far as I can
> tell. That pg_upgrade triggers it makes this quite bad.
> 
> I just hit issues related to it when writing a wraparound handling
> test. Peter remembered this issue (how?)...
> 
> Especially before 13 (inserts triggering autovacuum) it is quite common
> to have tables that only ever get vacuumed due to anti-wraparound
> vacuums. And it's common for larger databases to increase
> autovacuum_freeze_max_age. Which makes it fairly likely for this to
> guess an oldestXid value that's *newer* than an accurate one. Since
> oldestXid is used in a few important-ish places (like triggering
> vacuums, and in 14 also some snapshot related logic) I think that's bad.
> 
> The relevant code:
> 
> if (set_xid != 0)
> {
> ControlFile.checkPointCopy.nextXid =
> 
> FullTransactionIdFromEpochAndXid(EpochFromFullTransactionId(ControlFile.checkPointCopy.nextXid),
>  set_xid);
> 
> /*
>  * For the moment, just set oldestXid to a value that will force
>  * immediate autovacuum-for-wraparound.  It's not clear whether adding
>  * user control of this is useful, so let's just do something that's
>  * reasonably safe.  The magic constant here corresponds to the
>  * maximum allowed value of autovacuum_freeze_max_age.
>  */
> ControlFile.checkPointCopy.oldestXid = set_xid - 20;
> if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId)
> ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId;
> ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
> }
> 
> Originally from:
> 
> commit 25ec228ef760eb91c094cc3b6dea7257cc22ffb5
> Author: Tom Lane 
> Date:   2009-08-31 02:23:23 +
> 
> Track the current XID wrap limit (or more accurately, the oldest unfrozen
> XID) in checkpoint records.  This eliminates the need to recompute the 
> value
> from scratch during database startup, which is one of the two remaining
> reasons for the flatfile code to exist.  It should also simplify life for
> hot-standby operation.
> 
> I think we should remove the oldestXid guessing logic, and expose it as
> an explicit option. I think it's important that pg_upgrade sets an
> accurate value. Probably not worth caring about oldestXidDB though?

This (combination of) thread(s) seems relevant.

Subject: pg_upgrade failing for 200+ million Large Objects
https://www.postgresql.org/message-id/flat/12601596dbbc4c01b86b4ac4d2bd4d48%40EX13D05UWC001.ant.amazon.com
https://www.postgresql.org/message-id/flat/a9f9376f1c3343a6bb319dce294e20ac%40EX13D05UWC001.ant.amazon.com
https://www.postgresql.org/message-id/flat/cc089cc3-fc43-9904-fdba-d830d8222145%40enterprisedb.com#3eec85391c6076a4913e96a86fece75e
> Allows the user to provide a constant via pg_upgrade command-line, that
>overrides the 2 billion constant in pg_resetxlog [1] thereby increasing the
>(window of) Transaction IDs available for pg_upgrade to complete.

-- 
Justin




Re: pg_upgrade can result in early wraparound on databases with high transaction load

2021-04-23 Thread Andres Freund
Hi,

On 2019-06-15 11:37:59 -0700, Noah Misch wrote:
> On Tue, May 21, 2019 at 03:23:00PM -0700, Peter Geoghegan wrote:
> > On Mon, May 20, 2019 at 3:10 AM Jason Harvey  wrote:
> > > This week I upgraded one of my large(2.8TB), high-volume databases from 9 
> > > to 11. The upgrade itself went fine. About two days later, we 
> > > unexpectedly hit transaction ID wraparound. What was perplexing about 
> > > this was that the age of our oldest `datfrozenxid` was only 1.2 billion - 
> > > far away from where I'd expect a wraparound. Curiously, the wraparound 
> > > error referred to a mysterious database of `OID 0`:
> > >
> > > UPDATE ERROR:  database is not accepting commands to avoid wraparound 
> > > data loss in database with OID 0
> 
> That's bad.

Yea. The code triggering it in pg_resetwal is bogus as far as I can
tell. That pg_upgrade triggers it makes this quite bad.

I just hit issues related to it when writing a wraparound handling
test. Peter remembered this issue (how?)...

Especially before 13 (inserts triggering autovacuum) it is quite common
to have tables that only ever get vacuumed due to anti-wraparound
vacuums. And it's common for larger databases to increase
autovacuum_freeze_max_age. Which makes it fairly likely for this to
guess an oldestXid value that's *newer* than an accurate one. Since
oldestXid is used in a few important-ish places (like triggering
vacuums, and in 14 also some snapshot related logic) I think that's bad.

The relevant code:

if (set_xid != 0)
{
ControlFile.checkPointCopy.nextXid =

FullTransactionIdFromEpochAndXid(EpochFromFullTransactionId(ControlFile.checkPointCopy.nextXid),
 set_xid);

/*
 * For the moment, just set oldestXid to a value that will force
 * immediate autovacuum-for-wraparound.  It's not clear whether adding
 * user control of this is useful, so let's just do something that's
 * reasonably safe.  The magic constant here corresponds to the
 * maximum allowed value of autovacuum_freeze_max_age.
 */
ControlFile.checkPointCopy.oldestXid = set_xid - 20;
if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId)
ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId;
ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
}

Originally from:

commit 25ec228ef760eb91c094cc3b6dea7257cc22ffb5
Author: Tom Lane 
Date:   2009-08-31 02:23:23 +

Track the current XID wrap limit (or more accurately, the oldest unfrozen
XID) in checkpoint records.  This eliminates the need to recompute the value
from scratch during database startup, which is one of the two remaining
reasons for the flatfile code to exist.  It should also simplify life for
hot-standby operation.


I think we should remove the oldestXid guessing logic, and expose it as
an explicit option. I think it's important that pg_upgrade sets an
accurate value. Probably not worth caring about oldestXidDB though?

Greetings,

Andres Freund




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-23 Thread Alvaro Herrera
On 2021-Apr-23, Alvaro Herrera wrote:

> I think the patch I posted was too simple.  I think a real fix requires
> us to keep track of exactly in what way the partdesc is outdated, so
> that we can compare to the current situation in deciding to use that
> partdesc or build a new one.  For example, we could keep track of a list
> of OIDs of detached partitions (and it simplifies things a lot if allow
> only a single partition in this situation, because it's either zero OIDs
> or one OID); or we can save the Xmin of the pg_inherits tuple for the
> detached partition (and we can compare that Xmin to our current active
> snapshot and decide whether to use that partdesc or not).
> 
> I'll experiment with this a little more and propose a patch later today.

This (POC-quality) seems to do the trick.

(I restored the API of find_inheritance_children, because it was getting
a little obnoxious.  I haven't thought this through but I think we
should do something like it.)

> I don't think it's too much of a problem to state that you need to
> finalize one detach before you can do the next one.  After all, with
> regular detach, you'd have the tables exclusively locked so you can't do
> them in parallel anyway.  (It also increases the chances that people
> will finalize detach operations that went unnoticed.)

I haven't added a mechanism to verify this; but with asserts on, this
patch will crash if you have more than one.  I think the behavior is not
necessarily sane with asserts off, since you'll get an arbitrary
detach-Xmin assigned to the partdesc, depending on catalog scan order.

-- 
Álvaro Herrera   Valdivia, Chile
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 6447b52854..e3ee72c0b7 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -63,8 +63,16 @@ typedef struct SeenRelsEntry
  * committed to the active snapshot.
  */
 List *
-find_inheritance_children(Oid parentrelId, bool omit_detached,
-		  LOCKMODE lockmode, bool *detached_exist)
+find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
+{
+	return find_inheritance_children_extended(parentrelId, true, lockmode,
+			  NULL, NULL);
+}
+
+List *
+find_inheritance_children_extended(Oid parentrelId, bool omit_detached,
+   LOCKMODE lockmode, bool *detached_exist,
+   TransactionId *detached_xmin)
 {
 	List	   *list = NIL;
 	Relation	relation;
@@ -132,7 +140,15 @@ find_inheritance_children(Oid parentrelId, bool omit_detached,
 snap = GetActiveSnapshot();
 
 if (!XidInMVCCSnapshot(xmin, snap))
+{
+	/* XXX only one detached partition allowed */
+	if (detached_xmin)
+	{
+		Assert(*detached_xmin == InvalidTransactionId);
+		*detached_xmin = xmin;
+	}
 	continue;
+}
 			}
 		}
 
@@ -247,8 +263,7 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
 		ListCell   *lc;
 
 		/* Get the direct children of this rel */
-		currentchildren = find_inheritance_children(currentrel, true,
-	lockmode, NULL);
+		currentchildren = find_inheritance_children(currentrel, lockmode);
 
 		/*
 		 * Add to the queue only those children not already seen. This avoids
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7d00f4eb25..dd931d09af 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3507,7 +3507,7 @@ renameatt_internal(Oid myrelid,
 		 * expected_parents will only be 0 if we are not already recursing.
 		 */
 		if (expected_parents == 0 &&
-			find_inheritance_children(myrelid, true, NoLock, NULL) != NIL)
+			find_inheritance_children(myrelid, NoLock) != NIL)
 			ereport(ERROR,
 	(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
 	 errmsg("inherited column \"%s\" must be renamed in child tables too",
@@ -3706,7 +3706,7 @@ rename_constraint_internal(Oid myrelid,
 		else
 		{
 			if (expected_parents == 0 &&
-find_inheritance_children(myrelid, true, NoLock, NULL) != NIL)
+find_inheritance_children(myrelid, NoLock) != NIL)
 ereport(ERROR,
 		(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
 		 errmsg("inherited constraint \"%s\" must be renamed in child tables too",
@@ -6580,7 +6580,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	 */
 	if (colDef->identity &&
 		recurse &&
-		find_inheritance_children(myrelid, true, NoLock, NULL) != NIL)
+		find_inheritance_children(myrelid, NoLock) != NIL)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
  errmsg("cannot recursively add identity column to table that has child tables")));
@@ -6826,7 +6826,7 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	 * use find_all_inheritors to do it in one pass.
 	 */
 	children =
-		find_inheritance_children(RelationGetRelid(rel), true, lockmode, NULL);
+		

Re: Testing autovacuum wraparound (including failsafe)

2021-04-23 Thread Andres Freund
Hi,

On 2021-04-23 18:08:12 -0500, Justin Pryzby wrote:
> On Fri, Apr 23, 2021 at 01:43:06PM -0700, Andres Freund wrote:
> > 2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the
> >failsafe mode, we can't really create 4GB relations on the BF. While
> >writing the tests I've lowered this to 4MB...
> 
> > For 2), I don't really have a better idea than making that configurable
> > somehow?
> 
> Does it work to shut down the cluster and create the .0,.1,.2,.3 segments of a
> new, empty relation with zero blocks using something like truncate -s 1G ?

I'd like this to be portable to at least windows - I don't know how well
that deals with sparse files. But the bigger issue is that that IIRC
will trigger vacuum to try to initialize all those pages, which will
then force all that space to be allocated anyway...

Greetings,

Andres Freund




Re: Testing autovacuum wraparound (including failsafe)

2021-04-23 Thread Peter Geoghegan
On Fri, Apr 23, 2021 at 1:43 PM Andres Freund  wrote:
> I started to write a test for $Subject, which I think we sorely need.

+1

> Currently my approach is to:
> - start a cluster, create a few tables with test data
> - acquire SHARE UPDATE EXCLUSIVE in a prepared transaction, to prevent
>   autovacuum from doing anything
> - cause dead tuples to exist
> - restart
> - run pg_resetwal -x 227648
> - do things like acquiring pins on pages that block vacuum from progressing
> - commit prepared transaction
> - wait for template0, template1 datfrozenxid to increase
> - wait for relfrozenxid for most relations in postgres to increase
> - release buffer pin
> - wait for postgres datfrozenxid to increase

Just having a standard-ish way to do stress testing like this would
add something.

> 2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the
>failsafe mode, we can't really create 4GB relations on the BF. While
>writing the tests I've lowered this to 4MB...

The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the
related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits
how often we'll consider the failsafe in the single-pass/no-indexes
case.

I see no reason why it cannot be changed now. VACUUM_FSM_EVERY_PAGES
also frustrates FSM testing in the single-pass case in about the same
way, so maybe that should be considered as well? Note that the FSM
handling for the single pass case is actually a bit different to the
two pass/has-indexes case, since the single pass case calls
lazy_vacuum_heap_page() directly in its first and only pass over the
heap (that's the whole point of having it of course).

> 3) pg_resetwal -x requires to carefully choose an xid: It needs to be the
>first xid on a clog page. It's not hard to determine which xids are but it
>depends on BLCKSZ and a few constants in clog.c. I've for now hardcoded a
>value appropriate for 8KB, but ...

Ugh.

> For 2), I don't really have a better idea than making that configurable
> somehow?

That could make sense as a developer/testing option, I suppose. I just
doubt that it makes sense as anything else.

> 2021-04-23 13:32:30.899 PDT [2027738] LOG:  automatic aggressive vacuum to 
> prevent wraparound of table "postgres.public.small_trunc": index scans: 1
> pages: 400 removed, 28 remain, 0 skipped due to pins, 0 skipped frozen
> tuples: 14000 removed, 1000 remain, 0 are dead but not yet removable, 
> oldest xmin: 227651
> buffer usage: 735 hits, 1262 misses, 874 dirtied
> index scan needed: 401 pages from table (1432.14% of total) had 14000 
> dead item identifiers removed
> index "small_trunc_pkey": pages: 43 in total, 37 newly deleted, 37 
> currently deleted, 0 reusable
> avg read rate: 559.048 MB/s, avg write rate: 387.170 MB/s
> system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
> WAL usage: 1809 records, 474 full page images, 3977538 bytes
>
> '1432.14% of total' - looks like removed pages need to be added before the
> percentage calculation?

Clearly this needs to account for removed heap pages in order to
consistently express the percentage of pages with LP_DEAD items in
terms of a percentage of the original table size. I can fix this
shortly.

--
Peter Geoghegan




Re: Testing autovacuum wraparound (including failsafe)

2021-04-23 Thread Justin Pryzby
On Fri, Apr 23, 2021 at 01:43:06PM -0700, Andres Freund wrote:
> 2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the
>failsafe mode, we can't really create 4GB relations on the BF. While
>writing the tests I've lowered this to 4MB...

> For 2), I don't really have a better idea than making that configurable
> somehow?

Does it work to shut down the cluster and create the .0,.1,.2,.3 segments of a
new, empty relation with zero blocks using something like truncate -s 1G ?

-- 
Justin




Re: PageGetItemIdCareful - should we MAXALIGN sizeof(BTPageOpaqueData)?

2021-04-23 Thread Peter Geoghegan
On Wed, Apr 21, 2021 at 10:10 PM Bharath Rupireddy
 wrote:
> In the PageGetItemIdCareful() introduced by commit a9ce839a, it seems
> like we are using btree page pd_special structure BTPageOpaqueData for
> error case without max aligning it.
> if (ItemIdGetOffset(itemid) + ItemIdGetLength(itemid) >
> BLCKSZ - sizeof(BTPageOpaqueData))
> ereport(ERROR,
>
> I'm not sure if it is intentional. ISTM that this was actually not a
> problem because the BTPageOpaqueData already has all-aligned(???)
> members (3 uint32, 2 uint16). But it might be a problem if we add
> unaligned bytes.

Fair point. I pushed a commit to fix this to HEAD just now. Thanks.

> PageInit always max aligns this structure, when we
> initialize the btree page in _bt_pageini and in all other places we
> max align it before use. Since this is an error throwing path, I think
> we should max align it  just to be on the safer side. While on it, I
> think we can also replace BLCKSZ with PageGetPageSize(page).

I didn't replace BLCKSZ with PageGetPageSize() in the commit, though.
We definitely don't want to rely on that being sane in amcheck (this
is also why we don't use PageGetSpecialPointer(), which is the usual
approach).

Actually, even if this wasn't amcheck code I might make the same call.
I personally don't think that most existing calls to PageGetPageSize()
make very much sense.

> Attaching a small patch. Thoughts?

I'm curious: Was this just something that you noticed randomly, while
looking at the code? Or do you have a specific practical reason to
care about it? (I always like hearing about the ways in which people
use amcheck.)

--
Peter Geoghegan




Re: pg_amcheck contrib application

2021-04-23 Thread Mark Dilger



> On Apr 23, 2021, at 1:31 PM, Robert Haas  wrote:
> 
> Perhaps something like this, closer to the way you had it?
> 
>   expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
>   : extsize - (last_chunk_seq * TOAST_MAX_CHUNK_SIZE);

It still suffers the same failures.  I'll try to post something that 
accomplishes the changes to the reports that you are looking for.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Testing autovacuum wraparound (including failsafe)

2021-04-23 Thread Andres Freund
Hi,

I started to write a test for $Subject, which I think we sorely need.

Currently my approach is to:
- start a cluster, create a few tables with test data
- acquire SHARE UPDATE EXCLUSIVE in a prepared transaction, to prevent
  autovacuum from doing anything
- cause dead tuples to exist
- restart
- run pg_resetwal -x 227648
- do things like acquiring pins on pages that block vacuum from progressing
- commit prepared transaction
- wait for template0, template1 datfrozenxid to increase
- wait for relfrozenxid for most relations in postgres to increase
- release buffer pin
- wait for postgres datfrozenxid to increase

So far so good. But I've encountered a few things that stand in the way of
enabling such a test by default:

1) During startup StartupSUBTRANS() zeroes out all pages between
   oldestActiveXID and nextXid. That takes 8s on my workstation, but only
   because I have plenty memory - pg_subtrans ends up 14GB as I currently do
   the test. Clearly not something we could do on the BF.

2) FAILSAFE_MIN_PAGES is 4GB - which seems to make it infeasible to test the
   failsafe mode, we can't really create 4GB relations on the BF. While
   writing the tests I've lowered this to 4MB...

3) pg_resetwal -x requires to carefully choose an xid: It needs to be the
   first xid on a clog page. It's not hard to determine which xids are but it
   depends on BLCKSZ and a few constants in clog.c. I've for now hardcoded a
   value appropriate for 8KB, but ...


I have 2 1/2 ideas about addressing 1);

- We could exposing functionality to do advance nextXid to a future value at
  runtime, without filling in clog/subtrans pages. Would probably have to live
  in varsup.c and be exposed via regress.so or such?

- The only reason StartupSUBTRANS() does that work is because of the prepared
  transaction holding back oldestActiveXID. That transaction in turn exists to
  prevent autovacuum from doing anything before we do test setup
  steps.

  Perhaps it'd be sufficient to set autovacuum_naptime really high initially,
  perform the test setup, set naptime to something lower, reload config. But
  I'm worried that might not be reliable: If something ends up allocating an
  xid we'd potentially reach the path in GetNewTransaction() that wakes up the
  launcher?  But probably there wouldn't be anything doing so?

  Another aspect that might not make this a good choice is that it actually
  seems relevant to be able to test cases where there are very old still
  running transactions...

- As a variant of the previous idea: If that turns out to be unreliable, we
  could instead set nextxid, start in single user mode, create a blocking 2PC
  transaction, start normally. Because there's no old active xid we'd not run
  into the StartupSUBTRANS problem.


For 2), I don't really have a better idea than making that configurable
somehow?

3) is probably tolerable for now, we could skip the test if BLCKSZ isn't 8KB,
or we could hardcode the calculation for different block sizes.



I noticed one minor bug that's likely new:

2021-04-23 13:32:30.899 PDT [2027738] LOG:  automatic aggressive vacuum to 
prevent wraparound of table "postgres.public.small_trunc": index scans: 1
pages: 400 removed, 28 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 14000 removed, 1000 remain, 0 are dead but not yet removable, 
oldest xmin: 227651
buffer usage: 735 hits, 1262 misses, 874 dirtied
index scan needed: 401 pages from table (1432.14% of total) had 14000 
dead item identifiers removed
index "small_trunc_pkey": pages: 43 in total, 37 newly deleted, 37 
currently deleted, 0 reusable
avg read rate: 559.048 MB/s, avg write rate: 387.170 MB/s
system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
WAL usage: 1809 records, 474 full page images, 3977538 bytes

'1432.14% of total' - looks like removed pages need to be added before the
percentage calculation?

Greetings,

Andres Freund




Re: pg_amcheck contrib application

2021-04-23 Thread Robert Haas
On Fri, Apr 23, 2021 at 2:36 PM Mark Dilger
 wrote:
> > What's different?
>
> for one thing, if a sequence of chunks happens to fit perfectly, the final 
> chunk will have size TOAST_MAX_CHUNK_SIZE, but you're expecting no larger 
> than one less than that, given how modulo arithmetic works.

Good point.

Perhaps something like this, closer to the way you had it?

   expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
   : extsize - (last_chunk_seq * TOAST_MAX_CHUNK_SIZE);

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: decoupling table and index vacuum

2021-04-23 Thread Peter Geoghegan
On Thu, Apr 22, 2021 at 1:01 PM Andres Freund  wrote:
> The gin case seems a bit easier than the partial index case. Keeping
> stats about the number of new entries in a GIN index doesn't seem too
> hard, nor does tracking the number of cleaned up index entries. But
> knowing which indexes are affected when a heap tuple becomes dead seems
> harder.  I guess we could just start doing a stats-only version of
> ExecInsertIndexTuples() for deletes, but obviously the cost of that is
> not enticing. Perhaps it'd not be too bad if we only did it when there's
> an index with predicates?

Though I agree that we need some handling here, I doubt that an index
with a predicate is truly a special case.

Suppose you have a partial index that covers 10% of the table. How is
that meaningfully different from an index without a predicate that is
otherwise equivalent? If the churn occurs in the same keyspace in
either case, and if that's the part of the keyspace that queries care
about, then ISTM that the practical difference is fairly
insignificant. (If you have some churn all over the standard index by
queries are only interested in the same 10% of the full keyspace then
this will be less true, but still roughly true.)

There is an understandable tendency to focus on the total size of the
index in each case, and to be alarmed that the partial index has (say)
doubled in size, while at the same time not being overly concerned
about lower *proportionate* growth for the standard index case
(assuming otherwise identical workload/conditions). The page splits
that affect the same 10% of the key space in each case will be
approximately as harmful in each case, though. We can expect the same
growth in leaf pages in each case, which will look very similar.

It should be obvious that it's somewhat of a problem that 90% of the
standard index is apparently not useful (this is an unrelated
problem). But if the DBA fixes this unrelated problem (by making the
standard index a partial index), surely it would be absurd to then
conclude that that helpful intervention somehow had the effect of
making the index bloat situation much worse!

I think that a simple heuristic could work very well here, but it
needs to be at least a little sensitive to the extremes. And I mean
all of the extremes, not just the one from my example -- every
variation exists and will cause problems if given zero weight.

-- 
Peter Geoghegan




Re: decoupling table and index vacuum

2021-04-23 Thread Peter Geoghegan
On Fri, Apr 23, 2021 at 8:44 AM Robert Haas  wrote:
> On Thu, Apr 22, 2021 at 4:52 PM Peter Geoghegan  wrote:
> > Mostly what I'm saying is that I would like to put together a rough
> > list of things that we could do to improve VACUUM along the lines
> > we've discussed -- all of which stem from $SUBJECT. There are
> > literally dozens of goals (some of which are quite disparate) that we
> > could conceivably set out to pursue under the banner of $SUBJECT.
>
> I hope not. I don't have a clue why there would be dozens of possible
> goals here, or why it matters.

Not completely distinct goals, for the most part, but I can certainly
see dozens of benefits.

For example, if we know before index vacuuming starts that heap
vacuuming definitely won't go ahead (quite possible when we decide
that we're only vacuuming a subset of indexes), we can then tell the
index AM about that fact. It can then safely vacuum in an
"approximate" fashion, for example by skipping pages whose LSNs are
from before the last VACUUM, and by not bothering with a
super-exclusive lock in the case of nbtree.

The risk of a conflict between this goal and another goal that we may
want to pursue (which might be a bit contrived) is that we fail to do
the right thing when a large range deletion has taken place, which
must be accounted in the statistics, but creates a tension with the
global index stuff. It's probably only safe to do this when we know
that there have been hardly any DELETEs. There is also the question of
how the TID map thing interacts with the visibility map, and how that
affects how VACUUM behaves (both in general and in order to attain
some kind of specific new benefit from this synergy).

Who knows? We're never going to get on exactly the same page, but some
rough idea of which page each of us are on might save everybody time.

The stuff that I went into about making aborted transactions special
as a means of decoupling transaction status management from garbage
collection is arguably totally unrelated -- perhaps it's just too much
of a stretch to link that to what you want to do now. I suppose it's
hard to invest the time to engage with me on that stuff, and I
wouldn't be surprised if you never did so. If it doesn't happen it
would be understandable, though quite possibly a missed opportunity
for both of us. My basic intuition there is that it's another variety
of decoupling, so (for better or worse) it does *seem* related to me.
(I am an intuitive thinker, which has advantages and disadvantages.)

> I think if we're going to do something
> like $SUBJECT, we should just concentrate on the best way to make that
> particular change happen with minimal change to anything else.
> Otherwise, we risk conflating this engineering effort with others that
> really should be separate endeavors.

Of course it's true that that is a risk. That doesn't mean that the
opposite risk is not also a concern. I am concerned about both risks.
I'm not sure which risk I should be more concerned about.

I agree that we ought to focus on a select few goals as part of the
first round of work in this area (without necessarily doing all or
even most of them at the same time). It's not self-evident which goals
those should be at this point, though. You've said that you're
interested in global indexes. Okay, that's a start. I'll add the basic
idea of not doing index vacuuming for some indexes and not others to
the list -- this will necessitate that we teach index AMs to assess
how much bloat the index has accumulated since the last VACUUM, which
presumably must work in some generalized, composable way.

> For example, as far as possible,
> I think it would be best to try to do this without changing the
> statistics that are currently gathered, and just make the best
> decisions we can with the information we already have.

I have no idea if that's the right way to do it. In any case the
statistics that we gather influence the behavior of autovacuum.c, but
nothing stops us from doing our own dynamic gathering of statistics to
decide what we should do within vacuumlazy.c each time. We don't have
to change the basic triggering conditions to change the work each
VACUUM performs.

As I've said before, I think that we're likely to get more benefit (at
least at first) from making the actual reality of what VACUUM does
simpler and more predictable in practice than we are from changing how
reality is modeled inside autovacuum.c. I'll go further with that now:
if we do change that modelling at some point, I think that it should
work in an additive way, which can probably be compatible with how the
statistics and so on work already. For example, maybe vacuumlazy.c
asks autovacuum.c to do a VACUUM earlier next time. This can be
structured as an exception to the general rule of autovacuum
scheduling, probably -- something that occurs when it becomes evident
that the generic schedule isn't quite cutting it in some important,
specific way.

> Ideally, I'd
> like to 

[PATCH] Remove extraneous whitespace in tags: > foo< and >bar

2021-04-23 Thread Justin Pryzby
More fixes like the one Peter committed as 9bd563aa9.
I eyeballed the HTML to make sure this looks right.

>From a8b782cde7c5d6eef1e3876636feb652bc5f3711 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 22 Apr 2021 21:10:49 -0500
Subject: [PATCH] Remove extraneous whitespace in tags

git grep -E '<([^>]*)>[^<]* ' doc/src/sgml |grep -Evw 
'optional|xsl|lineannotation|entry|prompt|computeroutput'
---
 doc/src/sgml/maintenance.sgml   | 2 +-
 doc/src/sgml/mvcc.sgml  | 2 +-
 doc/src/sgml/pgcrypto.sgml  | 2 +-
 doc/src/sgml/ref/pg_rewind.sgml | 2 +-
 doc/src/sgml/runtime.sgml   | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 4adb34a21b..ee6113926a 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -719,7 +719,7 @@ HINT:  Stop the postmaster and vacuum that database in 
single-user mode.
 PostgreSQL has an optional but highly
 recommended feature called autovacuum,
 whose purpose is to automate the execution of
-VACUUM and ANALYZE  commands.
+VACUUM and ANALYZE commands.
 When enabled, autovacuum checks for
 tables that have had a large number of inserted, updated or deleted
 tuples.  These checks use the statistics collection facility;
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index b46cba8158..6cb9c63161 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -1074,7 +1074,7 @@ ERROR:  could not serialize access due to read/write 
dependencies among transact
 
 
 
-  Conflicting Lock Modes
+ Conflicting Lock Modes
  
   
   
diff --git a/doc/src/sgml/pgcrypto.sgml b/doc/src/sgml/pgcrypto.sgml
index b6bb23de0f..13770dfc6f 100644
--- a/doc/src/sgml/pgcrypto.sgml
+++ b/doc/src/sgml/pgcrypto.sgml
@@ -1410,7 +1410,7 @@ gen_random_uuid() returns uuid
   KAME kame/sys/crypto
  
  
-  SHA256/384/512 
+  SHA256/384/512
   Aaron D. Gifford
   OpenBSD sys/crypto
  
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 07aae75d8b..33e6bb64ad 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -25,7 +25,7 @@ PostgreSQL documentation
option

 
- -D 
+ -D
  --target-pgdata
 
  directory
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 001d195b8e..f1cbc1d9e9 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -2258,7 +2258,7 @@ pg_dumpall -p 5432 | psql -d postgres -p 5433
The certificates of intermediate certificate authorities
can also be appended to the file.  Doing this avoids the necessity of
storing intermediate certificates on clients, assuming the root and
-   intermediate certificates were created with v3_ca 
+   intermediate certificates were created with v3_ca
extensions.  (This sets the certificate's basic constraint of
CA to true.)
This allows easier expiration of intermediate certificates.
-- 
2.17.0





Re: pg_amcheck contrib application

2021-04-23 Thread Mark Dilger



> On Apr 23, 2021, at 11:29 AM, Robert Haas  wrote:
> 
> +   expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
> +   : extsize % TOAST_MAX_CHUNK_SIZE;
> 
> What's different?

for one thing, if a sequence of chunks happens to fit perfectly, the final 
chunk will have size TOAST_MAX_CHUNK_SIZE, but you're expecting no larger than 
one less than that, given how modulo arithmetic works.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: pg_amcheck contrib application

2021-04-23 Thread Mark Dilger



> On Apr 23, 2021, at 11:29 AM, Robert Haas  wrote:
> 
> On Fri, Apr 23, 2021 at 2:15 PM Mark Dilger
>  wrote:
>> Another difference I should probably mention is that a bunch of unrelated 
>> tests are failing with errors like:
>> 
>>toast value 13465 chunk 0 has size 1995, but expected size 1996
>> 
>> which leads me to suspect your changes to how the size is calculated.
> 
> That seems like a pretty reasonable suspicion, but I can't see the problem:
> 
> -   expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
> -   : VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) -
> (endchunk * TOAST_MAX_CHUNK_SIZE);
> +   expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
> +   : extsize % TOAST_MAX_CHUNK_SIZE;
> 
> What's different?
> 
> 1. The variables are renamed.
> 
> 2. It uses a new variable extsize instead of recomputing
> VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer), but I think that
> should have the same value.
> 
> 3. I used modulo arithmetic (%) instead of subtracting endchunk *
> TOAST_MAX_CHUNK_SIZE.
> 
> Is TOAST_MAX_CHUNK_SIZE 1996? How long a value did you insert?

On  my laptop, yes, 1996 is TOAST_MAX_CHUNK_SIZE.

I'm not inserting anything.  These failures come from just regular tests that I 
have not changed.  I just applied your patch and ran `make check-world` and 
these fail in src/bin/pg_amcheck 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: pg_amcheck contrib application

2021-04-23 Thread Robert Haas
On Fri, Apr 23, 2021 at 2:15 PM Mark Dilger
 wrote:
> Another difference I should probably mention is that a bunch of unrelated 
> tests are failing with errors like:
>
> toast value 13465 chunk 0 has size 1995, but expected size 1996
>
> which leads me to suspect your changes to how the size is calculated.

That seems like a pretty reasonable suspicion, but I can't see the problem:

-   expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
-   : VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) -
(endchunk * TOAST_MAX_CHUNK_SIZE);
+   expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
+   : extsize % TOAST_MAX_CHUNK_SIZE;

What's different?

1. The variables are renamed.

2. It uses a new variable extsize instead of recomputing
VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer), but I think that
should have the same value.

3. I used modulo arithmetic (%) instead of subtracting endchunk *
TOAST_MAX_CHUNK_SIZE.

Is TOAST_MAX_CHUNK_SIZE 1996? How long a value did you insert?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] force_parallel_mode and GUC categories

2021-04-23 Thread Justin Pryzby
On Wed, Apr 14, 2021 at 03:57:21PM +0900, Michael Paquier wrote:
> On Tue, Apr 13, 2021 at 07:31:39AM -0500, Justin Pryzby wrote:
> > Good point.
> 
> Thanks.  I have used the wording that Tom has proposed upthread, added
> one GUC_NOT_IN_SAMPLE that you forgot, and applied the
> force_parallel_mode patch.

Thanks.  It just occured to me to ask if we should backpatch it.
The goal is to avoid someone trying to use this as a peformance option.

It's to their benefit and ours if they don't do that on v10-13 for the next 5
years, not just v14-17.

The patch seems to apply cleanly on v12 but cherry-pick needs help for other
branches...

-- 
Justin




Re: pg_amcheck contrib application

2021-04-23 Thread Mark Dilger



> On Apr 23, 2021, at 11:05 AM, Mark Dilger  
> wrote:
> 
> Here are the differences between master and you patch:

Another difference I should probably mention is that a bunch of unrelated tests 
are failing with errors like:

toast value 13465 chunk 0 has size 1995, but expected size 1996

which leads me to suspect your changes to how the size is calculated.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: pg_amcheck contrib application

2021-04-23 Thread Robert Haas
On Fri, Apr 23, 2021 at 2:05 PM Mark Dilger
 wrote:
> Here are the differences between master and you patch:

Thanks. Those messages look reasonable to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_amcheck contrib application

2021-04-23 Thread Mark Dilger



> On Apr 23, 2021, at 10:31 AM, Mark Dilger  
> wrote:
> 
> I will test your patch and see what differs.

Here are the differences between master and you patch:

UPDATE $toastname SET chunk_seq = chunk_seq + 1000 WHERE chunk_id = 
$value_id_to_corrupt

-   qr/${header}toast value 16459 chunk 0 has sequence 
number 1000, but expected sequence number 0/,
-   qr/${header}toast value 16459 chunk 1 has sequence 
number 1001, but expected sequence number 1/,
-   qr/${header}toast value 16459 chunk 2 has sequence 
number 1002, but expected sequence number 2/,
-   qr/${header}toast value 16459 chunk 3 has sequence 
number 1003, but expected sequence number 3/,
-   qr/${header}toast value 16459 chunk 4 has sequence 
number 1004, but expected sequence number 4/,
-   qr/${header}toast value 16459 chunk 5 has sequence 
number 1005, but expected sequence number 5/;
+   qr/${header}toast value 16459 index scan returned chunk 1000 
when expecting chunk 0/,
+   qr/${header}toast value 16459 chunk 1000 follows last expected 
chunk 5/,
+   qr/${header}toast value 16459 chunk 1001 follows last expected 
chunk 5/,
+   qr/${header}toast value 16459 chunk 1002 follows last expected 
chunk 5/,
+   qr/${header}toast value 16459 chunk 1003 follows last expected 
chunk 5/,
+   qr/${header}toast value 16459 chunk 1004 follows last expected 
chunk 5/,
+   qr/${header}toast value 16459 chunk 1005 follows last expected 
chunk 5/;

UPDATE $toastname SET chunk_seq = chunk_seq * 1000 WHERE chunk_id = 
$value_id_to_corrupt

-   qr/${header}toast value $value_id_to_corrupt chunk 1 
has sequence number 1000, but expected sequence number 1/,
-   qr/${header}toast value $value_id_to_corrupt chunk 2 
has sequence number 2000, but expected sequence number 2/,
-   qr/${header}toast value $value_id_to_corrupt chunk 3 
has sequence number 3000, but expected sequence number 3/,
-   qr/${header}toast value $value_id_to_corrupt chunk 4 
has sequence number 4000, but expected sequence number 4/,
-   qr/${header}toast value $value_id_to_corrupt chunk 5 
has sequence number 5000, but expected sequence number 5/;
-
+   qr/${header}toast value 16460 index scan returned chunk 1000 
when expecting chunk 1/,
+   qr/${header}toast value 16460 chunk 1000 follows last expected 
chunk 5/,
+   qr/${header}toast value 16460 index scan returned chunk 2000 
when expecting chunk 1001/,
+   qr/${header}toast value 16460 chunk 2000 follows last expected 
chunk 5/,
+   qr/${header}toast value 16460 index scan returned chunk 3000 
when expecting chunk 2001/,
+   qr/${header}toast value 16460 chunk 3000 follows last expected 
chunk 5/,
+   qr/${header}toast value 16460 index scan returned chunk 4000 
when expecting chunk 3001/,
+   qr/${header}toast value 16460 chunk 4000 follows last expected 
chunk 5/,
+   qr/${header}toast value 16460 index scan returned chunk 5000 
when expecting chunk 4001/,
+   qr/${header}toast value 16460 chunk 5000 follows last expected 
chunk 5/;

INSERT INTO $toastname (chunk_id, chunk_seq, chunk_data)
(SELECT chunk_id,
10*chunk_seq + 1000,
chunk_data
FROM $toastname
WHERE chunk_id = $value_id_to_corrupt)

-   qr/${header}toast value $value_id_to_corrupt chunk 6 
has sequence number 1000, but expected sequence number 6/,
-   qr/${header}toast value $value_id_to_corrupt chunk 7 
has sequence number 1010, but expected sequence number 7/,
-   qr/${header}toast value $value_id_to_corrupt chunk 8 
has sequence number 1020, but expected sequence number 8/,
-   qr/${header}toast value $value_id_to_corrupt chunk 9 
has sequence number 1030, but expected sequence number 9/,
-   qr/${header}toast value $value_id_to_corrupt chunk 10 
has sequence number 1040, but expected sequence number 10/,
-   qr/${header}toast value $value_id_to_corrupt chunk 11 
has sequence number 1050, but expected sequence number 11/,
-   qr/${header}toast value $value_id_to_corrupt was 
expected to end at chunk 6, but ended at chunk 12/;
+  qr/${header}toast value $value_id_to_corrupt index scan returned 
chunk 1000 when expecting chunk 6/,
+  qr/${header}toast value $value_id_to_corrupt chunk 1000 follows 
last expected chunk 5/,
+  qr/${header}toast value $value_id_to_corrupt index scan returned 
chunk 1010 when expecting chunk 1001/,
+  qr/${header}toast value $value_id_to_corrupt chunk 1010 follows 
last expected chunk 5/,
+  

Re: pgsql: autovacuum: handle analyze for partitioned tables

2021-04-23 Thread Justin Pryzby
On Thu, Apr 22, 2021 at 12:43:46PM -0500, Justin Pryzby wrote:
> Maybe the behavior should be documented, though.  Actually, I thought the
> pre-existing (non)behavior of autoanalyze would've been documented, and we'd
> now update that.  All I can find is this:
> 
> https://www.postgresql.org/docs/current/sql-analyze.html
> |The autovacuum daemon, however, will only consider inserts or updates on the
> |parent table itself when deciding whether to trigger an automatic analyze for
> |that table
> 
> I think that should probably have been written down somewhere other than for
> the manual ANALYZE command, but in any case it seems to be outdated now.

Starting with this 

>From a7ae56a879b6bacc4fc22cbd769851713be89840 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 23 Apr 2021 09:15:58 -0500
Subject: [PATCH] WIP: Add docs for autovacuum processing of partitioned tables

---
 doc/src/sgml/perform.sgml| 3 ++-
 doc/src/sgml/ref/analyze.sgml| 4 +++-
 doc/src/sgml/ref/pg_restore.sgml | 6 --
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 89ff58338e..814c3cffbe 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1767,7 +1767,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND 
somethingelse;

 Whenever you have significantly altered the distribution of data
 within a table, running ANALYZE is strongly 
recommended. This
-includes bulk loading large amounts of data into the table.  Running
+includes bulk loading large amounts of data into the table,
+or attaching/detaching partitions.  Running
 ANALYZE (or VACUUM ANALYZE)
 ensures that the planner has up-to-date statistics about the
 table.  With no statistics or obsolete statistics, the planner might
diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml
index c8fcebc161..179ae3555d 100644
--- a/doc/src/sgml/ref/analyze.sgml
+++ b/doc/src/sgml/ref/analyze.sgml
@@ -255,11 +255,13 @@ ANALYZE [ VERBOSE ] [ table_and_columnsANALYZE manually.
+For partitioned tables, inserts and updates on the partitions are counted
+towards auto-analyze on the parent.
   
 
   
diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 93ea937ac8..260bf0feb7 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -922,8 +922,10 @@ CREATE DATABASE foo WITH TEMPLATE template0;
 
   
Once restored, it is wise to run ANALYZE on each
-   restored table so the optimizer has useful statistics; see
-and
+   restored table so the optimizer has useful statistics.
+   If the table is a partition or an inheritence child, it may also be useful
+   to analyze the parent table.
+   See  and
 for more information.
   
 
-- 
2.17.0





Re: pg_amcheck contrib application

2021-04-23 Thread Mark Dilger



> On Apr 23, 2021, at 10:28 AM, Robert Haas  wrote:
> 
> On Thu, Apr 22, 2021 at 7:28 PM Mark Dilger
>  wrote:
>> I have refactored the patch to address your other concerns.  Breaking the 
>> patch into multiple pieces didn't add any clarity, but refactoring portions 
>> of it made things simpler to read, I think, so here it is as one patch file.
> 
> I was hoping that this version was going to be smaller than the last
> version, but instead it went from 300+ lines to 500+ lines.
> 
> The main thing I'm unhappy about in the status quo is the use of
> chunkno in error messages. I have suggested several times making that
> concept go away, because I think users will be confused. Here's a
> minimal patch that does just that. It's 32 lines and results in a net
> removal of 4 lines. It differs somewhat from my earlier suggestions,
> because my priority here is to get reasonably understandable output
> without needing a ton of code, and as I was working on this I found
> that some of my earlier suggestions would have needed more code to
> implement and I didn't think it bought enough to be worth it. It's
> possible this is too simple, or that it's buggy, so let me know what
> you think. But basically, I think what got committed before is
> actually mostly fine and doesn't need major revision. It just needs
> tidying up to avoid the confusing chunkno concept.
> 
> Now, the other thing we've talked about is adding a few more checks,
> to verify for example that the toastrelid is what we expect, and I see
> in your v22 you thought of a few other things. I think we can consider
> those, possibly as things where we consider it tidying up loose ends
> for v14, or else as improvements for v15. But I don't think that the
> fairly large size of your patch comes primarily from additional
> checks. I think it mostly comes from the code to produce error reports
> getting a lot more complicated. I apologize if my comments have driven
> that complexity, but they weren't intended to.
> 
> One tiny problem with the attached patch is that it does not make any
> regression tests fail, which also makes it hard for me to tell if it
> breaks anything, or if the existing code works. I don't know how
> practical it is to do anything about that. Do you have a patch handy
> that allows manual updates and deletes on TOAST tables, for manual
> testing purposes?

Yes, I haven't been posting that with the patch because, but I will test your 
patch and see what differs.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: pg_amcheck contrib application

2021-04-23 Thread Robert Haas
On Thu, Apr 22, 2021 at 7:28 PM Mark Dilger
 wrote:
> I have refactored the patch to address your other concerns.  Breaking the 
> patch into multiple pieces didn't add any clarity, but refactoring portions 
> of it made things simpler to read, I think, so here it is as one patch file.

I was hoping that this version was going to be smaller than the last
version, but instead it went from 300+ lines to 500+ lines.

The main thing I'm unhappy about in the status quo is the use of
chunkno in error messages. I have suggested several times making that
concept go away, because I think users will be confused. Here's a
minimal patch that does just that. It's 32 lines and results in a net
removal of 4 lines. It differs somewhat from my earlier suggestions,
because my priority here is to get reasonably understandable output
without needing a ton of code, and as I was working on this I found
that some of my earlier suggestions would have needed more code to
implement and I didn't think it bought enough to be worth it. It's
possible this is too simple, or that it's buggy, so let me know what
you think. But basically, I think what got committed before is
actually mostly fine and doesn't need major revision. It just needs
tidying up to avoid the confusing chunkno concept.

Now, the other thing we've talked about is adding a few more checks,
to verify for example that the toastrelid is what we expect, and I see
in your v22 you thought of a few other things. I think we can consider
those, possibly as things where we consider it tidying up loose ends
for v14, or else as improvements for v15. But I don't think that the
fairly large size of your patch comes primarily from additional
checks. I think it mostly comes from the code to produce error reports
getting a lot more complicated. I apologize if my comments have driven
that complexity, but they weren't intended to.

One tiny problem with the attached patch is that it does not make any
regression tests fail, which also makes it hard for me to tell if it
breaks anything, or if the existing code works. I don't know how
practical it is to do anything about that. Do you have a patch handy
that allows manual updates and deletes on TOAST tables, for manual
testing purposes?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


simply-remove-chunkno-concept.patch
Description: Binary data


Re: A test for replay of regression tests

2021-04-23 Thread Andres Freund
Hi,

On 2021-04-23 13:13:15 -0400, Tom Lane wrote:
> contrib/bloom is indeed the only(?) case within contrib, but in my mind
> that's a proxy for what people will be needing to test out-of-core AMs.
> It might not be a test to run by default, but there needs to be a way
> to do it.

Hm. My experience in the past was that the best way to test an external
AM is to run the core regression tests with a different
default_table_access_method. That does require some work of ensuring the
AM is installed and the relevant extension created, which in turn
requires a different test schedule, or hacking template1.  So likely a
different test script anyway?


> Also, I suspect that there are bits of GIST/GIN/SPGIST that are not
> well exercised if you don't run the contrib modules that add opclasses
> for those.

Possible - still think it'd be best to get the minimal thing in asap,
and then try to extend further afterwards... Perfect being the enemy of
good and all that.

Greetings,

Andres Freund




Re: A test for replay of regression tests

2021-04-23 Thread Tom Lane
Andres Freund  writes:
> On 2021-04-23 11:53:35 -0400, Tom Lane wrote:
>> Yeah.  I found out earlier that wal_consistency_checking=all is an
>> absolute PIG.  Running the regression tests that way requires tens of
>> gigabytes of disk space, and a significant amount of time if your
>> disk is not very speedy.  If we put this into the buildfarm at all,
>> it would have to be opt-in, not run-by-default, because a lot of BF
>> animals simply don't have the horsepower.  I think I'd vote against
>> adding it to check-world, too; the cost/benefit ratio is not good
>> unless you are specifically working on replay functions.

> I think it'd be a huge improvement to test recovery during check-world
> by default - it's a huge swath of crucial code that practically has no
> test coverage.  I agree that testing by default with
> wal_consistency_checking=all isn't feasible due to the time & space
> overhead, so I think we should not do that.

I was mainly objecting to enabling wal_consistency_checking by default.
I agree it's bad that we have so little routine test coverage on WAL
replay code.

>> The two things I'd say about this are (1) Whether to use
>> wal_consistency_checking, and with what value, needs to be
>> easily adjustable. (2) People will want to run other test suites
>> than the core regression tests, eg contrib modules.

> I don't think there's actually that much need to test contrib modules
> through recovery - most of them don't seem like they'd add meaningful
> coverage? The exception is contrib/bloom, but perhaps that'd be better
> tackled with a dedicated test?

contrib/bloom is indeed the only(?) case within contrib, but in my mind
that's a proxy for what people will be needing to test out-of-core AMs.
It might not be a test to run by default, but there needs to be a way
to do it.

Also, I suspect that there are bits of GIST/GIN/SPGIST that are not
well exercised if you don't run the contrib modules that add opclasses
for those.

regards, tom lane




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-23 Thread Alvaro Herrera
On 2021-Apr-23, Amit Langote wrote:

> Back in the 1st session:
> 
> end;
> insert into fk select generate_series(1, 1);
> INSERT 0 1
> Time: 47400.792 ms (00:47.401)

I guess I was wrong about that ... the example I tried didn't have 1000s
of partitions, and I used debug print-outs to show when a new partdesc
was being rebuilt, and only six were occurring.  I'm not sure why my
case behaves so differently from yours, but clearly from the timing this
is not good.

> I am afraid we may have to fix this in the code after all, because
> there does not seem a good way to explain this away in the
> documentation.  

Yeah, looking at this case, I agree that it needs a fix of some kind.

> If I read correctly, you did try an approach of caching the
> PartitionDesc that we currently don't, no?

I think the patch I posted was too simple.  I think a real fix requires
us to keep track of exactly in what way the partdesc is outdated, so
that we can compare to the current situation in deciding to use that
partdesc or build a new one.  For example, we could keep track of a list
of OIDs of detached partitions (and it simplifies things a lot if allow
only a single partition in this situation, because it's either zero OIDs
or one OID); or we can save the Xmin of the pg_inherits tuple for the
detached partition (and we can compare that Xmin to our current active
snapshot and decide whether to use that partdesc or not).

I'll experiment with this a little more and propose a patch later today.

I don't think it's too much of a problem to state that you need to
finalize one detach before you can do the next one.  After all, with
regular detach, you'd have the tables exclusively locked so you can't do
them in parallel anyway.  (It also increases the chances that people
will finalize detach operations that went unnoticed.)

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: A test for replay of regression tests

2021-04-23 Thread Andres Freund
Hi,

On 2021-04-23 11:53:35 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Hm. I wonder if running with wal_consistency_checking=all doesn't also
> > reduce coverage of some aspects of recovery, by increasing record sizes
> > etc.
> 
> Yeah.  I found out earlier that wal_consistency_checking=all is an
> absolute PIG.  Running the regression tests that way requires tens of
> gigabytes of disk space, and a significant amount of time if your
> disk is not very speedy.  If we put this into the buildfarm at all,
> it would have to be opt-in, not run-by-default, because a lot of BF
> animals simply don't have the horsepower.  I think I'd vote against
> adding it to check-world, too; the cost/benefit ratio is not good
> unless you are specifically working on replay functions.

I think it'd be a huge improvement to test recovery during check-world
by default - it's a huge swath of crucial code that practically has no
test coverage.  I agree that testing by default with
wal_consistency_checking=all isn't feasible due to the time & space
overhead, so I think we should not do that.


> The two things I'd say about this are (1) Whether to use
> wal_consistency_checking, and with what value, needs to be
> easily adjustable. (2) People will want to run other test suites
> than the core regression tests, eg contrib modules.

I'm not really excited about tackling 2) initially. I think it's a
significant issue that we don't have test coverage for most redo
routines and that we should change that with some urgency - even if we
back out these WAL prefetch related changes, there've been sufficiently
many changes affecting WAL that I think it's worth doing the minimal
thing soon.

I don't think there's actually that much need to test contrib modules
through recovery - most of them don't seem like they'd add meaningful
coverage? The exception is contrib/bloom, but perhaps that'd be better
tackled with a dedicated test?

Greetings,

Andres Freund




Re: TRUNCATE on foreign table

2021-04-23 Thread Fujii Masao



On 2021/04/23 19:56, Bharath Rupireddy wrote:

On Fri, Apr 23, 2021 at 1:39 PM Fujii Masao  wrote:

+
+ Note that information about ONLY options specified
+ in the original TRUNCATE command is not passed to

I think it is not "information about", no? We just don't pass ONLY
option  instead we skip it. IMO, we can say "Note that the ONLY option
specified with a foreign table in the original TRUNCATE command is
skipped and not passed to ExecForeignTruncate."


Probably I still fail to understand your point.
But if "information about" is confusing, I'm ok to
remove that. Fixed.


A small typo in the docs patch: It is "are not passed to", instead of
"is not passed to" since we used plural "options". "Note that the ONLY
options specified in the original TRUNCATE command are not passed to"

+ Note that the ONLY options specified
   in the original TRUNCATE command is not passed to


Thanks for the review! I fixed this.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index e08441ec8b..8aa7edfe4a 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1076,27 +1076,25 @@ ExecForeignTruncate(List *rels,
 bool restart_seqs);
 
 
- Truncate a set of foreign tables specified in rels.
- This function is called when  is executed
- on foreign tables.  rels is the list of
- Relation data structure that indicates
- a foreign table to truncate.
+ Truncate foreign tables.  This function is called when
+  is executed on a foreign table.
+ rels is a list of Relation
+ data structures of foreign tables to truncate.
 
 
 
- behavior defines how foreign tables should
- be truncated, using as possible values DROP_RESTRICT,
- which means that RESTRICT option is specified,
- and DROP_CASCADE, which means that
- CASCADE option is specified, in
- TRUNCATE command.
+ behavior is either DROP_RESTRICT
+ or DROP_CASCADE indicating that the
+ RESTRICT or CASCADE option was
+ requested in the original TRUNCATE command,
+ respectively.
 
 
 
- restart_seqs is set to true
- if RESTART IDENTITY option is specified in
- TRUNCATE command.  It is false
- if CONTINUE IDENTITY option is specified.
+ If restart_seqs is true,
+ the original TRUNCATE command requested the
+ RESTART IDENTITY behavior, otherwise the
+ CONTINUE IDENTITY behavior was requested.
 
 
 
@@ -1109,11 +1107,10 @@ ExecForeignTruncate(List *rels,
 
 
 
- TRUNCATE invokes
- ExecForeignTruncate once per foreign server
- that foreign tables to truncate belong to.  This means that all foreign
- tables included in rels must belong to the same
- server.
+ ExecForeignTruncate is invoked once per
+ foreign server for which foreign tables are to be truncated.
+ This means that all foreign tables included in rels
+ must belong to the same server.
 
 
 
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index b0806c1274..839126c4ef 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -459,11 +459,17 @@ OPTIONS (ADD password_required 'false');
  
   
This option controls whether postgres_fdw allows
-   foreign tables to be truncated using TRUNCATE
+   foreign tables to be truncated using the TRUNCATE
command. It can be specified for a foreign table or a foreign server.
A table-level option overrides a server-level option.
The default is true.
   
+
+  
+   Of course, if the remote table is not in fact truncatable, an error
+   would occur anyway.  Use of this option primarily allows the error to
+   be thrown locally without querying the remote server.
+  
  
 

diff --git a/doc/src/sgml/ref/truncate.sgml b/doc/src/sgml/ref/truncate.sgml
index acf3633be4..9af42dd008 100644
--- a/doc/src/sgml/ref/truncate.sgml
+++ b/doc/src/sgml/ref/truncate.sgml
@@ -173,7 +173,7 @@ TRUNCATE [ TABLE ] [ ONLY ] name [
 
   
TRUNCATE can be used for foreign tables if
-   the foreign data wrapper supports, for instance,
+   supported by the foreign data wrapper,
see .
   
  
diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index bdc4c3620d..7a798530e3 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -2179,24 +2179,19 @@ deparseAnalyzeSql(StringInfo buf, Relation rel, List 
**retrieved_attrs)
 void
 deparseTruncateSql(StringInfo buf,
   List *rels,
-  List *rels_extra,
   DropBehavior behavior,
   bool restart_seqs)
 {
-   ListCell   *lc1,
-  *lc2;
+   

Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-23 Thread Tom Lane
Robert Haas  writes:
> On Fri, Apr 23, 2021 at 9:15 AM Tom Lane  wrote:
>> Greg Nancarrow  writes:
>>> I'm curious. The FmgrBuiltin struct includes the "strict" flag, so
>>> that would "lock down the value" of the strict flag, wouldn't it?

>> It does, but that's much more directly a property of the function's
>> C code than parallel-safety is.

> I'm not sure I agree with that, but I think having the "strict" flag
> in FmgrBuiltin isn't that nice either.

Yeah, if we could readily do without it, we probably would.  But the
function call mechanism itself is responsible for implementing strictness,
so it *has* to have that flag available.

regards, tom lane




Re: A test for replay of regression tests

2021-04-23 Thread Tom Lane
Andres Freund  writes:
> On 2021-04-23 17:37:48 +1200, Thomas Munro wrote:
>> We have automated tests for many specific replication and recovery
>> scenarios, but nothing that tests replay of a wide range of records.

> Yay.

+1

>> Add a new TAP test under src/test/recovery that runs the regression
>> tests with wal_consistency_checking=all.

> Hm. I wonder if running with wal_consistency_checking=all doesn't also
> reduce coverage of some aspects of recovery, by increasing record sizes
> etc.

Yeah.  I found out earlier that wal_consistency_checking=all is an
absolute PIG.  Running the regression tests that way requires tens of
gigabytes of disk space, and a significant amount of time if your
disk is not very speedy.  If we put this into the buildfarm at all,
it would have to be opt-in, not run-by-default, because a lot of BF
animals simply don't have the horsepower.  I think I'd vote against
adding it to check-world, too; the cost/benefit ratio is not good
unless you are specifically working on replay functions.

And as you say, it alters the behavior enough to make it a little
questionable whether we're exercising the "normal" code paths.

The two things I'd say about this are (1) Whether to use
wal_consistency_checking, and with what value, needs to be
easily adjustable. (2) People will want to run other test suites
than the core regression tests, eg contrib modules.

regards, tom lane




Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-23 Thread Robert Haas
On Fri, Apr 23, 2021 at 9:15 AM Tom Lane  wrote:
> Greg Nancarrow  writes:
> > I'm curious. The FmgrBuiltin struct includes the "strict" flag, so
> > that would "lock down the value" of the strict flag, wouldn't it?
>
> It does, but that's much more directly a property of the function's
> C code than parallel-safety is.

I'm not sure I agree with that, but I think having the "strict" flag
in FmgrBuiltin isn't that nice either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: decoupling table and index vacuum

2021-04-23 Thread Robert Haas
On Thu, Apr 22, 2021 at 4:52 PM Peter Geoghegan  wrote:
> Mostly what I'm saying is that I would like to put together a rough
> list of things that we could do to improve VACUUM along the lines
> we've discussed -- all of which stem from $SUBJECT. There are
> literally dozens of goals (some of which are quite disparate) that we
> could conceivably set out to pursue under the banner of $SUBJECT.

I hope not. I don't have a clue why there would be dozens of possible
goals here, or why it matters. I think if we're going to do something
like $SUBJECT, we should just concentrate on the best way to make that
particular change happen with minimal change to anything else.
Otherwise, we risk conflating this engineering effort with others that
really should be separate endeavors. For example, as far as possible,
I think it would be best to try to do this without changing the
statistics that are currently gathered, and just make the best
decisions we can with the information we already have. Ideally, I'd
like to avoid introducing a new kind of relation fork that uses a
different on-disk storage format (e.g. 16MB segments that are dropped
from the tail) rather than the one used by the other forks, but I'm
not sure we can get away with that, because conveyor-belt storage
looks pretty appealing here. Regardless, the more we have to change to
accomplish the immediate goal, the more likely we are to introduce
instability into places where it could have been avoided, or to get
tangled up in endless bikeshedding.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Parallel INSERT SELECT take 2

2021-04-23 Thread Bharath Rupireddy
On Thu, Apr 22, 2021 at 4:51 PM houzj.f...@fujitsu.com
 wrote:
>
> > BACKGROUND
> > 
> >
> > We want to realize parallel INSERT SELECT in the following steps:
> > 1) INSERT + parallel SELECT
> > 2) Parallel INSERT + parallel SELECT
> >
> > Below are example use cases.  We don't expect high concurrency or an empty
> > data source.
> > * Data loading (ETL or ELT) into an analytics database, typically a data 
> > ware
> > house.
> > * Batch processing in an OLTP database.
> > 2) Enabling users to declare that the table allows parallel data 
> > modification Add
> > a table property that represents parallel safety of the table for DML 
> > statement
> > execution.  Users specify it as follows:
> >
> > CREATE TABLE table_name (...) PARALLEL { UNSAFE | RESTRICTED | SAFE };
> > ALTER TABLE table_name PARALLEL { UNSAFE | RESTRICTED | SAFE };
> >
> > This property is recorded in pg_class's relparallel column as 'u', 'r', or 
> > 's', just
> > like pg_proc's proparallel.  The default is UNSAFE.
> >
> > The planner assumes that all of the table, its descendant partitions, and 
> > their
> > ancillary objects have the specified parallel safety or safer one.  The 
> > user is
> > responsible for its correctness.  If the parallel processes find an object 
> > that is
> > less safer than the assumed parallel safety during statement execution, it
> > throws an ERROR and abort the statement execution.
> >
> > When the parallel safety of some of these objects is changed, it's costly to
> > reflect it on the parallel safety of tables that depend on them.  So, we 
> > don't do
> > it.  Instead, we provide a utility function 
> > pg_get_parallel_safety('table_name')
> > that returns records of (objid, classid, parallel_safety) that represent the
> > parallel safety of objects that determine the parallel safety of the 
> > specified
> > table.  The function only outputs objects that are not parallel safe.  
> > Otherwise,
> > it will consume excessive memory while accumulating the output.  The user
> > can use this function to identify problematic objects when a parallel DML 
> > fails
> > or is not parallelized in an expected manner.
> >
> > How does the executor detect parallel unsafe objects?  There are two ways:
> >
> > 1) At loading time
> > ...
> > 2) At function execution time
> > All related objects come down to some function execution.  So, add a 
> > parallel
> > safety check there when in a parallel worker.  If the current process is a 
> > parallel
> > worker and the function is parallel unsafe, error out with ereport(ERROR).  
> > This
> > approach eliminates the oversight of parallel safety check with the 
> > additional
> > bonus of tiny code change!
> >
> > The place would be FunctionCallInvoke().  It's a macro in fmgr.h now.  
> > Perhaps
> > we should make it a function in fmgr.c, so that fmgr.h does not have to 
> > include
> > header files for parallelism-related definitions.
> >
> > We have to evaluate the performance effect of converting 
> > FunctionCallInvoke()
> > into a function and adding an if statement there, because it's a relatively
> > low-level function.
>
> Based on above, we plan to move forward with the apporache 2) (declarative 
> idea).

IIUC, the declarative behaviour idea attributes parallel
safe/unsafe/restricted tags to each table with default being the
unsafe. Does it mean for a parallel unsafe table, no parallel selects,
inserts (may be updates) will be picked up? Or is it only the parallel
inserts? If both parallel inserts, selects will be picked, then the
existing tables need to be adjusted to set the parallel safety tags
while migrating?

Another point, what does it mean a table being parallel restricted?
What should happen if it is present in a query of other parallel safe
tables?

I may be wrong here: IIUC, the main problem we are trying to solve
with the declarative approach is to let the user decide parallel
safety for partition tables as it may be costlier for postgres to
determine it. And for the normal tables we can perform parallel safety
checks without incurring much cost. So, I think we should restrict the
declarative approach to only partitioned tables?

While reading the design, I came across this "erroring out during
execution of a query when a parallel unsafe function is detected". If
this is correct, isn't it warranting users to run
pg_get_parallel_safety to know the parallel unsafe objects, set
parallel safety to all of them if possible, otherwise disable
parallelism to run the query? Isn't this burdensome? Instead, how
about postgres retries the query upon detecting the error that came
from a parallel unsafe function during execution, disable parallelism
and run the query? I think this kind of retry query feature can be
built outside of the core postgres, but IMO it will be good to have
inside (of course configurable). IIRC, the Teradata database has a
Query Retry feature.

With Regards,
Bharath Rupireddy.
EnterpriseDB: 

Re: A test for replay of regression tests

2021-04-23 Thread Andrew Dunstan


On 4/23/21 1:37 AM, Thomas Munro wrote:
> Hi,
>
> We have automated tests for many specific replication and recovery
> scenarios, but nothing that tests replay of a wide range of records.
> People working on recovery code often use installcheck (presumably
> along with other custom tests) to exercise it, sometimes with
> wal_consistency_check enabled.  So, why don't we automate that?  Aside
> from exercising the WAL decoding machinery (which brought me here),
> that'd hopefully provide some decent improvements in coverage of the
> various redo routines, many of which are not currently exercised at
> all.
>
> I'm not quite sure where it belongs, though.  The attached initial
> sketch patch puts it under rc/test/recovery near other similar things,
> but I'm not sure if it's really OK to invoke make -C ../regress from
> here.  I copied pg_update/test.sh's trick of using a different
> outputdir to avoid clobbering a concurrent run under src/test/regress,
> and I also needed to invent a way to stop it from running the cursed
> tablespace test (deferring startup of the standby also works but eats
> way too much space, which I learned after blowing out a smallish
> development VM's disk).   Alternatively, we could put it under
> src/test/regress, which makes some kind of logical sense, but it would
> make a quick "make check" take more than twice as long.  Perhaps it
> could use a different target, one that check-world somehow reaches but
> not check, and that also doesn't seem great.  Ideas on how to
> structure this or improve the perl welcome.



Nice, I like adding a skip-tests option to pg_regress.

The perl looks ok - I assume those

    print "standby 1: $result\n";  

lines are there for debugging. Normally you would just process $result
using the Test::More functions


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: decoupling table and index vacuum

2021-04-23 Thread Robert Haas
On Fri, Apr 23, 2021 at 7:04 AM Masahiko Sawada  wrote:
> I think we can divide the TID fork into 16MB or 32MB chunks like WAL
> segment files so that we can easily remove old chunks. Regarding the
> efficient search part, I think we need to consider the case where the
> TID fork gets bigger than maintenance_work_mem. In that case, during
> the heap scan, we need to check if the discovered TID exists in a
> chunk of the TID fork that could be on the disk. Even if all
> known-dead-TIDs are loaded into an array on the memory, it could get
> much slower than the current heap scan to bsearch over the array for
> each dead TID discovered during heap scan. So it would be better to
> have a way to skip searching by already recorded TIDs. For example,
> during heap scan or HOT pruning, I think that when marking TIDs dead
> and recording it to the dead TID fork we can mark them “dead and
> recorded” instead of just “dead” so that future heap scans can skip
> those TIDs without existence check.

I'm not very excited about this. If we did this, and if we ever
generated dead-but-not-recorded TIDs, then we will potentially dirty
those blocks again later to mark them recorded.

Also, if bsearch() is a bottleneck, how about just using an O(1)
algorithm instead of an O(lg n) algorithm, rather than changing the
on-disk format?

Also, can you clarify exactly what you think the problem case is here?
It seems to me that:

1. If we're pruning the heap to collect dead TIDs, we should stop when
the number of TIDs we've accumulated reaches maintenance_work_mem. It
is possible that we could find when starting to prune that there are
*already* more dead TIDs than will fit, because maintenance_work_mem
might have been reduced since they were gathered. But that's OK: we
can figure out that there are more than will fit without loading them
all, and since we shouldn't do additional pruning in this case,
there's no issue.

2. If we're sanitizing indexes, we should normally discover that there
are few enough TIDs that we can still fit them all in memory. But if
that proves not to be the case, again because for example
maintenance_work_mem has been reduced, then we can handle that with
multiple index passes just as we do today.

3. If we're going back to the heap to permit TIDs to be recycled by
setting dead line pointers to unused, we can load in as many of those
as will fit in maintenance_work_mem, sort them by block number, and go
through block by block and DTRT. Then, we can release all that memory
and, if necessary, do the whole thing again. This isn't even
particularly inefficient.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: A test for replay of regression tests

2021-04-23 Thread Andres Freund
Hi,

On 2021-04-23 17:37:48 +1200, Thomas Munro wrote:
> We have automated tests for many specific replication and recovery
> scenarios, but nothing that tests replay of a wide range of records.
> People working on recovery code often use installcheck (presumably
> along with other custom tests) to exercise it, sometimes with
> wal_consistency_check enabled.  So, why don't we automate that?  Aside
> from exercising the WAL decoding machinery (which brought me here),
> that'd hopefully provide some decent improvements in coverage of the
> various redo routines, many of which are not currently exercised at
> all.

Yay.


> I'm not quite sure where it belongs, though.  The attached initial
> sketch patch puts it under rc/test/recovery near other similar things,
> but I'm not sure if it's really OK to invoke make -C ../regress from
> here.

I'd say it's not ok, and we should just invoke pg_regress without make.


> Add a new TAP test under src/test/recovery that runs the regression
> tests with wal_consistency_checking=all.

Hm. I wonder if running with wal_consistency_checking=all doesn't also
reduce coverage of some aspects of recovery, by increasing record sizes
etc.


> I copied pg_update/test.sh's trick of using a different
> outputdir to avoid clobbering a concurrent run under src/test/regress,
> and I also needed to invent a way to stop it from running the cursed
> tablespace test (deferring startup of the standby also works but eats
> way too much space, which I learned after blowing out a smallish
> development VM's disk).

That's because you are using wal_consistency_checking=all, right?
Because IIRC we don't generate that much WAL otherwise?


> +# Create some content on primary and check its presence in standby 1
> +$node_primary->safe_psql('postgres',
> + "CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
> +
> +# Wait for standby to catch up
> +$node_primary->wait_for_catchup($node_standby_1, 'replay',
> + $node_primary->lsn('insert'));

> +my $result =
> +  $node_standby_1->safe_psql('postgres', "SELECT count(*) FROM tab_int");
> +print "standby 1: $result\n";
> +is($result, qq(1002), 'check streamed content on standby 1');

Why is this needed?



Greetings,

Andres Freund




RE: Forget close an open relation in ReorderBufferProcessTXN()

2021-04-23 Thread osumi.takami...@fujitsu.com
On Saturday, April 17, 2021 4:13 PM Amit Kapila  wrote:
> On Sat, Apr 17, 2021 at 12:05 PM Japin Li  wrote:
> >
> > On Sat, 17 Apr 2021 at 14:09, Amit Kapila 
> wrote:
> > > On Thu, Apr 15, 2021 at 4:53 PM Amit Kapila 
> wrote:
> > >>
> > >> On Thu, Apr 15, 2021 at 4:00 PM Japin Li  wrote:
> > >> >
> > >> > The RelationIdGetRelation() comment says:
> > >> >
> > >> > > Caller should eventually decrement count. (Usually, that
> > >> > > happens by calling RelationClose().)
> > >> >
> > >> > However, it doesn't do it in ReorderBufferProcessTXN().
> > >> > I think we should close it, here is a patch that fixes it. Thoughts?
> > >> >
> > >>
> > >> +1. Your fix looks correct to me but can we test it in some way?
> > >>
> > >
> > > I have tried to find a test but not able to find one. I have tried
> > > with a foreign table but we don't log truncate for it, see
> > > ExecuteTruncate. It has a check that it will log for relids where
> > > RelationIsLogicallyLogged. If that is the case, it is not clear to
> > > me how we can ever hit this condition? Have you tried to find the test?
> >
> > I also don't find a test for this.  It is introduced in 5dfd1e5a6696,
> > wrote by Simon Riggs, Marco Nenciarini and Peter Eisentraut.  Maybe
> > they can explain when we can enter this condition?
> 
> My guess is that this has been copied from the code a few lines above to
> handle insert/update/delete where it is required to handle some DDL ops like
> Alter Table but I think we don't need it here (for Truncate op). If that
> understanding turns out to be true then we should either have an Assert for
> this or an elog message.
In this thread, we are discussing 3 topics below...

(1) necessity of the check for REORDER_BUFFER_CHANGE_TRUNCATE in 
ReorderBufferProcessTXN()
(2) discussion of whether we disallow decoding of operations on user catalog 
tables or not
(3) memory leak of maybe_send_schema() (patch already provided)

Let's address those one by one.
In terms of (1), which was close to the motivation of this thread,
first of all, I traced the truncate processing
and I think the check is done by truncate command side as well.
I preferred Assert rather than never called elog,
but it's OK to choose elog if someone has strong opinion on it.
Attached the patch for this.

Best Regards,
Takamichi Osumi



replace_check_of_ReorderBufferProcessTXN_v01.patch
Description: replace_check_of_ReorderBufferProcessTXN_v01.patch


Re: How to test Postgres for any unaligned memory accesses?

2021-04-23 Thread Tom Lane
Bharath Rupireddy  writes:
> I'm trying to test Postgres code for any unaligned memory accesses. I
> used a hack shown at [1] and put it in exec_simple_query, then I'm
> seeing a SIGBUS error from SplitIdentifierString's strncpy, see [2].

Regardless of Postgres' policy about alignment safety, glibc sees
no reason to avoid unaligned accesses on x86 hardware.  If you want
to test this sort of thing on hardware that's not actually alignment
picky, you have to enlist the toolchain's help.

> I'm not sure this is the right way. I would like to know whether there
> is a standard way of testing Postgres code for any unaligned memory
> accesses. Thanks. Any help would be appreciated.

Per c.h, late-model compilers have options for this:

 * Testing can be done with "-fsanitize=alignment -fsanitize-trap=alignment"
 * on clang, or "-fsanitize=alignment -fno-sanitize-recover=alignment" on gcc.

We have at least one buildfarm member using the former.  I have no idea
how water-tight these checks are though.  They don't seem to cause very
much slowdown, which is suspicious :-(

regards, tom lane




RE: Truncate in synchronous logical replication failed

2021-04-23 Thread osumi.takami...@fujitsu.com
On Friday, April 23, 2021 6:03 PM I wrote:
> I've combined v5 with above accepted comments.
> 
> Just in case, I've conducted make check-world, the test of
> clobber_cache_always mode again for v6 and tight loop test of 100 times for
> 010_truncate.pl.
> The result is that all passed with no failure.
I'm sorry, I realized another minor thing which should be fixed.

In v6, I did below.
+Bitmapset *
+RelationGetIdentityKeyBitmap(Relation relation)
+{
+   Bitmapset  *idindexattrs;   /* columns in the replica identity */
...
+   /* Build attributes to idindexattrs */
+   idindexattrs = NULL;

But, we removed the loop, so we can insert NULL
at the beginning to declare idindexattrs.
v7 is the version to update this part and
related comments from v6.

Best Regards,
Takamichi Osumi



truncate_in_synchronous_logical_replication_v07.patch
Description: truncate_in_synchronous_logical_replication_v07.patch


Re: CTE push down

2021-04-23 Thread Alexander Pyhalov

Ashutosh Bapat писал 2021-04-14 16:01:

On Tue, Apr 13, 2021 at 6:58 PM Alexander Pyhalov
 wrote:



I believe step2 is needed to avoid materializing rows which will never
be selected. That would be a good improvement. However, care needs to
be taken for volatile quals. I think, the quals on CTE will be
evaluated twice, once when materializing the CTE result and second
time when scanning the materialized result. volatile quals may produce
different results when run multiple times.



Is there something else I miss?
Does somebody work on alternative solution or see issues in such
approach?


IMO, a POC patch will help understand your idea.


Hi.

I have a POC patch, which allows to distribute restrictinfos inside 
CTEs.

However, I found I can't efficiently do partition pruning.
When CTE replan stage happens, plans are already done. I can create 
alternative paths for relations,

for example, like in Try-prune-partitions patch.

However, new paths are not propagated to finalrel (UPPER_REL).
I'm not sure how to achieve this and need some advice.
Should we redo part of work, done by grouping_planner(), in the end of 
SS_replan_ctes()?
Should we rely on executor partition pruning (with current patches it 
doesn't work)?
Should we create init plans for ctes after grouping_planner(), not 
before?


--
Best regards,
Alexander Pyhalov,
Postgres ProfessionalFrom df32be065c8c91cd132ccf7cbd1edc1862c7ac93 Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov 
Date: Fri, 23 Apr 2021 15:18:21 +0300
Subject: [PATCH] Push down restrictinfos to CTE

---
 src/backend/executor/functions.c  |   1 +
 src/backend/nodes/copyfuncs.c |  20 ++
 src/backend/nodes/equalfuncs.c|  14 +
 src/backend/nodes/outfuncs.c  |  21 ++
 src/backend/nodes/readfuncs.c |   1 +
 src/backend/optimizer/plan/initsplan.c| 340 ++
 src/backend/optimizer/plan/planner.c  |   6 +
 src/backend/optimizer/plan/subselect.c| 252 +++-
 src/backend/optimizer/prep/prepjointree.c |   9 +
 src/backend/parser/parse_relation.c   |   1 +
 src/backend/rewrite/rewriteHandler.c  |   1 +
 src/include/nodes/nodes.h |   1 +
 src/include/nodes/parsenodes.h|   1 +
 src/include/nodes/pathnodes.h |  21 ++
 src/include/optimizer/planmain.h  |   2 +
 src/include/optimizer/subselect.h |   2 +
 16 files changed, 691 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 39580f7d577..7e8af62a954 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -1964,6 +1964,7 @@ tlist_coercion_finished:
 		rte = makeNode(RangeTblEntry);
 		rte->rtekind = RTE_SUBQUERY;
 		rte->subquery = parse;
+		rte->rtoffset = -1;
 		rte->eref = rte->alias = makeAlias("*SELECT*", colnames);
 		rte->lateral = false;
 		rte->inh = false;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 632cc31a045..e46db5f0ce9 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2373,6 +2373,22 @@ _copyRestrictInfo(const RestrictInfo *from)
 	return newnode;
 }
 
+/*
+ * _copyRelRestrictInfos
+ */
+static RelRestrictInfos *
+_copyRelRestrictInfos(const RelRestrictInfos * from)
+{
+	RelRestrictInfos *newnode = makeNode(RelRestrictInfos);
+
+	/* root is used only as address */
+	COPY_SCALAR_FIELD(root);
+	COPY_SCALAR_FIELD(relid);
+	COPY_NODE_FIELD(restrictinfos);
+
+	return newnode;
+}
+
 /*
  * _copyPlaceHolderVar
  */
@@ -2467,6 +2483,7 @@ _copyRangeTblEntry(const RangeTblEntry *from)
 	COPY_NODE_FIELD(tablesample);
 	COPY_NODE_FIELD(subquery);
 	COPY_SCALAR_FIELD(security_barrier);
+	COPY_SCALAR_FIELD(rtoffset);
 	COPY_SCALAR_FIELD(jointype);
 	COPY_SCALAR_FIELD(joinmergedcols);
 	COPY_NODE_FIELD(joinaliasvars);
@@ -5286,6 +5303,9 @@ copyObjectImpl(const void *from)
 		case T_RestrictInfo:
 			retval = _copyRestrictInfo(from);
 			break;
+		case T_RelRestrictInfos:
+			retval = _copyRelRestrictInfos(from);
+			break;
 		case T_PlaceHolderVar:
 			retval = _copyPlaceHolderVar(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index a410a29a178..fea72ebed8f 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -863,6 +863,16 @@ _equalRestrictInfo(const RestrictInfo *a, const RestrictInfo *b)
 	return true;
 }
 
+static bool
+_equalRelRestrictInfos(const RelRestrictInfos * a, const RelRestrictInfos * b)
+{
+	COMPARE_SCALAR_FIELD(root);
+	COMPARE_SCALAR_FIELD(relid);
+	COMPARE_NODE_FIELD(restrictinfos);
+
+	return true;
+}
+
 static bool
 _equalPlaceHolderVar(const PlaceHolderVar *a, const PlaceHolderVar *b)
 {
@@ -2715,6 +2725,7 @@ _equalRangeTblEntry(const RangeTblEntry *a, const RangeTblEntry *b)
 	COMPARE_NODE_FIELD(tablesample);
 	COMPARE_NODE_FIELD(subquery);
 	COMPARE_SCALAR_FIELD(security_barrier);
+	COMPARE_SCALAR_FIELD(rtoffset);
 	

Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-23 Thread Tom Lane
Greg Nancarrow  writes:
> I'm curious. The FmgrBuiltin struct includes the "strict" flag, so
> that would "lock down the value" of the strict flag, wouldn't it?

It does, but that's much more directly a property of the function's
C code than parallel-safety is.

regards, tom lane




Re: convert elog(LOG) calls to ereport

2021-04-23 Thread Peter Eisentraut

On 05.12.20 03:22, Michael Paquier wrote:

On Fri, Dec 04, 2020 at 02:34:26PM +0100, Peter Eisentraut wrote:

On 2020-12-02 15:04, Alvaro Herrera wrote:

I do wonder if it'd be a good idea to move the syscall
name itself out of the message, too; that would reduce the number of
messages to translate 50x to just "%s(%s) failed: %m" instead of one
message per distinct syscall.


Seems useful, but perhaps as a separate project.


-   elog(LOG, "getsockname() failed: %m");
+   ereport(LOG,
+   (errmsg("getsockname() failed: %m")));
FWIW, I disagree with the approach taken by eb93f3a.  As of HEAD, it
is now required to translate all those strings.  I think that it would
have been better to remove the function names from all those error
messages and not require the same pattern to be translated N times.


I made another pass across this and implemented the requested change.




Re: Table refer leak in logical replication

2021-04-23 Thread Amit Langote
On Thu, Apr 22, 2021 at 1:45 PM Michael Paquier  wrote:
> On Wed, Apr 21, 2021 at 09:58:10PM +0900, Amit Langote wrote:
> > Okay, done.
>
> So, I have been working on that today, and tried to apply the full set
> before realizing when writing the commit message that this was a set
> of bullet points, and that this was too much for a single commit.  The
> tests are a nice thing to have to improve the coverage related to
> tuple routing, but that these are not really mandatory for the sake of
> the fix discussed here.  So for now I have applied the main fix as of
> f3b141c to close the open item.

Thanks for that.

> Now..  Coming back to the tests.
>
> -RETURN NULL;
> +IF (NEW.bid = 4 AND NEW.id = OLD.id) THEN
> +   RETURN NEW;
> +ELSE
> +RETURN NULL;
> +END IF
> This part added in test 003 is subtle.  This is a tweak to make sure
> that the second trigger, AFTER trigger added in this patch, that would
> be fired after the already-existing BEFORE entry, gets its hands on
> the NEW tuple values.  I think that this makes the test more confusing
> than it should, and that could cause unnecessary pain to understand
> what's going on here for a future reader.  Anyway, what's actually
> the coverage we gain with this extra trigger in 003?

Not much maybe.  I am fine with dropping the changes made to 003 if
they are confusing, which I agree they can be.

> On the other hand, the tests for partitions have much more value IMO,
> but looking closely I think that we can do better:
> - Create triggers on more relations of the partition tree,
> particularly to also check when a trigger does not fire.

It seems you're suggesting to adopt 003's trigger firing tests for
partitions in 013, but would we gain much by that?

> - Use a more generic name for tab1_2_op_log and its function
> log_tab1_2_op(), say subs{1,2}_log_trigger_activity.

Sure, done.

> - Create some extra BEFORE triggers perhaps?

Again, that seems separate from what we're trying to do here.  AIUI,
our aim here is to expand coverage for after triggers, and not really
that of the trigger functionality proper, because nothing seems broken
about it, but that of the trigger target relation opening/closing.
ISTM that's what you're talking about below...

> By the way, I had an idea of trick we could use to check if relations
> do not leak: we could scan the logs for this pattern patterns,

It would be interesting to be able to do something like that, but

> similarly to what issues_sql_like() or connect_{fails,ok}() do
> already, but that would mean increasing the log level and we don't do
> that to ease the load of the nodes.

...sorry, I am not very familiar with our Perl testing infra.  Is
there some script that already does something like this?  For example,
checking in the logs generated by a "node" that no instance of a
certain WARNING is logged?

Meanwhile, attached is the updated version containing some of the
changes mentioned above.


--
Amit Langote
EDB: http://www.enterprisedb.com


fix_relcache_leak_in_lrworker_v9.patch
Description: Binary data


Re: multi-install PostgresNode fails with older postgres versions

2021-04-23 Thread Andrew Dunstan

On 4/23/21 12:41 AM, Michael Paquier wrote:
> On Thu, Apr 22, 2021 at 08:43:10PM -0400, Andrew Dunstan wrote:
>> Interesting point. Maybe we need to do something like devel = -4, alpha
>> = -3, beta = -2, rc = -1. Or maybe that's overkill.
> And after that it would come to how many betas, alphas or RCs you
> have, but you can never be sure of how many of each you may finish
> with.  I think that you have the right answer with just marking all
> of them with -1 for the minor number, keeping the code a maximum
> simple.


Yeah, I think it's ok for comparison purposes just to lump them all
together. Here's a patch that does that and some consequent cleanup.
Note we now cache the string rather than trying to reconstruct it.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

diff --git a/src/test/perl/PostgresVersion.pm b/src/test/perl/PostgresVersion.pm
index 3f3744ccfa..7ce9e62b79 100644
--- a/src/test/perl/PostgresVersion.pm
+++ b/src/test/perl/PostgresVersion.pm
@@ -79,18 +79,24 @@ sub new
 	# postgres command line tool
 	my $devel;
 	($arg,$devel) = ($1, $2)
-	  if ($arg =~  m/^(?:\(?PostgreSQL\)? )?(\d+(?:\.\d+)*)(devel)?/);
+	  if ($arg =~
+		  m!^ # beginning of line
+  (?:\(?PostgreSQL\)?\s)? # ignore PostgreSQL marker
+  (\d+(?:\.\d+)*) # version number, dotted notation
+  (devel|(?:alpha|beta|rc)\d+)?   # dev marker - see version_stamp.pl
+		 !x);
 
 	# Split into an array
-	my @result = split(/\./, $arg);
+	my @numbers = split(/\./, $arg);
 
 	# Treat development versions as having a minor/micro version one less than
 	# the first released version of that branch.
-	push @result, -1 if ($devel);
+	push @numbers, -1 if ($devel);
 
-	return bless \@result, $class;
-}
+	$devel ||= "";
 
+	return bless  { str => "$arg$devel", num => \@numbers }, $class;
+}
 
 # Routine which compares the _pg_version_array obtained for the two
 # arguments and returns -1, 0, or 1, allowing comparison between two
@@ -108,27 +114,21 @@ sub _version_cmp
 
 	$b = __PACKAGE__->new($b) unless blessed($b);
 
+	my ($an, $bn) = ($a->{num}, $b->{num});
+
 	for (my $idx = 0;; $idx++)
 	{
-		return 0 unless (defined $a->[$idx] && defined $b->[$idx]);
-		return $a->[$idx] <=> $b->[$idx]
-		  if ($a->[$idx] <=> $b->[$idx]);
+		return 0 unless (defined $an->[$idx] && defined $bn->[$idx]);
+		return $an->[$idx] <=> $bn->[$idx]
+		  if ($an->[$idx] <=> $bn->[$idx]);
 	}
 }
 
-# Render the version number in the standard "joined by dots" notation if
-# interpolated into a string. Put back 'devel' if we previously turned it
-# into a -1.
+# Render the version number using the saved string.
 sub _stringify
 {
 	my $self = shift;
-	my @sections = @$self;
-	if ($sections[-1] == -1)
-	{
-		pop @sections;
-		$sections[-1] = "$sections[-1]devel";
-	}
-	return join('.', @sections);
+	return $self->{str};
 }
 
 1;


Re: decoupling table and index vacuum

2021-04-23 Thread Masahiko Sawada
On Fri, Apr 23, 2021 at 3:47 AM Robert Haas  wrote:
>
> On Thu, Apr 22, 2021 at 10:28 AM Masahiko Sawada  
> wrote:
> > The dead TID fork needs to also be efficiently searched. If the heap
> > scan runs twice, the collected dead TIDs on each heap pass could be
> > overlapped. But we would not be able to merge them if we did index
> > vacuuming on one of indexes at between those two heap scans. The
> > second time heap scan would need to record only TIDs that are not
> > collected by the first time heap scan.
>
> I agree that there's a problem here. It seems to me that it's probably
> possible to have a dead TID fork that implements "throw away the
> oldest stuff" efficiently, and it's probably also possible to have a
> TID fork that can be searched efficiently. However, I am not sure that
> it's possible to have a dead TID fork that does both of those things
> efficiently. Maybe you have an idea. My intuition is that if we have
> to pick one, it's MUCH more important to be able to throw away the
> oldest stuff efficiently. I think we can work around the lack of
> efficient lookup, but I don't see a way to work around the lack of an
> efficient operation to discard the oldest stuff.

Agreed.

I think we can divide the TID fork into 16MB or 32MB chunks like WAL
segment files so that we can easily remove old chunks. Regarding the
efficient search part, I think we need to consider the case where the
TID fork gets bigger than maintenance_work_mem. In that case, during
the heap scan, we need to check if the discovered TID exists in a
chunk of the TID fork that could be on the disk. Even if all
known-dead-TIDs are loaded into an array on the memory, it could get
much slower than the current heap scan to bsearch over the array for
each dead TID discovered during heap scan. So it would be better to
have a way to skip searching by already recorded TIDs. For example,
during heap scan or HOT pruning, I think that when marking TIDs dead
and recording it to the dead TID fork we can mark them “dead and
recorded” instead of just “dead” so that future heap scans can skip
those TIDs without existence check.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: TRUNCATE on foreign table

2021-04-23 Thread Bharath Rupireddy
On Fri, Apr 23, 2021 at 1:39 PM Fujii Masao  wrote:
> > +
> > + Note that information about ONLY options specified
> > + in the original TRUNCATE command is not passed to
> >
> > I think it is not "information about", no? We just don't pass ONLY
> > option  instead we skip it. IMO, we can say "Note that the ONLY option
> > specified with a foreign table in the original TRUNCATE command is
> > skipped and not passed to ExecForeignTruncate."
>
> Probably I still fail to understand your point.
> But if "information about" is confusing, I'm ok to
> remove that. Fixed.

A small typo in the docs patch: It is "are not passed to", instead of
"is not passed to" since we used plural "options". "Note that the ONLY
options specified in the original TRUNCATE command are not passed to"

+ Note that the ONLY options specified
  in the original TRUNCATE command is not passed to

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

2021-04-23 Thread Matthias van de Meent
On Sat, 17 Apr 2021 at 01:05, Peter Geoghegan  wrote:
>
> On Fri, Apr 16, 2021 at 2:20 PM Matthias van de Meent
>  wrote:
> > > Interesting approach. I think that in an ideal world we would have a
> > > tuple format with attribute lengths/offsets right in the header.
> >
> > I believe that that would indeed be ideal w.r.t. access speed, but
> > also quite expensive w.r.t. amount of data stored. This would add 2
> > bytes per attribute in the current infrastructure (11 bits at least
> > for each attribute to store offsets), on the current 12 bytes of
> > overhead per indextuple (= 8 for IndexTuple + 4 for ItemId). That is
> > probably always going to be a non-starter, seeing that we can
> > relatively easily optimize our current attribute access patterns.
>
> I don't think that that's why it's a non-starter. This design assumes
> a world in which everything has already been optimized for this
> layout. You no longer get to store the varlena header inline, which
> would break a lot of things in Postgres were it ever to be attempted.
> The space efficiency issues don't really apply because you have an
> offset for fixed-length types -- their presence is always implied. I
> think that you need to encode NULLs differently, which is a lot less
> space efficient when there are a lot of NULLs. But on the whole this
> design seems more efficient than what we have currently.

I believe that that depends on your definition of 'efficiency'. For
storage efficiency, the current design is quite good (except for the
varlena header size of 4 bytes for attributes > 127 bytes, which could
be 2 bytes because pages can not be larger than 64kiB (actually 32kiB)
with our current design, all attributes use just about the least data
possible). For access efficiency / code complexity, you're probably
right that storing attribute offsets in the tuple header is
preferable, but such design would still need some alignment calls, or
store the length of attributes as well to prevent reading the
alignment padding of the next attribute into the variable length
attribute at the additional overhead of up to 2 bytes per attribute.

> This representation of index tuples would be a totally reasonable
> design were we in a green field situation. (Which is pretty far from
> the situation we're actually in, of course.)

That might indeed be the case, assuming a green field with different
or no processing architecture or storage limitations. CPU to storage
bandwidth can be (and often is) a bottleneck, as well as alignment.

> > I understand and appreciate that the "-INF" truncation that is
> > currently in place is being relied upon in quite some places. Part of
> > the effort for "+INF" truncation would be determining where and how to
> > keep the benefits of the "-INF" truncation. I also believe that for
> > internal pages truncating to "+INF" would be perfectly fine; the
> > optimizations that I know of only rely on it at the leaf level.
>
> I don't doubt that there is nothing special about -inf from a key
> space point of view. Actually...you could say that -inf is special to
> the limited extent that we know it only appears in pivot tuples and
> exploit that property when the !pivotsearch case/optimization is used.
> But that isn't much of an exception at a high level, so whatever.
>
> Anyway, it follows that +inf could in principle be used instead in
> some or all cases -- all that is truly essential for correctness is
> that the invariants always be respected. We're still in agreement up
> until here.

Agreed

> > Yes, I also read and appreciate your comments on +inf vs -inf when
> > this came up in [0].
>
> I'm impressed that you've done your homework on this.
>
> > However, if we could choose, I think that having
> > both options could be quite beneficial, especially when dealing with
> > many duplicates or duplicate prefixes.
>
> This is where things are much less clear -- maybe we're not in
> agreement here. Who knows, though -- maybe you're right. But you
> haven't presented any kind of argument. I understand that it's hard to
> articulate what effects might be in play with stuff like this, so I
> won't force the issue now. Strong evidence is of course the only way
> that you'll reliably convince me of this.
>
> I should point out that I am a little confused about how this +inf
> business could be both independently useful and pivotal to
> implementing [dynamic] prefix truncation/compression. Seems...weird to
> discuss them together, except maybe to mention in passing that this
> +inf thing is notable for particularly helping dynamic prefix stuff --
> which is it?

I agree that my reasoning might have been unclear and confusing.

I mean that most benefits that we could receive from +inf would be in
improving the ability to apply [dynamic] prefix truncation on a page
by limiting the keyspace of that page to 'local' values. If prefix
truncation is impossible / does not apply for some index (a single
unique column !allequalimage index is 

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-04-23 Thread Fujii Masao




On 2021/04/23 19:11, Magnus Hagander wrote:

On Fri, Apr 23, 2021 at 12:04 PM Fujii Masao
 wrote:




On 2021/04/23 18:46, Magnus Hagander wrote:

On Fri, Apr 23, 2021 at 9:10 AM Fujii Masao  wrote:




On 2021/04/22 18:23, Julien Rouhaud wrote:

On Thu, Apr 22, 2021 at 12:28:11AM +0900, Fujii Masao wrote:


I found another small issue in pg_stat_statements docs. The following
description in the docs should be updated so that toplevel is included?


This view contains one row for each distinct database ID, user ID and query ID


Indeed!  I'm adding Magnus in Cc.

PFA a patch to fix at, and also mention that toplevel will only
contain True values if pg_stat_statements.track is set to top.


Thanks for the patch! LGTM.


Agreed, in general. But going by the example a few lines down, I
changed the second part to:
  True if the query was executed as a top level statement
+   (if pg_stat_statements.track is set to
+   all, otherwise always false)


Isn't this confusing? Users may mistakenly read this as that the toplevel
column always indicates false if pg_stat_statements.track is not "all".


Hmm. I think you're right. It should say "always true", shouldn't it?


You're thinking something like the following?

True if the query was executed as a top level statement
(if pg_stat_statements.track is set to
top, always true)


So not just confusing, but completely wrong? :)


Yeah :)

I'm fine with the original wording by Julien.
Of course, the parameter name should be corrected as you did, though.

Or what about the following?

True if the query was executed as a top level statement
(this can be false only if
pg_stat_statements.track is set to
all and nested statements are also tracked)

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




How to test Postgres for any unaligned memory accesses?

2021-04-23 Thread Bharath Rupireddy
Hi,

I'm trying to test Postgres code for any unaligned memory accesses. I
used a hack shown at [1] and put it in exec_simple_query, then I'm
seeing a SIGBUS error from SplitIdentifierString's strncpy, see [2].
It looks like the SIGBUS error occurs even if a simple memcpy(for
testing purpose) is done in recomputeNamespacePath or
SplitIdentifierString.

I'm not sure this is the right way. I would like to know whether there
is a standard way of testing Postgres code for any unaligned memory
accesses. Thanks. Any help would be appreciated.

[1] - https://www.programmersought.com/article/17701994124/
+/* Enable Alignment Checking */
+#if defined(__GNUC__)
+# if defined(__i386__)
+/* Enable Alignment Checking on x86 */
+__asm__("pushf\norl $0x4,(%esp)\npopf");
+# elif defined(__x86_64__)
+ /* Enable Alignment Checking on x86_64 */
+__asm__("pushf\norl $0x4,(%rsp)\npopf");
+# endif
+#endif

[2]
Program received signal SIGBUS, Bus error.
0x7f5067188d36 in __strncpy_sse2_unaligned () from /lib64/libc.so.6
(gdb) bt
#0  0x7f5067188d36 in __strncpy_sse2_unaligned () from /lib64/libc.so.6
#1  0x00ada740 in SplitIdentifierString (rawstring=0x1146620 "\"$user",
separator=44 ',', namelist=0x7ffcdf1911d0) at varlena.c:3817
#2  0x005d203b in recomputeNamespacePath () at namespace.c:3761
#3  0x005cde11 in FuncnameGetCandidates (names=0x1145e08,
nargs=2, argnames=0x0,
expand_variadic=true, expand_defaults=true, missing_ok=false) at
namespace.c:971
#4  0x00647dcb in func_get_detail (funcname=0x1145e08, fargs=0x1146570,
fargnames=0x0, nargs=2, argtypes=0x7ffcdf191540, expand_variadic=true,
expand_defaults=true, funcid=0x7ffcdf1916d8, rettype=0x7ffcdf1916dc,
retset=0x7ffcdf19152f, nvargs=0x7ffcdf191528, vatype=0x7ffcdf191524,
true_typeids=0x7ffcdf191538, argdefaults=0x7ffcdf191530) at
parse_func.c:1421
#5  0x00645961 in ParseFuncOrColumn (pstate=0x11462e8,
funcname=0x1145e08,
fargs=0x1146570, last_srf=0x0, fn=0x1145f28, proc_call=false, location=14)
at parse_func.c:265

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Disallow cancellation of waiting for synchronous replication

2021-04-23 Thread Andrey Borodin
Hi Aleksander!

Thanks for looking into this.

> 23 апр. 2021 г., в 14:30, Aleksander Alekseev  
> написал(а):
> 
> Hi hackers,
> 
 After using a patch for a while it became obvious that PANICing during 
 termination is not a good idea. Even when we wait for synchronous 
 replication. It generates undesired coredumps.
 I think in presence of SIGTERM it's reasonable to say that we cannot 
 protect user anymore.
 PFA v3.
> 
> This patch, although solving a concrete and important problem, looks
> more like a quick workaround than an appropriate solution. Or is it
> just me?
> 
> Ideally, the transaction should be committed only after getting a
> reply from the standby.
Getting reply from the standby is a part of a commit. Commit is completed only 
when WAL reached standby. Commit, certainly, was initiated before getting reply 
from standby. We cannot commit only after we commit.

> If the user cancels the transaction, it
> doesn't get committed anywhere.
The problem is user tries to cancel a transaction after they asked for commit. 
We never promised rolling back committed transaction.
When user asks for commit we insert commit record into WAL. And then wait when 
it is acknowledged by quorum of standbys and local storage.
We cannot discard this record on standbys. Or, at one point we will have to 
discard discard records. Or discard discard discard records.

> This is what people into distributed
> systems would expect unless stated otherwise, at least.
I think, our transaction semantics is stated clearly in documentation.

> Although I
> realize how complicated it is to implement, especially considering all
> the possible corner cases (netsplit right after getting a reply, etc).
> Maybe we could come up with a less than ideal, but still sound and
> easy-to-understand model, which, as soon as you learned it, doesn't
> bring unexpected surprises to the user.
The model proposed by my patch sounds as follows:
transaction effects should not be observable on primary until requirements of 
synchronous_commit are satisfied.

E.g. even if user issues cancel of committed locally transaction, we should not 
release locks held by this transaction.
What unexpected surprises do you see in this model?

Thanks for reviewing!

Best regards, Andrey Borodin.



Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-04-23 Thread Fujii Masao




On 2021/04/23 18:46, Magnus Hagander wrote:

On Fri, Apr 23, 2021 at 9:10 AM Fujii Masao  wrote:




On 2021/04/22 18:23, Julien Rouhaud wrote:

On Thu, Apr 22, 2021 at 12:28:11AM +0900, Fujii Masao wrote:


I found another small issue in pg_stat_statements docs. The following
description in the docs should be updated so that toplevel is included?


This view contains one row for each distinct database ID, user ID and query ID


Indeed!  I'm adding Magnus in Cc.

PFA a patch to fix at, and also mention that toplevel will only
contain True values if pg_stat_statements.track is set to top.


Thanks for the patch! LGTM.


Agreed, in general. But going by the example a few lines down, I
changed the second part to:
 True if the query was executed as a top level statement
+   (if pg_stat_statements.track is set to
+   all, otherwise always false)


Isn't this confusing? Users may mistakenly read this as that the toplevel
column always indicates false if pg_stat_statements.track is not "all".



(changes the wording, but also the name of the parameter is
pg_stat_statements.track, not pg_stat_statements.toplevel (that's the
column, not the parameter). Same error in the commit msg except there
you called it pg_stat_statements.top - but that one needed some more
fix as well)

With those changes, applied. Thanks!


Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-04-23 Thread Magnus Hagander
On Fri, Apr 23, 2021 at 12:04 PM Fujii Masao
 wrote:
>
>
>
> On 2021/04/23 18:46, Magnus Hagander wrote:
> > On Fri, Apr 23, 2021 at 9:10 AM Fujii Masao  
> > wrote:
> >>
> >>
> >>
> >> On 2021/04/22 18:23, Julien Rouhaud wrote:
> >>> On Thu, Apr 22, 2021 at 12:28:11AM +0900, Fujii Masao wrote:
> 
>  I found another small issue in pg_stat_statements docs. The following
>  description in the docs should be updated so that toplevel is included?
> 
> > This view contains one row for each distinct database ID, user ID and 
> > query ID
> >>>
> >>> Indeed!  I'm adding Magnus in Cc.
> >>>
> >>> PFA a patch to fix at, and also mention that toplevel will only
> >>> contain True values if pg_stat_statements.track is set to top.
> >>
> >> Thanks for the patch! LGTM.
> >
> > Agreed, in general. But going by the example a few lines down, I
> > changed the second part to:
> >  True if the query was executed as a top level statement
> > +   (if pg_stat_statements.track is set to
> > +   all, otherwise always false)
>
> Isn't this confusing? Users may mistakenly read this as that the toplevel
> column always indicates false if pg_stat_statements.track is not "all".

Hmm. I think you're right. It should say "always true", shouldn't it?
So not just confusing, but completely wrong? :)


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-23 Thread Amit Kapila
On Wed, Apr 21, 2021 at 7:04 PM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Wed, Apr 21, 2021 at 8:12 AM tsunakawa.ta...@fujitsu.com
> >  wrote:
> >> From: Tom Lane 
> >>> [ raised eyebrow... ]  I find it very hard to understand why that would
> >>> be necessary, or even a good idea.
>
> > IIUC, the idea here is to check for parallel safety of functions at
> > someplace in the code during function invocation so that if we execute
> > any parallel unsafe/restricted function via parallel worker then we
> > error out. If so, isn't it possible to deal with built-in and
> > non-built-in functions in the same way?
>
> Yeah, one of the reasons I doubt this is a great idea is that you'd
> still have to fetch the pg_proc row for non-built-in functions.
>

So, are you suggesting that we should fetch the pg_proc row for
built-in functions as well for this purpose? If not, then how to
identify parallel safety of built-in functions in fmgr_info()?

Another idea could be that we check parallel safety of built-in
functions based on some static information. As we know the func_ids of
non-parallel-safe built-in functions, we can have a function
fmgr_builtin_parallel_safe() which check if the func_id is not one
among the predefined func_ids of non-parallel-safe built-in functions,
it returns true, otherwise, false. Then, we can call this new function
in fmgr_info for built-in functions.

Thoughts?

-- 
With Regards,
Amit Kapila.




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-04-23 Thread Magnus Hagander
On Fri, Apr 23, 2021 at 9:10 AM Fujii Masao  wrote:
>
>
>
> On 2021/04/22 18:23, Julien Rouhaud wrote:
> > On Thu, Apr 22, 2021 at 12:28:11AM +0900, Fujii Masao wrote:
> >>
> >> I found another small issue in pg_stat_statements docs. The following
> >> description in the docs should be updated so that toplevel is included?
> >>
> >>> This view contains one row for each distinct database ID, user ID and 
> >>> query ID
> >
> > Indeed!  I'm adding Magnus in Cc.
> >
> > PFA a patch to fix at, and also mention that toplevel will only
> > contain True values if pg_stat_statements.track is set to top.
>
> Thanks for the patch! LGTM.

Agreed, in general. But going by the example a few lines down, I
changed the second part to:
True if the query was executed as a top level statement
+   (if pg_stat_statements.track is set to
+   all, otherwise always false)

(changes the wording, but also the name of the parameter is
pg_stat_statements.track, not pg_stat_statements.toplevel (that's the
column, not the parameter). Same error in the commit msg except there
you called it pg_stat_statements.top - but that one needed some more
fix as well)

With those changes, applied. Thanks!

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-23 Thread Amit Langote
On Fri, Apr 23, 2021 at 4:26 AM Alvaro Herrera  wrote:
> On 2021-Apr-22, Amit Langote wrote:
> > -* The reason for this check is that we want to avoid seeing the
> > +* The reason for this hack is that we want to avoid seeing the
> >  * partition as alive in RI queries during REPEATABLE READ or
> > 
> > +* SERIALIZABLE transactions.
> >
> > The comment doesn't quite make it clear why it is the RI query case
> > that necessitates this hack in the first case.
>
> I added "such queries use a different snapshot than the one used by
> regular (user) queries."  I hope that's sufficient.

Yeah, that makes sense.

> > Maybe the relation to what's going on with the partdesc

(I had to leave my desk while in the middle of typing this, but I
forget what I was going to add :()

> > BTW, I do feel a bit alarmed by the potential performance impact of
> > this.  If detached_exist of a cached partdesc is true, then RI queries
> > invoked during a bulk DML operation would have to rebuild one for
> > every tuple to be checked, right?  I haven't tried an actual example
> > yet though.
>
> Yeah, I was scared about that too (which is why I insisted on trying to
> add a cached copy of the partdesc omitting detached partitions).  But
> AFAICS what happens is that the plan for the RI query gets cached after
> a few tries; so we do build the partdesc a few times at first, but later
> we use the cached plan and so we no longer use that one.  So at least in
> the normal cases this isn't a serious problem that I can see.

Actually, ri_trigger.c (or really plancache.c) is not very good at
caching the plan when querying partitioned tables; it always chooses
to replan because a generic plan, even with runtime pruning built into
it, looks very expensive compared to a custom one.  Now that's a
problem we will have to fix sooner than later, but until then we have
to work around it.

Here is an example that shows the problem:

create unlogged table pk_parted (a int primary key) partition by range (a);
select 'create unlogged table pk_parted_' || i || ' partition of
pk_parted for values from (' || (i-1) * 1000 + 1 || ') to (' ||  i *
1000 + 1 || ');' from generate_series(1, 1000) i;
\gexec
create unlogged table fk (a int references pk_parted);
insert into pk_parted select generate_series(1, 1);
begin;
select * from fk_parted where a = 1;

In another session:

alter table pk_parted detach partition pk_parted_1000 concurrently;


Back in the 1st session:

end;
insert into fk select generate_series(1, 1);
INSERT 0 1
Time: 47400.792 ms (00:47.401)

The insert took unusually long, because the PartitionDesc for
pk_parted had to be built exactly 1 times, because there's a
detach-pending partition lying around.  There is also a danger of an
OOM with such an insert because of leaking into PortalContext the
memory of every PartitionDesc thus built, especially with larger
counts of PK partitions and rows inserted into the FK table.

Also, I noticed that all queries on pk_parted, not just the RI
queries, have to build the PartitionDesc every time, so take that much
longer:

-- note the planning time
explain analyze select * from pk_parted where a = 1;
 QUERY
PLAN

---
--
 Index Only Scan using pk_parted_1_pkey on pk_parted_1 pk_parted
(cost=0.28..8.29 rows=1 width=4) (actual time=0.016..0.017 ro
ws=1 loops=1)
   Index Cond: (a = 1)
   Heap Fetches: 1
 Planning Time: 7.543 ms
 Execution Time: 0.044 ms
(5 rows)

Finalizing the detach makes the insert and the query finish in normal
time, because the PartitionDesc can be cached again:

alter table pk_parted detach partition pk_parted_1000 finalize;
insert into fk select generate_series(1, 1);
INSERT 0 1
Time: 855.336 ms

explain analyze select * from pk_parted where a = 1;
 QUERY
PLAN

---
--
 Index Only Scan using pk_parted_1_pkey on pk_parted_1 pk_parted
(cost=0.28..8.29 rows=1 width=4) (actual time=0.033..0.036 ro
ws=1 loops=1)
   Index Cond: (a = 1)
   Heap Fetches: 1
 Planning Time: 0.202 ms
 Execution Time: 0.075 ms
(5 rows)

I am afraid we may have to fix this in the code after all, because
there does not seem a good way to explain this away in the
documentation.  If I read correctly, you did try an approach of
caching the PartitionDesc that we currently don't, no?

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Disallow cancellation of waiting for synchronous replication

2021-04-23 Thread Aleksander Alekseev
Hi hackers,

> >> After using a patch for a while it became obvious that PANICing during 
> >> termination is not a good idea. Even when we wait for synchronous 
> >> replication. It generates undesired coredumps.
> >> I think in presence of SIGTERM it's reasonable to say that we cannot 
> >> protect user anymore.
> >> PFA v3.

This patch, although solving a concrete and important problem, looks
more like a quick workaround than an appropriate solution. Or is it
just me?

Ideally, the transaction should be committed only after getting a
reply from the standby. If the user cancels the transaction, it
doesn't get committed anywhere. This is what people into distributed
systems would expect unless stated otherwise, at least. Although I
realize how complicated it is to implement, especially considering all
the possible corner cases (netsplit right after getting a reply, etc).
Maybe we could come up with a less than ideal, but still sound and
easy-to-understand model, which, as soon as you learned it, doesn't
bring unexpected surprises to the user.

I believe at this point it's important to agree if the community is
ready to accept a patch as is to make existing users suffer less and
iterate afterward. Or we choose not to do it and to come up with
another idea. Personally, I don't have any better ideas, thus maybe
accepting Andrey's patch would be the lesser of two evils.

-- 
Best regards,
Aleksander Alekseev




Re: Replication slot stats misgivings

2021-04-23 Thread Amit Kapila
On Mon, Apr 19, 2021 at 4:28 PM vignesh C  wrote:
>
> I have made the changes to update the replication statistics at
> replication slot release. Please find the patch attached for the same.
> Thoughts?
>

Thanks, the changes look mostly good to me. The slot stats need to be
initialized in RestoreSlotFromDisk and ReplicationSlotCreate, not in
StartupDecodingContext. Apart from that, I have moved the declaration
of UpdateDecodingStats from slot.h back to logical.h. I have also
added/edited a few comments. Please check and let me know what do you
think of the attached?

-- 
With Regards,
Amit Kapila.


v2-0001-Update-decoding-stats-during-replication-slot-rel.patch
Description: Binary data


RE: Truncate in synchronous logical replication failed

2021-04-23 Thread osumi.takami...@fujitsu.com
On Friday, April 23, 2021 3:43 PM  Amit Kapila  wrote:
> On Fri, Apr 23, 2021 at 12:04 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On Thursday, April 22, 2021 6:33 PM Amit Kapila
>  wrote:
> > > On Tue, Apr 20, 2021 at 9:00 AM osumi.takami...@fujitsu.com
> > >  wrote:
> > > >
> > > > On  Tuesday, April 20, 2021 10:53 AM Ajin Cherian
> > > > 
> > > wrote:
> > > > >
> > > > > I reviewed the patch, ran make check, no issues. One minor
> comment:
> > > > >
> > > > > Could you add the comment similar to
> > > > > RelationGetIndexAttrBitmap() on why the redo, it's not very
> > > > > obvious to someone reading the code, why we are refetching the
> index list here.
> > > > >
> > > > > + /* Check if we need to redo */
> > > > >
> > > > > + newindexoidlist = RelationGetIndexList(relation);
> > > > Yeah, makes sense. Fixed.
> > >
> > > I don't think here we need to restart to get a stable list of
> > > indexes as we do in RelationGetIndexAttrBitmap. The reason is here
> > > we build the cache entry using a historic snapshot and all the later
> > > changes are absorbed while decoding WAL.
> > I rechecked this and I agree with this.
> > I don't see any problem to remove the redo check.
> > Based on this direction, I'll share my several minor comments.
> >
> > [1] a typo of RelationGetIdentityKeyBitmap's comment
> >
> > + * This is a special purpose function used during logical
> > + replication. Here,
> > + * unlike RelationGetIndexAttrBitmap(), we don't a acquire lock on
> > + the required
> >
> > We have "a" in an unnatural place. It should be "we don't acquire...".
> >
> > [2] suggestion to fix RelationGetIdentityKeyBitmap's comment
> >
> > + * later changes are absorbed while decoding WAL. Due to this reason,
> > + we don't
> > + * need to retry here in case of a change in the set of indexes.
> >
> > I think it's better to use "Because of" instead of "Due to".
> > This patch works to solve the deadlock.
> >
> 
> I am not sure which one is better. I would like to keep it as it is unless 
> you feel
> strongly about point 2.
> 
> > [3] wrong comment in RelationGetIdentityKeyBitmap
> >
> > +   /* Save some values to compare after building attributes */
> >
> > I wrote this comment for the redo check that has been removed already.
> > We can delete it.
> >
> > [4] suggestion to remove local variable relreplindex in
> > RelationGetIdentityKeyBitmap
> >
> > I think we don't need relreplindex.
> > We can just pass relaton->rd_replidindex to RelationIdGetRelation().
> > There is no more reference of the variable.
> >
> > [5] suggestion to fix the place to release indexoidlist
> >
> > I thought we can do its list_free() ealier, after checking if there is
> > no indexes.
> >
> 
> Hmm, why? If there are no indexes then we wouldn't have allocated any
> memory.
> 
> > [6] suggestion to remove period in one comment.
> >
> > +   /* Quick exit if we already computed the result. */
> >
> > This comes from RelationGetIndexAttrBitmap (and my posted versions)
> > but I think we can remove the period to give better alignment shared with
> other comments in the function.
> >
> 
> Can you please update the patch except for the two points to which I
> responded above?
I've combined v5 with above accepted comments.

Just in case, I've conducted make check-world, 
the test of clobber_cache_always mode again for v6
and tight loop test of 100 times for 010_truncate.pl. 
The result is that all passed with no failure.


Best Regards,
Takamichi Osumi



truncate_in_synchronous_logical_replication_v06.patch
Description: truncate_in_synchronous_logical_replication_v06.patch


Re: RFE: Make statistics robust for unplanned events

2021-04-23 Thread Patrik Novotny
>
>
> Yeah, that's what I was thinking as well -- dumping snapshot at
> regular intervals, so that on crash recovery we lose a "controlled
> amount" of recent starts instead of losing *everything*.
>
> I think in most situations a fairly long interval is OK -- if you have
> tables that take so many hits that you need a really quick reaction
> from autovacuum it will probably pick that up quickly enough even
> after a reset. And if it's moer the long-term tracking that's
> important, then a longer interval is probably OK.
>
> But perhaps make it configurable with a timeout and a "reasonable default"?
>
>
> > Patrik, for your use cases, would loosing at most the stats since the
> > start of last checkpoint be an issue?
>
> Unless there's a particular benefit to tie it specifically to the
> checkpoint occuring, I'd rather keep it as a separate setting. They
> might both come with the same default of course, btu I can certainly
> envision cases where you want to increase the checkpoint distance
> while keeping the stats interval lower for example. Many people
> increase the checkpoint timeout quite a lot...
>
>
>From what I understand, I think it depends on the interval of those
checkpoints. If the interval was configurable with the mentioned reasonable
default, then it shouldn't be an issue.

If we were to choose a hard coded interval of those checkpoints based on my
case, I would have to consult the original reporter, but then it might not
suit others anyway. Therefore, making it configurable makes more sense to
me personally.

-- 
Patrik Novotný
Associate Software Engineer
Red Hat
panov...@redhat.com


Re: Support tab completion for upper character inputs in psql

2021-04-23 Thread Laurenz Albe
On Fri, 2021-04-23 at 14:44 +0900, Kyotaro Horiguchi wrote:
> > The two examples I know of offhand are in German (eszett "ß" downcases to
> > "ss") and Turkish (dotted "Í" downcases to "i", likewise dotless "I"
> 
> According to Wikipedia, "ss" is equivalent to "ß" and their upper case
> letters are "SS" and "ẞ" respectively. (I didn't even know of the
> existence of "ẞ". AFAIK there's no word begins with eszett, but it
> seems that there's a case where "ẞ" appears in a word is spelled only
> with capital letters.

This "capital sharp s" is a recent invention that has never got much
traction.  I notice that on my Fedora 32 system with glibc 2.31 and de_DE.utf8,

SELECT lower(E'\u1E9E') = E'\u00DF', upper(E'\u00DF') = E'\u1E9E';

 ?column? │ ?column? 
══╪══
 t│ f
(1 row)

which to me as a German speaker makes no sense.

But Tom's example was the wrong way around: "ß" is a lower case letter,
and the traditional upper case translation is "SS".

But the Turkish example is correct:

> > downcases to "ı"; one of each of those pairs is an ASCII letter, the
> > other is not).  Depending on which encoding is in use, these
> 
> Upper dotless "I" and lower dotted "i" are in ASCII (or English
> alphabet?).  That's interesting.

Yes.  In languages other than Turkish, "i" is the lower case version of "I",
and both are ASCII.  Only Turkish has an "ı" (U+0131) and an "İ" (U+0130).
That causes annoyance for Turks who create a table named KADIN and find
that PostgreSQL turns it into "kadin".

Yours,
Laurenz Albe





Re: RFE: Make statistics robust for unplanned events

2021-04-23 Thread Magnus Hagander
On Fri, Apr 23, 2021 at 12:41 AM Andres Freund  wrote:
>
> Hi,
>
> On 2021-04-21 14:38:44 +0200, Magnus Hagander wrote:
> > Andres mentioned at least once over in the thread about shared memory
> > stats collection that being able to have persistent stats could come
> > out of that one in the future. Whatever is done on the topic should
> > probably be done based on that work, as it provides a better starting
> > point and also one that will stay around.
>
> Yea. I think the main benefit from the shared memory stats patch that
> would make this a easier is that it tracks (with one small hole that can
> probably be addressed) dropped objects accurately, even across crashes /
> replication. Having old stats around runs into danger of mixing stats
> for an old dropped object being combined with stats for a new object.
>
> I don't think making pgstat.c fully durable by continually storing the
> data in a table or something like that is an option. For one, the stats
> for a replica and primary are independent. For another, the overhead
> would be prohibitive.
>
> But after we gain the accurate dropping of stats we can store a stats
> snapshot corresponding to certain WAL records (by serializing to
> something like pg_stats_%redo_lsn%) without ending up with dropped stats
> surviving.
>
> A big question around this is how often we'd want to write out the
> stats. Obviously, the more often we do, the higher the overhead. And the
> less frequently, the more stats updates might be lost.

Yeah, that's what I was thinking as well -- dumping snapshot at
regular intervals, so that on crash recovery we lose a "controlled
amount" of recent starts instead of losing *everything*.

I think in most situations a fairly long interval is OK -- if you have
tables that take so many hits that you need a really quick reaction
from autovacuum it will probably pick that up quickly enough even
after a reset. And if it's moer the long-term tracking that's
important, then a longer interval is probably OK.

But perhaps make it configurable with a timeout and a "reasonable default"?


> Patrik, for your use cases, would loosing at most the stats since the
> start of last checkpoint be an issue?

Unless there's a particular benefit to tie it specifically to the
checkpoint occuring, I'd rather keep it as a separate setting. They
might both come with the same default of course, btu I can certainly
envision cases where you want to increase the checkpoint distance
while keeping the stats interval lower for example. Many people
increase the checkpoint timeout quite a lot...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: TRUNCATE on foreign table

2021-04-23 Thread Fujii Masao



On 2021/04/22 20:27, Bharath Rupireddy wrote:

On Thu, Apr 22, 2021 at 12:06 PM Fujii Masao
 wrote:

On 2021/04/22 9:39, Bharath Rupireddy wrote:

One comment on truncate_foreign_table_docs_v1.patch:
1) I think it is "to be truncated"
+ rels is a list of Relation
+ data structures for each foreign table to truncated.


Fixed. Thanks!


How about a slightly changed phrasing like below?
+ rels is a list of Relation
+ data structures of foreign tables to truncate.

Either works at least for me. If you think that this phrasing is
more precise or better, I'm ok with that and will update the patch again.


IMO, "rels is a list of Relation data structures of foreign tables to
truncate." looks better.


Fixed.

Thanks for reviewing the patches.
Attached are the updated versions of the patches.
These patches include the fixes pointed by Justin.





3) How about adding an extra para(after below para in
postgres_fdw.sgml) on WHY we don't push "ONLY" to foreign tables while
truncating? We could add to the same para for other options if at all
we don't choose to push them.
 DELETE, or TRUNCATE.
 (Of course, the remote user you have specified in your user mapping must
 have privileges to do these things.)


I agree to document the behavior that ONLY option is always ignored
for foreign tables. But I'm not sure if we can document WHY.
Because I could not find the past discussion about why ONLY option is
ignored on SELECT, etc... Maybe it's enough to document the behavior?


+1 to specify in the documentation about ONLY option is always
ignored.


Added.


But can we specify the WHY part within deparseTruncateSql, it
will be there for developer reference? I feel it's better if this
change goes with truncate_foreign_table_dont_pass_only_clause_v2.patch


I added this information into fdwhandler.sgml because the developers
usually read fdwhandler.sgml.


Thanks!

+
+ Note that information about ONLY options specified
+ in the original TRUNCATE command is not passed to

I think it is not "information about", no? We just don't pass ONLY
option  instead we skip it. IMO, we can say "Note that the ONLY option
specified with a foreign table in the original TRUNCATE command is
skipped and not passed to ExecForeignTruncate."


Probably I still fail to understand your point.
But if "information about" is confusing, I'm ok to
remove that. Fixed.




+ ExecForeignTruncate.  This is the same behavior as
+ for the callback functions for SELECT,
+ UPDATE and  DELETE on
+ a foreign table.

How about "This behaviour is similar to the callback functions of
SELECT, UPDATE, DELETE on a foreign table"?


Fixed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 0e2cd3628c..70d89393d9 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1076,44 +1076,41 @@ ExecForeignTruncate(List *rels,
 bool restart_seqs);
 
 
- Truncate a set of foreign tables specified in rels.
- This function is called when  is executed
- on foreign tables.  rels is the list of
- Relation data structure that indicates
- a foreign table to truncate.
+ Truncate foreign tables.  This function is called when
+  is executed on a foreign table.
+ rels is a list of Relation
+ data structures of foreign tables to truncate.
 
 
 
- behavior defines how foreign tables should
- be truncated, using as possible values DROP_RESTRICT,
- which means that RESTRICT option is specified,
- and DROP_CASCADE, which means that
- CASCADE option is specified, in
- TRUNCATE command.
+ behavior is either DROP_RESTRICT
+ or DROP_CASCADE indicating that the
+ RESTRICT or CASCADE option was
+ requested in the original TRUNCATE command,
+ respectively.
 
 
 
- restart_seqs is set to true
- if RESTART IDENTITY option is specified in
- TRUNCATE command.  It is false
- if CONTINUE IDENTITY option is specified.
+ If restart_seqs is true,
+ the original TRUNCATE command requested the
+ RESTART IDENTITY behavior, otherwise the
+ CONTINUE IDENTITY behavior was requested.
 
 
 
- Note that information about ONLY options specified
+ Note that the ONLY options specified
  in the original TRUNCATE command is not passed to
- ExecForeignTruncate.  This is the same behavior as
- for the callback functions for SELECT,
+ ExecForeignTruncate.  This behavior is similar to
+ the callback functions of SELECT,
  UPDATE and DELETE on
  a foreign table.
 
 
 
- TRUNCATE invokes
- ExecForeignTruncate once per foreign server
- that foreign tables to truncate belong to.  This means that all foreign
- tables included in rels must belong to the same
- server.
+ 

Re: TRUNCATE on foreign table

2021-04-23 Thread Fujii Masao




On 2021/04/22 17:56, Justin Pryzby wrote:

On Thu, Apr 22, 2021 at 03:36:25PM +0900, Fujii Masao wrote:

diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 553524553b..69aa66e73e 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1076,27 +1076,25 @@ ExecForeignTruncate(List *rels,
  bool restart_seqs);
  
- behavior defines how foreign tables should
- be truncated, using as possible values DROP_RESTRICT,
- which means that RESTRICT option is specified,
- and DROP_CASCADE, which means that
- CASCADE option is specified, in
- TRUNCATE command.
+ behavior is either DROP_RESTRICT
+ or DROP_CASCADE, which indicates that the
+ RESTRICT or CASCADE option was
+ requested in the original TRUNCATE command,
+ respectively.


Now that I reread this, I would change "which indicates" to "indicating".


Fixed. Thanks for reviewing the patch!
I will post the updated version of the patch later.





- restart_seqs is set to true
- if RESTART IDENTITY option is specified in
- TRUNCATE command.  It is false
- if CONTINUE IDENTITY option is specified.
+ If restart_seqs is true,
+ the original TRUNCATE command requested the
+ RESTART IDENTITY option, otherwise
+ CONTINUE IDENTITY option.


should it say "specified" instead of requested ?
Or should it say "requested the RESTART IDENTITY behavior" ?

Also, I think it should say "..otherwise, the CONTINUE IDENTITY behavior was
requested".


Fixed.

 

+++ b/doc/src/sgml/ref/truncate.sgml
@@ -173,7 +173,7 @@ TRUNCATE [ TABLE ] [ ONLY ] name [
  


 TRUNCATE can be used for foreign tables if
-   the foreign data wrapper supports, for instance,
+   supported by the foreign data wrapper, for instance,
 see .


what does "for instance" mean here?  I think it should be removed.


Removed.





+++ b/doc/src/sgml/fdwhandler.sgml
@@ -,6 +1099,15 @@ ExecForeignTruncate(List *rels, List *rels_extra,
   if CONTINUE IDENTITY option is specified.
  
  
+

+ Note that information about ONLY options specified
+ in the original TRUNCATE command is not passed to
+ ExecForeignTruncate.  This is the same behavior as
+ for the callback functions for SELECT,
+ UPDATE and  DELETE on


There's an extra space before DELETE


Fixed.





diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 5320accf6f..d03731b7d4 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -69,6 +69,13 @@
have privileges to do these things.)
   
  
+ 

+  Note that ONLY option specified in


add "the" to say: "the ONLY"


Fixed.





+  SELECT, UPDATE,
+  DELETE or TRUNCATE
+  has no effect when accessing or modifyung the remote table.


modifying


Fixed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: A test for replay of regression tests

2021-04-23 Thread Thomas Munro
On Fri, Apr 23, 2021 at 6:27 PM tsunakawa.ta...@fujitsu.com
 wrote:
> From: Thomas Munro 
> > I'm not quite sure where it belongs, though.  The attached initial sketch 
> > patch
>
> I think that's a good attempt.  It'd be better to confirm that the database 
> contents are identical on the primary and standby.  But... I remember when I 
> ran make installcheck with a standby connected, then ran pg_dumpall on both 
> the primary and standby and compare the two output files, about 40 lines of 
> difference were observed.  Those differences were all about the sequence 
> values.  I didn't think about whether I should/can remove the differences.

Interesting idea.  I hadn't noticed the thing with sequences before.

> +# XXX The tablespace tests don't currently work when the standby shares a
> +# filesystem with the primary due to colliding absolute paths.  We'll skip
> +# that for now.
>
> Maybe we had better have a hidden feature that creates tablespace contents in
>
> /tablespace_location/PG_..._/
>
>  is the value of cluster_name or application_name.
>
> Or, we may provide a visible feature that allows users to store tablespace 
> contents in a specified directory regardless of the LOCATION value in CREATE 
> TABLESPACE.  For instance, add a GUC like
>
> table_space_directory = '/some_dir'
>
> Then, the tablespace contents are stored in /some_dir//.  
> This may be useful if a DBaaS provider wants to offer some tablespace-based 
> feature, say tablespace transparent data encryption, but doesn't want users 
> to specify filesystem directories.

Yeah, a few similar ideas have been put forward before, for example in
this thread:

https://www.postgresql.org/message-id/flat/CALfoeisEF92F5nJ-aAcuWTvF_Aogxq_1bHLem_kVfM_tHc2mfg%40mail.gmail.com

... but also others.  I would definitely like to fix that problem too
(having worked on several things that interact with tablespaces, I
definitely want to be able to extend those tests in future patches,
and get coverage on the build farm and CI), but with --skip-tests that
could be done independently.

Apparently the invocation of make failed for some reason on CI (not
sure why, works for me), but in any case, that feels a bit clunky and
might not ever work on Windows, so perhaps we should figure out how to
invoke the pg_regress[.exe] program directly.




Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-04-23 Thread Greg Nancarrow
On Wed, Apr 21, 2021 at 12:22 PM Tom Lane  wrote:
>
> "tsunakawa.ta...@fujitsu.com"  writes:
> > From: Tom Lane 
> >> No.  You'd have to be superuser anyway to do that, and we're not in the
> >> habit of trying to put training wheels on superusers.
>
> > Understood.  However, we may add the parallel safety member in 
> > fmgr_builtins[] in another thread for parallel INSERT SELECT.  I'd 
> > appreciate your comment on this if you see any concern.
>
> [ raised eyebrow... ]  I find it very hard to understand why that would
> be necessary, or even a good idea.  Not least because there's no spare
> room there; you'd have to incur a substantial enlargement of the
> array to add another flag.  But also, that would indeed lock down
> the value of the parallel-safety flag, and that seems like a fairly
> bad idea.
>

I'm curious. The FmgrBuiltin struct includes the "strict" flag, so
that would "lock down the value" of the strict flag, wouldn't it?

Regards,
Greg Nancarrow
Fujitsu Australia




Re: [PATCH] Re: pg_identify_object_as_address() doesn't support pg_event_trigger oids

2021-04-23 Thread Joel Jacobson
On Fri, Apr 23, 2021, at 09:30, Joel Jacobson wrote:
> fix-pg_identify_object_as_address-for-event-triggers.patch

Also, since this is a problem also in v13 maybe this should also be back-ported?
I think it's a bug since both pg_identify_object_as_address() and event 
triggers exists in v13,
so the function should work there as well, otherwise users would need to do 
work-arounds for event triggers.

/Joel

[PATCH] Re: pg_identify_object_as_address() doesn't support pg_event_trigger oids

2021-04-23 Thread Joel Jacobson
On Fri, Apr 23, 2021, at 08:54, Joel Jacobson wrote:
> pg_describe_object| event trigger ddl_postgrest
> pg_identify_object| ("event trigger",,ddl_postgrest,ddl_postgrest)
> pg_identify_object_as_address | ("event trigger",{},{})
> 
> I therefore think the evtname should be added to object_names.

Could it possibly be as simple to fix as the attached patch?
Not sure if the the string constructed by appendStringInfo() also needs to be 
adjusted.

With the patch, the example above returns:

pg_identify_object_as_address | ("event trigger",{ddl_postgrest},{})

/Joel

fix-pg_identify_object_as_address-for-event-triggers.patch
Description: Binary data


Re: truncating timestamps on arbitrary intervals

2021-04-23 Thread Peter Eisentraut

On 22.04.21 11:16, Justin Pryzby wrote:

It looks like we all missed that I misspelled "date_bin" as
"date_trunc"...sorry.  I will include this with my next round of doc review, in
case you don't want to make a separate commit for it.


fixed




Re: wal stats questions

2021-04-23 Thread Fujii Masao




On 2021/04/23 10:25, Andres Freund wrote:

Hi,

On 2021-04-23 09:26:17 +0900, Masahiro Ikeda wrote:

On 2021/04/23 0:36, Andres Freund wrote:

On Thu, Apr 22, 2021, at 06:42, Fujii Masao wrote:

On 2021/04/21 18:31, Masahiro Ikeda wrote:

BTW, is it better to change PgStat_Counter from int64 to uint64 because> there 
aren't any counters with negative number?

On second thought, it's ok even if the counters like wal_records get overflowed.
Because they are always used to calculate the diff between the previous and
current counters. Even when the current counters are overflowed and
the previous ones are not, WalUsageAccumDiff() seems to return the right
diff of them. If this understanding is right, I'd withdraw my comment and
it's ok to use "long" type for those counters. Thought?


Why long? It's of a platform dependent size, which doesn't seem useful here?


I think it's ok to unify uint64. Although it's better to use small size for
cache, the idea works well for only some platform which use 4bytes for "long".


(Although I'm not familiar with the standardization...)
It seems that we need to adopt unsinged long if use "long", which may be 4bytes.

I though WalUsageAccumDiff() seems to return the right value too. But, I
researched deeply and found that ISO/IEC 9899:1999 defines unsinged integer
never overflow(2.6.5 Types 9th section) although it doesn't define overflow of
signed integer type.

If my understanding is right, the definition only guarantee
WalUsageAccumDiff() returns the right value only if the type is unsigned
integer. So, it's safe to change "long" to "unsigned long".


I think this should just use 64bit counters. Emitting more than 4
billion records in one transaction isn't actually particularly rare. And


Right. I agree to change the types of the counters to int64.

I think that it's better to make this change as a separate patch from
the changes for pg_stat_wal view.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-04-23 Thread Fujii Masao




On 2021/04/22 18:23, Julien Rouhaud wrote:

On Thu, Apr 22, 2021 at 12:28:11AM +0900, Fujii Masao wrote:


I found another small issue in pg_stat_statements docs. The following
description in the docs should be updated so that toplevel is included?


This view contains one row for each distinct database ID, user ID and query ID


Indeed!  I'm adding Magnus in Cc.

PFA a patch to fix at, and also mention that toplevel will only
contain True values if pg_stat_statements.track is set to top.


Thanks for the patch! LGTM.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




RE: fix a comment

2021-04-23 Thread wangyu...@fujitsu.com
Hi, Amul

Thank you for reviewing.

> How about simply removing these numbering?

Agree. Please see the v2 patch which delete the number in comment.

Best wishes
Yukun Wang

-Original Message-
From: Amul Sul  
Sent: Friday, April 23, 2021 3:51 PM
To: Wang, Yukun/王 俞坤 
Cc: pgsql-hack...@postgresql.org
Subject: Re: fix a comment

On Fri, Apr 23, 2021 at 12:12 PM wangyu...@fujitsu.com  
wrote:
>
> Hi, Hackers:
>
> In function ExecGetTriggerResultRel, we can see comments:
>
> > /* First, search through the query result relations */ ...
> > /*
> > * Third, search through the result relations that were created 
> > during
> > * tuple routing, if any.
> > */
>
> But the 'Second' was deleted since commit 1375422c78.
>
> Update the 'Third' to 'Second', please see the attachment.
>
> Thoughts?
>

Well yes, looks good to me.

How about simply removing these numbering?

Regards,
Amul


fix-a-comment.diff
Description: fix-a-comment.diff


Re: pg_identify_object_as_address() doesn't support pg_event_trigger oids

2021-04-23 Thread Joel Jacobson
On Thu, Apr 22, 2021, at 19:32, Alvaro Herrera wrote:
> On 2021-Apr-22, Joel Jacobson wrote:
> > SELECT * FROM 
> > pg_identify_object_as_address('pg_event_trigger'::regclass,289361636,0);
> > ERROR:  requested object address for unsupported object class 32: text 
> > result "ddl_postgrest"
> 
> Hmm, I think this is an accidental omission and it should be supported.

Oh, I realise now the error came from a server running v13,
but there seems to be a problem in HEAD as well;
the "object_names" text[] output is empty for event triggers,
so the output will be the same for all event triggers,
which doesn't seem right since the output should be unique.

The output from the other functions pg_describe_object() and 
pg_identify_object()
contain the name in the output though.

Example:

SELECT
  *,
  pg_describe_object('pg_event_trigger'::regclass,oid,0),
  pg_identify_object('pg_event_trigger'::regclass,oid,0),
  pg_identify_object_as_address('pg_event_trigger'::regclass,oid,0)
FROM pg_event_trigger;
-[ RECORD 1 ]-+---
oid   | 396715
evtname   | ddl_postgrest
evtevent  | ddl_command_end
evtowner  | 10
evtfoid   | 396714
evtenabled| O
evttags   |
pg_describe_object| event trigger ddl_postgrest
pg_identify_object| ("event trigger",,ddl_postgrest,ddl_postgrest)
pg_identify_object_as_address | ("event trigger",{},{})

I therefore think the evtname should be added to object_names.

/Joel

Re: Stale description for pg_basebackup

2021-04-23 Thread Fujii Masao




On 2021/04/22 13:25, Kyotaro Horiguchi wrote:

At Thu, 22 Apr 2021 13:06:50 +0900, Fujii Masao  
wrote in

Either works for me. I didn't add "()" because I just used the same
description
as that in func.sgml.

 it may be useful to run pg_switch_wal on the
 primary in order to trigger an immediate segment switch.)

..

IMO "in order to..." part is helpful for us to understand why
pg_switch_wal
is useful in this case. So I'd like to leave it.


Ok, I'm fine with both of them. Thanks for the explanation.


Pushed. Thanks!

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: fix a comment

2021-04-23 Thread Amul Sul
On Fri, Apr 23, 2021 at 12:12 PM wangyu...@fujitsu.com
 wrote:
>
> Hi, Hackers:
>
> In function ExecGetTriggerResultRel, we can see comments:
>
> > /* First, search through the query result relations */ ...
> > /*
> > * Third, search through the result relations that were created during
> > * tuple routing, if any.
> > */
>
> But the 'Second' was deleted since commit 1375422c78.
>
> Update the 'Third' to 'Second', please see the attachment.
>
> Thoughts?
>

Well yes, looks good to me.

How about simply removing these numbering?

Regards,
Amul




Re: Truncate in synchronous logical replication failed

2021-04-23 Thread Amit Kapila
On Fri, Apr 23, 2021 at 12:04 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Thursday, April 22, 2021 6:33 PM Amit Kapila  
> wrote:
> > On Tue, Apr 20, 2021 at 9:00 AM osumi.takami...@fujitsu.com
> >  wrote:
> > >
> > > On  Tuesday, April 20, 2021 10:53 AM Ajin Cherian 
> > wrote:
> > > >
> > > > I reviewed the patch, ran make check, no issues. One minor comment:
> > > >
> > > > Could you add the comment similar to RelationGetIndexAttrBitmap() on
> > > > why the redo, it's not very obvious to someone reading the code, why
> > > > we are refetching the index list here.
> > > >
> > > > + /* Check if we need to redo */
> > > >
> > > > + newindexoidlist = RelationGetIndexList(relation);
> > > Yeah, makes sense. Fixed.
> >
> > I don't think here we need to restart to get a stable list of indexes as we 
> > do in
> > RelationGetIndexAttrBitmap. The reason is here we build the cache entry
> > using a historic snapshot and all the later changes are absorbed while
> > decoding WAL.
> I rechecked this and I agree with this.
> I don't see any problem to remove the redo check.
> Based on this direction, I'll share my several minor comments.
>
> [1] a typo of RelationGetIdentityKeyBitmap's comment
>
> + * This is a special purpose function used during logical replication. Here,
> + * unlike RelationGetIndexAttrBitmap(), we don't a acquire lock on the 
> required
>
> We have "a" in an unnatural place. It should be "we don't acquire...".
>
> [2] suggestion to fix RelationGetIdentityKeyBitmap's comment
>
> + * later changes are absorbed while decoding WAL. Due to this reason, we 
> don't
> + * need to retry here in case of a change in the set of indexes.
>
> I think it's better to use "Because of" instead of "Due to".
> This patch works to solve the deadlock.
>

I am not sure which one is better. I would like to keep it as it is
unless you feel strongly about point 2.

> [3] wrong comment in RelationGetIdentityKeyBitmap
>
> +   /* Save some values to compare after building attributes */
>
> I wrote this comment for the redo check
> that has been removed already. We can delete it.
>
> [4] suggestion to remove local variable relreplindex in 
> RelationGetIdentityKeyBitmap
>
> I think we don't need relreplindex.
> We can just pass relaton->rd_replidindex to RelationIdGetRelation().
> There is no more reference of the variable.
>
> [5] suggestion to fix the place to release indexoidlist
>
> I thought we can do its list_free() ealier,
> after checking if there is no indexes.
>

Hmm, why? If there are no indexes then we wouldn't have allocated any memory.

> [6] suggestion to remove period in one comment.
>
> +   /* Quick exit if we already computed the result. */
>
> This comes from RelationGetIndexAttrBitmap (and my posted versions)
> but I think we can remove the period to give better alignment shared with 
> other comments in the function.
>

Can you please update the patch except for the two points to which I
responded above?

-- 
With Regards,
Amit Kapila.




fix a comment

2021-04-23 Thread wangyu...@fujitsu.com
Hi, Hackers:

In function ExecGetTriggerResultRel, we can see comments:

> /* First, search through the query result relations */ ...
> /*
> * Third, search through the result relations that were created during
> * tuple routing, if any.
> */

But the 'Second' was deleted since commit 1375422c78.

Update the 'Third' to 'Second', please see the attachment.

Thoughts?

Best wishes
Yukun Wang



fix-a-typo.diff
Description: fix-a-typo.diff


RE: Truncate in synchronous logical replication failed

2021-04-23 Thread osumi.takami...@fujitsu.com
On Thursday, April 22, 2021 6:33 PM Amit Kapila  wrote:
> On Tue, Apr 20, 2021 at 9:00 AM osumi.takami...@fujitsu.com
>  wrote:
> >
> > On  Tuesday, April 20, 2021 10:53 AM Ajin Cherian 
> wrote:
> > >
> > > I reviewed the patch, ran make check, no issues. One minor comment:
> > >
> > > Could you add the comment similar to RelationGetIndexAttrBitmap() on
> > > why the redo, it's not very obvious to someone reading the code, why
> > > we are refetching the index list here.
> > >
> > > + /* Check if we need to redo */
> > >
> > > + newindexoidlist = RelationGetIndexList(relation);
> > Yeah, makes sense. Fixed.
> 
> I don't think here we need to restart to get a stable list of indexes as we 
> do in
> RelationGetIndexAttrBitmap. The reason is here we build the cache entry
> using a historic snapshot and all the later changes are absorbed while
> decoding WAL.
I rechecked this and I agree with this.
I don't see any problem to remove the redo check.
Based on this direction, I'll share my several minor comments.

[1] a typo of RelationGetIdentityKeyBitmap's comment

+ * This is a special purpose function used during logical replication. Here,
+ * unlike RelationGetIndexAttrBitmap(), we don't a acquire lock on the required

We have "a" in an unnatural place. It should be "we don't acquire...".

[2] suggestion to fix RelationGetIdentityKeyBitmap's comment

+ * later changes are absorbed while decoding WAL. Due to this reason, we don't
+ * need to retry here in case of a change in the set of indexes.

I think it's better to use "Because of" instead of "Due to".
This patch works to solve the deadlock.

[3] wrong comment in RelationGetIdentityKeyBitmap

+   /* Save some values to compare after building attributes */

I wrote this comment for the redo check
that has been removed already. We can delete it.

[4] suggestion to remove local variable relreplindex in 
RelationGetIdentityKeyBitmap

I think we don't need relreplindex.
We can just pass relaton->rd_replidindex to RelationIdGetRelation().
There is no more reference of the variable.

[5] suggestion to fix the place to release indexoidlist

I thought we can do its list_free() ealier,
after checking if there is no indexes.

[6] suggestion to remove period in one comment.

+   /* Quick exit if we already computed the result. */

This comes from RelationGetIndexAttrBitmap (and my posted versions)
but I think we can remove the period to give better alignment shared with other 
comments in the function.

> I have updated that and modified few comments in the
> attached patch. Can you please test this in clobber_cache_always mode? I
> think just testing subscription/t/010_truncate.pl would be sufficient.
I did it. It didn't fail. No problem.

> Now, this bug exists in prior versions (>= v11) where we have introduced
> decoding of Truncate. Do we want to backpatch this? As no user has reported
> this till now apart from Tang who I think got it while doing some other patch
> testing, we might want to just push it in HEAD. I am fine either way. Petr,
> others, do you have any opinion on this matter?
I think we are fine with applying this patch to the HEAD only,
since nobody has complained about the hang issue.


Best Regards,
Takamichi Osumi



RE: A test for replay of regression tests

2021-04-23 Thread tsunakawa.ta...@fujitsu.com
From: Thomas Munro 
> We have automated tests for many specific replication and recovery scenarios,
> but nothing that tests replay of a wide range of records.
> People working on recovery code often use installcheck (presumably along
> with other custom tests) to exercise it, sometimes with
> wal_consistency_check enabled.  So, why don't we automate that?  Aside
> from exercising the WAL decoding machinery (which brought me here), that'd
> hopefully provide some decent improvements in coverage of the various redo
> routines, many of which are not currently exercised at all.
> 
> I'm not quite sure where it belongs, though.  The attached initial sketch 
> patch

I think that's a good attempt.  It'd be better to confirm that the database 
contents are identical on the primary and standby.  But... I remember when I 
ran make installcheck with a standby connected, then ran pg_dumpall on both the 
primary and standby and compare the two output files, about 40 lines of 
difference were observed.  Those differences were all about the sequence 
values.  I didn't think about whether I should/can remove the differences.


+# XXX The tablespace tests don't currently work when the standby shares a
+# filesystem with the primary due to colliding absolute paths.  We'll skip
+# that for now.

Maybe we had better have a hidden feature that creates tablespace contents in

/tablespace_location/PG_..._/

 is the value of cluster_name or application_name.

Or, we may provide a visible feature that allows users to store tablespace 
contents in a specified directory regardless of the LOCATION value in CREATE 
TABLESPACE.  For instance, add a GUC like

table_space_directory = '/some_dir'

Then, the tablespace contents are stored in /some_dir//.  This 
may be useful if a DBaaS provider wants to offer some tablespace-based feature, 
say tablespace transparent data encryption, but doesn't want users to specify 
filesystem directories.


Regards
Takayuki Tsunakawa