Re: Handle infinite recursion in logical replication setup

2022-04-26 Thread Peter Smith
Here are my review comments for v10-0001 and v10-0002.

(I did not yet look at the v10-0002 TAP tests. I will check those
separately later).

=
v10-0001 comments
=

1.1 src/include/catalog/pg_subscription.h

@@ -70,6 +70,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId)
BKI_SHARED_RELATION BKI_ROW

  bool substream; /* Stream in-progress transactions. */

+ bool sublocalonly; /* skip copying of remote origin data */
+
  char subtwophasestate; /* Stream two-phase transactions */

  bool subdisableonerr; /* True if a worker error should cause the

1.1.a Comment should start uppercase.

1.1.b Perhaps it is better to say "Skip replication of" instead of
"Skip copying" ?

~~~

1.2 src/include/catalog/pg_subscription.h

@@ -110,6 +112,7 @@ typedef struct Subscription
  bool binary; /* Indicates if the subscription wants data in
  * binary format */
  bool stream; /* Allow streaming in-progress transactions. */
+ bool local_only; /* Skip copying of remote origin data */
  char twophasestate; /* Allow streaming two-phase transactions */

Same comment as #1.1

=
v10-0002 comments
=

2.1 Commit message

b) user wil have to create another subscription subscribing from
Node2 to Node3 using local_only option and copy_data as true.

Typo: "wil"

~~~

2.2 Commit message

Here when user has specified local_only 'on' which indicates that the
publisher should only replicate the changes that are generated locally, but in
this case since the publisher node is also subscribing data from other nodes,
the publisher node can have remotely originated data, so throw an error in this
case to prevent remotely generated data being replicated to the subscriber. If
user still intends to continue with the operation user can specify copy_data
as 'force' and proceed.

SUGGESTED (minor rewording)
In the scenario above the user has specified local_only 'on' (which
indicates that the publisher should only replicate the changes that
are generated locally), but in this case, the publisher node is also
subscribing data from other nodes, so the publisher node may have
remotely originated data. We throw an error, in this case, to draw
attention to there being possible remote data. If the user still
wishes to continue with the operation user can specify copy_data as
'force' and proceed.

~~~

2.3 doc/src/sgml/logical-replication.sgml

+   
+ Bidirectional replication is useful in creating multi master database
+ which helps in performing read/write operations from any of the nodes.
+ Setting up bidirectional logical replication between two nodes requires
+ creation of the publication in all the nodes, creating subscription in
+ each of the nodes that subcribes to data from all the nodes. The steps
+ to create two node bidirectional replication is listed below:
+   

2.3.a Typo: "subcribes"

2.3.b Wording: "creating subscription” -> "and creating subscriptions..."

2.3.c Wording: "The steps to create two node bidirectional replication
is listed below:" -> "The steps to create a two-node bidirectional
replication are given below:"

~~~

2.4 doc/src/sgml/logical-replication.sgml

+
+node2=# CREATE SUBSCRIPTION sub_node1_node2
+node2=# CONNECTION 'dbname=foo host=node1 user=repuser'
+node2=# PUBLICATION pub_node1
+node2=# WITH (copy_data = off, local_only = on);
+CREATE SUBSCRIPTION
+

I am not sure if those psql continuation prompts are right. Shouldn't
they be "node2-#" instead of "node2=#"?

~~~

2.5 doc/src/sgml/logical-replication.sgml

+
+node2=# CREATE SUBSCRIPTION sub_node1_node2
+node2=# CONNECTION 'dbname=foo host=node1 user=repuser'
+node2=# PUBLICATION pub_node1
+node2=# WITH (copy_data = off, local_only = on);
+CREATE SUBSCRIPTION
+
+   

IIUC the closing  should be on the same line as the
. I recall there was some recent github push about
this sometime in the last month - maybe you can check to confirm it.

~~~

2.6 doc/src/sgml/logical-replication.sgml

+   
+ Create the subscription in node2 to subscribe the changes from node1:
+
+node2=# CREATE SUBSCRIPTION sub_node1_node2

IMO the naming convention here is backwards/confusing. E.g. The
"pub_node1" is the publisher at node1. So similarly, I think the
subscriber at node2 should be called "sub_node2_node1" (not
"sub_node1_node2")

~~~

2.7 doc/src/sgml/logical-replication.sgml

+   
+ Adding a new node node3 to the existing node1 and node2 requires setting
+ up subscription in node1 and node2 to replicate the data from node3 and
+ setting up subscription in node3 to replicate data from node1 and node2.
+ The steps for the same is listed below:
+   

There are several sections that say "The steps for the same is listed
below:". The sentence for all of them seemed redundant to me.

~~~

2.8 doc/src/sgml/logical-replication.sgml

+
+ Adding a new node node3 to the existing node1 and node2 when data is
+ present in existing nodes node1 and node2 needs similar 

Re: Possible corruption by CreateRestartPoint at promotion

2022-04-26 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Apr 27, 2022 at 12:36:10PM +0800, Rui Zhao wrote:
>> Do you have interest in adding a test like one in my patch?

> I have studied the test case you are proposing, and I am afraid that
> it is too expensive as designed.

That was my feeling too.  It's certainly a useful test for verifying
that we fixed the problem, but that doesn't mean that it's worth the
cycles to add it as a permanent fixture in check-world, even if we
could get rid of the timing assumptions.

regards, tom lane




Re: Possible corruption by CreateRestartPoint at promotion

2022-04-26 Thread Michael Paquier
On Wed, Apr 27, 2022 at 12:36:10PM +0800, Rui Zhao wrote:
> Do you have interest in adding a test like one in my patch?

I have studied the test case you are proposing, and I am afraid that
it is too expensive as designed.  And it is actually racy as you
expect the restart point to take longer than the promotion with a
timing based on an arbitrary (and large!) amount of data inserted into
the primary.  Well, the promotion should be shorter than the restart 
point in any case, but such tests should be designed so as they would
work reliably on slow machines while being able to complete quickly on
fast machines.

It would much better if the test is designed so as the restart point
is stopped at an arbitrary step rather than throttled, moving on when
the promotion of the standby is done.  A well-known method, that would
not work on Windows, is to rely on SIGSTOP that could be used on the
checkpointer for such things.  Anyway, we don't have any mean to
reliably stop a restart point while in the middle of its processing,
do we?
--
Michael


signature.asc
Description: PGP signature


Re: Fwd: range of composite types!

2022-04-26 Thread Tom Lane
Jian He  writes:
>> for that means the following sql queries should return* false:*

>> select mytyperange (
>> (1,'2022-01-01')::mytype,
>> (8, '2022-01-31')::mytype, '[]') @> (2, '2020-01-19')::mytype;

Why should that return false?  The comparison rules for composites
say that you compare the first column, only if that's equal
compare the second, etc.  Here, "2" is between "1" and "8" so
the contents of the second column don't matter.

regards, tom lane




Re: Possible corruption by CreateRestartPoint at promotion

2022-04-26 Thread Michael Paquier
On Tue, Apr 26, 2022 at 08:26:09PM -0700, Nathan Bossart wrote:
> On Wed, Apr 27, 2022 at 10:43:53AM +0900, Kyotaro Horiguchi wrote:
>> At Tue, 26 Apr 2022 11:33:49 -0700, Nathan Bossart 
>>  wrote in 
>>> I suspect we'll start seeing this problem more often once end-of-recovery
>>> checkpoints are removed [0].  Would you mind creating a commitfest entry
>>> for this thread?  I didn't see one.
>> 
>> I'm not sure the patch makes any change here, because restart points
>> don't run while crash recovery, since no checkpoint records seen
>> during a crash recovery.  Anyway the patch doesn't apply anymore so
>> rebased, but only the one for master for the lack of time for now.
> 
> Thanks for the new patch!  Yeah, it wouldn't affect crash recovery, but
> IIUC Robert's patch also applies to archive recovery.
> 
>> +/* Update pg_control */
>> +LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
>> +
>>  /*
>>   * Remember the prior checkpoint's redo ptr for
>>   * UpdateCheckPointDistanceEstimate()
>>   */
>>  PriorRedoPtr = ControlFile->checkPointCopy.redo;
> 
> nitpick: Why move the LWLockAcquire() all the way up here?

Yeah, that should not be necessary.  InitWalRecovery() is the only
place outside the checkpointer that would touch this field, but that
happens far too early in the startup sequence to matter with the
checkpointer.

>> +Assert (PriorRedoPtr < RedoRecPtr);
> 
> I think this could use a short explanation.

That's just to make sure that the current redo LSN is always older
than the one prior that.  It does not seem really necessary to me to
add that.

>> +/*
>> + * Aarchive recovery has ended. Crash recovery ever after should
>> + * always recover to the end of WAL
>> + */

s/Aarchive/Archive/.

>> +ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
>> +ControlFile->minRecoveryPointTLI = 0;
>> +
>> +/* also update local copy */
>> +LocalMinRecoveryPoint = InvalidXLogRecPtr;
>> +LocalMinRecoveryPointTLI = 0;
> 
> Should this be handled by the code that changes the control file state to
> DB_IN_PRODUCTION instead?  It looks like this is ordinarily done in the
> next checkpoint.  It's not clear to me why it is done this way.

Anyway, that would be the work of the end-of-recovery checkpoint
requested at the end of StartupXLOG() once a promotion happens or of
the checkpoint requested by PerformRecoveryXLogAction() in the second
case, no?  So, I don't quite see why we need to update
minRecoveryPoint and minRecoveryPointTLI in the control file here, as
much as this does not have to be part of the end-of-recovery code
that switches the control file to DB_IN_PRODUCTION.

-   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-   ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
-   {
7ff23c6 has removed the last call to CreateCheckpoint() outside the
checkpointer, meaning that there is one less concurrent race to worry
about, but I have to admit that this change, to update the control
file's checkPoint and checkPointCopy even if we don't check after
ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the
code less robust in ~14.  So I am questioning whether a backpatch
is actually worth the risk here.
--
Michael


signature.asc
Description: PGP signature


Fwd: range of composite types!

2022-04-26 Thread Jian He
 Hello.
Just wondering if this is possible or not..

-- Forwarded message -
From: Jian He 
Date: Tue, Apr 26, 2022 at 2:46 PM
Subject: range of composite types!
To: pgsql-general 

range of composite types. I found this would be a great idea!!!
Question on stackoverflow

DB Fiddle


 source code regress test

ranges of composite types code part:

 504 --
>  505 -- Ranges of composites
>  506 --
>  507
>  508 create type two_ints as (a int, b int);
>  509 create type two_ints_range as range (subtype = two_ints);
>  510
>  511 -- with force_parallel_mode on, this exercises tqueue.c's range
> remapping
>  512 select *, row_to_json(upper(t)) as u from
>  513   (values (two_ints_range(row(1,2), row(3,4))),
>  514   (two_ints_range(row(5,6), row(7,8 v(t);
>

-- composite type range.
> create type mytype as (t1 int, t2 date);
> -- create type my_interval as (t1 int, t2 interval);
> select (2,'2022-01-02')::mytype ;
> create type mytyperange as range(subtype = mytype);
>

I am thinking construct a composite type range that would be equivalent as:

> select a, b::datefrom generate_series(1,8) a,
> generate_series('2022-01-01'::timestamp,
> '2022-01-31'::timestamp, interval '1 day') b;
>
> for that means the following sql queries should return* false:*

select mytyperange (
> (1,'2022-01-01')::mytype,
> (8, '2022-01-31')::mytype, '[]') @> (2, '2020-01-19')::mytype;
>


>  select
> (2, '2020-01-19')::mytype <@
> mytyperange(
> (1,'2022-01-01')::mytype,
> (8, '2022-01-31')::mytype, '[]') ;
>


> --does the range overlaps, that is, have any common element.
> select
> mytyperange ((2,'2020-12-30')::mytype,
> (2, '2020-12-31')::mytype)
> &&
> mytyperange(
> (1,'2022-01-01')::mytype,
> (8, '2022-01-31')::mytype) ;
>

from the db fiddle link, so far I failed.
If this is possible then we may need a *subtype_diff *function and *canonical
*function.


Re: bogus: logical replication rows/cols combinations

2022-04-26 Thread Amit Kapila
On Tue, Apr 26, 2022 at 4:00 AM Tomas Vondra
 wrote:
>
> On 4/25/22 17:48, Alvaro Herrera wrote:
>
> > The desired result on subscriber is:
> >
> > table uno;
> >  a  │ b │ c
> > ┼───┼───
> >   1 │ 2 │
> >  -1 │   │ 4
> >
> >
> > Thoughts?
> >
>
> I'm not quite sure which of the two behaviors is more "desirable". In a
> way, it's somewhat similar to publish_as_relid, which is also calculated
> not considering which of the row filters match?
>

Right, or in other words, we check all publications to decide it and
similar is the case for publication actions which are also computed
independently for all publications.

> But maybe you're right and it should behave the way you propose ... the
> example I have in mind is a use case replicating table with two types of
> rows - sensitive and non-sensitive. For sensitive, we replicate only
> some of the columns, for non-sensitive we replicate everything. Which
> could be implemented as two publications
>
> create publication sensitive_rows
>for table t (a, b) where (is_sensitive);
>
> create publication non_sensitive_rows
>for table t where (not is_sensitive);
>
> But the way it's implemented now, we'll always replicate all columns,
> because the second publication has no column list.
>
> Changing this to behave the way you expect would be quite difficult,
> because at the moment we build a single OR expression from all the row
> filters. We'd have to keep the individual expressions, so that we can
> build a column list for each of them (in order to ignore those that
> don't match).
>
> We'd have to remove various other optimizations - for example we can't
> just discard row filters if we found "no_filter" publication.
>

I don't think that is the right way. We need some way to combine
expressions and I feel the current behavior is sane. I mean to say
that even if there is one publication that has no filter (column/row),
we should publish all rows with all columns. Now, as mentioned above
combining row filters or column lists for all publications appears to
be consistent with what we already do and seems correct behavior to
me.

To me, it appears that the method used to decide whether a particular
table is published or not is also similar to what we do for row
filters or column lists. Even if there is one publication that
publishes all tables, we consider the current table to be published
irrespective of whether other publications have published that table
or not.

> Or more
> precisely, we'd have to consider column lists too.
>
> In other words, we'd have to merge pgoutput_column_list_init into
> pgoutput_row_filter_init, and then modify pgoutput_row_filter to
> evaluate the row filters one by one, and build the column list.
>

Hmm, I think even if we want to do something here, we also need to
think about how to achieve similar behavior for initial tablesync
which will be more tricky.

> I can take a stab at it, but it seems strange to not apply the same
> logic to evaluation of publish_as_relid.
>

Yeah, the current behavior seems to be consistent with what we already do.

> I wonder what Amit thinks about
> this, as he wrote the row filter stuff.
>

I feel we can explain a bit more about this in docs. We already have
some explanation of how row filters are combined [1]. We can probably
add a few examples for column lists.

[1] - 
https://www.postgresql.org/docs/devel/logical-replication-row-filter.html#LOGICAL-REPLICATION-ROW-FILTER-COMBINING

-- 
With Regards,
Amit Kapila.




Re:Possible corruption by CreateRestartPoint at promotion

2022-04-26 Thread Rui Zhao
Kyotaro'spatch seems good to me and fixes the test case in my patch.
Do you have interest in adding a test like one in my patch?


 +  LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);  +/*  
 * Remember the prior checkpoint's redo ptr for  * 
UpdateCheckPointDistanceEstimate()*/ PriorRedoPtr = 
ControlFile-checkPointCopy.redo;+Assert (PriorRedoPtr < 
RedoRecPtr);Maybe PriorRedoPtr does not need to be under LWLockAcquire?
regards. --  Zhao Rui Alibaba Cloud: https://www.aliyun.com/
--Original--
From:   
 "Kyotaro Horiguchi"

https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

[2] 
https://www.postgresql.org/message-id/7bfad665-db9c-0c2a-2604-9f54763c5f9e%40oss.nttdata.com

[3] 
https://www.postgresql.org/message-id/20220222.174401.765586897814316743.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

0001-Test-of-this-problem-v14.patch
Description: Binary data


Re: pgsql: Add contrib/pg_walinspect.

2022-04-26 Thread Bharath Rupireddy
On Wed, Apr 27, 2022 at 8:45 AM Tom Lane  wrote:
>
> I wrote:
> > Thomas Munro  writes:
> >> BTW If you had your local change from debug.patch (upthread), that'd
> >> defeat the patch.  I mean this:
>
> >> +if(!*errormsg)
> >> +  *errormsg = "decode_queue_head is null";
>
> > Oh!  Okay, I'll retry without that.
>
> I've now done several runs with your patch and not seen the test failure.
> However, I think we ought to rethink this API a bit rather than just
> apply the patch as-is.  Even if it were documented, relying on
> errormsg = NULL to mean something doesn't seem like a great plan.

Sorry for being late in the game, occupied with other stuff.

How about using private_data of XLogReaderState for
read_local_xlog_page_no_wait, something like this?

typedef struct ReadLocalXLogPageNoWaitPrivate
{
bool end_of_wal;
} ReadLocalXLogPageNoWaitPrivate;

In read_local_xlog_page_no_wait:

/* If asked, let's not wait for future WAL. */
if (!wait_for_wal)
   {
private_data->end_of_wal = true;
break;
   }

/*
 * Opaque data for callbacks to use.  Not used by XLogReader.
 */
void   *private_data;

Regards,
Bharath Rupireddy.




Re: Possible corruption by CreateRestartPoint at promotion

2022-04-26 Thread Nathan Bossart
On Wed, Apr 27, 2022 at 10:43:53AM +0900, Kyotaro Horiguchi wrote:
> At Tue, 26 Apr 2022 11:33:49 -0700, Nathan Bossart  
> wrote in 
>> I suspect we'll start seeing this problem more often once end-of-recovery
>> checkpoints are removed [0].  Would you mind creating a commitfest entry
>> for this thread?  I didn't see one.
> 
> I'm not sure the patch makes any change here, because restart points
> don't run while crash recovery, since no checkpoint records seen
> during a crash recovery.  Anyway the patch doesn't apply anymore so
> rebased, but only the one for master for the lack of time for now.

Thanks for the new patch!  Yeah, it wouldn't affect crash recovery, but
IIUC Robert's patch also applies to archive recovery.

> + /* Update pg_control */
> + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> +
>   /*
>* Remember the prior checkpoint's redo ptr for
>* UpdateCheckPointDistanceEstimate()
>*/
>   PriorRedoPtr = ControlFile->checkPointCopy.redo;

nitpick: Why move the LWLockAcquire() all the way up here?

> + Assert (PriorRedoPtr < RedoRecPtr);

I think this could use a short explanation.

> +  * Ensure minRecoveryPoint is past the checkpoint record while archive
> +  * recovery is still ongoing.  Normally, this will have happened already
> +  * while writing out dirty buffers, but not necessarily - e.g. because 
> no
> +  * buffers were dirtied.  We do this because a non-exclusive base backup
> +  * uses minRecoveryPoint to determine which WAL files must be included 
> in
> +  * the backup, and the file (or files) containing the checkpoint record
> +  * must be included, at a minimum. Note that for an ordinary restart of
> +  * recovery there's no value in having the minimum recovery point any
> +  * earlier than this anyway, because redo will begin just after the
> +  * checkpoint record.

nitpick: Since exclusive backup mode is now removed, we don't need to
specify that the base backup is non-exclusive.

> + /*
> +  * Aarchive recovery has ended. Crash recovery ever after should
> +  * always recover to the end of WAL
> +  */
> + ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
> + ControlFile->minRecoveryPointTLI = 0;
> +
> + /* also update local copy */
> + LocalMinRecoveryPoint = InvalidXLogRecPtr;
> + LocalMinRecoveryPointTLI = 0;

Should this be handled by the code that changes the control file state to
DB_IN_PRODUCTION instead?  It looks like this is ordinarily done in the
next checkpoint.  It's not clear to me why it is done this way.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pgsql: Add contrib/pg_walinspect.

2022-04-26 Thread Tom Lane
I wrote:
> Thomas Munro  writes:
>> BTW If you had your local change from debug.patch (upthread), that'd
>> defeat the patch.  I mean this:

>> +if(!*errormsg)
>> +  *errormsg = "decode_queue_head is null";

> Oh!  Okay, I'll retry without that.

I've now done several runs with your patch and not seen the test failure.
However, I think we ought to rethink this API a bit rather than just
apply the patch as-is.  Even if it were documented, relying on
errormsg = NULL to mean something doesn't seem like a great plan.

regards, tom lane




Re: Fix primary crash continually with invalid checkpoint after promote

2022-04-26 Thread Michael Paquier
On Wed, Apr 27, 2022 at 11:24:11AM +0900, Kyotaro Horiguchi wrote:
> Zhao Rui's proposal is retension of WAL files according to (the wrong
> content of) control file.
> 
> Aside from the fact that it may let slots be invalidated ealier, It's
> not great that an acutally performed restartpoint is forgotten, which
> may cause the next crash recovery starts from an already performed
> checkpoint.

Yeah, I was analyzing this problem and took a look at what's proposed
here, and I agree that what is proposed on this thread would just do
some unnecessary work if we find ourselves in a situation where we
we need to replay from a point earlier than necessary, aka the
checkpoint that should have been already finished.
--
Michael


signature.asc
Description: PGP signature


Re: Fix primary crash continually with invalid checkpoint after promote

2022-04-26 Thread Kyotaro Horiguchi
At Tue, 26 Apr 2022 15:47:13 -0400, Tom Lane  wrote in 
> "=?ISO-8859-1?B?WmhhbyBSdWk=?=" <875941...@qq.com> writes:
> > Newly promoted primary may leave an invalid checkpoint.
> > In function CreateRestartPoint, control file is updated and old wals are 
> > removed. But in some situations, control file is not updated, old wals are 
> > still removed. Thus produces an invalid checkpoint with nonexistent wal. 
> > Crucial log: "invalid primary checkpoint record", "could not locate a valid 
> > checkpoint record".
> 
> I believe this is the same issue being discussed here:
> 
> https://www.postgresql.org/message-id/flat/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com
> 
> but Horiguchi-san's proposed fix looks quite different from yours.

The root cause is that CreateRestartPoint omits to update last
checkpoint in control file if archiver recovery exits at an
unfortunate timing. So my proposal is going to fix the root cause.

Zhao Rui's proposal is retension of WAL files according to (the wrong
content of) control file.

Aside from the fact that it may let slots be invalidated ealier, It's
not great that an acutally performed restartpoint is forgotten, which
may cause the next crash recovery starts from an already performed
checkpoint.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pgsql: Add contrib/pg_walinspect.

2022-04-26 Thread Tom Lane
Thomas Munro  writes:
> BTW If you had your local change from debug.patch (upthread), that'd
> defeat the patch.  I mean this:

> +if(!*errormsg)
> +  *errormsg = "decode_queue_head is null";

Oh!  Okay, I'll retry without that.

regards, tom lane




Re: pgsql: Add contrib/pg_walinspect.

2022-04-26 Thread Tom Lane
Thomas Munro  writes:
> Ok, that's the same for me.  Next question: why does the patch I
> posted not help?

I improved the instrumentation a bit, and it looks like what is
happening is that loc > read_upto, causing that code to "break"
independently of wait_for_wal.  success.log is from "make installcheck"
immediately after initdb; fail.log is from "make check".

regards, tom lane

diff --git a/contrib/pg_walinspect/Makefile b/contrib/pg_walinspect/Makefile
index 960530e..85dcb3a 100644
--- a/contrib/pg_walinspect/Makefile
+++ b/contrib/pg_walinspect/Makefile
@@ -15,7 +15,6 @@ REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_walinspect/walinspect.conf
 
 # Disabled because these tests require "wal_level=replica", which
 # some installcheck users do not have (e.g. buildfarm clients).
-NO_INSTALLCHECK = 1
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cf5db23..33d2c73 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -373,6 +373,9 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
 		 */
 		Assert(!XLogRecPtrIsInvalid(state->EndRecPtr));
 
+		if(!*errormsg)
+		  *errormsg = "decode_queue_head is null";
+		
 		return NULL;
 	}
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4257026..5ea59a3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -951,6 +951,9 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 		if (state->currTLI == currTLI)
 		{
+		  elog(LOG, "considering wait for wal: loc %X/%X read_upto %X/%X wait_for_wal %d",
+		   LSN_FORMAT_ARGS(loc), LSN_FORMAT_ARGS(read_upto),
+		   wait_for_wal);
 
 			if (loc <= read_upto)
 break;
2022-04-27 04:06:21.995 CEST [4503] LOG:  starting PostgreSQL 15devel on 
mipsel-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10+deb8u1) 4.9.2, 32-bit
2022-04-27 04:06:21.997 CEST [4503] LOG:  listening on IPv6 address "::1", port 
5440
2022-04-27 04:06:21.997 CEST [4503] LOG:  listening on IPv4 address 
"127.0.0.1", port 5440
2022-04-27 04:06:22.000 CEST [4503] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5440"
2022-04-27 04:06:22.021 CEST [4523] LOG:  database system was shut down at 
2022-04-27 04:06:11 CEST
2022-04-27 04:06:22.048 CEST [4503] LOG:  database system is ready to accept 
connections
2022-04-27 04:06:30.619 CEST [4738] ERROR:  WAL start LSN must be less than end 
LSN
2022-04-27 04:06:30.619 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_records_info('0/1903EB0', '0/1903E08');
2022-04-27 04:06:30.621 CEST [4738] ERROR:  WAL start LSN must be less than end 
LSN
2022-04-27 04:06:30.621 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_stats('0/1903EB0', '0/1903E08');
2022-04-27 04:06:30.623 CEST [4738] LOG:  considering wait for wal: loc 
0/1002000 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.623 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_record_info('0/1903E08');
2022-04-27 04:06:30.623 CEST [4738] LOG:  considering wait for wal: loc 
0/1903E08 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.623 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_record_info('0/1903E08');
2022-04-27 04:06:30.626 CEST [4738] LOG:  considering wait for wal: loc 
0/1002000 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.626 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_records_info('0/1903E08', '0/1903EB0');
2022-04-27 04:06:30.626 CEST [4738] LOG:  considering wait for wal: loc 
0/1903E08 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.626 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_records_info('0/1903E08', '0/1903EB0');
2022-04-27 04:06:30.630 CEST [4738] LOG:  considering wait for wal: loc 
0/1002000 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.630 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_records_info_till_end_of_wal('0/1903E08');
2022-04-27 04:06:30.630 CEST [4738] LOG:  considering wait for wal: loc 
0/1903E08 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.630 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_records_info_till_end_of_wal('0/1903E08');
2022-04-27 04:06:30.633 CEST [4738] LOG:  considering wait for wal: loc 
0/1002000 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.633 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_stats('0/1903E08', '0/1903EB0');
2022-04-27 04:06:30.634 CEST [4738] LOG:  considering wait for wal: loc 
0/1903E08 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.634 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_stats('0/1903E08', '0/1903EB0');
2022-04-27 04:06:30.638 CEST [4738] LOG:  considering wait for wal: loc 
0/1002000 read_upto 0/1903F58 

Re: pgsql: Add contrib/pg_walinspect.

2022-04-26 Thread Thomas Munro
On Wed, Apr 27, 2022 at 1:54 PM Thomas Munro  wrote:
> On Wed, Apr 27, 2022 at 1:47 PM Tom Lane  wrote:
> > Thomas Munro  writes:
> > > Hrmph...  Are you sure you rebuilt the contrib module?   Assuming so,
> > > maybe it's failing in a different way for you and me.  For me, it
> > > always fails after this break is reached in xlogutil.c:
> >
> > > /* If asked, let's not wait for future WAL. */
> > > if (!wait_for_wal)
> > > break;
> >
> > Hmm.  For me, that statement is not reached at all in successful
> > (make installcheck) runs.  In a failing run, it's reached with
> > wait_for_wal = false, after which we get the "could not read WAL"
> > failure.  Usually that happens twice, as per attached.
>
> Ok, that's the same for me.  Next question: why does the patch I
> posted not help?  For me, the error "could not read WAL at %X/%X",
> seen on the BF log, is raised by ReadNextXLogRecord() in
> pg_walinspect.c.  The patch removes that ereport() entirely (and
> handles NULL in a couple of places).

BTW If you had your local change from debug.patch (upthread), that'd
defeat the patch.  I mean this:

+if(!*errormsg)
+  *errormsg = "decode_queue_head is null";




Re: pgsql: Add contrib/pg_walinspect.

2022-04-26 Thread Tom Lane
Thomas Munro  writes:
> On Wed, Apr 27, 2022 at 12:25 PM Tom Lane  wrote:
>> Not sure what we're doing differently, but plain "make check" in
>> contrib/pg_walinspect fails pretty consistently for me on gcc23.
>> I tried it again just now and got five failures in five attempts.

> I tried on the /home filesystem (a slow NFS mount) and then inside a
> directory on /tmp to get ext4 (I saw that Noah had somehow got onto a
> local filesystem, based on the present of "ext4" in the pathname and I
> was trying everything I could think of).  I used what I thought might
> be some relevant starter configure options copied from the animal:

> ./configure --prefix=$HOME/install --enable-cassert --enable-debug
> --enable-tap-tests CC="ccache gcc -mips32r2" CFLAGS="-O2
> -funwind-tables" LDFLAGS="-rdynamic"

Hmph.  I'm also running it on the /home filesystem, and I used
these settings:

export CC="ccache gcc -mips32r2"
export CFLAGS="-O2 -funwind-tables"
export LDFLAGS="-rdynamic"

./configure --enable-debug --enable-cassert --with-systemd --enable-nls 
--with-icu --enable-tap-tests --with-system-tzdata=/usr/share/zoneinfo

plus uninteresting stuff like --prefix.  Now maybe some of these
other options affect this, but I'd be pretty surprised if so.
So I'm at a loss why it behaves differently for you.

regards, tom lane




Re: pgsql: Add contrib/pg_walinspect.

2022-04-26 Thread Thomas Munro
On Wed, Apr 27, 2022 at 1:47 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Hrmph...  Are you sure you rebuilt the contrib module?   Assuming so,
> > maybe it's failing in a different way for you and me.  For me, it
> > always fails after this break is reached in xlogutil.c:
>
> > /* If asked, let's not wait for future WAL. */
> > if (!wait_for_wal)
> > break;
>
> Hmm.  For me, that statement is not reached at all in successful
> (make installcheck) runs.  In a failing run, it's reached with
> wait_for_wal = false, after which we get the "could not read WAL"
> failure.  Usually that happens twice, as per attached.

Ok, that's the same for me.  Next question: why does the patch I
posted not help?  For me, the error "could not read WAL at %X/%X",
seen on the BF log, is raised by ReadNextXLogRecord() in
pg_walinspect.c.  The patch removes that ereport() entirely (and
handles NULL in a couple of places).




Re: pgsql: Add contrib/pg_walinspect.

2022-04-26 Thread Tom Lane
Thomas Munro  writes:
> Hrmph...  Are you sure you rebuilt the contrib module?   Assuming so,
> maybe it's failing in a different way for you and me.  For me, it
> always fails after this break is reached in xlogutil.c:

> /* If asked, let's not wait for future WAL. */
> if (!wait_for_wal)
> break;

Hmm.  For me, that statement is not reached at all in successful
(make installcheck) runs.  In a failing run, it's reached with
wait_for_wal = false, after which we get the "could not read WAL"
failure.  Usually that happens twice, as per attached.

regards, tom lane

diff --git a/contrib/pg_walinspect/Makefile b/contrib/pg_walinspect/Makefile
index 960530e..85dcb3a 100644
--- a/contrib/pg_walinspect/Makefile
+++ b/contrib/pg_walinspect/Makefile
@@ -15,7 +15,6 @@ REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_walinspect/walinspect.conf
 
 # Disabled because these tests require "wal_level=replica", which
 # some installcheck users do not have (e.g. buildfarm clients).
-NO_INSTALLCHECK = 1
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cf5db23..33d2c73 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -373,6 +373,9 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
 		 */
 		Assert(!XLogRecPtrIsInvalid(state->EndRecPtr));
 
+		if(!*errormsg)
+		  *errormsg = "decode_queue_head is null";
+		
 		return NULL;
 	}
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4257026..4089efd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -955,6 +955,8 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			if (loc <= read_upto)
 break;
 
+			elog(LOG, "let's wait for wal: %d", wait_for_wal);
+
 			/* If asked, let's not wait for future WAL. */
 			if (!wait_for_wal)
 break;
2022-04-27 03:40:09.253 CEST postmaster[32569] LOG:  starting PostgreSQL 
15devel on mipsel-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10+deb8u1) 
4.9.2, 32-bit
2022-04-27 03:40:09.254 CEST postmaster[32569] LOG:  listening on Unix socket 
"/tmp/pg_regress-g0HWzi/.s.PGSQL.51696"
2022-04-27 03:40:09.284 CEST startup[32574] LOG:  database system was shut down 
at 2022-04-27 03:40:09 CEST
2022-04-27 03:40:09.314 CEST postmaster[32569] LOG:  database system is ready 
to accept connections
2022-04-27 03:40:12.073 CEST client backend[32598] pg_regress/pg_walinspect 
ERROR:  WAL start LSN must be less than end LSN
2022-04-27 03:40:12.073 CEST client backend[32598] pg_regress/pg_walinspect 
STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_records_info('0/1903EE8', '0/1903E40');
2022-04-27 03:40:12.075 CEST client backend[32598] pg_regress/pg_walinspect 
ERROR:  WAL start LSN must be less than end LSN
2022-04-27 03:40:12.075 CEST client backend[32598] pg_regress/pg_walinspect 
STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats('0/1903EE8', 
'0/1903E40');
2022-04-27 03:40:12.085 CEST client backend[32598] pg_regress/pg_walinspect 
LOG:  let's wait for wal: 0
2022-04-27 03:40:12.085 CEST client backend[32598] pg_regress/pg_walinspect 
STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_records_info_till_end_of_wal('0/1903E40');
2022-04-27 03:40:12.085 CEST client backend[32598] pg_regress/pg_walinspect 
ERROR:  could not read WAL at 0/1903E40: decode_queue_head is null
2022-04-27 03:40:12.085 CEST client backend[32598] pg_regress/pg_walinspect 
STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_records_info_till_end_of_wal('0/1903E40');
2022-04-27 03:40:12.091 CEST client backend[32598] pg_regress/pg_walinspect 
LOG:  let's wait for wal: 0
2022-04-27 03:40:12.091 CEST client backend[32598] pg_regress/pg_walinspect 
STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_stats_till_end_of_wal('0/1903E40');
2022-04-27 03:40:12.091 CEST client backend[32598] pg_regress/pg_walinspect 
ERROR:  could not read WAL at 0/1903E40: decode_queue_head is null
2022-04-27 03:40:12.091 CEST client backend[32598] pg_regress/pg_walinspect 
STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_stats_till_end_of_wal('0/1903E40');
2022-04-27 03:40:12.198 CEST postmaster[32569] LOG:  received fast shutdown 
request
2022-04-27 03:40:12.299 CEST postmaster[32569] LOG:  aborting any active 
transactions
2022-04-27 03:40:12.304 CEST postmaster[32569] LOG:  background worker "logical 
replication launcher" (PID 32577) exited with exit code 1
2022-04-27 03:40:12.306 CEST checkpointer[32572] LOG:  shutting down
2022-04-27 03:40:12.310 CEST checkpointer[32572] LOG:  checkpoint starting: 
shutdown immediate
2022-04-27 03:40:12.524 CEST checkpointer[32572] LOG:  checkpoint complete: 
wrote 784 buffers (4.8%); 0 WAL file(s) added, 0 removed, 0 recycled; 
write=0.199 s, sync=0.001 s, total=0.218 s; sync 

Re: Possible corruption by CreateRestartPoint at promotion

2022-04-26 Thread Kyotaro Horiguchi
At Tue, 26 Apr 2022 11:33:49 -0700, Nathan Bossart  
wrote in 
> On Wed, Mar 16, 2022 at 10:24:44AM +0900, Kyotaro Horiguchi wrote:
> > While discussing on additional LSNs in checkpoint log message,
> > Fujii-san pointed out [2] that there is a case where
> > CreateRestartPoint leaves unrecoverable database when concurrent
> > promotion happens. That corruption is "fixed" by the next checkpoint
> > so it is not a severe corruption.
> 
> I suspect we'll start seeing this problem more often once end-of-recovery
> checkpoints are removed [0].  Would you mind creating a commitfest entry
> for this thread?  I didn't see one.

I'm not sure the patch makes any change here, because restart points
don't run while crash recovery, since no checkpoint records seen
during a crash recovery.  Anyway the patch doesn't apply anymore so
rebased, but only the one for master for the lack of time for now.

> > AFAICS since 9.5, no check(/restart)pionts won't run concurrently with
> > restartpoint [3].  So I propose to remove the code path as attached.
> 
> Yeah, this "quick hack" has been around for some time (2de48a8), and I
> believe much has changed since then, so something like what you're
> proposing is probably the right thing to do.

Thanks for checking!

> > /* Also update the info_lck-protected copy */
> > SpinLockAcquire(>info_lck);
> > -   XLogCtl->RedoRecPtr = lastCheckPoint.redo;
> > +   XLogCtl->RedoRecPtr = RedoRecPtr;
> > SpinLockRelease(>info_lck);
> >  
> > /*
> > @@ -6984,7 +6987,10 @@ CreateRestartPoint(int flags)
> > /* Update the process title */
> > update_checkpoint_display(flags, true, false);
> >  
> > -   CheckPointGuts(lastCheckPoint.redo, flags);
> > +   CheckPointGuts(RedoRecPtr, flags);
> 
> I don't understand the purpose of these changes.  Are these related to the
> fix, or is this just tidying up?

The latter, since the mixed use of two not-guaranteed-to-be-same
variables at the same time for the same purpose made me perplexed (but
I feel the change can hardly incorporated alone). However, you're
right that it is irrelevant to the fix, so removed including other
instances of the same.

> [0] 
> https://postgr.es/m/CA%2BTgmoY%2BSJLTjma4Hfn1sA7S6CZAgbihYd%3DKzO6srd7Ut%3DXVBQ%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From e764f47cec2476c458a2a1ff00228f982ea132a0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 27 Apr 2022 10:41:23 +0900
Subject: [PATCH v5] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 62 ---
 1 file changed, 40 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61cda56c6f..a03b8de593 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6921,6 +6921,9 @@ CreateRestartPoint(int flags)
 	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
 
+	/* we don't assume concurrent checkpoint/restartpoint to run */
+	Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
 	/* Get a local copy of the last safe checkpoint record. */
 	SpinLockAcquire(>info_lck);
 	lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
@@ -7007,36 +7010,34 @@ CreateRestartPoint(int flags)
 
 	CheckPointGuts(lastCheckPoint.redo, flags);
 
+	/* Update pg_control */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
 	/*
 	 * Remember the prior checkpoint's redo ptr for
 	 * UpdateCheckPointDistanceEstimate()
 	 */
 	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+	Assert (PriorRedoPtr < RedoRecPtr);
+
+	ControlFile->checkPoint = lastCheckPointRecPtr;
+	ControlFile->checkPointCopy = lastCheckPoint;
+
 	/*
-	 * Update pg_control, using current time.  Check that it still shows
-	 * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-	 * this is a quick hack to make sure nothing really bad happens if somehow
-	 * we get here after the end-of-recovery checkpoint.
+	 * Ensure minRecoveryPoint is past the checkpoint record while archive
+	 * recovery is still ongoing.  Normally, this will have happened already
+	 * while writing out 

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-26 Thread Michael Paquier
On Tue, Apr 26, 2022 at 01:15:05PM -0500, David Christensen wrote:
> True. :-) This does seem like a tool geared towards "expert mode", so
> maybe we just assume if you need it you know what you're doing?

This is definitely an expert mode toy.
--
Michael


signature.asc
Description: PGP signature


Re: pgsql: Add contrib/pg_walinspect.

2022-04-26 Thread Thomas Munro
On Wed, Apr 27, 2022 at 12:25 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > I think it's a bug in pg_walinspect, so I'll move the discussion back
> > here.  Here's one rather simple way to fix it, that has survived
> > running the test a thousand times (using a recipe that failed for me
> > quite soon, after 20-100 attempts or so; I never figured out how to
> > get the 50% failure rate reported by Tom).
>
> Not sure what we're doing differently, but plain "make check" in
> contrib/pg_walinspect fails pretty consistently for me on gcc23.
> I tried it again just now and got five failures in five attempts.

I tried on the /home filesystem (a slow NFS mount) and then inside a
directory on /tmp to get ext4 (I saw that Noah had somehow got onto a
local filesystem, based on the present of "ext4" in the pathname and I
was trying everything I could think of).  I used what I thought might
be some relevant starter configure options copied from the animal:

./configure --prefix=$HOME/install --enable-cassert --enable-debug
--enable-tap-tests CC="ccache gcc -mips32r2" CFLAGS="-O2
-funwind-tables" LDFLAGS="-rdynamic"

For me, make check always succeeds in contrib/pg_walinspect.  For me,
make installcheck fails if I do it enough times in a loop, somewhere
around the 20th loop or so, which I imagine has to do with WAL page
boundaries moving around.

for i in `seq 1 1000` ; do
  make -s installcheck || exit 1
done

> I then installed your patch and got the same failure, three times
> out of three, so I don't think we're there yet.

Hrmph...  Are you sure you rebuilt the contrib module?   Assuming so,
maybe it's failing in a different way for you and me.  For me, it
always fails after this break is reached in xlogutil.c:

/* If asked, let's not wait for future WAL. */
if (!wait_for_wal)
break;

If you add a log message there, do you see that?  For me, the patch
fixes it, because it teaches pg_walinspect that messageless errors are
a way of detecting end-of-data (due to the code above, introduced by
the pg_walinspect commit).




Re: pgsql: Add contrib/pg_walinspect.

2022-04-26 Thread Tom Lane
Thomas Munro  writes:
> I think it's a bug in pg_walinspect, so I'll move the discussion back
> here.  Here's one rather simple way to fix it, that has survived
> running the test a thousand times (using a recipe that failed for me
> quite soon, after 20-100 attempts or so; I never figured out how to
> get the 50% failure rate reported by Tom).

Not sure what we're doing differently, but plain "make check" in
contrib/pg_walinspect fails pretty consistently for me on gcc23.
I tried it again just now and got five failures in five attempts.

I then installed your patch and got the same failure, three times
out of three, so I don't think we're there yet.

Again, since I do have this problem in captivity, I'm happy
to spend some time poking at it.  But I could use a little
guidance where to poke, because I've not looked at any of
the WAL prefetch stuff.

regards, tom lane




Re: WIP: WAL prefetch (another approach)

2022-04-26 Thread Thomas Munro
On Tue, Apr 26, 2022 at 6:11 PM Thomas Munro  wrote:
> I will poke some more tomorrow to try to confirm this and try to come
> up with a fix.

Done, and moved over to the pg_walinspect commit thread to reach the
right eyeballs:

https://www.postgresql.org/message-id/CA%2BhUKGLtswFk9ZO3WMOqnDkGs6dK5kCdQK9gxJm0N8gip5cpiA%40mail.gmail.com




Re: pgsql: Add contrib/pg_walinspect.

2022-04-26 Thread Thomas Munro
On Tue, Apr 26, 2022 at 5:36 PM Michael Paquier  wrote:
> On Tue, Apr 26, 2022 at 01:25:14AM -0400, Tom Lane wrote:
> > I've been wondering if the issue could be traced to topminnow's unusual
> > hardware properties, specifically that it has MAXALIGN 8 even though
> > it's only a 32-bit machine per sizeof(void *).  I think the only
> > other active buildfarm animal like that is my gaur ... but I've
> > failed to reproduce it on gaur.  Best guess at the moment is that
> > it's a timing issue that topminnow manages to reproduce often.
>
> I have managed to miss your message.  Let's continue the discussion
> there, then.

I think it's a bug in pg_walinspect, so I'll move the discussion back
here.  Here's one rather simple way to fix it, that has survived
running the test a thousand times (using a recipe that failed for me
quite soon, after 20-100 attempts or so; I never figured out how to
get the 50% failure rate reported by Tom).  Explanation in commit
message.  You can see that the comments near the first hunk already
contemplated this possibility, but just didn't try to handle it.

Another idea that I slept on, but rejected, is that the new WOULDBLOCK
return value introduced to support WAL prefetching could be used here
(it's a way of reporting a lack of data, different from errors).
Unfortunately it's not exposed to the XLogReadRecord() interface, as I
only intended it for use by XLogReadAhead().  I don't really think
it's a good idea to redesign that API at this juncture.

Maybe there is some other way I haven't considered -- is there a way
to get the LSN past the latest whole flushed record from shmem?
From d78ae8fcac7195d2cc967bcde754d1ce1fa39321 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 27 Apr 2022 11:39:10 +1200
Subject: [PATCH] Fix pg_walinspect race against flush LSN.

Instability in the TAP test for pg_walinspect revealed that
pg_get_wal_records_info_till_end_of_wal(x) would try to decode all the
records beginning in the range [x, flush LSN), even though that might
include a partial record at the end of the range.  In that case, an
error would be reported by read_local_xlog_page_no_wait when it tried to
read past the flush LSN.  That caused a test failure only on a BF
animal that had been restarted recently, but could be expected to happen
in the wild quite easily depending on the alignment of various
parameters.

Adjust pg_walinspect the tolerate errors without messages, as a way to
discover the end of the WAL region to decode.

Discussion: https://postgr.es/m/Ymd/e5eeZMNAkrXo%40paquier.xyz
Discussion: https://postgr.es/m/111657.1650910...@sss.pgh.pa.us
---
 contrib/pg_walinspect/pg_walinspect.c | 51 +++
 1 file changed, 21 insertions(+), 30 deletions(-)

diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index bf38863ff1..10507c0c0c 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -133,6 +133,7 @@ InitXLogReaderState(XLogRecPtr lsn, XLogRecPtr *first_record)
  * We guard against ordinary errors trying to read WAL that hasn't been
  * written yet by limiting end_lsn to the flushed WAL, but that can also
  * encounter errors if the flush pointer falls in the middle of a record.
+ * In that case we'll return NULL.
  */
 static XLogRecord *
 ReadNextXLogRecord(XLogReaderState *xlogreader, XLogRecPtr first_record)
@@ -149,11 +150,11 @@ ReadNextXLogRecord(XLogReaderState *xlogreader, XLogRecPtr first_record)
 	(errcode_for_file_access(),
 	 errmsg("could not read WAL at %X/%X: %s",
 			LSN_FORMAT_ARGS(first_record), errormsg)));
-		else
-			ereport(ERROR,
-	(errcode_for_file_access(),
-	 errmsg("could not read WAL at %X/%X",
-			LSN_FORMAT_ARGS(first_record;
+
+		/*
+		 * With no error message, assume that read_local_xlog_page_no_wait
+		 * tried to read past the flush LSN and reported an error.
+		 */
 	}
 
 	return record;
@@ -246,7 +247,12 @@ pg_get_wal_record_info(PG_FUNCTION_ARGS)
 
 	xlogreader = InitXLogReaderState(lsn, _record);
 
-	(void) ReadNextXLogRecord(xlogreader, first_record);
+	if (!ReadNextXLogRecord(xlogreader, first_record))
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not read WAL at %X/%X",
+		LSN_FORMAT_ARGS(first_record;
+
 
 	MemSet(values, 0, sizeof(values));
 	MemSet(nulls, 0, sizeof(nulls));
@@ -327,22 +333,14 @@ GetWALRecordsInfo(FunctionCallInfo fcinfo, XLogRecPtr start_lsn,
 	MemSet(values, 0, sizeof(values));
 	MemSet(nulls, 0, sizeof(nulls));
 
-	for (;;)
+	while (ReadNextXLogRecord(xlogreader, first_record) &&
+		   xlogreader->EndRecPtr <= end_lsn)
 	{
-		(void) ReadNextXLogRecord(xlogreader, first_record);
+		GetWALRecordInfo(xlogreader, xlogreader->currRecPtr, values, nulls,
+		 PG_GET_WAL_RECORDS_INFO_COLS);
 
-		if (xlogreader->EndRecPtr <= end_lsn)
-		{
-			GetWALRecordInfo(xlogreader, xlogreader->currRecPtr, values, nulls,
-			 

Re: Building Postgres with lz4 on Visual Studio

2022-04-26 Thread Robert Haas
On Wed, Apr 13, 2022 at 10:22 AM Melih Mutlu  wrote:
> What I realized is that c:\lz4\lib\liblz4.lib actually does not exist.
> The latest versions of lz4, downloaded from [2], do not contain \liblz4.lib 
> anymore, as opposed to what's written here [3]. Also there isn't a lib folder 
> too.
>
> After those changes on lz4 side, AFAIU there seems like this line adds 
> library from wrong path in Solution.pm file [4].
>>
>> $proj->AddIncludeDir($self->{options}->{lz4} . '\include');
>> $proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
>
> Even if I spent some time on this problem and tried to fix some places, I'm 
> not able to run a successful build yet.
> This is also the case for zstd too. Enabling zstd gives the same error.
>
> Has anyone had this issue before? Is this something that anyone is aware of 
> and somehow made it work?
> I would appreciate any comment/suggestion etc.

In Solution.pm we have this:

if ($self->{options}->{lz4})
{
$proj->AddIncludeDir($self->{options}->{lz4} . '\include');
$proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
}
if ($self->{options}->{zstd})
{
$proj->AddIncludeDir($self->{options}->{zstd} . '\include');
$proj->AddLibrary($self->{options}->{zstd} . '\lib\libzstd.lib');
}

I think what you're saying is that the relative pathnames here may not
be correct, depending on which version of lz4/zstd you're using. The
solution is probably to use perl's -e to test which files actually
exists e.g.

if ($self->{options}->{lz4})
{
$proj->AddIncludeDir($self->{options}->{lz4} . '\include');
if (-e $proj->AddLibrary($self->{options}->{lz4} .
'\someplace\somelz4.lib')
{
$proj->AddLibrary($self->{options}->{lz4} .
'\someplace\somelz4.lib');
}
else
{
$proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
}
$proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');
   }

The trick, at least as it seems to me, is figuring out exactly what
the right set of conditions is, based on what kinds of different
builds exist out there and where they put stuff.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: avoid multiple hard links to same WAL file after a crash

2022-04-26 Thread Nathan Bossart
Here is an attempt at creating something that can be back-patched.  0001
simply replaces calls to durable_rename_excl() with durable_rename() and is
intended to be back-patched.  0002 removes the definition of
durable_rename_excl() and is _not_ intended for back-patching.  I imagine
0002 will need to be held back for v16devel.

I think back-patching 0001 will encounter a couple of small obstacles.  For
example, the call in basic_archive won't exist on most of the
back-branches, and durable_rename_excl() was named durable_link_or_rename()
before v13.  I don't mind producing a patch for each back-branch if needed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d489c2bff029db6e07e5028788faf869c35f886b Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 26 Apr 2022 11:56:50 -0700
Subject: [PATCH v3 1/2] Replace calls to durable_rename_excl() with
 durable_rename().

durable_rename_excl() attempts to avoid overwriting any existing
files by using link() and unlink(), but it falls back to rename()
on some platforms (e.g., Windows), which offers no such overwrite
protection.  Most callers use durable_rename_excl() just in case
there is an existing file, but in practice there shouldn't be one.
basic_archive used it to avoid overwriting an archive concurrently
created by another server, but as mentioned above, it will still
overwrite files on some platforms.

Furthermore, failures during durable_rename_excl() can result in
multiple hard links to the same file.  My testing demonstrated that
it was possible to end up with two links to the same file in pg_wal
after a crash just before unlink() during WAL recycling.
Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL
file was re-recycled upon restarting.  This seems likely to lead to
WAL corruption.

This change replaces all calls to durable_rename_excl() with
durable_rename().  This removes the protection against
accidentally overwriting an existing file, but some platforms are
already living without it, and ordinarily there shouldn't be one.
The function itself is left around in case any extensions are using
it.  It will be removed in v16 via a follow-up commit.

Back-patch to all supported versions.  Before v13,
durable_rename_excl() was named durable_link_or_rename().

Author: Nathan Bossart
Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20220418182336.GA2298576%40nathanxps13
---
 contrib/basic_archive/basic_archive.c |  5 +++--
 src/backend/access/transam/timeline.c | 14 ++
 src/backend/access/transam/xlog.c |  8 ++--
 3 files changed, 7 insertions(+), 20 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index e7efbfb9c3..ed33854c57 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -281,9 +281,10 @@ basic_archive_file_internal(const char *file, const char *path)
 
 	/*
 	 * Sync the temporary file to disk and move it to its final destination.
-	 * This will fail if destination already exists.
+	 * Note that this will overwrite any existing file, but this is only
+	 * possible if someone else created the file since the stat() above.
 	 */
-	(void) durable_rename_excl(temp, destination, ERROR);
+	(void) durable_rename(temp, destination, ERROR);
 
 	ereport(DEBUG1,
 			(errmsg("archived \"%s\" via basic_archive", file)));
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index be21968293..128f754e87 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -441,12 +441,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, newTLI);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -519,12 +514,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, tli);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61cda56c6f..f49194a8b5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3323,14 +3323,10 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		}
 	}
 
-	/*
-	 * Perform the rename using link if 

Re: Fix primary crash continually with invalid checkpoint after promote

2022-04-26 Thread Tom Lane
"=?ISO-8859-1?B?WmhhbyBSdWk=?=" <875941...@qq.com> writes:
> Newly promoted primary may leave an invalid checkpoint.
> In function CreateRestartPoint, control file is updated and old wals are 
> removed. But in some situations, control file is not updated, old wals are 
> still removed. Thus produces an invalid checkpoint with nonexistent wal. 
> Crucial log: "invalid primary checkpoint record", "could not locate a valid 
> checkpoint record".

I believe this is the same issue being discussed here:

https://www.postgresql.org/message-id/flat/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com

but Horiguchi-san's proposed fix looks quite different from yours.

regards, tom lane




Re: Possible corruption by CreateRestartPoint at promotion

2022-04-26 Thread Nathan Bossart
On Wed, Mar 16, 2022 at 10:24:44AM +0900, Kyotaro Horiguchi wrote:
> While discussing on additional LSNs in checkpoint log message,
> Fujii-san pointed out [2] that there is a case where
> CreateRestartPoint leaves unrecoverable database when concurrent
> promotion happens. That corruption is "fixed" by the next checkpoint
> so it is not a severe corruption.

I suspect we'll start seeing this problem more often once end-of-recovery
checkpoints are removed [0].  Would you mind creating a commitfest entry
for this thread?  I didn't see one.

> AFAICS since 9.5, no check(/restart)pionts won't run concurrently with
> restartpoint [3].  So I propose to remove the code path as attached.

Yeah, this "quick hack" has been around for some time (2de48a8), and I
believe much has changed since then, so something like what you're
proposing is probably the right thing to do.

>   /* Also update the info_lck-protected copy */
>   SpinLockAcquire(>info_lck);
> - XLogCtl->RedoRecPtr = lastCheckPoint.redo;
> + XLogCtl->RedoRecPtr = RedoRecPtr;
>   SpinLockRelease(>info_lck);
>  
>   /*
> @@ -6984,7 +6987,10 @@ CreateRestartPoint(int flags)
>   /* Update the process title */
>   update_checkpoint_display(flags, true, false);
>  
> - CheckPointGuts(lastCheckPoint.redo, flags);
> + CheckPointGuts(RedoRecPtr, flags);

I don't understand the purpose of these changes.  Are these related to the
fix, or is this just tidying up?

[0] 
https://postgr.es/m/CA%2BTgmoY%2BSJLTjma4Hfn1sA7S6CZAgbihYd%3DKzO6srd7Ut%3DXVBQ%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-26 Thread David Christensen
On Mon, Apr 25, 2022 at 9:42 PM Michael Paquier  wrote:
> On Mon, Apr 25, 2022 at 10:24:52AM -0500, David Christensen wrote:
> > On Mon, Apr 25, 2022 at 6:03 AM Bharath Rupireddy
> >  wrote:
> >> Thanks for working on this. I'm just thinking if we can use these FPIs
> >> to repair the corrupted pages? I would like to understand more
> >> detailed usages of the FPIs other than inspecting with pageinspect.
> >
> > My main use case was for being able to look at potential corruption,
> > either in the WAL stream, on heap, or in tools associated with the WAL
> > stream.  I suppose you could use the page images to replace corrupted
> > on-disk pages (and in fact I think I've heard of a tool or two that
> > try to do that), though don't know that I consider this the primary
> > purpose (and having toast tables and the list, as well as clog would
> > make it potentially hard to just drop-in a page version without
> > issues).  Might help in extreme situations though.
>
> You could do a bunch of things with those images, even make things
> worse if you are not careful enough.

True. :-) This does seem like a tool geared towards "expert mode", so
maybe we just assume if you need it you know what you're doing?

> >> 10) Along with pg_pwrite(), can we also fsync the files (of course
> >> users can choose it optionally) so that the writes will be durable for
> >> the OS crashes?
> >
> > Can add; you thinking a separate flag to disable this with default true?
>
> We expect data generated by tools like pg_dump, pg_receivewal
> (depending on the use --synchronous) or pg_basebackup to be consistent
> when we exit from the call.  FWIW, flushing this data does not seem
> like a strong requirement for something aimed at being used page-level
> chirurgy or lookups, because the WAL segments should still be around
> even if the host holding the archives is unplugged.

I have added the fsync to the latest path (forthcoming), but I have no
strong preferences here as to the correct/expected behavior.

Best,

David




Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-26 Thread David Christensen
On Mon, Apr 25, 2022 at 9:54 PM Michael Paquier  wrote:
> On Mon, Apr 25, 2022 at 10:11:10AM -0500, David Christensen wrote:
> > On Mon, Apr 25, 2022 at 1:11 AM Michael Paquier  wrote:
> >> I don't think that there is any need to rely on a new logic if there
> >> is already some code in place able to do the same work.  See
> >> verify_dir_is_empty_or_create() in pg_basebackup.c, as one example,
> >> that relies on pg_check_dir().  I think that you'd better rely at
> >> least on what pgcheckdir.c offers.
> >
> > Yeah, though I am tending towards what another user had suggested and
> > just accepting an existing directory rather than requiring it to be
> > empty, so thinking I might just skip this one. (Will review the file
> > you've pointed out to see if there is a relevant function though.)
>
> OK.  FWIW, pg_check_dir() is used in initdb and pg_basebackup because
> these care about the behavior to use when specifying a target path.
> You could reuse it, but use a different policy depending on its
> returned result for the needs of what you see fit in this case.

I have a new version of the patch (pending tests) that uses
pg_check_dir's return value to handle things appropriately, so at
least some code reuse now.  It did end up simplifying a lot.

> >> +   PageSetLSN(page, record->ReadRecPtr);
> >> +   /* if checksum field is non-zero then we have 
> >> checksums enabled,
> >> +* so recalculate the checksum with new LSN (yes, this 
> >> is a hack)
> >> +*/
> >> Yeah, that looks like a hack, but putting in place a page on a cluster
> >> that has checksums enabled would be more annoying with
> >> zero_damaged_pages enabled if we don't do that, so that's fine by me
> >> as-is.  Perhaps you should mention that FPWs don't have their
> >> pd_checksum updated when written.
> >
> > Can make a mention; you thinking just a comment in the code is
> > sufficient, or want there to be user-visible docs as well?
>
> I was thinking about a comment, at least.

New patch version has significantly more comments.

> > This was to make the page as extracted equivalent to the effect of
> > applying the WAL record block on replay (the LSN and checksum both);
> > since recovery calls this function to mark the LSN where the page came
> > from this is why I did this as well.  (I do see perhaps a case for
> > --raw output that doesn't munge the page whatsoever, just made
> > comparisons easier.)
>
> Hm.  Okay.  The argument goes both ways, I guess, depending on what we
> want to do with those raw pages.  Still you should not need pd_lsn if
> the point is to be able to stick the pages back in place to attempt to
> get back as much data as possible when loading it back to the shared
> buffers?

Yeah, I can see that too; I think there's at least enough of an
argument for a flag to apply the fixups or just extract only the raw
page pre-modification.  Not sure which should be the "default"
behavior; either `--raw` or `--fixup-metadata` or something could
work. (Naming suggestions welcomed.)

David




Re: DBT-5 Stored Procedure Development (2022)

2022-04-26 Thread Peter Geoghegan
On Tue, Apr 26, 2022 at 10:36 AM Mark Wong  wrote:
> I'm afraid not.  I'm guessing that pulling in egen 1.14 would address
> that.  Maybe it would make sense to put that on the top of todo list if
> this project is accepted...

Wouldn't it be a prerequisite here? I don't actually have any reason
to prefer the old function-based code to the new stored procedure
based code. Really, all I'm looking for is a credible implementation
of TPC-E that I can use to model some aspects of OLTP performance for
my own purposes.

TPC-C (which I have plenty of experience with) has only two secondary
indexes (in typical configurations), and doesn't really stress
concurrency control at all. Plus there are no low cardinality indexes
in TPC-C, while TPC-E has quite a few. Chances are high that I'd learn
something from TPC-E, which has all of these things -- I'm really
looking for bottlenecks, where Postgres does entirely the wrong thing.
It's especially interesting to me as somebody that focuses on B-Tree
indexing.

-- 
Peter Geoghegan




Re: DBT-5 Stored Procedure Development (2022)

2022-04-26 Thread Mark Wong
On Tue, Apr 19, 2022 at 05:20:50PM -0700, Peter Geoghegan wrote:
> On Tue, Apr 19, 2022 at 11:31 AM Mark Wong  wrote:
> > As some of tasks proposed are actually in place, one other task could be
> > updating egen (the TPC supplied code.)  The kit was last developed again
> > 1.12 and 1.14 is current as this email.
> 
> As you know, I have had some false starts with using DBT5 on a modern
> Linux distribution. Perhaps I gave up too easily at the time, but I'm
> definitely still interested. Has there been work on that since?

I'm afraid not.  I'm guessing that pulling in egen 1.14 would address
that.  Maybe it would make sense to put that on the top of todo list if
this project is accepted...

Regards,
Mark




Re: orafce: some of the build time generated files are not present in .gitignore and also checked into the repository

2022-04-26 Thread Pavel Stehule
út 26. 4. 2022 v 18:36 odesílatel Bruce Momjian  napsal:

> On Tue, Apr 26, 2022 at 06:17:41PM +0200, Pavel Stehule wrote:
> >
> > út 26. 4. 2022 v 14:19 odesílatel Ashutosh Sharma  >
> > napsal:
> >
> > Hi,
> >
> > The below files in orafce contrib module are generated at build time.
> > However, these are checked into the repository. Shouldn't these files
> > be removed from the repository and added to the .gitignore file so
> > that they get ignored in the future commits.
> >
> > sqlparse.c
> > sqlscan.c
> > sqlparse.h
> >
> > Without these files there is a problem with MSVC build.
>
> Uh, I am kind of lost here.  Why was this reported to the Postgres
> server email lists and not to the orafce email lists?  Is there
> confusion who develops/support orafce?  Which MSVC build is broken?  The
> Postgres server or orafce?
>

I am sorry. This is just Orafce topic.

Regards

Pavel


>
> I assume the problem is that the 'makefile' system generates these
> files, but the MSVC build doesn't generate them, so it is just easier to
> check them in after a make build so MSVC can use them, and another make
> run will just overwrite them.
>
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
>   Indecision is a decision.  Inaction is an action.  Mark Batterson
>
>


Re: orafce: some of the build time generated files are not present in .gitignore and also checked into the repository

2022-04-26 Thread Bruce Momjian
On Tue, Apr 26, 2022 at 06:17:41PM +0200, Pavel Stehule wrote:
> 
> út 26. 4. 2022 v 14:19 odesílatel Ashutosh Sharma 
> napsal:
> 
> Hi,
> 
> The below files in orafce contrib module are generated at build time.
> However, these are checked into the repository. Shouldn't these files
> be removed from the repository and added to the .gitignore file so
> that they get ignored in the future commits.
> 
> sqlparse.c
> sqlscan.c
> sqlparse.h
> 
> Without these files there is a problem with MSVC build. 

Uh, I am kind of lost here.  Why was this reported to the Postgres
server email lists and not to the orafce email lists?  Is there
confusion who develops/support orafce?  Which MSVC build is broken?  The
Postgres server or orafce?

I assume the problem is that the 'makefile' system generates these
files, but the MSVC build doesn't generate them, so it is just easier to
check them in after a make build so MSVC can use them, and another make
run will just overwrite them.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: orafce: some of the build time generated files are not present in .gitignore and also checked into the repository

2022-04-26 Thread Pavel Stehule
út 26. 4. 2022 v 14:19 odesílatel Ashutosh Sharma 
napsal:

> Hi,
>
> The below files in orafce contrib module are generated at build time.
> However, these are checked into the repository. Shouldn't these files
> be removed from the repository and added to the .gitignore file so
> that they get ignored in the future commits.
>
> sqlparse.c
> sqlscan.c
> sqlparse.h
>

Without these files there is a problem with MSVC build.


> --
> With Regards,
> Ashutosh Sharma.
>
>
>


Re: Cryptohash OpenSSL error queue in FIPS enabled builds

2022-04-26 Thread Daniel Gustafsson
> On 26 Apr 2022, at 03:55, Michael Paquier  wrote:
> 
> On Tue, Apr 26, 2022 at 12:07:32AM +0200, Daniel Gustafsson wrote:
>> In this particular codepath I think we can afford clearing it on the way out,
>> with a comment explaining why.  It's easily reproducible and adding a call 
>> and
>> a comment is a good documentation for ourselves of this OpenSSL behavior.  
>> That
>> being said, clearing on the way in is the important bit.
> 
> + * consumed an error, but cipher initialization can in FIPS enabled
> It seems to me that this comment needs a hyphen, as of
> "FIPS-enabled".

Will fix.

> I am a bit annoyed to assume that having only a localized
> ERR_clear_error() in the error code path of the init() call is the
> only problem that would occur, only because that's the first one we'd
> see in a hash computation.

It's also the only one in this case since the computation won't get past the
init step with the error no?  The queue will be cleared for each computation so
the risk of cross contamination is removed.

> Perhaps this should be reported to the upstream folks?  We'd still
> need this code for already released versions, but getting two errors
> looks like a mistake.

Not really, the error system in OpenSSL has been defined as a queue with
multiple errors per call possible at least since SSLeay 0.9.1.  I think this is
very much intentional, but a rare case of it.

>> pgcrypto doesn't really consume or even inspect the OpenSSL errors, but pass 
>> on
>> a PXE error based on the context of the operation.  We could clear the queue 
>> as
>> we leave, but as you say we already clear it before calling in other places 
>> so
>> it's not clear that it's useful.  We've had EVP in pgcrypto for some time
>> without seeing issues from error queues, one error left isn't that different
>> from two when not consumed.
> 
> Okay.  I did not recall the full error stack used in pgcrypto.  It is
> annoying to not get from OpenSSL the details of the error, though.
> With FIPS enabled, one computing a hash with pgcrypto would just know
> about the initialization error, but would miss why the computation
> failed.  It looks like we could use a new error code to tell
> px_strerror() to look at the OpenSSL error queue instead of one of the
> hardcoded strings.  Just saying.

I looked at that briefly, and might revisit it during the 16 cycle, but it does
have a smell of diminishing returns to it.  With non-OpenSSL code ripped out
from pgcrypto it's clearly more interesting than before.

--
Daniel Gustafsson   https://vmware.com/





Re: orafce: some of the build time generated files are not present in .gitignore and also checked into the repository

2022-04-26 Thread Ashutosh Sharma
Sure, I'll do that.

Thanks,
Ashutosh

On Tue, Apr 26, 2022 at 5:51 PM Daniel Gustafsson  wrote:
>
> > On 26 Apr 2022, at 14:19, Ashutosh Sharma  wrote:
>
> > The below files in orafce contrib module are generated at build time.
> > However, these are checked into the repository. Shouldn't these files
> > be removed from the repository and added to the .gitignore file so
> > that they get ignored in the future commits.
>
> You should probably take this to the orafce project instead, possibly with a 
> PR
> against their repo?
>
> --
> Daniel Gustafsson   https://vmware.com/
>




Re: double inclusion of '-p' flex flag

2022-04-26 Thread Ashutosh Sharma
On Tue, Apr 26, 2022 at 5:55 PM John Naylor
 wrote:
>
> On Tue, Apr 26, 2022 at 7:16 PM Ashutosh Sharma  wrote:
> >
> > Hi,
> >
> > I see that we have included '-p' flex flag twice in the commands used
> > to generate the scanner files. See below:
> >
> > src/backend/parser/Makefile:60:scan.c: FLEXFLAGS = -CF -p -p
> > src/backend/utils/adt/Makefile:122:jsonpath_scan.c: FLEXFLAGS = -CF -p 
> > -p
> > src/bin/psql/Makefile:61:   psqlscanslash.c: FLEXFLAGS = -Cfe -p -p
> > src/fe_utils/Makefile:43:psqlscan.c: FLEXFLAGS = -Cfe -p -p
> > src/backend/utils/adt/Makefile:122:jsonpath_scan.c: FLEXFLAGS = -CF -p 
> > -p
> > src/bin/psql/Makefile:61:psqlscanslash.c: FLEXFLAGS = -Cfe -p -p
> >
> > Do we need this or can the extra -p flag be removed?
>
> From the Flex manual:
>
> "generates a performance report to stderr. The report consists of
> comments regarding features of the flex input file which will cause a
> serious loss of performance in the resulting scanner. If you give the
> flag twice, you will also get comments regarding features that lead to
> minor performance losses."
>

Ahh. I see. This information is missing in the man page. thanks.!

--
With Regards,
Ashutosh Sharma.




Re: [Patch] ALTER SYSTEM READ ONLY

2022-04-26 Thread Amul Sul
On Sat, Apr 23, 2022 at 1:34 PM Bharath Rupireddy
 wrote:
>
> On Mon, Mar 15, 2021 at 12:56 PM Amul Sul  wrote:
> > >
> > > It is a very minor change, so I rebased the patch. Please take a look, if 
> > > that works for you.
> > >
> >
> > Thanks, I am getting one more failure for the vacuumlazy.c. on the
> > latest master head(d75288fb27b), I fixed that in attached version.
>
> Thanks Amul! I haven't looked at the whole thread, I may be repeating
> things here, please bear with me.
>

Np, thanks for looking into it.

> 1) Is the pg_prohibit_wal() only user sets the wal prohibit mode? Or
> do we still allow via 'ALTER SYSTEM READ ONLY/READ WRITE'? If not, I
> think the patches still have ALTER SYSTEM READ ONLY references.

Could you please point me to what those references are? I didn't find
any in the v45 version.

> 2) IIUC, the idea of this patch is not to generate any new WAL when
> set as default_transaction_read_only and transaction_read_only can't
> guarantee that?

No. Complete WAL write should be disabled, in other words XLogInsert()
should be restricted.

> 3) IMO, the function name pg_prohibit_wal doesn't look good where it
> also allows one to set WAL writes, how about the following functions -
> pg_prohibit_wal or pg_disallow_wal_{generation, inserts} or
> pg_allow_wal or pg_allow_wal_{generation, inserts} without any
> arguments and if needed a common function
> pg_set_wal_generation_state(read-only/read-write) something like that?

There are already similar suggestions before too, but none of that
finalized yet, there are other more challenges that need to be
handled, so we can keep this work at last.

> 4) It looks like only the checkpointer is setting the WAL prohibit
> state? Is there a strong reason for that? Why can't the backend take a
> lock on prohibit state in shared memory and set it and let the
> checkpointer read it and block itself from writing WAL?

Once WAL prohibited state transition is initiated and should be
completed, there is no fallback. What if the backed exit before the
complete transition? Similarly, even if the checkpointer exits,
that will be restarted again and will complete the state transition.

> 5) Is SIGUSR1 (which is multiplexed) being sent without a "reason" to
> checkpointer? Why?

Simply want to wake up the checkpointer process without asking for
specific work in the handle function. Another suitable choice will be
SIGINT, we can choose that too if needed.

> 6) What happens for long-running or in-progress transactions if
> someone prohibits WAL in the midst of them? Do these txns fail? Or do
> we say that we will allow them to run to completion? Or do we fail
> those txns at commit time? One might use this feature to say not let
> server go out of disk space, but if we allow in-progress txns to
> generate/write WAL, then how can one achieve that with this feature?
> Say, I monitor my server in such a way that at 90% of disk space,
> prohibit WAL to avoid server crash. But if this feature allows
> in-progress txns to generate WAL, then the server may still crash?

Read-only transactions will be allowed to continue, and if that
transaction tries to write or any other transaction that has performed
any writes already then the session running that transaction will be
terminated -- the design is described in the first mail of this
thread.

> 7) What are the other use-cases (I can think of - to avoid out of disk
> crashes, block/freeze writes to database when the server is
> compromised) with this feature? Any usages during/before failover,
> promotion or after it?

The important use case is for failover to avoid split-brain situations.

> 8) Is there a strong reason that we've picked up conditional variable
> wal_prohibit_cv over mutex/lock for updating WALProhibit shared
> memory?

I am not sure how that can be done using mutex or lock.

> 9) Any tests that you are planning to add?

Yes, we can. I have added very sophisticated tests that cover most of
my code changes, but that is not enough for such critical code
changes, have a lot of chances of improvement and adding more tests
for this module as well as other parts e.g. some missing coverage of
gin, gists, brin, core features where this patch is adding checks, etc.
Any help will be greatly appreciated.

Regards,
Amul




Re: double inclusion of '-p' flex flag

2022-04-26 Thread John Naylor
On Tue, Apr 26, 2022 at 7:16 PM Ashutosh Sharma  wrote:
>
> Hi,
>
> I see that we have included '-p' flex flag twice in the commands used
> to generate the scanner files. See below:
>
> src/backend/parser/Makefile:60:scan.c: FLEXFLAGS = -CF -p -p
> src/backend/utils/adt/Makefile:122:jsonpath_scan.c: FLEXFLAGS = -CF -p -p
> src/bin/psql/Makefile:61:   psqlscanslash.c: FLEXFLAGS = -Cfe -p -p
> src/fe_utils/Makefile:43:psqlscan.c: FLEXFLAGS = -Cfe -p -p
> src/backend/utils/adt/Makefile:122:jsonpath_scan.c: FLEXFLAGS = -CF -p -p
> src/bin/psql/Makefile:61:psqlscanslash.c: FLEXFLAGS = -Cfe -p -p
>
> Do we need this or can the extra -p flag be removed?

>From the Flex manual:

"generates a performance report to stderr. The report consists of
comments regarding features of the flex input file which will cause a
serious loss of performance in the resulting scanner. If you give the
flag twice, you will also get comments regarding features that lead to
minor performance losses."

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: orafce: some of the build time generated files are not present in .gitignore and also checked into the repository

2022-04-26 Thread Daniel Gustafsson
> On 26 Apr 2022, at 14:19, Ashutosh Sharma  wrote:

> The below files in orafce contrib module are generated at build time.
> However, these are checked into the repository. Shouldn't these files
> be removed from the repository and added to the .gitignore file so
> that they get ignored in the future commits.

You should probably take this to the orafce project instead, possibly with a PR
against their repo?

--
Daniel Gustafsson   https://vmware.com/





orafce: some of the build time generated files are not present in .gitignore and also checked into the repository

2022-04-26 Thread Ashutosh Sharma
Hi,

The below files in orafce contrib module are generated at build time.
However, these are checked into the repository. Shouldn't these files
be removed from the repository and added to the .gitignore file so
that they get ignored in the future commits.

sqlparse.c
sqlscan.c
sqlparse.h

--
With Regards,
Ashutosh Sharma.




double inclusion of '-p' flex flag

2022-04-26 Thread Ashutosh Sharma
Hi,

I see that we have included '-p' flex flag twice in the commands used
to generate the scanner files. See below:

src/backend/parser/Makefile:60:scan.c: FLEXFLAGS = -CF -p -p
src/backend/utils/adt/Makefile:122:jsonpath_scan.c: FLEXFLAGS = -CF -p -p
src/bin/psql/Makefile:61:   psqlscanslash.c: FLEXFLAGS = -Cfe -p -p
src/fe_utils/Makefile:43:psqlscan.c: FLEXFLAGS = -Cfe -p -p
src/backend/utils/adt/Makefile:122:jsonpath_scan.c: FLEXFLAGS = -CF -p -p
src/bin/psql/Makefile:61:psqlscanslash.c: FLEXFLAGS = -Cfe -p -p

Do we need this or can the extra -p flag be removed?

--
With Regards,
Ashutosh Sharma.




Re: variable filename for psql \copy

2022-04-26 Thread Jiří Fejfar
Dear Daniel, David

On Mon, 25 Apr 2022 at 18:07, David G. Johnston 
wrote:

> On Mon, Apr 25, 2022 at 1:24 AM Jiří Fejfar  wrote:
>
>> contrib_regression=# copy (select 1) to :'afile';
>>
>
> Hopefully you realize that COPY is going to place that file on the server,
> not send it to the psql client to be placed on the local machine.
>
> The best way to do copy in psql is:
> \set afile '...'
> \o :'afile'
> copy ... to stdout; --or the variant where you one-shot the \o ( \g with
> arguments )
>
> Not only do you get variable expansion but you can write the COPY command
> on multiple lines just like any other SQL command.
>
>
thank you for your advice, \g works pretty well in my case


> Additionally, we have a list, and even an online form, for submitting bug
> reports.  That would have been the more appropriate place to direct this
> email.
>
>
sorry, I didn't  realize that, next time I will send report there

J.

> David J.
>
>


Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-04-26 Thread Andrey Borodin



> 25 апр. 2022 г., в 21:48, Nathan Bossart  
> написал(а):
> 
> I'm personally in
> favor of just adding a GUC that can be enabled to block canceling
> synchronous replication waits

+1. I think it's the only option to provide quorum commit guarantees.

Best regards, Andrey Borodin.



Re: Perform streaming logical transactions by background workers and parallel apply

2022-04-26 Thread vignesh C
On Fri, Apr 8, 2022 at 2:44 PM houzj.f...@fujitsu.com
 wrote:
>
> On Wednesday, April 6, 2022 1:20 PM Amit Kapila  
> wrote:
>
> > In this email, I would like to discuss allowing streaming logical
> > transactions (large in-progress transactions) by background workers
> > and parallel apply in general. The goal of this work is to improve the
> > performance of the apply work in logical replication.
> >
> > Currently, for large transactions, the publisher sends the data in
> > multiple streams (changes divided into chunks depending upon
> > logical_decoding_work_mem), and then on the subscriber-side, the apply
> > worker writes the changes into temporary files and once it receives
> > the commit, it read from the file and apply the entire transaction. To
> > improve the performance of such transactions, we can instead allow
> > them to be applied via background workers. There could be multiple
> > ways to achieve this:
> >
> > Approach-1: Assign a new bgworker (if available) as soon as the xact's
> > first stream came and the main apply worker will send changes to this
> > new worker via shared memory. We keep this worker assigned till the
> > transaction commit came and also wait for the worker to finish at
> > commit. This preserves commit ordering and avoid writing to and
> > reading from file in most cases. We still need to spill if there is no
> > worker available. We also need to allow stream_stop to complete by the
> > background worker to finish it to avoid deadlocks because T-1's
> > current stream of changes can update rows in conflicting order with
> > T-2's next stream of changes.
> >
>
> Attach the POC patch for the Approach-1 of "Perform streaming logical
> transactions by background workers". The patch is still a WIP patch as
> there are serval TODO items left, including:
>
> * error handling for bgworker
> * support for SKIP the transaction in bgworker
> * handle the case when there is no more worker available
>   (might need spill the data to the temp file in this case)
> * some potential bugs
>
> The original patch is borrowed from an old thread[1] and was rebased and
> extended/cleaned by me. Comments and suggestions are welcome.
>
> [1] 
> https://www.postgresql.org/message-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17%40postgrespro.ru
>
> Here are some performance results of the patch shared by Shi Yu off-list.
>
> The performance was tested by varying
> logical_decoding_work_mem, which include two cases:
>
> 1) bulk insert.
> 2) create savepoint and rollback to savepoint.
>
> I used synchronous logical replication in the test, compared SQL execution
> times before and after applying the patch.
>
> The results are as follows. The bar charts and the details of the test are
> Attached as well.
>
> RESULT - bulk insert (5kk)
> --
> logical_decoding_work_mem   64kB128kB   256kB   512kB   1MB 2MB 
> 4MB 8MB 16MB32MB64MB
> HEAD51.673  51.199  51.166  50.259  52.898  50.651  
> 51.156  51.210  50.678  51.256  51.138
> patched 36.198  35.123  34.223  29.198  28.712  29.090  
> 29.709  29.408  34.367  34.716  35.439
>
> RESULT - rollback to savepoint (600k)
> --
> logical_decoding_work_mem   64kB128kB   256kB   512kB   1MB 2MB 
> 4MB 8MB 16MB32MB64MB
> HEAD31.101  31.087  30.931  31.015  30.920  31.109  
> 30.863  31.008  30.875  30.775  29.903
> patched 28.115  28.487  27.804  28.175  27.734  29.047  
> 28.279  27.909  28.277  27.345  28.375
>
>
> Summary:
> 1) bulk insert
>
> For different logical_decoding_work_mem size, it takes about 30% ~ 45% less
> time, which looks good to me. After applying this patch, it seems that the
> performance is better when logical_decoding_work_mem is between 512kB and 8MB.
>
> 2) rollback to savepoint
>
> There is an improvement of about 5% ~ 10% after applying this patch.
>
> In this case, the patch spend less time handling the part that is not
> rolled back, because it saves the time writing the changes into a temporary 
> file
> and reading the file. And for the part that is rolled back, it would spend 
> more
> time than HEAD, because it takes more time to write to filesystem and rollback
> than writing a temporary file and truncating the file. Overall, the results 
> looks
> good.

One comment on the design:
We should have a strategy to release the workers which have completed
applying the transactions, else even though there are some idle
workers for one of the subscriptions, it cannot be used by other
subscriptions.
Like in the following case:
Let's say max_logical_replication_workers is set to 10, if
subscription sub_1 uses all the 10 workers to apply the transactions
and all the 10 workers have finished applying the transactions and
then subscription sub_2 requests some workers for applying
transactions, subscription  sub_2 will not get any workers.

Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-04-26 Thread Laurenz Albe
On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote:
> With synchronous replication typically all the transactions (txns)
> first locally get committed, then streamed to the sync standbys and
> the backend that generated the transaction will wait for ack from sync
> standbys. While waiting for ack, it may happen that the query or the
> txn gets canceled (QueryCancelPending is true) or the waiting backend
> is asked to exit (ProcDiePending is true). In either of these cases,
> the wait for ack gets canceled and leaves the txn in an inconsistent
> state [...]
> 
> Here's a proposal (mentioned previously by Satya [1]) to avoid the
> above problems:
> 1) Wait a configurable amount of time before canceling the sync
> replication by the backends i.e. delay processing of
> QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
> synchronous_replication_naptime_before_cancel, when set, it will let
> the backends wait for the ack before canceling the synchronous
> replication so that the transaction can be available in sync standbys
> as well.
> 2) Wait for sync standbys to catch up upon restart after the crash or
> in the next txn after the old locally committed txn was canceled.

While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby.  I think that only using the full
power of two-phase commit can make this bulletproof.

Is it worth adding additional complexity that is not a complete solution?

Yours,
Laurenz Albe





Re: WIP: WAL prefetch (another approach)

2022-04-26 Thread Thomas Munro
On Tue, Apr 26, 2022 at 6:11 AM Tom Lane  wrote:
> I believe that the WAL prefetch patch probably accounts for the
> intermittent errors that buildfarm member topminnow has shown
> since it went in, eg [1]:
>
> diff -U3 
> /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out 
> /home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out

Hmm, maybe but I suspect not.  I think I might see what's happening here.

> +ERROR:  could not read WAL at 0/1903E40

> I've reproduced this manually on that machine, and confirmed that the
> proximate cause is that XLogNextRecord() is returning NULL because
> state->decode_queue_head == NULL, without bothering to provide an errormsg
> (which doesn't seem very well thought out in itself).  I obtained the

Thanks for doing that.  After several hours of trying I also managed
to reproduce it on that gcc23 system (not at all sure why it doesn't
show up elsewhere; MIPS 32 bit layout may be a factor), and added some
trace to get some more clues.  Still looking into it, but here is the
current hypothesis I'm testing:

1.  The reason there's a messageless ERROR in this case is because
there is new read_page callback logic introduced for pg_walinspect,
called via read_local_xlog_page_no_wait(), which is like the old
read_local_xlog_page() except that it returns -1 if you try to read
past the current "flushed" LSN, and we have no queued message.  An
error is then reported by XLogReadRecord(), and appears to the user.

2.  The reason pg_walinspect tries to read WAL data past the flushed
LSN is because its GetWALRecordsInfo() function keeps calling
XLogReadRecord() until EndRecPtr >= end_lsn, where end_lsn is taken
from a snapshot of the flushed LSN, but I don't see where it takes
into account that the flushed LSN might momentarily fall in the middle
of a record.  In that case, xlogreader.c will try to read the next
page, which fails because it's past the flushed LSN (see point 1).

I will poke some more tomorrow to try to confirm this and try to come
up with a fix.




RE: Skipping schema changes in publication

2022-04-26 Thread osumi.takami...@fujitsu.com
On Thursday, April 21, 2022 12:15 PM vignesh C  wrote:
> Updated patch by changing the syntax to use EXCEPT instead of SKIP.
Hi


This is my review comments on the v2 patch.

(1) gram.y

I think we can make a unified function that merges
preprocess_alltables_pubobj_list with check_except_in_pubobj_list.

With regard to preprocess_alltables_pubobj_list,
we don't use the 2nd argument "location" in this function.

(2) create_publication.sgml

+  
+   Create a publication that publishes all changes in all the tables except for
+   the changes of users and
+   departments table;

This sentence should end ":" not ";".

(3) publication.out & publication.sql

+-- fail - can't set except table to schema  publication
+ALTER PUBLICATION testpub_forschema SET EXCEPT TABLE testpub_tbl1;

There is one unnecessary space in the comment.
Kindly change from "schema  publication" to "schema publication".

(4) pg_dump.c & describe.c

In your first email of this thread, you explained this feature
is for PG16. Don't we need additional branch for PG16 ?

@@ -6322,6 +6328,21 @@ describePublications(const char *pattern)
}
}

+   if (pset.sversion >= 15)
+   {


@@ -4162,7 +4164,7 @@ getPublicationTables(Archive *fout, TableInfo tblinfo[], 
int numTables)
/* Collect all publication membership info. */
if (fout->remoteVersion >= 15)
appendPQExpBufferStr(query,
-"SELECT tableoid, oid, 
prpubid, prrelid, "
+"SELECT tableoid, oid, 
prpubid, prrelid, prexcept,"


(5) psql-ref.sgml

+If + is appended to the command name, the tables,
+except tables and schemas associated with each publication are shown as
+well.

I'm not sure if "except tables" is a good description.
I suggest "excluded tables". This applies to the entire patch,
in case if this is reasonable suggestion.


Best Regards,
Takamichi Osumi