Re: Race condition in recovery?

2021-06-15 Thread Kyotaro Horiguchi
At Tue, 15 Jun 2021 07:54:49 -0400, Andrew Dunstan  wrote 
in 
> 
> On 6/15/21 2:16 AM, Kyotaro Horiguchi wrote:
> > At Fri, 11 Jun 2021 10:46:45 -0400, Tom Lane  wrote in 
> >> I think jacana uses msys[2?], so this likely indicates a problem
> >> in path sanitization for the archive command.  Andrew, any advice?
> > Thanks for fixing it.
> >
> > # I haven't still succeed to run TAP tests on MSYS2 environment. I
> > # cannot install IPC::Run for msys perl..
> >
> > regards.
> >
> 
> 
> Unpack the attached somewhere and point your PERL5LIB at it. That's all
> I do.

Thanks a lot, Andrew.  I get to run the TAP test with it and saw the
same error with jacana.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-06-15 Thread Robert Haas
On Mon, Jun 14, 2021 at 3:47 PM Andrew Dunstan  wrote:
> So, will you feel happier with this applied? I haven't tested it yet but
> I'm confident it will work.

I'm not all that unhappy now, but yeah, that looks like an improvement
to me. I'm still afraid that I will keep writing tests that blow up on
Windows but that's a bigger problem than we can hope to fix on this
thread, and I do think this discussion has helped.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-15 Thread Andrew Dunstan

On 6/15/21 2:16 AM, Kyotaro Horiguchi wrote:
> At Fri, 11 Jun 2021 10:46:45 -0400, Tom Lane  wrote in 
>> I think jacana uses msys[2?], so this likely indicates a problem
>> in path sanitization for the archive command.  Andrew, any advice?
> Thanks for fixing it.
>
> # I haven't still succeed to run TAP tests on MSYS2 environment. I
> # cannot install IPC::Run for msys perl..
>
> regards.
>


Unpack the attached somewhere and point your PERL5LIB at it. That's all
I do.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com



IPC-Run-Win-0.94.tgz
Description: application/compressed-tar


Re: Race condition in recovery?

2021-06-14 Thread Kyotaro Horiguchi
At Fri, 11 Jun 2021 10:46:45 -0400, Tom Lane  wrote in 
> I think jacana uses msys[2?], so this likely indicates a problem
> in path sanitization for the archive command.  Andrew, any advice?

Thanks for fixing it.

# I haven't still succeed to run TAP tests on MSYS2 environment. I
# cannot install IPC::Run for msys perl..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-06-14 Thread Andrew Dunstan


On 6/14/21 3:32 PM, Andrew Dunstan wrote:
> On 6/14/21 1:50 PM, Andrew Dunstan wrote:
>> On 6/14/21 1:11 PM, Robert Haas wrote:
>>> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan  wrote:
 $^X is not at all broken.

 The explanation here is pretty simple - the argument to perl2host is
 meant to be a directory. If we're going to accomodate plain files then
 we have some more work to do in TestLib.
>>> This explanation seems to contradict the documentation in TestLib.pm,
>>> which makes no mention of any such restriction.
>> Heres a snippet:
>>
>>
>> sub perl2host
>> {
>> my ($subject) = @_;
>> ...
>> if (chdir $subject)
>>  
>>
>> Last time I looked you can't chdir to anything except a directory.
>
>
> Actually, I take it back, it does work for a file. I'll change it. I
> probably did this when something else wasn't working.




So, will you feel happier with this applied? I haven't tested it yet but
I'm confident it will work.


diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl 
b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index e4e58cb8ab..3e19bc4c50 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -24,11 +24,11 @@ my $node_primary = get_new_node('primary');
 # the timeline history file reaches the archive but before any of the WAL files
 # get there.
 $node_primary->init(allows_streaming => 1, has_archiving => 1);
-my $perlbin = $^X;
-if ($^O eq 'msys')
-{
-   $perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X);
-}
+
+# Note: consistent use of forward slashes here avoids any escaping problems
+# that arise from use of backslashes. That means we need to double-quote all
+# the paths in the archive_command
+my $perlbin = TestLib::perl2host(^X);
 $perlbin =~ s!\\!/!g if $TestLib::windows_os;
 my $archivedir_primary = $node_primary->archive_dir;
 $archivedir_primary =~ s!\\!/!g if $TestLib::windows_os;
@@ -36,6 +36,8 @@ $node_primary->append_conf('postgresql.conf', qq(
 archive_command = '"$perlbin" "$FindBin::RealBin/cp_history_files" "%p" 
"$archivedir_primary/%f"'
 wal_keep_size=128MB
 ));
+# make sure that Msys perl doesn't complain about difficulty in setting locale
+# when called this way.
 local $ENV{PERL_BADLANG}=0;
 $node_primary->start;
 


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Race condition in recovery?

2021-06-14 Thread Andrew Dunstan


On 6/14/21 1:50 PM, Andrew Dunstan wrote:
> On 6/14/21 1:11 PM, Robert Haas wrote:
>> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan  wrote:
>>> $^X is not at all broken.
>>>
>>> The explanation here is pretty simple - the argument to perl2host is
>>> meant to be a directory. If we're going to accomodate plain files then
>>> we have some more work to do in TestLib.
>> This explanation seems to contradict the documentation in TestLib.pm,
>> which makes no mention of any such restriction.
>
> Heres a snippet:
>
>
> sub perl2host
> {
> my ($subject) = @_;
> ...
> if (chdir $subject)
>  
>
> Last time I looked you can't chdir to anything except a directory.



Actually, I take it back, it does work for a file. I'll change it. I
probably did this when something else wasn't working.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Race condition in recovery?

2021-06-14 Thread Robert Haas
On Mon, Jun 14, 2021 at 1:50 PM Andrew Dunstan  wrote:
> Heres a snippet:
>
> sub perl2host
> {
> my ($subject) = @_;
> ...
> if (chdir $subject)
>
> Last time I looked you can't chdir to anything except a directory.

OK, but like I said, you can't tell that from the documentation. The
documentation says: "Translate a virtual file name to a host file
name. Currently, this is a no-op except for the case of Perl=msys and
host=mingw32.  The subject need not exist, but its parent or
grandparent directory must exist unless cygpath is available." If you
look just at that, there's nothing that would lead you to believe that
it has to be a directory name.

> I was trying to get the buildfarm green again. There have been plenty of
> times when small patches directly for such fixes have been committed
> directly. And that's the only circumstance when I do.

I wasn't intending to criticize your work on this. I really appreciate
it, in fact, as I also said to you off-list. I do think that there
were some small things in those patches where a little bit of quick
discussion might have been useful: e.g. should the archive_command
change have gone in in the first place? Do we need any comments to
explain the fixes? But it's not like it's a big deal either. I'm
certainly not disagreeing with the goodness of making the buildfarm
green as expediently as possible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-14 Thread Andrew Dunstan


On 6/14/21 1:11 PM, Robert Haas wrote:
> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan  wrote:
>> $^X is not at all broken.
>>
>> The explanation here is pretty simple - the argument to perl2host is
>> meant to be a directory. If we're going to accomodate plain files then
>> we have some more work to do in TestLib.
> This explanation seems to contradict the documentation in TestLib.pm,
> which makes no mention of any such restriction.


Heres a snippet:


sub perl2host
{
my ($subject) = @_;
...
if (chdir $subject)
 

Last time I looked you can't chdir to anything except a directory.


>
>>> +local $ENV{PERL_BADLANG}=0;
>>>
>>> Similarly here. There's not a single other reference to PERL_BADLANG
>>> in the repository, so if we need this one here, there should be a
>>> comment explaining why this is different from all the places where we
>>> don't need it.
>> Here's why this is different: this is the only place that we invoke the
>> msys perl in this way (i.e. from a non-msys aware environment - the
>> binaries we build are not msys-aware). We need to do that if for no
>> other reason than that it might well be the only perl available. Doing
>> so makes it complain loudly about missing locale info. Setting this
>> variable makes it shut up. I can add a comment on that if you like.
> Yes, please, but perhaps you'd like to post patches for discussion
> first instead of committing directly.


I was trying to get the buildfarm green again. There have been plenty of
times when small patches directly for such fixes have been committed
directly. And that's the only circumstance when I do.



cheers


andrew





Re: Race condition in recovery?

2021-06-14 Thread Robert Haas
On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan  wrote:
> $^X is not at all broken.
>
> The explanation here is pretty simple - the argument to perl2host is
> meant to be a directory. If we're going to accomodate plain files then
> we have some more work to do in TestLib.

This explanation seems to contradict the documentation in TestLib.pm,
which makes no mention of any such restriction.

> > +local $ENV{PERL_BADLANG}=0;
> >
> > Similarly here. There's not a single other reference to PERL_BADLANG
> > in the repository, so if we need this one here, there should be a
> > comment explaining why this is different from all the places where we
> > don't need it.
>
> Here's why this is different: this is the only place that we invoke the
> msys perl in this way (i.e. from a non-msys aware environment - the
> binaries we build are not msys-aware). We need to do that if for no
> other reason than that it might well be the only perl available. Doing
> so makes it complain loudly about missing locale info. Setting this
> variable makes it shut up. I can add a comment on that if you like.

Yes, please, but perhaps you'd like to post patches for discussion
first instead of committing directly.

> Part of the trouble is that I've been living and breathing some of these
> issues so much recently that I forget that what might be fairly obvious
> to me isn't to others. I assure you there is not the faintest touch of
> pixy dust involved.

Every pixie with whom I've spoken today says otherwise!

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-14 Thread Andrew Dunstan


On 6/14/21 11:52 AM, Robert Haas wrote:
> On Sat, Jun 12, 2021 at 10:20 AM Tom Lane  wrote:
>> Andrew Dunstan  writes:
>>> I have pushed a fix, tested on a replica of fairywren/drongo,
>> This bit seems a bit random:
>>
>>  # WAL segment, this is enough to guarantee that the history file was
>>  # archived.
>>  my $archive_wait_query =
>> -  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM 
>> pg_stat_archiver;";
>> +  "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " 
>> .
>> +  "FROM pg_stat_archiver";
>>  $node_standby->poll_query_until('postgres', $archive_wait_query)
>>or die "Timed out while waiting for WAL segment to be archived";
>>  my $last_archived_wal_file = $walfile_to_be_archived;
>>
>> I wonder whether that is a workaround for the poll_query_until bug
>> I proposed to fix at [1].



This has been reverted.


> I found that a bit random too, but it wasn't the only part of the
> patch I found a bit random. Like, what can this possibly be doing?
>
> +if ($^O eq 'msys')
> +{
> +$perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X);
> +}
>
> The idea here is apparently that on msys, the directory name that is
> part of $^X needs to be passed through perl2host, but the file name
> that is part of the same $^X needs to NOT be passed through perl2host.
> Is $^X really that broken? If so, I think some comments are in order.


$^X is not at all broken.


The explanation here is pretty simple - the argument to perl2host is
meant to be a directory. If we're going to accomodate plain files then
we have some more work to do in TestLib.


> +local $ENV{PERL_BADLANG}=0;
>
> Similarly here. There's not a single other reference to PERL_BADLANG
> in the repository, so if we need this one here, there should be a
> comment explaining why this is different from all the places where we
> don't need it.


Here's why this is different: this is the only place that we invoke the
msys perl in this way (i.e. from a non-msys aware environment - the
binaries we build are not msys-aware). We need to do that if for no
other reason than that it might well be the only perl available. Doing
so makes it complain loudly about missing locale info. Setting this
variable makes it shut up. I can add a comment on that if you like.


> On those occasions when I commit TAP test cases, I do try to think
> about whether they are going to be portable, but I find these kinds of
> changes indistinguishable from magic.



Part of the trouble is that I've been living and breathing some of these
issues so much recently that I forget that what might be fairly obvious
to me isn't to others. I assure you there is not the faintest touch of
pixy dust involved.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Race condition in recovery?

2021-06-14 Thread Robert Haas
On Sat, Jun 12, 2021 at 10:20 AM Tom Lane  wrote:
> Andrew Dunstan  writes:
> > I have pushed a fix, tested on a replica of fairywren/drongo,
>
> This bit seems a bit random:
>
>  # WAL segment, this is enough to guarantee that the history file was
>  # archived.
>  my $archive_wait_query =
> -  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM 
> pg_stat_archiver;";
> +  "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " .
> +  "FROM pg_stat_archiver";
>  $node_standby->poll_query_until('postgres', $archive_wait_query)
>or die "Timed out while waiting for WAL segment to be archived";
>  my $last_archived_wal_file = $walfile_to_be_archived;
>
> I wonder whether that is a workaround for the poll_query_until bug
> I proposed to fix at [1].

I found that a bit random too, but it wasn't the only part of the
patch I found a bit random. Like, what can this possibly be doing?

+if ($^O eq 'msys')
+{
+$perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X);
+}

The idea here is apparently that on msys, the directory name that is
part of $^X needs to be passed through perl2host, but the file name
that is part of the same $^X needs to NOT be passed through perl2host.
Is $^X really that broken? If so, I think some comments are in order.

+local $ENV{PERL_BADLANG}=0;

Similarly here. There's not a single other reference to PERL_BADLANG
in the repository, so if we need this one here, there should be a
comment explaining why this is different from all the places where we
don't need it.

On those occasions when I commit TAP test cases, I do try to think
about whether they are going to be portable, but I find these kinds of
changes indistinguishable from magic.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-13 Thread Mikael Kjellström




On 2021-06-10 01:09, Tom Lane wrote:

Robert Haas  writes:

Got it. I have now committed the patch to all branches, after adapting
your changes just a little bit.
Thanks to you and Kyotaro-san for all the time spent on this. What a slog!


conchuela failed its first encounter with this test case:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25

That machine has a certain, er, history of flakiness; so this may
not mean anything.  Still, we'd better keep an eye out to see if
the test needs more stabilization.


Yes, the flakiness is caused by the very weird filesystem (HAMMERFS) 
that has some weird garbage collection handling that sometimes fills up 
the disk and then never recovers automatically.


I have tried to put in the cleanup-utility for HAMMERFS in cron to run 
at a schedule but it's isn't 100% fool proof.


So I am going to upgrade to a newer version of DragonflyBSD in the near 
future.


/Mikael




Re: Race condition in recovery?

2021-06-12 Thread Andrew Dunstan


On 6/12/21 1:54 PM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 6/12/21 1:07 PM, Tom Lane wrote:
>>> OK.  But it makes me itch a bit that this one wait-for-wal-to-be-
>>> processed query looks different from all the other ones.
>> I'm happy to bring the other two queries that look like this into line
>> with this one if you like.
> I see a lot more than two --- grepping for poll_query_until with
> a test involving a LSN comparison finds a bunch.  Are we sure that
> there are only three in which the LSN could be null?  


Well, I'm counting places that specifically compare it with
pg_stat_archiver.last_archived_wal.



> How much
> does it really matter if it is?
>
>   


It makes it harder to tell if there was any result at all when there's a
failure. If it bugs you that much I can revert just that line. Now that
I have fixed the immediate issue it matters less. I'm not prepared to
put in a lot of effort here, though.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Race condition in recovery?

2021-06-12 Thread Tom Lane
Andrew Dunstan  writes:
> On 6/12/21 1:07 PM, Tom Lane wrote:
>> OK.  But it makes me itch a bit that this one wait-for-wal-to-be-
>> processed query looks different from all the other ones.

> I'm happy to bring the other two queries that look like this into line
> with this one if you like.

I see a lot more than two --- grepping for poll_query_until with
a test involving a LSN comparison finds a bunch.  Are we sure that
there are only three in which the LSN could be null?  How much
does it really matter if it is?

regards, tom lane




Re: Race condition in recovery?

2021-06-12 Thread Andrew Dunstan


On 6/12/21 1:07 PM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 6/12/21 10:20 AM, Tom Lane wrote:
>>> I wonder whether that is a workaround for the poll_query_until bug
>>> I proposed to fix at [1].
>> No, it's because I found it annoying and confusing that there was an
>> invisible result when last_archived_wal is null.
> OK.  But it makes me itch a bit that this one wait-for-wal-to-be-
> processed query looks different from all the other ones.
>
>   


I'm happy to bring the other two queries that look like this into line
with this one if you like.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Race condition in recovery?

2021-06-12 Thread Tom Lane
Andrew Dunstan  writes:
> On 6/12/21 10:20 AM, Tom Lane wrote:
>> I wonder whether that is a workaround for the poll_query_until bug
>> I proposed to fix at [1].

> No, it's because I found it annoying and confusing that there was an
> invisible result when last_archived_wal is null.

OK.  But it makes me itch a bit that this one wait-for-wal-to-be-
processed query looks different from all the other ones.

regards, tom lane




Re: Race condition in recovery?

2021-06-12 Thread Andrew Dunstan


On 6/12/21 10:20 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> I have pushed a fix, tested on a replica of fairywren/drongo,
> This bit seems a bit random:
>
>  # WAL segment, this is enough to guarantee that the history file was
>  # archived.
>  my $archive_wait_query =
> -  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM 
> pg_stat_archiver;";
> +  "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " .
> +  "FROM pg_stat_archiver";
>  $node_standby->poll_query_until('postgres', $archive_wait_query)
>or die "Timed out while waiting for WAL segment to be archived";
>  my $last_archived_wal_file = $walfile_to_be_archived;
>
> I wonder whether that is a workaround for the poll_query_until bug
> I proposed to fix at [1].
>
>   regards, tom lane
>
> [1] https://www.postgresql.org/message-id/2130215.1623450521%40sss.pgh.pa.us



No, it's because I found it annoying and confusing that there was an
invisible result when last_archived_wal is null.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Race condition in recovery?

2021-06-12 Thread Tom Lane
Andrew Dunstan  writes:
> I have pushed a fix, tested on a replica of fairywren/drongo,

This bit seems a bit random:

 # WAL segment, this is enough to guarantee that the history file was
 # archived.
 my $archive_wait_query =
-  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM 
pg_stat_archiver;";
+  "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " .
+  "FROM pg_stat_archiver";
 $node_standby->poll_query_until('postgres', $archive_wait_query)
   or die "Timed out while waiting for WAL segment to be archived";
 my $last_archived_wal_file = $walfile_to_be_archived;

I wonder whether that is a workaround for the poll_query_until bug
I proposed to fix at [1].

regards, tom lane

[1] https://www.postgresql.org/message-id/2130215.1623450521%40sss.pgh.pa.us




Re: Race condition in recovery?

2021-06-12 Thread Andrew Dunstan


On 6/12/21 7:31 AM, Andrew Dunstan wrote:
> On 6/12/21 3:48 AM, Michael Paquier wrote:
>> On Fri, Jun 11, 2021 at 10:46:45AM -0400, Tom Lane wrote:
>>> I think jacana uses msys[2?], so this likely indicates a problem
>>> in path sanitization for the archive command.  Andrew, any advice?
>> Err, something around TestLib::perl2host()?
>
> I'm working on a fix for this. Yes it includes perl2host, but that's not
> enough :-)
>
>

I have pushed a fix, tested on a replica of fairywren/drongo,


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Race condition in recovery?

2021-06-12 Thread Andrew Dunstan


On 6/12/21 3:48 AM, Michael Paquier wrote:
> On Fri, Jun 11, 2021 at 10:46:45AM -0400, Tom Lane wrote:
>> I think jacana uses msys[2?], so this likely indicates a problem
>> in path sanitization for the archive command.  Andrew, any advice?
> Err, something around TestLib::perl2host()?


I'm working on a fix for this. Yes it includes perl2host, but that's not
enough :-)


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Race condition in recovery?

2021-06-12 Thread Michael Paquier
On Fri, Jun 11, 2021 at 10:46:45AM -0400, Tom Lane wrote:
> I think jacana uses msys[2?], so this likely indicates a problem
> in path sanitization for the archive command.  Andrew, any advice?

Err, something around TestLib::perl2host()?
--
Michael


signature.asc
Description: PGP signature


Re: Race condition in recovery?

2021-06-11 Thread Tom Lane
Kyotaro Horiguchi  writes:
>> ==~_~===-=-===~_~== 
>> pgsql.build/src/bin/pg_verifybackup/tmp_check/log/003_corruption_primary.log 
>> ==~_~===-=-===~_~==
>> ...
>> 2021-06-08 16:17:41.706 CEST [51792:9] 003_corruption.pl LOG:  received 
>> replication command: START_REPLICATION SLOT "pg_basebackup_51792" 0/B00 
>> TIMELINE 1
>> 2021-06-08 16:17:41.706 CEST [51792:10] 003_corruption.pl STATEMENT:  
>> START_REPLICATION SLOT "pg_basebackup_51792" 0/B00 TIMELINE 1
>> (log ends here)

> There seems like some hardware failure?

conchuela has definitely evinced flakiness before.  Not sure what's
up with it, but I have no problem with writing off non-repeatable
failures from that machine.  In any case, it's now passed half a
dozen times in a row on HEAD, so I think we can say that it's okay
with this test.  That leaves jacana, which I'm betting has a
Windows portability issue with the new test.

regards, tom lane




Re: Race condition in recovery?

2021-06-11 Thread Tom Lane
Dilip Kumar  writes:
> On Fri, Jun 11, 2021 at 11:45 AM Kyotaro Horiguchi
>  wrote:
>>> ==~_~===-=-===~_~== 
>>> pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_primary.log
>>>  ==~_~===-=-===~_~==
>>> ...
>>> The system cannot find the path specified.
>>> 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:1] LOG:  archive command failed 
>>> with exit code 1
>>> 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:2] DETAIL:  The failed archive 
>>> command was: /usr/bin/perl 
>>> "/home/pgrunner/bf/root/HEAD/pgsql/src/test/recovery/t/cp_history_files" 
>>> "pg_wal\\00010001" 
>>> "/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/00010001"

> Wal file copying will not create a problem for us, but I noticed that
> it is failing in copying the history files as well and that is
> creating a problem.

I think jacana uses msys[2?], so this likely indicates a problem
in path sanitization for the archive command.  Andrew, any advice?

regards, tom lane




Re: Race condition in recovery?

2021-06-11 Thread Dilip Kumar
On Fri, Jun 11, 2021 at 11:45 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane  wrote in
> tgl> Please note that conchuela and jacana are still failing ...
>
> I forgot jacana's case..
>
> It is failing for the issue the first patch should have fixed.
>
> > ==~_~===-=-===~_~== 
> > pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_primary.log
> >  ==~_~===-=-===~_~==
> ...
> > The system cannot find the path specified.
> > 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:1] LOG:  archive command failed 
> > with exit code 1
> > 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:2] DETAIL:  The failed archive 
> > command was: /usr/bin/perl 
> > "/home/pgrunner/bf/root/HEAD/pgsql/src/test/recovery/t/cp_history_files" 
> > "pg_wal\\00010001" 
> > "/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/00010001"

Wal file copying will not create a problem for us, but I noticed that
it is failing in copying the history files as well and that is
creating a problem.

2021-06-10 22:56:28.940 EDT [60c2d0db.1208:1] LOG:  archive command
failed with exit code 1
2021-06-10 22:56:28.940 EDT [60c2d0db.1208:2] DETAIL:  The failed
archive command was: /usr/bin/perl
"/home/pgrunner/bf/root/HEAD/pgsql/src/test/recovery/t/cp_history_files"
"pg_wal\\0002.history"
"/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/0002.history"

I have noticed that the archive command is failing in some other test
case too (002_archiving_standby2.log), see below logs.

==~_~===-=-===~_~==
pgsql.build/src/test/recovery/tmp_check/log/002_archiving_standby2.log
==~_~===-=-===~_~==
...

0 file(s) copied.
2021-06-10 22:44:34.467 EDT [60c2ce10.1270:1] LOG:  archive command
failed with exit code 1
2021-06-10 22:44:34.467 EDT [60c2ce10.1270:2] DETAIL:  The failed
archive command was: copy "pg_wal\\0003.history"
"c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives\\0003.history"
The system cannot find the path specified.
0 file(s) copied.
2021-06-10 22:44:35.478 EDT [60c2ce10.1270:3] LOG:  archive command
failed with exit code 1
2021-06-10 22:44:35.478 EDT [60c2ce10.1270:4] DETAIL:  The failed
archive command was: copy "pg_wal\\0003.history"
"c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives\\0003.history"
2021-06-10 22:44:36.113 EDT [60c2ce0c.283c:5] LOG:  received immediate
shutdown request
2021-06-10 22:44:36.129 EDT [60c2ce0c.283c:6] LOG:  database system is shut down

I am not able to figure out why the archive command is failing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-10 Thread Kyotaro Horiguchi
At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane  wrote in 
tgl> Please note that conchuela and jacana are still failing ...

I forgot jacana's case..

It is failing for the issue the first patch should have fixed.

> ==~_~===-=-===~_~== 
> pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_primary.log
>  ==~_~===-=-===~_~==
...
> The system cannot find the path specified.
> 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:1] LOG:  archive command failed 
> with exit code 1
> 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:2] DETAIL:  The failed archive 
> command was: /usr/bin/perl 
> "/home/pgrunner/bf/root/HEAD/pgsql/src/test/recovery/t/cp_history_files" 
> "pg_wal\\00010001" 
> "/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/00010001"

the cp_history_files exits with just "exit" for the files with that
name, which should set status to 0. ActivePerl did so.

If I specified nonexistent command like /hoge/perl, %ERRORLEVEL% is
set to 3, not 1.

I don't find what is happening there so far.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-06-10 Thread Kyotaro Horiguchi
At Fri, 11 Jun 2021 14:07:45 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane  wrote in 
> > conchuela's failure is evidently not every time, but this test
> > definitely postdates the "fix":

conchuela failed recovery_check this time, and

> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-10%2014%3A09%3A08
> So the standby2 was stuck after selecting the new timeline and before
> updating control file and its postmaster couldn't even respond to
> SIGQUIT.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25

  This is before the "fix"

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-08%2014%3A07%3A46

  failed in pg_verifybackupCheck

> ==~_~===-=-===~_~== 
> pgsql.build/src/bin/pg_verifybackup/tmp_check/log/regress_log_003_corruption 
> ==~_~===-=-===~_~==
...
> #   Failed test 'base backup ok'
> #   at t/003_corruption.pl line 115.
> # Running: pg_verifybackup 
> /home/pgbf/buildroot/HEAD/pgsql.build/src/bin/pg_verifybackup/tmp_check/t_003_corruption_primary_data/backup/open_directory_fails
> pg_verifybackup: fatal: could not open file 
> "/home/pgbf/buildroot/HEAD/pgsql.build/src/bin/pg_verifybackup/tmp_check/t_003_corruption_primary_data/backup/open_directory_fails/backup_manifest":
>  No such file or directory
> not ok 38 - intact backup verified

The manifest file is missing in backup. In this case also the servers
failed to handle SIGQUIT.

> ==~_~===-=-===~_~== 
> pgsql.build/src/bin/pg_verifybackup/tmp_check/log/003_corruption_primary.log 
> ==~_~===-=-===~_~==
...
> 2021-06-08 16:17:41.706 CEST [51792:9] 003_corruption.pl LOG:  received 
> replication command: START_REPLICATION SLOT "pg_basebackup_51792" 0/B00 
> TIMELINE 1
> 2021-06-08 16:17:41.706 CEST [51792:10] 003_corruption.pl STATEMENT:  
> START_REPLICATION SLOT "pg_basebackup_51792" 0/B00 TIMELINE 1
(log ends here)

There seems like some hardware failure?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-06-10 Thread Kyotaro Horiguchi
At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > At Thu, 10 Jun 2021 09:56:51 -0400, Robert Haas  
> > wrote in 
> >> Thanks for the analysis and the patches. I have committed them.
> 
> > Thanks for committing it.
> 
> Please note that conchuela and jacana are still failing ...
> 
> conchuela's failure is evidently not every time, but this test
> definitely postdates the "fix":
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-10%2014%3A09%3A08

A different test is failing there.  Maybe from different issue.


> ==~_~===-=-===~_~== 
> pgsql.build/src/test/recovery/tmp_check/log/regress_log_002_archiving 
> ==~_~===-=-===~_~==
...
> # Postmaster PID for node "standby2" is 342349
> ### Promoting node "standby2"
> # Running: pg_ctl -D 
> /home/pgbf/buildroot/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_standby2_data/pgdata
>  -l 
> /home/pgbf/buildroot/HEAD/pgsql.build/src/test/recovery/tmp_check/log/002_archiving_standby2.log
>  promote
> waiting for server to 
> promote
>  stopped waiting
> pg_ctl: server did not promote in time
> Bail out!  system pg_ctl failed

> ==~_~===-=-===~_~== 
> pgsql.build/src/test/recovery/tmp_check/log/002_archiving_standby2.log 
> ==~_~===-=-===~_~==
...
> 2021-06-10 16:21:21.870 CEST [342350:9] LOG:  received promote request
> 2021-06-10 16:21:21.870 CEST [342350:10] LOG:  redo done at 0/3030200 system 
> usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.07 s
> 2021-06-10 16:21:21.870 CEST [342350:11] LOG:  last completed transaction was 
> at log time 2021-06-10 16:21:21.010599+02
> 2021-06-10 16:21:21.893 CEST [342350:12] LOG:  restored log file 
> "00010003" from archive
> cp: 
> /home/pgbf/buildroot/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/0003.history:
>  No such file or directory
> 2021-06-10 16:21:21.896 CEST [342350:13] LOG:  selected new timeline ID: 3
(log ends here)


> ==~_~===-=-===~_~== 
> pgsql.build/src/test/recovery/tmp_check/log/002_archiving_primary.log 
> ==~_~===-=-===~_~==
...
> 2021-06-10 16:21:21.107 CEST [342322:4] 002_archiving.pl LOG:  disconnection: 
> session time: 0:00:00.022 user=pgbf database=postgres host=[local]
> 2021-06-10 16:23:21.965 CEST [342279:4] LOG:  received immediate shutdown 
> request

So the standby2 was stuck after selecting the new timeline and before
updating control file and its postmaster couldn't even respond to
SIGQUIT.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-06-10 Thread Tom Lane
Kyotaro Horiguchi  writes:
> At Thu, 10 Jun 2021 09:56:51 -0400, Robert Haas  wrote 
> in 
>> Thanks for the analysis and the patches. I have committed them.

> Thanks for committing it.

Please note that conchuela and jacana are still failing ...

conchuela's failure is evidently not every time, but this test
definitely postdates the "fix":

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-10%2014%3A09%3A08

regards, tom lane




Re: Race condition in recovery?

2021-06-10 Thread Kyotaro Horiguchi
At Thu, 10 Jun 2021 09:56:51 -0400, Robert Haas  wrote 
in 
> On Wed, Jun 9, 2021 at 9:12 PM Kyotaro Horiguchi
>  wrote:
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25&stg=recovery-check
> >
> > > ==~_~===-=-===~_~== 
> > > pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_cascade.log
> > >  ==~_~===-=-===~_~==
> > 
> > > 2021-06-09 23:31:10.439 CEST [893820:1] LOG:  started streaming WAL from 
> > > primary at 0/200 on timeline 1
> > > 2021-06-09 23:31:10.439 CEST [893820:2] FATAL:  could not receive data 
> > > from WAL stream: ERROR:  requested WAL segment 00010002 
> > > has already been removed
> >
> > The script 025_stuck_on_olde_timeline.pl (and I) forgets to set
> > wal_keep_size(segments).
> 
> Thanks for the analysis and the patches. I have committed them.

Thanks for committing it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-06-10 Thread Robert Haas
On Wed, Jun 9, 2021 at 9:12 PM Kyotaro Horiguchi
 wrote:
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25&stg=recovery-check
>
> > ==~_~===-=-===~_~== 
> > pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_cascade.log
> >  ==~_~===-=-===~_~==
> 
> > 2021-06-09 23:31:10.439 CEST [893820:1] LOG:  started streaming WAL from 
> > primary at 0/200 on timeline 1
> > 2021-06-09 23:31:10.439 CEST [893820:2] FATAL:  could not receive data from 
> > WAL stream: ERROR:  requested WAL segment 00010002 has 
> > already been removed
>
> The script 025_stuck_on_olde_timeline.pl (and I) forgets to set
> wal_keep_size(segments).

Thanks for the analysis and the patches. I have committed them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-09 Thread Kyotaro Horiguchi
At Wed, 09 Jun 2021 19:09:54 -0400, Tom Lane  wrote in 
> Robert Haas  writes:
> > Got it. I have now committed the patch to all branches, after adapting
> > your changes just a little bit.
> > Thanks to you and Kyotaro-san for all the time spent on this. What a slog!
> 
> conchuela failed its first encounter with this test case:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25
> 
> That machine has a certain, er, history of flakiness; so this may
> not mean anything.  Still, we'd better keep an eye out to see if
> the test needs more stabilization.

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25&stg=recovery-check

> ==~_~===-=-===~_~== 
> pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_cascade.log
>  ==~_~===-=-===~_~==

> 2021-06-09 23:31:10.439 CEST [893820:1] LOG:  started streaming WAL from 
> primary at 0/200 on timeline 1
> 2021-06-09 23:31:10.439 CEST [893820:2] FATAL:  could not receive data from 
> WAL stream: ERROR:  requested WAL segment 00010002 has 
> already been removed

The script 025_stuck_on_olde_timeline.pl (and I) forgets to set
wal_keep_size(segments).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index 0d96bb3c15..25c2dff437 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -27,6 +27,7 @@ $perlbin =~ s{\\}{}g if ($TestLib::windows_os);
 my $archivedir_primary = $node_primary->archive_dir;
 $node_primary->append_conf('postgresql.conf', qq(
 archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"'
+wal_keep_size=128MB
 ));
 $node_primary->start;
 
diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index 0d96bb3c15..8099571299 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -27,6 +27,7 @@ $perlbin =~ s{\\}{}g if ($TestLib::windows_os);
 my $archivedir_primary = $node_primary->archive_dir;
 $node_primary->append_conf('postgresql.conf', qq(
 archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"'
+wal_keep_segments=8
 ));
 $node_primary->start;
 


Re: Race condition in recovery?

2021-06-09 Thread Tom Lane
Robert Haas  writes:
> Got it. I have now committed the patch to all branches, after adapting
> your changes just a little bit.
> Thanks to you and Kyotaro-san for all the time spent on this. What a slog!

conchuela failed its first encounter with this test case:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25

That machine has a certain, er, history of flakiness; so this may
not mean anything.  Still, we'd better keep an eye out to see if
the test needs more stabilization.

regards, tom lane




Re: Race condition in recovery?

2021-06-09 Thread Robert Haas
On Wed, Jun 9, 2021 at 4:07 AM Dilip Kumar  wrote:
> Reason for the problem was that the "-Xnone" parameter was not
> accepted by "sub backup" in PostgresNode.pm so I created that for
> backpatch.  With attached patches I am to make it pass in v12,v11,v10
> (with fix) and fail (without fix).  However, we will have to make some
> change for 9.6 because pg_basebackup doesn't support -Xnone on 9.6,
> maybe we can delete the content from pg_wal after the backup, if we
> think that approach looks fine then I will make the changes for 9.6 as
> well.

Got it. I have now committed the patch to all branches, after adapting
your changes just a little bit.

Thanks to you and Kyotaro-san for all the time spent on this. What a slog!

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-09 Thread Robert Haas
On Wed, Jun 9, 2021 at 4:07 AM Dilip Kumar  wrote:
> Reason for the problem was that the "-Xnone" parameter was not
> accepted by "sub backup" in PostgresNode.pm so I created that for
> backpatch.  With attached patches I am to make it pass in v12,v11,v10
> (with fix) and fail (without fix).  However, we will have to make some
> change for 9.6 because pg_basebackup doesn't support -Xnone on 9.6,
> maybe we can delete the content from pg_wal after the backup, if we
> think that approach looks fine then I will make the changes for 9.6 as
> well.

Ah. I looked into this and found that this is because commit
081876d75ea15c3bd2ee5ba64a794fd8ea46d794 is new in master, so actually
that change is absent in all the back-branches. I have now back-ported
that portion of that commit to v13, v12, v11, and v10.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-09 Thread Dilip Kumar
On Wed, Jun 9, 2021 at 1:37 PM Dilip Kumar  wrote:
>
> On Wed, Jun 9, 2021 at 12:14 PM Dilip Kumar  wrote:
> >
> > On Wed, Jun 9, 2021 at 2:07 AM Robert Haas  wrote:
> > 2021-06-09 12:11:08.618 IST [122456] LOG:  entering standby mode
> > 2021-06-09 12:11:08.622 IST [122456] LOG:  restored log file 
> > "0002.history" from archive
> > cp: cannot stat 
> > ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/00010002’:
> >  No such file or directory
> > 2021-06-09 12:11:08.627 IST [122456] LOG:  redo starts at 0/228
> > 2021-06-09 12:11:08.627 IST [122456] LOG:  consistent recovery state 
> > reached at 0/300
> >
> > Next, I will investigate, without a fix on v11 (maybe v12, v10..) why it is 
> > not hitting the defect location at all.  And after that, I will check the 
> > status on other older versions.
>
> Reason for the problem was that the "-Xnone" parameter was not
> accepted by "sub backup" in PostgresNode.pm so I created that for
> backpatch.  With attached patches I am to make it pass in v12,v11,v10
> (with fix) and fail (without fix).  However, we will have to make some
> change for 9.6 because pg_basebackup doesn't support -Xnone on 9.6,
> maybe we can delete the content from pg_wal after the backup, if we
> think that approach looks fine then I will make the changes for 9.6 as
> well.
>
> Note: for param backport for v12 and v11 same patch getting applied
> but for v10 due to some conflict we need a separate patch (both
> attached).

I have fixed it for 9.6 as well by removing the wal from the xlog
directory. Attaching all the patches in single mail to avoid
confusion.

Note:
v7-0001 applies to master, v13,v12 (but for v12 before this we need to
apply backport)
v12-v8-0001-Backport is same as v11-v8-0001-Backport (duplicated for
version wise separation)
v11-v8-0002 is same as v10-v8-0002

Basically, for v12 and v11 same backport patch works and for V11 and
V10 same main patch works, still I duplicated them to avoid confusion.
-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 8b96729a51015e850672b2d685ec1b61490f967c Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Tue, 8 Jun 2021 14:31:22 -0400
Subject: [PATCH v8 1/2] Back-port a few PostgresNode.pm methods.

The 'lsn' and 'wait_for_catchup' methods only exist in v10 and
higher, but are needed in order to support a test planned test
case for a bug that exists all the way back to v9.6. To minimize
cross-branch differences in the test case, back-port these
methods.

Discussion: http://postgr.es/m/ca+tgmoag5dma_8xc1wvbvftpjtwx5uzkgehxe7mij+im9jy...@mail.gmail.com
---
 src/test/perl/PostgresNode.pm | 83 +++
 1 file changed, 83 insertions(+)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 5c59265..5f5a7d2 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1472,6 +1472,89 @@ sub issues_sql_like
 
 =pod
 
+=item $node->lsn(mode)
+
+Look up WAL locations on the server:
+
+ * insert location (master only, error on replica)
+ * write location (master only, error on replica)
+ * flush location (master only, error on replica)
+ * receive location (always undef on master)
+ * replay location (always undef on master)
+
+mode must be specified.
+
+=cut
+
+sub lsn
+{
+	my ($self, $mode) = @_;
+	my %modes = (
+		'insert'  => 'pg_current_xlog_insert_location()',
+		'flush'   => 'pg_current_xlog_flush_location()',
+		'write'   => 'pg_current_xlog_location()',
+		'receive' => 'pg_last_xlog_receive_location()',
+		'replay'  => 'pg_last_xlog_replay_location()');
+
+	$mode = '' if !defined($mode);
+	die "unknown mode for 'lsn': '$mode', valid modes are "
+	  . join(', ', keys %modes)
+	  if !defined($modes{$mode});
+
+	my $result = $self->safe_psql('postgres', "SELECT $modes{$mode}");
+	chomp($result);
+	if ($result eq '')
+	{
+		return;
+	}
+	else
+	{
+		return $result;
+	}
+}
+
+=pod
+
+=item $node->wait_for_catchup(standby_name, mode, target_lsn)
+
+Wait for the node with application_name standby_name (usually from node->name)
+until its replication position in pg_stat_replication equals or passes the
+upstream's xlog insert point at the time this function is called. By default
+the replay_location is waited for, but 'mode' may be specified to wait for any
+of sent|write|flush|replay.
+
+If there is no active replication connection from this peer, waits until
+poll_query_until timeout.
+
+Requires that the 'postgres' db exists and is accessible.
+
+target_lsn may be any arbitrary lsn, but is typically $master_node->lsn('insert').
+
+This is not a test. It die()s on failure.
+
+=cut
+
+sub wait_for_catchup
+{
+	my ($self, $standby_name, $mode, $target_lsn) = @_;
+	$mode = defined($mode) ? $mode : 'replay';
+	my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1 );
+	die "unknown mode $mode for 'wait_for_catchup', valid modes are " 

Re: Race condition in recovery?

2021-06-09 Thread Dilip Kumar
On Wed, Jun 9, 2021 at 12:14 PM Dilip Kumar  wrote:
>
> On Wed, Jun 9, 2021 at 2:07 AM Robert Haas  wrote:
> 2021-06-09 12:11:08.618 IST [122456] LOG:  entering standby mode
> 2021-06-09 12:11:08.622 IST [122456] LOG:  restored log file 
> "0002.history" from archive
> cp: cannot stat 
> ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/00010002’:
>  No such file or directory
> 2021-06-09 12:11:08.627 IST [122456] LOG:  redo starts at 0/228
> 2021-06-09 12:11:08.627 IST [122456] LOG:  consistent recovery state reached 
> at 0/300
>
> Next, I will investigate, without a fix on v11 (maybe v12, v10..) why it is 
> not hitting the defect location at all.  And after that, I will check the 
> status on other older versions.

Reason for the problem was that the "-Xnone" parameter was not
accepted by "sub backup" in PostgresNode.pm so I created that for
backpatch.  With attached patches I am to make it pass in v12,v11,v10
(with fix) and fail (without fix).  However, we will have to make some
change for 9.6 because pg_basebackup doesn't support -Xnone on 9.6,
maybe we can delete the content from pg_wal after the backup, if we
think that approach looks fine then I will make the changes for 9.6 as
well.

Note: for param backport for v12 and v11 same patch getting applied
but for v10 due to some conflict we need a separate patch (both
attached).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 3782d36bd8821b1e7785fbd247aafda6a6cf8975 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Wed, 9 Jun 2021 13:15:22 +0530
Subject: [PATCH] Back-port backup param in PostgresNode.pm

---
 src/test/perl/PostgresNode.pm | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index fdcc159..9d895c1 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -512,13 +512,13 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name= $self->name;
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
-		$self->host, '-p', $self->port, '--no-sync');
+		$self->host, '-p', $self->port, '--no-sync', @{ $params{backup_options} });
 	print "# Backup finished\n";
 }
 
-- 
1.8.3.1

From a52e20bd0bde14d5e194e3d853b9f6ea72019ad5 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Wed, 9 Jun 2021 12:52:42 +0530
Subject: [PATCH v8] Back port backup param in PostgresNode.pm

---
 src/test/perl/PostgresNode.pm | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 61aa048..beb7bc1 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -548,7 +548,7 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name= $self->name;
 
@@ -556,7 +556,8 @@ sub backup
 	TestLib::system_or_bail(
 		'pg_basebackup', '-D', $backup_path, '-h',
 		$self->host, '-p', $self->port,  '--checkpoint',
-		'fast',  '--no-sync');
+		'fast',  '--no-sync',
+		@{ $params{backup_options} });
 	print "# Backup finished\n";
 	return;
 }
-- 
1.8.3.1

From 0c8e0ebeb480b434b58e29c46594f873db3f7087 Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Tue, 8 Jun 2021 12:52:55 -0400
Subject: [PATCH v8 2/2] Fix corner case failure of new standby to follow new
 primary.

This only happens if (1) the new standby has no WAL available locally,
(2) the new standby is starting from the old timeline, (3) the promotion
happened in the WAL segment from which the new standby is starting,
(4) the timeline history file for the new timeline is available from
the archive but the WAL files for are not (i.e. this is a race),
(5) the WAL files for the new timeline are available via streaming,
and (6) recovery_target_timeline='latest'.

Commit ee994272ca50f70b53074f0febaec97e28f83c4e introduced this
logic and was an improvement over the previous code, but it mishandled
this case. If recovery_target_timeline='latest' and restore_command is
set, validateRecoveryParameters() can change recoveryTargetTLI to be
different from receiveTLI. If streaming is then tried afterward,
expectedTLEs gets initialized with the history of the wrong timeline.
It's supposed to be a list of entries explaining how to get to the
target timeline, but in this case it ends up with a list of entries
explaining how to get to the new standby's original timeline, which
isn't right.

Dilip Kumar and Robert Haas, reviewed by Kyotaro Horiguchi.

Discussion: http://postgr.es/m/CAFiTN-sE-jr=lb8jquxeqikd-ux+

Re: Race condition in recovery?

2021-06-08 Thread Dilip Kumar
On Wed, Jun 9, 2021 at 2:07 AM Robert Haas  wrote:

> Then I tried to get things working on 9.6. There's a patch attached to
> back-port a couple of PostgresNode.pm methods from 10 to 9.6, and also
> a version of the main patch attached with the necessary wal->xlog,
> lsn->location renaming. Unfortunately ... the new test case still
> fails on 9.6 in a way that looks an awful lot like the bug isn't
> actually fixed:
>
> LOG:  primary server contains no more WAL on requested timeline 1
> cp:
> /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_primary_enMi/archives/00010003:
> No such file or directory
> (repeated many times)
>
> I find that the same failure happens if I back-port the master version
> of the patch to v10 or v11,


I think this fails because prior to v12 the recovery target tli was not set
to the latest by default because it was not GUC at that time.  So after
below fix it started passing on v11(only tested on v11 so far).


diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl
b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index 842878a..b3ce5da 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -50,6 +50,9 @@ my $node_cascade = get_new_node('cascade');
 $node_cascade->init_from_backup($node_standby, $backup_name,
has_streaming => 1);
 $node_cascade->enable_restoring($node_primary);
+$node_cascade->append_conf('recovery.conf', qq(
+recovery_target_timeline='latest'
+));

But now it started passing even without the fix and the log says that it
never tried to stream from primary using TL 1 so it never hit the defect
location.

2021-06-09 12:11:08.618 IST [122456] LOG:  entering standby mode
2021-06-09 12:11:08.622 IST [122456] LOG:  restored log file
"0002.history" from archive
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/00010002’:
No such file or directory
2021-06-09 12:11:08.627 IST [122456] LOG:  redo starts at 0/228
2021-06-09 12:11:08.627 IST [122456] LOG:  consistent recovery state
reached at 0/300

Next, I will investigate, without a fix on v11 (maybe v12, v10..) why it is
not hitting the defect location at all.  And after that, I will check the
status on other older versions.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: Race condition in recovery?

2021-06-08 Thread Robert Haas
On Tue, Jun 8, 2021 at 12:26 PM Robert Haas  wrote:
> I think the problem is here:
>
> Can't locate object method "lsn" via package "PostgresNode" at
> t/025_stuck_on_old_timeline.pl line 84.
>
> When that happens, it bails out, and cleans everything up, doing an
> immediate shutdown of all the nodes. The 'lsn' method was added by
> commit fb093e4cb36fe40a1c3f87618fb8362845dae0f0, so it only appears in
> v10 and later. I think maybe we can think of back-porting that to 9.6.

Here's an updated set of patches. I removed the extra teardown_node
calls per Kyotaro Horiguchi's request. I adopted his suggestion for
setting a $perlbin variable from $^X, but found that $perlbin was
undefined, so I split the incantation into two lines to fix that. I
updated the code to use ->promote() instead of calling pg_promote(),
and to use poll_query_until() afterwards to wait for promotion as
suggested by Dilip. Also, I added a comment to the change in xlog.c.

Then I tried to get things working on 9.6. There's a patch attached to
back-port a couple of PostgresNode.pm methods from 10 to 9.6, and also
a version of the main patch attached with the necessary wal->xlog,
lsn->location renaming. Unfortunately ... the new test case still
fails on 9.6 in a way that looks an awful lot like the bug isn't
actually fixed:

LOG:  primary server contains no more WAL on requested timeline 1
cp: 
/Users/rhaas/pgsql/src/test/recovery/tmp_check/data_primary_enMi/archives/00010003:
No such file or directory
(repeated many times)

I find that the same failure happens if I back-port the master version
of the patch to v10 or v11, but if I back-port it to v12 or v13 then
the test passes as expected. I haven't figured out what the issue is
yet. I also noticed that if I back-port it to v12 and then revert the
code change, the test still passes. So I think there may be something
subtly wrong with this test case yet. Or maybe a code bug.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v7-0001-Fix-corner-case-failure-of-new-standby-to-follow-.patch
Description: Binary data


9.6-v7-0002-Fix-corner-case-failure-of-new-standby-to-follow-.patch
Description: Binary data


9.6-v7-0001-Back-port-a-few-PostgresNode.pm-methods.patch
Description: Binary data


Re: Race condition in recovery?

2021-06-08 Thread Robert Haas
On Tue, Jun 8, 2021 at 4:47 AM Dilip Kumar  wrote:
> I have changed for as per 9.6 but I am seeing some crash (both
> with/without fix), I could not figure out the reason, it did not
> generate any core dump, although I changed pg_ctl in PostgresNode.pm
> to use "-c" so that it can generate core but it did not generate any
> core file.

I think the problem is here:

Can't locate object method "lsn" via package "PostgresNode" at
t/025_stuck_on_old_timeline.pl line 84.

When that happens, it bails out, and cleans everything up, doing an
immediate shutdown of all the nodes. The 'lsn' method was added by
commit fb093e4cb36fe40a1c3f87618fb8362845dae0f0, so it only appears in
v10 and later. I think maybe we can think of back-porting that to 9.6.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-08 Thread Dilip Kumar
On Tue, Jun 8, 2021 at 11:13 AM Dilip Kumar  wrote:
>
> # Wait until the node exits recovery.
> $standby->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';")
> or die "Timed out while waiting for promotion";
>
> I will try to generate a version for 9.6 based on this idea and see how it 
> goes

I have changed for as per 9.6 but I am seeing some crash (both
with/without fix), I could not figure out the reason, it did not
generate any core dump, although I changed pg_ctl in PostgresNode.pm
to use "-c" so that it can generate core but it did not generate any
core file.

This is log from cascading node (025_stuck_on_old_timeline_cascade.log)
-
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_52dW/archives/00010003’:
No such file or directory
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
FATAL:  could not receive database system identifier and timeline ID
from the primary server: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
--

The attached logs are when I ran without a fix.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 42886664062e4aae35ee33a6b3a87bc242ec86ad Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Fri, 4 Jun 2021 10:55:11 -0400
Subject: [PATCH v6] Fix corner case failure of new standby to follow new
 primary.

This only happens if (1) the new standby has no WAL available locally,
(2) the new standby is starting from the old timeline, (3) the promotion
happened in the WAL segment from which the new standby is starting,
(4) the timeline history file for the new timeline is available from
the archive but the WAL files for are not (i.e. this is a race),
(5) the WAL files for the new timeline are available via streaming,
and (6) recovery_target_timeline='latest'.

Commit ee994272ca50f70b53074f0febaec97e28f83c4e introduced this
logic and was an improvement over the previous code, but it mishandled
this case. If recovery_target_timeline='latest' and restore_command is
set, validateRecoveryParameters() can change recoveryTargetTLI to be
different from receiveTLI. If streaming is then tried afterward,
expectedTLEs gets initialized with the history of the wrong timeline.
It's supposed to be a list of entries explaining how to get to the
target timeline, but in this case it ends up with a list of entries
explaining how to get to the new standby's original timeline, which
isn't right.

Dilip Kumar and Robert Haas, reviewed by Kyotaro Horiguchi.

Discussion: http://postgr.es/m/CAFiTN-sE-jr=lb8jquxeqikd-ux+jhixyh4ydizmpedgqku...@mail.gmail.com
---
 src/backend/access/transam/xlog.c|  2 +-
 src/test/recovery/t/025_stuck_on_old_timeline.pl | 96 
 src/test/recovery/t/cp_history_files | 10 +++
 3 files changed, 107 insertions(+), 1 deletion(-)
 create mode 100644 src/test/recovery/t/025_stuck_on_old_timeline.pl
 create mode 100644 src/test/recovery/t/cp_history_files

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9ad6da9..a7ee14b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11839,7 +11839,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		if (readFile < 0)
 		{
 			if (!expectedTLEs)
-expectedTLEs = readTimeLineHistory(receiveTLI);
+expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
 			readFile = XLogFileRead(readSegNo, PANIC,
 	receiveTLI,
 	XLOG_FROM_STREAM, false);
diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl
new file mode 100644
index 000..1eafbbd
--- /dev/null
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -0,0 +1,96 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Testing streaming replication where standby is promoted and a new cascading
+# standby (without WAL) is connected to the promoted standby.  Both archiving
+# and streaming are enabled, but only the history file is available from the
+# archive, so the WAL files all have to be streamed.  Test that the cascading
+# standby can follow the new primary (promoted standby).
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use FindBin;
+use Test::More tests => 1;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+
+# Set up an archive command that will copy the history file but not the WAL
+# files. No real archive command should behave this way; the p

Re: Race condition in recovery?

2021-06-07 Thread Dilip Kumar
On Tue, Jun 8, 2021 at 12:32 AM Robert Haas  wrote:
>
> I tried back-porting my version of this patch to 9.6 to see what would
> happen there. One problem is that some of the functions have different
> names before v10. So 9.6 needs this:
>
> -"SELECT pg_walfile_name(pg_current_wal_lsn());");
> +"SELECT pg_xlogfile_name(pg_current_xlog_location());");
>
> But there's also another problem, which is that this doesn't work before v12:
>
> $node_standby->psql('postgres', 'SELECT pg_promote()');
>
> So I tried changing it to this:
>
> $node_standby->promote;
>
> But then the test fails, because pg_promote() has logic built into it
> to wait until the promotion actually happens, but ->promote doesn't,
> so SELECT pg_walfile_name(pg_current_wal_lsn()) errors out because the
> system is still in recovery. I'm not sure what to do about that. I
> quickly tried adding -w to 'sub promote' in PostgresNode.pm, but that
> didn't fix it, so I guess we'll have to find some other way to wait
> until the promotion is complete.
>

Maybe we can use it ?

# Wait until the node exits recovery.
$standby->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';")
or die "Timed out while waiting for promotion";

I will try to generate a version for 9.6 based on this idea and see how it goes

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-07 Thread Kyotaro Horiguchi
At Mon, 7 Jun 2021 10:40:27 -0400, Robert Haas  wrote in 
> On Mon, Jun 7, 2021 at 12:57 AM Kyotaro Horiguchi
>  wrote:
> > Unfortunately no. The backslashes in the binary path need to be
> > escaped. (taken from PostgresNode.pm:1008)
> >
> > > (my $perlbin = $^X) =~ s{\\}{}g if ($TestLib::windows_os);
> > > $node_primary->append_conf(
> > >   'postgresql.conf', qq(
> > > archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" 
> > > "$archivedir_primary/%f"'
> > > ));
> >
> > This works for me.
> 
> Hmm, OK. Do you think we also need to use perl2host in this case?

I understand that perl2host converts '/some/where' style path to the
native windows path 'X:/any/where' if needed. Since perl's $^X is
already in native style so I think we don't need to use it.

> > Ugh! Sorry. I meant "The explicit teardowns are useless". That's not
> > harmful but it is done by PostgresNode.pm automatically(implicitly)
> > and we don't do that in the existing scripts.
> 
> OK. I don't think it's a big deal, but we can remove them.

Thanks.

> I went back and looked at your patch again, now that I understand the
> issue better. I believe it's not necessary to do this here, because
> StartupXLOG() already contains a check for the same thing:
> 
> /*
>  * If the location of the checkpoint record is not on the expected
>  * timeline in the history of the requested timeline, we cannot proceed:
>  * the backup is not part of the history of the requested timeline.
>  */
> Assert(expectedTLEs);   /* was initialized by reading checkpoint
>  * record */
> if (tliOfPointInHistory(checkPointLoc, expectedTLEs) !=
> checkPoint.ThisTimeLineID)
> ...
> 
> This code is always run after ReadCheckpointRecord() returns. And I
> think that your only concern here is about the case where the
> checkpoint record is being fetched, because otherwise expectedTLEs
> must already be set.

Sure. Thanks for confirming that, and agreed.

> By the way, I also noticed that your version of the patch contains a
> few words which are spelled incorrectly: hearafter, and incosistent.

Mmm. Sorry for them..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-06-07 Thread Robert Haas
Hi,

I tried back-porting my version of this patch to 9.6 to see what would
happen there. One problem is that some of the functions have different
names before v10. So 9.6 needs this:

-"SELECT pg_walfile_name(pg_current_wal_lsn());");
+"SELECT pg_xlogfile_name(pg_current_xlog_location());");

But there's also another problem, which is that this doesn't work before v12:

$node_standby->psql('postgres', 'SELECT pg_promote()');

So I tried changing it to this:

$node_standby->promote;

But then the test fails, because pg_promote() has logic built into it
to wait until the promotion actually happens, but ->promote doesn't,
so SELECT pg_walfile_name(pg_current_wal_lsn()) errors out because the
system is still in recovery. I'm not sure what to do about that. I
quickly tried adding -w to 'sub promote' in PostgresNode.pm, but that
didn't fix it, so I guess we'll have to find some other way to wait
until the promotion is complete.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-07 Thread Robert Haas
On Mon, Jun 7, 2021 at 12:57 AM Kyotaro Horiguchi
 wrote:
> Unfortunately no. The backslashes in the binary path need to be
> escaped. (taken from PostgresNode.pm:1008)
>
> > (my $perlbin = $^X) =~ s{\\}{}g if ($TestLib::windows_os);
> > $node_primary->append_conf(
> >   'postgresql.conf', qq(
> > archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" 
> > "$archivedir_primary/%f"'
> > ));
>
> This works for me.

Hmm, OK. Do you think we also need to use perl2host in this case?

> Ugh! Sorry. I meant "The explicit teardowns are useless". That's not
> harmful but it is done by PostgresNode.pm automatically(implicitly)
> and we don't do that in the existing scripts.

OK. I don't think it's a big deal, but we can remove them.

> As I said upthread the relationship between receiveTLI and
> recoveryTargetTLI is not confirmed yet at the point.
> findNewestTimeLine() simply searches for the history file with the
> largest timeline id so the returned there's a case where the timeline
> id that the function returns is not a future of the latest checkpoint
> TLI.  I think that the fact that rescanLatestTimeLine() checks the
> relationship is telling us that we need to do the same in the path as
> well.
>
> In my previous proposal, it is done just after the line the patch
> touches but it can be in the if (fetching_ckpt) branch.

I went back and looked at your patch again, now that I understand the
issue better. I believe it's not necessary to do this here, because
StartupXLOG() already contains a check for the same thing:

/*
 * If the location of the checkpoint record is not on the expected
 * timeline in the history of the requested timeline, we cannot proceed:
 * the backup is not part of the history of the requested timeline.
 */
Assert(expectedTLEs);   /* was initialized by reading checkpoint
 * record */
if (tliOfPointInHistory(checkPointLoc, expectedTLEs) !=
checkPoint.ThisTimeLineID)
...

This code is always run after ReadCheckpointRecord() returns. And I
think that your only concern here is about the case where the
checkpoint record is being fetched, because otherwise expectedTLEs
must already be set.

By the way, I also noticed that your version of the patch contains a
few words which are spelled incorrectly: hearafter, and incosistent.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-06 Thread Kyotaro Horiguchi
Sorry, some extra words are left alone.

At Mon, 07 Jun 2021 13:57:35 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> As I said upthread the relationship between receiveTLI and
> recoveryTargetTLI is not confirmed yet at the point.
- findNewestTimeLine() simply searches for the history file with the
- largest timeline id so the returned there's a case where the timeline
+ findNewestTimeLine() simply searches for the history file with the
+ largest timeline id so there's a case where the timeline
> id that the function returns is not a future of the latest checkpoint
> TLI.  I think that the fact that rescanLatestTimeLine() checks the
> relationship is telling us that we need to do the same in the path as
> well.
> 
> In my previous proposal, it is done just after the line the patch
> touches but it can be in the if (fetching_ckpt) branch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-06-06 Thread Kyotaro Horiguchi
At Fri, 4 Jun 2021 10:56:12 -0400, Robert Haas  wrote in 
> On Fri, Jun 4, 2021 at 5:25 AM Kyotaro Horiguchi
>  wrote:
> > I think that's right. And the test script detects the issue for me
> > both on Linux but doesn't work for Windows.
> >
> > '"C:/../Documents/work/postgresql/src/test/recovery/t/cp_history_files"' is 
> > not recognized as an internal command or external command ..
> 
> Hmm, that's a problem. Can you try the attached version?

Unfortunately no. The backslashes in the binary path need to be
escaped. (taken from PostgresNode.pm:1008)

> (my $perlbin = $^X) =~ s{\\}{}g if ($TestLib::windows_os);
> $node_primary->append_conf(
>   'postgresql.conf', qq(
> archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" 
> "$archivedir_primary/%f"'
> ));

This works for me.

> > + # clean up
> > + $node_primary->teardown_node;
> > + $node_standby->teardown_node;
> > + $node_cascade->teardown_node;
> >
> > I don't think explicit teardown is useless as the final cleanup.
> 
> I don't know what you mean by this. If it's not useless, good, because
> we're doing it. Or do you mean that you think it is useless, and we
> should remove it?

Ugh! Sorry. I meant "The explicit teardowns are useless". That's not
harmful but it is done by PostgresNode.pm automatically(implicitly)
and we don't do that in the existing scripts.

> > By the way the attached patch is named as "Fix-corner-case..." but
> > doesn't contain the fix. Is it intentional?
> 
> No, that was a goof.

As I said upthread the relationship between receiveTLI and
recoveryTargetTLI is not confirmed yet at the point.
findNewestTimeLine() simply searches for the history file with the
largest timeline id so the returned there's a case where the timeline
id that the function returns is not a future of the latest checkpoint
TLI.  I think that the fact that rescanLatestTimeLine() checks the
relationship is telling us that we need to do the same in the path as
well.

In my previous proposal, it is done just after the line the patch
touches but it can be in the if (fetching_ckpt) branch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-06-04 Thread Robert Haas
On Fri, Jun 4, 2021 at 5:25 AM Kyotaro Horiguchi
 wrote:
> I think that's right. And the test script detects the issue for me
> both on Linux but doesn't work for Windows.
>
> '"C:/../Documents/work/postgresql/src/test/recovery/t/cp_history_files"' is 
> not recognized as an internal command or external command ..

Hmm, that's a problem. Can you try the attached version?

> + # clean up
> + $node_primary->teardown_node;
> + $node_standby->teardown_node;
> + $node_cascade->teardown_node;
>
> I don't think explicit teardown is useless as the final cleanup.

I don't know what you mean by this. If it's not useless, good, because
we're doing it. Or do you mean that you think it is useless, and we
should remove it?

> By the way the attached patch is named as "Fix-corner-case..." but
> doesn't contain the fix. Is it intentional?

No, that was a goof.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v6-0001-Fix-corner-case-failure-of-new-standby-to-follow-.patch
Description: Binary data


Re: Race condition in recovery?

2021-06-04 Thread Robert Haas
On Fri, Jun 4, 2021 at 3:51 AM Dilip Kumar  wrote:
> I could not reproduce this but I think I got the issue, I think I used
> the wrong target LSN in wait_for_catchup, instead of checking the last
> "insert LSN" of the standby I was waiting for last "replay LSN" of
> standby which was wrong.  Changed as below in the attached patch.

Yeah, that fixes it for me. Thanks.

With that change, this test reliably passes for me with the fix, and
reliably fails for me without the fix. Woohoo!

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-06-04 Thread Kyotaro Horiguchi
At Fri, 4 Jun 2021 13:21:08 +0530, Dilip Kumar  wrote in 
> On Fri, Jun 4, 2021 at 2:03 AM Robert Haas  wrote:
> >
> > On Thu, May 27, 2021 at 2:26 AM Dilip Kumar  wrote:
> > > Changed as suggested.
> >
> > I don't think the code as written here is going to work on Windows,
> > because your code doesn't duplicate enable_restoring's call to
> > perl2host or its backslash-escaping logic. It would really be better
> > if we could use enable_restoring directly. Also, I discovered that the
> > 'return' in cp_history_files should really say 'exit', because
> > otherwise it generates a complaint every time it's run. It should also
> > have 'use strict' and 'use warnings' at the top.
> 
> Ok
> 
> > Here's a version of your test case patch with the 1-line code fix
> > added, the above issues addressed, and a bunch of cosmetic tweaks.
> > Unfortunately, it doesn't pass for me consistently. I'm not sure if
> > that's because I broke something with my changes, or because the test
> > contains an underlying race condition which we need to address.
> > Attached also are the log files from a failed run if you want to look
> > at them. The key lines seem to be:
> 
> I could not reproduce this but I think I got the issue, I think I used
> the wrong target LSN in wait_for_catchup, instead of checking the last
> "insert LSN" of the standby I was waiting for last "replay LSN" of
> standby which was wrong.  Changed as below in the attached patch.

I think that's right. And the test script detects the issue for me
both on Linux but doesn't work for Windows.

'"C:/../Documents/work/postgresql/src/test/recovery/t/cp_history_files"' is not 
recognized as an internal command or external command ..

Because Windows' cmd.exe doesn't have the shbang feature. On Windows,
maybe archive_command should be like

'".../perl" "$FindBin../cp_history_files" "%p"...

If I did this I got another error.

"couldn't copy pg_wal\0002.history to C:/../Documents/work/postgresql/src   
est^Mecovery/tmp_check/t_000_a_primary_data/archives/0002.history:  at 
C:/../Documents/work/postgresql/src/test/recovery/t/cp_history_files line 10.^M"

("^M" are the replacement for carrage return)
So.. I'm not sure what is happening but the error messages, or..
Anyway I don't have a time to investigate it.


+ # clean up
+ $node_primary->teardown_node;
+ $node_standby->teardown_node;
+ $node_cascade->teardown_node;

I don't think explicit teardown is useless as the final cleanup.

By the way the attached patch is named as "Fix-corner-case..." but
doesn't contain the fix. Is it intentional?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-06-04 Thread Dilip Kumar
On Fri, Jun 4, 2021 at 2:03 AM Robert Haas  wrote:
>
> On Thu, May 27, 2021 at 2:26 AM Dilip Kumar  wrote:
> > Changed as suggested.
>
> I don't think the code as written here is going to work on Windows,
> because your code doesn't duplicate enable_restoring's call to
> perl2host or its backslash-escaping logic. It would really be better
> if we could use enable_restoring directly. Also, I discovered that the
> 'return' in cp_history_files should really say 'exit', because
> otherwise it generates a complaint every time it's run. It should also
> have 'use strict' and 'use warnings' at the top.

Ok

> Here's a version of your test case patch with the 1-line code fix
> added, the above issues addressed, and a bunch of cosmetic tweaks.
> Unfortunately, it doesn't pass for me consistently. I'm not sure if
> that's because I broke something with my changes, or because the test
> contains an underlying race condition which we need to address.
> Attached also are the log files from a failed run if you want to look
> at them. The key lines seem to be:

I could not reproduce this but I think I got the issue, I think I used
the wrong target LSN in wait_for_catchup, instead of checking the last
"insert LSN" of the standby I was waiting for last "replay LSN" of
standby which was wrong.  Changed as below in the attached patch.

diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl
b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index 09eb3eb..ee7d78d 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -78,7 +78,7 @@ $node_standby->safe_psql('postgres', "CREATE TABLE
tab_int AS SELECT 1 AS a");

 # Wait for the replication to catch up
 $node_standby->wait_for_catchup($node_cascade, 'replay',
-   $node_standby->lsn('replay'));
+   $node_standby->lsn('insert'));

 # Check that cascading standby has the new content
 my $result =

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 7e7b5cad099b1e1d554581276481ef729fc87fbc Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Thu, 3 Jun 2021 16:06:57 -0400
Subject: [PATCH v5] Fix corner case failure of new standby to follow new
 primary.

This only happens if (1) the new standby has no WAL available locally,
(2) the new standby is starting from the old timeline, (3) the promotion
happened in the WAL segment from which the new standby is starting,
(4) the timeline history file for the new timeline is available from
the archive but the WAL files for are not (i.e. this is a race),
(5) the WAL files for the new timeline are available via streaming,
and (6) recovery_target_timeline='latest'.

Commit ee994272ca50f70b53074f0febaec97e28f83c4e introduced this
logic and was an improvement over the previous code, but it mishandled
this case. If recovery_target_timeline='latest' and restore_command is
set, validateRecoveryParameters() can change recoveryTargetTLI to be
different from receiveTLI. If streaming is then tried afterward,
expectedTLEs gets initialized with the history of the wrong timeline.
It's supposed to be a list of entries explaining how to get to the
target timeline, but in this case it ends up with a list of entries
explaining how to get to the new standby's original timeline, which
isn't right.

Dilip Kumar and Robert Haas, with input from Kyotaro Horiguchi.

Discussion: http://postgr.es/m/CAFiTN-sE-jr=lb8jquxeqikd-ux+jhixyh4ydizmpedgqku...@mail.gmail.com
---
 src/test/recovery/t/025_stuck_on_old_timeline.pl | 92 
 src/test/recovery/t/cp_history_files | 10 +++
 2 files changed, 102 insertions(+)
 create mode 100644 src/test/recovery/t/025_stuck_on_old_timeline.pl
 create mode 100755 src/test/recovery/t/cp_history_files

diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl
new file mode 100644
index 000..ee7d78d
--- /dev/null
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -0,0 +1,92 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Testing streaming replication where standby is promoted and a new cascading
+# standby (without WAL) is connected to the promoted standby.  Both archiving
+# and streaming are enabled, but only the history file is available from the
+# archive, so the WAL files all have to be streamed.  Test that the cascading
+# standby can follow the new primary (promoted standby).
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use FindBin;
+use Test::More tests => 1;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+
+# Set up an archive command that will copy the history file but not the WAL
+# files. No real archive command should behave this way; the point is to
+# simulate a race condition where the new cascading standby starts up after
+# the timeline history file reaches the archive but before any of the WAL files
+# get there.
+$node_primary->init(allows_streaming => 1, has_arch

Re: Race condition in recovery?

2021-06-03 Thread Robert Haas
On Thu, May 27, 2021 at 2:26 AM Dilip Kumar  wrote:
> Changed as suggested.

I don't think the code as written here is going to work on Windows,
because your code doesn't duplicate enable_restoring's call to
perl2host or its backslash-escaping logic. It would really be better
if we could use enable_restoring directly. Also, I discovered that the
'return' in cp_history_files should really say 'exit', because
otherwise it generates a complaint every time it's run. It should also
have 'use strict' and 'use warnings' at the top.

Here's a version of your test case patch with the 1-line code fix
added, the above issues addressed, and a bunch of cosmetic tweaks.
Unfortunately, it doesn't pass for me consistently. I'm not sure if
that's because I broke something with my changes, or because the test
contains an underlying race condition which we need to address.
Attached also are the log files from a failed run if you want to look
at them. The key lines seem to be:

2021-06-03 16:16:53.984 EDT [47796] LOG:  restarted WAL streaming at
0/300 on timeline 2
2021-06-03 16:16:54.197 EDT [47813] 025_stuck_on_old_timeline.pl LOG:
statement: SELECT count(*) FROM tab_int
2021-06-03 16:16:54.197 EDT [47813] 025_stuck_on_old_timeline.pl
ERROR:  relation "tab_int" does not exist at character 22

Or from the main log:

Waiting for replication conn cascade's replay_lsn to pass '0/300' on standby
done
error running SQL: 'psql::1: ERROR:  relation "tab_int" does not exist
LINE 1: SELECT count(*) FROM tab_int
 ^'

I wonder whether that problem points to an issue with this incantation:

$node_standby->wait_for_catchup($node_cascade, 'replay',
$node_standby->lsn('replay'));

But I'm not sure, and I'm out of time to investigate for today.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v4-0001-Fix-corner-case-failure-of-new-standby-to-follow-.patch
Description: Binary data


025_stuck_on_old_timeline_primary.log
Description: Binary data


025_stuck_on_old_timeline_cascade.log
Description: Binary data


regress_log_025_stuck_on_old_timeline
Description: Binary data


025_stuck_on_old_timeline_standby.log
Description: Binary data


Re: Race condition in recovery?

2021-06-02 Thread Kyotaro Horiguchi
At Tue, 1 Jun 2021 16:45:52 -0400, Robert Haas  wrote in 
> On Fri, May 28, 2021 at 2:05 AM Kyotaro Horiguchi
>  wrote:
> > Mmmm.  That looks like meaning that we don't intend to support the
> > Dilip's case, and means that we support the use of
> > archive-command-copies-only-other-than-wal-segments?
> 
> Actually, I think Dilip's case ought to be supported, but I also think
> that somebody else might disagree, so it's better for me if the test
> doesn't need to rely on it.

Understood.

> > Agreed. I often annoyed by a long-lasting TAP script when I wanted to
> > do one of the test items in it. However, I was not sure which is our
> > policy here, consolidating all related tests into one script or having
> > separate scripts containing tests up to a "certain" number or a set of
> > tests that would take a certain time, or limiting by number the of
> > lines.  I thought that we are on the first way as I have told several
> > times to put new tests into an existing script.
> 
> Different people might have different opinions about this, but my
> opinion is that when it's possible to combine the test cases in a way
> that feels natural, it's good to do. For example if I have two tests
> that require the same setup and teardown but do different things in
> the middle, and if those things seem related, then it's great to set
> up once, try both things, and tear down once. However I don't support
> combining test cases where it's just concatenating them one after
> another, because that sort of thing seems to have no benefit. Fewer
> files in the source tree is not a goal of itself.

Sounds like a reasonable criteria.

> > No. Thanks for the words, Robert. I might be a bit too naive, but I
> > had an anxious feeling that I might have been totally pointless or my
> > words might have been too cryptic/broken (my fingers are quite fat),
> > or I might have done something wrong or anything other.  Anyway I
> > thought I might have done something wrong here.
> 
> No, I don't think so. I think the difficulty is more that the three of
> us who are mostly involved in this conversation all have different
> native languages, and we are trying to discuss an issue which is very
> subtle. Sometimes I am having difficulty understanding precisely what
> either you or Dilip are intending to say, and it would not surprise me
> to learn that there are difficulties in the other direction also. If
> we seem to be covering the same topics multiple times or if any
> important points seem to be getting ignored, that's probably the
> reason.

That makes me convinced. Thanks for the thought and sorry for
bothering with the complaint.


Anyway, Now I agree to the all of the direction here.

Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-06-02 Thread Noah Misch
On Tue, Jun 01, 2021 at 04:45:52PM -0400, Robert Haas wrote:
> On Fri, May 28, 2021 at 2:05 AM Kyotaro Horiguchi  
> wrote:
> > Agreed. I often annoyed by a long-lasting TAP script when I wanted to
> > do one of the test items in it. However, I was not sure which is our
> > policy here, consolidating all related tests into one script or having
> > separate scripts containing tests up to a "certain" number or a set of
> > tests that would take a certain time, or limiting by number the of
> > lines.  I thought that we are on the first way as I have told several
> > times to put new tests into an existing script.
> 
> Different people might have different opinions about this, but my
> opinion is that when it's possible to combine the test cases in a way
> that feels natural, it's good to do. For example if I have two tests
> that require the same setup and teardown but do different things in
> the middle, and if those things seem related, then it's great to set
> up once, try both things, and tear down once. However I don't support
> combining test cases where it's just concatenating them one after
> another, because that sort of thing seems to have no benefit. Fewer
> files in the source tree is not a goal of itself.

I agree, particularly for the recovery and subscription TAP suites.  When one
of those tests fails on the buildfarm, it's often not obvious to me which log
messages are relevant to the failure.  Smaller test files simplify the
investigation somewhat.




Re: Race condition in recovery?

2021-06-01 Thread Robert Haas
On Fri, May 28, 2021 at 2:05 AM Kyotaro Horiguchi
 wrote:
> Mmmm.  That looks like meaning that we don't intend to support the
> Dilip's case, and means that we support the use of
> archive-command-copies-only-other-than-wal-segments?

Actually, I think Dilip's case ought to be supported, but I also think
that somebody else might disagree, so it's better for me if the test
doesn't need to rely on it.

> Agreed. I often annoyed by a long-lasting TAP script when I wanted to
> do one of the test items in it. However, I was not sure which is our
> policy here, consolidating all related tests into one script or having
> separate scripts containing tests up to a "certain" number or a set of
> tests that would take a certain time, or limiting by number the of
> lines.  I thought that we are on the first way as I have told several
> times to put new tests into an existing script.

Different people might have different opinions about this, but my
opinion is that when it's possible to combine the test cases in a way
that feels natural, it's good to do. For example if I have two tests
that require the same setup and teardown but do different things in
the middle, and if those things seem related, then it's great to set
up once, try both things, and tear down once. However I don't support
combining test cases where it's just concatenating them one after
another, because that sort of thing seems to have no benefit. Fewer
files in the source tree is not a goal of itself.

> No. Thanks for the words, Robert. I might be a bit too naive, but I
> had an anxious feeling that I might have been totally pointless or my
> words might have been too cryptic/broken (my fingers are quite fat),
> or I might have done something wrong or anything other.  Anyway I
> thought I might have done something wrong here.

No, I don't think so. I think the difficulty is more that the three of
us who are mostly involved in this conversation all have different
native languages, and we are trying to discuss an issue which is very
subtle. Sometimes I am having difficulty understanding precisely what
either you or Dilip are intending to say, and it would not surprise me
to learn that there are difficulties in the other direction also. If
we seem to be covering the same topics multiple times or if any
important points seem to be getting ignored, that's probably the
reason.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-31 Thread Kyotaro Horiguchi
Moved to another thread.

https://www.postgresql.org/message-id/20210531.165825.921389284096975508.horikyota@gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-05-30 Thread Tatsuro Yamada

Hi Horiguchi-san,


(Why me?)


Because the story was also related to PG-REX, which you are
also involved in developing. Perhaps off-list instead of
-hackers would have been better, but I emailed -hackers because
the same problem could be encountered by PostgreSQL users who
do not use PG-REX.

 

In a project I helped with, I encountered an issue where
the archive command kept failing. I thought this issue was
related to the problem in this thread, so I'm sharing it here.
If I should create a new thread, please let me know.

* Problem
   - The archive_command is failed always.


Although I think the configuration is a kind of broken, it can be seen
as it is mimicing the case of shared-archive, where primary and
standby share the same archive directory.



To be precise, the environment of this reproduction script is
different from our actual environment. I tried to make it as
simple as possible to reproduce the problem.
(In order to make it look like the actual environment, you have
to build a PG-REX environment.)

A simple replication environment might be enough, so I'll try to
recreate a script that is closer to the actual environment later.

 

Basically we need to use an archive command like the following for
that case to avoid this kind of failure. The script returns "success"
when the target file is found but identical with the source file. I
don't find such a description in the documentation, and haven't
bothered digging into the mailing-list archive.

==
#! /bin/bash

if [ -f $2 ]; then
cmp -s $1 $2
if [ $? != 0 ]; then
exit 1
fi
exit 0
fi

cp $1 $2
==


Thanks for your reply.
Since the above behavior is different from the behavior of the
test command in the following example in postgresql.conf, I think
we should write a note about this example.

# e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'

Let me describe the problem we faced.
- When archive_mode=always, archive_command is (sometimes) executed
  in a situation where the history file already exists on the standby
  side.

- In this case, if "test ! -f" is written in the archive_command of
  postgresql.conf on the standby side, the command will keep failing.

  Note that this problem does not occur when archive_mode=on.

So, what should we do for the user? I think we should put some notes
in postgresql.conf or in the documentation. For example, something
like this:


Note: If you use archive_mode=always, the archive_command on the standby side should not 
be used "test ! -f".




Regards,
Tatsuro Yamada






Re: Race condition in recovery?

2021-05-28 Thread Kyotaro Horiguchi
(Sorry for being a bit off-topic)

At Fri, 28 May 2021 12:18:35 +0900, Tatsuro Yamada 
 wrote in 
> Hi Horiguchi-san,

(Why me?)

> In a project I helped with, I encountered an issue where
> the archive command kept failing. I thought this issue was
> related to the problem in this thread, so I'm sharing it here.
> If I should create a new thread, please let me know.
> 
> * Problem
>   - The archive_command is failed always.

Although I think the configuration is a kind of broken, it can be seen
as it is mimicing the case of shared-archive, where primary and
standby share the same archive directory.

Basically we need to use an archive command like the following for
that case to avoid this kind of failure. The script returns "success"
when the target file is found but identical with the source file. I
don't find such a description in the documentation, and haven't
bothered digging into the mailing-list archive.

==
#! /bin/bash

if [ -f $2 ]; then
cmp -s $1 $2
if [ $? != 0 ]; then
exit 1
fi
exit 0
fi

cp $1 $2
==

A maybe-non-optimal behavior is both 0002.history.done and .ready
files are found at once in archive_status directory but that doesn't
practically matter. (Or I faintly remember that it is designed to work
even in that case.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-05-27 Thread Kyotaro Horiguchi
Thanks!

At Thu, 27 May 2021 15:05:44 -0400, Robert Haas  wrote 
in 
> On Wed, May 26, 2021 at 8:49 PM Kyotaro Horiguchi
>  wrote:
> > So in the mail [1] and [2] I tried to describe what's going on around
> > the two issues.  Although I haven't have a response to [2], can I
> > think that we clarified the intention of ee994272ca?  And may I think
> > that we decided that we don't add a test for the commit?
> 
> Regarding the first question, I feel that the intention of ee994272ca
> is fairly clear at this point. Someone else might feel differently so
> I won't presume to speak for anyone but me.

I completely misunderstood your intention here.

> Regarding the second question, I am not opposed to adding a test for
> that commit, but I think it is a lot more important to fix the bug we
> have now than to add a test for a bug that was fixed a long time ago.

Yes. I agree to that. Glad to see that.

> > Then it seems to me that Robert refound how to reproduce Dilip's case
> > using basebackup instead of using two standbys.  It is using a broken
> > archive_command with pg_basebackup -Xnone and I showed that the same
> > resulting state is available by pg_basebackup -Xstream/fetch clearing
> > pg_wal directory of the resulting backup including an explanation of
> > why.
> 
> Yes, it makes sense that we could get to the same state either by not
> fetching the WAL in the first place, or alternatively by fetching it
> and then removing it.

Sure. That is an opinion and I can agree to that.

> > *I* think that it is better to avoid to have the archive_command since
> > it seems to me that just unlinking some files seems simpler tha having
> > the broken archive_command.  However, since Robert ignored it, I guess
> > that Robert thinks that the broken archive_command is better than
> > that.
> 
> Well ... I don't see those things as quite related. As far as I can
> see, unlinking files from pg_wal is an alternative to using -Xnone. On
> the other hand, the broken archive_command is there to make sure the
> new primary doesn't archive its WAL segment too soon.

I agree to use the archive_command just to create the desired state.

> Regarding the first point, I think using -Xnone is better than using
> -Xfetch/stream and then removing the WAL, because (1) it doesn't seem
> efficient to fetch WAL only to turn around and remove it and (2)
> someone might question whether removing the WAL afterward is a
> supported procedure, whereas using an option built into the tool must
> surely be supported.

Mmmm.  That looks like meaning that we don't intend to support the
Dilip's case, and means that we support the use of
archive-command-copies-only-other-than-wal-segments?

> Regarding the second point, I think using the broken archive command
> is superior because we can be sure of the behavior. If we just rely on
> not having crossed a segment boundary, then anything that causes more
> WAL to be generated than we are expecting could break the test. I
> don't think it's particularly likely in a case like this that
> autovacuum or any other thing would kick in and generate extra WAL,
> but the broken archive command ensures that even if it does happen,
> the test will still work as intended. That, to me, seems like a good
> enough reason to do it that way.

Yeah. It is what the most convincing reason.

> > FWIW, regarding to the name of the test script, putting aside what it
> > actually does, I proposed to place it as a part or
> > 004_timeline_switch.pl because this issue is related to timeline
> > switching.
> 
> I think it is better to keep it separate. Long test scripts that test
> multiple things with completely separate tests are hard to read.

Agreed. I often annoyed by a long-lasting TAP script when I wanted to
do one of the test items in it. However, I was not sure which is our
policy here, consolidating all related tests into one script or having
separate scripts containing tests up to a "certain" number or a set of
tests that would take a certain time, or limiting by number the of
lines.  I thought that we are on the first way as I have told several
times to put new tests into an existing script.

> Kyotaro-san, I hope I have not given any offense. I am doing my best,
> and certainly did not mean to be rude.

No. Thanks for the words, Robert. I might be a bit too naive, but I
had an anxious feeling that I might have been totally pointless or my
words might have been too cryptic/broken (my fingers are quite fat),
or I might have done something wrong or anything other.  Anyway I
thought I might have done something wrong here.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-05-27 Thread Tatsuro Yamada

Hi Horiguchi-san,

In a project I helped with, I encountered an issue where
the archive command kept failing. I thought this issue was
related to the problem in this thread, so I'm sharing it here.
If I should create a new thread, please let me know.

* Problem
  - The archive_command is failed always.

* Conditions under which the problem occurs (parameters)
  - archive_mode=always
  - Using the test command in archive_command
 
* Probable cause

  - I guess that is because the .history file already exists,
and the test command fails.
(but if we use archive_mode=on, archive_command is successful).

* How to reproduce
  - Attached is a script to reproduce the problem.
  Note: the script will remove $PGDATA when it started

The test command is listed as an example of the use of archive_command
in postgresql.conf, and the project faced this problem because it used
the example as is. If this behavior is a specification, it would be
better not to write the test command as a usage example.
Or maybe there should be a note that the test command should not be used
when archive_mode=always. Maybe, I'm missing something, sorry.

Regards,
Tatsuro Yamada
set -v
pg_ctl stop
rm -rf $PGDATA
initdb --encoding=UTF8 --no-locale
mkdir $PGDATA/arc

cat <<@EOF >> $PGDATA/postgresql.conf
archive_mode = always
archive_command = 'test ! -f arc/%f && cp %p arc/%f'

restore_command = 'cp arc/%f %p'
recovery_target_timeline = 'latest'
@EOF

touch $PGDATA/standby.signal
pg_ctl start
sleep 3
pg_ctl promote
sleep 3
psql -c "create table test (a int);"
# These are the trigger for archive_command
psql -c "insert into test values (1); checkpoint; select pg_switch_wal();"  # OK
psql -c "insert into test values (1); checkpoint; select pg_switch_wal();"  # OK
psql -c "insert into test values (1); checkpoint; select pg_switch_wal();"  # OK

sleep 1

touch $PGDATA/standby.signal
pg_ctl stop
pg_ctl start
echo " NG  (If we use archive_mode=on, archive_command is successful)"
sleep 10
pg_ctl promote
sleep 3
# These are the trigger for archive_command
psql -c "insert into test values (1); checkpoint; select pg_switch_wal();"  # 
NG: archive_command failed
psql -c "insert into test values (1); checkpoint; select pg_switch_wal();"  # 
NG: archive_command failed
psql -c "insert into test values (1); checkpoint; select pg_switch_wal();"  # 
NG: archive_command failed
waiting for server to shut down done
server stopped
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "C".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory /home/postgres/PG140/DATA ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Asia/Tokyo
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok


Success. You can now start the database server using:

pg_ctl -D /home/postgres/PG140/DATA -l logfile start

waiting for server to start2021-05-28 12:19:06.378 JST [8215] LOG:  
starting PostgreSQL 14beta1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 
20150623 (Red Hat 4.8.5-39), 64-bit
2021-05-28 12:19:06.378 JST [8215] LOG:  listening on IPv6 address "::1", port 
1400
2021-05-28 12:19:06.379 JST [8215] LOG:  listening on IPv4 address "127.0.0.1", 
port 1400
2021-05-28 12:19:06.381 JST [8215] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.1400"
2021-05-28 12:19:06.386 JST [8216] LOG:  database system was shut down at 
2021-05-28 12:19:06 JST
cp: cannot stat ‘arc/0002.history’: No such file or directory
2021-05-28 12:19:06.393 JST [8216] LOG:  entering standby mode
cp: cannot stat ‘arc/00010001’: No such file or directory
2021-05-28 12:19:06.402 JST [8216] LOG:  consistent recovery state reached at 
0/16E9CF8
2021-05-28 12:19:06.402 JST [8216] LOG:  invalid record length at 0/16E9CF8: 
wanted 24, got 0
2021-05-28 12:19:06.402 JST [8215] LOG:  database system is ready to accept 
read only connections
cp: cannot stat ‘arc/0002.history’: No such file or directory
cp: cannot stat ‘arc/00010001’: No such file or directory
cp: cannot stat ‘arc/0002.history’: No such file or directory
 done
server started
waiting for server to promotecp: cannot stat 
‘arc/00010001’: No such file or directory
2021-05-28 12:19:09.462 JST [8216] LOG:  received promote request
2021-05-28 12:19:09.463 JST [8216] LOG:  redo is not required
cp: cannot stat ‘arc/00010001’: No such file or directory
cp: cannot stat ‘arc/0002.history’: No such file or directory
2021-05-28 12:19:09.478 JST [8216] LOG:  selected new timeline ID: 2
2021-05-28 12:19:09.502 JST [8216] LOG:  ar

Re: Race condition in recovery?

2021-05-27 Thread Robert Haas
On Wed, May 26, 2021 at 8:49 PM Kyotaro Horiguchi
 wrote:
> So in the mail [1] and [2] I tried to describe what's going on around
> the two issues.  Although I haven't have a response to [2], can I
> think that we clarified the intention of ee994272ca?  And may I think
> that we decided that we don't add a test for the commit?

Regarding the first question, I feel that the intention of ee994272ca
is fairly clear at this point. Someone else might feel differently so
I won't presume to speak for anyone but me.

Regarding the second question, I am not opposed to adding a test for
that commit, but I think it is a lot more important to fix the bug we
have now than to add a test for a bug that was fixed a long time ago.

> Then it seems to me that Robert refound how to reproduce Dilip's case
> using basebackup instead of using two standbys.  It is using a broken
> archive_command with pg_basebackup -Xnone and I showed that the same
> resulting state is available by pg_basebackup -Xstream/fetch clearing
> pg_wal directory of the resulting backup including an explanation of
> why.

Yes, it makes sense that we could get to the same state either by not
fetching the WAL in the first place, or alternatively by fetching it
and then removing it.

> *I* think that it is better to avoid to have the archive_command since
> it seems to me that just unlinking some files seems simpler tha having
> the broken archive_command.  However, since Robert ignored it, I guess
> that Robert thinks that the broken archive_command is better than
> that.

Well ... I don't see those things as quite related. As far as I can
see, unlinking files from pg_wal is an alternative to using -Xnone. On
the other hand, the broken archive_command is there to make sure the
new primary doesn't archive its WAL segment too soon.

Regarding the first point, I think using -Xnone is better than using
-Xfetch/stream and then removing the WAL, because (1) it doesn't seem
efficient to fetch WAL only to turn around and remove it and (2)
someone might question whether removing the WAL afterward is a
supported procedure, whereas using an option built into the tool must
surely be supported.

Regarding the second point, I think using the broken archive command
is superior because we can be sure of the behavior. If we just rely on
not having crossed a segment boundary, then anything that causes more
WAL to be generated than we are expecting could break the test. I
don't think it's particularly likely in a case like this that
autovacuum or any other thing would kick in and generate extra WAL,
but the broken archive command ensures that even if it does happen,
the test will still work as intended. That, to me, seems like a good
enough reason to do it that way.

> FWIW, regarding to the name of the test script, putting aside what it
> actually does, I proposed to place it as a part or
> 004_timeline_switch.pl because this issue is related to timeline
> switching.

I think it is better to keep it separate. Long test scripts that test
multiple things with completely separate tests are hard to read.

Kyotaro-san, I hope I have not given any offense. I am doing my best,
and certainly did not mean to be rude.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-27 Thread Kyotaro Horiguchi
At Thu, 27 May 2021 12:47:30 +0530, Dilip Kumar  wrote 
in 
> On Thu, May 27, 2021 at 12:09 PM Kyotaro Horiguchi
>  wrote:
> >
> > At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar  
> > wrote in
> > We're writing at the very beginning of the switching segment at the
> > promotion time. So it is guaranteed that the first segment of the
> > newer timline won't be archived until the rest almost 16MB in the
> > segment is consumed or someone explicitly causes a segment switch
> > (including archive timeout).
> 
> I agree
>
> > > BTW, I have also tested your script and I found below log, which shows
> > > that standby2 is successfully able to select the timeline2 so it is
> > > not reproducing the issue.  Am I missing something?
> >
> > standby_2? My last one 026_timeline_issue_2.pl doesn't use that name
> > and uses "standby_1 and "cascade".  In the ealier ones, standby_4 and
> > 5 (or 3 and 4 in the later versions) are used in ths additional tests.
> >
> > So I think it shold be something different?
> 
> Yeah, I tested with your patch where you had a different test case,
> with "026_timeline_issue_2.pl", I am able to reproduce the issue.

That said, I don't object if we decide to choose the crafted archive
command as far as we consider the trade-offs between the two ways.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-05-27 Thread Dilip Kumar
On Thu, May 27, 2021 at 12:09 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar  wrote 
> in
> > Maybe we can somehow achieve that without a broken archive command,
> > but I am not sure how it is enough to just delete WAL from pg_wal?  I
> > mean my original case was that
> > 1. Got the new history file from the archive but did not get the WAL
> > file yet which contains the checkpoint after TL switch
> > 2. So the standby2 try to stream using new primary using old TL and
> > set the wrong TL in expectedTLEs
> >
> > But if you are not doing anything to stop archiving WAL files or to
> > guarantee that WAL has come to archive and you deleted those then I am
> > not sure how we are reproducing the original problem.
>
> Thanks for the reply!
>
> We're writing at the very beginning of the switching segment at the
> promotion time. So it is guaranteed that the first segment of the
> newer timline won't be archived until the rest almost 16MB in the
> segment is consumed or someone explicitly causes a segment switch
> (including archive timeout).

I agree

> > BTW, I have also tested your script and I found below log, which shows
> > that standby2 is successfully able to select the timeline2 so it is
> > not reproducing the issue.  Am I missing something?
>
> standby_2? My last one 026_timeline_issue_2.pl doesn't use that name
> and uses "standby_1 and "cascade".  In the ealier ones, standby_4 and
> 5 (or 3 and 4 in the later versions) are used in ths additional tests.
>
> So I think it shold be something different?

Yeah, I tested with your patch where you had a different test case,
with "026_timeline_issue_2.pl", I am able to reproduce the issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-26 Thread Kyotaro Horiguchi
At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar  wrote 
in 
> Maybe we can somehow achieve that without a broken archive command,
> but I am not sure how it is enough to just delete WAL from pg_wal?  I
> mean my original case was that
> 1. Got the new history file from the archive but did not get the WAL
> file yet which contains the checkpoint after TL switch
> 2. So the standby2 try to stream using new primary using old TL and
> set the wrong TL in expectedTLEs
> 
> But if you are not doing anything to stop archiving WAL files or to
> guarantee that WAL has come to archive and you deleted those then I am
> not sure how we are reproducing the original problem.

Thanks for the reply!

We're writing at the very beginning of the switching segment at the
promotion time. So it is guaranteed that the first segment of the
newer timline won't be archived until the rest almost 16MB in the
segment is consumed or someone explicitly causes a segment switch
(including archive timeout).

> BTW, I have also tested your script and I found below log, which shows
> that standby2 is successfully able to select the timeline2 so it is
> not reproducing the issue.  Am I missing something?

standby_2? My last one 026_timeline_issue_2.pl doesn't use that name
and uses "standby_1 and "cascade".  In the ealier ones, standby_4 and
5 (or 3 and 4 in the later versions) are used in ths additional tests.

So I think it shold be something different?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-05-26 Thread Dilip Kumar
On Wed, May 26, 2021 at 9:40 PM Robert Haas  wrote:

> ...which has a clear race condition.
> src/test/recovery/t/023_pitr_prepared_xact.pl has logic to wait for a
> WAL file to be archived, so maybe we can steal that logic and use it
> here.

Yeah, done that, I think we can use exact same logic for history files
as well because if wal file is archived then history file must be
because a) history file get created during promotion so created before
WAL file with new TL is ready for archive b) Archiver archive history
files before archiving any WAL files.

src/test/recovery/t/025_stuck_on_old_timeline.pl

> I suggest we rename the test to something a bit more descriptive. Like
> instead of 025_timeline_issue.pl, perhaps
> 025_stuck_on_old_timeline.pl? Or I'm open to other suggestions, but
> "timeline issue" is a bit too vague for my taste.

Changed as suggested.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 5bbea2f5a21c28729580ec2b7b8cf50fd67e7ed0 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Sun, 23 May 2021 21:27:58 +0530
Subject: [PATCH v3] Test for new standby not following promoted standby

---
 src/test/recovery/t/025_stuck_on_old_timeline.pl | 101 +++
 src/test/recovery/t/cp_history_files |   8 ++
 2 files changed, 109 insertions(+)
 create mode 100644 src/test/recovery/t/025_stuck_on_old_timeline.pl
 create mode 100755 src/test/recovery/t/cp_history_files

diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl
new file mode 100644
index 000..e44b774
--- /dev/null
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -0,0 +1,101 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Testing streaming replication where standby is promoted and a new cascade
+# standby (without WAL) is connected to the promoted standby.  Both archiving
+# and streaming are enabled but it should only get the history files from the
+# archive but not the WAL files so that it has to get the checkpoint record
+# from the promoted standby through streaming.  Test that the cascade standby
+# should be able to follow the new primary (promoted standby).
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use FindBin;
+use Test::More tests => 1;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+
+# Set archive command using 'cp_history_files' (custom command).  The command
+# will only copy the history file and ignore all other WAL files.  This is
+# required to reproduce the scenario where history files reach the archive
+# but not the WAL files when standby try to restore it from the archive so that
+# it needs to stream the checkpoint record from the primary.
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+my $archivedir_primary = $node_primary->archive_dir;
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+archive_command = '"$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"'
+));
+$node_primary->start;
+
+my $backup_name = 'my_backup';
+
+# Take backup from primary
+$node_primary->backup($backup_name);
+
+# Create streaming standby linking to primary
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	allows_streaming => 1, has_streaming => 1, has_archiving => 1);
+$node_standby->start;
+
+# Take backup of standby, use -Xnone so that pg_wal is empty.
+$node_standby->backup($backup_name, backup_options => ['-Xnone']);
+
+# Create cascading standby but don't start it yet.
+# Must set up both streaming and archiving.
+my $node_cascade = get_new_node('cascade');
+$node_cascade->init_from_backup($node_standby, $backup_name,
+	has_streaming => 1);
+
+# Setup restore command
+my $copy_command =
+  $TestLib::windows_os ? 'copy' : 'cp';
+
+$node_cascade->append_conf(
+	'postgresql.conf', qq(
+restore_command = '$copy_command "$archivedir_primary/%f" "%p"'
+));
+
+# Promote the standby.
+$node_standby->psql('postgres', 'SELECT pg_promote()');
+
+# Find next WAL segment to be archived
+my $walfile_to_be_archived = $node_standby->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn());");
+
+# Make WAL segment eligible for archival
+$node_standby->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Wait until the WAL segment has been archived.
+# Since the history file get created on promotion and the archiver ensures that
+# the history files get archived before any wal segment so we can ensure after
+# this the history file must have been archived.
+my $archive_wait_query =
+  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;";
+$node_standby->poll_query_until('postgres', $archive_wait_query)
+  or die "Timed out while waiting for WAL segment to be archived";
+my $last_archived_wal_file = $walfile_to_be_archived;
+
+# Start cascade node
+$node_cascade->start;
+
+# Create some content on promoted standby and check its presen

Re: Race condition in recovery?

2021-05-26 Thread Dilip Kumar
On Thu, May 27, 2021 at 6:19 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 26 May 2021 22:08:32 +0530, Dilip Kumar  wrote 
> in
> > On Wed, 26 May 2021 at 10:06 PM, Robert Haas  wrote:
> >
> > > On Wed, May 26, 2021 at 12:26 PM Dilip Kumar 
> > > wrote:
> > > > I will check if there is any timing dependency in the test case.
> > >
> > > There is. I explained it in the second part of my email, which you may
> > > have failed to notice.
> >
> >
> > Sorry, my bad.  I got your point now.  I will change the test.
>
> I didn't noticed that but that is actually possible to happen.
>
>
> By the way I'm having a hard time understanding what was happening on
> this thread.
>
> In the very early in this thread I posted a test script that exactly
> reproduces Dilip's case by starting from two standbys based on his
> explanation. But *we* didn't understand what the original commit
> ee994272ca intended and I understood that we wanted to know it.
>
> So in the mail [1] and [2] I tried to describe what's going on around
> the two issues.  Although I haven't have a response to [2], can I
> think that we clarified the intention of ee994272ca?  And may I think
> that we decided that we don't add a test for the commit?
>
> Then it seems to me that Robert refound how to reproduce Dilip's case
> using basebackup instead of using two standbys.  It is using a broken
> archive_command with pg_basebackup -Xnone and I showed that the same
> resulting state is available by pg_basebackup -Xstream/fetch clearing
> pg_wal directory of the resulting backup including an explanation of
> why.

Maybe we can somehow achieve that without a broken archive command,
but I am not sure how it is enough to just delete WAL from pg_wal?  I
mean my original case was that
1. Got the new history file from the archive but did not get the WAL
file yet which contains the checkpoint after TL switch
2. So the standby2 try to stream using new primary using old TL and
set the wrong TL in expectedTLEs

But if you are not doing anything to stop archiving WAL files or to
guarantee that WAL has come to archive and you deleted those then I am
not sure how we are reproducing the original problem.

BTW, I have also tested your script and I found below log, which shows
that standby2 is successfully able to select the timeline2 so it is
not reproducing the issue.  Am I missing something?

--standby-1--
2021-05-27 10:45:35.866 IST [5096] LOG:  last completed transaction
was at log time 2021-05-27 10:45:35.699316+05:30
2021-05-27 10:45:35.867 IST [5096] LOG:  selected new timeline ID: 2
2021-05-27 10:45:35.882 IST [5096] LOG:  archive recovery complete
2021-05-27 10:45:35.911 IST [5095] LOG:  database system is ready to
accept connections
2021-05-27 10:45:36.096 IST [5134] standby_2 LOG:  received
replication command: IDENTIFY_SYSTEM
2021-05-27 10:45:36.096 IST [5134] standby_2 STATEMENT:  IDENTIFY_SYSTEM

--standby-2--
2021-05-27 10:45:36.089 IST [5129] LOG:  entering standby mode
2021-05-27 10:45:36.090 IST [5129] LOG:  redo starts at 0/228
2021-05-27 10:45:36.092 IST [5129] LOG:  consistent recovery state
reached at 0/3030320
2021-05-27 10:45:36.092 IST [5129] LOG:  invalid record length at
0/3030320: wanted 24, got 0
2021-05-27 10:45:36.092 IST [5128] LOG:  database system is ready to
accept read only connections
2021-05-27 10:45:36.096 IST [5133] LOG:  fetching timeline history
file for timeline 2 from primary server
2021-05-27 10:45:36.097 IST [5133] LOG:  started streaming WAL from
primary at 0/300 on timeline 1
2021-05-27 10:45:36.098 IST [5133] LOG:  replication terminated by
primary server
2021-05-27 10:45:36.098 IST [5133] DETAIL:  End of WAL reached on
timeline 1 at 0/3030320.
2021-05-27 10:45:36.098 IST [5129] LOG:  new target timeline is 2
2021-05-27 10:45:36.098 IST [5133] LOG:  restarted WAL streaming at
0/300 on timeline 2


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-26 Thread Kyotaro Horiguchi
At Wed, 26 May 2021 22:08:32 +0530, Dilip Kumar  wrote 
in 
> On Wed, 26 May 2021 at 10:06 PM, Robert Haas  wrote:
> 
> > On Wed, May 26, 2021 at 12:26 PM Dilip Kumar 
> > wrote:
> > > I will check if there is any timing dependency in the test case.
> >
> > There is. I explained it in the second part of my email, which you may
> > have failed to notice.
> 
> 
> Sorry, my bad.  I got your point now.  I will change the test.

I didn't noticed that but that is actually possible to happen.


By the way I'm having a hard time understanding what was happening on
this thread.

In the very early in this thread I posted a test script that exactly
reproduces Dilip's case by starting from two standbys based on his
explanation. But *we* didn't understand what the original commit
ee994272ca intended and I understood that we wanted to know it.

So in the mail [1] and [2] I tried to describe what's going on around
the two issues.  Although I haven't have a response to [2], can I
think that we clarified the intention of ee994272ca?  And may I think
that we decided that we don't add a test for the commit?

Then it seems to me that Robert refound how to reproduce Dilip's case
using basebackup instead of using two standbys.  It is using a broken
archive_command with pg_basebackup -Xnone and I showed that the same
resulting state is available by pg_basebackup -Xstream/fetch clearing
pg_wal directory of the resulting backup including an explanation of
why.

*I* think that it is better to avoid to have the archive_command since
it seems to me that just unlinking some files seems simpler tha having
the broken archive_command.  However, since Robert ignored it, I guess
that Robert thinks that the broken archive_command is better than
that.

It my understanding above about the current status of this thread is
right?


FWIW, regarding to the name of the test script, putting aside what it
actually does, I proposed to place it as a part or
004_timeline_switch.pl because this issue is related to timeline
switching.


[1] 20210521.112105.27166595366072396.horikyota@gmail.com
[2] 
https://www.postgresql.org/message-id/20210524.113402.1922481024406047229.horikyota@gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-05-26 Thread Dilip Kumar
On Wed, 26 May 2021 at 10:06 PM, Robert Haas  wrote:

> On Wed, May 26, 2021 at 12:26 PM Dilip Kumar 
> wrote:
> > I will check if there is any timing dependency in the test case.
>
> There is. I explained it in the second part of my email, which you may
> have failed to notice.


Sorry, my bad.  I got your point now.  I will change the test.

> --
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: Race condition in recovery?

2021-05-26 Thread Robert Haas
On Wed, May 26, 2021 at 12:26 PM Dilip Kumar  wrote:
> I will check if there is any timing dependency in the test case.

There is. I explained it in the second part of my email, which you may
have failed to notice.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-26 Thread Dilip Kumar
On Wed, May 26, 2021 at 9:40 PM Robert Haas  wrote:
>
> On Wed, May 26, 2021 at 2:44 AM Dilip Kumar  wrote:
> > I think we need to create some content on promoted standby and check
> > whether the cascade standby is able to get that or not, that will
> > guarantee that it is actually following the promoted standby,  I have
> > added the test for that so that it matches the comments.
>
> OK, but I ran this test against an unpatched server and it passed.
> That's not so great, because the test should fail without the bug fix.
> It seems to be because there's only one actual test in this test case.
> Looking at the log file,
> src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the
> only "ok" nor "not ok" line is:
>
> ok 1 - check streamed content on cascade standby
>
> So either that test is not right or some other test is needed. I think
> there's something else going wrong here, because when I run my
> original test.sh script, I see this:

Thats strange, when I ran the test I can see below in log of cascade
node (which shows that cascade get the history file but not the WAL
file and then it select the old timeline and never go to the new
timeline)

...
2021-05-26 21:46:54.412 IST [84080] LOG:  restored log file
"0002.history" from archive
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/0003.history’:
No such file or directory
2021-05-26 21:46:54.415 IST [84080] LOG:  entering standby mode
2021-05-26 21:46:54.419 IST [84080] LOG:  restored log file
"0002.history" from archive
.
2021-05-26 21:46:54.429 IST [84085] LOG:  started streaming WAL from
primary at 0/200 on timeline 1   -> stream using previous TL
2021-05-26 21:46:54.466 IST [84080] LOG:  redo starts at 0/228
2021-05-26 21:46:54.466 IST [84080] LOG:  consistent recovery state
reached at 0/300
2021-05-26 21:46:54.467 IST [84079] LOG:  database system is ready to
accept read only connections
2021-05-26 21:46:54.483 IST [84085] LOG:  replication terminated by
primary server
2021-05-26 21:46:54.483 IST [84085] DETAIL:  End of WAL reached on
timeline 1 at 0/300.
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/0003.history’:
No such file or directory
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00010003’:
No such file or directory
2021-05-26 21:46:54.498 IST [84085] LOG:  primary server contains no
more WAL on requested timeline 1



And finally the test case fails because the cascade can never get the changes.

I will check if there is any timing dependency in the test case.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-26 Thread Robert Haas
On Wed, May 26, 2021 at 2:44 AM Dilip Kumar  wrote:
> I think we need to create some content on promoted standby and check
> whether the cascade standby is able to get that or not, that will
> guarantee that it is actually following the promoted standby,  I have
> added the test for that so that it matches the comments.

OK, but I ran this test against an unpatched server and it passed.
That's not so great, because the test should fail without the bug fix.
It seems to be because there's only one actual test in this test case.
Looking at the log file,
src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the
only "ok" nor "not ok" line is:

ok 1 - check streamed content on cascade standby

So either that test is not right or some other test is needed. I think
there's something else going wrong here, because when I run my
original test.sh script, I see this:

2021-05-26 11:37:47.794 EDT [57961] LOG:  restored log file
"0002.history" from archive
...
2021-05-26 11:37:47.916 EDT [57961] LOG:  redo starts at 0/228
...
2021-05-26 11:37:47.927 EDT [57966] LOG:  replication terminated by
primary server
2021-05-26 11:37:47.927 EDT [57966] DETAIL:  End of WAL reached on
timeline 1 at 0/300

But in the src/test/recovery/tmp_check/log/025_timeline_issue_cascade.log
file generated by this test case:

cp: 
/Users/rhaas/pgsql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/0002.history:
No such file or directory
...
2021-05-26 11:41:08.149 EDT [63347] LOG:  fetching timeline history
file for timeline 2 from primary server
...
2021-05-26 11:41:08.288 EDT [63344] LOG:  new target timeline is 2
...
2021-05-26 11:41:08.303 EDT [63344] LOG:  redo starts at 0/228
...
2021-05-26 11:41:08.331 EDT [63347] LOG:  restarted WAL streaming at
0/300 on timeline 2

So it doesn't seem like the test is actually reproducing the problem
correctly. The timeline history file isn't available from the archive,
so it streams it, and then the problem doesn't occur. I guess that's
because there's nothing to guarantee that the history file reaches the
archive before 'cascade' is started. The code just does:

# Promote the standby.
$node_standby->psql('postgres', 'SELECT pg_promote()');

# Start cascade node
$node_cascade->start;

...which has a clear race condition.
src/test/recovery/t/023_pitr_prepared_xact.pl has logic to wait for a
WAL file to be archived, so maybe we can steal that logic and use it
here.

I suggest we rename the test to something a bit more descriptive. Like
instead of 025_timeline_issue.pl, perhaps
025_stuck_on_old_timeline.pl? Or I'm open to other suggestions, but
"timeline issue" is a bit too vague for my taste.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-25 Thread Dilip Kumar
On Tue, May 25, 2021 at 9:16 PM Robert Haas  wrote:

> use FindBin;
>
> and then use $FindBin::RealBin to construct a path name to the executable, 
> e.g.
>
> $node_primary->append_conf(
>'postgresql.conf', qq(
> archive_command = '"$FindBin::RealBin/skip_cp" "%p" "$archivedir_primary/%f"'
> ));
>
> This avoids issues such as: leaving behind files if the script is
> terminated, needing the current working directory to be writable,
> possible permissions issues with the new file under Windows or
> SE-Linux.

Done

> The restore_command needs to be "cp" on Linux but "copy" on Windows.
> Maybe you can use PostgresNode.pm's enable_restoring? Or if that
> doesn't work, then you need to mimic the logic, as
> src/test/recovery/t/020_archive_status.pl does for archive_command.

Done

> Why do you set log_line_prefix? Is that needed?

No, it was not, removed

> Why are the nodes called standby_1 and cascade? Either use standby and
> cascade or standby_1 and standby_2.

Fixed

> There is a comment that says "Create some content on primary and check
> its presence in standby 1" but it only creates the content, and does
> not check anything. I think we don't really need to do any of this,
> but at least the code and the comment have to match.

I think we need to create some content on promoted standby and check
whether the cascade standby is able to get that or not, that will
guarantee that it is actually following the promoted standby,  I have
added the test for that so that it matches the comments.

> Let's not call the command skip_cp. It's not very descriptive. If you
> don't like recalcitrant_cp, then maybe something like cp_history_files
> or so.

Done

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 0c9235f77fa9c75b1957810aa88a2572ed4ac1dc Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Sun, 23 May 2021 21:27:58 +0530
Subject: [PATCH v2] Test for new standby not following promoted standby

---
 src/test/recovery/t/025_timeline_issue.pl | 84 +++
 src/test/recovery/t/cp_history_files  |  8 +++
 2 files changed, 92 insertions(+)
 create mode 100644 src/test/recovery/t/025_timeline_issue.pl
 create mode 100755 src/test/recovery/t/cp_history_files

diff --git a/src/test/recovery/t/025_timeline_issue.pl b/src/test/recovery/t/025_timeline_issue.pl
new file mode 100644
index 000..9031354
--- /dev/null
+++ b/src/test/recovery/t/025_timeline_issue.pl
@@ -0,0 +1,84 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Testing streaming replication where standby is promoted and a new cascade
+# standby (without WAL) is connected to the promoted standby.  Both archiving
+# and streaming are enabled but it should only get the history files from the
+# archive but not the WAL files so that it has to get the checkpoint record
+# from the promoted standby through streaming.  Test that the cascade standby
+# should be able to follow the new primary (promoted standby).
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use FindBin;
+use Test::More tests => 1;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+
+# Set archive command using 'cp_history_files' (custom command).  The command
+# will only copy the history file and ignore all other WAL files.  This is
+# required to reproduce the scenario where history files reach the archive
+# but not the WAL files when standby try to restore it from the archive so that
+# it needs to stream the checkpoint record from the primary.
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+my $archivedir_primary = $node_primary->archive_dir;
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+archive_command = '"$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"'
+));
+$node_primary->start;
+
+my $backup_name = 'my_backup';
+
+# Take backup from primary
+$node_primary->backup($backup_name);
+
+# Create streaming standby linking to primary
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	allows_streaming => 1, has_streaming => 1, has_archiving => 1);
+$node_standby->start;
+
+# Take backup of standby, use -Xnone so that pg_wal is empty.
+$node_standby->backup($backup_name, backup_options => ['-Xnone']);
+
+# Create cascading standby but don't start it yet.
+# Must set up both streaming and archiving.
+my $node_cascade = get_new_node('cascade');
+$node_cascade->init_from_backup($node_standby, $backup_name,
+	has_streaming => 1);
+
+# Setup restore command
+my $copy_command =
+  $TestLib::windows_os ? 'copy' : 'cp';
+
+$node_cascade->append_conf(
+	'postgresql.conf', qq(
+restore_command = '$copy_command "$archivedir_primary/%f" "%p"'
+));
+
+# Promote the standby.
+$node_standby->psql('postgres', 'SELECT pg_promote()');
+
+# Start cascade node
+$node_cascade->start;
+
+# Create some content on promoted standby and check its presence in cascade standby
+$node_standby->safe_psql('postgre

Re: Race condition in recovery?

2021-05-25 Thread Robert Haas
On Sun, May 23, 2021 at 12:08 PM Dilip Kumar  wrote:
> I have created a tap test based on Robert's test.sh script.  It
> reproduces the issue.  I am new with perl so this still needs some
> cleanup/improvement, but at least it shows the idea.

Thanks. I think this is the right idea but just needs a few adjustments.

I don't think that dynamically writing out a file into the current
working directory of the script is the right approach. Instead I think
we should be planning to check this file into the repository and then
have the test script find it. Now the trick is how to do that in a
portable way. I think we can probably use the same idea that the
pg_rewind tests use to find a perl module located in the test
directory.  That is:

use FindBin;

and then use $FindBin::RealBin to construct a path name to the executable, e.g.

$node_primary->append_conf(
   'postgresql.conf', qq(
archive_command = '"$FindBin::RealBin/skip_cp" "%p" "$archivedir_primary/%f"'
));

This avoids issues such as: leaving behind files if the script is
terminated, needing the current working directory to be writable,
possible permissions issues with the new file under Windows or
SE-Linux.

The restore_command needs to be "cp" on Linux but "copy" on Windows.
Maybe you can use PostgresNode.pm's enable_restoring? Or if that
doesn't work, then you need to mimic the logic, as
src/test/recovery/t/020_archive_status.pl does for archive_command.

Why do you set log_line_prefix? Is that needed?

Why are the nodes called standby_1 and cascade? Either use standby and
cascade or standby_1 and standby_2.

There is a comment that says "Create some content on primary and check
its presence in standby 1" but it only creates the content, and does
not check anything. I think we don't really need to do any of this,
but at least the code and the comment have to match.

Let's not call the command skip_cp. It's not very descriptive. If you
don't like recalcitrant_cp, then maybe something like cp_history_files
or so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-23 Thread Dilip Kumar
On Mon, May 24, 2021 at 10:17 AM Kyotaro Horiguchi
 wrote:
>
> At Sun, 23 May 2021 21:37:58 +0530, Dilip Kumar  wrote 
> in
> > On Sun, May 23, 2021 at 2:19 PM Dilip Kumar  wrote:
> > >
> > > On Sat, May 22, 2021 at 8:33 PM Robert Haas  wrote:
> >
> > I have created a tap test based on Robert's test.sh script.  It
> > reproduces the issue.  I am new with perl so this still needs some
> > cleanup/improvement, but at least it shows the idea.
>
> I'm not sure I'm following the discussion here, however, if we were
> trying to reproduce Dilip's case using base backup, we would need such
> a broken archive command if using pg_basebackup witn -Xnone.  Becuase
> the current version of pg_basebackup waits for all required WAL
> segments to be archived when connecting to a standby with -Xnone.

Right, that's the reason if you see my patch I have dynamically
generated such archive command which skips everything other than the
history file
see below snippet from my patch, where I am generating a skip_cp
command and then I am using that as an archive command.

==
+# Prepare a alternative archive command to skip WAL files
+my $script = "#!/usr/bin/perl \n
+use File::Copy; \n
+my (\$source, \$target) = \@ARGV; \n
+if (\$source =~ /history/) \n
+{ \n
+ copy(\$source, \$target); \n
+}";
+
+open my $fh, '>', "skip_cp";
+print {$fh} $script;
===

  I
> don't bother reconfirming the version that fix took place, but just
> using -X stream instead of "none" we successfully miss the first
> segment of the new timeline in the upstream archive, though we need to
> erase pg_wal in the backup.  Either the broken archive command or
> erasing pg_wal of the cascade is required to the behavior to occur.
>
> The attached is how it looks like.

I will test this and let you know.  Thanks!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-23 Thread Kyotaro Horiguchi
At Sun, 23 May 2021 21:37:58 +0530, Dilip Kumar  wrote 
in 
> On Sun, May 23, 2021 at 2:19 PM Dilip Kumar  wrote:
> >
> > On Sat, May 22, 2021 at 8:33 PM Robert Haas  wrote:
> 
> I have created a tap test based on Robert's test.sh script.  It
> reproduces the issue.  I am new with perl so this still needs some
> cleanup/improvement, but at least it shows the idea.

I'm not sure I'm following the discussion here, however, if we were
trying to reproduce Dilip's case using base backup, we would need such
a broken archive command if using pg_basebackup witn -Xnone.  Becuase
the current version of pg_basebackup waits for all required WAL
segments to be archived when connecting to a standby with -Xnone.  I
don't bother reconfirming the version that fix took place, but just
using -X stream instead of "none" we successfully miss the first
segment of the new timeline in the upstream archive, though we need to
erase pg_wal in the backup.  Either the broken archive command or
erasing pg_wal of the cascade is required to the behavior to occur.

The attached is how it looks like.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
# Copyright (c) 2021, PostgreSQL Global Development Group

# Minimal test testing streaming replication
use Cwd;
use strict;
use warnings;
use PostgresNode;
use TestLib;
use Test::More tests => 1;

# Initialize primary node
my $node_primary = get_new_node('primary');
# A specific role is created to perform some tests related to replication,
# and it needs proper authentication configuration.
$node_primary->init(allows_streaming => 1);
$node_primary->append_conf(
'postgresql.conf', qq(
wal_keep_size=128MB
));
$node_primary->start;

my $backup_name = 'my_backup';

# Take backup
$node_primary->backup($backup_name);

my $node_standby_1 = get_new_node('standby_1');
$node_standby_1->init_from_backup($node_primary, $backup_name,
  
allows_streaming => 1, has_streaming => 1);
my $archivedir_standby_1 = $node_standby_1->archive_dir;
$node_standby_1->append_conf(
'postgresql.conf', qq(
archive_mode=always
archive_command='cp "%p" "$archivedir_standby_1/%f"'
));
$node_standby_1->start;


# Take backup of standby 1
# NB: Use -Xnone so that pg_wal is empty.
#$node_standby_1->backup($backup_name, backup_options => ['-Xnone']);
$node_standby_1->backup($backup_name);

# Promote the standby.
$node_standby_1->psql('postgres', 'SELECT pg_promote()');

# clean up pg_wal from the backup
my $pgwaldir = $node_standby_1->backup_dir. "/" . $backup_name . "/pg_wal";
opendir my $dh, $pgwaldir or die "failed to open $pgwaldir";
while (my $f = readdir($dh))
{
unlink("$pgwaldir/$f") if (-f "$pgwaldir/$f");
}
closedir($dh);

# Create cascading standby but don't start it yet.
# NB: Must set up both streaming and archiving.
my $node_cascade = get_new_node('cascade');
$node_cascade->init_from_backup($node_standby_1, $backup_name,
has_streaming => 1);
$node_cascade->append_conf(
'postgresql.conf', qq(
restore_command = 'cp "$archivedir_standby_1/%f" "%p"'
log_line_prefix = '%m [%p:%b] %q%a '
archive_mode=off
));


# Start cascade node
$node_cascade->start;

# Create some content on primary and check its presence in standby 1
$node_standby_1->safe_psql('postgres',
"CREATE TABLE tab_int AS SELECT 1 AS a");

# Wait for standbys to catch up
$node_standby_1->wait_for_catchup($node_cascade, 'replay',
$node_standby_1->lsn('replay'));

ok(1, 'test'); # it's sucess if we come here.


Re: Race condition in recovery?

2021-05-23 Thread Kyotaro Horiguchi
At Fri, 21 May 2021 12:52:54 -0400, Robert Haas  wrote 
in 
> I had trouble following it completely, but I didn't really spot
> anything that seemed definitely wrong. However, I don't understand
> what it has to do with where we are now. What I want to understand is:
> under exactly what circumstances does it matter that
> WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
> will stream from receiveTLI rather than recoveryTargetTLI?

Extracing related descriptions from my previous mail,

- recoveryTargetTimeLine is initialized with
  ControlFile->checkPointCopy.ThisTimeLineID

- readRecoveryCommandFile():
  ...or in the case of
  latest, move it forward up to the maximum timeline among the history
  files found in either pg_wal or archive.

- ReadRecord...XLogFileReadAnyTLI

  Tries to load the history file for recoveryTargetTLI either from
  pg_wal or archive onto local TLE list, if the history file is not
  found, use a generateed list with one entry for the
  recoveryTargetTLI.

  (b) If such a segment is *not* found, expectedTLEs is left
NIL. Usually recoveryTargetTLI is equal to the last checkpoint
TLI.

  (c) However, in the case where timeline switches happened in the
segment and the recoveryTargetTLI has been increased, that is, the
history file for the recoveryTargetTLI is found in pg_wal or
archive, that is, the issue raised here, recoveryTargetTLI becomes
the future timline of the checkpoint TLI.

- WaitForWALToBecomeAvailable

In the case of (c) recoveryTargetTLI > checkpoint TLI.  In this case
  we expecte that checkpint TLI is in the history of
  recoveryTargetTLI. Otherwise recovery failse^h.  This case is similar
  to the case (a) but the relationship between recoveryTargetTLI and
  the checkpoint TLI is not confirmed yet. ReadRecord barks later if
  they are not compatible so there's not a serious problem but might
  be better checking the relation ship there.  My first proposal
  performed mutual check between the two but we need to check only
  unidirectionally.

===
So the condition for the Dilip's case is, as you wrote in another mail:

- ControlFile->checkPointCopy.ThisTimeLineID is in the older timeline.
- Archive or pg_wal offers the history file for the newer timeline.
- The segment for the checkpoint is not found in pg_wal nor in archive.

That is,

- A grandchild(c) node is stopped
- Then the child node(b) is promoted.

- Clear pg_wal directory of (c) then connect it to (b) *before* (b)
  archives the segment for the newer timeline of the
  timeline-switching segments.  (if we have switched at segment 3,
  TLI=1, the segment file of the older timeline is renamed to
  .partial, then create the same segment for TLI=2.  The former is
  archived while promotion is performed but the latter won't be
  archive until the segment ends.)


The orinal case of after the commit ee994272ca,

- recoveryTargetTimeLine is initialized with
  ControlFile->checkPointCopy.ThisTimeLineID

(X) (Before the commit, we created the one-entry expectedTLEs consists
   only of ControlFile->checkPointCopy.ThisTimeLineID.)

- readRecoveryCommandFile():

  Move recoveryTargetTLI forward to the specified target timline if
  the history file for the timeline is found, or in the case of
  latest, move it forward up to the maximum timeline among the history
  files found in either pg_wal or archive.

- ReadRecord...XLogFileReadAnyTLI

  Tries to load the history file for recoveryTargetTLI either from
  pg_wal or archive onto local TLE list, if the history file is not
  found, use a generateed list with one entry for the
  recoveryTargetTLI.

  (b) If such a segment is *not* found, expectedTLEs is left
NIL. Usually recoveryTargetTLI is equal to the last checkpoint
TLI.

- WaitForWALToBecomeAvailable

  if we have had no segments for the last checkpoint, initiate
  streaming from the REDO point of the last checkpoint. We should have
  all history files until receiving segment data.

  after sufficient WAL data has been received, the only cases where
  expectedTLEs is still NIL are the (b) and (c) above.

  In the case of (b) recoveryTargetTLI == checkpoint TLI.

So I thought that the commit fixed this scenario. Even in this case,
ReadRecord fails because the checkpoint segment contains pages for the
older timeline which is not in expectedTLEs if we did (X).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-05-23 Thread Dilip Kumar
On Sun, May 23, 2021 at 2:19 PM Dilip Kumar  wrote:
>
> On Sat, May 22, 2021 at 8:33 PM Robert Haas  wrote:

I have created a tap test based on Robert's test.sh script.  It
reproduces the issue.  I am new with perl so this still needs some
cleanup/improvement, but at least it shows the idea.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From c797ac980eb774c6ca7400b4b90d76164717acc7 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Sun, 23 May 2021 21:27:58 +0530
Subject: [PATCH] Test for new standby not following promoted standby

---
 src/test/recovery/t/025_timeline_issue.pl | 85 +++
 1 file changed, 85 insertions(+)
 create mode 100644 src/test/recovery/t/025_timeline_issue.pl

diff --git a/src/test/recovery/t/025_timeline_issue.pl b/src/test/recovery/t/025_timeline_issue.pl
new file mode 100644
index 000..3ea3720
--- /dev/null
+++ b/src/test/recovery/t/025_timeline_issue.pl
@@ -0,0 +1,85 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Minimal test testing streaming replication
+use Cwd;
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+# Prepare a alternative archive command to skip WAL files
+my $script = "#!/usr/bin/perl \n
+use File::Copy; \n
+my (\$source, \$target) = \@ARGV; \n
+if (\$source =~ /history/) \n
+{ \n
+	copy(\$source, \$target); \n
+}";
+
+open my $fh, '>', "skip_cp";
+print {$fh} $script;
+chmod 0777, "skip_cp";
+close $fh;
+
+my $cwd = getcwd();
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+# A specific role is created to perform some tests related to replication,
+# and it needs proper authentication configuration.
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+my $archivedir_primary = $node_primary->archive_dir;
+$node_primary->append_conf(
+	'postgresql.conf', qq(
+archive_command = '"$cwd/skip_cp" "%p" "$archivedir_primary/%f"'
+));
+$node_primary->start;
+my $backup_name = 'my_backup';
+
+# Take backup
+$node_primary->backup($backup_name);
+
+# Create streaming standby linking to primary
+my $node_standby_1 = get_new_node('standby_1');
+$node_standby_1->init_from_backup($node_primary, $backup_name,
+	allows_streaming => 1, has_streaming => 1, has_archiving => 1);
+$node_standby_1->start;
+
+# Take backup of standby 1
+# NB: Use -Xnone so that pg_wal is empty.
+$node_standby_1->backup($backup_name, backup_options => ['-Xnone']);
+
+# Create cascading standby but don't start it yet.
+# NB: Must set up both streaming and archiving.
+my $node_cascade = get_new_node('cascade');
+$node_cascade->init_from_backup($node_standby_1, $backup_name,
+	has_streaming => 1);
+$node_cascade->append_conf(
+	'postgresql.conf', qq(
+restore_command = 'cp "$archivedir_primary/%f" "%p"'
+log_line_prefix = '%m [%p:%b] %q%a '
+));
+
+# Promote the standby.
+$node_standby_1->psql('postgres', 'SELECT pg_promote()');
+
+# Create some content on primary and check its presence in standby 1
+$node_standby_1->safe_psql('postgres',
+	"CREATE TABLE tab_int AS SELECT 1 AS a");
+
+# Start cascade node
+$node_cascade->start;
+
+# Wait for standbys to catch up
+$node_standby_1->wait_for_catchup($node_cascade, 'replay',
+	$node_standby_1->lsn('replay'));
+
+# clean up
+$node_primary->teardown_node;
+$node_standby_1->teardown_node;
+$node_cascade->teardown_node;
+
+unlink "skip_cp";
+exit(0);
\ No newline at end of file
-- 
1.8.3.1



Re: Race condition in recovery?

2021-05-23 Thread Dilip Kumar
On Sat, May 22, 2021 at 8:33 PM Robert Haas  wrote:

> > For my original case, both standby1 and standby2 are connected to the
> > primary.  Now, standby1 is promoted and standby2 is shut down. And,
> > before restarting, all the local WAL of the standby2 is removed so
> > that it can follow the new primary. The primary info and restore
> > command for standby2 are changed as per the new primary(standby1).
>
> One thing I don't understand is why the final WAL segment from the
> original primary didn't end up in the archive in this scenario. If it
> had, then we would not have seen the issue in that case.

I used different archive folders for primary and new
primary(standby1).  I have modified your test.sh slightly (modified
test2.sh attached) so that I can demonstrate my scenario where I was
seeing the issue and this is getting fixed after putting the fix we
discussed[1]

[1]
-
expectedTLEs = readTimeLineHistory(receiveTLI);
+
expectedTLEs = readTimeLineHistory(recoveryTargetTLI);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


test2.sh
Description: application/shellscript


Re: Race condition in recovery?

2021-05-22 Thread Robert Haas
On Sat, May 22, 2021 at 12:45 AM Dilip Kumar  wrote:
> No, in my original scenario also the new standby was not old primary,
> I had 3 nodes
> node1-> primary, node2 -> standby1, node3-> standby2
> node2 promoted as a new primary and node3's local WAL was removed (so
> that it has to stream checkpoint record from new primary and then
> remaining everything happens as you explain in remaining steps).

Oh, OK. I misunderstood. I think it could happen that way, though.

> I haven't tested this, but I will do that.  But now we are on the same
> page about the cause of the actual problem I reported.

Yeah, sorry, I just didn't understand the exact chain of events before.

> For my original case, both standby1 and standby2 are connected to the
> primary.  Now, standby1 is promoted and standby2 is shut down. And,
> before restarting, all the local WAL of the standby2 is removed so
> that it can follow the new primary. The primary info and restore
> command for standby2 are changed as per the new primary(standby1).

One thing I don't understand is why the final WAL segment from the
original primary didn't end up in the archive in this scenario. If it
had, then we would not have seen the issue in that case.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-21 Thread Dilip Kumar
On Sat, May 22, 2021 at 10:15 AM Dilip Kumar  wrote:
>
> On Sat, May 22, 2021 at 1:14 AM Robert Haas  wrote:
> >
> > The attached test script, test.sh seems to reliably reproduce this.
> > Put that file and the recalcitrant_cp script, also attached, into an
>
> I haven't tested this, but I will do that.  But now we are on the same
> page about the cause of the actual problem I reported.

Now, I have tested. I am able to reproduce the actual problem with your script.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Sat, May 22, 2021 at 10:15 AM Dilip Kumar  wrote:
>
> On Sat, May 22, 2021 at 1:14 AM Robert Haas  wrote:
> >
> > On Fri, May 21, 2021 at 12:52 PM Robert Haas  wrote:
> > > I had trouble following it completely, but I didn't really spot
> > > anything that seemed definitely wrong. However, I don't understand
> > > what it has to do with where we are now. What I want to understand is:
> > > under exactly what circumstances does it matter that
> > > WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
> > > will stream from receiveTLI rather than recoveryTargetTLI?
> >
> > Ah ha! I think I figured it out. To hit this bug, you need to meet the
> > following conditions:
> >
> > 1. Both streaming and archiving have to be configured.
> > 2. You have to promote a new primary.
> > 3. After promoting the new primary you have to start a new standby
> > that doesn't have local WAL and for which the backup was taken from
> > the previous timeline. In Dilip's original scenario, this new standby
> > is actually the old primary, but that's not required.
>
> No, in my original scenario also the new standby was not old primary,
> I had 3 nodes
> node1-> primary, node2 -> standby1, node3-> standby2
> node2 promoted as a new primary and node3's local WAL was removed (so
> that it has to stream checkpoint record from new primary and then
> remaining everything happens as you explain in remaining steps).
>
> > 4. The new standby has to be able to find the history file it needs in
> > the archive but not the WAL files.
> > 5. The new standby needs to have recovery_target_timeline='latest'
> > (which is the default)
> >
> > When you start the new standby, it will fetch the current TLI from its
> > control file. Then, since recovery_target_timeline=latest, the system
> > will try to figure out the latest timeline, which only works because
> > archiving is configured. There seems to be no provision for detecting
> > the latest timeline via streaming. With archiving enabled, though,
> > findNewestTimeLine() will be able to restore the history file created
> > by the promotion of the new primary, which will cause
> > validateRecoveryParameters() to change recoveryTargetTLI.
>
> Right
>
>  Then we'll
> > try to read the WAL segment containing the checkpoint record and fail
> > because, by stipulation, only history files are available from the
> > archive. Now, because streaming is also configured, we'll try
> > streaming. That will work, so we'll be able to read the checkpoint
> > record, but now, because WaitForWALToBecomeAvailable() initialized
> > expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't
> > switch to the correct timeline and it all goes wrong.
>
> exactly
>
> > The attached test script, test.sh seems to reliably reproduce this.
> > Put that file and the recalcitrant_cp script, also attached, into an
> > empty directory, cd to that directory, and run test.sh. Afterwards
> > examine pgcascade.log. Basically, these scripts just set up the
> > scenario described above. We set up primary and a standby that use
> > recalcitrant_cp as the archive command, and because it's recalcitrant,
> > it's only willing to copy history files, and always fails for WAL
> > files.Then we create a cascading standby by taking a base backup from
> > the standby, but before actually starting it, we promote the original
> > standby. So now it meets all the conditions described above. I tried a
> > couple variants of this test. If I switch the archive command from
> > recalcitrant_cp to just regular cp, then there's no problem. And if I
> > switch it to something that always fails, then there's also no
> > problem. That's because, with either of those changes, condition (4)
> > above is no longer met. In the first case, both files end up in the
> > archive, and in the second case, neither file.
>
> I haven't tested this, but I will do that.  But now we are on the same
> page about the cause of the actual problem I reported.
>
> > What about hitting this in real life, with a real archive command?
> > Well, you'd probably need the archive command to be kind of slow and
> > get unlucky on the timing, but there's nothing to prevent it from
> > happening.
>
> Right
>
> > But, it will be WAY more likely if you have Dilip's original scenario,
> > where you try to repurpose an old primary as a standby. It would
> > normally be unlikely that the backup used to create a new standby
> > woul

Re: Race condition in recovery?

2021-05-21 Thread Dilip Kumar
On Sat, May 22, 2021 at 1:14 AM Robert Haas  wrote:
>
> On Fri, May 21, 2021 at 12:52 PM Robert Haas  wrote:
> > I had trouble following it completely, but I didn't really spot
> > anything that seemed definitely wrong. However, I don't understand
> > what it has to do with where we are now. What I want to understand is:
> > under exactly what circumstances does it matter that
> > WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
> > will stream from receiveTLI rather than recoveryTargetTLI?
>
> Ah ha! I think I figured it out. To hit this bug, you need to meet the
> following conditions:
>
> 1. Both streaming and archiving have to be configured.
> 2. You have to promote a new primary.
> 3. After promoting the new primary you have to start a new standby
> that doesn't have local WAL and for which the backup was taken from
> the previous timeline. In Dilip's original scenario, this new standby
> is actually the old primary, but that's not required.

No, in my original scenario also the new standby was not old primary,
I had 3 nodes
node1-> primary, node2 -> standby1, node3-> standby2
node2 promoted as a new primary and node3's local WAL was removed (so
that it has to stream checkpoint record from new primary and then
remaining everything happens as you explain in remaining steps).

> 4. The new standby has to be able to find the history file it needs in
> the archive but not the WAL files.
> 5. The new standby needs to have recovery_target_timeline='latest'
> (which is the default)
>
> When you start the new standby, it will fetch the current TLI from its
> control file. Then, since recovery_target_timeline=latest, the system
> will try to figure out the latest timeline, which only works because
> archiving is configured. There seems to be no provision for detecting
> the latest timeline via streaming. With archiving enabled, though,
> findNewestTimeLine() will be able to restore the history file created
> by the promotion of the new primary, which will cause
> validateRecoveryParameters() to change recoveryTargetTLI.

Right

 Then we'll
> try to read the WAL segment containing the checkpoint record and fail
> because, by stipulation, only history files are available from the
> archive. Now, because streaming is also configured, we'll try
> streaming. That will work, so we'll be able to read the checkpoint
> record, but now, because WaitForWALToBecomeAvailable() initialized
> expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't
> switch to the correct timeline and it all goes wrong.

exactly

> The attached test script, test.sh seems to reliably reproduce this.
> Put that file and the recalcitrant_cp script, also attached, into an
> empty directory, cd to that directory, and run test.sh. Afterwards
> examine pgcascade.log. Basically, these scripts just set up the
> scenario described above. We set up primary and a standby that use
> recalcitrant_cp as the archive command, and because it's recalcitrant,
> it's only willing to copy history files, and always fails for WAL
> files.Then we create a cascading standby by taking a base backup from
> the standby, but before actually starting it, we promote the original
> standby. So now it meets all the conditions described above. I tried a
> couple variants of this test. If I switch the archive command from
> recalcitrant_cp to just regular cp, then there's no problem. And if I
> switch it to something that always fails, then there's also no
> problem. That's because, with either of those changes, condition (4)
> above is no longer met. In the first case, both files end up in the
> archive, and in the second case, neither file.

I haven't tested this, but I will do that.  But now we are on the same
page about the cause of the actual problem I reported.

> What about hitting this in real life, with a real archive command?
> Well, you'd probably need the archive command to be kind of slow and
> get unlucky on the timing, but there's nothing to prevent it from
> happening.

Right

> But, it will be WAY more likely if you have Dilip's original scenario,
> where you try to repurpose an old primary as a standby. It would
> normally be unlikely that the backup used to create a new standby
> would have an older TLI, because you typically wouldn't switch masters
> in between taking a base backup and using it to create a new standby.
> But the old master always has an older TLI. So (3) is satisfied. For
> (4) to be satisfied, you need the old master to fail to archive all of
> its WAL when it shuts down.

For my original case, both standby1 and standby2 are connected to the
primary.  Now, standby1 is promoted and standby2 is shut down. And,
before restarting, all the local WAL of the standby2 is removed so
that it can follow the new primary. The primary info and restore
command for standby2 are changed as per the new primary(standby1).

Now the scenario is that the standby1 has switched the timeline in the
middle of the segment which cont

Re: Race condition in recovery?

2021-05-21 Thread Robert Haas
On Fri, May 21, 2021 at 12:52 PM Robert Haas  wrote:
> I had trouble following it completely, but I didn't really spot
> anything that seemed definitely wrong. However, I don't understand
> what it has to do with where we are now. What I want to understand is:
> under exactly what circumstances does it matter that
> WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
> will stream from receiveTLI rather than recoveryTargetTLI?

Ah ha! I think I figured it out. To hit this bug, you need to meet the
following conditions:

1. Both streaming and archiving have to be configured.
2. You have to promote a new primary.
3. After promoting the new primary you have to start a new standby
that doesn't have local WAL and for which the backup was taken from
the previous timeline. In Dilip's original scenario, this new standby
is actually the old primary, but that's not required.
4. The new standby has to be able to find the history file it needs in
the archive but not the WAL files.
5. The new standby needs to have recovery_target_timeline='latest'
(which is the default)

When you start the new standby, it will fetch the current TLI from its
control file. Then, since recovery_target_timeline=latest, the system
will try to figure out the latest timeline, which only works because
archiving is configured. There seems to be no provision for detecting
the latest timeline via streaming. With archiving enabled, though,
findNewestTimeLine() will be able to restore the history file created
by the promotion of the new primary, which will cause
validateRecoveryParameters() to change recoveryTargetTLI. Then we'll
try to read the WAL segment containing the checkpoint record and fail
because, by stipulation, only history files are available from the
archive. Now, because streaming is also configured, we'll try
streaming. That will work, so we'll be able to read the checkpoint
record, but now, because WaitForWALToBecomeAvailable() initialized
expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't
switch to the correct timeline and it all goes wrong.

The attached test script, test.sh seems to reliably reproduce this.
Put that file and the recalcitrant_cp script, also attached, into an
empty directory, cd to that directory, and run test.sh. Afterwards
examine pgcascade.log. Basically, these scripts just set up the
scenario described above. We set up primary and a standby that use
recalcitrant_cp as the archive command, and because it's recalcitrant,
it's only willing to copy history files, and always fails for WAL
files.Then we create a cascading standby by taking a base backup from
the standby, but before actually starting it, we promote the original
standby. So now it meets all the conditions described above. I tried a
couple variants of this test. If I switch the archive command from
recalcitrant_cp to just regular cp, then there's no problem. And if I
switch it to something that always fails, then there's also no
problem. That's because, with either of those changes, condition (4)
above is no longer met. In the first case, both files end up in the
archive, and in the second case, neither file.

What about hitting this in real life, with a real archive command?
Well, you'd probably need the archive command to be kind of slow and
get unlucky on the timing, but there's nothing to prevent it from
happening.

But, it will be WAY more likely if you have Dilip's original scenario,
where you try to repurpose an old primary as a standby. It would
normally be unlikely that the backup used to create a new standby
would have an older TLI, because you typically wouldn't switch masters
in between taking a base backup and using it to create a new standby.
But the old master always has an older TLI. So (3) is satisfied. For
(4) to be satisfied, you need the old master to fail to archive all of
its WAL when it shuts down.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


recalcitrant_cp
Description: Binary data
#!/bin/bash

PG_PRIMARY_PORT=5450
PG_STANDBY_PORT=5451
PG_CASCADE_PORT=5452

# Cleanup anything left over from previous runs.
for d in pgprimary pgstandby pgcascade; do
if test -d $d; then
		pg_ctl stop -D $d;
		rm -rf $d
	fi
	rm -f $d.log
done
rm -rf wal_archive

# Echo commands from this point onward and exit on failure.
set -ex

# Create empty WAL archive.
mkdir wal_archive

# Initialize and start primary.
# NB: Must enable archiving, but only for history files not WAL files.
initdb -D pgprimary
cat >> pgprimary/postgresql.auto.conf <> pgstandby/postgresql.auto.conf <> pgcascade/postgresql.auto.conf <

Re: Race condition in recovery?

2021-05-21 Thread Robert Haas
On Thu, May 20, 2021 at 10:21 PM Kyotaro Horiguchi
 wrote:
> > > Conclusion:
> > > - I think now we agree on the point that initializing expectedTLEs
> > > with the recovery target timeline is the right fix.
> > > - We still have some differences of opinion about what was the
> > > original problem in the base code which was fixed by the commit
> > > (ee994272ca50f70b53074f0febaec97e28f83c4e).
> >
> > I am also still concerned about whether we understand in exactly what
> > cases the current logic doesn't work. We seem to more or less agree on
> > the fix, but I don't think we really understand precisely what case we
> > are fixing.
>
> Does the discussion above make sense?

I had trouble following it completely, but I didn't really spot
anything that seemed definitely wrong. However, I don't understand
what it has to do with where we are now. What I want to understand is:
under exactly what circumstances does it matter that
WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
will stream from receiveTLI rather than recoveryTargetTLI?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-21 Thread Robert Haas
On Fri, May 21, 2021 at 10:39 AM Dilip Kumar  wrote:
> > so we might have
> > the timeline history in RECOVERYHISTORY but that's not the filename
> > we're actually going to try to read from inside readTimeLineHistory().
> > In the second case, findNewestTimeLine() will call
> > existsTimeLineHistory() which results in the same situation. So I
> > think when readRecoveryCommandFile() returns expectedTLI can be
> > different but the history file can be absent since it was only ever
> > restored under a temporary name.
>
> I agree that readTimeLineHistory() will not look for that filename,
> but it will also try to get the file using (RestoreArchivedFile(path,
> histfname, "RECOVERYHISTORY", 0, false)).  So after we check the
> history file existence in existsTimeLineHistory(), if the file got
> removed from the archive (not sure how) then it is possible that now
> readTimeLineHistory() will not find that history file again.  Am I
> missing something?

That sounds right.

I've lost the thread of what we're talking about here a bit. I think
what we've established is that, when running a commit prior to
ee994272ca50f70b53074f0febaec97e28f83c4e, if (a) recovery_target_tli
is set, (b) restore_command works, and (c) nothing's being removed
from the archive concurrently, then by the time StartupXLOG() does
expectedTLEs = readTimeLineHistory(recoveryTargetTLI), any timeline
history file that exists in the archive will have been restored, and
the scenario ee994272ca50f70b53074f0febaec97e28f83c4e was concerned
about won't occur. That's because it was concerned about a scenario
where we failed to restore the history file until after we set
expectedTLEs.

Consequently, if we want to try to reproduce the problem fixed by that
commit, we should look for a scenario that does not involve setting
recovery_target_tli.

Is that the conclusion you were driving towards?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-21 Thread Dilip Kumar
On Fri, May 21, 2021 at 7:51 AM Kyotaro Horiguchi
 wrote:
>
> https://www.postgresql.org/message-id/50E43C57.5050101%40vmware.com
>
> > That leaves one case not covered: If you take a backup with plain
> > "pg_basebackup" from a standby, without -X, and the first WAL segment
> > contains a timeline switch (ie. you take the backup right after a
> > failover), and you try to recover from it without a WAL archive, it
> > doesn't work. This is the original issue that started this thread,
> > except that I used "-x" in my original test case. The problem here is
> > that even though streaming replication will fetch the timeline history
> > file when it connects, at the very beginning of recovery, we expect that
> > we already have the timeline history file corresponding the initial
> > timeline available, either in pg_xlog or the WAL archive. By the time
> > streaming replication has connected and fetched the history file, we've
> > already initialized expectedTLEs to contain just the latest TLI. To fix
> > that, we should delay calling readTimeLineHistoryFile() until streaming
> > replication has connected and fetched the file.
> > If the first segment read by recovery contains a timeline switch, the first
> > pages have older timeline than segment timeline. However, if
> > exepectedTLEs contained only the segment timeline, we cannot know if
> > we can use the record.  In that case the following error is issued.
>
> If expectedTLEs is initialized with the pseudo list,
> tliOfPointInHistory always return the recoveryTargetTLI regardless of
> the given LSN so the checking about timelines later doesn't work. And
> later ReadRecord is surprised to see a page of an unknown timeline.

>From this whole discussion (on the thread given by you), IIUC the
issue was that if the checkpoint LSN does not exist on the
"ControlFile->checkPointCopy.ThisTimeLineID". If that is true then I
agree that we will just initialize expectedTLE based on the online
entry (ControlFile->checkPointCopy.ThisTimeLineID) and later we will
fail to find the checkpoint record on this timeline because the
checkpoint LSN is smaller than the start LSN of this timeline. Right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-21 Thread Dilip Kumar
On Thu, May 20, 2021 at 11:19 PM Robert Haas  wrote:
>
> On Tue, May 18, 2021 at 1:33 AM Dilip Kumar  wrote:
> > Yeah, it will be a fake 1-element list.  But just to be clear that
> > 1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and
> > nothing else, do you agree to this?  Because we initialize
> > recoveryTargetTLI to this value and we might change it in
> > readRecoveryCommandFile() but for that, we must get the history file,
> > so if we are talking about the case where we don't have the history
> > file then "recoveryTargetTLI" will still be
> > "ControlFile->checkPointCopy.ThisTimeLineID".
>
> I don't think your conclusion is correct. As I understand it, you're
> talking about the state before
> ee994272ca50f70b53074f0febaec97e28f83c4e,

Right, I am talking about before this commit.

 because as of now
> readRecoveryCommandFile() no longer exists in master. Before that
> commit, StartupXLOG did this:
>
> recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
> readRecoveryCommandFile();
> expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
>
> Now, readRecoveryCommandFile() can change recoveryTargetTLI. Before
> doing so, it will call existsTimeLineHistory if
> recovery_target_timeline was set to an integer, and findNewestTimeLine
> if it was set to latest. In the first case, existsTimeLineHistory()
> calls RestoreArchivedFile(), but that restores it using a temporary
> name, and KeepFileRestoredFromArchive is not called,

Yes, I agree with this.

so we might have
> the timeline history in RECOVERYHISTORY but that's not the filename
> we're actually going to try to read from inside readTimeLineHistory().
> In the second case, findNewestTimeLine() will call
> existsTimeLineHistory() which results in the same situation. So I
> think when readRecoveryCommandFile() returns expectedTLI can be
> different but the history file can be absent since it was only ever
> restored under a temporary name.

I agree that readTimeLineHistory() will not look for that filename,
but it will also try to get the file using (RestoreArchivedFile(path,
histfname, "RECOVERYHISTORY", 0, false)).  So after we check the
history file existence in existsTimeLineHistory(), if the file got
removed from the archive (not sure how) then it is possible that now
readTimeLineHistory() will not find that history file again.  Am I
missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-21 Thread Kyotaro Horiguchi
At Fri, 21 May 2021 11:21:05 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 20 May 2021 13:49:10 -0400, Robert Haas  wrote 
> in 
>   In the case of (c) recoveryTargetTLI > checkpoint TLI.  In this case
>   we expecte that checkpint TLI is in the history of
>   recoveryTargetTLI. Otherwise recovery failse.  This case is similar
>   to the case (a) but the relationship between recoveryTargetTLI and
>   the checkpoint TLI is not confirmed yet. ReadRecord barks later if
>   they are not compatible so there's not a serious problem but might
>   be better checking the relation ship there.  My first proposal
>   performed mutual check between the two but we need to check only
>   unidirectionally.
> 
>   if (readFile < 0)
>   {
>  if (!expectedTLEs)
>{
>   expectedTLEs = readTimeLineHistory(receiveTLI);
> +   if (!tliOfPointInHistory(receiveTLI, expectedTLEs))
> +  ereport(ERROR, "the received timeline %d is not found in the 
> history file for timeline %d");
> 
> 
> > > Conclusion:
> > > - I think now we agree on the point that initializing expectedTLEs
> > > with the recovery target timeline is the right fix.
> > > - We still have some differences of opinion about what was the
> > > original problem in the base code which was fixed by the commit
> > > (ee994272ca50f70b53074f0febaec97e28f83c4e).
> > 
> > I am also still concerned about whether we understand in exactly what
> > cases the current logic doesn't work. We seem to more or less agree on
> > the fix, but I don't think we really understand precisely what case we
> > are fixing.
> 
> Does the discussion above make sense?

This is a revised version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 6664808340494dffc775e21cfa1029d6dfb8452e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 21 May 2021 16:24:14 +0900
Subject: [PATCH v3] Set expectedTLEs correctly based on recoveryTargetTLI

When a standby started streaming before it determines expectedTLEs, it
is determined based on the timeline of the WAL segment just streamed
from the primary.  If the standby fetches the older timeline than
recoveryTargetTLI for thestarting checkpoint, this behavior leads to
setting expectedTLEs based on the older timeline than
recoveryTargetTLI.  It is inconsistent with the definition of the
variable and prevents later calls to rescanLatestTimeLine from
updating recoveryTargetTLI and expectedTLEs correctly. If that happens
the standby stops following the upstream.

This behavior has been introduced by commit ee994272ca but there seems
not a reason for the behavior and looks like a typo. Since the
relationship between the two timeline IDs is not verified until then,
also add a check for the relationship for safety's sake.
---
 src/backend/access/transam/xlog.c  | 35 ++
 src/test/recovery/t/004_timeline_switch.pl | 55 +-
 2 files changed, 79 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 441a9124cd..fda729c63f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12659,22 +12659,37 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 * info is set correctly and XLogReceiptTime isn't
 		 * changed.
 		 */
-		if (readFile < 0)
-		{
-			if (!expectedTLEs)
-expectedTLEs = readTimeLineHistory(receiveTLI);
-			readFile = XLogFileRead(readSegNo, PANIC,
-	receiveTLI,
-	XLOG_FROM_STREAM, false);
-			Assert(readFile >= 0);
-		}
-		else
+		if (readFile >= 0)
 		{
 			/* just make sure source info is correct... */
 			readSource = XLOG_FROM_STREAM;
 			XLogReceiptSource = XLOG_FROM_STREAM;
 			return true;
 		}
+
+		readFile = XLogFileRead(readSegNo, PANIC,
+receiveTLI,
+XLOG_FROM_STREAM, false);
+		Assert(readFile >= 0);
+
+		if (expectedTLEs)
+			break;
+
+		expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
+
+		/*
+		 * We haven't checked the relationship beween the
+		 * receiveTLI and recoveryTargetTLI. Make sure that
+		 * receiveTLI is in the history for
+		 * recoveryTargetTLI. We don't need to do that
+		 * hearafter since recoveryTargetTLI and expectedTLEs
+		 * will be kept in sync.
+		 */
+		if (!tliOfPointInHistory(receiveTLI, expectedTLEs))
+			ereport(ERROR,
+	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+	 errmsg("hisotry file of timeline %d is incosistent with the current timeline %d",
+			recoveryTargetTLI, receiveTLI)));
 		break;
 	}
 
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index c101980e9e..357f3bf8fa 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -90,7 +90,14 @@ $node_primary_2->backup

Re: Race condition in recovery?

2021-05-20 Thread Kyotaro Horiguchi
At Thu, 20 May 2021 13:49:10 -0400, Robert Haas  wrote 
in 
> On Tue, May 18, 2021 at 1:33 AM Dilip Kumar  wrote:
> > Yeah, it will be a fake 1-element list.  But just to be clear that
> > 1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and
> > nothing else, do you agree to this?  Because we initialize
> > recoveryTargetTLI to this value and we might change it in
> > readRecoveryCommandFile() but for that, we must get the history file,
> > so if we are talking about the case where we don't have the history
> > file then "recoveryTargetTLI" will still be
> > "ControlFile->checkPointCopy.ThisTimeLineID".
> 
> I don't think your conclusion is correct. As I understand it, you're
> talking about the state before
> ee994272ca50f70b53074f0febaec97e28f83c4e, because as of now
> readRecoveryCommandFile() no longer exists in master. Before that
> commit, StartupXLOG did this:
> 
> recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
> readRecoveryCommandFile();
> expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
> 
> Now, readRecoveryCommandFile() can change recoveryTargetTLI. Before
> doing so, it will call existsTimeLineHistory if
> recovery_target_timeline was set to an integer, and findNewestTimeLine
> if it was set to latest. In the first case, existsTimeLineHistory()
> calls RestoreArchivedFile(), but that restores it using a temporary
> name, and KeepFileRestoredFromArchive is not called, so we might have
> the timeline history in RECOVERYHISTORY but that's not the filename
> we're actually going to try to read from inside readTimeLineHistory().
> In the second case, findNewestTimeLine() will call
> existsTimeLineHistory() which results in the same situation. So I
> think when readRecoveryCommandFile() returns expectedTLI can be
> different but the history file can be absent since it was only ever
> restored under a temporary name.

Anyway it seems that the commit tried to fix an issue happens without
using WAL archive.

https://www.postgresql.org/message-id/50E43C57.5050101%40vmware.com

> That leaves one case not covered: If you take a backup with plain 
> "pg_basebackup" from a standby, without -X, and the first WAL segment 
> contains a timeline switch (ie. you take the backup right after a 
> failover), and you try to recover from it without a WAL archive, it 
> doesn't work. This is the original issue that started this thread, 
> except that I used "-x" in my original test case. The problem here is 
> that even though streaming replication will fetch the timeline history 
> file when it connects, at the very beginning of recovery, we expect that 
> we already have the timeline history file corresponding the initial 
> timeline available, either in pg_xlog or the WAL archive. By the time 
> streaming replication has connected and fetched the history file, we've 
> already initialized expectedTLEs to contain just the latest TLI. To fix 
> that, we should delay calling readTimeLineHistoryFile() until streaming 
> replication has connected and fetched the file.
> If the first segment read by recovery contains a timeline switch, the first
> pages have older timeline than segment timeline. However, if
> exepectedTLEs contained only the segment timeline, we cannot know if
> we can use the record.  In that case the following error is issued.

If expectedTLEs is initialized with the pseudo list,
tliOfPointInHistory always return the recoveryTargetTLI regardless of
the given LSN so the checking about timelines later doesn't work. And
later ReadRecord is surprised to see a page of an unknown timeline.

"unexpected timeline ID 1 in log segment"

So the objective is to initialize expectedTLEs with the right content
of the history file for the recoveryTargetTLI until ReadRecord fetches
the first record.  After the fix things are working as the following.

- recoveryTargetTimeLine is initialized with
  ControlFile->checkPointCopy.ThisTimeLineID

- readRecoveryCommandFile():

  Move recoveryTargetTLI forward to the specified target timline if
  the history file for the timeline is found, or in the case of
  latest, move it forward up to the maximum timeline among the history
  files found in either pg_wal or archive.

  !!! Anyway recoveryTargetTLI won't goes back behind the checkpoint
  TLI.

- ReadRecord...XLogFileReadAnyTLI

  Tries to load the history file for recoveryTargetTLI either from
  pg_wal or archive onto local TLE list, if the history file is not
  found, use a generateed list with one entry for the
  recoveryTargetTLI.

  (a) If the specified segment file for any timeline in the TLE list
is found, expectedTLEs is initialized with the local list. No need
to worry about expectedTLEs any longer.

  (b) If such a segment is *not* found, expectedTLEs is left
NIL. Usually recoveryTargetTLI is equal to the last checkpoint
TLI.

  (c) However, in the case where timeline switches happened in the
segment and the recoveryTargetTLI has be

Re: Race condition in recovery?

2021-05-20 Thread Robert Haas
On Tue, May 18, 2021 at 1:33 AM Dilip Kumar  wrote:
> Yeah, it will be a fake 1-element list.  But just to be clear that
> 1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and
> nothing else, do you agree to this?  Because we initialize
> recoveryTargetTLI to this value and we might change it in
> readRecoveryCommandFile() but for that, we must get the history file,
> so if we are talking about the case where we don't have the history
> file then "recoveryTargetTLI" will still be
> "ControlFile->checkPointCopy.ThisTimeLineID".

I don't think your conclusion is correct. As I understand it, you're
talking about the state before
ee994272ca50f70b53074f0febaec97e28f83c4e, because as of now
readRecoveryCommandFile() no longer exists in master. Before that
commit, StartupXLOG did this:

recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
readRecoveryCommandFile();
expectedTLEs = readTimeLineHistory(recoveryTargetTLI);

Now, readRecoveryCommandFile() can change recoveryTargetTLI. Before
doing so, it will call existsTimeLineHistory if
recovery_target_timeline was set to an integer, and findNewestTimeLine
if it was set to latest. In the first case, existsTimeLineHistory()
calls RestoreArchivedFile(), but that restores it using a temporary
name, and KeepFileRestoredFromArchive is not called, so we might have
the timeline history in RECOVERYHISTORY but that's not the filename
we're actually going to try to read from inside readTimeLineHistory().
In the second case, findNewestTimeLine() will call
existsTimeLineHistory() which results in the same situation. So I
think when readRecoveryCommandFile() returns expectedTLI can be
different but the history file can be absent since it was only ever
restored under a temporary name.

> Conclusion:
> - I think now we agree on the point that initializing expectedTLEs
> with the recovery target timeline is the right fix.
> - We still have some differences of opinion about what was the
> original problem in the base code which was fixed by the commit
> (ee994272ca50f70b53074f0febaec97e28f83c4e).

I am also still concerned about whether we understand in exactly what
cases the current logic doesn't work. We seem to more or less agree on
the fix, but I don't think we really understand precisely what case we
are fixing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-19 Thread Dilip Kumar
On Tue, May 18, 2021 at 12:22 PM Kyotaro Horiguchi
 wrote:

> And finally I think I could reach the situation the commit wanted to fix.
>
> I took a basebackup from a standby just before replaying the first
> checkpoint of the new timeline (by using debugger), without copying
> pg_wal.  In this backup, the control file contains checkPointCopy of
> the previous timeline.
>
> I modified StartXLOG so that expectedTLEs is set just after first
> determining recoveryTargetTLI, then started the grandchild node.  I
> have the following error and the server fails to continue replication.

> [postmaster] LOG:  starting PostgreSQL 14beta1 on x86_64-pc-linux-gnu...
> [startup] LOG:  database system was interrupted while in recovery at log...
> [startup] LOG:  set expectedtles tli=6, length=1
> [startup] LOG:  Probing history file for TLI=7
> [startup] LOG:  entering standby mode
> [startup] LOG:  scanning segment 3 TLI 6, source 0
> [startup] LOG:  Trying fetching history file for TLI=6
> [walreceiver] LOG:  fetching timeline history file for timeline 5 from pri...
> [walreceiver] LOG:  fetching timeline history file for timeline 6 from pri...
> [walreceiver] LOG:  started streaming ... primary at 0/300 on timeline 5
> [walreceiver] DETAIL:  End of WAL reached on timeline 5 at 0/30006E0.
> [startup] LOG:  unexpected timeline ID 1 in log segment 
> 00050003, offset 0
> [startup] LOG:  Probing history file for TLI=7
> [startup] LOG:  scanning segment 3 TLI 6, source 0
> (repeats forever)

So IIUC, this logs shows that
"ControlFile->checkPointCopy.ThisTimeLineID" is 6 but
"ControlFile->checkPoint" record is on TL 5?  I think if you had the
old version of the code (before the commit) or below code [1], right
after initializing expectedTLEs then you would have hit the FATAL the
patch had fix.

While debugging did you check what was the "ControlFile->checkPoint"
LSN vs the first LSN of the first segment with TL6?

expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
[1]
if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) !=
ControlFile->checkPointCopy.ThisTimeLineID)
{
report(FATAL..
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-17 Thread Kyotaro Horiguchi
At Tue, 18 May 2021 15:52:07 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> FWIW, you could be get a problematic base backup by the following steps.
> 
> 0. (make sure /tmp/hoge is removed)
> 1. apply the attached patch
> 2. create a primary then start
> 3. create a standby then start
> 4. place standby.signal to the primary, then restart it.
> 5. place the file /tmp/hoge.
> 6. promote the "primary".
> 7. You will see a log line like this
>LOG:  WAIT START: CHECKPOINT_ONLINE: TLI=2
> 8. Take a base backup (without copying WAL files)

I carelessly have left the "modification" uncommented in the diff file.

@@ -6577,6 +6584,8 @@ StartupXLOG(void)
else
recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
 
+   expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
+   elog(LOG, "set expectedtles %d, %d", recoveryTargetTLI, 
list_length(expectedTLEs));

Disabling the lines would show the result of the ancient fix.

regards.
-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-05-17 Thread Kyotaro Horiguchi
At Mon, 17 May 2021 10:46:24 +0530, Dilip Kumar  wrote 
in 
> On Mon, May 17, 2021 at 10:09 AM Dilip Kumar  wrote:
> >
> > On Mon, May 17, 2021 at 8:50 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > Before the commit expectedTLEs is always initialized with just one
> > > entry for the TLI of the last checkpoint record.
> >
> > Right
> >
> > > (1) If XLogFileReadAnyTLI() found the segment but no history file
> > > found, that is, using the dummy TLE-list, expectedTLEs is initialized
> > > with the dummy one-entry list. So there's no behavioral change in this
> > > aspect.
> >
> > Yeah, you are right.
> 
> But do you agree that one line entry will always be a checkpoint
> timeline entry?  Because if you notice below code[1] in function
> "readRecoveryCommandFile();", then you will realize that once we come
> out of this function either the "recoveryTargetTLI" is checkpoint TL
> wherever it was before calling this function or we must have the
> history file.  That means after exiting this function if we execute
> this line (expectedTLEs = readTimeLineHistory(recoveryTargetTLI);)
> that means either "expectedTLEs" could point to one dummy entry which
> will be nothing but the checkpoint TL entry or it will be holding
> complete history.

Right.

> The patch is trying to say that without the history file the
> checkpoint TL will not be found in "expectedTLEs" because the older TL
> (checkpoint TL) is not the ancestor of the target
> timeline(recoveryTargetTLI).  But ideally, either the target timeline
> should be the same as the checkpoint timeline or we must have the
> history file as I stated in the above paragraph.  Am I missing
> something?

Yeah, that has been the most mysterious point here.  So I searched for
a situation the one-entry expectedTLEs does not work.

I vaguely believed that there's a case where REDO point of a
checkpoint is in the timeline previous to the record of the
checkpoint. The previous discussion is based on this case, but that
doesn't seem to happen.  The last replayed checkpoint (that causes a
restartpoint) record is found before protmotion and the first
checkpoint starts after promotion.

A little while ago I tried to make a situation where a checkpoint
record is placed in the previous timeline of the tli written in the
control file. But control file is always written after checkpoint
record is flushed.


I rebooted myself from this:

ee994272ca:
>There is at least one scenario where this makes a difference: if you take
>a base backup from a standby server right after a timeline switch, the
>WAL segment containing the initial checkpoint record will begin with an
>older timeline ID. Without the timeline history file, recovering that file

And finally I think I could reach the situation the commit wanted to fix.

I took a basebackup from a standby just before replaying the first
checkpoint of the new timeline (by using debugger), without copying
pg_wal.  In this backup, the control file contains checkPointCopy of
the previous timeline.

I modified StartXLOG so that expectedTLEs is set just after first
determining recoveryTargetTLI, then started the grandchild node.  I
have the following error and the server fails to continue replication.

[postmaster] LOG:  starting PostgreSQL 14beta1 on x86_64-pc-linux-gnu...
[startup] LOG:  database system was interrupted while in recovery at log...
[startup] LOG:  set expectedtles tli=6, length=1
[startup] LOG:  Probing history file for TLI=7
[startup] LOG:  entering standby mode
[startup] LOG:  scanning segment 3 TLI 6, source 0
[startup] LOG:  Trying fetching history file for TLI=6
[walreceiver] LOG:  fetching timeline history file for timeline 5 from pri...
[walreceiver] LOG:  fetching timeline history file for timeline 6 from pri...
[walreceiver] LOG:  started streaming ... primary at 0/300 on timeline 5
[walreceiver] DETAIL:  End of WAL reached on timeline 5 at 0/30006E0.
[startup] LOG:  unexpected timeline ID 1 in log segment 
00050003, offset 0
[startup] LOG:  Probing history file for TLI=7
[startup] LOG:  scanning segment 3 TLI 6, source 0
(repeats forever)

This seems like the behavior the patch wanted to fix. (I'm not sure
precisely what happened at the time of the "unexpected timeline ID
1..", though. The line is seen only just after the first conection.)

>will fail as the older timeline ID is not recognized to be an ancestor of
>the target timeline. If you try to recover from such a backup, using only
>streaming replication to fetch the WAL, this patch is required for that to
>work.

After I reverted the modification, I got the following behavior
instead from the same backup.

[postmaster] LOG:  starting PostgreSQL 14beta1 on x86_64-...
[startup] JST LOG:  database system was interrupted while in recovery at log 
time 2021-05-18 13:45:59 JST
[startup] JST LOG:  Probing history file for TLI=7
[startup] JST LOG:  entering standby mode
[startup] JST LOG:  Loading history file for TLI=6 (2

Re: Race condition in recovery?

2021-05-17 Thread Dilip Kumar
On Tue, May 18, 2021 at 1:28 AM Robert Haas  wrote:
>
> Sorry, you're right. It couldn't be uninitialized, but it could be a
> fake 1-element list saying there are no ancestors rather than the real
> value. So I think the point was to avoid that.

Yeah, it will be a fake 1-element list.  But just to be clear that
1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and
nothing else, do you agree to this?  Because we initialize
recoveryTargetTLI to this value and we might change it in
readRecoveryCommandFile() but for that, we must get the history file,
so if we are talking about the case where we don't have the history
file then "recoveryTargetTLI" will still be
"ControlFile->checkPointCopy.ThisTimeLineID".

>
> I think the issue here is: If expectedTLEs was initialized before the
> history file was available, and contained a dummy 1-element list, then
> tliOfPointInHistory() is going to say that every LSN is on that
> timeline rather than any previous timeline. And if we are supposed to
> be switching timelines then that will lead to this sanity check
> failing.

You are talking about the sanity check of validating the timeline of
the checkpoint record right?  but as I mentioned earlier the only
entry in expectedTLEs will be the TLE of the checkpoint record so how
the sanity check will fail?

>
> I agree, but that's actually bad, isn't it?

Yes, it is bad.

 I mean if we want the
> sanity check to never fail we can just take it out. What we want to
> happen is that the sanity check should pass if the startup timeline if
> the TLI of the startup checkpoint is in the history of the recovery
> target timeline, but fail if it isn't. The only way to achieve that
> behavior is if expectedTLEs is initialized from the recovery target
> timeline.

Yes, I agree, with this.  So initializing expectedTLEs with the
recovery target timeline is the right fix.

Conclusion:
- I think now we agree on the point that initializing expectedTLEs
with the recovery target timeline is the right fix.
- We still have some differences of opinion about what was the
original problem in the base code which was fixed by the commit
(ee994272ca50f70b53074f0febaec97e28f83c4e).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-17 Thread Robert Haas
On Sat, May 15, 2021 at 1:25 AM Dilip Kumar  wrote:
> >  As I understand it, the general issue here was that if
> > XLogFileReadAnyTLI() was called before expectedTLEs got set, then
> > prior to this commit it would have to fail, because the foreach() loop
> > in that function would be iterating over an empty list. Heikki tried
> > to make it not fail in that case, by setting tles =
> > readTimeLineHistory(recoveryTargetTLI), so that the foreach loop
> > *wouldn't* get an empty list.
>
> I might be missing something but I don't agree with this logic.  If
> you see prior to this commit the code flow was like below[1].  So my
> point is if we are calling XLogFileReadAnyTLI() somewhere while
> reading the checkpoint record then note that expectedTLEs were
> initialized unconditionally before even try to read that checkpoint
> record.  So how expectedTLEs could be uninitialized in
> LogFileReadAnyTLI?

Sorry, you're right. It couldn't be uninitialized, but it could be a
fake 1-element list saying there are no ancestors rather than the real
value. So I think the point was to avoid that.

> Another point which I am not sure about but still I think that one
> line (expectedTLEs = readTimeLineHistory(receiveTLI);), somewhere
> related to the purpose of this commit.  Let me explain why do I think
> so.  Basically, before this commit, we were initializing
> "expectedTLEs" based on the history file of "recoveryTargetTLI", right
> after calling "readRecoveryCommandFile()" (this function will
> initialize recoveryTargetTLI based on the recovery target, and it
> ensures it read the respective history file).  Now, right after this
> point, there was a check as shown below[2], which is checking whether
> the checkpoint TLI exists in the  "expectedTLEs" which is initialized
> based on recoveryTargetTLI.  And it appeared that this check was
> failing in some cases which this commit tried to fix and all other
> code is there to support that.  Because now before going for reading
> the checkpoint we are not initializing "expectedTLEs" so now after
> moving this line from here it was possible that "expectedTLEs" is not
> initialized in XLogFileReadAnyTLI() and the remaining code in
> XLogFileReadAnyTLI() is to handle that part.

I think the issue here is: If expectedTLEs was initialized before the
history file was available, and contained a dummy 1-element list, then
tliOfPointInHistory() is going to say that every LSN is on that
timeline rather than any previous timeline. And if we are supposed to
be switching timelines then that will lead to this sanity check
failing.

> Now, coming to my point that why this one line is related,  In this
> one line (expectedTLEs = readTimeLineHistory(receiveTLI);), we
> completely avoiding recoveryTargetTLI and initializing "expectedTLEs"
> based on the history file of the TL from which we read the checkpoint,
> so now, there is no scope of below[2] check to fail because note that
> we are not initializing "expectedTLEs" based on the
> "recoveryTargetTLI" but we are initializing from the history from
> where we read checkpoint.

I agree, but that's actually bad, isn't it? I mean if we want the
sanity check to never fail we can just take it out. What we want to
happen is that the sanity check should pass if the startup timeline if
the TLI of the startup checkpoint is in the history of the recovery
target timeline, but fail if it isn't. The only way to achieve that
behavior is if expectedTLEs is initialized from the recovery target
timeline.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-16 Thread Dilip Kumar
On Mon, May 17, 2021 at 10:09 AM Dilip Kumar  wrote:
>
> On Mon, May 17, 2021 at 8:50 AM Kyotaro Horiguchi
>  wrote:
> >
> > Before the commit expectedTLEs is always initialized with just one
> > entry for the TLI of the last checkpoint record.
>
> Right
>
> > (1) If XLogFileReadAnyTLI() found the segment but no history file
> > found, that is, using the dummy TLE-list, expectedTLEs is initialized
> > with the dummy one-entry list. So there's no behavioral change in this
> > aspect.
>
> Yeah, you are right.

But do you agree that one line entry will always be a checkpoint
timeline entry?  Because if you notice below code[1] in function
"readRecoveryCommandFile();", then you will realize that once we come
out of this function either the "recoveryTargetTLI" is checkpoint TL
wherever it was before calling this function or we must have the
history file.  That means after exiting this function if we execute
this line (expectedTLEs = readTimeLineHistory(recoveryTargetTLI);)
that means either "expectedTLEs" could point to one dummy entry which
will be nothing but the checkpoint TL entry or it will be holding
complete history.

The patch is trying to say that without the history file the
checkpoint TL will not be found in "expectedTLEs" because the older TL
(checkpoint TL) is not the ancestor of the target
timeline(recoveryTargetTLI).  But ideally, either the target timeline
should be the same as the checkpoint timeline or we must have the
history file as I stated in the above paragraph.  Am I missing
something?

[1]
if (rtli)
{
  /* Timeline 1 does not have a history file, all else should */
  if (rtli != 1 && !existsTimeLineHistory(rtli))
   ereport(FATAL,
  (errmsg("recovery target timeline %u does not exist",
  rtli)));
  recoveryTargetTLI = rtli;
  recoveryTargetIsLatest = false;
}
else
{
  /* We start the "latest" search from pg_control's timeline */
  recoveryTargetTLI = findNewestTimeLine(recoveryTargetTLI);
  recoveryTargetIsLatest = true;
}


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-16 Thread Dilip Kumar
On Mon, May 17, 2021 at 8:50 AM Kyotaro Horiguchi
 wrote:
>
> Before the commit expectedTLEs is always initialized with just one
> entry for the TLI of the last checkpoint record.

Right

> (1) If XLogFileReadAnyTLI() found the segment but no history file
> found, that is, using the dummy TLE-list, expectedTLEs is initialized
> with the dummy one-entry list. So there's no behavioral change in this
> aspect.

Yeah, you are right.

> (2) If we didn't find the segment for the checkpoint record, it starts
> replication and fetches history files and WAL records then revisits
> XLogFileReadAnyTLI.  Now we have both the history file and segments,
> it successfully reads the recood.  The difference of expectedTLEs made
> by the patch is having just one entry or the all entries for the past.

Correct.

> Assuming that we keep expectedTLEs synced with recoveryTargetTLI,
> rescanLatestTimeLine updates the list properly so no need to worry
> about the future.  So the issue would be in the past timelines.  After
> reading the checkpoint record, if we need to rewind to the previous
> timeline for the REDO point, the dummy list is inconvenient.
>
> So there is a possibility that the patch fixed the case (2), where the
> standby doesn't have both the segment for the checkpoint record and
> the history file for the checkpoint, and the REDO point is on the last
> TLI.  If it is correct, the patch still fails for the case (1), that
> is, the issue raised here. Anyway it would be useless (and rahter
> harmful) to initialize expectedTLEs based on receiveTLI there.
>
> So my resul there is:
>
> The commit fixed the case (2)

Yes, by maintaining the entire history instead of one entry if history
was missing.

> The fix caused the issue for the case (1).

Basically, before this commit expectedTLEs and recoveryTargetTLI were
in always in sync which this patch broke.

> The proposed fix fixes the case (1), caused by the commit.

Right, I agree with the fix. So fix should be just to change that one
line and initialize expectedTLEs from recoveryTargetTLI


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-16 Thread Kyotaro Horiguchi
At Mon, 17 May 2021 13:01:04 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Mon, 17 May 2021 12:20:12 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > Assuming that we keep expectedTLEs synced with recoveryTargetTLI,
> > rescanLatestTimeLine updates the list properly so no need to worry
> > about the future.  So the issue would be in the past timelines.  After
> > reading the checkpoint record, if we need to rewind to the previous
> > timeline for the REDO point, the dummy list is inconvenient.
> 
> By the way, I tried reproducing this situation, but ended in finding
> it a kind of impossible because pg_basebackup (or pg_stop_backup())
> waits for the promotion checkpoint to end.

Mmm. That's wrong.  What the tool waits is not a promotion checkpoint,
but a backup-checkpoint, maybe.  (I don't remember cleary about that,
sorry.)

> If we make a backup in a somewhat broken steps, that could be done but
> I didn't try.

So there might still be a way to reproduce that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-05-16 Thread Kyotaro Horiguchi
At Mon, 17 May 2021 12:20:12 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Assuming that we keep expectedTLEs synced with recoveryTargetTLI,
> rescanLatestTimeLine updates the list properly so no need to worry
> about the future.  So the issue would be in the past timelines.  After
> reading the checkpoint record, if we need to rewind to the previous
> timeline for the REDO point, the dummy list is inconvenient.

By the way, I tried reproducing this situation, but ended in finding
it a kind of impossible because pg_basebackup (or pg_stop_backup())
waits for the promotion checkpoint to end.

If we make a backup in a somewhat broken steps, that could be done but
I didn't try.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Race condition in recovery?

2021-05-16 Thread Kyotaro Horiguchi
At Sat, 15 May 2021 10:55:05 +0530, Dilip Kumar  wrote 
in 
> On Sat, May 15, 2021 at 3:58 AM Robert Haas  wrote:
> >
> > I did notice, but keep in mind that this was more than 8 years ago.
> > Even if Heikki is reading this thread, he may not remember why he
> > changed 1 line of code one way rather than another in 2013. I mean if
> > he does that's great, but it's asking a lot.
> 
> I agree with your point.  But I think that one line is related to the
> purpose of this commit and I have explained (in 3rd paragraph below)
> why do I think so.
> 
>  As I understand it, the general issue here was that if
> > XLogFileReadAnyTLI() was called before expectedTLEs got set, then
> > prior to this commit it would have to fail, because the foreach() loop
> > in that function would be iterating over an empty list. Heikki tried
> > to make it not fail in that case, by setting tles =
> > readTimeLineHistory(recoveryTargetTLI), so that the foreach loop
> > *wouldn't* get an empty list.
> 
> I might be missing something but I don't agree with this logic.  If
> you see prior to this commit the code flow was like below[1].  So my
> point is if we are calling XLogFileReadAnyTLI() somewhere while
> reading the checkpoint record then note that expectedTLEs were
> initialized unconditionally before even try to read that checkpoint
> record.  So how expectedTLEs could be uninitialized in
> LogFileReadAnyTLI?

Mmm. I think both of you are right.  Before the commit,
XLogFileReadAnyTLI expected that expectedTLEs is initialized. After
the commit it cannot no longer expect that so readTimeLineHistory was
changed to try fetching by itself.  *If* an appropriate history file
is found, it *initializes* expectedTLEs with the content.

> [1]
> StartupXLOG(void)
> {
> 
> 
> recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
> ...
> readRecoveryCommandFile();
> ...
> expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
> ...
> ..
> record = ReadCheckpointRecord(checkPointLoc, 0);
> 
> 
> Another point which I am not sure about but still I think that one
> line (expectedTLEs = readTimeLineHistory(receiveTLI);), somewhere
> related to the purpose of this commit.  Let me explain why do I think
> so.  Basically, before this commit, we were initializing
> "expectedTLEs" based on the history file of "recoveryTargetTLI", right
> after calling "readRecoveryCommandFile()" (this function will
> initialize recoveryTargetTLI based on the recovery target, and it
> ensures it read the respective history file).  Now, right after this
> point, there was a check as shown below[2], which is checking whether
> the checkpoint TLI exists in the  "expectedTLEs" which is initialized
> based on recoveryTargetTLI.  And it appeared that this check was
> failing in some cases which this commit tried to fix and all other
> code is there to support that.  Because now before going for reading
> the checkpoint we are not initializing "expectedTLEs" so now after
> moving this line from here it was possible that "expectedTLEs" is not
> initialized in XLogFileReadAnyTLI() and the remaining code in
> XLogFileReadAnyTLI() is to handle that part.

Before the commit expectedTLEs is always initialized with just one
entry for the TLI of the last checkpoint record.

(1) If XLogFileReadAnyTLI() found the segment but no history file
found, that is, using the dummy TLE-list, expectedTLEs is initialized
with the dummy one-entry list. So there's no behavioral change in this
aspect.

(2) If we didn't find the segment for the checkpoint record, it starts
replication and fetches history files and WAL records then revisits
XLogFileReadAnyTLI.  Now we have both the history file and segments,
it successfully reads the recood.  The difference of expectedTLEs made
by the patch is having just one entry or the all entries for the past.

Assuming that we keep expectedTLEs synced with recoveryTargetTLI,
rescanLatestTimeLine updates the list properly so no need to worry
about the future.  So the issue would be in the past timelines.  After
reading the checkpoint record, if we need to rewind to the previous
timeline for the REDO point, the dummy list is inconvenient.

So there is a possibility that the patch fixed the case (2), where the
standby doesn't have both the segment for the checkpoint record and
the history file for the checkpoint, and the REDO point is on the last
TLI.  If it is correct, the patch still fails for the case (1), that
is, the issue raised here. Anyway it would be useless (and rahter
harmful) to initialize expectedTLEs based on receiveTLI there.

So my resul there is:

The commit fixed the case (2)
The fix caused the issue for the case (1).
The proposed fix fixes the case (1), caused by the commit.

There's another issue in the case (1) and REDO point is back to the
previous timeline, which is in doubt we need to fix..

> Now, coming to my point that why this one line is related,  In this
> one line (expectedTLEs = readTimeLineHistory(receiveTLI);), we
> co

Re: Race condition in recovery?

2021-05-14 Thread Dilip Kumar
On Sat, May 15, 2021 at 3:58 AM Robert Haas  wrote:
>
> I did notice, but keep in mind that this was more than 8 years ago.
> Even if Heikki is reading this thread, he may not remember why he
> changed 1 line of code one way rather than another in 2013. I mean if
> he does that's great, but it's asking a lot.

I agree with your point.  But I think that one line is related to the
purpose of this commit and I have explained (in 3rd paragraph below)
why do I think so.

 As I understand it, the general issue here was that if
> XLogFileReadAnyTLI() was called before expectedTLEs got set, then
> prior to this commit it would have to fail, because the foreach() loop
> in that function would be iterating over an empty list. Heikki tried
> to make it not fail in that case, by setting tles =
> readTimeLineHistory(recoveryTargetTLI), so that the foreach loop
> *wouldn't* get an empty list.

I might be missing something but I don't agree with this logic.  If
you see prior to this commit the code flow was like below[1].  So my
point is if we are calling XLogFileReadAnyTLI() somewhere while
reading the checkpoint record then note that expectedTLEs were
initialized unconditionally before even try to read that checkpoint
record.  So how expectedTLEs could be uninitialized in
LogFileReadAnyTLI?

[1]
StartupXLOG(void)
{


recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
...
readRecoveryCommandFile();
...
expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
...
..
record = ReadCheckpointRecord(checkPointLoc, 0);


Another point which I am not sure about but still I think that one
line (expectedTLEs = readTimeLineHistory(receiveTLI);), somewhere
related to the purpose of this commit.  Let me explain why do I think
so.  Basically, before this commit, we were initializing
"expectedTLEs" based on the history file of "recoveryTargetTLI", right
after calling "readRecoveryCommandFile()" (this function will
initialize recoveryTargetTLI based on the recovery target, and it
ensures it read the respective history file).  Now, right after this
point, there was a check as shown below[2], which is checking whether
the checkpoint TLI exists in the  "expectedTLEs" which is initialized
based on recoveryTargetTLI.  And it appeared that this check was
failing in some cases which this commit tried to fix and all other
code is there to support that.  Because now before going for reading
the checkpoint we are not initializing "expectedTLEs" so now after
moving this line from here it was possible that "expectedTLEs" is not
initialized in XLogFileReadAnyTLI() and the remaining code in
XLogFileReadAnyTLI() is to handle that part.

Now, coming to my point that why this one line is related,  In this
one line (expectedTLEs = readTimeLineHistory(receiveTLI);), we
completely avoiding recoveryTargetTLI and initializing "expectedTLEs"
based on the history file of the TL from which we read the checkpoint,
so now, there is no scope of below[2] check to fail because note that
we are not initializing "expectedTLEs" based on the
"recoveryTargetTLI" but we are initializing from the history from
where we read checkpoint.

So I feel if we directly fix this one line to initialize
"expectedTLEs" from "recoveryTargetTLI" then it will expose to the
same problem this commit tried to fix.

[2]
if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) !=
ControlFile->checkPointCopy.ThisTimeLineID)
{
error()
}


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Race condition in recovery?

2021-05-14 Thread Robert Haas
On Fri, May 14, 2021 at 12:59 AM Dilip Kumar  wrote:
> I am not sure that have you noticed the commit id which changed the
> definition of expectedTLEs, Heikki has committed that change so adding
> him in the list to know his opinion.

I did notice, but keep in mind that this was more than 8 years ago.
Even if Heikki is reading this thread, he may not remember why he
changed 1 line of code one way rather than another in 2013. I mean if
he does that's great, but it's asking a lot.

> =
> ee994272ca50f70b53074f0febaec97e28f83c4e
> Author: Heikki Linnakangas   2013-01-03 14:11:58
> Committer: Heikki Linnakangas   2013-01-03 14:11:58
>
> Delay reading timeline history file until it's fetched from master.
> .
>  Without the timeline history file, recovering that file
> will fail as the older timeline ID is not recognized to be an ancestor of
> the target timeline. If you try to recover from such a backup, using only
> streaming replication to fetch the WAL, this patch is required for that to
> work.
> =
>
> Part of this commit message says that it will not identify the
> recoveryTargetTLI as the ancestor of the checkpoint timeline (without
> history file).  I did not understand what it is trying to say.  Does
> it is trying to say that even though the recoveryTargetTLI is the
> ancestor of the checkpoint timeline but we can not track that because
> we don't have a history file?  So to handle this problem change the
> definition of expectedTLEs to directly point to the checkpoint
> timeline?
>
> Because before this commit, we were directly initializing expectedTLEs
> with the history file of recoveryTargetTLI, we were not even waiting
> for reading the checkpoint,  but under this commit, it is changed.

Well, I think that is talking about what the commit did in general,
not specifically the one line of code that we think may be incorrect.
As I understand it, the general issue here was that if
XLogFileReadAnyTLI() was called before expectedTLEs got set, then
prior to this commit it would have to fail, because the foreach() loop
in that function would be iterating over an empty list. Heikki tried
to make it not fail in that case, by setting tles =
readTimeLineHistory(recoveryTargetTLI), so that the foreach loop
*wouldn't* get an empty list.

Thinking about this a bit more, I think the idea behind the logic this
commit added to XLogFileReadAnyTLI() is that
XLogFileReadAnyTLI(recoveryTargetTLI) may or may not produce the
correct answer. If the timeline history file exists, it will contain
all the information that we need and will return a complete list of
TLEs. But if the file does not exist yet, then it will return a
1-entry list saying that the TLI in question has no parents. If
readTimeLineHistory() actually reads the file, then it's safe to save
the return value in expectedTLEs, but if it doesn't, then it may or
may not be safe. If XLogFileReadAnyTLI calls XLogFileRead and it
works, then the WAL segment we need exists on our target timeline and
we don't actually need the timeline history for anything because we
can just directly begin replay from the target timeline. But if
XLogFileRead fails with the 1-entry dummy list, then we need the
timeline history and don't have it yet, so we have to retry later,
when the history file will hopefully be present, and then at that
point readTimeLineHistory will return a different and better answer
and hopefully it will all work.

I think this is what the commit message is talking about when it says
that "Without the timeline history file, recovering that file will
fail as the older timeline ID is not recognized to be an ancestor of
the target timeline." Without the timeline history file, we can't know
whether some other timeline is an ancestor or not. But the specific
way that manifests is that XLogFileReadAnyTLI() returns a 1-entry
dummy list instead of the real contents of the timeline history file.
This commit doesn't prevent that from happening, but it does prevent
the 1-entry dummy list from getting stored in the global variable
expectedTLEs, except in the case where no timeline switch is occurring
and the lack of history therefore doesn't matter. Without this commit,
if the call to readTimeLineHistory(recoveryTargetTLI) happens at a
time when the timeline history file is not yet available, the 1-entry
dummy list ends up in the global variable and there's no way for it to
ever be replaced with a real history even if the timeline history file
shows up in the archive later.

As I see it, the question on the table here is whether there's any
justification for the fact that when the second switch in
WaitForWALToBecomeAvailable takes the
XLOG_FROM_ARCHIVE/XLOG_FROM_PG_WAL branch, it calls XLogFileReadAnyTLI
which tries to read the history of recoveryTargetTLI, while when that
same switch takes the XLOG_FROM_STREAM branch, it tries to read the
history of receiveTLI. I tend to think it doesn't make sense. On
general principle, archiving 

  1   2   >