Re: snapper and skink and fairywren (oh my!)

2022-01-27 Thread Andrew Dunstan


On 1/26/22 18:45, Andrew Dunstan wrote:
> On 1/26/22 16:17, Robert Haas wrote
>> 3. fairywren failed the last run in module-commit_tsCheck. It's unhappy 
>> because:
>>
> Intermittent failures give a false positive against the latest set of
> commits. These failures started happening regularly about 49 days ago.
> So we need to look at what exactly happened around then. See
> 


Specifically, the issue is discussed here

Tom has already reverted the offending changes on the back branches.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: snapper and skink and fairywren (oh my!)

2022-01-26 Thread Noah Misch
On Wed, Jan 26, 2022 at 04:17:19PM -0500, Robert Haas wrote:
> 1. snapper failed 4 out of the last 5 runs in recoveryCheck. The
> latest run as of this writing shows this:
> 
> [19:09:50] t/026_overwrite_contrecord.pl  ok43136 ms
> # poll_query_until timed out executing this query:
> # SELECT '0/1415E310' <= replay_lsn AND state = 'streaming' FROM
> pg_catalog.pg_stat_replication WHERE application_name = 'standby_1';
> # expecting this output:
> # t
> # last actual query output:
> #

I expect commit ce6d793 will end the failures on snapper, kittiwake, and
tadarida.  I apologize for letting the damage last months.




Re: snapper and skink and fairywren (oh my!)

2022-01-26 Thread Tom Lane
Kyotaro Horiguchi  writes:
> I'm not sure why walsender of the standby continued running not knowing the 
> primary has been once dead for such a long time.

Isn't this precisely the problem that made us revert the "graceful
disconnection" patch [1]?  Munro seems to have a theory about
why that broke things, but in any case it's not Haas' fault ;-)

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BOeoETZQ%3DQw5Ub5h3tmwQhBmDA%3DnuNO3KG%3DzWfUypFAw%40mail.gmail.com




Re: snapper and skink and fairywren (oh my!)

2022-01-26 Thread Kyotaro Horiguchi
At Wed, 26 Jan 2022 18:45:53 -0500, Andrew Dunstan  wrote 
in 
> It's very unlikely any of this is your fault. In any case, intermittent
> failures are very hard to nail down.

The primary starts at 2022-01-26 16:30:06.613 the accepted a connectin
from the standby at 2022-01-26 16:30:09.911.

P: 2022-01-26 16:30:06.613 UTC [74668:1] LOG:  starting PostgreSQL 15devel on 
x86_64-w64-mingw32, compiled by gcc.exe (Rev2, Built by MSYS2 project) 10.3.0, 
64-bit
S: 2022-01-26 16:30:09.637 UTC [72728:1] LOG:  starting PostgreSQL 15devel on 
x86_64-w64-mingw32, compiled by gcc.exe (Rev2, Built by MSYS2 project) 10.3.0, 
64-bit
P: 2022-01-26 16:30:09.911 UTC [74096:3] [unknown] LOG:  replication connection 
authorized: user=pgrunner application_name=standby
S: 2022-01-26 16:30:09.912 UTC [73932:1] LOG:  started streaming WAL from 
primary at 0/300 on timeline 1

After this the primary restarts.

P: 2022-01-26 16:30:11.832 UTC [74668:7] LOG:  database system is shut down
P: 2022-01-26 16:30:12.010 UTC [72140:1] LOG:  starting PostgreSQL 15devel on 
x86_64-w64-mingw32, compiled by gcc.exe (Rev2, Built by MSYS2 project) 10.3.0, 
64-bit

But the standby doesn't notice the disconnection and continue with the
poll_query_until() on the stale connection. But the query should have
executed after reconnection finally.  The following log lines are not
thinned out.

S: 2022-01-26 16:30:09.912 UTC [73932:1] LOG:  started streaming WAL from 
primary at 0/300 on timeline 1
S: 2022-01-26 16:30:12.825 UTC [72760:1] [unknown] LOG:  connection received: 
host=127.0.0.1 port=60769
S: 2022-01-26 16:30:12.830 UTC [72760:2] [unknown] LOG:  connection 
authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner" method=sspi 
(C:/tools/msys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_003_standby_2_standby_data/pgdata/pg_hba.conf:2)
S: 2022-01-26 16:30:12.830 UTC [72760:3] [unknown] LOG:  connection authorized: 
user=pgrunner database=postgres application_name=003_standby_2.pl
S: 2022-01-26 16:30:12.838 UTC [72760:4] 003_standby_2.pl LOG:  statement: 
SELECT '0/303C7D0'::pg_lsn <= pg_last_wal_replay_lsn()

Since the standby dones't receive WAL from the restarted server so the
poll_query_until() doesn't return.

I'm not sure why walsender of the standby continued running not knowing the 
primary has been once dead for such a long time.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: snapper and skink and fairywren (oh my!)

2022-01-26 Thread Andrew Dunstan


On 1/26/22 16:17, Robert Haas wrote
> 3. fairywren failed the last run in module-commit_tsCheck. It's unhappy 
> because:
>
> [16:30:02] t/002_standby.pl  ok13354 ms ( 0.06 usr  0.00 sys +
>  1.11 cusr  3.20 csys =  4.37 CPU)
> # poll_query_until timed out executing this query:
> # SELECT '0/303C7D0'::pg_lsn <= pg_last_wal_replay_lsn()
> # expecting this output:
> # t
> # last actual query output:
> # f
>
> I don't know what is causing any of these failures, and I don't know
> if there's already some discussion elsewhere that I've missed, but
> maybe this email will be helpful to someone. I also noticed that 2 out
> of the 3 failures report 2dbb7b9b2279d064f66ce9008869fd0e2b794534 "Fix
> pg_hba_file_rules for authentication method cert" as the only new
> commit since the last run, and it's hardly believable that that commit
> would have broken this. Nor do I see any other recent changes that
> look like likely culprits. Apologies in advance if any of this is my
> fault.
>

Intermittent failures give a false positive against the latest set of
commits. These failures started happening regularly about 49 days ago.
So we need to look at what exactly happened around then. See


It's very unlikely any of this is your fault. In any case, intermittent
failures are very hard to nail down.

I should add that both drongo and fairywren run on a single
AWS/EC2/Windows2019 instance. I haven't updated the OS in a while. so
I'm doing that to see if matters improve. FWIW. I just reproduced the
error on a similar instance where I'm testing some stuff for Andres'
project to make TAP tests more portable.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





snapper and skink and fairywren (oh my!)

2022-01-26 Thread Robert Haas
I was thinking about whether it made sense to try to commit anything
and decided it would be a good idea to check how the buildfarm looks
first. It doesn't look great.

1. snapper failed 4 out of the last 5 runs in recoveryCheck. The
latest run as of this writing shows this:

[19:09:50] t/026_overwrite_contrecord.pl  ok43136 ms
# poll_query_until timed out executing this query:
# SELECT '0/1415E310' <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'standby_1';
# expecting this output:
# t
# last actual query output:
#

2. skink failed the last run in this MiscCheck phase. I had no idea
what this phase was, because the name isn't very descriptive. It seems
that it runs some stuff in contrib and some stuff in src/test/modules,
which seems a bit confusing. Anyway, the failure here is:

test oldest_xmin  ... FAILED 5533 ms

To find the failure, the regression test suite suggests looking at
file contrib/test_decoding/output_iso/regression.diffs or
regression.out. But neither file is in the buildfarm results so far as
I can see. It does have the postmaster log but I can't tell what's
gone wrong from looking at that. In fact I'm not really sure it's the
correct log file, because oldest_xmin.spec uses a slot called
"isolation_slot" and test_decoding/log/postmaster.log refers only to
"regression_slot", so it seems like this postmaster.log file might
cover only the pg_regress tests and not the results from the isolation
tester.

3. fairywren failed the last run in module-commit_tsCheck. It's unhappy because:

[16:30:02] t/002_standby.pl  ok13354 ms ( 0.06 usr  0.00 sys +
 1.11 cusr  3.20 csys =  4.37 CPU)
# poll_query_until timed out executing this query:
# SELECT '0/303C7D0'::pg_lsn <= pg_last_wal_replay_lsn()
# expecting this output:
# t
# last actual query output:
# f

I don't know what is causing any of these failures, and I don't know
if there's already some discussion elsewhere that I've missed, but
maybe this email will be helpful to someone. I also noticed that 2 out
of the 3 failures report 2dbb7b9b2279d064f66ce9008869fd0e2b794534 "Fix
pg_hba_file_rules for authentication method cert" as the only new
commit since the last run, and it's hardly believable that that commit
would have broken this. Nor do I see any other recent changes that
look like likely culprits. Apologies in advance if any of this is my
fault.

-- 
Robert Haas
EDB: http://www.enterprisedb.com