RE: Random pg_upgrade test failure on drongo

2024-01-16 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

> 
> Thanks to both of you. I have pushed the patch.
>

I have been tracking the BF animals these days, and this failure has not seen 
anymore.
I think we can close the topic. Again, thanks for all efforts.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Random pg_upgrade test failure on drongo

2024-01-11 Thread Amit Kapila
On Thu, Jan 11, 2024 at 8:15 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > > But tomorrow it could be for other tables and if we change this
> > > TRUNCATE logic for pg_largeobject (of which chances are less) then
> > > there is always a chance that one misses changing this comment. I feel
> > > keeping it generic in this case would be better as the problem is
> > > generic but it is currently shown for pg_largeobject.
> >
> > Yes, for sure. So let's keep it generic as you prefer.
> >
> > Thank you!
>
> Thanks for working the patch. I'm also OK to push the Amit's fix patch.
>

Thanks to both of you. I have pushed the patch.

-- 
With Regards,
Amit Kapila.




RE: Random pg_upgrade test failure on drongo

2024-01-10 Thread Hayato Kuroda (Fujitsu)
Dear Alexander, Amit,

> > But tomorrow it could be for other tables and if we change this
> > TRUNCATE logic for pg_largeobject (of which chances are less) then
> > there is always a chance that one misses changing this comment. I feel
> > keeping it generic in this case would be better as the problem is
> > generic but it is currently shown for pg_largeobject.
> 
> Yes, for sure. So let's keep it generic as you prefer.
> 
> Thank you!

Thanks for working the patch. I'm also OK to push the Amit's fix patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Random pg_upgrade test failure on drongo

2024-01-10 Thread Alexander Lakhin

10.01.2024 13:37, Amit Kapila wrote:

But tomorrow it could be for other tables and if we change this
TRUNCATE logic for pg_largeobject (of which chances are less) then
there is always a chance that one misses changing this comment. I feel
keeping it generic in this case would be better as the problem is
generic but it is currently shown for pg_largeobject.


Yes, for sure. So let's keep it generic as you prefer.

Thank you!

Best regards,
Alexander




Re: Random pg_upgrade test failure on drongo

2024-01-10 Thread Amit Kapila
On Wed, Jan 10, 2024 at 3:30 PM Alexander Lakhin  wrote:
>
> 10.01.2024 12:31, Amit Kapila wrote:
> > I am slightly hesitant to add any particular system table name in the
> > comments as this can happen for any other system table as well, so
> > slightly adjusted the comments in the attached. However, I think it is
> > okay to mention the particular system table name in the commit
> > message. Let me know what do you think.
>
> Thank you, Amit!
>
> I'd like to note that the culprit is exactly pg_largeobject as coded in
> dumpDatabase():
>  /*
>   * pg_largeobject comes from the old system intact, so set its
>   * relfrozenxids, relminmxids and relfilenode.
>   */
>  if (dopt->binary_upgrade)
> ...
>  appendPQExpBufferStr(loOutQry,
>   "TRUNCATE pg_catalog.pg_largeobject;\n");
>
> I see no other TRUNCATEs (or similar logic) around, so I would specify the
> table name in the comments. Though maybe I'm missing something...
>

But tomorrow it could be for other tables and if we change this
TRUNCATE logic for pg_largeobject (of which chances are less) then
there is always a chance that one misses changing this comment. I feel
keeping it generic in this case would be better as the problem is
generic but it is currently shown for pg_largeobject.

-- 
With Regards,
Amit Kapila.




Re: Random pg_upgrade test failure on drongo

2024-01-10 Thread Alexander Lakhin

10.01.2024 12:31, Amit Kapila wrote:

I am slightly hesitant to add any particular system table name in the
comments as this can happen for any other system table as well, so
slightly adjusted the comments in the attached. However, I think it is
okay to mention the particular system table name in the commit
message. Let me know what do you think.


Thank you, Amit!

I'd like to note that the culprit is exactly pg_largeobject as coded in
dumpDatabase():
    /*
 * pg_largeobject comes from the old system intact, so set its
 * relfrozenxids, relminmxids and relfilenode.
 */
    if (dopt->binary_upgrade)
...
    appendPQExpBufferStr(loOutQry,
 "TRUNCATE pg_catalog.pg_largeobject;\n");

I see no other TRUNCATEs (or similar logic) around, so I would specify the
table name in the comments. Though maybe I'm missing something...

Best regards,
Alexander




Re: Random pg_upgrade test failure on drongo

2024-01-10 Thread Amit Kapila
On Tue, Jan 9, 2024 at 4:30 PM Alexander Lakhin  wrote:
>
> 09.01.2024 13:08, Amit Kapila wrote:
> >
> >> As to checkpoint_timeout, personally I would not increase it, because it
> >> seems unbelievable to me that pg_restore (with the cluster containing only
> >> two empty databases) can run for longer than 5 minutes. I'd rather
> >> investigate such situation separately, in case we encounter it, but maybe
> >> it's only me.
> >>
> > I feel it is okay to set a higher value of checkpoint_timeout due to
> > the same reason though the probability is less. I feel here it is
> > important to explain in the comments why we are using these settings
> > in the new test. I have thought of something like: "During the
> > upgrade, bgwriter or checkpointer could hold the file handle for some
> > removed file. Now, during restore when we try to create the file with
> > the same name, it errors out. This behavior is specific to only some
> > specific Windows versions and the probability of seeing this behavior
> > is higher in this test because we use wal_level as logical via
> > allows_streaming => 'logical' which in turn sets shared_buffers as
> > 1MB."
> >
> > Thoughts?
>
> I would describe that behavior as "During upgrade, when pg_restore performs
> CREATE DATABASE, bgwriter or checkpointer may flush buffers and hold a file
> handle for pg_largeobject, so later TRUNCATE pg_largeobject command will
> fail if OS (such as older Windows versions) doesn't remove an unlinked file
> completely till it's open. ..."
>

I am slightly hesitant to add any particular system table name in the
comments as this can happen for any other system table as well, so
slightly adjusted the comments in the attached. However, I think it is
okay to mention the particular system table name in the commit
message. Let me know what do you think.

-- 
With Regards,
Amit Kapila.


v2-0001-Fix-an-intermetant-BF-failure-in-003_logical_slot.patch
Description: Binary data


Re: Random pg_upgrade test failure on drongo

2024-01-09 Thread Alexander Lakhin

Hello Amit,

09.01.2024 13:08, Amit Kapila wrote:



As to checkpoint_timeout, personally I would not increase it, because it
seems unbelievable to me that pg_restore (with the cluster containing only
two empty databases) can run for longer than 5 minutes. I'd rather
investigate such situation separately, in case we encounter it, but maybe
it's only me.


I feel it is okay to set a higher value of checkpoint_timeout due to
the same reason though the probability is less. I feel here it is
important to explain in the comments why we are using these settings
in the new test. I have thought of something like: "During the
upgrade, bgwriter or checkpointer could hold the file handle for some
removed file. Now, during restore when we try to create the file with
the same name, it errors out. This behavior is specific to only some
specific Windows versions and the probability of seeing this behavior
is higher in this test because we use wal_level as logical via
allows_streaming => 'logical' which in turn sets shared_buffers as
1MB."

Thoughts?


I would describe that behavior as "During upgrade, when pg_restore performs
CREATE DATABASE, bgwriter or checkpointer may flush buffers and hold a file
handle for pg_largeobject, so later TRUNCATE pg_largeobject command will
fail if OS (such as older Windows versions) doesn't remove an unlinked file
completely till it's open. ..."

Best regards,
Alexander





Re: Random pg_upgrade test failure on drongo

2024-01-09 Thread Amit Kapila
On Tue, Jan 9, 2024 at 2:30 PM Alexander Lakhin  wrote:
>
> 09.01.2024 08:49, Hayato Kuroda (Fujitsu) wrote:
> > Based on the suggestion by Amit, I have created a patch with the alternative
> > approach. This just does GUC settings. The reported failure is only for
> > 003_logical_slots, but the patch also includes changes for the recently 
> > added
> > test, 004_subscription. IIUC, there is a possibility that 004 would fail as 
> > well.
> >
> > Per our understanding, this patch can stop random failures. Alexander, can 
> > you
> > test for the confirmation?
> >
>
> Yes, the patch fixes the issue for me (without the patch I observe failures
> on iterations 1-2, with 10 tests running in parallel, but with the patch
> 10 iterations succeeded).
>
> But as far I can see, 004_subscription is not affected by the issue,
> because it doesn't enable streaming for nodes new_sub, new_sub1.
> As I noted before, I could see the failure only with
> shared_buffers = 1MB (which is set with allows_streaming => 'logical').
> So I'm not sure, whether we need to modify 004 (or any other test that
> runs pg_upgrade).
>

I see your point and the probable reason for failure with
shared_buffers=1MB is that the probability of bgwriter holding the
file handle for pg_largeobject increases. So, let's change it only for
003.

> As to checkpoint_timeout, personally I would not increase it, because it
> seems unbelievable to me that pg_restore (with the cluster containing only
> two empty databases) can run for longer than 5 minutes. I'd rather
> investigate such situation separately, in case we encounter it, but maybe
> it's only me.
>

I feel it is okay to set a higher value of checkpoint_timeout due to
the same reason though the probability is less. I feel here it is
important to explain in the comments why we are using these settings
in the new test. I have thought of something like: "During the
upgrade, bgwriter or checkpointer could hold the file handle for some
removed file. Now, during restore when we try to create the file with
the same name, it errors out. This behavior is specific to only some
specific Windows versions and the probability of seeing this behavior
is higher in this test because we use wal_level as logical via
allows_streaming => 'logical' which in turn sets shared_buffers as
1MB."

Thoughts?

-- 
With Regards,
Amit Kapila.




Re: Random pg_upgrade test failure on drongo

2024-01-09 Thread Alexander Lakhin

Hello Kuroda-san,

09.01.2024 08:49, Hayato Kuroda (Fujitsu) wrote:

Based on the suggestion by Amit, I have created a patch with the alternative
approach. This just does GUC settings. The reported failure is only for
003_logical_slots, but the patch also includes changes for the recently added
test, 004_subscription. IIUC, there is a possibility that 004 would fail as 
well.

Per our understanding, this patch can stop random failures. Alexander, can you
test for the confirmation?



Yes, the patch fixes the issue for me (without the patch I observe failures
on iterations 1-2, with 10 tests running in parallel, but with the patch
10 iterations succeeded).

But as far I can see, 004_subscription is not affected by the issue,
because it doesn't enable streaming for nodes new_sub, new_sub1.
As I noted before, I could see the failure only with
shared_buffers = 1MB (which is set with allows_streaming => 'logical').
So I'm not sure, whether we need to modify 004 (or any other test that
runs pg_upgrade).

As to checkpoint_timeout, personally I would not increase it, because it
seems unbelievable to me that pg_restore (with the cluster containing only
two empty databases) can run for longer than 5 minutes. I'd rather
investigate such situation separately, in case we encounter it, but maybe
it's only me.
On the other hand, if a checkpoint could occur by some reason within a
shorter time span, then increasing the timeout would not matter, I suppose.
(I've also tested the bgwriter_lru_maxpages-only modification of your patch
and can confirm that it works as well.)

Best regards,
Alexander




RE: Random pg_upgrade test failure on drongo

2024-01-08 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Alexander,

> > We get the effect discussed when the background writer process decides to
> > flush a file buffer for pg_largeobject during stage 1.
> > (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
> > the result must be the same.)
> > And another important factor is shared_buffers = 1MB (set during the test).
> > With the default setting of 128MB I couldn't see the failure.
> >
> > It can be reproduced easily (on old Windows versions) just by running
> > pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
> > default cluster)).
> > If an old cluster contains dozen of databases, this increases the failure
> > probability significantly (with 10 additional databases I've got failures
> > on iterations 4, 1, 6).
> >
> 
> I don't have an old Windows environment to test but I agree with your
> analysis and theory. The question is what should we do for these new
> random BF failures? I think we should set bgwriter_lru_maxpages to 0
> and checkpoint_timeout to 1hr for these new tests. Doing some invasive
> fix as part of this doesn't sound reasonable because this is an
> existing problem and there seems to be another patch by Thomas that
> probably deals with the root cause of the existing problem [1] as
> pointed out by you.
> 
> [1] - https://commitfest.postgresql.org/40/3951/

Based on the suggestion by Amit, I have created a patch with the alternative
approach. This just does GUC settings. The reported failure is only for
003_logical_slots, but the patch also includes changes for the recently added
test, 004_subscription. IIUC, there is a possibility that 004 would fail as 
well.

Per our understanding, this patch can stop random failures. Alexander, can you
test for the confirmation?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



0001-Add-GUC-settings-to-avoid-random-failures-on-Windows.patch
Description:  0001-Add-GUC-settings-to-avoid-random-failures-on-Windows.patch


Re: Random pg_upgrade test failure on drongo

2024-01-08 Thread Amit Kapila
On Mon, Jan 8, 2024 at 9:36 PM Jim Nasby  wrote:
>
> On 1/4/24 10:19 PM, Amit Kapila wrote:
> > On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin  wrote:
> >>
> >> 03.01.2024 14:42, Amit Kapila wrote:
> >>>
> >>
>  And the internal process is ... background writer (BgBufferSync()).
> 
>  So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
>  got 20 x 10 tests passing.
> 
>  Thus, it we want just to get rid of the test failure, maybe it's enough 
>  to
>  add this to the test's config...
> 
> >>> What about checkpoints? Can't it do the same while writing the buffers?
> >>
> >> As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
> >> the desired effect, but I think it's not impossible in principle.
> >> More details below.
> >> What happens during the pg_upgrade execution is essentially:
> >> 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
> >> -- this command flushes file buffers as well
> >> 2) ALTER DATABASE postgres OWNER TO ...
> >> 3) COMMENT ON DATABASE "postgres" IS ...
> >> 4) -- For binary upgrade, preserve pg_largeobject and index 
> >> relfilenodes
> >>   SELECT 
> >> pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
> >>   SELECT 
> >> pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
> >>   TRUNCATE pg_catalog.pg_largeobject;
> >> --  ^^^ here we can get the error "could not create file "base/5/2683": 
> >> File exists"
> >> ...
> >>
> >> We get the effect discussed when the background writer process decides to
> >> flush a file buffer for pg_largeobject during stage 1.
> >> (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
> >> the result must be the same.)
> >> And another important factor is shared_buffers = 1MB (set during the test).
> >> With the default setting of 128MB I couldn't see the failure.
> >>
> >> It can be reproduced easily (on old Windows versions) just by running
> >> pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
> >> default cluster)).
> >> If an old cluster contains dozen of databases, this increases the failure
> >> probability significantly (with 10 additional databases I've got failures
> >> on iterations 4, 1, 6).
> >>
> >
> > I don't have an old Windows environment to test but I agree with your
> > analysis and theory. The question is what should we do for these new
> > random BF failures? I think we should set bgwriter_lru_maxpages to 0
> > and checkpoint_timeout to 1hr for these new tests. Doing some invasive
> > fix as part of this doesn't sound reasonable because this is an
> > existing problem and there seems to be another patch by Thomas that
> > probably deals with the root cause of the existing problem [1] as
> > pointed out by you.
> >
> > [1] - https://commitfest.postgresql.org/40/3951/
>
> Isn't this just sweeping the problem (non-POSIX behavior on SMB and
> ReFS) under the carpet? I realize that synthetic test workloads like
> pg_upgrade in a loop aren't themselves real-world scenarios, but what
> about other cases? Even if we're certain it's not possible for these
> issues to wedge a server, it's still not a good experience for users to
> get random, unexplained IO-related errors...
>

The point is that this is an existing known Windows behavior and that
too only in certain versions. The fix doesn't seem to be
straightforward, so it seems advisable to avoid random BF failures by
having an appropriate configuration.

-- 
With Regards,
Amit Kapila.




Re: Random pg_upgrade test failure on drongo

2024-01-08 Thread Jim Nasby

On 1/4/24 10:19 PM, Amit Kapila wrote:

On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin  wrote:


03.01.2024 14:42, Amit Kapila wrote:





And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...


What about checkpoints? Can't it do the same while writing the buffers?


As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
the desired effect, but I think it's not impossible in principle.
More details below.
What happens during the pg_upgrade execution is essentially:
1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
-- this command flushes file buffers as well
2) ALTER DATABASE postgres OWNER TO ...
3) COMMENT ON DATABASE "postgres" IS ...
4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes
  SELECT 
pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
  SELECT 
pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
  TRUNCATE pg_catalog.pg_largeobject;
--  ^^^ here we can get the error "could not create file "base/5/2683": File 
exists"
...

We get the effect discussed when the background writer process decides to
flush a file buffer for pg_largeobject during stage 1.
(Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
the result must be the same.)
And another important factor is shared_buffers = 1MB (set during the test).
With the default setting of 128MB I couldn't see the failure.

It can be reproduced easily (on old Windows versions) just by running
pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
default cluster)).
If an old cluster contains dozen of databases, this increases the failure
probability significantly (with 10 additional databases I've got failures
on iterations 4, 1, 6).



I don't have an old Windows environment to test but I agree with your
analysis and theory. The question is what should we do for these new
random BF failures? I think we should set bgwriter_lru_maxpages to 0
and checkpoint_timeout to 1hr for these new tests. Doing some invasive
fix as part of this doesn't sound reasonable because this is an
existing problem and there seems to be another patch by Thomas that
probably deals with the root cause of the existing problem [1] as
pointed out by you.

[1] - https://commitfest.postgresql.org/40/3951/


Isn't this just sweeping the problem (non-POSIX behavior on SMB and 
ReFS) under the carpet? I realize that synthetic test workloads like 
pg_upgrade in a loop aren't themselves real-world scenarios, but what 
about other cases? Even if we're certain it's not possible for these 
issues to wedge a server, it's still not a good experience for users to 
get random, unexplained IO-related errors...

--
Jim Nasby, Data Architect, Austin TX





Re: pg_upgrade test failure

2024-01-04 Thread vignesh C
On Sun, 29 Oct 2023 at 11:14, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Andres,
>
> While tracking BF failures related with pg_ugprade, I found the same failure 
> has still happened [1] - [4].
> According to the log, the output directory was remained even after the 
> successful upgrade [5].
> I analyzed and attached the fix patch, and below is my analysis... how do you 
> think?
>

The same failure occurs randomly at [1] for a newly added test too:
pg_upgrade: warning: could not remove directory
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log":
Directory not empty
pg_upgrade: warning: could not remove directory
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796":
Directory not empty
pg_upgrade: warning: could not stat file
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log/pg_upgrade_internal.log":
No such file or directory
pg_upgrade: warning: could not remove directory
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log":
Directory not empty
pg_upgrade: warning: could not remove directory
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796":
Directory not empty

[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2024-01-04%2019%3A56%3A20

Regards,
Vignesh




Re: Random pg_upgrade test failure on drongo

2024-01-04 Thread Amit Kapila
On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin  wrote:
>
> 03.01.2024 14:42, Amit Kapila wrote:
> >
>
> >> And the internal process is ... background writer (BgBufferSync()).
> >>
> >> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
> >> got 20 x 10 tests passing.
> >>
> >> Thus, it we want just to get rid of the test failure, maybe it's enough to
> >> add this to the test's config...
> >>
> > What about checkpoints? Can't it do the same while writing the buffers?
>
> As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
> the desired effect, but I think it's not impossible in principle.
> More details below.
> What happens during the pg_upgrade execution is essentially:
> 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
> -- this command flushes file buffers as well
> 2) ALTER DATABASE postgres OWNER TO ...
> 3) COMMENT ON DATABASE "postgres" IS ...
> 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes
>  SELECT 
> pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
>  SELECT 
> pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
>  TRUNCATE pg_catalog.pg_largeobject;
> --  ^^^ here we can get the error "could not create file "base/5/2683": File 
> exists"
> ...
>
> We get the effect discussed when the background writer process decides to
> flush a file buffer for pg_largeobject during stage 1.
> (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
> the result must be the same.)
> And another important factor is shared_buffers = 1MB (set during the test).
> With the default setting of 128MB I couldn't see the failure.
>
> It can be reproduced easily (on old Windows versions) just by running
> pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
> default cluster)).
> If an old cluster contains dozen of databases, this increases the failure
> probability significantly (with 10 additional databases I've got failures
> on iterations 4, 1, 6).
>

I don't have an old Windows environment to test but I agree with your
analysis and theory. The question is what should we do for these new
random BF failures? I think we should set bgwriter_lru_maxpages to 0
and checkpoint_timeout to 1hr for these new tests. Doing some invasive
fix as part of this doesn't sound reasonable because this is an
existing problem and there seems to be another patch by Thomas that
probably deals with the root cause of the existing problem [1] as
pointed out by you.

[1] - https://commitfest.postgresql.org/40/3951/

-- 
With Regards,
Amit Kapila.




Re: Random pg_upgrade test failure on drongo

2024-01-04 Thread Alexander Lakhin

Hello Amit,

03.01.2024 14:42, Amit Kapila wrote:



So I started to think about other approach: to perform unlink as it's
implemented now, but then wait until the DELETE_PENDING state is gone.


There is a comment in the code which suggests we shouldn't wait
indefinitely. See "However, we won't wait indefinitely for someone
else to close the file, as the caller might be holding locks and
blocking other backends."


Yes, I saw it, but initially I thought that we have a transient condition
there, so waiting in open() (instead of failing immediately) seemed like a
good idea then...


And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...


What about checkpoints? Can't it do the same while writing the buffers?


As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
the desired effect, but I think it's not impossible in principle.
More details below.
What happens during the pg_upgrade execution is essentially:
1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
-- this command flushes file buffers as well
2) ALTER DATABASE postgres OWNER TO ...
3) COMMENT ON DATABASE "postgres" IS ...
4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes
    SELECT 
pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
    SELECT 
pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
    TRUNCATE pg_catalog.pg_largeobject;
--  ^^^ here we can get the error "could not create file "base/5/2683": File 
exists"
...

We get the effect discussed when the background writer process decides to
flush a file buffer for pg_largeobject during stage 1.
(Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
the result must be the same.)
And another important factor is shared_buffers = 1MB (set during the test).
With the default setting of 128MB I couldn't see the failure.

It can be reproduced easily (on old Windows versions) just by running
pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
default cluster)).
If an old cluster contains dozen of databases, this increases the failure
probability significantly (with 10 additional databases I've got failures
on iterations 4, 1, 6).

Please see the reproducing script attached.

Best regards,
Alexander@echo off
REM define PGBIN and PATH
set PGBIN=%cd%\tmp_install\usr\local\pgsql\bin
set PATH=%PGBIN%;%PATH%

setlocal enabledelayedexpansion

rmdir /S /Q tmpdb
initdb -D tmpdb >initdb.log 2>&1
echo shared_buffers = 1MB>>tmpdb\postgresql.conf

set /a i = 0
:LOOP
set /a i+=1
echo ITERATION %i%

rmdir /S /Q tmpdb_old
xcopy /e /i /q /r tmpdb tmpdb_old\

pg_ctl -D tmpdb_old -l server.log start
echo SELECT concat('CREATE DATABASE db', g) FROM generate_series(1, 10) g 
\gexec | psql -d postgres
pg_ctl -D tmpdb_old stop

rmdir /S /Q tmpdb_new
xcopy /e /i /q /r tmpdb tmpdb_new\

pg_upgrade --no-sync -r -b "%PGBIN%" -B "%PGBIN%" -d tmpdb_old -D tmpdb_new
if %ERRORLEVEL% NEQ 0 goto ERR
goto LOOP

:ERR
echo ERROR
pause

:EXIT


Re: Random pg_upgrade test failure on drongo

2024-01-03 Thread Amit Kapila
On Tue, Jan 2, 2024 at 10:30 AM Alexander Lakhin  wrote:
>
> 28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote:
> > Dear Alexander,
> >
> >> I agree with your analysis and would like to propose a PoC fix (see
> >> attached). With this patch applied, 20 iterations succeeded for me.
> > There are no reviewers so that I will review again. Let's move the PoC
> > to the concrete patch. Note that I only focused on fixes of random failure,
> > other parts are out-of-scope.
>
> Thinking about that fix more, I'm not satisfied with the approach proposed.
> First, it turns every unlink operation into two write operations
> (rename + unlink), not to say about new risks of having stale .tmp files
> (perhaps, it's ok for regular files (MoveFileEx can overwrite existing
> files), but not for directories)
> Second, it does that on any Windows OS versions, including modern ones,
> which are not affected by the issue, as we know.
>
> So I started to think about other approach: to perform unlink as it's
> implemented now, but then wait until the DELETE_PENDING state is gone.
>

There is a comment in the code which suggests we shouldn't wait
indefinitely. See "However, we won't wait indefinitely for someone
else to close the file, as the caller might be holding locks and
blocking other backends."

> And I was very surprised to see that this state is not transient in our case.
> Additional investigation showed that the test fails not because some aside
> process opens a file (concretely, {template1_id/postgres_id}/2683), that is
> being deleted, but because of an internal process that opens the file and
> holds a handle to it indefinitely.
> And the internal process is ... background writer (BgBufferSync()).
>
> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
> got 20 x 10 tests passing.
>
> Thus, it we want just to get rid of the test failure, maybe it's enough to
> add this to the test's config...
>

What about checkpoints? Can't it do the same while writing the buffers?

-- 
With Regards,
Amit Kapila.




Re: Random pg_upgrade test failure on drongo

2024-01-01 Thread Alexander Lakhin

Hello Kuroda-san,

28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote:

Dear Alexander,


I agree with your analysis and would like to propose a PoC fix (see
attached). With this patch applied, 20 iterations succeeded for me.

There are no reviewers so that I will review again. Let's move the PoC
to the concrete patch. Note that I only focused on fixes of random failure,
other parts are out-of-scope.


Thinking about that fix more, I'm not satisfied with the approach proposed.
First, it turns every unlink operation into two write operations
(rename + unlink), not to say about new risks of having stale .tmp files
(perhaps, it's ok for regular files (MoveFileEx can overwrite existing
files), but not for directories)
Second, it does that on any Windows OS versions, including modern ones,
which are not affected by the issue, as we know.

So I started to think about other approach: to perform unlink as it's
implemented now, but then wait until the DELETE_PENDING state is gone.
And I was very surprised to see that this state is not transient in our case.
Additional investigation showed that the test fails not because some aside
process opens a file (concretely, {template1_id/postgres_id}/2683), that is
being deleted, but because of an internal process that opens the file and
holds a handle to it indefinitely.
And the internal process is ... background writer (BgBufferSync()).

So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
got 20 x 10 tests passing.

Thus, it we want just to get rid of the test failure, maybe it's enough to
add this to the test's config...

The other way to go is to find out whether the background writer process
should react on a shared-inval message, sent from smgrdounlinkall(), and
close that file's handle,

Maybe we could also (after changing the bgwriter's behaviour) add a waiting
loop into pgwin32_open_handle() to completely rule out transient open()
failures due to some other process (such as Windows Exporer) opening a file
being deleted, but I would not complicate the things until we have a clear
vision/plans of using modern APIs/relying of modern OS versions' behavior.
I mean proceeding with something like:
https://commitfest.postgresql.org/40/3951/

Best regards,
Alexander




RE: Random pg_upgrade test failure on drongo

2023-12-27 Thread Hayato Kuroda (Fujitsu)
Dear Alexander, 

> I agree with your analysis and would like to propose a PoC fix (see
> attached). With this patch applied, 20 iterations succeeded for me.

There are no reviewers so that I will review again. Let's move the PoC
to the concrete patch. Note that I only focused on fixes of random failure,
other parts are out-of-scope.

Basically, code comments can be updated accordingly.

01.

```
/*
 * This function might be called for a regular file or for a junction
 * point (which we use to emulate symlinks).  The latter must be 
unlinked
 * with rmdir() on Windows.  Before we worry about any of that, let's 
see
 * if we can unlink directly, since that's expected to be the most 
common
 * case.
 */
snprintf(tmppath, sizeof(tmppath), "%s.tmp", path);
if (pgrename(path, tmppath) == 0)
{
if (unlink(tmppath) == 0)
return 0;
curpath = tmppath;
}
```

You can modify comments atop changes because it is not trivial.
Below is my draft:

```
 * XXX: we rename the target file to ".tmp" before calling unlink. The
 * removal may fail with STATUS_DELETE_PENDING status on Windows, so
 * creating the same file would fail. This assumes that renaming is a
 * synchronous operation.
```

02.

```
loops = 0;
while (lstat(curpath, ) < 0 && 
lstat_error_was_status_delete_pending())
{
if (++loops > 100)  /* time out after 10 sec */
return -1;
pg_usleep(10);  /* us */
}
```

Comments can be added atop the part. Below one is my draft.

```
/*
 * Wait until the removal is really finished to avoid ERRORs for 
creating a
 * same file in other functions.
 */
```


Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Random pg_upgrade test failure on drongo

2023-11-30 Thread Alexander Lakhin

Hello Andrew and Kuroda-san,

27.11.2023 16:58, Andrew Dunstan wrote:

It's also interesting, what is full version/build of OS on drongo and
fairywren.



It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122



I've updated it to 17763.5122 now.



Thank you for the information! It had pushed me to upgrade my Server
2019 1809 17763.592 to 17763.4252. And then I discovered that I have
difficulties with reproducing the issue on all my VMs after reboot (even
on old versions/builds). It took me a while to understand what's going on
and what affects reproduction of the issue.
I was puzzled by the fact that I can't reproduce the issue with my
unlink-open test program under seemingly the same conditions as before,
until I realized that the issue reproduced only when the target directory
opened in Windows Explorer.

Now I'm sorry for bringing more mystery into the topic and for misleading
information.

So, the issue reproduced only when something scans the working directory
for files/opens them. I added the same logic into my test program (see
unlink-open-scandir attached) and now I see the failure on Windows Server
2019 (Version 10.0.17763.4252).
A script like this:
start cmd /c "unlink-open-scandir test1 10 5000 >log1 2>&1"
...
start cmd /c "unlink-open-scandir test10 10 5000 >log10 2>&1"

results in:
C:\temp>find "failed" log*
-- LOG1
-- LOG10
fopen() after unlink() failed (13)
-- LOG2
fopen() after unlink() failed (13)
-- LOG3
fopen() after unlink() failed (13)
-- LOG4
fopen() after unlink() failed (13)
-- LOG5
fopen() after unlink() failed (13)
-- LOG6
fopen() after unlink() failed (13)
-- LOG7
fopen() after unlink() failed (13)
-- LOG8
fopen() after unlink() failed (13)
-- LOG9
fopen() after unlink() failed (13)

C:\temp>type log10
...
iteration 108
fopen() after unlink() failed (13)

The same observed on:
Windows 10 Version 1809 (OS Build 17763.1)

But no failures on:
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So the behavior change really took place, but my previous estimations were
incorrect (my apologies).

BTW, "rename" mode of the test program can produce more rare errors on
rename:
-- LOG3
MoveFileEx() failed (0)

but not on open.

30.11.2023 13:00, Hayato Kuroda (Fujitsu) wrote:


Thanks for your interest for the issue. I have been tracking the failure but 
been not occurred.
Your analysis seems to solve BF failures, by updating OSes.


Yes, but I don't think that leaving Server 2019 behind (I suppose Windows
Server 2019 build 20348 would have the same behaviour as Windows 10 19045)
is affordable. (Though looking at Cirrus CI logs, I see that what is
entitled "Windows Server 2019" in fact is Windows Server 2022 there.)


I think that's because unlink() is performed asynchronously on those old
Windows versions, but rename() is always synchronous.
  
OK. Actually I could not find descriptions about them, but your experiment showed facts.


I don't know how this peculiarity is called, but it looks like when some
other process captures the file handle, unlink() exits as if the file was
deleted completely, but the subsequent open() fails.

Best regards,
Alexander#include 
#include 
#include 

char buf[1048576] = {0};

int main(int argc, char *argv[])
{
	int nmb;
	int res;
	int num_iterations;
	char *filename, *curfilename;
	char tmpfilename[MAX_PATH];
	FILE *fh;
	char do_rename;
	WIN32_FIND_DATA FindFileData;

	if (argc < 4)
		{ printf("Usage: %s {filename} {size_mb} {num_iterations}", argv[0]); return 1; }

	filename = argv[1];

	nmb = atoi(argv[2]);
	if (nmb < 1) nmb = 1;

	num_iterations = atoi(argv[3]);
	if (num_iterations < 1) num_iterations = 1;

	do_rename = (strcmp(argv[argc - 1], "rename") == 0);
	
	for (int i = 1; i <= num_iterations; i++) {
		printf("iteration %d\n", i);

		fh = fopen(filename, "wb");
		if (fh == NULL)
			{ printf("first fopen() failed (%d)\n", errno); return 1; }

		for (int n = 0; n <= nmb; n++) {
			if (fwrite(buf, 1, sizeof(buf), fh) != sizeof(buf))
{ printf("fwrite() failed (%d)\n", errno); return 1; }
		}

		res = fclose(fh);
		if (res < 0)
			{ printf("first fclose() failed (%d)\n", errno); return 1; }

		curfilename = filename;
		if (do_rename)
		{
			sprintf(tmpfilename, "%s.tmp", filename);
			res = MoveFileEx(filename, tmpfilename, MOVEFILE_REPLACE_EXISTING); // from dirmod.c
			if (res == 0)
{ printf("MoveFileEx() failed (%d)\n", errno); return 1; }
			curfilename = tmpfilename;
		}
			
		res = unlink(curfilename);
		if (res < 0)
			{ printf("first unlink() failed (%d)\n", errno); return 1; }

		fh = fopen(filename, "w");
		if (fh == NULL)
			{ printf("fopen() after unlink() failed (%d)\n", errno); return 2; }

		res = fclose(fh);
		if (res < 0)
			{ printf("fclose() failed (%d)\n", errno); return 1; }

		res = unlink(filename);
		if (res < 0)
			{ printf("unlink() failed (%d)\n", errno); return 1; }

		

RE: Random pg_upgrade test failure on drongo

2023-11-30 Thread Hayato Kuroda (Fujitsu)
Dear Alexander, Andrew,
 
Thanks for your analysis!
 
> I see that behavior on:
> Windows 10 Version 1607 (OS Build 14393.0)
> Windows Server 2016 Version 1607 (OS Build 14393.0)
> Windows Server 2019 Version 1809 (OS Build 17763.1)
>
> But it's not reproduced on:
> Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
> Windows Server 2019 Version 1809 (OS Build 17763.592)
> Windows 10 Version 22H2 (OS Build 19045.3693)
> Windows 11 Version 21H2 (OS Build 22000.613)
>
> So it looks like the failure occurs depending not on Windows edition, but
> rather on it's build. For Windows Server 2019 the "good" build is
> somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
> 14393.0 and 17763.1.
> (Maybe there was some change related to
> FILE_DISPOSITION_POSIX_SEMANTICS/
> FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
> information about that change.)
>
> It's also interesting, what is full version/build of OS on drongo and
> fairywren.
 
Thanks for your interest for the issue. I have been tracking the failure but 
been not occurred.
Your analysis seems to solve BF failures, by updating OSes.
 
> I think that's because unlink() is performed asynchronously on those old
> Windows versions, but rename() is always synchronous.
 
OK. Actually I could not find descriptions about them, but your experiment 
showed facts.
 
> I've managed to reproduce that issue (or at least a situation that
> manifested similarly) with a sleep added in miscinit.c:
>  ereport(IsPostmasterEnvironment ? LOG : NOTICE,
> (errmsg("database system is shut down")));
> +   pg_usleep(50L);
>
> With this change, I get the same warning as in [1] when running in
> parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations
> 33, 46, 8). And with my PoC patch applied, I could see the same warning
> as well (on iteration 6).
>
> I believe that's because rename() can't rename a directory containing an
> open file, just as unlink() can't remove it.
>
> In the light of the above, I think that the issue in question should be
> fixed in accordance with/as a supplement to [2].
 
OK, I understood that we need to fix more around here. For now, we should focus 
our points.
 
Your patch seems good, but it needs more sight from windows-friendly developers.
How do other think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Random pg_upgrade test failure on drongo

2023-11-27 Thread Andrew Dunstan



On 2023-11-27 Mo 07:39, Andrew Dunstan wrote:


On 2023-11-27 Mo 07:00, Alexander Lakhin wrote:

Hello Kuroda-san,

25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote:

Thanks for attaching a program. This helps us to understand the issue.
I wanted to confirm your env - this failure was occurred on windows 
server , right?


I see that behavior on:
Windows 10 Version 1607 (OS Build 14393.0)
Windows Server 2016 Version 1607 (OS Build 14393.0)
Windows Server 2019 Version 1809 (OS Build 17763.1)

But it's not reproduced on:
Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
Windows Server 2019 Version 1809 (OS Build 17763.592)
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So it looks like the failure occurs depending not on Windows edition, 
but

rather on it's build. For Windows Server 2019 the "good" build is
somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
14393.0 and 17763.1.
(Maybe there was some change related to 
FILE_DISPOSITION_POSIX_SEMANTICS/

FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
information about that change.)

It's also interesting, what is full version/build of OS on drongo and
fairywren.




It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122





I've updated it to 17763.5122 now.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Random pg_upgrade test failure on drongo

2023-11-27 Thread Andrew Dunstan



On 2023-11-27 Mo 07:00, Alexander Lakhin wrote:

Hello Kuroda-san,

25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote:

Thanks for attaching a program. This helps us to understand the issue.
I wanted to confirm your env - this failure was occurred on windows 
server , right?


I see that behavior on:
Windows 10 Version 1607 (OS Build 14393.0)
Windows Server 2016 Version 1607 (OS Build 14393.0)
Windows Server 2019 Version 1809 (OS Build 17763.1)

But it's not reproduced on:
Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
Windows Server 2019 Version 1809 (OS Build 17763.592)
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So it looks like the failure occurs depending not on Windows edition, but
rather on it's build. For Windows Server 2019 the "good" build is
somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
14393.0 and 17763.1.
(Maybe there was some change related to FILE_DISPOSITION_POSIX_SEMANTICS/
FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
information about that change.)

It's also interesting, what is full version/build of OS on drongo and
fairywren.




It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Random pg_upgrade test failure on drongo

2023-11-27 Thread Alexander Lakhin

Hello Kuroda-san,

25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote:

Thanks for attaching a program. This helps us to understand the issue.
I wanted to confirm your env - this failure was occurred on windows server 
, right?


I see that behavior on:
Windows 10 Version 1607 (OS Build 14393.0)
Windows Server 2016 Version 1607 (OS Build 14393.0)
Windows Server 2019 Version 1809 (OS Build 17763.1)

But it's not reproduced on:
Windows 10 Version 1809 (OS Build 17763.1) (triple-checked)
Windows Server 2019 Version 1809 (OS Build 17763.592)
Windows 10 Version 22H2 (OS Build 19045.3693)
Windows 11 Version 21H2 (OS Build 22000.613)

So it looks like the failure occurs depending not on Windows edition, but
rather on it's build. For Windows Server 2019 the "good" build is
somewhere between 17763.1 and 17763.592, but for Windows 10 it's between
14393.0 and 17763.1.
(Maybe there was some change related to FILE_DISPOSITION_POSIX_SEMANTICS/
FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find
information about that change.)

It's also interesting, what is full version/build of OS on drongo and
fairywren.


That is, my idea was to try removing a file through renaming it as a fast
path (thus avoiding that troublesome state DELETE PENDING), and if that
fails, to perform removal as before. May be the whole function might be
simplified, but I'm not sure about special cases yet.

I felt that your result showed pgrename() would be more rarely delayed than 
unlink().
That's why a file which has original name would not exist when subsequent 
open() was called.


I think that's because unlink() is performed asynchronously on those old
Windows versions, but rename() is always synchronous.


* IIUC, the important points is the latter part, which waits until the status is
changed. Based on that, can we remove a double rmtree() from

cleanup_output_dirs()?

They seems to be add for the similar motivation.

I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it
was observed in [1]), with c28911750 reverted, so I need more time to
research that issue to answer this question.

Yeah, as the first place, this failure seldom occurred


I've managed to reproduce that issue (or at least a situation that
manifested similarly) with a sleep added in miscinit.c:
    ereport(IsPostmasterEnvironment ? LOG : NOTICE,
    (errmsg("database system is shut down")));
+   pg_usleep(50L);

With this change, I get the same warning as in [1] when running in
parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations
33, 46, 8). And with my PoC patch applied, I could see the same warning
as well (on iteration 6).

I believe that's because rename() can't rename a directory containing an
open file, just as unlink() can't remove it.

In the light of the above, I think that the issue in question should be
fixed in accordance with/as a supplement to [2].

[1] https://www.postgresql.org/message-id/20230131172806.GM22427%40telsasoft.com
[2] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com

Best regards,
Alexander




RE: Random pg_upgrade test failure on drongo

2023-11-25 Thread Hayato Kuroda (Fujitsu)
Dear Alexander,

> 
> Please look at the simple test program attached. It demonstrates the
> failure for me when running in two sessions as follows:
> unlink-open test 150 1000
>  unlink-open test2 150 1000

Thanks for attaching a program. This helps us to understand the issue.
I wanted to confirm your env - this failure was occurred on windows server 
, right?

> 
> That is, my idea was to try removing a file through renaming it as a fast
> path (thus avoiding that troublesome state DELETE PENDING), and if that
> fails, to perform removal as before. May be the whole function might be
> simplified, but I'm not sure about special cases yet.

I felt that your result showed pgrename() would be more rarely delayed than 
unlink().
That's why a file which has original name would not exist when subsequent 
open() was called.

About special cases, I wanted seniors to check.

> > * IIUC, the important points is the latter part, which waits until the 
> > status is
> >changed. Based on that, can we remove a double rmtree() from
> cleanup_output_dirs()?
> >They seems to be add for the similar motivation.
> 
> I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it
> was observed in [1]), with c28911750 reverted, so I need more time to
> research that issue to answer this question.

Yeah, as the first place, this failure seldom occurred

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Random pg_upgrade test failure on drongo

2023-11-24 Thread Alexander Lakhin

Hello Kuroda-san,

23.11.2023 15:15, Hayato Kuroda (Fujitsu) wrote:



I agree with your analysis and would like to propose a PoC fix (see
attached). With this patch applied, 20 iterations succeeded for me.

Thanks, here are comments. I'm quite not sure for the windows, so I may say
something wrong.

* I'm not sure why the file/directory name was changed before doing a unlink.
   Could you add descriptions?


Please look at the simple test program attached. It demonstrates the
failure for me when running in two sessions as follows:
unlink-open test 150 1000
                unlink-open test2 150 1000
...
iteration 60
iteration 61
fopen() after unlink() failed (13)

Process Monitor shows:
...
        9:16:55.9249792 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 138,412,032, 
Length: 1,048,576

### unlink() performed for the file "test":
9:16:55.9852903 AM    unlink-open.exe    4968    CreateFile C:\src\test    SUCCESS    Desired Access: Read Attributes, 
Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, 
Delete, AllocationSize: n/a, OpenResult: Opened
9:16:55.9853637 AM    unlink-open.exe    4968 QueryAttributeTagFile    C:\src\test    SUCCESS    Attributes: A, 
ReparseTag: 0x0

### file "test" gets into state DELETE PENDING:
9:16:55.9853756 AM    unlink-open.exe    4968 SetDispositionInformationFile    
C:\src\test    SUCCESS    Delete: True
9:16:55.9853888 AM    unlink-open.exe    4968    CloseFile C:\src\test    
SUCCESS
### concurrent operations with file "test2":
        9:16:55.9866362 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 139,460,608, 
Length: 1,048,576

...
        9:16:55.9972373 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 157,286,400, 
Length: 1,048,576

        9:16:55.9979040 AM    unlink-open.exe    3232 CloseFile    C:\src\test2 
   SUCCESS
### unlink() for "test2":
        9:16:56.1029981 AM    unlink-open.exe    3232 CreateFile    C:\src\test2    SUCCESS    Desired Access: Read 
Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: 
Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
        9:16:56.1030432 AM    unlink-open.exe    3232 QueryAttributeTagFile    C:\src\test2    SUCCESS    Attributes: 
A, ReparseTag: 0x0

### file "test2" gets into state DELETE PENDING:
        9:16:56.1030517 AM    unlink-open.exe    3232 SetDispositionInformationFile    C:\src\test2    SUCCESS    
Delete: True

        9:16:56.1030625 AM    unlink-open.exe    3232 CloseFile    C:\src\test2 
   SUCCESS
### and then it opened successfully:
        9:16:56.1189503 AM    unlink-open.exe    3232 CreateFile    C:\src\test2    SUCCESS    Desired Access: Generic 
Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, 
ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created

        9:16:56.1192016 AM    unlink-open.exe    3232 CloseFile    C:\src\test2 
   SUCCESS
### operations with file "test2" continued:
        9:16:56.1193394 AM    unlink-open.exe    3232 CreateFile    C:\src\test2    SUCCESS    Desired Access: Read 
Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: 
Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
        9:16:56.1193895 AM    unlink-open.exe    3232 QueryAttributeTagFile    C:\src\test2    SUCCESS    Attributes: 
A, ReparseTag: 0x0
        9:16:56.1194042 AM    unlink-open.exe    3232 SetDispositionInformationFile    C:\src\test2    SUCCESS    
Delete: True

        9:16:56.1194188 AM    unlink-open.exe    3232 CloseFile    C:\src\test2 
   SUCCESS
        9:16:56.1198459 AM    unlink-open.exe    3232 CreateFile    C:\src\test2    SUCCESS    Desired Access: Generic 
Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, 
ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created
        9:16:56.1200302 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 0, Length: 
1,048,576, Priority: Normal

...
        9:16:56.1275871 AM    unlink-open.exe    3232 WriteFile    C:\src\test2    SUCCESS    Offset: 10,485,760, 
Length: 1,048,576


### at the same time, CreateFile() for file "test" failed:
9:16:56.1276453 AM    unlink-open.exe    4968    CreateFile C:\src\test    DELETE PENDING    Desired Access: Generic 
Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, 
ShareMode: Read, Write, AllocationSize: 0

9:16:56.1279359 AM    unlink-open.exe    3232    WriteFile C:\src\test2    
SUCCESS    Offset: 11,534,336, Length: 1,048,576
9:16:56.1283452 AM    unlink-open.exe    3232    WriteFile C:\src\test2    
SUCCESS    Offset: 12,582,912, Length: 1,048,576

RE: Random pg_upgrade test failure on drongo

2023-11-23 Thread Hayato Kuroda (Fujitsu)
Dear Alexander,

> 
> I can easily reproduce this failure on my workstation by running 5 tests
> 003_logical_slots in parallel inside Windows VM with it's CPU resources
> limited to 50%, like so:
> VBoxManage controlvm "Windows" cpuexecutioncap 50
> 
> set PGCTLTIMEOUT=180
> python3 -c "NUMITERATIONS=20;NUMTESTS=5;import os;tsts='';exec('for i in
> range(1,NUMTESTS+1):
> tsts+=f\"pg_upgrade_{i}/003_logical_slots \"'); exec('for i in
> range(1,NUMITERATIONS+1):print(f\"iteration {i}\");
> assert(os.system(f\"meson test --num-processes {NUMTESTS} {tsts}\") == 0)')"
> ...
> iteration 2
> ninja: Entering directory `C:\src\postgresql\build'
> ninja: no work to do.
> 1/5 postgresql:pg_upgrade_2 / pg_upgrade_2/003_logical_slots
> ERROR60.30s   exit status 25
> ...
> pg_restore: error: could not execute query: ERROR:  could not create file
> "base/1/2683": File exists
> ...

Great. I do not have such an environment so I could not find. This seemed to
suggest that the failure was occurred because the system was busy.

> I agree with your analysis and would like to propose a PoC fix (see
> attached). With this patch applied, 20 iterations succeeded for me.

Thanks, here are comments. I'm quite not sure for the windows, so I may say
something wrong.

* I'm not sure why the file/directory name was changed before doing a unlink.
  Could you add descriptions?
* IIUC, the important points is the latter part, which waits until the status is
  changed. Based on that, can we remove a double rmtree() from 
cleanup_output_dirs()?
  They seems to be add for the similar motivation.

```
+   loops = 0;
+   while (lstat(curpath, ) < 0 && 
lstat_error_was_status_delete_pending())
+   {
+   if (++loops > 100)  /* time out after 10 sec */
+   return -1;
+   pg_usleep(10);  /* us */
+   }
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Random pg_upgrade test failure on drongo

2023-11-23 Thread Alexander Lakhin

Hello Kuroda-san,

21.11.2023 13:37, Hayato Kuroda (Fujitsu) wrote:

This email tells an update. The machine drongo failed the test a week ago [1]
and finally got logfiles. PSA files.

Oh, sorry. I missed to attach files. You can see pg_upgrade_server.log for now.



I can easily reproduce this failure on my workstation by running 5 tests
003_logical_slots in parallel inside Windows VM with it's CPU resources
limited to 50%, like so:
VBoxManage controlvm "Windows" cpuexecutioncap 50

set PGCTLTIMEOUT=180
python3 -c "NUMITERATIONS=20;NUMTESTS=5;import os;tsts='';exec('for i in range(1,NUMTESTS+1): 
tsts+=f\"pg_upgrade_{i}/003_logical_slots \"'); exec('for i in range(1,NUMITERATIONS+1):print(f\"iteration {i}\"); 
assert(os.system(f\"meson test --num-processes {NUMTESTS} {tsts}\") == 0)')"

...
iteration 2
ninja: Entering directory `C:\src\postgresql\build'
ninja: no work to do.
1/5 postgresql:pg_upgrade_2 / pg_upgrade_2/003_logical_slots ERROR    
60.30s   exit status 25
...
pg_restore: error: could not execute query: ERROR:  could not create file 
"base/1/2683": File exists
...

I agree with your analysis and would like to propose a PoC fix (see
attached). With this patch applied, 20 iterations succeeded for me.

Best regards,
Alexanderdiff --git a/src/port/dirmod.c b/src/port/dirmod.c
index 07dd190cbc..5185648388 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -121,6 +121,8 @@ pgunlink(const char *path)
 	bool		is_lnk;
 	int			loops = 0;
 	struct stat st;
+	char		tmppath[MAX_PATH];
+	char	   *curpath = path;
 
 	/*
 	 * This function might be called for a regular file or for a junction
@@ -129,9 +131,14 @@ pgunlink(const char *path)
 	 * if we can unlink directly, since that's expected to be the most common
 	 * case.
 	 */
-	if (unlink(path) == 0)
-		return 0;
-	if (errno != EACCES)
+	snprintf(tmppath, sizeof(tmppath), "%s.tmp", path);
+	if (pgrename(path, tmppath) == 0)
+	{
+		if (unlink(tmppath) == 0)
+			return 0;
+		curpath = tmppath;
+	}
+	else if (errno != EACCES)
 		return -1;
 
 	/*
@@ -150,7 +157,7 @@ pgunlink(const char *path)
 	 * fail. We want to wait until the file truly goes away so that simple
 	 * recursive directory unlink algorithms work.
 	 */
-	if (lstat(path, ) < 0)
+	if (lstat(curpath, ) < 0)
 	{
 		if (lstat_error_was_status_delete_pending())
 			is_lnk = false;
@@ -167,7 +174,7 @@ pgunlink(const char *path)
 	 * someone else to close the file, as the caller might be holding locks
 	 * and blocking other backends.
 	 */
-	while ((is_lnk ? rmdir(path) : unlink(path)) < 0)
+	while ((is_lnk ? rmdir(curpath) : unlink(curpath)) < 0)
 	{
 		if (errno != EACCES)
 			return -1;
@@ -175,6 +182,14 @@ pgunlink(const char *path)
 			return -1;
 		pg_usleep(10);		/* us */
 	}
+
+	loops = 0;
+	while (lstat(curpath, ) < 0 && lstat_error_was_status_delete_pending())
+	{
+		if (++loops > 100)		/* time out after 10 sec */
+			return -1;
+		pg_usleep(10);		/* us */
+	}
 	return 0;
 }
 


RE: Random pg_upgrade test failure on drongo

2023-11-21 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

This email tells an update. The machine drongo failed the test a week ago [1]
and finally got logfiles. PSA files.

## Observed failure

pg_upgrade_server.log is a server log during the pg_upgrade command. According 
to
it, the TRUNCATE command seemed to be failed due to a "File exists" error.

```
2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR:  could not 
create file "base/1/2683": File exists
2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT:  
-- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT 
pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
SELECT 
pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;
...
```

## Analysis

I think it caused due to the STATUS_DELETE_PENDING failure, not related with 
recent
updates for pg_upgrade.

The file "base/1/2683" is an index file for pg_largeobject_loid_pn_index, and 
the
output meant that file creation was failed. Below is a backtrace.

```
pgwin32_open() // <-- this returns -1
open()
BasicOpenFilePerm()
PathNameOpenFilePerm()
PathNameOpenFile()
mdcreate()
smgrcreate()
RelationCreateStorage()
RelationSetNewRelfilenumber()
ExecuteTruncateGuts()
ExecuteTruncate()
```

But this is strange. Before calling mdcreate(), we surely unlink the file which
have the same name. Below is a trace until unlink.

```
pgunlink()
unlink()
mdunlinkfork()
mdunlink()
smgrdounlinkall()
RelationSetNewRelfilenumber() // common path with above
ExecuteTruncateGuts()
ExecuteTruncate()
```

I found Thomas said that [2] pgunlink sometimes could not remove file even if
it returns OK, at that time NTSTATUS is STATUS_DELETE_PENDING. Also, a comment
in pgwin32_open_handle() mentions the same thing:

```
/*
 * ERROR_ACCESS_DENIED is returned if the file is deleted but 
not yet
 * gone (Windows NT status code is STATUS_DELETE_PENDING).  In 
that
 * case, we'd better ask for the NT status too so we can 
translate it
 * to a more Unix-like error.  We hope that nothing clobbers 
the NT
 * status in between the internal NtCreateFile() call and 
CreateFile()
 * returning.
 *
```

The definition of STATUS_DELETE_PENDING can be seen in [3]. Based on that, 
indeed,
open() would be able to fail with STATUS_DELETE_PENDING if the deletion is 
pending
but it is tried to open.

Another thread [4] also tries the issue while doing rmtree->unlink, and it 
reties
to remove if it fails with STATUS_DELETE_PENDING. So, should we retry to open 
when
it fails as well? Anyway, this fix seems out-of-scope for pg_upgrade.

How do you think? Do you have any thoughts about it?

## Acknowledgement

I want to say thanks to Sholk, Vingesh, for helping the analysis.

[1]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer=2023-11-15%2006%3A16%3A15
[2]: 
https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail.gmail.com
[3]: 
https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
[4]: 
https://www.postgresql.org/message-id/flat/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de#6ae5e2ba3dd6e1fd680dcc34eab710d5

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





RE: Random pg_upgrade test failure on drongo

2023-11-15 Thread Hayato Kuroda (Fujitsu)
Dear hackers,
 
> While tracking a buildfarm, I found that drongo failed the test
> pg_upgrade/003_logical_slots [1].
> A strange point is that the test passed in the next iteration. Currently I'm 
> not
> sure the reason, but I will keep my eye for it and will investigate if it
> happens again.
 
This email just tells an update. We found that fairywren was also failed due to
the same reason [2]. It fails inconsistently, but there might be a bad thing on
windows. I'm now trying to reproduce with my colleagues to analyze more detail.
Also, working with Andrew for getting logs emitted during the upgrade.
I will continue to keep on my eye.
 
[2]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-11-08%2010%3A22%3A45

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Random pg_upgrade test failure on drongo

2023-11-08 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

While tracking a buildfarm, I found that drongo failed the test 
pg_upgrade/003_logical_slots [1].
A strange point is that the test passed in the next iteration. Currently I'm not
sure the reason, but I will keep my eye for it and will investigate if it
happens again.

I think this failure is not related with our logical slots work, whereas it
failed 003_logical_slots.pl. More detail, please see latter part.

For more investigation, a server log during the upgrade may be needed. It will
be in the data directory so BF system will not upload them. I may need 
additional
information if it failed again.

# Analysis of failure

According to the output, pg_upgrade seemed to be failed while restoring objects
to new cluster[2].

As code-level anaysis, pg_upgrade command failed in exec_prog().
In the function, pg_restore tried to be executed for database "postgres".
Below is a brief call-stack. Note that pg_restore is not used for migrating
logical replication slots, it is done by pg_upgrade binary itself. Also, the
migration is done after all objects are copied, not in create_new_objects().

```
exec_prog()
parallel_exec_prog("pg_restore ... ") <-- Since -j option is not specified, it 
is just a wrapper
create_new_objects()
main()
```

In exec_prog(), system() system call was called but returned non-zero value.
Doc said that sytem() returns value that is returned by the command interpreter,
when input is not NULL [3]. Unfortunately, current code does not output the
return code.  Also, BF system does not upload data directory for failed tests.
Therefore, I could not get more information for the investigation.

[1]: 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo=2023-11-07%2013%3A43%3A23=pg_upgrade-check
[2]:
```
...
# No postmaster PID for node "oldpub"
# Running: pg_upgrade --no-sync -d 
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_oldpub_data/pgdata
 -D 
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata
 -b C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -B 
C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -s 
127.0.0.1 -p 54813 -P 54814 --copy
Performing Consistency Checks
...
Setting frozenxid and minmxid counters in new cluster ok
Restoring global objects in the new cluster   ok
Restoring database schemas in the new cluster 
*failure*

Consult the last few lines of 
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/003_logical_slots/data/t_003_logical_slots_newpub_data/pgdata/pg_upgrade_output.d/20231107T142224.580/log/pg_upgrade_dump_5.log"
 for
the probable cause of the failure.
Failure, exiting
[14:23:26.632](70.141s) not ok 10 - run of pg_upgrade of old cluster
[14:23:26.632](0.000s) #   Failed test 'run of pg_upgrade of old cluster'
#   at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl 
line 170.
### Starting node "newpub"
# Running: pg_ctl -w -D 
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata
 -l 
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\log/003_logical_slots_newpub.log
 -o --cluster-name=newpub start
waiting for server to start done
server started
# Postmaster PID for node "newpub" is 4604
[14:23:28.398](1.766s) not ok 11 - check the slot exists on new cluster
[14:23:28.398](0.001s) #   Failed test 'check the slot exists on new cluster'
#   at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl 
line 176.
[14:23:28.399](0.000s) #  got: ''
# expected: 'regress_sub|t'
...
```
[3]: 
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/system-wsystem?view=msvc-170


Best Regards,
Hayato Kuroda
FUJITSU LIMITED





RE: pg_upgrade test failure

2023-10-28 Thread Hayato Kuroda (Fujitsu)
Dear Andres,

While tracking BF failures related with pg_ugprade, I found the same failure 
has still happened [1] - [4].
According to the log, the output directory was remained even after the 
successful upgrade [5].
I analyzed and attached the fix patch, and below is my analysis... how do you 
think?

=

lstat() seemed fail while doing the second try of rmtree(). This error message 
is
output from get_dirent_type().

Apart from pgunlink(), get_dirent_type() does not have an retry mechanism when
lstat()->_pglstat64() detects STATUS_DELETE_PENDING. Therefore, I think rmtree()
may not wait the file until it would be really removed, if the status is deceted
in the get_dirent_type().

One solution is to retry stat() or lstat() even in get_dirent_type(), like 
attached.


[1]: 2023-07-21 02:21:53 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-07-21%2002%3A21%3A53
[2]: 2023-10-21 13:39:15 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-10-21%2013%3A39%3A15
[3]: 2023-10-23 09:03:07 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-10-23%2009%3A03%3A07
[4]: 2023-10-27 23:06:17 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-10-27%2023%3A06%3A17
[5]
```
...
*Clusters are compatible*
pg_upgrade: warning: could not remove directory 
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867/log":
 Directory not empty
pg_upgrade: warning: could not remove directory 
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867":
 Directory not empty
pg_upgrade: warning: could not stat file 
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867/log/pg_upgrade_internal.log":
 No such file or directory
pg_upgrade: warning: could not remove directory 
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867/log":
 Directory not empty
pg_upgrade: warning: could not remove directory 
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867":
 Directory not empty
[23:46:07.585](17.106s) ok 12 - run of pg_upgrade --check for new instance
[23:46:07.587](0.002s) not ok 13 - pg_upgrade_output.d/ removed after 
pg_upgrade --check success
...
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



0001-Retry-stat-and-lstat-even-in-get_dirent_type.patch
Description: 0001-Retry-stat-and-lstat-even-in-get_dirent_type.patch


Re: pg_upgrade test failure

2023-02-06 Thread Andres Freund
On 2023-02-06 14:14:22 -0800, Andres Freund wrote:
> On 2023-02-07 11:03:18 +1300, Thomas Munro wrote:
> > What I see is that there were 1254 FreeBSD tasks run in that window, of
> > which 163 failed, and (more interestingly) 111 of those failures succeeded
> > on every other platform.  And clicking on a few on cfbot's page reveals that
> > it's the new running stuff, and I'm still trying to find the interesting
> > logs...
> 
> I think I figured out why the logs frequently fail to upload - the server is
> still running, so the size changes during the upload, causing the upload to
> fail with errors like:
> 
> [12:46:43.552] Failed to upload artifacts: Put 
> "https://storage.googleapis.com/cirrus-ci-5309429912436736-3271c9/artifacts/postgresql-cfbot/postgresql/6729936359129088/testrun/build/testrun/runningcheck.log?X-Goog-Algorithm=GOOG4-RSA-SHA256=cirrus-ci%40cirrus-ci-community.iam.gserviceaccount.com%2F20230206%2Fauto%2Fstorage%2Fgoog4_request=20230206T124536Z=600=host%3Bx-goog-content-length-range%3Bx-goog-meta-created_by_task=8e84192cbc754180b8baa6c00c41b463f580fe7183f0e7113c253aac13cc2458b835caef7940b91e102e96d54cff2b5714c77390e74244e2fb88c00c9957a801e33cbee2ac960e0db8a01fe08ee945bedf4616881e6beafa3a162c22948ac0b9a9359d93e1f461fc9f49385b784b75d633f1b01805b987d9d53bc7fb55263917ec85180a2140659d50990f066160f03e8bb8984e8d2aadb64c875c253167cf24da152a18d69fcd3d941edce145931e4feb23dc8cf43de7b7bbfc565786c1c692406f2a0a127f30385a8c4b66f96709b51d26d3c71617991c731b0e7206ee3906338dedf6359412edd024f8c76bd33400f4c9320c2bde9512fa8bcd6289e54d52":
>  http2: request body larger than specified content length
> 
> I'm testing adding a pg_ctl stop to the on_failure right now.

Pushed the fix doing so.




Re: pg_upgrade test failure

2023-02-06 Thread Andres Freund
Hi,

On 2023-02-07 11:03:18 +1300, Thomas Munro wrote:
> On Tue, Feb 7, 2023 at 10:57 AM Andres Freund  wrote:
> > On February 6, 2023 1:51:20 PM PST, Thomas Munro  
> > wrote:
> > >Next up: the new "running" tests, spuriously failing around 8.8% of CI
> > >builds on FreeBSD.  I'll go and ping that thread...
> >
> > Is that rate unchanged? I thought I fixed the main issue last week?
> 
> Unfortunately my cfbot database only holds a week's history.

Would be interesting to increase that to a considerably longer time. I can't
imagine that that'd take all that much resources?


> What I see is that there were 1254 FreeBSD tasks run in that window, of
> which 163 failed, and (more interestingly) 111 of those failures succeeded
> on every other platform.  And clicking on a few on cfbot's page reveals that
> it's the new running stuff, and I'm still trying to find the interesting
> logs...

I think I figured out why the logs frequently fail to upload - the server is
still running, so the size changes during the upload, causing the upload to
fail with errors like:

[12:46:43.552] Failed to upload artifacts: Put 
"https://storage.googleapis.com/cirrus-ci-5309429912436736-3271c9/artifacts/postgresql-cfbot/postgresql/6729936359129088/testrun/build/testrun/runningcheck.log?X-Goog-Algorithm=GOOG4-RSA-SHA256=cirrus-ci%40cirrus-ci-community.iam.gserviceaccount.com%2F20230206%2Fauto%2Fstorage%2Fgoog4_request=20230206T124536Z=600=host%3Bx-goog-content-length-range%3Bx-goog-meta-created_by_task=8e84192cbc754180b8baa6c00c41b463f580fe7183f0e7113c253aac13cc2458b835caef7940b91e102e96d54cff2b5714c77390e74244e2fb88c00c9957a801e33cbee2ac960e0db8a01fe08ee945bedf4616881e6beafa3a162c22948ac0b9a9359d93e1f461fc9f49385b784b75d633f1b01805b987d9d53bc7fb55263917ec85180a2140659d50990f066160f03e8bb8984e8d2aadb64c875c253167cf24da152a18d69fcd3d941edce145931e4feb23dc8cf43de7b7bbfc565786c1c692406f2a0a127f30385a8c4b66f96709b51d26d3c71617991c731b0e7206ee3906338dedf6359412edd024f8c76bd33400f4c9320c2bde9512fa8bcd6289e54d52":
 http2: request body larger than specified content length

I'm testing adding a pg_ctl stop to the on_failure right now.

Greetings,

Andres Freund




Re: pg_upgrade test failure

2023-02-06 Thread Thomas Munro
On Tue, Feb 7, 2023 at 11:03 AM Thomas Munro  wrote:
> On Tue, Feb 7, 2023 at 10:57 AM Andres Freund  wrote:
> > On February 6, 2023 1:51:20 PM PST, Thomas Munro  
> > wrote:
> > >Next up: the new "running" tests, spuriously failing around 8.8% of CI
> > >builds on FreeBSD.  I'll go and ping that thread...
> >
> > Is that rate unchanged? I thought I fixed the main issue last week?
>
> Unfortunately my cfbot database only holds a week's history.  What I
> see is that there were 1254 FreeBSD tasks run in that window, of which
> 163 failed, and (more interestingly) 111 of those failures succeeded
> on every other platform.  And clicking on a few on cfbot's page
> reveals that it's the new running stuff, and I'm still trying to find
> the interesting logs...

Ah, that number might include some other problems, including in
subscription (#2900).  That's the problem with flapping tests, you get
desensitised and stop looking closely and miss things...




Re: pg_upgrade test failure

2023-02-06 Thread Thomas Munro
On Tue, Feb 7, 2023 at 10:57 AM Andres Freund  wrote:
> On February 6, 2023 1:51:20 PM PST, Thomas Munro  
> wrote:
> >Next up: the new "running" tests, spuriously failing around 8.8% of CI
> >builds on FreeBSD.  I'll go and ping that thread...
>
> Is that rate unchanged? I thought I fixed the main issue last week?

Unfortunately my cfbot database only holds a week's history.  What I
see is that there were 1254 FreeBSD tasks run in that window, of which
163 failed, and (more interestingly) 111 of those failures succeeded
on every other platform.  And clicking on a few on cfbot's page
reveals that it's the new running stuff, and I'm still trying to find
the interesting logs...




Re: pg_upgrade test failure

2023-02-06 Thread Andres Freund
Hi, 

On February 6, 2023 1:51:20 PM PST, Thomas Munro  wrote:
>Next up: the new "running" tests, spuriously failing around 8.8% of CI
>builds on FreeBSD.  I'll go and ping that thread...

Is that rate unchanged? I thought I fixed the main issue last week?

Greetings,

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: pg_upgrade test failure

2023-02-06 Thread Thomas Munro
On Wed, Feb 1, 2023 at 2:44 PM Thomas Munro  wrote:
> OK, I pushed that.  Third time lucky?

I pulled down logs for a week of Windows CI, just over ~1k builds.
The failure rate was a few per day before, but there are no failures
like that after that went in.  There are logs that contain the
"Directory not empty" warning, but clearly the
try-again-and-this-time-wait-for-the-other-process logic must be
working (as horrible as it is) because then the test checks that the
directory is gone, and succeeds.  Hooray.

So that's one of our biggest CI flappers fixed.  Unfortunately without
treating the root cause, really.

Next up: the new "running" tests, spuriously failing around 8.8% of CI
builds on FreeBSD.  I'll go and ping that thread...




Re: pg_upgrade test failure

2023-01-31 Thread Thomas Munro
On Wed, Feb 1, 2023 at 10:08 AM Thomas Munro  wrote:
> On Wed, Feb 1, 2023 at 10:04 AM Andres Freund  wrote:
> > Maybe we should just handle it by sleeping and retrying, if on windows? Sad 
> > to even propose...
>
> Yeah, that's what that code I posted would do automatically, though
> it's a bit hidden.  The second attempt to unlink() would see delete
> already pending, and activate its secret internal sleep/retry loop.

OK, I pushed that.  Third time lucky?

I tracked down the discussion of that existing comment about pg_ctl,
which comes from the 9.2 days:

https://www.postgresql.org/message-id/flat/5044DE59.5020500%40dunslane.net

I guess maybe back then fopen() was Windows' own fopen() that wouldn't
allow two handles to a file at the same time?  These days we redirect
it to a wrapper with the magic "shared" flags, so the kluge installed
by commit f8c81c5dde2 may not even be needed anymore.  It does
demonstrate that there are long standing timing races around log
files, process exit and wait-for-shutdown logic, though.

Someone who develops for Windows could probably chase this right down,
and make sure that we do certain things in the right order, and/or
find better kernel facilities; at a wild guess, something like
OpenProcess() before you initiate shutdown, so you can then wait on
its handle, for example.  The docs for ExitProcess() make it clear
that handles are synchronously closed, so I think it's probably just
that our tests for when processes have exited are too fuzzy.




Re: pg_upgrade test failure

2023-01-31 Thread Thomas Munro
On Wed, Feb 1, 2023 at 9:54 AM Thomas Munro  wrote:
> ... I have one more idea ...

I also had a second idea, barely good enough to mention and probably
just paranoia.  In a nearby thread I learned that process exit does
not release Windows advisory file locks synchronously, which surprised
this Unix hacker; it made me wonder what else might be released lazily
after process exit.  Handles?!  However, as previously mentioned, it's
possible that even with fully Unix-like resource cleanup on process
exit, we could be confused if we are using "the process that was on
the end of this pipe has closed it" as a proxy for "the process is
gone, *all* its handles are closed".  In any case, the previous kluge
should help wallpaper over any of that too, for this test anyway.




Re: pg_upgrade test failure

2023-01-31 Thread Thomas Munro
On Wed, Feb 1, 2023 at 10:04 AM Andres Freund  wrote:
> On January 31, 2023 12:54:42 PM PST, Thomas Munro  
> wrote:
> >I'm not sure about anything, but if that's what's happening here, then
> >maybe the attached would help.  In short, it would make the previous
> >theory true (the idea of a second unlink() saving the day).
>
> Maybe we should just handle it by sleeping and retrying, if on windows? Sad 
> to even propose...

Yeah, that's what that code I posted would do automatically, though
it's a bit hidden.  The second attempt to unlink() would see delete
already pending, and activate its secret internal sleep/retry loop.




Re: pg_upgrade test failure

2023-01-31 Thread Andres Freund
Hi, 

On January 31, 2023 12:54:42 PM PST, Thomas Munro  
wrote:
>On Wed, Feb 1, 2023 at 6:28 AM Justin Pryzby  wrote:
>> > I pushed the rmtree() change.  Let's see if that helps, or tells us
>> > something new.
>>
>> I found a few failures since then:
>>
>> https://api.cirrus-ci.com/v1/artifact/task/6696942420361216/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
>>
>> pg_upgrade: warning: could not remove directory 
>> "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720/log":
>>  Directory not empty
>> pg_upgrade: warning: could not remove directory 
>> "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720":
>>  Directory not empty
>
>So no change: we didn't see "could not unlink file ...".  So I think
>that means that it was rmtree() that unlinked the file for the *first*
>time, but someone else has it open.
>
>Even though Windows is at this point eroding my love of computers and
>making me consider a new career in, I dunno, carrot farming or
>something, I have one more idea.  Check out this kluge in
>src/bin/pg_upgrade/exec.c:
>
>/*
> * "pg_ctl -w stop" might have reported that the server has stopped
> * because the postmaster.pid file has been removed, but "pg_ctl -w
> * start" might still be in the process of closing and might still be
> * holding its stdout and -l log file descriptors open.  Therefore,
> * try to open the log file a few more times.
> */
>
>I'm not sure about anything, but if that's what's happening here, then
>maybe the attached would help.  In short, it would make the previous
>theory true (the idea of a second unlink() saving the day).


Maybe we should just handle it by sleeping and retrying, if on windows? Sad to 
even propose... 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: pg_upgrade test failure

2023-01-31 Thread Thomas Munro
On Wed, Feb 1, 2023 at 6:28 AM Justin Pryzby  wrote:
> > I pushed the rmtree() change.  Let's see if that helps, or tells us
> > something new.
>
> I found a few failures since then:
>
> https://api.cirrus-ci.com/v1/artifact/task/6696942420361216/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
>
> pg_upgrade: warning: could not remove directory 
> "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720/log":
>  Directory not empty
> pg_upgrade: warning: could not remove directory 
> "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720":
>  Directory not empty

So no change: we didn't see "could not unlink file ...".  So I think
that means that it was rmtree() that unlinked the file for the *first*
time, but someone else has it open.

Even though Windows is at this point eroding my love of computers and
making me consider a new career in, I dunno, carrot farming or
something, I have one more idea.  Check out this kluge in
src/bin/pg_upgrade/exec.c:

/*
 * "pg_ctl -w stop" might have reported that the server has stopped
 * because the postmaster.pid file has been removed, but "pg_ctl -w
 * start" might still be in the process of closing and might still be
 * holding its stdout and -l log file descriptors open.  Therefore,
 * try to open the log file a few more times.
 */

I'm not sure about anything, but if that's what's happening here, then
maybe the attached would help.  In short, it would make the previous
theory true (the idea of a second unlink() saving the day).
diff --git a/src/bin/pg_upgrade/util.c b/src/bin/pg_upgrade/util.c
index 42dcbfc5b5..f214b7737f 100644
--- a/src/bin/pg_upgrade/util.c
+++ b/src/bin/pg_upgrade/util.c
@@ -68,7 +68,12 @@ cleanup_output_dirs(void)
 	if (log_opts.retain)
 		return;
 
-	(void) rmtree(log_opts.basedir, true);
+	/*
+	 * Try twice.  The second time might wait for files to be concurrently
+	 * closed on Windows.
+	 */
+	if (!rmtree(log_opts.basedir, true))
+		rmtree(log_opts.basedir, true);
 
 	/* Remove pg_upgrade_output.d only if empty */
 	switch (pg_check_dir(log_opts.rootdir))
@@ -80,7 +85,12 @@ cleanup_output_dirs(void)
 
 		case 1:	/* exists and empty */
 		case 2:	/* exists and contains only dot files */
-			(void) rmtree(log_opts.rootdir, true);
+			/*
+			 * Try twice.  The second time might wait for files to be
+			 * concurrently closed on Windows.
+			 */
+			if (!rmtree(log_opts.rootdir, true))
+rmtree(log_opts.rootdir, true);
 			break;
 
 		case 4:	/* exists */


Re: pg_upgrade test failure

2023-01-31 Thread Justin Pryzby
On Tue, Jan 31, 2023 at 02:00:05PM +1300, Thomas Munro wrote:
> On Thu, Jan 5, 2023 at 4:11 PM Thomas Munro  wrote:
> > On Wed, Dec 7, 2022 at 7:15 AM Andres Freund  wrote:
> > > On 2022-11-08 01:16:09 +1300, Thomas Munro wrote:
> > > > So [1] on its own didn't fix this.  My next guess is that the attached
> > > > might help.
> >
> > > What is our plan here? This afaict is the most common "false positive" for
> > > cfbot in the last weeks.
> 
> I pushed the rmtree() change.  Let's see if that helps, or tells us
> something new.

I found a few failures since then:

https://api.cirrus-ci.com/v1/artifact/task/6696942420361216/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

pg_upgrade: warning: could not remove directory 
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720/log":
 Directory not empty
pg_upgrade: warning: could not remove directory 
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720":
 Directory not empty

https://api.cirrus-ci.com/v1/artifact/task/5119776607961088/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
same

I verified that those both include your 54e72b66e, which is pretty
strange, since the patch passed tests 10s of times on CI until it was
merged, when it started/kept failing.

-- 
Justin




Re: pg_upgrade test failure

2023-01-30 Thread Thomas Munro
On Thu, Jan 5, 2023 at 4:11 PM Thomas Munro  wrote:
> On Wed, Dec 7, 2022 at 7:15 AM Andres Freund  wrote:
> > On 2022-11-08 01:16:09 +1300, Thomas Munro wrote:
> > > So [1] on its own didn't fix this.  My next guess is that the attached
> > > might help.
>
> > What is our plan here? This afaict is the most common "false positive" for
> > cfbot in the last weeks.

I pushed the rmtree() change.  Let's see if that helps, or tells us
something new.

Michael:  There were some questions from Andres above.  FWIW I think
if you wanted to investigate this properly on a local Windows system
to chase down who's got the file open (shutdown sequence problem or
whatever), you'd probably have to install Server 2019, or maybe use an
old 8.1 VM if you still have such a thing, based on the suspicion that
typical 10 and 11 systems won't exhibit the problem.  But then I could
be wrong about what's going on...




Re: pg_upgrade test failure

2023-01-04 Thread Thomas Munro
On Wed, Dec 7, 2022 at 7:15 AM Andres Freund  wrote:
> On 2022-11-08 01:16:09 +1300, Thomas Munro wrote:
> > So [1] on its own didn't fix this.  My next guess is that the attached
> > might help.

> What is our plan here? This afaict is the most common "false positive" for
> cfbot in the last weeks.

That branch hasn't failed on cfbot[1], except once due to one of the
other known flapping races we have to fix.  Which doesn't prove
anything, of course, but it is encouraging.  I wish we knew why the
test does this, though

Here's a better version that works harder to avoid opening more than
one fd at a time (like the pgfnames()-based code it replaces), and
also uses fd.c facilities in the backend version (unlike pgfnames(),
which looks like it could leak a descriptor if palloc() threw, and
also doesn't know how to handle file descriptor pressure).

[1] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/41/4011
From 3c9ed22ee207f2a5c008a971f8b19561d2e0ccfa Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 5 Jan 2023 14:15:37 +1300
Subject: [PATCH v2] Refactor rmtree() to use get_dirent_type().

Switch to get_dirent_type() instead of lstat() while traversing a
directory graph, to see if that fixes intermittent ENOTEMPTY failures
while cleaning up after the pg_upgrade tests, on our Windows CI build.

Our CI image uses Windows Server 2019, a version known not to have POSIX
unlink semantics enabled by default yet, unlike typical Windows 10 and
11 systems.

The theory is that the offending directory entries must be
STATUS_DELETE_PENDING.  With this change, rmtree() should not skip them,
and try to unlink again.  Our unlink() wrapper should either wait a
short time for them to go away (because other processes close the
handle) or log a message to tell us the path of the problem file if not,
so we can dig further.

Discussion: https://postgre.es/m/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de
---
 src/common/rmtree.c | 120 +++-
 1 file changed, 64 insertions(+), 56 deletions(-)

diff --git a/src/common/rmtree.c b/src/common/rmtree.c
index d445e21ba2..a3e536149b 100644
--- a/src/common/rmtree.c
+++ b/src/common/rmtree.c
@@ -20,13 +20,21 @@
 #include 
 #include 
 
+#include "common/file_utils.h"
+
 #ifndef FRONTEND
+#include "storage/fd.h"
 #define pg_log_warning(...) elog(WARNING, __VA_ARGS__)
+#define LOG_LEVEL WARNING
+#define OPENDIR(x) AllocateDir(x)
+#define CLOSEDIR(x) FreeDir(x)
 #else
 #include "common/logging.h"
+#define LOG_LEVEL PG_LOG_WARNING
+#define OPENDIR(x) opendir(x)
+#define CLOSEDIR(x) closedir(x)
 #endif
 
-
 /*
  *	rmtree
  *
@@ -41,82 +49,82 @@
 bool
 rmtree(const char *path, bool rmtopdir)
 {
-	bool		result = true;
 	char		pathbuf[MAXPGPATH];
-	char	  **filenames;
-	char	  **filename;
-	struct stat statbuf;
-
-	/*
-	 * we copy all the names out of the directory before we start modifying
-	 * it.
-	 */
-	filenames = pgfnames(path);
+	DIR		   *dir;
+	struct dirent *de;
+	bool		result = true;
+	size_t		dirnames_size = 0;
+	size_t		dirnames_capacity = 8;
+	char	  **dirnames = palloc(sizeof(char *) * dirnames_capacity);
 
-	if (filenames == NULL)
+	dir = OPENDIR(path);
+	if (dir == NULL)
+	{
+		pg_log_warning("could not open directory \"%s\": %m", path);
 		return false;
+	}
 
-	/* now we have the names we can start removing things */
-	for (filename = filenames; *filename; filename++)
+	while (errno = 0, (de = readdir(dir)))
 	{
-		snprintf(pathbuf, MAXPGPATH, "%s/%s", path, *filename);
-
-		/*
-		 * It's ok if the file is not there anymore; we were just about to
-		 * delete it anyway.
-		 *
-		 * This is not an academic possibility. One scenario where this
-		 * happens is when bgwriter has a pending unlink request for a file in
-		 * a database that's being dropped. In dropdb(), we call
-		 * ForgetDatabaseSyncRequests() to flush out any such pending unlink
-		 * requests, but because that's asynchronous, it's not guaranteed that
-		 * the bgwriter receives the message in time.
-		 */
-		if (lstat(pathbuf, ) != 0)
-		{
-			if (errno != ENOENT)
-			{
-pg_log_warning("could not stat file or directory \"%s\": %m",
-			   pathbuf);
-result = false;
-			}
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
 			continue;
-		}
-
-		if (S_ISDIR(statbuf.st_mode))
-		{
-			/* call ourselves recursively for a directory */
-			if (!rmtree(pathbuf, true))
-			{
-/* we already reported the error */
-result = false;
-			}
-		}
-		else
+		snprintf(pathbuf, sizeof(pathbuf), "%s/%s", path, de->d_name);
+		switch (get_dirent_type(pathbuf, de, false, LOG_LEVEL))
 		{
-			if (unlink(pathbuf) != 0)
-			{
-if (errno != ENOENT)
+			case PGFILETYPE_ERROR:
+/* already logged, press on */
+break;
+			case PGFILETYPE_DIR:
+
+/*
+ * Defer recursion until after we've closed this directory, to
+ * avoid using more than one file descriptor at a time.
+ */
+if (dirnames_size == 

Re: pg_upgrade test failure

2022-12-06 Thread Andres Freund
Hi,

On 2022-11-08 01:16:09 +1300, Thomas Munro wrote:
> So [1] on its own didn't fix this.  My next guess is that the attached
> might help.
> 
> Hmm.  Following Michael's clue that this might involve log files and
> pg_ctl, I noticed one thing: pg_ctl implements
> wait_for_postmaster_stop() by waiting for kill(pid, 0) to fail, and
> our kill emulation does CallNamedPipe().  If the server is in the
> process of exiting and the kernel is cleaning up all the handles we
> didn't close, is there any reason to expect the signal pipe to be
> closed after the log file?

What is our plan here? This afaict is the most common "false positive" for
cfbot in the last weeks.

E.g.:

https://api.cirrus-ci.com/v1/artifact/task/5462686092230656/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
...
[00:02:58.761](93.859s) ok 10 - run of pg_upgrade for new instance
[00:02:58.808](0.047s) not ok 11 - pg_upgrade_output.d/ removed after 
pg_upgrade success
[00:02:58.815](0.007s) #   Failed test 'pg_upgrade_output.d/ removed after 
pg_upgrade success'
#   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 288.


Michael:

Why does 002_pg_upgrade.pl try to filter the list of files in
pg_upgrade_output.d for files ending in .log? And why does it print those
only after starting the new node?

How about moving the iteration through the pg_upgrade_output.d to before the
->start and printing all the files, but only slurp_file() if the filename ends
with *.log?

Minor nit: It seems off to quite so many copies of
  $newnode->data_dir . "/pg_upgrade_output.d"
particularly where the test defines $log_path, but then still builds
it from scratch after (line 304).

Greetings,

Andres Freund




Re: pg_upgrade test failure

2022-11-16 Thread Justin Pryzby
On Tue, Nov 08, 2022 at 01:16:09AM +1300, Thomas Munro wrote:
> So [1] on its own didn't fix this.  My next guess is that the attached
> might help.

I took the liberty of adding a CF entry for this
https://commitfest.postgresql.org/41/4011/

And afterwards figured I could be a little bit wasteful and run the
tests using meson test --repeat, rather than let cfbot do it over the
course of a month.
https://cirrus-ci.com/task/5115893722644480

So I didn't find evidence that it doesn't resolve the issue (but this
also doesn't prove that it will works).

-- 
Justin




Re: pg_upgrade test failure

2022-11-07 Thread Thomas Munro
So [1] on its own didn't fix this.  My next guess is that the attached
might help.

Hmm.  Following Michael's clue that this might involve log files and
pg_ctl, I noticed one thing: pg_ctl implements
wait_for_postmaster_stop() by waiting for kill(pid, 0) to fail, and
our kill emulation does CallNamedPipe().  If the server is in the
process of exiting and the kernel is cleaning up all the handles we
didn't close, is there any reason to expect the signal pipe to be
closed after the log file?

[1] 
https://www.postgresql.org/message-id/flat/20221025213055.GA8537%40telsasoft.com#9030de6c4c5e544d2b057b793a5b42af
From 9cbfd5c1fc1327443152d89ca7f5346bfeda3f52 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 7 Nov 2022 17:43:42 +1300
Subject: [PATCH] Refactor rmtree() to use get_dirent_type().

Switch to get_dirent_type() instead of lstat() while traversing a
directory graph, to see if that fixes intermittent ENOTEMPTY failures
while cleaning up after the pg_upgrade tests.

The idea is that the offending directory entries must be
STATUS_DELETE_PENDING.  With this change, rmtree() should not skip them,
and try to unlink again.  That should either wait a short time for them
to go away (because other processes close the handle) or log a message
to tell us the path of the problem file if not so we can dig further.

Discussion: https://postgre.es/m/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de
---
 src/common/rmtree.c | 94 ++---
 1 file changed, 37 insertions(+), 57 deletions(-)

diff --git a/src/common/rmtree.c b/src/common/rmtree.c
index 221d0e20a7..736e34892c 100644
--- a/src/common/rmtree.c
+++ b/src/common/rmtree.c
@@ -20,13 +20,16 @@
 #include 
 #include 
 
+#include "common/file_utils.h"
+
 #ifndef FRONTEND
 #define pg_log_warning(...) elog(WARNING, __VA_ARGS__)
+#define LOG_LEVEL WARNING
 #else
 #include "common/logging.h"
+#define LOG_LEVEL PG_LOG_WARNING
 #endif
 
-
 /*
  *	rmtree
  *
@@ -41,82 +44,59 @@
 bool
 rmtree(const char *path, bool rmtopdir)
 {
-	bool		result = true;
 	char		pathbuf[MAXPGPATH];
-	char	  **filenames;
-	char	  **filename;
-	struct stat statbuf;
-
-	/*
-	 * we copy all the names out of the directory before we start modifying
-	 * it.
-	 */
-	filenames = pgfnames(path);
+	DIR		   *dir;
+	struct dirent *de;
+	bool		result = true;
 
-	if (filenames == NULL)
+	dir = opendir(path);
+	if (dir == NULL)
+	{
+		pg_log_warning("could not open directory \"%s\": %m", path);
 		return false;
+	}
 
-	/* now we have the names we can start removing things */
-	for (filename = filenames; *filename; filename++)
+	while (errno = 0, (de = readdir(dir)))
 	{
-		snprintf(pathbuf, MAXPGPATH, "%s/%s", path, *filename);
-
-		/*
-		 * It's ok if the file is not there anymore; we were just about to
-		 * delete it anyway.
-		 *
-		 * This is not an academic possibility. One scenario where this
-		 * happens is when bgwriter has a pending unlink request for a file in
-		 * a database that's being dropped. In dropdb(), we call
-		 * ForgetDatabaseSyncRequests() to flush out any such pending unlink
-		 * requests, but because that's asynchronous, it's not guaranteed that
-		 * the bgwriter receives the message in time.
-		 */
-		if (lstat(pathbuf, ) != 0)
-		{
-			if (errno != ENOENT)
-			{
-pg_log_warning("could not stat file or directory \"%s\": %m",
-			   pathbuf);
-result = false;
-			}
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
 			continue;
-		}
-
-		if (S_ISDIR(statbuf.st_mode))
-		{
-			/* call ourselves recursively for a directory */
-			if (!rmtree(pathbuf, true))
-			{
-/* we already reported the error */
-result = false;
-			}
-		}
-		else
+		snprintf(pathbuf, sizeof(pathbuf), "%s/%s", path, de->d_name);
+		switch (get_dirent_type(pathbuf, de, false, LOG_LEVEL))
 		{
-			if (unlink(pathbuf) != 0)
-			{
-if (errno != ENOENT)
+			case PGFILETYPE_ERROR:
+/* already logged, press on */
+break;
+			case PGFILETYPE_DIR:
+if (!rmtree(pathbuf, true))
+	result = false;
+break;
+			default:
+if (unlink(pathbuf) != 0 && errno != ENOENT)
 {
-	pg_log_warning("could not remove file or directory \"%s\": %m",
-   pathbuf);
+	pg_log_warning("could not unlink file \"%s\": %m", pathbuf);
 	result = false;
 }
-			}
+break;
 		}
 	}
 
+	if (errno != 0)
+	{
+		pg_log_warning("could not read directory \"%s\": %m", path);
+		result = false;
+	}
+
+	closedir(dir);
+
 	if (rmtopdir)
 	{
 		if (rmdir(path) != 0)
 		{
-			pg_log_warning("could not remove file or directory \"%s\": %m",
-		   path);
+			pg_log_warning("could not remove directory \"%s\": %m", path);
 			result = false;
 		}
 	}
 
-	pgfnames_cleanup(filenames);
-
 	return result;
 }
-- 
2.30.2



Re: pg_upgrade test failure

2022-10-18 Thread Andres Freund
Hi,

On 2022-10-17 23:31:44 -0500, Justin Pryzby wrote:
> On Tue, Oct 18, 2022 at 01:06:15PM +0900, Michael Paquier wrote:
> > On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote:
> > > * Server 2019, as used on CI, still uses the traditional NT semantics
> > > (unlink is asynchronous, when all handles closes)
> > > * the fix I proposed has the right effect (I will follow up with tests
> > > to demonstrate)
> > 
> > Wow, nice investigation.  And cirrus does not offer a newer option
> > either..
> 
> Currently Andres builds images based on cirrus's 2019 image, but I think
> we could use any windows docker image.

You unfortunately can't run newer containers than the host OS :(, just user
older ones. And if you use mismatching containers the startup gets slower
because it switches to use full virtualization rather than containers.

I think we need to switch to use full VMs rather than containers. The
performance of the windows containers is just atrocious (build times on a
local VM with the same number of cores is 1/2 of what we see in CI, test times
1/3), they're slow to start due to pulling all files and decompressing them,
and they're fragile. I've asked Bilal (CCed) to work on generating both
containers and images.

Greetings,

Andres Freund




Re: pg_upgrade test failure

2022-10-17 Thread Justin Pryzby
On Tue, Oct 18, 2022 at 01:06:15PM +0900, Michael Paquier wrote:
> On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote:
> > * Server 2019, as used on CI, still uses the traditional NT semantics
> > (unlink is asynchronous, when all handles closes)
> > * the fix I proposed has the right effect (I will follow up with tests
> > to demonstrate)
> 
> Wow, nice investigation.  And cirrus does not offer a newer option
> either..

Currently Andres builds images based on cirrus's 2019 image, but I think
we could use any windows docker image.

> Do you think that Windows server 2022 (successor of 2019) is
> able to use POSIX semantics for unlink()?

I think it's possible to use it now, like what's done here.
https://commitfest.postgresql.org/40/3347/

The only caveat is that it's done conditionally.




Re: pg_upgrade test failure

2022-10-17 Thread Michael Paquier
On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote:
> * Server 2019, as used on CI, still uses the traditional NT semantics
> (unlink is asynchronous, when all handles closes)
> * the fix I proposed has the right effect (I will follow up with tests
> to demonstrate)

Wow, nice investigation.  And cirrus does not offer a newer option
either..  Do you think that Windows server 2022 (successor of 2019) is
able to use POSIX semantics for unlink()?  It looks that we are a few
years away from being able to do that assuming that cirrus offers a
newer version than server 2019, but I guess that the code could
mention this possibility in a comment, at least..
--
Michael


signature.asc
Description: PGP signature


Re: pg_upgrade test failure

2022-10-17 Thread Thomas Munro
On Mon, Oct 3, 2022 at 7:29 PM Michael Paquier  wrote:
> On Mon, Oct 03, 2022 at 04:03:12PM +1300, Thomas Munro wrote:
> > So I think that setting is_lnk = false is good enough here.  Do
> > you see a hole in it?
>
> I cannot think on one, on top of my head.  Thanks for the
> explanation.

Some things I learned while trying to understand how I managed to
introduce that bug despite reading and testing:

* the code in pgunlink() has comments saying that its retry loop is to
handle sharing violations
* in fact that retry loop also comes into play for STATUS_DELETE_PENDING
* that case is fairly well hidden, because to reach it you need to
unlink(pathname) twice!  the second call will wait up to 10 seconds
for handles to close and then report ENOENT, allowing rmdir(parent) to
succeed
* I guess this code is relying on that double-unlink to block until
the directory is empty?
* you wouldn't notice any of this if you were testing on Windows 10 on
a desktop/laptop/VM, because it now uses POSIX semantics for unlink on
NTFS, so the first unlink truly (synchronously) unlinks (no more
STATUS_DELETE_PENDING)
* Server 2019, as used on CI, still uses the traditional NT semantics
(unlink is asynchronous, when all handles closes)
* the fix I proposed has the right effect (I will follow up with tests
to demonstrate)

I'll post my tests for this and a bunch more things I figured out
shortly in a new Windows-filesystem-semantics-omnibus thread.




Re: pg_upgrade test failure

2022-10-03 Thread Michael Paquier
On Mon, Oct 03, 2022 at 04:03:12PM +1300, Thomas Munro wrote:
> So I think that setting is_lnk = false is good enough here.  Do
> you see a hole in it?

I cannot think on one, on top of my head.  Thanks for the
explanation.
--
Michael


signature.asc
Description: PGP signature


Re: pg_upgrade test failure

2022-10-02 Thread Thomas Munro
On Mon, Oct 3, 2022 at 1:40 PM Michael Paquier  wrote:
> On Mon, Oct 03, 2022 at 12:10:06PM +1300, Thomas Munro wrote:
> > I think something like the attached should do the right thing for
> > STATUS_DELETE_PENDING (sort of "ENOENT-in-progress").  unlink() goes
> > back to being blocking (sleep+retry until eventually we reach ENOENT
> > or we time out and give up with EACCES), but we still distinguish it
> > from true ENOENT so we have a fast exit in that case.  This is passing
> > CI, but not tested yet.
>
> if (lstat(path, ) < 0)
> -   return -1;
> +   {
> +   if (lstat_error_was_status_delete_pending())
> +   is_lnk = false;
> +   else
> +   return -1;
> +   }
> +   else
> +   is_lnk = S_ISLNK(st.st_mode);

> Sorry, I don't remember all the details in this area, but a directory
> can never be marked as STATUS_DELETE_PENDING with some of its contents
> still inside, right?

That's my understanding, yes: just like Unix, you can't remove a
directory with something in it.  Unlike Unix, that includes files that
have been unlinked but are still open somewhere.  (Note that in this
case it's not exactly a real directory, it's a junction point, which
is a directory but it doesn't have contents, it is a reparse point
pointing somewhere else, so I suspect that it can't really suffer from
ENOTEMPTY, but it probably can suffer from 'someone has it open for a
short time because they are concurrently stat-ing it'.)

> If it has some contents, forcing unlink() all
> the time would be fine?

Here's why I think it's probably OK to use unlink() unconditionally
after detecting STATUS_DELETE_PENDING.  AFAICT there is no way to even
find out if it's a file or a junction in this state, but it doesn't
matter: we are not waiting for rmdir() or unlink() to succeed, we are
waiting for it to fail with an error other than EACCES, most likely
ENOENT (or to time out, perhaps because someone held the file open for
11 seconds, or because EACCES was actually telling us about a
permissions problem).  EACCES is the errno for many things including
STATUS_DELETE_PENDING and also "you called unlink() but it's a
directory" (should be EPERM according to POSIX, or EISDIR according
to Linux).  Both of those reasons imply that the zombie directory
entry still exists, and we don't care which of those reasons triggered
it.So I think that setting is_lnk = false is good enough here.  Do
you see a hole in it?




Re: pg_upgrade test failure

2022-10-02 Thread Michael Paquier
On Mon, Oct 03, 2022 at 12:10:06PM +1300, Thomas Munro wrote:
> I think something like the attached should do the right thing for
> STATUS_DELETE_PENDING (sort of "ENOENT-in-progress").  unlink() goes
> back to being blocking (sleep+retry until eventually we reach ENOENT
> or we time out and give up with EACCES), but we still distinguish it
> from true ENOENT so we have a fast exit in that case.  This is passing
> CI, but not tested yet.

if (lstat(path, ) < 0)
-   return -1;
+   {
+   if (lstat_error_was_status_delete_pending())
+   is_lnk = false;
+   else
+   return -1;
+   }
+   else
+   is_lnk = S_ISLNK(st.st_mode);
Sorry, I don't remember all the details in this area, but a directory
can never be marked as STATUS_DELETE_PENDING with some of its contents
still inside, right?  If it has some contents, forcing unlink() all
the time would be fine?

> One ugly thing in this patch is that it has to deal with our
> historical mistake (?) of including Windows headers in this file in
> Cygwin builds for no reason and thus getting WIN32 defined on a
> non-WIN32 build, as I've complained about before[1] but not yet tidied
> up.

Your proposal remains local to dirmod.c, so that does not sound like a
big deal to me for the time being.
--
Michael


signature.asc
Description: PGP signature


Re: pg_upgrade test failure

2022-10-02 Thread Thomas Munro
On Mon, Oct 3, 2022 at 9:07 AM Thomas Munro  wrote:
> On Tue, Sep 20, 2022 at 1:31 PM Justin Pryzby  wrote:
> > I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so
> > didn't warn about the file itself, but then failed one moment later in
> > rmdir.
>
> Yeah, I think this is my fault.  In commit f357233c the new lstat()
> call might return ENOENT for STATUS_DELETE_PENDING, and then we don't
> enter pgunlink()'s 10 second sleep-retry loop.  Let me think about how
> best to fix that, and how to write a regression test program that
> would exercise stuff like this.  Might take a couple of days as I am
> away from computers until mid-week.

I think something like the attached should do the right thing for
STATUS_DELETE_PENDING (sort of "ENOENT-in-progress").  unlink() goes
back to being blocking (sleep+retry until eventually we reach ENOENT
or we time out and give up with EACCES), but we still distinguish it
from true ENOENT so we have a fast exit in that case.  This is passing
CI, but not tested yet.

One ugly thing in this patch is that it has to deal with our
historical mistake (?) of including Windows headers in this file in
Cygwin builds for no reason and thus getting WIN32 defined on a
non-WIN32 build, as I've complained about before[1] but not yet tidied
up.

Remembering why we do any of this weird looking stuff that we don't
need on Unix, the general idea is that things that scan directories to
unlink everything before unlinking the parent directory need to block
for a while on STATUS_DELETE_PENDING to increase their probability of
success, while things that do anything else probably just want to skip
such zombie files completely.  To recap, we have:

 * readdir() sees files that are ENOENT-in-progress (so recursive
unlinks can see them)
 * unlink() waits for ENOENT-in-progress to reach ENOENT (what broke here)
 * stat() and lstat() report ENOENT-in-progress as ENOENT (done to fix
eg pg_basebackup, which used to fail at random on Windows)
 * open() reports ENOENT-in-progress as either ENOENT or EEXIST
depending on O_CREAT (because used by stat())

Clearly this set of kludges isn't perfect and other kludge-sets would
be possible too.  One thought is that we could hide ENOENT-in-progress
from readdir(), and add a new rmdir() wrapper instead.  If it gets a
directory-not-empty error from the kernel, it could at that point wait
for zombie files to go away (perhaps registering for file system
events with some local equivalent of KQ_FILTER_VNODE if there is one,
to be less sloppy that the current sleep() nonsense, but sleep would
work too).

When I'm back at my real keyboard I'll try to come up with tests for
this stuff, but I'm not sure how solid we can really make a test for
this particular case -- I think you'd need to have another thread open
the file and then close it after different periods of time, to
demonstrate that the retry loop works but also gives up, and that's
exactly the sort of timing-dependent stuff we try to avoid.  But I
think I'll try that anyway, because it's essential infrastructure to
allow Unix-only hackers to work only this stuff.  Once we have that,
we might be able to make some more progress with the various
FILE_DISPOSITION_POSIX_SEMANTICS proposals, if it helps, because we'll
have reproducible evidence for what it really does.

[1] https://commitfest.postgresql.org/39/3781/
From 0fa290b4d3597c843804d3ee35559074864424aa Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 3 Oct 2022 09:15:02 +1300
Subject: [PATCH] Fix STATUS_DELETE_PENDING handling in pgunlink().

Commit c5cb8f3b didn't handle STATUS_DELETE_PENDING correctly, and would
report ENOENT immediately without waiting.  In order for simple recursive
unlink algorithms to have a chance of succeeding while other backends
might not have closed them yet, we rely on the 10-seconds-of-retries
behavior present in pgunlink().

Diagnosed-by: Justin Pryzby 
Discussion: https://postgr.es/m/20220920013122.GA31833%40telsasoft.com
---
 src/port/dirmod.c | 38 --
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index ae6301dd6c..aae81228ad 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -39,6 +39,10 @@
 #endif
 #endif
 
+#if defined(WIN32) && !defined(__CYGWIN__)
+#include "port/win32ntdll.h"
+#endif
+
 #if defined(WIN32) || defined(__CYGWIN__)
 
 /*
@@ -91,6 +95,22 @@ pgrename(const char *from, const char *to)
 	return 0;
 }
 
+/*
+ * Check if _pglstat64()'s reason for failure was STATUS_DELETE_PENDING.
+ * This doesn't apply to Cygwin, which has its own lstat() that would report
+ * the case as EACCES.
+*/
+static bool
+lstat_error_was_status_delete_pending(void)
+{
+	if (errno != ENOENT)
+		return false;
+#if defined(WIN32) && !defined(__CYGWIN__)
+	if (pg_RtlGetLastNtStatus() == STATUS_DELETE_PENDING)
+		return true;
+#endif
+	return false;
+}
 
 /*
  *	pgunlink
@@ -98,6 +118,7 @@ pgrename(const char *from, const 

Re: pg_upgrade test failure

2022-10-02 Thread Thomas Munro
On Tue, Sep 20, 2022 at 1:31 PM Justin Pryzby  wrote:
> I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so
> didn't warn about the file itself, but then failed one moment later in
> rmdir.

Yeah, I think this is my fault.  In commit f357233c the new lstat()
call might return ENOENT for STATUS_DELETE_PENDING, and then we don't
enter pgunlink()'s 10 second sleep-retry loop.  Let me think about how
best to fix that, and how to write a regression test program that
would exercise stuff like this.  Might take a couple of days as I am
away from computers until mid-week.




Re: pg_upgrade test failure

2022-10-02 Thread Andres Freund
Hi,

On 2022-09-27 11:47:37 +0530, Bharath Rupireddy wrote:
> Just for the records - the same issue was also seen here [1], [2].
> 
> [1] https://cirrus-ci.com/task/5709014662119424?logs=check_world#L82
> [2] 
> https://api.cirrus-ci.com/v1/artifact/task/5709014662119424/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

Yea, this is at the moment one of the top sources of spurious test failures
for cfbot. Just manually looking at http://cfbot.cputube.org/ for tasks that
recently changed state on windows:

https://cirrus-ci.com/task/6422687231770624?logs=check_world#L60
https://cirrus-ci.com/task/6408332243107840?logs=check_world#L60
https://cirrus-ci.com/task/6202259712245760?logs=check_world#L60
https://cirrus-ci.com/task/6150885981028352?logs=check_world#L60
https://cirrus-ci.com/task/5361597290905600?logs=check_world#L60
https://cirrus-ci.com/task/5177327624650752?logs=check_world#L60
https://cirrus-ci.com/task/4862503887831040?logs=check_world#L60
https://cirrus-ci.com/task/4576362479484928?logs=check_world#L60

Something needs to happen here.

Greetings,

Andres Freund




Re: pg_upgrade test failure

2022-09-27 Thread Bharath Rupireddy
On Tue, Sep 20, 2022 at 7:01 AM Justin Pryzby  wrote:
>
> On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> > Hi,
> >
> > After my last rebase of the meson tree I encountered the following test
> > failure:
> >
> > https://cirrus-ci.com/task/5532444261613568
> >
> > [20:23:04.171] - 8< 
> > -
> > [20:23:04.171] stderr:
> > [20:23:04.171] #   Failed test 'pg_upgrade_output.d/ not removed after 
> > pg_upgrade --check success'
> > [20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 
> > 249.
> > [20:23:04.171] #   Failed test 'pg_upgrade_output.d/ removed after 
> > pg_upgrade success'
> > [20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 
> > 261.
> > [20:23:04.171] # Looks like you failed 2 tests of 13.
> >
> > regress_log:
> > https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
> >
> > The pg_upgrade output contains these potentially relevant warnings:
> >
> > ...
> > *Clusters are compatible*
> > pg_upgrade: warning: could not remove file or directory 
> > "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log":
> >  Directory not empty
> > pg_upgrade: warning: could not remove file or directory 
> > "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511":
> >  Directory not empty
> > ...

Just for the records - the same issue was also seen here [1], [2].

[1] https://cirrus-ci.com/task/5709014662119424?logs=check_world#L82
[2] 
https://api.cirrus-ci.com/v1/artifact/task/5709014662119424/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: pg_upgrade test failure

2022-09-19 Thread Justin Pryzby
On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> Hi,
> 
> After my last rebase of the meson tree I encountered the following test
> failure:
> 
> https://cirrus-ci.com/task/5532444261613568
> 
> [20:23:04.171] - 8< 
> -
> [20:23:04.171] stderr:
> [20:23:04.171] #   Failed test 'pg_upgrade_output.d/ not removed after 
> pg_upgrade --check success'
> [20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 
> 249.
> [20:23:04.171] #   Failed test 'pg_upgrade_output.d/ removed after pg_upgrade 
> success'
> [20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 
> 261.
> [20:23:04.171] # Looks like you failed 2 tests of 13.
> 
> regress_log:
> https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade
> 
> The pg_upgrade output contains these potentially relevant warnings:
> 
> ...
> *Clusters are compatible*
> pg_upgrade: warning: could not remove file or directory 
> "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log":
>  Directory not empty
> pg_upgrade: warning: could not remove file or directory 
> "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511":
>  Directory not empty
> ...

It looks like it failed to remove a *.log file under windows, which
caused rmtree to fail.

src/bin/pg_upgrade/pg_upgrade.h-#define DB_DUMP_LOG_FILE_MASK   
"pg_upgrade_dump_%u.log"
src/bin/pg_upgrade/pg_upgrade.h-#define SERVER_LOG_FILE 
"pg_upgrade_server.log"
src/bin/pg_upgrade/pg_upgrade.h-#define UTILITY_LOG_FILE
"pg_upgrade_utility.log"
src/bin/pg_upgrade/pg_upgrade.h:#define INTERNAL_LOG_FILE   
"pg_upgrade_internal.log"

ee5353abb only changed .txt files used for errors so can't be the cause,
but the original commit 38bfae3 might be related.

I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so
didn't warn about the file itself, but then failed one moment later in
rmdir.

-- 
Justin




Re: pg_upgrade test failure

2022-09-19 Thread Michael Paquier
On Mon, Sep 19, 2022 at 06:13:17PM -0700, Andres Freund wrote:
> I don't really see what'd race with what here? pg_upgrade has precise control
> over what's happening here, no?

A code path could have forgotten a fclose() for example, but this code
is rather old and close-proof as far as I know.  Most of the log files
are used with redirections for external calls, though  I don't see
how these could still be hold after pg_upgrade finishes, though :/
Could the use meson somewhat influence when running tests on Windows?

> I've only seen it once so far, but there haven't been many CI runs of the
> meson branch since rebasing ontop of the last changes to pg_upgrade.

Hmm, okay.  Is that a specific branch in one of your public repos?
--
Michael


signature.asc
Description: PGP signature


Re: pg_upgrade test failure

2022-09-19 Thread Andres Freund
Hi,

On 2022-09-20 10:08:41 +0900, Michael Paquier wrote:
> On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> > I don't know if actually related to the commit below, but there've been a
> > lot of runs of the pg_upgrade tests in the meson branch, and this is the 
> > first
> > failure of this kind. Unfortunately the error seems to be transient -
> > rerunning the tests succeeded.
> 
> This smells to me like a race condition in pg_upgrade (or even pg_ctl
> for SERVER_LOG_FILE) where the code still has handles on some of the
> files in the log/ subdirectory, causing its removal to not be able to
> finish happen.

I don't really see what'd race with what here? pg_upgrade has precise control
over what's happening here, no?


> If this proves to be rather easy to reproduce, giving
> a list of the files still present in this path would give a hint easy
> to follow.  Does this reproduce with a good frequency?

I've only seen it once so far, but there haven't been many CI runs of the
meson branch since rebasing ontop of the last changes to pg_upgrade.

Greetings,

Andres Freund




Re: pg_upgrade test failure

2022-09-19 Thread Michael Paquier
On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote:
> I don't know if actually related to the commit below, but there've been a
> lot of runs of the pg_upgrade tests in the meson branch, and this is the first
> failure of this kind. Unfortunately the error seems to be transient -
> rerunning the tests succeeded.

This smells to me like a race condition in pg_upgrade (or even pg_ctl
for SERVER_LOG_FILE) where the code still has handles on some of the
files in the log/ subdirectory, causing its removal to not be able to
finish happen.  If this proves to be rather easy to reproduce, giving
a list of the files still present in this path would give a hint easy
to follow.  Does this reproduce with a good frequency?
--
Michael


signature.asc
Description: PGP signature


pg_upgrade test failure

2022-09-19 Thread Andres Freund
Hi,

After my last rebase of the meson tree I encountered the following test
failure:

https://cirrus-ci.com/task/5532444261613568

[20:23:04.171] - 8< 
-
[20:23:04.171] stderr:
[20:23:04.171] #   Failed test 'pg_upgrade_output.d/ not removed after 
pg_upgrade --check success'
[20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249.
[20:23:04.171] #   Failed test 'pg_upgrade_output.d/ removed after pg_upgrade 
success'
[20:23:04.171] #   at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261.
[20:23:04.171] # Looks like you failed 2 tests of 13.

regress_log:
https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

The pg_upgrade output contains these potentially relevant warnings:

...
*Clusters are compatible*
pg_upgrade: warning: could not remove file or directory 
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log":
 Directory not empty
pg_upgrade: warning: could not remove file or directory 
"C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511":
 Directory not empty
...


I don't know if actually related to the commit below, but there've been a
lot of runs of the pg_upgrade tests in the meson branch, and this is the first
failure of this kind. Unfortunately the error seems to be transient -
rerunning the tests succeeded.

On 2022-09-13 01:39:59 +, Michael Paquier wrote:
> Move any remaining files generated by pg_upgrade into an internal subdir
>
> This change concerns a couple of .txt files (for internal state checks)
> that were still written in the path where the binary is executed, and
> not in the subdirectory located in the target cluster.  Like the other
> .txt files doing already so (like loadable_libraries.txt), these are
> saved in the base output directory.  Note that on failure, the logs
> report the full path to the .txt file generated, so these are easy to
> find.
>
> Oversight in 38bfae3.
>
> Author: Daniel Gustafsson
> Reviewed-by: Michael Paquier, Justin Prysby
> Discussion: https://postgr.es/m/181a6da8-3b7f-4b71-82d5-363ff0146...@yesql.se
> Backpatch-through: 15


Greetings,

Andres Freund