RE: Random pg_upgrade test failure on drongo
Dear hackers, > > Thanks to both of you. I have pushed the patch. > I have been tracking the BF animals these days, and this failure has not seen anymore. I think we can close the topic. Again, thanks for all efforts. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Random pg_upgrade test failure on drongo
On Thu, Jan 11, 2024 at 8:15 AM Hayato Kuroda (Fujitsu) wrote: > > > > But tomorrow it could be for other tables and if we change this > > > TRUNCATE logic for pg_largeobject (of which chances are less) then > > > there is always a chance that one misses changing this comment. I feel > > > keeping it generic in this case would be better as the problem is > > > generic but it is currently shown for pg_largeobject. > > > > Yes, for sure. So let's keep it generic as you prefer. > > > > Thank you! > > Thanks for working the patch. I'm also OK to push the Amit's fix patch. > Thanks to both of you. I have pushed the patch. -- With Regards, Amit Kapila.
RE: Random pg_upgrade test failure on drongo
Dear Alexander, Amit, > > But tomorrow it could be for other tables and if we change this > > TRUNCATE logic for pg_largeobject (of which chances are less) then > > there is always a chance that one misses changing this comment. I feel > > keeping it generic in this case would be better as the problem is > > generic but it is currently shown for pg_largeobject. > > Yes, for sure. So let's keep it generic as you prefer. > > Thank you! Thanks for working the patch. I'm also OK to push the Amit's fix patch. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Random pg_upgrade test failure on drongo
10.01.2024 13:37, Amit Kapila wrote: But tomorrow it could be for other tables and if we change this TRUNCATE logic for pg_largeobject (of which chances are less) then there is always a chance that one misses changing this comment. I feel keeping it generic in this case would be better as the problem is generic but it is currently shown for pg_largeobject. Yes, for sure. So let's keep it generic as you prefer. Thank you! Best regards, Alexander
Re: Random pg_upgrade test failure on drongo
On Wed, Jan 10, 2024 at 3:30 PM Alexander Lakhin wrote: > > 10.01.2024 12:31, Amit Kapila wrote: > > I am slightly hesitant to add any particular system table name in the > > comments as this can happen for any other system table as well, so > > slightly adjusted the comments in the attached. However, I think it is > > okay to mention the particular system table name in the commit > > message. Let me know what do you think. > > Thank you, Amit! > > I'd like to note that the culprit is exactly pg_largeobject as coded in > dumpDatabase(): > /* > * pg_largeobject comes from the old system intact, so set its > * relfrozenxids, relminmxids and relfilenode. > */ > if (dopt->binary_upgrade) > ... > appendPQExpBufferStr(loOutQry, > "TRUNCATE pg_catalog.pg_largeobject;\n"); > > I see no other TRUNCATEs (or similar logic) around, so I would specify the > table name in the comments. Though maybe I'm missing something... > But tomorrow it could be for other tables and if we change this TRUNCATE logic for pg_largeobject (of which chances are less) then there is always a chance that one misses changing this comment. I feel keeping it generic in this case would be better as the problem is generic but it is currently shown for pg_largeobject. -- With Regards, Amit Kapila.
Re: Random pg_upgrade test failure on drongo
10.01.2024 12:31, Amit Kapila wrote: I am slightly hesitant to add any particular system table name in the comments as this can happen for any other system table as well, so slightly adjusted the comments in the attached. However, I think it is okay to mention the particular system table name in the commit message. Let me know what do you think. Thank you, Amit! I'd like to note that the culprit is exactly pg_largeobject as coded in dumpDatabase(): /* * pg_largeobject comes from the old system intact, so set its * relfrozenxids, relminmxids and relfilenode. */ if (dopt->binary_upgrade) ... appendPQExpBufferStr(loOutQry, "TRUNCATE pg_catalog.pg_largeobject;\n"); I see no other TRUNCATEs (or similar logic) around, so I would specify the table name in the comments. Though maybe I'm missing something... Best regards, Alexander
Re: Random pg_upgrade test failure on drongo
On Tue, Jan 9, 2024 at 4:30 PM Alexander Lakhin wrote: > > 09.01.2024 13:08, Amit Kapila wrote: > > > >> As to checkpoint_timeout, personally I would not increase it, because it > >> seems unbelievable to me that pg_restore (with the cluster containing only > >> two empty databases) can run for longer than 5 minutes. I'd rather > >> investigate such situation separately, in case we encounter it, but maybe > >> it's only me. > >> > > I feel it is okay to set a higher value of checkpoint_timeout due to > > the same reason though the probability is less. I feel here it is > > important to explain in the comments why we are using these settings > > in the new test. I have thought of something like: "During the > > upgrade, bgwriter or checkpointer could hold the file handle for some > > removed file. Now, during restore when we try to create the file with > > the same name, it errors out. This behavior is specific to only some > > specific Windows versions and the probability of seeing this behavior > > is higher in this test because we use wal_level as logical via > > allows_streaming => 'logical' which in turn sets shared_buffers as > > 1MB." > > > > Thoughts? > > I would describe that behavior as "During upgrade, when pg_restore performs > CREATE DATABASE, bgwriter or checkpointer may flush buffers and hold a file > handle for pg_largeobject, so later TRUNCATE pg_largeobject command will > fail if OS (such as older Windows versions) doesn't remove an unlinked file > completely till it's open. ..." > I am slightly hesitant to add any particular system table name in the comments as this can happen for any other system table as well, so slightly adjusted the comments in the attached. However, I think it is okay to mention the particular system table name in the commit message. Let me know what do you think. -- With Regards, Amit Kapila. v2-0001-Fix-an-intermetant-BF-failure-in-003_logical_slot.patch Description: Binary data
Re: Random pg_upgrade test failure on drongo
Hello Amit, 09.01.2024 13:08, Amit Kapila wrote: As to checkpoint_timeout, personally I would not increase it, because it seems unbelievable to me that pg_restore (with the cluster containing only two empty databases) can run for longer than 5 minutes. I'd rather investigate such situation separately, in case we encounter it, but maybe it's only me. I feel it is okay to set a higher value of checkpoint_timeout due to the same reason though the probability is less. I feel here it is important to explain in the comments why we are using these settings in the new test. I have thought of something like: "During the upgrade, bgwriter or checkpointer could hold the file handle for some removed file. Now, during restore when we try to create the file with the same name, it errors out. This behavior is specific to only some specific Windows versions and the probability of seeing this behavior is higher in this test because we use wal_level as logical via allows_streaming => 'logical' which in turn sets shared_buffers as 1MB." Thoughts? I would describe that behavior as "During upgrade, when pg_restore performs CREATE DATABASE, bgwriter or checkpointer may flush buffers and hold a file handle for pg_largeobject, so later TRUNCATE pg_largeobject command will fail if OS (such as older Windows versions) doesn't remove an unlinked file completely till it's open. ..." Best regards, Alexander
Re: Random pg_upgrade test failure on drongo
On Tue, Jan 9, 2024 at 2:30 PM Alexander Lakhin wrote: > > 09.01.2024 08:49, Hayato Kuroda (Fujitsu) wrote: > > Based on the suggestion by Amit, I have created a patch with the alternative > > approach. This just does GUC settings. The reported failure is only for > > 003_logical_slots, but the patch also includes changes for the recently > > added > > test, 004_subscription. IIUC, there is a possibility that 004 would fail as > > well. > > > > Per our understanding, this patch can stop random failures. Alexander, can > > you > > test for the confirmation? > > > > Yes, the patch fixes the issue for me (without the patch I observe failures > on iterations 1-2, with 10 tests running in parallel, but with the patch > 10 iterations succeeded). > > But as far I can see, 004_subscription is not affected by the issue, > because it doesn't enable streaming for nodes new_sub, new_sub1. > As I noted before, I could see the failure only with > shared_buffers = 1MB (which is set with allows_streaming => 'logical'). > So I'm not sure, whether we need to modify 004 (or any other test that > runs pg_upgrade). > I see your point and the probable reason for failure with shared_buffers=1MB is that the probability of bgwriter holding the file handle for pg_largeobject increases. So, let's change it only for 003. > As to checkpoint_timeout, personally I would not increase it, because it > seems unbelievable to me that pg_restore (with the cluster containing only > two empty databases) can run for longer than 5 minutes. I'd rather > investigate such situation separately, in case we encounter it, but maybe > it's only me. > I feel it is okay to set a higher value of checkpoint_timeout due to the same reason though the probability is less. I feel here it is important to explain in the comments why we are using these settings in the new test. I have thought of something like: "During the upgrade, bgwriter or checkpointer could hold the file handle for some removed file. Now, during restore when we try to create the file with the same name, it errors out. This behavior is specific to only some specific Windows versions and the probability of seeing this behavior is higher in this test because we use wal_level as logical via allows_streaming => 'logical' which in turn sets shared_buffers as 1MB." Thoughts? -- With Regards, Amit Kapila.
Re: Random pg_upgrade test failure on drongo
Hello Kuroda-san, 09.01.2024 08:49, Hayato Kuroda (Fujitsu) wrote: Based on the suggestion by Amit, I have created a patch with the alternative approach. This just does GUC settings. The reported failure is only for 003_logical_slots, but the patch also includes changes for the recently added test, 004_subscription. IIUC, there is a possibility that 004 would fail as well. Per our understanding, this patch can stop random failures. Alexander, can you test for the confirmation? Yes, the patch fixes the issue for me (without the patch I observe failures on iterations 1-2, with 10 tests running in parallel, but with the patch 10 iterations succeeded). But as far I can see, 004_subscription is not affected by the issue, because it doesn't enable streaming for nodes new_sub, new_sub1. As I noted before, I could see the failure only with shared_buffers = 1MB (which is set with allows_streaming => 'logical'). So I'm not sure, whether we need to modify 004 (or any other test that runs pg_upgrade). As to checkpoint_timeout, personally I would not increase it, because it seems unbelievable to me that pg_restore (with the cluster containing only two empty databases) can run for longer than 5 minutes. I'd rather investigate such situation separately, in case we encounter it, but maybe it's only me. On the other hand, if a checkpoint could occur by some reason within a shorter time span, then increasing the timeout would not matter, I suppose. (I've also tested the bgwriter_lru_maxpages-only modification of your patch and can confirm that it works as well.) Best regards, Alexander
RE: Random pg_upgrade test failure on drongo
Dear Amit, Alexander, > > We get the effect discussed when the background writer process decides to > > flush a file buffer for pg_largeobject during stage 1. > > (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, > > the result must be the same.) > > And another important factor is shared_buffers = 1MB (set during the test). > > With the default setting of 128MB I couldn't see the failure. > > > > It can be reproduced easily (on old Windows versions) just by running > > pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the > > default cluster)). > > If an old cluster contains dozen of databases, this increases the failure > > probability significantly (with 10 additional databases I've got failures > > on iterations 4, 1, 6). > > > > I don't have an old Windows environment to test but I agree with your > analysis and theory. The question is what should we do for these new > random BF failures? I think we should set bgwriter_lru_maxpages to 0 > and checkpoint_timeout to 1hr for these new tests. Doing some invasive > fix as part of this doesn't sound reasonable because this is an > existing problem and there seems to be another patch by Thomas that > probably deals with the root cause of the existing problem [1] as > pointed out by you. > > [1] - https://commitfest.postgresql.org/40/3951/ Based on the suggestion by Amit, I have created a patch with the alternative approach. This just does GUC settings. The reported failure is only for 003_logical_slots, but the patch also includes changes for the recently added test, 004_subscription. IIUC, there is a possibility that 004 would fail as well. Per our understanding, this patch can stop random failures. Alexander, can you test for the confirmation? Best Regards, Hayato Kuroda FUJITSU LIMITED 0001-Add-GUC-settings-to-avoid-random-failures-on-Windows.patch Description: 0001-Add-GUC-settings-to-avoid-random-failures-on-Windows.patch
Re: Random pg_upgrade test failure on drongo
On Mon, Jan 8, 2024 at 9:36 PM Jim Nasby wrote: > > On 1/4/24 10:19 PM, Amit Kapila wrote: > > On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin wrote: > >> > >> 03.01.2024 14:42, Amit Kapila wrote: > >>> > >> > And the internal process is ... background writer (BgBufferSync()). > > So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and > got 20 x 10 tests passing. > > Thus, it we want just to get rid of the test failure, maybe it's enough > to > add this to the test's config... > > >>> What about checkpoints? Can't it do the same while writing the buffers? > >> > >> As we deal here with pg_upgrade/pg_restore, it must not be very easy to get > >> the desired effect, but I think it's not impossible in principle. > >> More details below. > >> What happens during the pg_upgrade execution is essentially: > >> 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...; > >> -- this command flushes file buffers as well > >> 2) ALTER DATABASE postgres OWNER TO ... > >> 3) COMMENT ON DATABASE "postgres" IS ... > >> 4) -- For binary upgrade, preserve pg_largeobject and index > >> relfilenodes > >> SELECT > >> pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); > >> SELECT > >> pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); > >> TRUNCATE pg_catalog.pg_largeobject; > >> -- ^^^ here we can get the error "could not create file "base/5/2683": > >> File exists" > >> ... > >> > >> We get the effect discussed when the background writer process decides to > >> flush a file buffer for pg_largeobject during stage 1. > >> (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, > >> the result must be the same.) > >> And another important factor is shared_buffers = 1MB (set during the test). > >> With the default setting of 128MB I couldn't see the failure. > >> > >> It can be reproduced easily (on old Windows versions) just by running > >> pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the > >> default cluster)). > >> If an old cluster contains dozen of databases, this increases the failure > >> probability significantly (with 10 additional databases I've got failures > >> on iterations 4, 1, 6). > >> > > > > I don't have an old Windows environment to test but I agree with your > > analysis and theory. The question is what should we do for these new > > random BF failures? I think we should set bgwriter_lru_maxpages to 0 > > and checkpoint_timeout to 1hr for these new tests. Doing some invasive > > fix as part of this doesn't sound reasonable because this is an > > existing problem and there seems to be another patch by Thomas that > > probably deals with the root cause of the existing problem [1] as > > pointed out by you. > > > > [1] - https://commitfest.postgresql.org/40/3951/ > > Isn't this just sweeping the problem (non-POSIX behavior on SMB and > ReFS) under the carpet? I realize that synthetic test workloads like > pg_upgrade in a loop aren't themselves real-world scenarios, but what > about other cases? Even if we're certain it's not possible for these > issues to wedge a server, it's still not a good experience for users to > get random, unexplained IO-related errors... > The point is that this is an existing known Windows behavior and that too only in certain versions. The fix doesn't seem to be straightforward, so it seems advisable to avoid random BF failures by having an appropriate configuration. -- With Regards, Amit Kapila.
Re: Random pg_upgrade test failure on drongo
On 1/4/24 10:19 PM, Amit Kapila wrote: On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin wrote: 03.01.2024 14:42, Amit Kapila wrote: And the internal process is ... background writer (BgBufferSync()). So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and got 20 x 10 tests passing. Thus, it we want just to get rid of the test failure, maybe it's enough to add this to the test's config... What about checkpoints? Can't it do the same while writing the buffers? As we deal here with pg_upgrade/pg_restore, it must not be very easy to get the desired effect, but I think it's not impossible in principle. More details below. What happens during the pg_upgrade execution is essentially: 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...; -- this command flushes file buffers as well 2) ALTER DATABASE postgres OWNER TO ... 3) COMMENT ON DATABASE "postgres" IS ... 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); TRUNCATE pg_catalog.pg_largeobject; -- ^^^ here we can get the error "could not create file "base/5/2683": File exists" ... We get the effect discussed when the background writer process decides to flush a file buffer for pg_largeobject during stage 1. (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, the result must be the same.) And another important factor is shared_buffers = 1MB (set during the test). With the default setting of 128MB I couldn't see the failure. It can be reproduced easily (on old Windows versions) just by running pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the default cluster)). If an old cluster contains dozen of databases, this increases the failure probability significantly (with 10 additional databases I've got failures on iterations 4, 1, 6). I don't have an old Windows environment to test but I agree with your analysis and theory. The question is what should we do for these new random BF failures? I think we should set bgwriter_lru_maxpages to 0 and checkpoint_timeout to 1hr for these new tests. Doing some invasive fix as part of this doesn't sound reasonable because this is an existing problem and there seems to be another patch by Thomas that probably deals with the root cause of the existing problem [1] as pointed out by you. [1] - https://commitfest.postgresql.org/40/3951/ Isn't this just sweeping the problem (non-POSIX behavior on SMB and ReFS) under the carpet? I realize that synthetic test workloads like pg_upgrade in a loop aren't themselves real-world scenarios, but what about other cases? Even if we're certain it's not possible for these issues to wedge a server, it's still not a good experience for users to get random, unexplained IO-related errors... -- Jim Nasby, Data Architect, Austin TX
Re: pg_upgrade test failure
On Sun, 29 Oct 2023 at 11:14, Hayato Kuroda (Fujitsu) wrote: > > Dear Andres, > > While tracking BF failures related with pg_ugprade, I found the same failure > has still happened [1] - [4]. > According to the log, the output directory was remained even after the > successful upgrade [5]. > I analyzed and attached the fix patch, and below is my analysis... how do you > think? > The same failure occurs randomly at [1] for a newly added test too: pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log": Directory not empty pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796": Directory not empty pg_upgrade: warning: could not stat file "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log/pg_upgrade_internal.log": No such file or directory pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796/log": Directory not empty pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20240104T215133.796": Directory not empty [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2024-01-04%2019%3A56%3A20 Regards, Vignesh
Re: Random pg_upgrade test failure on drongo
On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin wrote: > > 03.01.2024 14:42, Amit Kapila wrote: > > > > >> And the internal process is ... background writer (BgBufferSync()). > >> > >> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and > >> got 20 x 10 tests passing. > >> > >> Thus, it we want just to get rid of the test failure, maybe it's enough to > >> add this to the test's config... > >> > > What about checkpoints? Can't it do the same while writing the buffers? > > As we deal here with pg_upgrade/pg_restore, it must not be very easy to get > the desired effect, but I think it's not impossible in principle. > More details below. > What happens during the pg_upgrade execution is essentially: > 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...; > -- this command flushes file buffers as well > 2) ALTER DATABASE postgres OWNER TO ... > 3) COMMENT ON DATABASE "postgres" IS ... > 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes > SELECT > pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); > SELECT > pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); > TRUNCATE pg_catalog.pg_largeobject; > -- ^^^ here we can get the error "could not create file "base/5/2683": File > exists" > ... > > We get the effect discussed when the background writer process decides to > flush a file buffer for pg_largeobject during stage 1. > (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, > the result must be the same.) > And another important factor is shared_buffers = 1MB (set during the test). > With the default setting of 128MB I couldn't see the failure. > > It can be reproduced easily (on old Windows versions) just by running > pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the > default cluster)). > If an old cluster contains dozen of databases, this increases the failure > probability significantly (with 10 additional databases I've got failures > on iterations 4, 1, 6). > I don't have an old Windows environment to test but I agree with your analysis and theory. The question is what should we do for these new random BF failures? I think we should set bgwriter_lru_maxpages to 0 and checkpoint_timeout to 1hr for these new tests. Doing some invasive fix as part of this doesn't sound reasonable because this is an existing problem and there seems to be another patch by Thomas that probably deals with the root cause of the existing problem [1] as pointed out by you. [1] - https://commitfest.postgresql.org/40/3951/ -- With Regards, Amit Kapila.
Re: Random pg_upgrade test failure on drongo
Hello Amit, 03.01.2024 14:42, Amit Kapila wrote: So I started to think about other approach: to perform unlink as it's implemented now, but then wait until the DELETE_PENDING state is gone. There is a comment in the code which suggests we shouldn't wait indefinitely. See "However, we won't wait indefinitely for someone else to close the file, as the caller might be holding locks and blocking other backends." Yes, I saw it, but initially I thought that we have a transient condition there, so waiting in open() (instead of failing immediately) seemed like a good idea then... And the internal process is ... background writer (BgBufferSync()). So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and got 20 x 10 tests passing. Thus, it we want just to get rid of the test failure, maybe it's enough to add this to the test's config... What about checkpoints? Can't it do the same while writing the buffers? As we deal here with pg_upgrade/pg_restore, it must not be very easy to get the desired effect, but I think it's not impossible in principle. More details below. What happens during the pg_upgrade execution is essentially: 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...; -- this command flushes file buffers as well 2) ALTER DATABASE postgres OWNER TO ... 3) COMMENT ON DATABASE "postgres" IS ... 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); TRUNCATE pg_catalog.pg_largeobject; -- ^^^ here we can get the error "could not create file "base/5/2683": File exists" ... We get the effect discussed when the background writer process decides to flush a file buffer for pg_largeobject during stage 1. (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, the result must be the same.) And another important factor is shared_buffers = 1MB (set during the test). With the default setting of 128MB I couldn't see the failure. It can be reproduced easily (on old Windows versions) just by running pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the default cluster)). If an old cluster contains dozen of databases, this increases the failure probability significantly (with 10 additional databases I've got failures on iterations 4, 1, 6). Please see the reproducing script attached. Best regards, Alexander@echo off REM define PGBIN and PATH set PGBIN=%cd%\tmp_install\usr\local\pgsql\bin set PATH=%PGBIN%;%PATH% setlocal enabledelayedexpansion rmdir /S /Q tmpdb initdb -D tmpdb >initdb.log 2>&1 echo shared_buffers = 1MB>>tmpdb\postgresql.conf set /a i = 0 :LOOP set /a i+=1 echo ITERATION %i% rmdir /S /Q tmpdb_old xcopy /e /i /q /r tmpdb tmpdb_old\ pg_ctl -D tmpdb_old -l server.log start echo SELECT concat('CREATE DATABASE db', g) FROM generate_series(1, 10) g \gexec | psql -d postgres pg_ctl -D tmpdb_old stop rmdir /S /Q tmpdb_new xcopy /e /i /q /r tmpdb tmpdb_new\ pg_upgrade --no-sync -r -b "%PGBIN%" -B "%PGBIN%" -d tmpdb_old -D tmpdb_new if %ERRORLEVEL% NEQ 0 goto ERR goto LOOP :ERR echo ERROR pause :EXIT
Re: Random pg_upgrade test failure on drongo
On Tue, Jan 2, 2024 at 10:30 AM Alexander Lakhin wrote: > > 28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote: > > Dear Alexander, > > > >> I agree with your analysis and would like to propose a PoC fix (see > >> attached). With this patch applied, 20 iterations succeeded for me. > > There are no reviewers so that I will review again. Let's move the PoC > > to the concrete patch. Note that I only focused on fixes of random failure, > > other parts are out-of-scope. > > Thinking about that fix more, I'm not satisfied with the approach proposed. > First, it turns every unlink operation into two write operations > (rename + unlink), not to say about new risks of having stale .tmp files > (perhaps, it's ok for regular files (MoveFileEx can overwrite existing > files), but not for directories) > Second, it does that on any Windows OS versions, including modern ones, > which are not affected by the issue, as we know. > > So I started to think about other approach: to perform unlink as it's > implemented now, but then wait until the DELETE_PENDING state is gone. > There is a comment in the code which suggests we shouldn't wait indefinitely. See "However, we won't wait indefinitely for someone else to close the file, as the caller might be holding locks and blocking other backends." > And I was very surprised to see that this state is not transient in our case. > Additional investigation showed that the test fails not because some aside > process opens a file (concretely, {template1_id/postgres_id}/2683), that is > being deleted, but because of an internal process that opens the file and > holds a handle to it indefinitely. > And the internal process is ... background writer (BgBufferSync()). > > So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and > got 20 x 10 tests passing. > > Thus, it we want just to get rid of the test failure, maybe it's enough to > add this to the test's config... > What about checkpoints? Can't it do the same while writing the buffers? -- With Regards, Amit Kapila.
Re: Random pg_upgrade test failure on drongo
Hello Kuroda-san, 28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote: Dear Alexander, I agree with your analysis and would like to propose a PoC fix (see attached). With this patch applied, 20 iterations succeeded for me. There are no reviewers so that I will review again. Let's move the PoC to the concrete patch. Note that I only focused on fixes of random failure, other parts are out-of-scope. Thinking about that fix more, I'm not satisfied with the approach proposed. First, it turns every unlink operation into two write operations (rename + unlink), not to say about new risks of having stale .tmp files (perhaps, it's ok for regular files (MoveFileEx can overwrite existing files), but not for directories) Second, it does that on any Windows OS versions, including modern ones, which are not affected by the issue, as we know. So I started to think about other approach: to perform unlink as it's implemented now, but then wait until the DELETE_PENDING state is gone. And I was very surprised to see that this state is not transient in our case. Additional investigation showed that the test fails not because some aside process opens a file (concretely, {template1_id/postgres_id}/2683), that is being deleted, but because of an internal process that opens the file and holds a handle to it indefinitely. And the internal process is ... background writer (BgBufferSync()). So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and got 20 x 10 tests passing. Thus, it we want just to get rid of the test failure, maybe it's enough to add this to the test's config... The other way to go is to find out whether the background writer process should react on a shared-inval message, sent from smgrdounlinkall(), and close that file's handle, Maybe we could also (after changing the bgwriter's behaviour) add a waiting loop into pgwin32_open_handle() to completely rule out transient open() failures due to some other process (such as Windows Exporer) opening a file being deleted, but I would not complicate the things until we have a clear vision/plans of using modern APIs/relying of modern OS versions' behavior. I mean proceeding with something like: https://commitfest.postgresql.org/40/3951/ Best regards, Alexander
RE: Random pg_upgrade test failure on drongo
Dear Alexander, > I agree with your analysis and would like to propose a PoC fix (see > attached). With this patch applied, 20 iterations succeeded for me. There are no reviewers so that I will review again. Let's move the PoC to the concrete patch. Note that I only focused on fixes of random failure, other parts are out-of-scope. Basically, code comments can be updated accordingly. 01. ``` /* * This function might be called for a regular file or for a junction * point (which we use to emulate symlinks). The latter must be unlinked * with rmdir() on Windows. Before we worry about any of that, let's see * if we can unlink directly, since that's expected to be the most common * case. */ snprintf(tmppath, sizeof(tmppath), "%s.tmp", path); if (pgrename(path, tmppath) == 0) { if (unlink(tmppath) == 0) return 0; curpath = tmppath; } ``` You can modify comments atop changes because it is not trivial. Below is my draft: ``` * XXX: we rename the target file to ".tmp" before calling unlink. The * removal may fail with STATUS_DELETE_PENDING status on Windows, so * creating the same file would fail. This assumes that renaming is a * synchronous operation. ``` 02. ``` loops = 0; while (lstat(curpath, ) < 0 && lstat_error_was_status_delete_pending()) { if (++loops > 100) /* time out after 10 sec */ return -1; pg_usleep(10); /* us */ } ``` Comments can be added atop the part. Below one is my draft. ``` /* * Wait until the removal is really finished to avoid ERRORs for creating a * same file in other functions. */ ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Random pg_upgrade test failure on drongo
Hello Andrew and Kuroda-san, 27.11.2023 16:58, Andrew Dunstan wrote: It's also interesting, what is full version/build of OS on drongo and fairywren. It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122 I've updated it to 17763.5122 now. Thank you for the information! It had pushed me to upgrade my Server 2019 1809 17763.592 to 17763.4252. And then I discovered that I have difficulties with reproducing the issue on all my VMs after reboot (even on old versions/builds). It took me a while to understand what's going on and what affects reproduction of the issue. I was puzzled by the fact that I can't reproduce the issue with my unlink-open test program under seemingly the same conditions as before, until I realized that the issue reproduced only when the target directory opened in Windows Explorer. Now I'm sorry for bringing more mystery into the topic and for misleading information. So, the issue reproduced only when something scans the working directory for files/opens them. I added the same logic into my test program (see unlink-open-scandir attached) and now I see the failure on Windows Server 2019 (Version 10.0.17763.4252). A script like this: start cmd /c "unlink-open-scandir test1 10 5000 >log1 2>&1" ... start cmd /c "unlink-open-scandir test10 10 5000 >log10 2>&1" results in: C:\temp>find "failed" log* -- LOG1 -- LOG10 fopen() after unlink() failed (13) -- LOG2 fopen() after unlink() failed (13) -- LOG3 fopen() after unlink() failed (13) -- LOG4 fopen() after unlink() failed (13) -- LOG5 fopen() after unlink() failed (13) -- LOG6 fopen() after unlink() failed (13) -- LOG7 fopen() after unlink() failed (13) -- LOG8 fopen() after unlink() failed (13) -- LOG9 fopen() after unlink() failed (13) C:\temp>type log10 ... iteration 108 fopen() after unlink() failed (13) The same observed on: Windows 10 Version 1809 (OS Build 17763.1) But no failures on: Windows 10 Version 22H2 (OS Build 19045.3693) Windows 11 Version 21H2 (OS Build 22000.613) So the behavior change really took place, but my previous estimations were incorrect (my apologies). BTW, "rename" mode of the test program can produce more rare errors on rename: -- LOG3 MoveFileEx() failed (0) but not on open. 30.11.2023 13:00, Hayato Kuroda (Fujitsu) wrote: Thanks for your interest for the issue. I have been tracking the failure but been not occurred. Your analysis seems to solve BF failures, by updating OSes. Yes, but I don't think that leaving Server 2019 behind (I suppose Windows Server 2019 build 20348 would have the same behaviour as Windows 10 19045) is affordable. (Though looking at Cirrus CI logs, I see that what is entitled "Windows Server 2019" in fact is Windows Server 2022 there.) I think that's because unlink() is performed asynchronously on those old Windows versions, but rename() is always synchronous. OK. Actually I could not find descriptions about them, but your experiment showed facts. I don't know how this peculiarity is called, but it looks like when some other process captures the file handle, unlink() exits as if the file was deleted completely, but the subsequent open() fails. Best regards, Alexander#include #include #include char buf[1048576] = {0}; int main(int argc, char *argv[]) { int nmb; int res; int num_iterations; char *filename, *curfilename; char tmpfilename[MAX_PATH]; FILE *fh; char do_rename; WIN32_FIND_DATA FindFileData; if (argc < 4) { printf("Usage: %s {filename} {size_mb} {num_iterations}", argv[0]); return 1; } filename = argv[1]; nmb = atoi(argv[2]); if (nmb < 1) nmb = 1; num_iterations = atoi(argv[3]); if (num_iterations < 1) num_iterations = 1; do_rename = (strcmp(argv[argc - 1], "rename") == 0); for (int i = 1; i <= num_iterations; i++) { printf("iteration %d\n", i); fh = fopen(filename, "wb"); if (fh == NULL) { printf("first fopen() failed (%d)\n", errno); return 1; } for (int n = 0; n <= nmb; n++) { if (fwrite(buf, 1, sizeof(buf), fh) != sizeof(buf)) { printf("fwrite() failed (%d)\n", errno); return 1; } } res = fclose(fh); if (res < 0) { printf("first fclose() failed (%d)\n", errno); return 1; } curfilename = filename; if (do_rename) { sprintf(tmpfilename, "%s.tmp", filename); res = MoveFileEx(filename, tmpfilename, MOVEFILE_REPLACE_EXISTING); // from dirmod.c if (res == 0) { printf("MoveFileEx() failed (%d)\n", errno); return 1; } curfilename = tmpfilename; } res = unlink(curfilename); if (res < 0) { printf("first unlink() failed (%d)\n", errno); return 1; } fh = fopen(filename, "w"); if (fh == NULL) { printf("fopen() after unlink() failed (%d)\n", errno); return 2; } res = fclose(fh); if (res < 0) { printf("fclose() failed (%d)\n", errno); return 1; } res = unlink(filename); if (res < 0) { printf("unlink() failed (%d)\n", errno); return 1; }
RE: Random pg_upgrade test failure on drongo
Dear Alexander, Andrew, Thanks for your analysis! > I see that behavior on: > Windows 10 Version 1607 (OS Build 14393.0) > Windows Server 2016 Version 1607 (OS Build 14393.0) > Windows Server 2019 Version 1809 (OS Build 17763.1) > > But it's not reproduced on: > Windows 10 Version 1809 (OS Build 17763.1) (triple-checked) > Windows Server 2019 Version 1809 (OS Build 17763.592) > Windows 10 Version 22H2 (OS Build 19045.3693) > Windows 11 Version 21H2 (OS Build 22000.613) > > So it looks like the failure occurs depending not on Windows edition, but > rather on it's build. For Windows Server 2019 the "good" build is > somewhere between 17763.1 and 17763.592, but for Windows 10 it's between > 14393.0 and 17763.1. > (Maybe there was some change related to > FILE_DISPOSITION_POSIX_SEMANTICS/ > FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find > information about that change.) > > It's also interesting, what is full version/build of OS on drongo and > fairywren. Thanks for your interest for the issue. I have been tracking the failure but been not occurred. Your analysis seems to solve BF failures, by updating OSes. > I think that's because unlink() is performed asynchronously on those old > Windows versions, but rename() is always synchronous. OK. Actually I could not find descriptions about them, but your experiment showed facts. > I've managed to reproduce that issue (or at least a situation that > manifested similarly) with a sleep added in miscinit.c: > ereport(IsPostmasterEnvironment ? LOG : NOTICE, > (errmsg("database system is shut down"))); > + pg_usleep(50L); > > With this change, I get the same warning as in [1] when running in > parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations > 33, 46, 8). And with my PoC patch applied, I could see the same warning > as well (on iteration 6). > > I believe that's because rename() can't rename a directory containing an > open file, just as unlink() can't remove it. > > In the light of the above, I think that the issue in question should be > fixed in accordance with/as a supplement to [2]. OK, I understood that we need to fix more around here. For now, we should focus our points. Your patch seems good, but it needs more sight from windows-friendly developers. How do other think? Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Random pg_upgrade test failure on drongo
On 2023-11-27 Mo 07:39, Andrew Dunstan wrote: On 2023-11-27 Mo 07:00, Alexander Lakhin wrote: Hello Kuroda-san, 25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote: Thanks for attaching a program. This helps us to understand the issue. I wanted to confirm your env - this failure was occurred on windows server , right? I see that behavior on: Windows 10 Version 1607 (OS Build 14393.0) Windows Server 2016 Version 1607 (OS Build 14393.0) Windows Server 2019 Version 1809 (OS Build 17763.1) But it's not reproduced on: Windows 10 Version 1809 (OS Build 17763.1) (triple-checked) Windows Server 2019 Version 1809 (OS Build 17763.592) Windows 10 Version 22H2 (OS Build 19045.3693) Windows 11 Version 21H2 (OS Build 22000.613) So it looks like the failure occurs depending not on Windows edition, but rather on it's build. For Windows Server 2019 the "good" build is somewhere between 17763.1 and 17763.592, but for Windows 10 it's between 14393.0 and 17763.1. (Maybe there was some change related to FILE_DISPOSITION_POSIX_SEMANTICS/ FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find information about that change.) It's also interesting, what is full version/build of OS on drongo and fairywren. It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122 I've updated it to 17763.5122 now. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Random pg_upgrade test failure on drongo
On 2023-11-27 Mo 07:00, Alexander Lakhin wrote: Hello Kuroda-san, 25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote: Thanks for attaching a program. This helps us to understand the issue. I wanted to confirm your env - this failure was occurred on windows server , right? I see that behavior on: Windows 10 Version 1607 (OS Build 14393.0) Windows Server 2016 Version 1607 (OS Build 14393.0) Windows Server 2019 Version 1809 (OS Build 17763.1) But it's not reproduced on: Windows 10 Version 1809 (OS Build 17763.1) (triple-checked) Windows Server 2019 Version 1809 (OS Build 17763.592) Windows 10 Version 22H2 (OS Build 19045.3693) Windows 11 Version 21H2 (OS Build 22000.613) So it looks like the failure occurs depending not on Windows edition, but rather on it's build. For Windows Server 2019 the "good" build is somewhere between 17763.1 and 17763.592, but for Windows 10 it's between 14393.0 and 17763.1. (Maybe there was some change related to FILE_DISPOSITION_POSIX_SEMANTICS/ FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find information about that change.) It's also interesting, what is full version/build of OS on drongo and fairywren. It's WS 2019 1809/17763.4252. The latest available AFAICT is 17763.5122 cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Random pg_upgrade test failure on drongo
Hello Kuroda-san, 25.11.2023 18:19, Hayato Kuroda (Fujitsu) wrote: Thanks for attaching a program. This helps us to understand the issue. I wanted to confirm your env - this failure was occurred on windows server , right? I see that behavior on: Windows 10 Version 1607 (OS Build 14393.0) Windows Server 2016 Version 1607 (OS Build 14393.0) Windows Server 2019 Version 1809 (OS Build 17763.1) But it's not reproduced on: Windows 10 Version 1809 (OS Build 17763.1) (triple-checked) Windows Server 2019 Version 1809 (OS Build 17763.592) Windows 10 Version 22H2 (OS Build 19045.3693) Windows 11 Version 21H2 (OS Build 22000.613) So it looks like the failure occurs depending not on Windows edition, but rather on it's build. For Windows Server 2019 the "good" build is somewhere between 17763.1 and 17763.592, but for Windows 10 it's between 14393.0 and 17763.1. (Maybe there was some change related to FILE_DISPOSITION_POSIX_SEMANTICS/ FILE_DISPOSITION_ON_CLOSE implementation; I don't know where to find information about that change.) It's also interesting, what is full version/build of OS on drongo and fairywren. That is, my idea was to try removing a file through renaming it as a fast path (thus avoiding that troublesome state DELETE PENDING), and if that fails, to perform removal as before. May be the whole function might be simplified, but I'm not sure about special cases yet. I felt that your result showed pgrename() would be more rarely delayed than unlink(). That's why a file which has original name would not exist when subsequent open() was called. I think that's because unlink() is performed asynchronously on those old Windows versions, but rename() is always synchronous. * IIUC, the important points is the latter part, which waits until the status is changed. Based on that, can we remove a double rmtree() from cleanup_output_dirs()? They seems to be add for the similar motivation. I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it was observed in [1]), with c28911750 reverted, so I need more time to research that issue to answer this question. Yeah, as the first place, this failure seldom occurred I've managed to reproduce that issue (or at least a situation that manifested similarly) with a sleep added in miscinit.c: ereport(IsPostmasterEnvironment ? LOG : NOTICE, (errmsg("database system is shut down"))); + pg_usleep(50L); With this change, I get the same warning as in [1] when running in parallel 10 tests 002_pg_upgrade with a minimal olddump (on iterations 33, 46, 8). And with my PoC patch applied, I could see the same warning as well (on iteration 6). I believe that's because rename() can't rename a directory containing an open file, just as unlink() can't remove it. In the light of the above, I think that the issue in question should be fixed in accordance with/as a supplement to [2]. [1] https://www.postgresql.org/message-id/20230131172806.GM22427%40telsasoft.com [2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com Best regards, Alexander
RE: Random pg_upgrade test failure on drongo
Dear Alexander, > > Please look at the simple test program attached. It demonstrates the > failure for me when running in two sessions as follows: > unlink-open test 150 1000 > unlink-open test2 150 1000 Thanks for attaching a program. This helps us to understand the issue. I wanted to confirm your env - this failure was occurred on windows server , right? > > That is, my idea was to try removing a file through renaming it as a fast > path (thus avoiding that troublesome state DELETE PENDING), and if that > fails, to perform removal as before. May be the whole function might be > simplified, but I'm not sure about special cases yet. I felt that your result showed pgrename() would be more rarely delayed than unlink(). That's why a file which has original name would not exist when subsequent open() was called. About special cases, I wanted seniors to check. > > * IIUC, the important points is the latter part, which waits until the > > status is > >changed. Based on that, can we remove a double rmtree() from > cleanup_output_dirs()? > >They seems to be add for the similar motivation. > > I couldn't yet reproduce a failure, which motivated that doubling (IIUC, it > was observed in [1]), with c28911750 reverted, so I need more time to > research that issue to answer this question. Yeah, as the first place, this failure seldom occurred Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Random pg_upgrade test failure on drongo
Hello Kuroda-san, 23.11.2023 15:15, Hayato Kuroda (Fujitsu) wrote: I agree with your analysis and would like to propose a PoC fix (see attached). With this patch applied, 20 iterations succeeded for me. Thanks, here are comments. I'm quite not sure for the windows, so I may say something wrong. * I'm not sure why the file/directory name was changed before doing a unlink. Could you add descriptions? Please look at the simple test program attached. It demonstrates the failure for me when running in two sessions as follows: unlink-open test 150 1000 unlink-open test2 150 1000 ... iteration 60 iteration 61 fopen() after unlink() failed (13) Process Monitor shows: ... 9:16:55.9249792 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 138,412,032, Length: 1,048,576 ### unlink() performed for the file "test": 9:16:55.9852903 AM unlink-open.exe 4968 CreateFile C:\src\test SUCCESS Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 9:16:55.9853637 AM unlink-open.exe 4968 QueryAttributeTagFile C:\src\test SUCCESS Attributes: A, ReparseTag: 0x0 ### file "test" gets into state DELETE PENDING: 9:16:55.9853756 AM unlink-open.exe 4968 SetDispositionInformationFile C:\src\test SUCCESS Delete: True 9:16:55.9853888 AM unlink-open.exe 4968 CloseFile C:\src\test SUCCESS ### concurrent operations with file "test2": 9:16:55.9866362 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 139,460,608, Length: 1,048,576 ... 9:16:55.9972373 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 157,286,400, Length: 1,048,576 9:16:55.9979040 AM unlink-open.exe 3232 CloseFile C:\src\test2 SUCCESS ### unlink() for "test2": 9:16:56.1029981 AM unlink-open.exe 3232 CreateFile C:\src\test2 SUCCESS Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 9:16:56.1030432 AM unlink-open.exe 3232 QueryAttributeTagFile C:\src\test2 SUCCESS Attributes: A, ReparseTag: 0x0 ### file "test2" gets into state DELETE PENDING: 9:16:56.1030517 AM unlink-open.exe 3232 SetDispositionInformationFile C:\src\test2 SUCCESS Delete: True 9:16:56.1030625 AM unlink-open.exe 3232 CloseFile C:\src\test2 SUCCESS ### and then it opened successfully: 9:16:56.1189503 AM unlink-open.exe 3232 CreateFile C:\src\test2 SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created 9:16:56.1192016 AM unlink-open.exe 3232 CloseFile C:\src\test2 SUCCESS ### operations with file "test2" continued: 9:16:56.1193394 AM unlink-open.exe 3232 CreateFile C:\src\test2 SUCCESS Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 9:16:56.1193895 AM unlink-open.exe 3232 QueryAttributeTagFile C:\src\test2 SUCCESS Attributes: A, ReparseTag: 0x0 9:16:56.1194042 AM unlink-open.exe 3232 SetDispositionInformationFile C:\src\test2 SUCCESS Delete: True 9:16:56.1194188 AM unlink-open.exe 3232 CloseFile C:\src\test2 SUCCESS 9:16:56.1198459 AM unlink-open.exe 3232 CreateFile C:\src\test2 SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created 9:16:56.1200302 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 0, Length: 1,048,576, Priority: Normal ... 9:16:56.1275871 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 10,485,760, Length: 1,048,576 ### at the same time, CreateFile() for file "test" failed: 9:16:56.1276453 AM unlink-open.exe 4968 CreateFile C:\src\test DELETE PENDING Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0 9:16:56.1279359 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 11,534,336, Length: 1,048,576 9:16:56.1283452 AM unlink-open.exe 3232 WriteFile C:\src\test2 SUCCESS Offset: 12,582,912, Length: 1,048,576
RE: Random pg_upgrade test failure on drongo
Dear Alexander, > > I can easily reproduce this failure on my workstation by running 5 tests > 003_logical_slots in parallel inside Windows VM with it's CPU resources > limited to 50%, like so: > VBoxManage controlvm "Windows" cpuexecutioncap 50 > > set PGCTLTIMEOUT=180 > python3 -c "NUMITERATIONS=20;NUMTESTS=5;import os;tsts='';exec('for i in > range(1,NUMTESTS+1): > tsts+=f\"pg_upgrade_{i}/003_logical_slots \"'); exec('for i in > range(1,NUMITERATIONS+1):print(f\"iteration {i}\"); > assert(os.system(f\"meson test --num-processes {NUMTESTS} {tsts}\") == 0)')" > ... > iteration 2 > ninja: Entering directory `C:\src\postgresql\build' > ninja: no work to do. > 1/5 postgresql:pg_upgrade_2 / pg_upgrade_2/003_logical_slots > ERROR60.30s exit status 25 > ... > pg_restore: error: could not execute query: ERROR: could not create file > "base/1/2683": File exists > ... Great. I do not have such an environment so I could not find. This seemed to suggest that the failure was occurred because the system was busy. > I agree with your analysis and would like to propose a PoC fix (see > attached). With this patch applied, 20 iterations succeeded for me. Thanks, here are comments. I'm quite not sure for the windows, so I may say something wrong. * I'm not sure why the file/directory name was changed before doing a unlink. Could you add descriptions? * IIUC, the important points is the latter part, which waits until the status is changed. Based on that, can we remove a double rmtree() from cleanup_output_dirs()? They seems to be add for the similar motivation. ``` + loops = 0; + while (lstat(curpath, ) < 0 && lstat_error_was_status_delete_pending()) + { + if (++loops > 100) /* time out after 10 sec */ + return -1; + pg_usleep(10); /* us */ + } ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Random pg_upgrade test failure on drongo
Hello Kuroda-san, 21.11.2023 13:37, Hayato Kuroda (Fujitsu) wrote: This email tells an update. The machine drongo failed the test a week ago [1] and finally got logfiles. PSA files. Oh, sorry. I missed to attach files. You can see pg_upgrade_server.log for now. I can easily reproduce this failure on my workstation by running 5 tests 003_logical_slots in parallel inside Windows VM with it's CPU resources limited to 50%, like so: VBoxManage controlvm "Windows" cpuexecutioncap 50 set PGCTLTIMEOUT=180 python3 -c "NUMITERATIONS=20;NUMTESTS=5;import os;tsts='';exec('for i in range(1,NUMTESTS+1): tsts+=f\"pg_upgrade_{i}/003_logical_slots \"'); exec('for i in range(1,NUMITERATIONS+1):print(f\"iteration {i}\"); assert(os.system(f\"meson test --num-processes {NUMTESTS} {tsts}\") == 0)')" ... iteration 2 ninja: Entering directory `C:\src\postgresql\build' ninja: no work to do. 1/5 postgresql:pg_upgrade_2 / pg_upgrade_2/003_logical_slots ERROR 60.30s exit status 25 ... pg_restore: error: could not execute query: ERROR: could not create file "base/1/2683": File exists ... I agree with your analysis and would like to propose a PoC fix (see attached). With this patch applied, 20 iterations succeeded for me. Best regards, Alexanderdiff --git a/src/port/dirmod.c b/src/port/dirmod.c index 07dd190cbc..5185648388 100644 --- a/src/port/dirmod.c +++ b/src/port/dirmod.c @@ -121,6 +121,8 @@ pgunlink(const char *path) bool is_lnk; int loops = 0; struct stat st; + char tmppath[MAX_PATH]; + char *curpath = path; /* * This function might be called for a regular file or for a junction @@ -129,9 +131,14 @@ pgunlink(const char *path) * if we can unlink directly, since that's expected to be the most common * case. */ - if (unlink(path) == 0) - return 0; - if (errno != EACCES) + snprintf(tmppath, sizeof(tmppath), "%s.tmp", path); + if (pgrename(path, tmppath) == 0) + { + if (unlink(tmppath) == 0) + return 0; + curpath = tmppath; + } + else if (errno != EACCES) return -1; /* @@ -150,7 +157,7 @@ pgunlink(const char *path) * fail. We want to wait until the file truly goes away so that simple * recursive directory unlink algorithms work. */ - if (lstat(path, ) < 0) + if (lstat(curpath, ) < 0) { if (lstat_error_was_status_delete_pending()) is_lnk = false; @@ -167,7 +174,7 @@ pgunlink(const char *path) * someone else to close the file, as the caller might be holding locks * and blocking other backends. */ - while ((is_lnk ? rmdir(path) : unlink(path)) < 0) + while ((is_lnk ? rmdir(curpath) : unlink(curpath)) < 0) { if (errno != EACCES) return -1; @@ -175,6 +182,14 @@ pgunlink(const char *path) return -1; pg_usleep(10); /* us */ } + + loops = 0; + while (lstat(curpath, ) < 0 && lstat_error_was_status_delete_pending()) + { + if (++loops > 100) /* time out after 10 sec */ + return -1; + pg_usleep(10); /* us */ + } return 0; }
RE: Random pg_upgrade test failure on drongo
Dear hackers, This email tells an update. The machine drongo failed the test a week ago [1] and finally got logfiles. PSA files. ## Observed failure pg_upgrade_server.log is a server log during the pg_upgrade command. According to it, the TRUNCATE command seemed to be failed due to a "File exists" error. ``` 2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR: could not create file "base/1/2683": File exists 2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT: -- For binary upgrade, preserve pg_largeobject and index relfilenodes SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); TRUNCATE pg_catalog.pg_largeobject; ... ``` ## Analysis I think it caused due to the STATUS_DELETE_PENDING failure, not related with recent updates for pg_upgrade. The file "base/1/2683" is an index file for pg_largeobject_loid_pn_index, and the output meant that file creation was failed. Below is a backtrace. ``` pgwin32_open() // <-- this returns -1 open() BasicOpenFilePerm() PathNameOpenFilePerm() PathNameOpenFile() mdcreate() smgrcreate() RelationCreateStorage() RelationSetNewRelfilenumber() ExecuteTruncateGuts() ExecuteTruncate() ``` But this is strange. Before calling mdcreate(), we surely unlink the file which have the same name. Below is a trace until unlink. ``` pgunlink() unlink() mdunlinkfork() mdunlink() smgrdounlinkall() RelationSetNewRelfilenumber() // common path with above ExecuteTruncateGuts() ExecuteTruncate() ``` I found Thomas said that [2] pgunlink sometimes could not remove file even if it returns OK, at that time NTSTATUS is STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle() mentions the same thing: ``` /* * ERROR_ACCESS_DENIED is returned if the file is deleted but not yet * gone (Windows NT status code is STATUS_DELETE_PENDING). In that * case, we'd better ask for the NT status too so we can translate it * to a more Unix-like error. We hope that nothing clobbers the NT * status in between the internal NtCreateFile() call and CreateFile() * returning. * ``` The definition of STATUS_DELETE_PENDING can be seen in [3]. Based on that, indeed, open() would be able to fail with STATUS_DELETE_PENDING if the deletion is pending but it is tried to open. Another thread [4] also tries the issue while doing rmtree->unlink, and it reties to remove if it fails with STATUS_DELETE_PENDING. So, should we retry to open when it fails as well? Anyway, this fix seems out-of-scope for pg_upgrade. How do you think? Do you have any thoughts about it? ## Acknowledgement I want to say thanks to Sholk, Vingesh, for helping the analysis. [1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer=2023-11-15%2006%3A16%3A15 [2]: https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail.gmail.com [3]: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55 [4]: https://www.postgresql.org/message-id/flat/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de#6ae5e2ba3dd6e1fd680dcc34eab710d5 Best Regards, Hayato Kuroda FUJITSU LIMITED
RE: Random pg_upgrade test failure on drongo
Dear hackers, > While tracking a buildfarm, I found that drongo failed the test > pg_upgrade/003_logical_slots [1]. > A strange point is that the test passed in the next iteration. Currently I'm > not > sure the reason, but I will keep my eye for it and will investigate if it > happens again. This email just tells an update. We found that fairywren was also failed due to the same reason [2]. It fails inconsistently, but there might be a bad thing on windows. I'm now trying to reproduce with my colleagues to analyze more detail. Also, working with Andrew for getting logs emitted during the upgrade. I will continue to keep on my eye. [2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-11-08%2010%3A22%3A45 Best Regards, Hayato Kuroda FUJITSU LIMITED
Random pg_upgrade test failure on drongo
Dear hackers, While tracking a buildfarm, I found that drongo failed the test pg_upgrade/003_logical_slots [1]. A strange point is that the test passed in the next iteration. Currently I'm not sure the reason, but I will keep my eye for it and will investigate if it happens again. I think this failure is not related with our logical slots work, whereas it failed 003_logical_slots.pl. More detail, please see latter part. For more investigation, a server log during the upgrade may be needed. It will be in the data directory so BF system will not upload them. I may need additional information if it failed again. # Analysis of failure According to the output, pg_upgrade seemed to be failed while restoring objects to new cluster[2]. As code-level anaysis, pg_upgrade command failed in exec_prog(). In the function, pg_restore tried to be executed for database "postgres". Below is a brief call-stack. Note that pg_restore is not used for migrating logical replication slots, it is done by pg_upgrade binary itself. Also, the migration is done after all objects are copied, not in create_new_objects(). ``` exec_prog() parallel_exec_prog("pg_restore ... ") <-- Since -j option is not specified, it is just a wrapper create_new_objects() main() ``` In exec_prog(), system() system call was called but returned non-zero value. Doc said that sytem() returns value that is returned by the command interpreter, when input is not NULL [3]. Unfortunately, current code does not output the return code. Also, BF system does not upload data directory for failed tests. Therefore, I could not get more information for the investigation. [1]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo=2023-11-07%2013%3A43%3A23=pg_upgrade-check [2]: ``` ... # No postmaster PID for node "oldpub" # Running: pg_upgrade --no-sync -d C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_oldpub_data/pgdata -D C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata -b C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -B C:/prog/bf/root/HEAD/PGSQL~1.BUI/TMP_IN~1/prog/bf/root/HEAD/inst/bin -s 127.0.0.1 -p 54813 -P 54814 --copy Performing Consistency Checks ... Setting frozenxid and minmxid counters in new cluster ok Restoring global objects in the new cluster ok Restoring database schemas in the new cluster *failure* Consult the last few lines of "C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/003_logical_slots/data/t_003_logical_slots_newpub_data/pgdata/pg_upgrade_output.d/20231107T142224.580/log/pg_upgrade_dump_5.log" for the probable cause of the failure. Failure, exiting [14:23:26.632](70.141s) not ok 10 - run of pg_upgrade of old cluster [14:23:26.632](0.000s) # Failed test 'run of pg_upgrade of old cluster' # at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl line 170. ### Starting node "newpub" # Running: pg_ctl -w -D C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\data/t_003_logical_slots_newpub_data/pgdata -l C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/pg_upgrade/003_logical_slots\\log/003_logical_slots_newpub.log -o --cluster-name=newpub start waiting for server to start done server started # Postmaster PID for node "newpub" is 4604 [14:23:28.398](1.766s) not ok 11 - check the slot exists on new cluster [14:23:28.398](0.001s) # Failed test 'check the slot exists on new cluster' # at C:/prog/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/003_logical_slots.pl line 176. [14:23:28.399](0.000s) # got: '' # expected: 'regress_sub|t' ... ``` [3]: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/system-wsystem?view=msvc-170 Best Regards, Hayato Kuroda FUJITSU LIMITED
RE: pg_upgrade test failure
Dear Andres, While tracking BF failures related with pg_ugprade, I found the same failure has still happened [1] - [4]. According to the log, the output directory was remained even after the successful upgrade [5]. I analyzed and attached the fix patch, and below is my analysis... how do you think? = lstat() seemed fail while doing the second try of rmtree(). This error message is output from get_dirent_type(). Apart from pgunlink(), get_dirent_type() does not have an retry mechanism when lstat()->_pglstat64() detects STATUS_DELETE_PENDING. Therefore, I think rmtree() may not wait the file until it would be really removed, if the status is deceted in the get_dirent_type(). One solution is to retry stat() or lstat() even in get_dirent_type(), like attached. [1]: 2023-07-21 02:21:53 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-07-21%2002%3A21%3A53 [2]: 2023-10-21 13:39:15 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-10-21%2013%3A39%3A15 [3]: 2023-10-23 09:03:07 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-10-23%2009%3A03%3A07 [4]: 2023-10-27 23:06:17 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2023-10-27%2023%3A06%3A17 [5] ``` ... *Clusters are compatible* pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867/log": Directory not empty pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867": Directory not empty pg_upgrade: warning: could not stat file "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867/log/pg_upgrade_internal.log": No such file or directory pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867/log": Directory not empty pg_upgrade: warning: could not remove directory "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20231027T234552.867": Directory not empty [23:46:07.585](17.106s) ok 12 - run of pg_upgrade --check for new instance [23:46:07.587](0.002s) not ok 13 - pg_upgrade_output.d/ removed after pg_upgrade --check success ... ``` Best Regards, Hayato Kuroda FUJITSU LIMITED 0001-Retry-stat-and-lstat-even-in-get_dirent_type.patch Description: 0001-Retry-stat-and-lstat-even-in-get_dirent_type.patch
Re: pg_upgrade test failure
On 2023-02-06 14:14:22 -0800, Andres Freund wrote: > On 2023-02-07 11:03:18 +1300, Thomas Munro wrote: > > What I see is that there were 1254 FreeBSD tasks run in that window, of > > which 163 failed, and (more interestingly) 111 of those failures succeeded > > on every other platform. And clicking on a few on cfbot's page reveals that > > it's the new running stuff, and I'm still trying to find the interesting > > logs... > > I think I figured out why the logs frequently fail to upload - the server is > still running, so the size changes during the upload, causing the upload to > fail with errors like: > > [12:46:43.552] Failed to upload artifacts: Put > "https://storage.googleapis.com/cirrus-ci-5309429912436736-3271c9/artifacts/postgresql-cfbot/postgresql/6729936359129088/testrun/build/testrun/runningcheck.log?X-Goog-Algorithm=GOOG4-RSA-SHA256=cirrus-ci%40cirrus-ci-community.iam.gserviceaccount.com%2F20230206%2Fauto%2Fstorage%2Fgoog4_request=20230206T124536Z=600=host%3Bx-goog-content-length-range%3Bx-goog-meta-created_by_task=8e84192cbc754180b8baa6c00c41b463f580fe7183f0e7113c253aac13cc2458b835caef7940b91e102e96d54cff2b5714c77390e74244e2fb88c00c9957a801e33cbee2ac960e0db8a01fe08ee945bedf4616881e6beafa3a162c22948ac0b9a9359d93e1f461fc9f49385b784b75d633f1b01805b987d9d53bc7fb55263917ec85180a2140659d50990f066160f03e8bb8984e8d2aadb64c875c253167cf24da152a18d69fcd3d941edce145931e4feb23dc8cf43de7b7bbfc565786c1c692406f2a0a127f30385a8c4b66f96709b51d26d3c71617991c731b0e7206ee3906338dedf6359412edd024f8c76bd33400f4c9320c2bde9512fa8bcd6289e54d52": > http2: request body larger than specified content length > > I'm testing adding a pg_ctl stop to the on_failure right now. Pushed the fix doing so.
Re: pg_upgrade test failure
Hi, On 2023-02-07 11:03:18 +1300, Thomas Munro wrote: > On Tue, Feb 7, 2023 at 10:57 AM Andres Freund wrote: > > On February 6, 2023 1:51:20 PM PST, Thomas Munro > > wrote: > > >Next up: the new "running" tests, spuriously failing around 8.8% of CI > > >builds on FreeBSD. I'll go and ping that thread... > > > > Is that rate unchanged? I thought I fixed the main issue last week? > > Unfortunately my cfbot database only holds a week's history. Would be interesting to increase that to a considerably longer time. I can't imagine that that'd take all that much resources? > What I see is that there were 1254 FreeBSD tasks run in that window, of > which 163 failed, and (more interestingly) 111 of those failures succeeded > on every other platform. And clicking on a few on cfbot's page reveals that > it's the new running stuff, and I'm still trying to find the interesting > logs... I think I figured out why the logs frequently fail to upload - the server is still running, so the size changes during the upload, causing the upload to fail with errors like: [12:46:43.552] Failed to upload artifacts: Put "https://storage.googleapis.com/cirrus-ci-5309429912436736-3271c9/artifacts/postgresql-cfbot/postgresql/6729936359129088/testrun/build/testrun/runningcheck.log?X-Goog-Algorithm=GOOG4-RSA-SHA256=cirrus-ci%40cirrus-ci-community.iam.gserviceaccount.com%2F20230206%2Fauto%2Fstorage%2Fgoog4_request=20230206T124536Z=600=host%3Bx-goog-content-length-range%3Bx-goog-meta-created_by_task=8e84192cbc754180b8baa6c00c41b463f580fe7183f0e7113c253aac13cc2458b835caef7940b91e102e96d54cff2b5714c77390e74244e2fb88c00c9957a801e33cbee2ac960e0db8a01fe08ee945bedf4616881e6beafa3a162c22948ac0b9a9359d93e1f461fc9f49385b784b75d633f1b01805b987d9d53bc7fb55263917ec85180a2140659d50990f066160f03e8bb8984e8d2aadb64c875c253167cf24da152a18d69fcd3d941edce145931e4feb23dc8cf43de7b7bbfc565786c1c692406f2a0a127f30385a8c4b66f96709b51d26d3c71617991c731b0e7206ee3906338dedf6359412edd024f8c76bd33400f4c9320c2bde9512fa8bcd6289e54d52": http2: request body larger than specified content length I'm testing adding a pg_ctl stop to the on_failure right now. Greetings, Andres Freund
Re: pg_upgrade test failure
On Tue, Feb 7, 2023 at 11:03 AM Thomas Munro wrote: > On Tue, Feb 7, 2023 at 10:57 AM Andres Freund wrote: > > On February 6, 2023 1:51:20 PM PST, Thomas Munro > > wrote: > > >Next up: the new "running" tests, spuriously failing around 8.8% of CI > > >builds on FreeBSD. I'll go and ping that thread... > > > > Is that rate unchanged? I thought I fixed the main issue last week? > > Unfortunately my cfbot database only holds a week's history. What I > see is that there were 1254 FreeBSD tasks run in that window, of which > 163 failed, and (more interestingly) 111 of those failures succeeded > on every other platform. And clicking on a few on cfbot's page > reveals that it's the new running stuff, and I'm still trying to find > the interesting logs... Ah, that number might include some other problems, including in subscription (#2900). That's the problem with flapping tests, you get desensitised and stop looking closely and miss things...
Re: pg_upgrade test failure
On Tue, Feb 7, 2023 at 10:57 AM Andres Freund wrote: > On February 6, 2023 1:51:20 PM PST, Thomas Munro > wrote: > >Next up: the new "running" tests, spuriously failing around 8.8% of CI > >builds on FreeBSD. I'll go and ping that thread... > > Is that rate unchanged? I thought I fixed the main issue last week? Unfortunately my cfbot database only holds a week's history. What I see is that there were 1254 FreeBSD tasks run in that window, of which 163 failed, and (more interestingly) 111 of those failures succeeded on every other platform. And clicking on a few on cfbot's page reveals that it's the new running stuff, and I'm still trying to find the interesting logs...
Re: pg_upgrade test failure
Hi, On February 6, 2023 1:51:20 PM PST, Thomas Munro wrote: >Next up: the new "running" tests, spuriously failing around 8.8% of CI >builds on FreeBSD. I'll go and ping that thread... Is that rate unchanged? I thought I fixed the main issue last week? Greetings, Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: pg_upgrade test failure
On Wed, Feb 1, 2023 at 2:44 PM Thomas Munro wrote: > OK, I pushed that. Third time lucky? I pulled down logs for a week of Windows CI, just over ~1k builds. The failure rate was a few per day before, but there are no failures like that after that went in. There are logs that contain the "Directory not empty" warning, but clearly the try-again-and-this-time-wait-for-the-other-process logic must be working (as horrible as it is) because then the test checks that the directory is gone, and succeeds. Hooray. So that's one of our biggest CI flappers fixed. Unfortunately without treating the root cause, really. Next up: the new "running" tests, spuriously failing around 8.8% of CI builds on FreeBSD. I'll go and ping that thread...
Re: pg_upgrade test failure
On Wed, Feb 1, 2023 at 10:08 AM Thomas Munro wrote: > On Wed, Feb 1, 2023 at 10:04 AM Andres Freund wrote: > > Maybe we should just handle it by sleeping and retrying, if on windows? Sad > > to even propose... > > Yeah, that's what that code I posted would do automatically, though > it's a bit hidden. The second attempt to unlink() would see delete > already pending, and activate its secret internal sleep/retry loop. OK, I pushed that. Third time lucky? I tracked down the discussion of that existing comment about pg_ctl, which comes from the 9.2 days: https://www.postgresql.org/message-id/flat/5044DE59.5020500%40dunslane.net I guess maybe back then fopen() was Windows' own fopen() that wouldn't allow two handles to a file at the same time? These days we redirect it to a wrapper with the magic "shared" flags, so the kluge installed by commit f8c81c5dde2 may not even be needed anymore. It does demonstrate that there are long standing timing races around log files, process exit and wait-for-shutdown logic, though. Someone who develops for Windows could probably chase this right down, and make sure that we do certain things in the right order, and/or find better kernel facilities; at a wild guess, something like OpenProcess() before you initiate shutdown, so you can then wait on its handle, for example. The docs for ExitProcess() make it clear that handles are synchronously closed, so I think it's probably just that our tests for when processes have exited are too fuzzy.
Re: pg_upgrade test failure
On Wed, Feb 1, 2023 at 9:54 AM Thomas Munro wrote: > ... I have one more idea ... I also had a second idea, barely good enough to mention and probably just paranoia. In a nearby thread I learned that process exit does not release Windows advisory file locks synchronously, which surprised this Unix hacker; it made me wonder what else might be released lazily after process exit. Handles?! However, as previously mentioned, it's possible that even with fully Unix-like resource cleanup on process exit, we could be confused if we are using "the process that was on the end of this pipe has closed it" as a proxy for "the process is gone, *all* its handles are closed". In any case, the previous kluge should help wallpaper over any of that too, for this test anyway.
Re: pg_upgrade test failure
On Wed, Feb 1, 2023 at 10:04 AM Andres Freund wrote: > On January 31, 2023 12:54:42 PM PST, Thomas Munro > wrote: > >I'm not sure about anything, but if that's what's happening here, then > >maybe the attached would help. In short, it would make the previous > >theory true (the idea of a second unlink() saving the day). > > Maybe we should just handle it by sleeping and retrying, if on windows? Sad > to even propose... Yeah, that's what that code I posted would do automatically, though it's a bit hidden. The second attempt to unlink() would see delete already pending, and activate its secret internal sleep/retry loop.
Re: pg_upgrade test failure
Hi, On January 31, 2023 12:54:42 PM PST, Thomas Munro wrote: >On Wed, Feb 1, 2023 at 6:28 AM Justin Pryzby wrote: >> > I pushed the rmtree() change. Let's see if that helps, or tells us >> > something new. >> >> I found a few failures since then: >> >> https://api.cirrus-ci.com/v1/artifact/task/6696942420361216/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade >> >> pg_upgrade: warning: could not remove directory >> "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720/log": >> Directory not empty >> pg_upgrade: warning: could not remove directory >> "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720": >> Directory not empty > >So no change: we didn't see "could not unlink file ...". So I think >that means that it was rmtree() that unlinked the file for the *first* >time, but someone else has it open. > >Even though Windows is at this point eroding my love of computers and >making me consider a new career in, I dunno, carrot farming or >something, I have one more idea. Check out this kluge in >src/bin/pg_upgrade/exec.c: > >/* > * "pg_ctl -w stop" might have reported that the server has stopped > * because the postmaster.pid file has been removed, but "pg_ctl -w > * start" might still be in the process of closing and might still be > * holding its stdout and -l log file descriptors open. Therefore, > * try to open the log file a few more times. > */ > >I'm not sure about anything, but if that's what's happening here, then >maybe the attached would help. In short, it would make the previous >theory true (the idea of a second unlink() saving the day). Maybe we should just handle it by sleeping and retrying, if on windows? Sad to even propose... -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: pg_upgrade test failure
On Wed, Feb 1, 2023 at 6:28 AM Justin Pryzby wrote: > > I pushed the rmtree() change. Let's see if that helps, or tells us > > something new. > > I found a few failures since then: > > https://api.cirrus-ci.com/v1/artifact/task/6696942420361216/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade > > pg_upgrade: warning: could not remove directory > "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720/log": > Directory not empty > pg_upgrade: warning: could not remove directory > "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720": > Directory not empty So no change: we didn't see "could not unlink file ...". So I think that means that it was rmtree() that unlinked the file for the *first* time, but someone else has it open. Even though Windows is at this point eroding my love of computers and making me consider a new career in, I dunno, carrot farming or something, I have one more idea. Check out this kluge in src/bin/pg_upgrade/exec.c: /* * "pg_ctl -w stop" might have reported that the server has stopped * because the postmaster.pid file has been removed, but "pg_ctl -w * start" might still be in the process of closing and might still be * holding its stdout and -l log file descriptors open. Therefore, * try to open the log file a few more times. */ I'm not sure about anything, but if that's what's happening here, then maybe the attached would help. In short, it would make the previous theory true (the idea of a second unlink() saving the day). diff --git a/src/bin/pg_upgrade/util.c b/src/bin/pg_upgrade/util.c index 42dcbfc5b5..f214b7737f 100644 --- a/src/bin/pg_upgrade/util.c +++ b/src/bin/pg_upgrade/util.c @@ -68,7 +68,12 @@ cleanup_output_dirs(void) if (log_opts.retain) return; - (void) rmtree(log_opts.basedir, true); + /* + * Try twice. The second time might wait for files to be concurrently + * closed on Windows. + */ + if (!rmtree(log_opts.basedir, true)) + rmtree(log_opts.basedir, true); /* Remove pg_upgrade_output.d only if empty */ switch (pg_check_dir(log_opts.rootdir)) @@ -80,7 +85,12 @@ cleanup_output_dirs(void) case 1: /* exists and empty */ case 2: /* exists and contains only dot files */ - (void) rmtree(log_opts.rootdir, true); + /* + * Try twice. The second time might wait for files to be + * concurrently closed on Windows. + */ + if (!rmtree(log_opts.rootdir, true)) +rmtree(log_opts.rootdir, true); break; case 4: /* exists */
Re: pg_upgrade test failure
On Tue, Jan 31, 2023 at 02:00:05PM +1300, Thomas Munro wrote: > On Thu, Jan 5, 2023 at 4:11 PM Thomas Munro wrote: > > On Wed, Dec 7, 2022 at 7:15 AM Andres Freund wrote: > > > On 2022-11-08 01:16:09 +1300, Thomas Munro wrote: > > > > So [1] on its own didn't fix this. My next guess is that the attached > > > > might help. > > > > > What is our plan here? This afaict is the most common "false positive" for > > > cfbot in the last weeks. > > I pushed the rmtree() change. Let's see if that helps, or tells us > something new. I found a few failures since then: https://api.cirrus-ci.com/v1/artifact/task/6696942420361216/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade pg_upgrade: warning: could not remove directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720/log": Directory not empty pg_upgrade: warning: could not remove directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20230131T134931.720": Directory not empty https://api.cirrus-ci.com/v1/artifact/task/5119776607961088/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade same I verified that those both include your 54e72b66e, which is pretty strange, since the patch passed tests 10s of times on CI until it was merged, when it started/kept failing. -- Justin
Re: pg_upgrade test failure
On Thu, Jan 5, 2023 at 4:11 PM Thomas Munro wrote: > On Wed, Dec 7, 2022 at 7:15 AM Andres Freund wrote: > > On 2022-11-08 01:16:09 +1300, Thomas Munro wrote: > > > So [1] on its own didn't fix this. My next guess is that the attached > > > might help. > > > What is our plan here? This afaict is the most common "false positive" for > > cfbot in the last weeks. I pushed the rmtree() change. Let's see if that helps, or tells us something new. Michael: There were some questions from Andres above. FWIW I think if you wanted to investigate this properly on a local Windows system to chase down who's got the file open (shutdown sequence problem or whatever), you'd probably have to install Server 2019, or maybe use an old 8.1 VM if you still have such a thing, based on the suspicion that typical 10 and 11 systems won't exhibit the problem. But then I could be wrong about what's going on...
Re: pg_upgrade test failure
On Wed, Dec 7, 2022 at 7:15 AM Andres Freund wrote: > On 2022-11-08 01:16:09 +1300, Thomas Munro wrote: > > So [1] on its own didn't fix this. My next guess is that the attached > > might help. > What is our plan here? This afaict is the most common "false positive" for > cfbot in the last weeks. That branch hasn't failed on cfbot[1], except once due to one of the other known flapping races we have to fix. Which doesn't prove anything, of course, but it is encouraging. I wish we knew why the test does this, though Here's a better version that works harder to avoid opening more than one fd at a time (like the pgfnames()-based code it replaces), and also uses fd.c facilities in the backend version (unlike pgfnames(), which looks like it could leak a descriptor if palloc() threw, and also doesn't know how to handle file descriptor pressure). [1] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/41/4011 From 3c9ed22ee207f2a5c008a971f8b19561d2e0ccfa Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 5 Jan 2023 14:15:37 +1300 Subject: [PATCH v2] Refactor rmtree() to use get_dirent_type(). Switch to get_dirent_type() instead of lstat() while traversing a directory graph, to see if that fixes intermittent ENOTEMPTY failures while cleaning up after the pg_upgrade tests, on our Windows CI build. Our CI image uses Windows Server 2019, a version known not to have POSIX unlink semantics enabled by default yet, unlike typical Windows 10 and 11 systems. The theory is that the offending directory entries must be STATUS_DELETE_PENDING. With this change, rmtree() should not skip them, and try to unlink again. Our unlink() wrapper should either wait a short time for them to go away (because other processes close the handle) or log a message to tell us the path of the problem file if not, so we can dig further. Discussion: https://postgre.es/m/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de --- src/common/rmtree.c | 120 +++- 1 file changed, 64 insertions(+), 56 deletions(-) diff --git a/src/common/rmtree.c b/src/common/rmtree.c index d445e21ba2..a3e536149b 100644 --- a/src/common/rmtree.c +++ b/src/common/rmtree.c @@ -20,13 +20,21 @@ #include #include +#include "common/file_utils.h" + #ifndef FRONTEND +#include "storage/fd.h" #define pg_log_warning(...) elog(WARNING, __VA_ARGS__) +#define LOG_LEVEL WARNING +#define OPENDIR(x) AllocateDir(x) +#define CLOSEDIR(x) FreeDir(x) #else #include "common/logging.h" +#define LOG_LEVEL PG_LOG_WARNING +#define OPENDIR(x) opendir(x) +#define CLOSEDIR(x) closedir(x) #endif - /* * rmtree * @@ -41,82 +49,82 @@ bool rmtree(const char *path, bool rmtopdir) { - bool result = true; char pathbuf[MAXPGPATH]; - char **filenames; - char **filename; - struct stat statbuf; - - /* - * we copy all the names out of the directory before we start modifying - * it. - */ - filenames = pgfnames(path); + DIR *dir; + struct dirent *de; + bool result = true; + size_t dirnames_size = 0; + size_t dirnames_capacity = 8; + char **dirnames = palloc(sizeof(char *) * dirnames_capacity); - if (filenames == NULL) + dir = OPENDIR(path); + if (dir == NULL) + { + pg_log_warning("could not open directory \"%s\": %m", path); return false; + } - /* now we have the names we can start removing things */ - for (filename = filenames; *filename; filename++) + while (errno = 0, (de = readdir(dir))) { - snprintf(pathbuf, MAXPGPATH, "%s/%s", path, *filename); - - /* - * It's ok if the file is not there anymore; we were just about to - * delete it anyway. - * - * This is not an academic possibility. One scenario where this - * happens is when bgwriter has a pending unlink request for a file in - * a database that's being dropped. In dropdb(), we call - * ForgetDatabaseSyncRequests() to flush out any such pending unlink - * requests, but because that's asynchronous, it's not guaranteed that - * the bgwriter receives the message in time. - */ - if (lstat(pathbuf, ) != 0) - { - if (errno != ENOENT) - { -pg_log_warning("could not stat file or directory \"%s\": %m", - pathbuf); -result = false; - } + if (strcmp(de->d_name, ".") == 0 || + strcmp(de->d_name, "..") == 0) continue; - } - - if (S_ISDIR(statbuf.st_mode)) - { - /* call ourselves recursively for a directory */ - if (!rmtree(pathbuf, true)) - { -/* we already reported the error */ -result = false; - } - } - else + snprintf(pathbuf, sizeof(pathbuf), "%s/%s", path, de->d_name); + switch (get_dirent_type(pathbuf, de, false, LOG_LEVEL)) { - if (unlink(pathbuf) != 0) - { -if (errno != ENOENT) + case PGFILETYPE_ERROR: +/* already logged, press on */ +break; + case PGFILETYPE_DIR: + +/* + * Defer recursion until after we've closed this directory, to + * avoid using more than one file descriptor at a time. + */ +if (dirnames_size ==
Re: pg_upgrade test failure
Hi, On 2022-11-08 01:16:09 +1300, Thomas Munro wrote: > So [1] on its own didn't fix this. My next guess is that the attached > might help. > > Hmm. Following Michael's clue that this might involve log files and > pg_ctl, I noticed one thing: pg_ctl implements > wait_for_postmaster_stop() by waiting for kill(pid, 0) to fail, and > our kill emulation does CallNamedPipe(). If the server is in the > process of exiting and the kernel is cleaning up all the handles we > didn't close, is there any reason to expect the signal pipe to be > closed after the log file? What is our plan here? This afaict is the most common "false positive" for cfbot in the last weeks. E.g.: https://api.cirrus-ci.com/v1/artifact/task/5462686092230656/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade ... [00:02:58.761](93.859s) ok 10 - run of pg_upgrade for new instance [00:02:58.808](0.047s) not ok 11 - pg_upgrade_output.d/ removed after pg_upgrade success [00:02:58.815](0.007s) # Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success' # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 288. Michael: Why does 002_pg_upgrade.pl try to filter the list of files in pg_upgrade_output.d for files ending in .log? And why does it print those only after starting the new node? How about moving the iteration through the pg_upgrade_output.d to before the ->start and printing all the files, but only slurp_file() if the filename ends with *.log? Minor nit: It seems off to quite so many copies of $newnode->data_dir . "/pg_upgrade_output.d" particularly where the test defines $log_path, but then still builds it from scratch after (line 304). Greetings, Andres Freund
Re: pg_upgrade test failure
On Tue, Nov 08, 2022 at 01:16:09AM +1300, Thomas Munro wrote: > So [1] on its own didn't fix this. My next guess is that the attached > might help. I took the liberty of adding a CF entry for this https://commitfest.postgresql.org/41/4011/ And afterwards figured I could be a little bit wasteful and run the tests using meson test --repeat, rather than let cfbot do it over the course of a month. https://cirrus-ci.com/task/5115893722644480 So I didn't find evidence that it doesn't resolve the issue (but this also doesn't prove that it will works). -- Justin
Re: pg_upgrade test failure
So [1] on its own didn't fix this. My next guess is that the attached might help. Hmm. Following Michael's clue that this might involve log files and pg_ctl, I noticed one thing: pg_ctl implements wait_for_postmaster_stop() by waiting for kill(pid, 0) to fail, and our kill emulation does CallNamedPipe(). If the server is in the process of exiting and the kernel is cleaning up all the handles we didn't close, is there any reason to expect the signal pipe to be closed after the log file? [1] https://www.postgresql.org/message-id/flat/20221025213055.GA8537%40telsasoft.com#9030de6c4c5e544d2b057b793a5b42af From 9cbfd5c1fc1327443152d89ca7f5346bfeda3f52 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 7 Nov 2022 17:43:42 +1300 Subject: [PATCH] Refactor rmtree() to use get_dirent_type(). Switch to get_dirent_type() instead of lstat() while traversing a directory graph, to see if that fixes intermittent ENOTEMPTY failures while cleaning up after the pg_upgrade tests. The idea is that the offending directory entries must be STATUS_DELETE_PENDING. With this change, rmtree() should not skip them, and try to unlink again. That should either wait a short time for them to go away (because other processes close the handle) or log a message to tell us the path of the problem file if not so we can dig further. Discussion: https://postgre.es/m/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de --- src/common/rmtree.c | 94 ++--- 1 file changed, 37 insertions(+), 57 deletions(-) diff --git a/src/common/rmtree.c b/src/common/rmtree.c index 221d0e20a7..736e34892c 100644 --- a/src/common/rmtree.c +++ b/src/common/rmtree.c @@ -20,13 +20,16 @@ #include #include +#include "common/file_utils.h" + #ifndef FRONTEND #define pg_log_warning(...) elog(WARNING, __VA_ARGS__) +#define LOG_LEVEL WARNING #else #include "common/logging.h" +#define LOG_LEVEL PG_LOG_WARNING #endif - /* * rmtree * @@ -41,82 +44,59 @@ bool rmtree(const char *path, bool rmtopdir) { - bool result = true; char pathbuf[MAXPGPATH]; - char **filenames; - char **filename; - struct stat statbuf; - - /* - * we copy all the names out of the directory before we start modifying - * it. - */ - filenames = pgfnames(path); + DIR *dir; + struct dirent *de; + bool result = true; - if (filenames == NULL) + dir = opendir(path); + if (dir == NULL) + { + pg_log_warning("could not open directory \"%s\": %m", path); return false; + } - /* now we have the names we can start removing things */ - for (filename = filenames; *filename; filename++) + while (errno = 0, (de = readdir(dir))) { - snprintf(pathbuf, MAXPGPATH, "%s/%s", path, *filename); - - /* - * It's ok if the file is not there anymore; we were just about to - * delete it anyway. - * - * This is not an academic possibility. One scenario where this - * happens is when bgwriter has a pending unlink request for a file in - * a database that's being dropped. In dropdb(), we call - * ForgetDatabaseSyncRequests() to flush out any such pending unlink - * requests, but because that's asynchronous, it's not guaranteed that - * the bgwriter receives the message in time. - */ - if (lstat(pathbuf, ) != 0) - { - if (errno != ENOENT) - { -pg_log_warning("could not stat file or directory \"%s\": %m", - pathbuf); -result = false; - } + if (strcmp(de->d_name, ".") == 0 || + strcmp(de->d_name, "..") == 0) continue; - } - - if (S_ISDIR(statbuf.st_mode)) - { - /* call ourselves recursively for a directory */ - if (!rmtree(pathbuf, true)) - { -/* we already reported the error */ -result = false; - } - } - else + snprintf(pathbuf, sizeof(pathbuf), "%s/%s", path, de->d_name); + switch (get_dirent_type(pathbuf, de, false, LOG_LEVEL)) { - if (unlink(pathbuf) != 0) - { -if (errno != ENOENT) + case PGFILETYPE_ERROR: +/* already logged, press on */ +break; + case PGFILETYPE_DIR: +if (!rmtree(pathbuf, true)) + result = false; +break; + default: +if (unlink(pathbuf) != 0 && errno != ENOENT) { - pg_log_warning("could not remove file or directory \"%s\": %m", - pathbuf); + pg_log_warning("could not unlink file \"%s\": %m", pathbuf); result = false; } - } +break; } } + if (errno != 0) + { + pg_log_warning("could not read directory \"%s\": %m", path); + result = false; + } + + closedir(dir); + if (rmtopdir) { if (rmdir(path) != 0) { - pg_log_warning("could not remove file or directory \"%s\": %m", - path); + pg_log_warning("could not remove directory \"%s\": %m", path); result = false; } } - pgfnames_cleanup(filenames); - return result; } -- 2.30.2
Re: pg_upgrade test failure
Hi, On 2022-10-17 23:31:44 -0500, Justin Pryzby wrote: > On Tue, Oct 18, 2022 at 01:06:15PM +0900, Michael Paquier wrote: > > On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote: > > > * Server 2019, as used on CI, still uses the traditional NT semantics > > > (unlink is asynchronous, when all handles closes) > > > * the fix I proposed has the right effect (I will follow up with tests > > > to demonstrate) > > > > Wow, nice investigation. And cirrus does not offer a newer option > > either.. > > Currently Andres builds images based on cirrus's 2019 image, but I think > we could use any windows docker image. You unfortunately can't run newer containers than the host OS :(, just user older ones. And if you use mismatching containers the startup gets slower because it switches to use full virtualization rather than containers. I think we need to switch to use full VMs rather than containers. The performance of the windows containers is just atrocious (build times on a local VM with the same number of cores is 1/2 of what we see in CI, test times 1/3), they're slow to start due to pulling all files and decompressing them, and they're fragile. I've asked Bilal (CCed) to work on generating both containers and images. Greetings, Andres Freund
Re: pg_upgrade test failure
On Tue, Oct 18, 2022 at 01:06:15PM +0900, Michael Paquier wrote: > On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote: > > * Server 2019, as used on CI, still uses the traditional NT semantics > > (unlink is asynchronous, when all handles closes) > > * the fix I proposed has the right effect (I will follow up with tests > > to demonstrate) > > Wow, nice investigation. And cirrus does not offer a newer option > either.. Currently Andres builds images based on cirrus's 2019 image, but I think we could use any windows docker image. > Do you think that Windows server 2022 (successor of 2019) is > able to use POSIX semantics for unlink()? I think it's possible to use it now, like what's done here. https://commitfest.postgresql.org/40/3347/ The only caveat is that it's done conditionally.
Re: pg_upgrade test failure
On Tue, Oct 18, 2022 at 09:47:37AM +1300, Thomas Munro wrote: > * Server 2019, as used on CI, still uses the traditional NT semantics > (unlink is asynchronous, when all handles closes) > * the fix I proposed has the right effect (I will follow up with tests > to demonstrate) Wow, nice investigation. And cirrus does not offer a newer option either.. Do you think that Windows server 2022 (successor of 2019) is able to use POSIX semantics for unlink()? It looks that we are a few years away from being able to do that assuming that cirrus offers a newer version than server 2019, but I guess that the code could mention this possibility in a comment, at least.. -- Michael signature.asc Description: PGP signature
Re: pg_upgrade test failure
On Mon, Oct 3, 2022 at 7:29 PM Michael Paquier wrote: > On Mon, Oct 03, 2022 at 04:03:12PM +1300, Thomas Munro wrote: > > So I think that setting is_lnk = false is good enough here. Do > > you see a hole in it? > > I cannot think on one, on top of my head. Thanks for the > explanation. Some things I learned while trying to understand how I managed to introduce that bug despite reading and testing: * the code in pgunlink() has comments saying that its retry loop is to handle sharing violations * in fact that retry loop also comes into play for STATUS_DELETE_PENDING * that case is fairly well hidden, because to reach it you need to unlink(pathname) twice! the second call will wait up to 10 seconds for handles to close and then report ENOENT, allowing rmdir(parent) to succeed * I guess this code is relying on that double-unlink to block until the directory is empty? * you wouldn't notice any of this if you were testing on Windows 10 on a desktop/laptop/VM, because it now uses POSIX semantics for unlink on NTFS, so the first unlink truly (synchronously) unlinks (no more STATUS_DELETE_PENDING) * Server 2019, as used on CI, still uses the traditional NT semantics (unlink is asynchronous, when all handles closes) * the fix I proposed has the right effect (I will follow up with tests to demonstrate) I'll post my tests for this and a bunch more things I figured out shortly in a new Windows-filesystem-semantics-omnibus thread.
Re: pg_upgrade test failure
On Mon, Oct 03, 2022 at 04:03:12PM +1300, Thomas Munro wrote: > So I think that setting is_lnk = false is good enough here. Do > you see a hole in it? I cannot think on one, on top of my head. Thanks for the explanation. -- Michael signature.asc Description: PGP signature
Re: pg_upgrade test failure
On Mon, Oct 3, 2022 at 1:40 PM Michael Paquier wrote: > On Mon, Oct 03, 2022 at 12:10:06PM +1300, Thomas Munro wrote: > > I think something like the attached should do the right thing for > > STATUS_DELETE_PENDING (sort of "ENOENT-in-progress"). unlink() goes > > back to being blocking (sleep+retry until eventually we reach ENOENT > > or we time out and give up with EACCES), but we still distinguish it > > from true ENOENT so we have a fast exit in that case. This is passing > > CI, but not tested yet. > > if (lstat(path, ) < 0) > - return -1; > + { > + if (lstat_error_was_status_delete_pending()) > + is_lnk = false; > + else > + return -1; > + } > + else > + is_lnk = S_ISLNK(st.st_mode); > Sorry, I don't remember all the details in this area, but a directory > can never be marked as STATUS_DELETE_PENDING with some of its contents > still inside, right? That's my understanding, yes: just like Unix, you can't remove a directory with something in it. Unlike Unix, that includes files that have been unlinked but are still open somewhere. (Note that in this case it's not exactly a real directory, it's a junction point, which is a directory but it doesn't have contents, it is a reparse point pointing somewhere else, so I suspect that it can't really suffer from ENOTEMPTY, but it probably can suffer from 'someone has it open for a short time because they are concurrently stat-ing it'.) > If it has some contents, forcing unlink() all > the time would be fine? Here's why I think it's probably OK to use unlink() unconditionally after detecting STATUS_DELETE_PENDING. AFAICT there is no way to even find out if it's a file or a junction in this state, but it doesn't matter: we are not waiting for rmdir() or unlink() to succeed, we are waiting for it to fail with an error other than EACCES, most likely ENOENT (or to time out, perhaps because someone held the file open for 11 seconds, or because EACCES was actually telling us about a permissions problem). EACCES is the errno for many things including STATUS_DELETE_PENDING and also "you called unlink() but it's a directory" (should be EPERM according to POSIX, or EISDIR according to Linux). Both of those reasons imply that the zombie directory entry still exists, and we don't care which of those reasons triggered it.So I think that setting is_lnk = false is good enough here. Do you see a hole in it?
Re: pg_upgrade test failure
On Mon, Oct 03, 2022 at 12:10:06PM +1300, Thomas Munro wrote: > I think something like the attached should do the right thing for > STATUS_DELETE_PENDING (sort of "ENOENT-in-progress"). unlink() goes > back to being blocking (sleep+retry until eventually we reach ENOENT > or we time out and give up with EACCES), but we still distinguish it > from true ENOENT so we have a fast exit in that case. This is passing > CI, but not tested yet. if (lstat(path, ) < 0) - return -1; + { + if (lstat_error_was_status_delete_pending()) + is_lnk = false; + else + return -1; + } + else + is_lnk = S_ISLNK(st.st_mode); Sorry, I don't remember all the details in this area, but a directory can never be marked as STATUS_DELETE_PENDING with some of its contents still inside, right? If it has some contents, forcing unlink() all the time would be fine? > One ugly thing in this patch is that it has to deal with our > historical mistake (?) of including Windows headers in this file in > Cygwin builds for no reason and thus getting WIN32 defined on a > non-WIN32 build, as I've complained about before[1] but not yet tidied > up. Your proposal remains local to dirmod.c, so that does not sound like a big deal to me for the time being. -- Michael signature.asc Description: PGP signature
Re: pg_upgrade test failure
On Mon, Oct 3, 2022 at 9:07 AM Thomas Munro wrote: > On Tue, Sep 20, 2022 at 1:31 PM Justin Pryzby wrote: > > I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so > > didn't warn about the file itself, but then failed one moment later in > > rmdir. > > Yeah, I think this is my fault. In commit f357233c the new lstat() > call might return ENOENT for STATUS_DELETE_PENDING, and then we don't > enter pgunlink()'s 10 second sleep-retry loop. Let me think about how > best to fix that, and how to write a regression test program that > would exercise stuff like this. Might take a couple of days as I am > away from computers until mid-week. I think something like the attached should do the right thing for STATUS_DELETE_PENDING (sort of "ENOENT-in-progress"). unlink() goes back to being blocking (sleep+retry until eventually we reach ENOENT or we time out and give up with EACCES), but we still distinguish it from true ENOENT so we have a fast exit in that case. This is passing CI, but not tested yet. One ugly thing in this patch is that it has to deal with our historical mistake (?) of including Windows headers in this file in Cygwin builds for no reason and thus getting WIN32 defined on a non-WIN32 build, as I've complained about before[1] but not yet tidied up. Remembering why we do any of this weird looking stuff that we don't need on Unix, the general idea is that things that scan directories to unlink everything before unlinking the parent directory need to block for a while on STATUS_DELETE_PENDING to increase their probability of success, while things that do anything else probably just want to skip such zombie files completely. To recap, we have: * readdir() sees files that are ENOENT-in-progress (so recursive unlinks can see them) * unlink() waits for ENOENT-in-progress to reach ENOENT (what broke here) * stat() and lstat() report ENOENT-in-progress as ENOENT (done to fix eg pg_basebackup, which used to fail at random on Windows) * open() reports ENOENT-in-progress as either ENOENT or EEXIST depending on O_CREAT (because used by stat()) Clearly this set of kludges isn't perfect and other kludge-sets would be possible too. One thought is that we could hide ENOENT-in-progress from readdir(), and add a new rmdir() wrapper instead. If it gets a directory-not-empty error from the kernel, it could at that point wait for zombie files to go away (perhaps registering for file system events with some local equivalent of KQ_FILTER_VNODE if there is one, to be less sloppy that the current sleep() nonsense, but sleep would work too). When I'm back at my real keyboard I'll try to come up with tests for this stuff, but I'm not sure how solid we can really make a test for this particular case -- I think you'd need to have another thread open the file and then close it after different periods of time, to demonstrate that the retry loop works but also gives up, and that's exactly the sort of timing-dependent stuff we try to avoid. But I think I'll try that anyway, because it's essential infrastructure to allow Unix-only hackers to work only this stuff. Once we have that, we might be able to make some more progress with the various FILE_DISPOSITION_POSIX_SEMANTICS proposals, if it helps, because we'll have reproducible evidence for what it really does. [1] https://commitfest.postgresql.org/39/3781/ From 0fa290b4d3597c843804d3ee35559074864424aa Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 3 Oct 2022 09:15:02 +1300 Subject: [PATCH] Fix STATUS_DELETE_PENDING handling in pgunlink(). Commit c5cb8f3b didn't handle STATUS_DELETE_PENDING correctly, and would report ENOENT immediately without waiting. In order for simple recursive unlink algorithms to have a chance of succeeding while other backends might not have closed them yet, we rely on the 10-seconds-of-retries behavior present in pgunlink(). Diagnosed-by: Justin Pryzby Discussion: https://postgr.es/m/20220920013122.GA31833%40telsasoft.com --- src/port/dirmod.c | 38 -- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/src/port/dirmod.c b/src/port/dirmod.c index ae6301dd6c..aae81228ad 100644 --- a/src/port/dirmod.c +++ b/src/port/dirmod.c @@ -39,6 +39,10 @@ #endif #endif +#if defined(WIN32) && !defined(__CYGWIN__) +#include "port/win32ntdll.h" +#endif + #if defined(WIN32) || defined(__CYGWIN__) /* @@ -91,6 +95,22 @@ pgrename(const char *from, const char *to) return 0; } +/* + * Check if _pglstat64()'s reason for failure was STATUS_DELETE_PENDING. + * This doesn't apply to Cygwin, which has its own lstat() that would report + * the case as EACCES. +*/ +static bool +lstat_error_was_status_delete_pending(void) +{ + if (errno != ENOENT) + return false; +#if defined(WIN32) && !defined(__CYGWIN__) + if (pg_RtlGetLastNtStatus() == STATUS_DELETE_PENDING) + return true; +#endif + return false; +} /* * pgunlink @@ -98,6 +118,7 @@ pgrename(const char *from, const
Re: pg_upgrade test failure
On Tue, Sep 20, 2022 at 1:31 PM Justin Pryzby wrote: > I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so > didn't warn about the file itself, but then failed one moment later in > rmdir. Yeah, I think this is my fault. In commit f357233c the new lstat() call might return ENOENT for STATUS_DELETE_PENDING, and then we don't enter pgunlink()'s 10 second sleep-retry loop. Let me think about how best to fix that, and how to write a regression test program that would exercise stuff like this. Might take a couple of days as I am away from computers until mid-week.
Re: pg_upgrade test failure
Hi, On 2022-09-27 11:47:37 +0530, Bharath Rupireddy wrote: > Just for the records - the same issue was also seen here [1], [2]. > > [1] https://cirrus-ci.com/task/5709014662119424?logs=check_world#L82 > [2] > https://api.cirrus-ci.com/v1/artifact/task/5709014662119424/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade Yea, this is at the moment one of the top sources of spurious test failures for cfbot. Just manually looking at http://cfbot.cputube.org/ for tasks that recently changed state on windows: https://cirrus-ci.com/task/6422687231770624?logs=check_world#L60 https://cirrus-ci.com/task/6408332243107840?logs=check_world#L60 https://cirrus-ci.com/task/6202259712245760?logs=check_world#L60 https://cirrus-ci.com/task/6150885981028352?logs=check_world#L60 https://cirrus-ci.com/task/5361597290905600?logs=check_world#L60 https://cirrus-ci.com/task/5177327624650752?logs=check_world#L60 https://cirrus-ci.com/task/4862503887831040?logs=check_world#L60 https://cirrus-ci.com/task/4576362479484928?logs=check_world#L60 Something needs to happen here. Greetings, Andres Freund
Re: pg_upgrade test failure
On Tue, Sep 20, 2022 at 7:01 AM Justin Pryzby wrote: > > On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote: > > Hi, > > > > After my last rebase of the meson tree I encountered the following test > > failure: > > > > https://cirrus-ci.com/task/5532444261613568 > > > > [20:23:04.171] - 8< > > - > > [20:23:04.171] stderr: > > [20:23:04.171] # Failed test 'pg_upgrade_output.d/ not removed after > > pg_upgrade --check success' > > [20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line > > 249. > > [20:23:04.171] # Failed test 'pg_upgrade_output.d/ removed after > > pg_upgrade success' > > [20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line > > 261. > > [20:23:04.171] # Looks like you failed 2 tests of 13. > > > > regress_log: > > https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade > > > > The pg_upgrade output contains these potentially relevant warnings: > > > > ... > > *Clusters are compatible* > > pg_upgrade: warning: could not remove file or directory > > "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log": > > Directory not empty > > pg_upgrade: warning: could not remove file or directory > > "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511": > > Directory not empty > > ... Just for the records - the same issue was also seen here [1], [2]. [1] https://cirrus-ci.com/task/5709014662119424?logs=check_world#L82 [2] https://api.cirrus-ci.com/v1/artifact/task/5709014662119424/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: pg_upgrade test failure
On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote: > Hi, > > After my last rebase of the meson tree I encountered the following test > failure: > > https://cirrus-ci.com/task/5532444261613568 > > [20:23:04.171] - 8< > - > [20:23:04.171] stderr: > [20:23:04.171] # Failed test 'pg_upgrade_output.d/ not removed after > pg_upgrade --check success' > [20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line > 249. > [20:23:04.171] # Failed test 'pg_upgrade_output.d/ removed after pg_upgrade > success' > [20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line > 261. > [20:23:04.171] # Looks like you failed 2 tests of 13. > > regress_log: > https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade > > The pg_upgrade output contains these potentially relevant warnings: > > ... > *Clusters are compatible* > pg_upgrade: warning: could not remove file or directory > "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log": > Directory not empty > pg_upgrade: warning: could not remove file or directory > "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511": > Directory not empty > ... It looks like it failed to remove a *.log file under windows, which caused rmtree to fail. src/bin/pg_upgrade/pg_upgrade.h-#define DB_DUMP_LOG_FILE_MASK "pg_upgrade_dump_%u.log" src/bin/pg_upgrade/pg_upgrade.h-#define SERVER_LOG_FILE "pg_upgrade_server.log" src/bin/pg_upgrade/pg_upgrade.h-#define UTILITY_LOG_FILE "pg_upgrade_utility.log" src/bin/pg_upgrade/pg_upgrade.h:#define INTERNAL_LOG_FILE "pg_upgrade_internal.log" ee5353abb only changed .txt files used for errors so can't be the cause, but the original commit 38bfae3 might be related. I suspect that rmtree() was looping in pgunlink(), and got ENOENT, so didn't warn about the file itself, but then failed one moment later in rmdir. -- Justin
Re: pg_upgrade test failure
On Mon, Sep 19, 2022 at 06:13:17PM -0700, Andres Freund wrote: > I don't really see what'd race with what here? pg_upgrade has precise control > over what's happening here, no? A code path could have forgotten a fclose() for example, but this code is rather old and close-proof as far as I know. Most of the log files are used with redirections for external calls, though I don't see how these could still be hold after pg_upgrade finishes, though :/ Could the use meson somewhat influence when running tests on Windows? > I've only seen it once so far, but there haven't been many CI runs of the > meson branch since rebasing ontop of the last changes to pg_upgrade. Hmm, okay. Is that a specific branch in one of your public repos? -- Michael signature.asc Description: PGP signature
Re: pg_upgrade test failure
Hi, On 2022-09-20 10:08:41 +0900, Michael Paquier wrote: > On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote: > > I don't know if actually related to the commit below, but there've been a > > lot of runs of the pg_upgrade tests in the meson branch, and this is the > > first > > failure of this kind. Unfortunately the error seems to be transient - > > rerunning the tests succeeded. > > This smells to me like a race condition in pg_upgrade (or even pg_ctl > for SERVER_LOG_FILE) where the code still has handles on some of the > files in the log/ subdirectory, causing its removal to not be able to > finish happen. I don't really see what'd race with what here? pg_upgrade has precise control over what's happening here, no? > If this proves to be rather easy to reproduce, giving > a list of the files still present in this path would give a hint easy > to follow. Does this reproduce with a good frequency? I've only seen it once so far, but there haven't been many CI runs of the meson branch since rebasing ontop of the last changes to pg_upgrade. Greetings, Andres Freund
Re: pg_upgrade test failure
On Mon, Sep 19, 2022 at 02:32:17PM -0700, Andres Freund wrote: > I don't know if actually related to the commit below, but there've been a > lot of runs of the pg_upgrade tests in the meson branch, and this is the first > failure of this kind. Unfortunately the error seems to be transient - > rerunning the tests succeeded. This smells to me like a race condition in pg_upgrade (or even pg_ctl for SERVER_LOG_FILE) where the code still has handles on some of the files in the log/ subdirectory, causing its removal to not be able to finish happen. If this proves to be rather easy to reproduce, giving a list of the files still present in this path would give a hint easy to follow. Does this reproduce with a good frequency? -- Michael signature.asc Description: PGP signature
pg_upgrade test failure
Hi, After my last rebase of the meson tree I encountered the following test failure: https://cirrus-ci.com/task/5532444261613568 [20:23:04.171] - 8< - [20:23:04.171] stderr: [20:23:04.171] # Failed test 'pg_upgrade_output.d/ not removed after pg_upgrade --check success' [20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 249. [20:23:04.171] # Failed test 'pg_upgrade_output.d/ removed after pg_upgrade success' [20:23:04.171] # at C:/cirrus/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 261. [20:23:04.171] # Looks like you failed 2 tests of 13. regress_log: https://api.cirrus-ci.com/v1/artifact/task/5532444261613568/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade The pg_upgrade output contains these potentially relevant warnings: ... *Clusters are compatible* pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511/log": Directory not empty pg_upgrade: warning: could not remove file or directory "C:/cirrus/build/testrun/pg_upgrade/002_pg_upgrade/data/t_002_pg_upgrade_new_node_data/pgdata/pg_upgrade_output.d/20220919T201958.511": Directory not empty ... I don't know if actually related to the commit below, but there've been a lot of runs of the pg_upgrade tests in the meson branch, and this is the first failure of this kind. Unfortunately the error seems to be transient - rerunning the tests succeeded. On 2022-09-13 01:39:59 +, Michael Paquier wrote: > Move any remaining files generated by pg_upgrade into an internal subdir > > This change concerns a couple of .txt files (for internal state checks) > that were still written in the path where the binary is executed, and > not in the subdirectory located in the target cluster. Like the other > .txt files doing already so (like loadable_libraries.txt), these are > saved in the base output directory. Note that on failure, the logs > report the full path to the .txt file generated, so these are easy to > find. > > Oversight in 38bfae3. > > Author: Daniel Gustafsson > Reviewed-by: Michael Paquier, Justin Prysby > Discussion: https://postgr.es/m/181a6da8-3b7f-4b71-82d5-363ff0146...@yesql.se > Backpatch-through: 15 Greetings, Andres Freund