Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
On Thu, Jan 11, 2024 at 3:33 AM Kyotaro Horiguchi wrote: > Is it correct to understand that you are requesting changes as follows? > > --- a/src/bin/pg_ctl/pg_ctl.c > +++ b/src/bin/pg_ctl/pg_ctl.c > @@ -1995,11 +1995,14 @@ pgwin32_find_postmaster_pid(pid_t shell_pid) > * > * Check for duplicate processes to ensure reliability. > * > -* The launcher shell might start other cmd.exe instances or programs > -* besides postgres.exe. Verifying the program file name is essential. > -* > -* The launcher shell process isn't checked in this function. It > will be > -* checked by the caller. > +* The ppe entry to be examined is identified by th32ParentProcessID, > which > +* should correspond to the cmd.exe process that executes the > postgres.exe > +* binary. Additionally, th32ProcessID in the same entry should be > the PID > +* of the launched postgres.exe. However, even though we have > launched the > +* parent cmd.exe with the /D option specified, it is sometimes > observed > +* that another cmd.exe is launched for unknown reasons. Therefore, > it is > +* crucial to verify the program file name to avoid returning the > wrong > +* PID. > */ This kind of change looks massively helpful to me. I don't know if it is exactly right or not, but it would have been a big help to me when writing my previous review, so +1 for some change of this general type. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Thanks for restarting this thread. At Tue, 9 Jan 2024 09:40:23 +0900, Michael Paquier wrote in > On Fri, Jan 05, 2024 at 02:58:55PM -0500, Robert Haas wrote: > > I'm not a Windows expert, but my guess is that 0001 is a very good > > idea. I hope someone who is a Windows expert will comment on that. > > I am +1 on 0001. It is just something we've never anticipated when > these wrappers around cmd in pg_ctl were written. Thanks for committing it. > > 0002 seems problematic to me. One potential issue is that it would > > break if someone renamed postgres.exe to something else -- although > > that's probably not really a serious problem. > > We do a find_other_exec_or_die() on "postgres" with what could be a > custom execution path. So we're actually sure that the binary will be > there in the start path, no? I don't like much the hardcoded > dependency to .exe here. The patch doesn't care of the path for postgres.exe. If you are referring to the code you cited below, it's for another reason. I'll describe that there. > > A bigger issue is that > > it seems like it would break if someone used pg_ctl to start several > > instances in different data directories on the same machine. If I'm > > understanding correctly, that currently mostly works, and this would > > break it. > > Not having the guarantee that a single shell_pid is associated to a > single postgres.exe would be a problem. Now the patch includes this > code: > + if (ppe.th32ParentProcessID == shell_pid && > + strcmp("postgres.exe", ppe.szExeFile) == 0) > + { > + if (pm_pid != ppe.th32ProcessID && pm_pid != 0) > + multiple_children = true; > + pm_pid = ppe.th32ProcessID; > + } > > Which is basically giving this guarantee? multiple_children should > never happen once the autorun part is removed. Is that right? The patch indeed ensures the relationship between the parent pg_ctl.exe and postgres.exe. However, for some reason, in my Windows 11 environment with the /D option specified, I observed that another cmd.exe is spawned as the second child process of the parent cmd.exe. This is why there is a need to verify the executable file name. I have no idea how the second cmd.exe is being spawned. > +* The launcher shell might start other cmd.exe instances or programs > +* besides postgres.exe. Veryfying the program file name is essential. > > With the autorun part of cmd.exe removed, what's still relevant here? Yes, if the strcmp() is commented out, multiple_children sometimes becomes true.. > s/Veryfying/Verifying/. Oops! > Perhaps 0002 should make more efforts in documenting things like > th32ProcessID and th32ParentProcessID. Is it correct to understand that you are requesting changes as follows? --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -1995,11 +1995,14 @@ pgwin32_find_postmaster_pid(pid_t shell_pid) * * Check for duplicate processes to ensure reliability. * -* The launcher shell might start other cmd.exe instances or programs -* besides postgres.exe. Verifying the program file name is essential. -* -* The launcher shell process isn't checked in this function. It will be -* checked by the caller. +* The ppe entry to be examined is identified by th32ParentProcessID, which +* should correspond to the cmd.exe process that executes the postgres.exe +* binary. Additionally, th32ProcessID in the same entry should be the PID +* of the launched postgres.exe. However, even though we have launched the +* parent cmd.exe with the /D option specified, it is sometimes observed +* that another cmd.exe is launched for unknown reasons. Therefore, it is +* crucial to verify the program file name to avoid returning the wrong +* PID. */ regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From f9f2486f18502357fb76f2f1da00e19db40b2bc9 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 24 Oct 2023 14:46:50 +0900 Subject: [PATCH v6 1/2] Improve pg_ctl postmaster process check on Windows Currently pg_ctl on Windows does not verify that it actually executed a postmaster process due to lack of process ID knowledge. This can lead to false positives in cases where another pg_ctl instance starts a different server simultaneously. This patch adds the capability to identify the process ID of the launched postmaster on Windows, similar to other OS versions, ensuring more reliable detection of concurrent server startups. --- src/bin/pg_ctl/pg_ctl.c | 105 1 file changed, 96 insertions(+), 9 deletions(-) diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 6900b27675..dec0afdb29 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -131,6 +131,7 @@
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Michael Paquier writes: > I am wondering if we'd better backpatch all that, TBH. Seems like a good idea to me. regards, tom lane
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
On Tue, Jan 09, 2024 at 09:40:12PM -0500, Tom Lane wrote: > No Windows expert here, but it does seem like the same argument > applies. Yeah, I've applied the same restriction for pg_regress to avoid similar problems as we spawn a postgres process in this case. I've tested it and it was not causing issues in my own setup or the CI. I am wondering if we'd better backpatch all that, TBH. -- Michael signature.asc Description: PGP signature
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Michael Paquier writes: > I have now applied 0001 for pg_ctl. > While reviewing that, I have also noticed spawn_process() in > pg_regress.c that includes direct command invocations with cmd.exe /c. > Could it make sense to append an extra /d for this case as well? No Windows expert here, but it does seem like the same argument applies. regards, tom lane
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
On Tue, Jan 09, 2024 at 09:40:23AM +0900, Michael Paquier wrote: > On Fri, Jan 05, 2024 at 02:58:55PM -0500, Robert Haas wrote: > > I'm not a Windows expert, but my guess is that 0001 is a very good > > idea. I hope someone who is a Windows expert will comment on that. > > I am +1 on 0001. It is just something we've never anticipated when > these wrappers around cmd in pg_ctl were written. I have now applied 0001 for pg_ctl. While reviewing that, I have also noticed spawn_process() in pg_regress.c that includes direct command invocations with cmd.exe /c. Could it make sense to append an extra /d for this case as well? -- Michael signature.asc Description: PGP signature
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
On Fri, Jan 05, 2024 at 02:58:55PM -0500, Robert Haas wrote: > I'm not a Windows expert, but my guess is that 0001 is a very good > idea. I hope someone who is a Windows expert will comment on that. I am +1 on 0001. It is just something we've never anticipated when these wrappers around cmd in pg_ctl were written. > 0002 seems problematic to me. One potential issue is that it would > break if someone renamed postgres.exe to something else -- although > that's probably not really a serious problem. We do a find_other_exec_or_die() on "postgres" with what could be a custom execution path. So we're actually sure that the binary will be there in the start path, no? I don't like much the hardcoded dependency to .exe here. > A bigger issue is that > it seems like it would break if someone used pg_ctl to start several > instances in different data directories on the same machine. If I'm > understanding correctly, that currently mostly works, and this would > break it. Not having the guarantee that a single shell_pid is associated to a single postgres.exe would be a problem. Now the patch includes this code: + if (ppe.th32ParentProcessID == shell_pid && + strcmp("postgres.exe", ppe.szExeFile) == 0) + { + if (pm_pid != ppe.th32ProcessID && pm_pid != 0) + multiple_children = true; + pm_pid = ppe.th32ProcessID; + } Which is basically giving this guarantee? multiple_children should never happen once the autorun part is removed. Is that right? +* The launcher shell might start other cmd.exe instances or programs +* besides postgres.exe. Veryfying the program file name is essential. With the autorun part of cmd.exe removed, what's still relevant here? s/Veryfying/Verifying/. Perhaps 0002 should make more efforts in documenting things like th32ProcessID and th32ParentProcessID. -- Michael signature.asc Description: PGP signature
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
On Tue, Oct 24, 2023 at 4:28 AM Kyotaro Horiguchi wrote: > In the attached, fixed the existing two messages, and adjusted one > message to display an error code, all in the consistent format. Hi, I'm not a Windows expert, but my guess is that 0001 is a very good idea. I hope someone who is a Windows expert will comment on that. 0002 seems problematic to me. One potential issue is that it would break if someone renamed postgres.exe to something else -- although that's probably not really a serious problem. A bigger issue is that it seems like it would break if someone used pg_ctl to start several instances in different data directories on the same machine. If I'm understanding correctly, that currently mostly works, and this would break it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Tue, 24 Oct 2023 07:37:22 +, "Hayato Kuroda (Fujitsu)" wrote in > Dear Horiguchi-san, > > Thanks for updates! I was quite not sure the Windows env, but I can post > comments. > (We need reviews by windows-friendly developers...) Indeed, I haven't managed to successfully build using Meson on Windows... > Formatting of messages for write_stderr() seem different from others. In v3, > I slightly modified for readability like below. I wanted to let you know just > in case > because you did not say anything about these changes... Ah. Sorry, I was lazy about the messages because I didn't regard this to be at that stage yet. In the attached, fixed the existing two messages, and adjusted one message to display an error code, all in the consistent format. Thanks! -- Kyotaro Horiguchi NTT Open Source Software Center >From 65013b5ccebd101f91be0e46b5209dc9cfcc7d5f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 24 Oct 2023 11:43:57 +0900 Subject: [PATCH v5 1/3] Disable autoruns for cmd.exe on Windows On Windows, we use cmd.exe to launch the postmaster process for easy redirection setup. However, cmd.exe may execute other programs at startup due to autorun configurations. This patch adds /D flag to the launcher cmd.exe command line to disable autorun settings written in the registry. --- src/bin/pg_ctl/pg_ctl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 807d7023a9..3ac2fcc004 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -553,11 +553,11 @@ start_postmaster(void) else close(fd); - cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"", + cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"", comspec, exec_path, pgdata_opt, post_opts, DEVNULL, log_file); } else - cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" 2>&1\"", + cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" 2>&1\"", comspec, exec_path, pgdata_opt, post_opts, DEVNULL); if (!CreateRestrictedProcess(cmd, , false)) -- 2.39.3 >From 18c9a319b5160721d08cc1d3a3df770a63756f65 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 24 Oct 2023 14:46:50 +0900 Subject: [PATCH v5 2/3] Improve pg_ctl postmaster process check on Windows Currently pg_ctl on Windows does not verify that it actually executed a postmaster process due to lack of process ID knowledge. This can lead to false positives in cases where another pg_ctl instance starts a different server simultaneously. This patch adds the capability to identify the process ID of the launched postmaster on Windows, similar to other OS versions, ensuring more reliable detection of concurrent server startups. --- src/bin/pg_ctl/pg_ctl.c | 102 1 file changed, 93 insertions(+), 9 deletions(-) diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 3ac2fcc004..221049db0e 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -132,6 +132,7 @@ static void adjust_data_dir(void); #ifdef WIN32 #include +#include static bool pgwin32_IsInstalled(SC_HANDLE); static char *pgwin32_CommandLine(bool); static void pgwin32_doRegister(void); @@ -142,6 +143,7 @@ static void WINAPI pgwin32_ServiceMain(DWORD, LPTSTR *); static void pgwin32_doRunAsService(void); static int CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo, bool as_service); static PTOKEN_PRIVILEGES GetPrivilegesToDelete(HANDLE hToken); +static pid_t pgwin32_find_postmaster_pid(pid_t shell_pid); #endif static pid_t get_pgpid(bool is_status_request); @@ -609,7 +611,11 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) /* File is complete enough for us, parse it */ pid_t pmpid; time_t pmstart; - +#ifndef WIN32 + pid_t wait_pid = pm_pid; +#else + pid_t wait_pid = pgwin32_find_postmaster_pid(pm_pid); +#endif /* * Make sanity checks. If it's for the wrong PID, or the recorded * start time is before pg_ctl started, then either we are looking @@ -619,14 +625,8 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); - if (pmstart >= start_time - 2 && -#ifndef WIN32 -pmpid == pm_pid -#else - /* Windows can only reject standalone-backend PIDs */ -pmpid > 0 -#endif -) + + if (pmstart >= start_time - 2 && pmpid == wait_pid) { /* * OK, seems to be a valid pidfile from our child. Check the @@ -1950,6 +1950,90 @@ GetPrivilegesToDelete(HANDLE hToken) return tokenPrivs; } + +/* + * Find the PID of the launched postmaster. + * + * On Windows, the cmd.exe doesn't support the exec command. As a result, we + * don't directly get the postmaster's PID. This function identifies the PID of + * the postmaster started by the child cmd.exe. + * + * Returns the postmaster's
RE: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Dear Horiguchi-san, Thanks for updates! I was quite not sure the Windows env, but I can post comments. (We need reviews by windows-friendly developers...) > Other error cases will fit to "shouldn't occur under normal > conditions" errors. Formatting of messages for write_stderr() seem different from others. In v3, I slightly modified for readability like below. I wanted to let you know just in case because you did not say anything about these changes... ``` + /* create a process snapshot */ + hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0); + if (hSnapshot == INVALID_HANDLE_VALUE) + { + write_stderr(_("%s: could not create a snapshot: error code %lu\n"), +progname, (unsigned long) GetLastError()); + exit(1); + } + + /* start iterating on the snapshot */ + ppe.dwSize = sizeof(PROCESSENTRY32); + if (!Process32First(hSnapshot, )) + { + write_stderr(_("%s: cound not retrieve information about the process: error code %lu\n"), +progname, GetLastError()); + exit(1); + } + ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Mon, 23 Oct 2023 08:57:19 +, "Hayato Kuroda (Fujitsu)" wrote in > > I agree with your suggestion. Ultimately, if there's a possibility > > for this to be committed, the message will be consolidated to "could > > not start server". > > Based on the suggestion, I tried to update the patch. > A new argument is_valid is added for reporting callee. Also, reporting formats > are adjusted based on other functions. How do you think? An equivalent check is already done shortly afterward in the calling function. Therefore, we can simply remove the code path for "launcher shell died", and it will work the same way. Please find the attached. Other error cases will fit to "shouldn't occur under normal conditions" errors. There is a possibility that the launcher shell terminates while postmaster is running. Even in such a case, the server continue working without any problems. I contemplated accomodating this case but the effort required seemed disproportionate to the possibility. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 530dc21be72e4e7f5ea199a9fceee00bbb88cf75 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 24 Oct 2023 11:43:57 +0900 Subject: [PATCH v4 1/3] Disable autoruns for cmd.exe on Windows On Windows, we use cmd.exe to launch the postmaster process for easy redirection setup. However, cmd.exe may execute other programs at startup due to autorun configurations. This patch adds /D flag to the launcher cmd.exe command line to disable autorun settings written in the registry. --- src/bin/pg_ctl/pg_ctl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 807d7023a9..3ac2fcc004 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -553,11 +553,11 @@ start_postmaster(void) else close(fd); - cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"", + cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"", comspec, exec_path, pgdata_opt, post_opts, DEVNULL, log_file); } else - cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" 2>&1\"", + cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" 2>&1\"", comspec, exec_path, pgdata_opt, post_opts, DEVNULL); if (!CreateRestrictedProcess(cmd, , false)) -- 2.39.3 >From 913a0fbe0937331fca6ea4099834ed3001714339 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 24 Oct 2023 14:46:50 +0900 Subject: [PATCH v4 2/3] Improve pg_ctl postmaster process check on Windows Currently pg_ctl on Windows does not verify that it actually executed a postmaster process due to lack of process ID knowledge. This can lead to false positives in cases where another pg_ctl instance starts a different server simultaneously. This patch adds the capability to identify the process ID of the launched postmaster on Windows, similar to other OS versions, ensuring more reliable detection of concurrent server startups. --- src/bin/pg_ctl/pg_ctl.c | 102 1 file changed, 93 insertions(+), 9 deletions(-) diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 3ac2fcc004..9c0168b075 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -132,6 +132,7 @@ static void adjust_data_dir(void); #ifdef WIN32 #include +#include static bool pgwin32_IsInstalled(SC_HANDLE); static char *pgwin32_CommandLine(bool); static void pgwin32_doRegister(void); @@ -142,6 +143,7 @@ static void WINAPI pgwin32_ServiceMain(DWORD, LPTSTR *); static void pgwin32_doRunAsService(void); static int CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo, bool as_service); static PTOKEN_PRIVILEGES GetPrivilegesToDelete(HANDLE hToken); +static pid_t pgwin32_find_postmaster_pid(pid_t shell_pid); #endif static pid_t get_pgpid(bool is_status_request); @@ -609,7 +611,11 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) /* File is complete enough for us, parse it */ pid_t pmpid; time_t pmstart; - +#ifndef WIN32 + pid_t wait_pid = pm_pid; +#else + pid_t wait_pid = pgwin32_find_postmaster_pid(pm_pid); +#endif /* * Make sanity checks. If it's for the wrong PID, or the recorded * start time is before pg_ctl started, then either we are looking @@ -619,14 +625,8 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); - if (pmstart >= start_time - 2 && -#ifndef WIN32 -pmpid == pm_pid -#else - /* Windows can only reject standalone-backend PIDs */ -pmpid > 0 -#endif -) + + if (pmstart >= start_time - 2 && pmpid == wait_pid) { /* * OK, seems to be a valid pidfile from our child. Check the @@ -1950,6 +1950,90 @@ GetPrivilegesToDelete(HANDLE hToken) return tokenPrivs; } + +/* + * Find the PID of the launched postmaster. + * + * On Windows,
RE: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Dear Horiguchi-san, Shlok, > > At Fri, 6 Oct 2023 12:28:32 +0530, Shlok Kyal wrote > i> D:\project\pg_dev\bin>pg_ctl -D ../data -l data2.log start > > pg_ctl: another server might be running; trying to start server anyway > > waiting for server to startpg_ctl: launcher shell died > > > > The output message after patch is different from the HEAD. I felt that > > with patch as well we should get the message "pg_ctl: could not start > > server". > > Is this message change intentional? > > Partly no, partly yes. My focus was on verifying the accuracy of > identifying the actual postmaster PID on Windows. The current patch > provides a detailed description of the events, primarily because I > lack a comprehensive understanding of both the behavior of Windows > APIs and the associated processes. Given that context, the messages > essentially serve debugging purposes. > > I agree with your suggestion. Ultimately, if there's a possibility > for this to be committed, the message will be consolidated to "could > not start server". Based on the suggestion, I tried to update the patch. A new argument is_valid is added for reporting callee. Also, reporting formats are adjusted based on other functions. How do you think? Best Regards, Hayato Kuroda FUJITSU LIMITED v3-0001-Disable-autoruns-for-cmd.exe-on-Windows.patch Description: v3-0001-Disable-autoruns-for-cmd.exe-on-Windows.patch v3-0002-Improve-pg_ctl-postmaster-process-check-on-Window.patch Description: v3-0002-Improve-pg_ctl-postmaster-process-check-on-Window.patch v3-0003-Remove-short-sleep-from-001_start_stop.pl.patch Description: v3-0003-Remove-short-sleep-from-001_start_stop.pl.patch
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Thank you for testing this! At Fri, 6 Oct 2023 12:28:32 +0530, Shlok Kyal wrote i> D:\project\pg_dev\bin>pg_ctl -D ../data -l data2.log start > pg_ctl: another server might be running; trying to start server anyway > waiting for server to startpg_ctl: launcher shell died > > The output message after patch is different from the HEAD. I felt that > with patch as well we should get the message "pg_ctl: could not start > server". > Is this message change intentional? Partly no, partly yes. My focus was on verifying the accuracy of identifying the actual postmaster PID on Windows. The current patch provides a detailed description of the events, primarily because I lack a comprehensive understanding of both the behavior of Windows APIs and the associated processes. Given that context, the messages essentially serve debugging purposes. I agree with your suggestion. Ultimately, if there's a possibility for this to be committed, the message will be consolidated to "could not start server". regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
On Fri, 6 Oct 2023 at 11:38, Hayato Kuroda (Fujitsu) wrote: > > Dear Horiguchi-san, > > Thank you for making a patch! They can pass ci. > I'm still not sure what should be, but I can respond a part. > > > Another issue is.. that I haven't been able to cause the false > > positive of pg_ctl start.. Do you have a concise reproducer of the > > issue? > > I found a short sleep in pg_ctl/t/001_start_stop.pl. This was introduced in > 6bcce2580 to ensure waiting more than 2 seconds. I've tested on my CI and > found that removing the sleep can trigger the failure. Also, I confirmed your > patch > fixes the problem. PSA the small patch for cfbot. 0001 and 0002 were not > changed. I have tested the patches on my windows setup. I am trying to start two postgres servers with an interval of 5 secs. with HEAD (when same server is started after an interval of 5 secs): D:\project\pg\bin>pg_ctl -D ../data -l data2.log start pg_ctl: another server might be running; trying to start server anyway waiting for server to start stopped waiting pg_ctl: could not start server Examine the log output. with Patch:(when same server is started after an interval of 5 secs) D:\project\pg_dev\bin>pg_ctl -D ../data -l data2.log start pg_ctl: another server might be running; trying to start server anyway waiting for server to startpg_ctl: launcher shell died The output message after patch is different from the HEAD. I felt that with patch as well we should get the message "pg_ctl: could not start server". Is this message change intentional? Thanks, Shlok Kumar Kyal
RE: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Dear Horiguchi-san, Thank you for making a patch! They can pass ci. I'm still not sure what should be, but I can respond a part. > Another issue is.. that I haven't been able to cause the false > positive of pg_ctl start.. Do you have a concise reproducer of the > issue? I found a short sleep in pg_ctl/t/001_start_stop.pl. This was introduced in 6bcce2580 to ensure waiting more than 2 seconds. I've tested on my CI and found that removing the sleep can trigger the failure. Also, I confirmed your patch fixes the problem. PSA the small patch for cfbot. 0001 and 0002 were not changed. Best Regards, Hayato Kuroda FUJITSU LIMITED v2-0001-Disable-autoruns-for-cmd.exe-on-Windows.patch Description: v2-0001-Disable-autoruns-for-cmd.exe-on-Windows.patch v2-0002-Improve-pg_ctl-postmaster-process-check-on-Window.patch Description: v2-0002-Improve-pg_ctl-postmaster-process-check-on-Window.patch v2-0003-Remove-short-sleep-from-001_start_stop.pl.patch Description: v2-0003-Remove-short-sleep-from-001_start_stop.pl.patch
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Wed, 20 Sep 2023 14:18:41 +0900 (JST), Kyotaro Horiguchi wrote in > I was able to see the trouble in the CI environment, but not > locally. I'll delve deeper into this. Thanks you for bringing it to my > attention. I found two instances with multiple child processes. # child-pid / parent-pid / given-pid : exec name process parent PID child PID target PID exec file shell 1228 64721228 cmd.exe child 5184 12281228 cmd.exe child 6956 12281228 postgres.exe > launcher shell executed multiple processes process parent PID child PID target PID exec file shell 4296 58804296 cmd.exe child 5156 42964296 agent.exe child 5640 42964296 postgres.exe > launcher shell executed multiple processes It looks like the environment has autorun setups for cmd.exe. There's another known issue related to auto-launching chcp at startup. Ideally, we would avoid such behavior in the postmaster-launcher shell. I think we should add "/D" flag to cmd.exe command line, perhaps in a separate patch. Even after making that change, I still see something being launched from the launcher cmd.exe... process parent PID child PID target PID exec file shell 2784 66682784 cmd.exe child 6140 27842784 MicrosoftEdgeUpdate.exe child 6260 27842784 postgres.exe > launcher shell executed multiple processes I'm not sure what triggers this; perhaps some other kind of hooks? If we cannot avoid this behavior, we'll have to verify the executable file name. It should be fine, given that the file name is constant, but I'm not fully convinced that this is the ideal solution. Another issue is.. that I haven't been able to cause the false positive of pg_ctl start.. Do you have a concise reproducer of the issue? regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From c6de7b541be55b01bbd0d2e8e8d95aac6799a82c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 21 Sep 2023 16:54:33 +0900 Subject: [PATCH 1/3] Disable autoruns for cmd.exe on Windows On Windows, we use cmd.exe to launch the postmaster process for easy redirection setup. However, cmd.exe may execute other programs at startup due to autorun configurations. This patch adds /D flag to the launcher cmd.exe command line to disable autorun settings written in the registry. --- src/bin/pg_ctl/pg_ctl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 807d7023a9..3ac2fcc004 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -553,11 +553,11 @@ start_postmaster(void) else close(fd); - cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"", + cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"", comspec, exec_path, pgdata_opt, post_opts, DEVNULL, log_file); } else - cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" 2>&1\"", + cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" 2>&1\"", comspec, exec_path, pgdata_opt, post_opts, DEVNULL); if (!CreateRestrictedProcess(cmd, , false)) -- 2.39.3 >From f1484974f5bcf14d5ac3c6a702dcb02d0eb45f9f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 20 Sep 2023 13:16:35 +0900 Subject: [PATCH 2/3] Improve pg_ctl postmaster process check on Windows Currently pg_ctl on Windows does not verify that it actually executed a postmaster process due to lack of process ID knowledge. This can lead to false positives in cases where another pg_ctl instance starts a different server simultaneously. This patch adds the capability to identify the process ID of the launched postmaster on Windows, similar to other OS versions, ensuring more reliable detection of concurrent server startups. --- src/bin/pg_ctl/pg_ctl.c | 109 1 file changed, 100 insertions(+), 9 deletions(-) diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 3ac2fcc004..ed1b0c43fc 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -132,6 +132,7 @@ static void adjust_data_dir(void); #ifdef WIN32 #include +#include static bool pgwin32_IsInstalled(SC_HANDLE); static char *pgwin32_CommandLine(bool); static void pgwin32_doRegister(void); @@ -142,6 +143,7 @@ static void WINAPI pgwin32_ServiceMain(DWORD, LPTSTR *); static void pgwin32_doRunAsService(void); static int CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo, bool as_service); static PTOKEN_PRIVILEGES GetPrivilegesToDelete(HANDLE hToken); +static pid_t pgwin32_find_postmaster_pid(pid_t shell_pid); #endif static pid_t get_pgpid(bool is_status_request); @@ -609,7 +611,11 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) /* File is complete enough for us, parse it */ pid_t pmpid; time_t pmstart; -
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Tue, 19 Sep 2023 13:48:55 +, "Hayato Kuroda (Fujitsu)" wrote in > Dear Horiguchi-san, > > > I added the thread to next CF entry, so let's see the how cfbot says. > > At least there are several compiler warnings. E.g., > > * pgwin32_find_postmaster_pid() has "return;", but IIUC it should be "exit(1)" > * When DWORD is printed, "%lx" should be used. > * The variable "flags" seems not needed. Yeah, I thought that they all have been fixed but.. you are right in every respect. > Here is a patch which suppresses warnings, whereas test would fail... > You can use it if acceptable. I was able to see the trouble in the CI environment, but not locally. I'll delve deeper into this. Thanks you for bringing it to my attention. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Dear Horiguchi-san, > I added the thread to next CF entry, so let's see the how cfbot says. At least there are several compiler warnings. E.g., * pgwin32_find_postmaster_pid() has "return;", but IIUC it should be "exit(1)" * When DWORD is printed, "%lx" should be used. * The variable "flags" seems not needed. Here is a patch which suppresses warnings, whereas test would fail... You can use it if acceptable. Best Regards, Hayato Kuroda FUJITSU LIMITED pg_ctl_waits_for_postmasters_pid_on_windows_3.patch Description: pg_ctl_waits_for_postmasters_pid_on_windows_3.patch
RE: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Dear Horiguchi-san, I have tested your patch on my CI, but several test could not patch with error: "pg_ctl: launcher shell executed multiple processes". I added the thread to next CF entry, so let's see the how cfbot says. [1]: https://commitfest.postgresql.org/45/4573/ Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Fri, 8 Sep 2023 08:02:57 +, "Hayato Kuroda (Fujitsu)" wrote in > > > Ditching cmd.exe seems like a big hassle. So, on the flip side, I > > > tried to identify the postmaster PID using the shell's PID, and it > > > seem to work. The APIs used are avaiable from XP/2003 onwards. > > According to 495ed0ef2, Windows 10 seems the minimal requirement for using > the postgres. So the approach seems OK. > > Followings are my comment, but I can say only cosmetic ones because I do not > have > windows machine which can run postgres. Thank you for the comment! > 1. > Forward declaration seems missing. In the pg_ctl.c, the static function seems > to > be declared even if there is only one caller (c.f., GetPrivilegesToDelete). Agreed. > 2. > I think the argument should be pid_t. Yeah, I didn't linger on that detail earlier. But revisiting it, I coucur it is best suited since it is a local function in pg_ctl.c. I've now positioned it at the end of a WIN32 section defining other win32-specific functions. Hence, a forward declaration became necessary:p > 3. > I'm not sure the return type of the function should be pid_t or not. According > to the document, DWORD corrresponds to the pid_t. In win32_port.h, the pid_t > is > defiend as int (_MSC_VER seems to be defined when the VisualStduio is used). > It > is harmless, but I perfer to match the interface between caller/callee. IIUC > we > can add just a cast. For the reason previously stated, I've adjusted the type for both the parameter and the return value to pid_t. start_postmaster() already assumed that pid_t is wider than DWORD. I noticed that PID 0 is valid on Windows. However, it is consistently the PID for the system idle process, so it can't be associated with cmd.exe or postgres. I've added a comment noting that premise. Also I did away with an unused variable. For the CreateToolhelp32Snapshot function, I changed the second parameter to 0 from shell_pid, since it is not used when using TH32CS_SNAPPROCESS. I changed the comparison operator for pid_t from > to !=, ensuring correct behavior even with negative values. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 807d7023a9..68ab83779c 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -132,6 +132,7 @@ static void adjust_data_dir(void); #ifdef WIN32 #include +#include static bool pgwin32_IsInstalled(SC_HANDLE); static char *pgwin32_CommandLine(bool); static void pgwin32_doRegister(void); @@ -142,6 +143,7 @@ static void WINAPI pgwin32_ServiceMain(DWORD, LPTSTR *); static void pgwin32_doRunAsService(void); static int CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo, bool as_service); static PTOKEN_PRIVILEGES GetPrivilegesToDelete(HANDLE hToken); +static pid_t pgwin32_find_postmaster_pid(pid_t shell_pid); #endif static pid_t get_pgpid(bool is_status_request); @@ -609,7 +611,11 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) /* File is complete enough for us, parse it */ pid_t pmpid; time_t pmstart; - +#ifndef WIN32 + pid_t wait_pid = pm_pid; +#else + pid_t wait_pid = pgwin32_find_postmaster_pid(pm_pid); +#endif /* * Make sanity checks. If it's for the wrong PID, or the recorded * start time is before pg_ctl started, then either we are looking @@ -619,14 +625,8 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); - if (pmstart >= start_time - 2 && -#ifndef WIN32 -pmpid == pm_pid -#else - /* Windows can only reject standalone-backend PIDs */ -pmpid > 0 -#endif -) + + if (pmstart >= start_time - 2 && pmpid == wait_pid) { /* * OK, seems to be a valid pidfile from our child. Check the @@ -1950,6 +1950,90 @@ GetPrivilegesToDelete(HANDLE hToken) return tokenPrivs; } + +/* + * Find the PID of the launched postmaster. + * + * On Windows, the cmd.exe doesn't support the exec command. As a result, we + * don't directly get the postmaster's PID. This function identifies the PID of + * the postmaster started by the child cmd.exe. + * + * Returns the postmaster's PID. If the shell is alive but the postmaster is + * missing, returns 0. Otherwise terminates this command with an error. + * + * This function uses PID 0 as an invalid value, assuming the system idle + * process occupies it and it won't be a PID for a shell or postmaster. + */ +pid_t +pgwin32_find_postmaster_pid(pid_t shell_pid) +{ + HANDLE hSnapshot; + DWORD flags; + PROCESSENTRY32 ppe; + pid_t pm_pid = 0; /* abusing 0 as an invalid value */ + bool shell_exists = false; + bool multiple_children = false; + DWORD last_error; + + /* create a process snapshot */ + hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0); + if (hSnapshot == INVALID_HANDLE_VALUE) + { + write_stderr(_("%s:
RE: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Dear Hoiguchi-san, Thank you for making the patch! > It doesn't seem to work as expected. We still lose the relationship > between the PID file and the launched postmaster. Yes, I did not expect that the relationship can be kept. Conceptually +1 for your approach. > > Ditching cmd.exe seems like a big hassle. So, on the flip side, I > > tried to identify the postmaster PID using the shell's PID, and it > > seem to work. The APIs used are avaiable from XP/2003 onwards. According to 495ed0ef2, Windows 10 seems the minimal requirement for using the postgres. So the approach seems OK. Followings are my comment, but I can say only cosmetic ones because I do not have windows machine which can run postgres. 1. Forward declaration seems missing. In the pg_ctl.c, the static function seems to be declared even if there is only one caller (c.f., GetPrivilegesToDelete). 2. I think the argument should be pid_t. 3. I'm not sure the return type of the function should be pid_t or not. According to the document, DWORD corrresponds to the pid_t. In win32_port.h, the pid_t is defiend as int (_MSC_VER seems to be defined when the VisualStduio is used). It is harmless, but I perfer to match the interface between caller/callee. IIUC we can add just a cast. ``` #ifdef _MSC_VER typedef int pid_t; #endif ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Fri, 08 Sep 2023 14:17:16 +0900 (JST), Kyotaro Horiguchi wrote in > Ditching cmd.exe seems like a big hassle. So, on the flip side, I > tried to identify the postmaster PID using the shell's PID, and it > seem to work. The APIs used are avaiable from XP/2003 onwards. Cleaned it up a bit. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 807d7023a9..25d3354dc6 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -32,6 +32,9 @@ #ifdef WIN32 /* on Unix, we don't need libpq */ #include "pqexpbuffer.h" #endif +#ifdef WIN32 +#include +#endif typedef enum @@ -425,6 +428,90 @@ free_readfile(char **optlines) * start/test/stop routines */ +#ifdef WIN32 +/* + * Find the PID of the launched postmaster. + * + * On Windows, the cmd.exe doesn't support the exec command. As a result, we + * don't directly get the postmaster's PID. This function identifies the PID of + * the postmaster started by the child cmd.exe. + * + * Returns the postmaster's PID. If the shell is alive but the postmaster is + * missing, returns 0. Otherwise terminates this command with an error. + */ +DWORD +win32_find_postmaster_pid(int shell_pid) +{ + HANDLE hSnapshot; + DWORD flags; + DWORD p_pid = shell_pid; + PROCESSENTRY32 ppe; + DWORD pm_pid = 0; + bool shell_exists = false; + bool multiple_children = false; + DWORD last_error; + + /* Create a process snapshot */ + hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, shell_pid); + if (hSnapshot == INVALID_HANDLE_VALUE) + { + write_stderr(_("%s: CreateToolhelp32Snapshot failed\n"), + progname); + return; + } + + /* start iterating on the snapshot */ + ppe.dwSize = sizeof(PROCESSENTRY32); + if (!Process32First(hSnapshot, )) + { + write_stderr(_("%s: Process32First failed: errcode=%08x\n"), + progname, GetLastError()); + exit(1); + } + + /* iterate over the snapshot */ + do + { + if (ppe.th32ProcessID == shell_pid) + shell_exists = true; + if (ppe.th32ParentProcessID == shell_pid) + { + if (pm_pid > 0) +multiple_children = true; + pm_pid = ppe.th32ProcessID; + } + } + while (Process32Next(hSnapshot, )); + + /* avoid multiple calls primary for clarity, not out of necessity */ + last_error = GetLastError(); + if (last_error != ERROR_NO_MORE_FILES) + { + write_stderr(_("%s: Process32Next failed: errcode=%08x\n"), + progname, last_error); + exit(1); + } + CloseHandle(hSnapshot); + + /* assuming the launching shell executes a single process */ + if (multiple_children) + { + write_stderr(_("%s: launcher shell executed multiple processes\n"), + progname); + exit(1); + } + + /* check if the process is still alive */ + if (!shell_exists) + { + write_stderr(_("%s: launcher shell died\n"), progname); + exit(1); + } + + return pm_pid; +} +#endif + /* * Start the postmaster and return its PID. * @@ -609,7 +696,11 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) /* File is complete enough for us, parse it */ pid_t pmpid; time_t pmstart; - +#ifndef WIN32 + pid_t wait_pid = pm_pid; +#else + pid_t wait_pid = win32_find_postmaster_pid(pm_pid); +#endif /* * Make sanity checks. If it's for the wrong PID, or the recorded * start time is before pg_ctl started, then either we are looking @@ -619,14 +710,8 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); - if (pmstart >= start_time - 2 && -#ifndef WIN32 -pmpid == pm_pid -#else - /* Windows can only reject standalone-backend PIDs */ -pmpid > 0 -#endif -) + + if (pmstart >= start_time - 2 && pmpid == wait_pid) { /* * OK, seems to be a valid pidfile from our child. Check the
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Thu, 7 Sep 2023 10:53:41 +, "Hayato Kuroda (Fujitsu)" wrote in > My first idea is that to move the checking part to above, but this may not > handle > the case the postmaster is still alive (now sure this is real issue). Do we > have to > add a new indicator which ensures the identity of processes for windows? > Please tell me how you feel. It doesn't seem to work as expected. We still lose the relationship between the PID file and the launched postmaster. > > Now, I > > also recall that the processes spawned by pg_ctl on Windows make the > > status handling rather tricky to reason about.. > > Did you say about the below comment? Currently I have no idea to make > codes more proper, sorry. > > ``` >* On Windows, we may be checking the postmaster's parent > shell, but >* that's fine for this purpose. > ``` Ditching cmd.exe seems like a big hassle. So, on the flip side, I tried to identify the postmaster PID using the shell's PID, and it seem to work. The APIs used are avaiable from XP/2003 onwards. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 807d7023a9..d54cc6ab54 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -32,6 +32,9 @@ #ifdef WIN32 /* on Unix, we don't need libpq */ #include "pqexpbuffer.h" #endif +#ifdef WIN32 +#include +#endif typedef enum @@ -425,6 +428,91 @@ free_readfile(char **optlines) * start/test/stop routines */ +#ifdef WIN32 +/* + * Find the PID of the launched postmaster. + * + * On Windows, the cmd.exe doesn't support the exec command. As a result, we + * don't directly get the postmaster's PID. This function identifies the PID of + * the postmaster started by the child cmd.exe. + * + * Returns the postmaster's PID. If the shell is alive but the postmaster is + * missing, returns 0. Otherwise terminates this command with an error. + */ +DWORD +win32_find_postmaster_pid(int shell_pid) +{ + HANDLE hSnapshot; + DWORD flags; + DWORD p_pid = shell_pid; + PROCESSENTRY32 ppe; + DWORD pm_pid = 0; + bool shell_exists = false; + bool multiple_children = false; + DWORD last_error; + + /* Create a process snapshot */ + hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, shell_pid); + if (hSnapshot == INVALID_HANDLE_VALUE) + { + write_stderr(_("%s: CreateToolhelp32Snapshot failed\n"), + progname); + return; + } + + /* start iterating on the snapshot */ + ppe.dwSize = sizeof(PROCESSENTRY32); + if (!Process32First(hSnapshot, )) + { + write_stderr(_("%s: Process32First failed: errcode=%08x\n"), + progname, GetLastError()); + exit(1); + } + + /* iterate over the snapshot */ + do + { + if (ppe.th32ProcessID == shell_pid) + shell_exists = true; + if (ppe.th32ParentProcessID == shell_pid) + { + if (pm_pid > 0) +multiple_children = true; + pm_pid = ppe.th32ProcessID; + } + } + while (Process32Next(hSnapshot, )); + + /* avoid multiple calls primary for clarity, not out of necessity */ + last_error = GetLastError(); + if (last_error != ERROR_NO_MORE_FILES) + { + write_stderr(_("%s: Process32Next failed: errcode=%08x\n"), + progname, last_error); + exit(1); + } + CloseHandle(hSnapshot); + + /* assuming the launching shell executes a single process */ + if (multiple_children) + { + write_stderr(_("%s: lancher shell executed multiple processes\n"), + progname); + exit(1); + } + + /* Check if the process is gone for any reason */ + if (!shell_exists) + { + /* Report the condition as server start failure */ + write_stderr(_("%s: launcher shell died\n"), progname); + exit(1); + } + + return pm_pid; +} +#endif + /* * Start the postmaster and return its PID. * @@ -609,7 +697,11 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) /* File is complete enough for us, parse it */ pid_t pmpid; time_t pmstart; - +#ifndef WIN32 + pid_t wait_pid = pm_pid; +#else + pid_t wait_pid = win32_find_postmaster_pid(pm_pid); +#endif /* * Make sanity checks. If it's for the wrong PID, or the recorded * start time is before pg_ctl started, then either we are looking @@ -619,21 +711,15 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint) */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); - if (pmstart >= start_time - 2 && -#ifndef WIN32 -pmpid == pm_pid -#else - /* Windows can only reject standalone-backend PIDs */ -pmpid > 0 -#endif -) + + if (pmstart >= start_time - 2 && pmpid == wait_pid) { /* * OK, seems to be a valid pidfile from our child. Check the * status line (this assumes a v10 or later server). */ char *pmstatus = optlines[LOCK_FILE_LINE_PM_STATUS - 1]; - + if (strcmp(pmstatus, PM_STATUS_READY) == 0 || strcmp(pmstatus, PM_STATUS_STANDBY) == 0) {
RE: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Dear Michael, Thank you for replying! > Not failing on `pg_ctl start` if the command is run on a data folder > that has already been started previously by a different command with a > postmaster still alive feels like cheating, because pg_ctl is lying > about its result. If pg_ctl wants to start a cluster but is not able > to do it, either because the postmaster failed at startup or because > the cluster has already started, it should report a failure. I have a same feelings as you. Users may use the return code in their batch file and they may decide what to do based on the wrong status. Reporting the status more accurately is nice. My first idea is that to move the checking part to above, but this may not handle the case the postmaster is still alive (now sure this is real issue). Do we have to add a new indicator which ensures the identity of processes for windows? Please tell me how you feel. > Now, I > also recall that the processes spawned by pg_ctl on Windows make the > status handling rather tricky to reason about.. Did you say about the below comment? Currently I have no idea to make codes more proper, sorry. ``` * On Windows, we may be checking the postmaster's parent shell, but * that's fine for this purpose. ``` Best Regards, Hayato Kuroda FUJITSU LIMITED move_checking.patch Description: move_checking.patch
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
On Thu, Sep 07, 2023 at 07:07:36AM +, Hayato Kuroda (Fujitsu) wrote: > # Problem > > The "pg_ctl start" command returns 0 (succeeded) even if the cluster has > already been started. This occurs on Windows environment, and when the command > is executed just after postmaster starts. Not failing on `pg_ctl start` if the command is run on a data folder that has already been started previously by a different command with a postmaster still alive feels like cheating, because pg_ctl is lying about its result. If pg_ctl wants to start a cluster but is not able to do it, either because the postmaster failed at startup or because the cluster has already started, it should report a failure. Now, I also recall that the processes spawned by pg_ctl on Windows make the status handling rather tricky to reason about.. -- Michael signature.asc Description: PGP signature
pg_ctl start may return 0 even if the postmaster has been already started on Windows
Dear hackers, While investigating the cfbot failure [1], I found a strange behavior of pg_ctl command. How do you think? Is this a bug to be fixed or in the specification? # Problem The "pg_ctl start" command returns 0 (succeeded) even if the cluster has already been started. This occurs on Windows environment, and when the command is executed just after postmaster starts. # Analysis The primal reason is in wait_for_postmaster_start(). In this function the postmaster.pid file is read and checked whether the start command is successfully done or not. Check (1) requires that the postmaster must be started after the our pg_ctl command, but 2 seconds delay is accepted. In the linux mode, the check (2) is also executed to ensures that the forked process modified the file, so this time window is not so problematic. But in the windows system, (2) is ignored, *so the pg_ctl command may be succeeded if the postmaster is started within 2 seconds.* ``` if ((optlines = readfile(pid_file, )) != NULL && numlines >= LOCK_FILE_LINE_PM_STATUS) { /* File is complete enough for us, parse it */ pid_t pmpid; time_t pmstart; /* * Make sanity checks. If it's for the wrong PID, or the recorded * start time is before pg_ctl started, then either we are looking * at the wrong data directory, or this is a pre-existing pidfile * that hasn't (yet?) been overwritten by our child postmaster. * Allow 2 seconds slop for possible cross-process clock skew. */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); if (pmstart >= start_time - 2 && // (1) #ifndef WIN32 pmpid == pm_pid // (2) #else /* Windows can only reject standalone-backend PIDs */ pmpid > 0 #endif ``` # Appendix - how do I found? I found it while investigating the failure. In the test "pg_upgrade --check" is executed just after old cluster has been started. I checked the output file [2] and found that the banner says "Performing Consistency Checks", which meant that the parameter live_check was set to false (see output_check_banner()). This parameter is set to true when the postmaster has been started at that time and the pg_ctl start fails. That's how I find. [1]: https://cirrus-ci.com/task/4634769732927488 [2]: https://api.cirrus-ci.com/v1/artifact/task/4634769732927488/testrun/build/testrun/pg_upgrade/003_logical_replication_slots/data/t_003_logical_replication_slots_new_publisher_data/pgdata/pg_upgrade_output.d/20230905T080645.548/log/pg_upgrade_internal.log Best Regards, Hayato Kuroda FUJITSU LIMITED