Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Michael Paquier writes: > On Tue, Oct 13, 2015 at 7:41 AM, Tom Lane wrote: >> I'm not sure if this will completely fix our problems with "pg_ctl start" >> related buildfarm failures on very slow critters. It does get rid of the >> hard wired 5-second timeout, but the 60-second timeout could still be an >> issue. I think Noah was considering a patch to allow that number to be >> raised. I'd be in favor of letting pg_ctl accept a default timeout length >> from an environment variable, and then the slower critters could be fixed >> by adjusting their buildfarm configurations. > Being able to pass that as a command-line parameter (master-only > change) would be welcome as well IMO. Huh? The --timeout parameter has always been there. I'm just expressing an opinion that modifying all the makefiles and test scripts to pass through a timeout setting would be too messy. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
On Tue, Oct 13, 2015 at 7:41 AM, Tom Lane wrote: > So there's still something to be desired on Windows; but it's still > better than before in that we can reliably detect child process exit > instead of having to use the five-second heuristic. And of course on > Unix this is way better than before. > > So I've pushed this with some cosmetic adjustments, as well as the not > so cosmetic adjustment of making the service-start path also use handle > testing. If there are remaining problems, the buildfarm should tell us. Thanks! I used at some point when hacking postmasterProcess as well to save the handle but for clarity as this was already used for shutdown when running things as a service of the Windows SCM I put that in another variable. So far so good based on some extra testing I just did now. > I'm not sure if this will completely fix our problems with "pg_ctl start" > related buildfarm failures on very slow critters. It does get rid of the > hard wired 5-second timeout, but the 60-second timeout could still be an > issue. I think Noah was considering a patch to allow that number to be > raised. I'd be in favor of letting pg_ctl accept a default timeout length > from an environment variable, and then the slower critters could be fixed > by adjusting their buildfarm configurations. Being able to pass that as a command-line parameter (master-only change) would be welcome as well IMO. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Michael Paquier writes: >> On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane wrote: >>> I think there is still room to salvage something without fully rewriting >>> the postmaster invocation logic to avoid using CMD, because it's still >>> true that checking whether the CMD process is still there should be as >>> good as checking the postmaster proper. We just can't use kill() for it. > I had a look at that, and attached is an updated patch showing the > concept of using an HANDLE that pg_ctl waits for. I agree that saving > an HANDLE the way this patch does using a static variable is a bit > ugly especially knowing that service registration uses > test_postmaster_connection as well with directly a postmaster. The > good thing is that tapcheck runs smoothly, with one single failure > though: the second call to pg_ctl start -w may succeed instead of > failing if kicked within an interval of 3 seconds after the first one, > based on the tests on my Windows VM. My guess is that this is caused > by the fact that we monitor the shell process and not the postmaster > itself. Thoughts? Yeah, it can still succeed because pg_ctl can't tell that the postmaster.pid created by the earlier invocation isn't the one it wants. It adopts the values out of that file, tests the connection, finds it works, and declares victory, not realizing that the postmaster *it* started will soon fail (or maybe already has). Waiting more than 2 seconds is enough to make sure that test_postmaster_connection sees the pre-existing pidfile as stale and doesn't believe that it represents a successful postmaster start. So there's still something to be desired on Windows; but it's still better than before in that we can reliably detect child process exit instead of having to use the five-second heuristic. And of course on Unix this is way better than before. So I've pushed this with some cosmetic adjustments, as well as the not so cosmetic adjustment of making the service-start path also use handle testing. If there are remaining problems, the buildfarm should tell us. I'm not sure if this will completely fix our problems with "pg_ctl start" related buildfarm failures on very slow critters. It does get rid of the hard wired 5-second timeout, but the 60-second timeout could still be an issue. I think Noah was considering a patch to allow that number to be raised. I'd be in favor of letting pg_ctl accept a default timeout length from an environment variable, and then the slower critters could be fixed by adjusting their buildfarm configurations. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Alvaro Herrera writes: > I wonder if you are interested in rewriting this whole thing to not use > cmd.exe at all, which as I understand is just about output redirection. FWIW, I have little interest in going there, because I'm afraid that getting it to be bug-compatible with the existing behavior on points such as quoting of -o arguments would be extremely fiddly. Perhaps it'd be worth doing it in master, but I'd be afraid to back-patch such a change --- and what I'm after right now is a back-patchable way of getting rid of the 5-second-timeout behavior. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Michael Paquier wrote: > > That would be WaitForSingleObject(handle, ms_timeout) == > > WAIT_OBJECT_0, no? The handle should be picked up from > > CreateRestrictedProcess, and I think that CloseHandle should not be > > called on pi.hProcess if we are going to wait for it afterwards. > > Reference: > > https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx > > I had a look at that, and attached is an updated patch showing the > concept of using an HANDLE that pg_ctl waits for. I wonder if you are interested in rewriting this whole thing to not use cmd.exe at all, which as I understand is just about output redirection. Earlier in the thread I pointed out that on Microsoft docs they claim that you can redirect stdout/stderr when creating a new process without going through cmd.exe; see https://www.postgresql.org/message-id/20150903202114.GG2912%40alvherre.pgsql -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
On Thu, Oct 8, 2015 at 3:09 PM, Michael Paquier wrote: > On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane wrote: >> Michael Paquier writes: >>> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote: So the attached modified patch adjusts the PID-match logic and some comments, but is otherwise what I posted before. I believe that this might actually work on Windows, but I have no way to test it. Someone please try that? (Don't forget to test the service-start path, too.) >> >>> If "pg_ctl start" is invoked directly from a command prompt, it works. >>> Now, if I run "pg_ctl start" within a script (in an msysgit >>> environment for example), pg_ctl fails, complaining that >>> postmaster.pid already exists. The TAP tests get broken as well, >> >> Reading that again, I think you mean that "pg_ctl start" works but you >> didn't try "pg_ctl start -w" manually? Because it's "-w" that's at >> issue here, and the failing test cases are using that. > > Yes, sorry. I fat-fingered the command from the prompt and forgot the > -w switch. test_postmaster_connection just behaves incorrectly, and > exists immediately with PQPING_NO_RESPONSE, aka "Stopped waiting, > could not start server, blah". > >> I think there is still room to salvage something without fully rewriting >> the postmaster invocation logic to avoid using CMD, because it's still >> true that checking whether the CMD process is still there should be as >> good as checking the postmaster proper. We just can't use kill() for it. >> I'd be inclined to get rid of the use of kill() on Unix as well; as Noah >> mentioned upthread, if we're using fork/exec we might as well pay >> attention to waitpid's results instead. On Windows, I imagine that the >> thing to do is have start_postmaster() save aside the handle for the CMD >> process, and then in test_postmaster_connection(), check the handle state >> to see if the process is still running (I assume there's a way to do >> that). > > That would be WaitForSingleObject(handle, ms_timeout) == > WAIT_OBJECT_0, no? The handle should be picked up from > CreateRestrictedProcess, and I think that CloseHandle should not be > called on pi.hProcess if we are going to wait for it afterwards. > Reference: > https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx I had a look at that, and attached is an updated patch showing the concept of using an HANDLE that pg_ctl waits for. I agree that saving an HANDLE the way this patch does using a static variable is a bit ugly especially knowing that service registration uses test_postmaster_connection as well with directly a postmaster. The good thing is that tapcheck runs smoothly, with one single failure though: the second call to pg_ctl start -w may succeed instead of failing if kicked within an interval of 3 seconds after the first one, based on the tests on my Windows VM. My guess is that this is caused by the fact that we monitor the shell process and not the postmaster itself. Thoughts? Regards, -- Michael diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index ba189e7..81e0631 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #ifdef HAVE_SYS_RESOURCE_H @@ -110,6 +111,7 @@ static SERVICE_STATUS status; static SERVICE_STATUS_HANDLE hStatus = (SERVICE_STATUS_HANDLE) 0; static HANDLE shutdownHandles[2]; static pid_t postmasterPID = -1; +static HANDLE commandProcess = INVALID_HANDLE_VALUE; #define shutdownEvent shutdownHandles[0] #define postmasterProcess shutdownHandles[1] @@ -153,10 +155,10 @@ static int CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo, static pgpid_t get_pgpid(bool is_status_request); static char **readfile(const char *path); static void free_readfile(char **optlines); -static int start_postmaster(void); +static pgpid_t start_postmaster(void); static void read_post_opts(void); -static PGPing test_postmaster_connection(bool); +static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint); static bool postmaster_is_alive(pid_t pid); #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE) @@ -419,36 +421,70 @@ free_readfile(char **optlines) * start/test/stop routines */ -static int +/* + * Start the postmaster and return its PID. + * + * Currently, on Windows what we return is the PID of the shell process + * that launched the postmaster (and, we trust, is waiting for it to exit). + * So the PID is usable for "is the postmaster still running" checks, + * but cannot be compared directly to postmaster.pid. + */ +static pgpid_t start_postmaster(void) { char cmd[MAXPGPATH]; #ifndef WIN32 + pgpid_t pm_pid; + + /* Flush stdio channels just before fork, to avoid double-output problems */ + fflush(stdout); + fflush(stderr); + + pm_pid = fork(); + if (pm_pid < 0) + { + /* fork failed */ + write_stderr(_("%s: could not start server: %s\n"), + progname, strerro
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane wrote: > Michael Paquier writes: >> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote: >>> So the attached modified patch adjusts the PID-match logic and some >>> comments, but is otherwise what I posted before. I believe that this >>> might actually work on Windows, but I have no way to test it. Someone >>> please try that? (Don't forget to test the service-start path, too.) > >> If "pg_ctl start" is invoked directly from a command prompt, it works. >> Now, if I run "pg_ctl start" within a script (in an msysgit >> environment for example), pg_ctl fails, complaining that >> postmaster.pid already exists. The TAP tests get broken as well, > > Reading that again, I think you mean that "pg_ctl start" works but you > didn't try "pg_ctl start -w" manually? Because it's "-w" that's at > issue here, and the failing test cases are using that. Yes, sorry. I fat-fingered the command from the prompt and forgot the -w switch. test_postmaster_connection just behaves incorrectly, and exists immediately with PQPING_NO_RESPONSE, aka "Stopped waiting, could not start server, blah". > I think there is still room to salvage something without fully rewriting > the postmaster invocation logic to avoid using CMD, because it's still > true that checking whether the CMD process is still there should be as > good as checking the postmaster proper. We just can't use kill() for it. > I'd be inclined to get rid of the use of kill() on Unix as well; as Noah > mentioned upthread, if we're using fork/exec we might as well pay > attention to waitpid's results instead. On Windows, I imagine that the > thing to do is have start_postmaster() save aside the handle for the CMD > process, and then in test_postmaster_connection(), check the handle state > to see if the process is still running (I assume there's a way to do > that). That would be WaitForSingleObject(handle, ms_timeout) == WAIT_OBJECT_0, no? The handle should be picked up from CreateRestrictedProcess, and I think that CloseHandle should not be called on pi.hProcess if we are going to wait for it afterwards. Reference: https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx > I can take care of the Unix side of that, but as before, somebody else > would need to code and test the Windows side. It's probably just a few > lines of code to add ... any volunteers? I could give it a shot, do you still want to rework the Unix portion of the patch? There are not many folks around able to test the patch either way. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Michael Paquier writes: > On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote: >> So the attached modified patch adjusts the PID-match logic and some >> comments, but is otherwise what I posted before. I believe that this >> might actually work on Windows, but I have no way to test it. Someone >> please try that? (Don't forget to test the service-start path, too.) > If "pg_ctl start" is invoked directly from a command prompt, it works. > Now, if I run "pg_ctl start" within a script (in an msysgit > environment for example), pg_ctl fails, complaining that > postmaster.pid already exists. The TAP tests get broken as well, Reading that again, I think you mean that "pg_ctl start" works but you didn't try "pg_ctl start -w" manually? Because it's "-w" that's at issue here, and the failing test cases are using that. After studying the test logs more carefully, I think the behavior they're showing is consistent with the idea that postmaster_is_alive() doesn't work on the CMD shell process. Which seems very likely now that I think about it, because we're depending on an implementation of kill() that is PG-specific. So in fact this idea doesn't work :-(. I think there is still room to salvage something without fully rewriting the postmaster invocation logic to avoid using CMD, because it's still true that checking whether the CMD process is still there should be as good as checking the postmaster proper. We just can't use kill() for it. I'd be inclined to get rid of the use of kill() on Unix as well; as Noah mentioned upthread, if we're using fork/exec we might as well pay attention to waitpid's results instead. On Windows, I imagine that the thing to do is have start_postmaster() save aside the handle for the CMD process, and then in test_postmaster_connection(), check the handle state to see if the process is still running (I assume there's a way to do that). I can take care of the Unix side of that, but as before, somebody else would need to code and test the Windows side. It's probably just a few lines of code to add ... any volunteers? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Michael Paquier writes: > On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote: >> So the attached modified patch adjusts the PID-match logic and some >> comments, but is otherwise what I posted before. I believe that this >> might actually work on Windows, but I have no way to test it. Someone >> please try that? (Don't forget to test the service-start path, too.) > If "pg_ctl start" is invoked directly from a command prompt, it works. > Now, if I run "pg_ctl start" within a script (in an msysgit > environment for example), pg_ctl fails, complaining that > postmaster.pid already exists. The TAP tests get broken as well, Hm, the TAP trace makes it look like the postmaster start does actually work, but pg_ctl isn't correctly waiting for that to happen. (The complaint about "already exists" seems to be from the second start attempt, and is exactly what to expect there.) I could believe that I'd fat-fingered the PID comparison logic in test_postmaster_connection, but I don't understand how it would work from a command prompt but not from a script. Any ideas? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote: > So the attached modified patch adjusts the PID-match logic and some > comments, but is otherwise what I posted before. I believe that this > might actually work on Windows, but I have no way to test it. Someone > please try that? (Don't forget to test the service-start path, too.) If "pg_ctl start" is invoked directly from a command prompt, it works. Now, if I run "pg_ctl start" within a script (in an msysgit environment for example), pg_ctl fails, complaining that postmaster.pid already exists. The TAP tests get broken as well, attached are the log files of the run, and here is an interested bit: # Running: pg_ctl start -D C:\Users\ioltas\git\postgres\src\bin\pg_ctl\tmp_testor2K/data -w waiting for server to start... stopped waiting not ok 12 - pg_ctl start -w # Failed test 'pg_ctl start -w' # at C:/Users/ioltas/git/postgres/src/test/perl/TestLib.pm line 282. # Running: pg_ctl start -D C:\Users\ioltas\git\postgres\src\bin\pg_ctl\tmp_testor2K/data -w waiting for server to start...FATAL: lock file "postmaster.pid" already exists HINT: Is another postmaster (PID 4040) running in data directory "C:/Users/ioltas/git/postgres/src/bin/pg_ctl/tmp_testor2K/data"? LOG: database system was shut down at 2015-10-06 23:58:02 PDT LOG: MultiXact member wraparound protections are now enabled FATAL: the database system is starting up LOG: database system is ready to accept connections LOG: autovacuum launcher started stopped waiting Registering a service to the SCM works, but the service registered refuses to start. Regards, -- Michael regress_log_001_start_stop Description: Binary data regress_log_002_status Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
On Wed, Oct 7, 2015 at 7:05 AM, Michael Paquier wrote: > On Wed, Oct 7, 2015 at 6:44 AM, Tom Lane wrote: >> I wrote: >>> So the attached modified patch adjusts the PID-match logic and some >>> comments, but is otherwise what I posted before. I believe that this >>> might actually work on Windows, but I have no way to test it. Someone >>> please try that? (Don't forget to test the service-start path, too.) >> >> Has anyone tried that patch on Windows? I'd like to push it in hopes of >> improving buildfarm stability, but I'm hesitant to do so without some >> confirmation that it acts as I think it will. > > I marked this patch as something to look at, though I haven't take the > time to do it yet... s/take/taken -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
On Wed, Oct 7, 2015 at 6:44 AM, Tom Lane wrote: > I wrote: >> So the attached modified patch adjusts the PID-match logic and some >> comments, but is otherwise what I posted before. I believe that this >> might actually work on Windows, but I have no way to test it. Someone >> please try that? (Don't forget to test the service-start path, too.) > > Has anyone tried that patch on Windows? I'd like to push it in hopes of > improving buildfarm stability, but I'm hesitant to do so without some > confirmation that it acts as I think it will. I marked this patch as something to look at, though I haven't take the time to do it yet... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
I wrote: > So the attached modified patch adjusts the PID-match logic and some > comments, but is otherwise what I posted before. I believe that this > might actually work on Windows, but I have no way to test it. Someone > please try that? (Don't forget to test the service-start path, too.) Has anyone tried that patch on Windows? I'd like to push it in hopes of improving buildfarm stability, but I'm hesitant to do so without some confirmation that it acts as I think it will. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
I wrote: > Attached is a draft patch for this. I think it's fine for Unix (unless > someone wants to object to relying on "/bin/sh -c"), but I have no idea > whether it works for Windows. The main risk is that if CMD.EXE runs > the postmaster as a subprocess rather than overlaying itself a la shell > "exec", the PID we'd get back would apply only to CMD.EXE not to the > eventual postmaster. If so, I do not know how to fix that, or whether > it's fixable at all. > Note that this makes the test case in question fail reliably, which is > reasonable behavior IMO so I just changed the test. > If this does (or can be made to) work on Windows, I'd propose applying > it back to 9.2, where the current coding came in. Nobody seems to have stepped up to fix the Windows side of this, but after some more thought it occurred to me that maybe it doesn't need (much) fixing. There are two things that pg_ctl wants the PID for: 1. To check if the postmaster.pid file contains the expected child process PID. Well, that's a nice-to-have, but we never had it before and got along well enough, because the start-timestamp comparison check covers most real-world cases. We can omit the PID-match check on Windows. 2. To see whether the child postmaster process is still running, via a "kill(pid, 0)" test. But if we've launched the postmaster through a shell, and not used "&" semantics, presumably that shell will wait for its child process to exit and then exit itself. So *testing whether the shell is still there is just about as good as testing the real postmaster PID*. Therefore, we don't really need to worry about rewriting the Windows subprocess-launch logic. If someone would like to do it, that would be nice, but we don't have to wait for that to happen before we can get rid of the 5-second-timeout issues. So the attached modified patch adjusts the PID-match logic and some comments, but is otherwise what I posted before. I believe that this might actually work on Windows, but I have no way to test it. Someone please try that? (Don't forget to test the service-start path, too.) regards, tom lane diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 6a36d29..62db328 100644 *** a/src/bin/pg_ctl/pg_ctl.c --- b/src/bin/pg_ctl/pg_ctl.c *** *** 28,33 --- 28,34 #include #include #include + #include #include #ifdef HAVE_SYS_RESOURCE_H *** static int CreateRestrictedProcess(char *** 153,162 static pgpid_t get_pgpid(bool is_status_request); static char **readfile(const char *path); static void free_readfile(char **optlines); ! static int start_postmaster(void); static void read_post_opts(void); ! static PGPing test_postmaster_connection(bool); static bool postmaster_is_alive(pid_t pid); #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE) --- 154,163 static pgpid_t get_pgpid(bool is_status_request); static char **readfile(const char *path); static void free_readfile(char **optlines); ! static pgpid_t start_postmaster(void); static void read_post_opts(void); ! static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint); static bool postmaster_is_alive(pid_t pid); #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE) *** free_readfile(char **optlines) *** 419,454 * start/test/stop routines */ ! static int start_postmaster(void) { char cmd[MAXPGPATH]; #ifndef WIN32 /* * Since there might be quotes to handle here, it is easier simply to pass ! * everything to a shell to process them. ! * ! * XXX it would be better to fork and exec so that we would know the child ! * postmaster's PID directly; then test_postmaster_connection could use ! * the PID without having to rely on reading it back from the pidfile. */ if (log_file != NULL) ! snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1 &", exec_path, pgdata_opt, post_opts, DEVNULL, log_file); else ! snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" 2>&1 &", exec_path, pgdata_opt, post_opts, DEVNULL); ! return system(cmd); #else /* WIN32 */ /* ! * On win32 we don't use system(). So we don't need to use & (which would ! * be START /B on win32). However, we still call the shell (CMD.EXE) with ! * it to handle redirection etc. */ PROCESS_INFORMATION pi; --- 420,489 * start/test/stop routines */ ! /* ! * Start the postmaster and return its PID. ! * ! * Currently, on Windows what we return is the PID of the shell process ! * that launched the postmaster (and, we trust, is waiting for it to exit). ! * So the PID is usable for "is the postmaster still running" checks, ! * but cannot be compared directly to postmaster.pid. ! */ ! static pgpid_t start_postmaster(void) { char cmd[MAXPGPATH]; #ifndef WIN32 + pgpid_t pm_pid; + + /* Flush stdio channels just before fork, to
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Noah Misch writes: > On Sat, Sep 05, 2015 at 10:22:59PM -0400, Tom Lane wrote: >> The kill() coding works on Windows (I believe); waitpid() not so much. > Windows has the concept under a different name. See postmaster.c. Well, if someone else wants to code and test that, feel free. I don't do Windows ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
On Sat, Sep 05, 2015 at 10:22:59PM -0400, Tom Lane wrote: > Noah Misch writes: > > kill() adds nothing when dealing with a pg_ctl child. waitpid() is enough. > > The kill() coding works on Windows (I believe); waitpid() not so much. Windows has the concept under a different name. See postmaster.c. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Noah Misch writes: > kill() adds nothing when dealing with a pg_ctl child. waitpid() is enough. The kill() coding works on Windows (I believe); waitpid() not so much. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
On Fri, Sep 04, 2015 at 10:54:51PM -0400, Tom Lane wrote: > Noah Misch writes: > > On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote: > >>> This is the first time I've seen an indication that the > >>> start_postmaster change mentioned in the comment is actually important > >>> for production use, rather than just being cleanup. > > > I scratched my head awhile without thinking of a credible production use > > case > > that would notice the difference. Which one did you have in mind? > > Well, mainly that it's making our regression tests fail, which suggests > behavioral instability that could be important for production. I doubt issuing two "pg_ctl start" within two seconds, with no intervening stop, has a production use. However, ... > Aside from the specific case you diagnosed, it's clear from buildfarm > results that the five-second timeout elsewhere in > test_postmaster_connection has got funny behavior under load; there are > way too many cases where pg_ctl gives up after exactly that long, with > no useful information printed. The draft patch I posted gets rid of that > behavior ... ... fixing this is a good deal more valuable. Besides coping with stalls in early postmaster startup, pg_ctl would no longer take 5s to exit when the postmaster reports a postgresql.conf defect or other early error. On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote: > If this does (or can be made to) work on Windows, I'd propose applying > it back to 9.2, where the current coding came in. That seems prudent. We'll keep meeting buildfarm members with performance variability sufficient to reach that 5s timeout. (Incidentally, the current coding first appeared in 9.1.) > --- 650,671 > } > > /* > ! * Reap child process if it died; if it remains in zombie state > then > ! * our kill-based test will think it's still running. >*/ > ! #ifndef WIN32 > { > ! int exitstatus; > > ! (void) waitpid((pid_t) pm_pid, &exitstatus, WNOHANG); Let's report any exit status. When a signal kills the young postmaster, it's a waste to omit that detail. > } > + #endif > > /* > ! * Check whether the child postmaster process is still alive. > This > ! * lets us exit early if the postmaster fails during startup. >*/ > ! if (!postmaster_is_alive((pid_t) pm_pid)) > return PQPING_NO_RESPONSE; kill() adds nothing when dealing with a pg_ctl child. waitpid() is enough. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Noah Misch writes: > On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote: >>> This is the first time I've seen an indication that the >>> start_postmaster change mentioned in the comment is actually important >>> for production use, rather than just being cleanup. > I scratched my head awhile without thinking of a credible production use case > that would notice the difference. Which one did you have in mind? Well, mainly that it's making our regression tests fail, which suggests behavioral instability that could be important for production. Aside from the specific case you diagnosed, it's clear from buildfarm results that the five-second timeout elsewhere in test_postmaster_connection has got funny behavior under load; there are way too many cases where pg_ctl gives up after exactly that long, with no useful information printed. The draft patch I posted gets rid of that behavior ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote: > I wrote: > > This is the first time I've seen an indication that the > > start_postmaster change mentioned in the comment is actually important > > for production use, rather than just being cleanup. I scratched my head awhile without thinking of a credible production use case that would notice the difference. Which one did you have in mind? > Note that this makes the test case in question fail reliably, which is > reasonable behavior IMO so I just changed the test. That's the better behavior, agreed. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Tom Lane wrote: > Andrew Dunstan writes: > > There is no equivalent of execl, nor a cmd.exe exquivalent of the > > shell's exec. But surely the equivalent of the fork/execl you're doing > > here would be a simple CreateProcess(). I don't see why you need a shell > > in the middle on Windows at all. > > The problem is to get the output redirection to work. I imagine it could > be rewritten without involving the shell, but it'd be less than trivial > (or at least, I'm not going to do it). In the CreateProcess model, I think you can use startupInfo.hStdOutput, see https://msdn.microsoft.com/en-us/library/windows/desktop/ms686331%28v=vs.85%29.aspx https://msdn.microsoft.com/en-us/library/windows/desktop/ms682425%28v=vs.85%29.aspx -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Andrew Dunstan writes: > There is no equivalent of execl, nor a cmd.exe exquivalent of the > shell's exec. But surely the equivalent of the fork/execl you're doing > here would be a simple CreateProcess(). I don't see why you need a shell > in the middle on Windows at all. The problem is to get the output redirection to work. I imagine it could be rewritten without involving the shell, but it'd be less than trivial (or at least, I'm not going to do it). If we did get that to work, it could probably be applied to the service-start case as well, which currently just does a CreateProcess and pays no attention to what happens to postmaster stderr. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
On 09/03/2015 03:31 PM, Tom Lane wrote: I wrote: Andres Freund writes: I'don't like adding a couple seconds of test runtime for the benefit of very slow platforms. Me either. This is the first time I've seen an indication that the start_postmaster change mentioned in the comment is actually important for production use, rather than just being cleanup. I think we ought to just fix it. I'm willing to take care of the Unix side if someone will explain how to change the Windows side. Attached is a draft patch for this. I think it's fine for Unix (unless someone wants to object to relying on "/bin/sh -c"), but I have no idea whether it works for Windows. The main risk is that if CMD.EXE runs the postmaster as a subprocess rather than overlaying itself a la shell "exec", the PID we'd get back would apply only to CMD.EXE not to the eventual postmaster. If so, I do not know how to fix that, or whether it's fixable at all. Note that this makes the test case in question fail reliably, which is reasonable behavior IMO so I just changed the test. If this does (or can be made to) work on Windows, I'd propose applying it back to 9.2, where the current coding came in. There is no equivalent of execl, nor a cmd.exe exquivalent of the shell's exec. But surely the equivalent of the fork/execl you're doing here would be a simple CreateProcess(). I don't see why you need a shell in the middle on Windows at all. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
I wrote: > Andres Freund writes: >> I'don't like adding a couple seconds of test runtime for the benefit of >> very slow platforms. > Me either. This is the first time I've seen an indication that the > start_postmaster change mentioned in the comment is actually important > for production use, rather than just being cleanup. I think we ought > to just fix it. I'm willing to take care of the Unix side if someone > will explain how to change the Windows side. Attached is a draft patch for this. I think it's fine for Unix (unless someone wants to object to relying on "/bin/sh -c"), but I have no idea whether it works for Windows. The main risk is that if CMD.EXE runs the postmaster as a subprocess rather than overlaying itself a la shell "exec", the PID we'd get back would apply only to CMD.EXE not to the eventual postmaster. If so, I do not know how to fix that, or whether it's fixable at all. Note that this makes the test case in question fail reliably, which is reasonable behavior IMO so I just changed the test. If this does (or can be made to) work on Windows, I'd propose applying it back to 9.2, where the current coding came in. regards, tom lane diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 6a36d29..f0025d7 100644 *** a/src/bin/pg_ctl/pg_ctl.c --- b/src/bin/pg_ctl/pg_ctl.c *** *** 28,33 --- 28,34 #include #include #include + #include #include #ifdef HAVE_SYS_RESOURCE_H *** static int CreateRestrictedProcess(char *** 153,162 static pgpid_t get_pgpid(bool is_status_request); static char **readfile(const char *path); static void free_readfile(char **optlines); ! static int start_postmaster(void); static void read_post_opts(void); ! static PGPing test_postmaster_connection(bool); static bool postmaster_is_alive(pid_t pid); #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE) --- 154,163 static pgpid_t get_pgpid(bool is_status_request); static char **readfile(const char *path); static void free_readfile(char **optlines); ! static pgpid_t start_postmaster(void); static void read_post_opts(void); ! static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint); static bool postmaster_is_alive(pid_t pid); #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE) *** free_readfile(char **optlines) *** 419,454 * start/test/stop routines */ ! static int start_postmaster(void) { char cmd[MAXPGPATH]; #ifndef WIN32 /* * Since there might be quotes to handle here, it is easier simply to pass * everything to a shell to process them. - * - * XXX it would be better to fork and exec so that we would know the child - * postmaster's PID directly; then test_postmaster_connection could use - * the PID without having to rely on reading it back from the pidfile. */ if (log_file != NULL) ! snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1 &", exec_path, pgdata_opt, post_opts, DEVNULL, log_file); else ! snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" 2>&1 &", exec_path, pgdata_opt, post_opts, DEVNULL); ! return system(cmd); #else /* WIN32 */ /* ! * On win32 we don't use system(). So we don't need to use & (which would ! * be START /B on win32). However, we still call the shell (CMD.EXE) with ! * it to handle redirection etc. */ PROCESS_INFORMATION pi; --- 420,479 * start/test/stop routines */ ! static pgpid_t start_postmaster(void) { char cmd[MAXPGPATH]; #ifndef WIN32 + pgpid_t pm_pid; + + /* Flush stdio channels just before fork, to avoid double-output problems */ + fflush(stdout); + fflush(stderr); + + pm_pid = fork(); + if (pm_pid < 0) + { + /* fork failed */ + write_stderr(_("%s: could not start server: %s\n"), + progname, strerror(errno)); + exit(1); + } + if (pm_pid > 0) + { + /* fork succeeded, in parent */ + return pm_pid; + } + + /* fork succeeded, in child */ /* * Since there might be quotes to handle here, it is easier simply to pass * everything to a shell to process them. */ if (log_file != NULL) ! snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1", exec_path, pgdata_opt, post_opts, DEVNULL, log_file); else ! snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1", exec_path, pgdata_opt, post_opts, DEVNULL); ! (void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL); ! ! /* exec failed */ ! write_stderr(_("%s: could not start server: %s\n"), ! progname, strerror(errno)); ! exit(1); ! ! return 0; /* keep dumb compilers quiet */ ! #else /* WIN32 */ /* ! * As with the Unix case, it's easiest to use the shell (CMD.EXE) to ! * handle redirection etc. */ PROCESS_INFORMATION pi; *** start_postmaster(void) *** 460,469 exec_
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
Andres Freund writes: > On 2015-09-03 02:25:00 -0400, Noah Misch wrote: >> --- a/src/bin/pg_ctl/t/001_start_stop.pl >> +++ b/src/bin/pg_ctl/t/001_start_stop.pl >> @@ -35,6 +35,7 @@ close CONF; >> command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], >> 'pg_ctl start -w'); >> -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], >> -'second pg_ctl start succeeds'); >> +sleep 3;# bridge test_postmaster_connection() slop threshold >> +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], >> +'second pg_ctl start fails'); >> command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], >> 'pg_ctl stop -w'); > I'don't like adding a couple seconds of test runtime for the benefit of > very slow platforms. Me either. This is the first time I've seen an indication that the start_postmaster change mentioned in the comment is actually important for production use, rather than just being cleanup. I think we ought to just fix it. I'm willing to take care of the Unix side if someone will explain how to change the Windows side. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members
On 2015-09-03 02:25:00 -0400, Noah Misch wrote: > --- a/src/bin/pg_ctl/t/001_start_stop.pl > +++ b/src/bin/pg_ctl/t/001_start_stop.pl > @@ -35,6 +35,7 @@ close CONF; > command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], > 'pg_ctl start -w'); > -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], > - 'second pg_ctl start succeeds'); > +sleep 3;# bridge test_postmaster_connection() slop threshold > +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ], > + 'second pg_ctl start fails'); > command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ], > 'pg_ctl stop -w'); I'don't like adding a couple seconds of test runtime for the benefit of very slow platforms. The second pg_ctl start doesn't seem to test something very interesting. I'm inclined to just remove it. I'm not caffeinated sufficiently, but afaics that ought to sidestep the issue as stop doesn't depend on the slop time? > crake failed the same way, once: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06 Sounds like an actual production hazard too. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers