Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-13 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Oct 13, 2015 at 7:41 AM, Tom Lane  wrote:
>> I'm not sure if this will completely fix our problems with "pg_ctl start"
>> related buildfarm failures on very slow critters.  It does get rid of the
>> hard wired 5-second timeout, but the 60-second timeout could still be an
>> issue.  I think Noah was considering a patch to allow that number to be
>> raised.  I'd be in favor of letting pg_ctl accept a default timeout length
>> from an environment variable, and then the slower critters could be fixed
>> by adjusting their buildfarm configurations.

> Being able to pass that as a command-line parameter (master-only
> change) would be welcome as well IMO.

Huh?  The --timeout parameter has always been there.  I'm just expressing
an opinion that modifying all the makefiles and test scripts to pass
through a timeout setting would be too messy.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-12 Thread Michael Paquier
On Tue, Oct 13, 2015 at 7:41 AM, Tom Lane  wrote:
> So there's still something to be desired on Windows; but it's still
> better than before in that we can reliably detect child process exit
> instead of having to use the five-second heuristic.  And of course on
> Unix this is way better than before.
>
> So I've pushed this with some cosmetic adjustments, as well as the not
> so cosmetic adjustment of making the service-start path also use handle
> testing.  If there are remaining problems, the buildfarm should tell us.

Thanks! I used at some point when hacking postmasterProcess as well to
save the handle but for clarity as this was already used for shutdown
when running things as a service of the Windows SCM I put that in
another variable. So far so good based on some extra testing I just
did now.

> I'm not sure if this will completely fix our problems with "pg_ctl start"
> related buildfarm failures on very slow critters.  It does get rid of the
> hard wired 5-second timeout, but the 60-second timeout could still be an
> issue.  I think Noah was considering a patch to allow that number to be
> raised.  I'd be in favor of letting pg_ctl accept a default timeout length
> from an environment variable, and then the slower critters could be fixed
> by adjusting their buildfarm configurations.

Being able to pass that as a command-line parameter (master-only
change) would be welcome as well IMO.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-12 Thread Tom Lane
Michael Paquier  writes:
>> On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane  wrote:
>>> I think there is still room to salvage something without fully rewriting
>>> the postmaster invocation logic to avoid using CMD, because it's still
>>> true that checking whether the CMD process is still there should be as
>>> good as checking the postmaster proper.  We just can't use kill() for it.

> I had a look at that, and attached is an updated patch showing the
> concept of using an HANDLE that pg_ctl waits for. I agree that saving
> an HANDLE the way this patch does using a static variable is a bit
> ugly especially knowing that service registration uses
> test_postmaster_connection as well with directly a postmaster. The
> good thing is that tapcheck runs smoothly, with one single failure
> though: the second call to pg_ctl start -w may succeed instead of
> failing if kicked within an interval of 3 seconds after the first one,
> based on the tests on my Windows VM. My guess is that this is caused
> by the fact that we monitor the shell process and not the postmaster
> itself. Thoughts?

Yeah, it can still succeed because pg_ctl can't tell that the
postmaster.pid created by the earlier invocation isn't the one it wants.
It adopts the values out of that file, tests the connection, finds it
works, and declares victory, not realizing that the postmaster *it*
started will soon fail (or maybe already has).

Waiting more than 2 seconds is enough to make sure that
test_postmaster_connection sees the pre-existing pidfile as stale and
doesn't believe that it represents a successful postmaster start.

So there's still something to be desired on Windows; but it's still
better than before in that we can reliably detect child process exit
instead of having to use the five-second heuristic.  And of course on
Unix this is way better than before.

So I've pushed this with some cosmetic adjustments, as well as the not
so cosmetic adjustment of making the service-start path also use handle
testing.  If there are remaining problems, the buildfarm should tell us.

I'm not sure if this will completely fix our problems with "pg_ctl start"
related buildfarm failures on very slow critters.  It does get rid of the
hard wired 5-second timeout, but the 60-second timeout could still be an
issue.  I think Noah was considering a patch to allow that number to be
raised.  I'd be in favor of letting pg_ctl accept a default timeout length
from an environment variable, and then the slower critters could be fixed
by adjusting their buildfarm configurations.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-09 Thread Tom Lane
Alvaro Herrera  writes:
> I wonder if you are interested in rewriting this whole thing to not use
> cmd.exe at all, which as I understand is just about output redirection.

FWIW, I have little interest in going there, because I'm afraid that
getting it to be bug-compatible with the existing behavior on points such
as quoting of -o arguments would be extremely fiddly.  Perhaps it'd be
worth doing it in master, but I'd be afraid to back-patch such a change
--- and what I'm after right now is a back-patchable way of getting rid
of the 5-second-timeout behavior.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-09 Thread Alvaro Herrera
Michael Paquier wrote:

> > That would be WaitForSingleObject(handle, ms_timeout) ==
> > WAIT_OBJECT_0, no? The handle should be picked up from
> > CreateRestrictedProcess, and I think that CloseHandle should not be
> > called on pi.hProcess if we are going to wait for it afterwards.
> > Reference:
> > https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx
> 
> I had a look at that, and attached is an updated patch showing the
> concept of using an HANDLE that pg_ctl waits for.

I wonder if you are interested in rewriting this whole thing to not use
cmd.exe at all, which as I understand is just about output redirection.
Earlier in the thread I pointed out that on Microsoft docs they claim
that you can redirect stdout/stderr when creating a new process without
going through cmd.exe; see
https://www.postgresql.org/message-id/20150903202114.GG2912%40alvherre.pgsql

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-08 Thread Michael Paquier
On Thu, Oct 8, 2015 at 3:09 PM, Michael Paquier
 wrote:
> On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane  wrote:
>> Michael Paquier  writes:
>>> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
 So the attached modified patch adjusts the PID-match logic and some
 comments, but is otherwise what I posted before.  I believe that this
 might actually work on Windows, but I have no way to test it.  Someone
 please try that?  (Don't forget to test the service-start path, too.)
>>
>>> If "pg_ctl start" is invoked directly from a command prompt, it works.
>>> Now, if I run "pg_ctl start" within a script (in an msysgit
>>> environment for example), pg_ctl fails, complaining that
>>> postmaster.pid already exists. The TAP tests get broken as well,
>>
>> Reading that again, I think you mean that "pg_ctl start" works but you
>> didn't try "pg_ctl start -w" manually?  Because it's "-w" that's at
>> issue here, and the failing test cases are using that.
>
> Yes, sorry. I fat-fingered the command from the prompt and forgot the
> -w switch. test_postmaster_connection just behaves incorrectly, and
> exists immediately with PQPING_NO_RESPONSE, aka "Stopped waiting,
> could not start server, blah".
>
>> I think there is still room to salvage something without fully rewriting
>> the postmaster invocation logic to avoid using CMD, because it's still
>> true that checking whether the CMD process is still there should be as
>> good as checking the postmaster proper.  We just can't use kill() for it.
>> I'd be inclined to get rid of the use of kill() on Unix as well; as Noah
>> mentioned upthread, if we're using fork/exec we might as well pay
>> attention to waitpid's results instead.  On Windows, I imagine that the
>> thing to do is have start_postmaster() save aside the handle for the CMD
>> process, and then in test_postmaster_connection(), check the handle state
>> to see if the process is still running (I assume there's a way to do
>> that).
>
> That would be WaitForSingleObject(handle, ms_timeout) ==
> WAIT_OBJECT_0, no? The handle should be picked up from
> CreateRestrictedProcess, and I think that CloseHandle should not be
> called on pi.hProcess if we are going to wait for it afterwards.
> Reference:
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx

I had a look at that, and attached is an updated patch showing the
concept of using an HANDLE that pg_ctl waits for. I agree that saving
an HANDLE the way this patch does using a static variable is a bit
ugly especially knowing that service registration uses
test_postmaster_connection as well with directly a postmaster. The
good thing is that tapcheck runs smoothly, with one single failure
though: the second call to pg_ctl start -w may succeed instead of
failing if kicked within an interval of 3 seconds after the first one,
based on the tests on my Windows VM. My guess is that this is caused
by the fact that we monitor the shell process and not the postmaster
itself. Thoughts?
Regards,
-- 
Michael
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index ba189e7..81e0631 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #ifdef HAVE_SYS_RESOURCE_H
@@ -110,6 +111,7 @@ static SERVICE_STATUS status;
 static SERVICE_STATUS_HANDLE hStatus = (SERVICE_STATUS_HANDLE) 0;
 static HANDLE shutdownHandles[2];
 static pid_t postmasterPID = -1;
+static HANDLE commandProcess = INVALID_HANDLE_VALUE;
 
 #define shutdownEvent	  shutdownHandles[0]
 #define postmasterProcess shutdownHandles[1]
@@ -153,10 +155,10 @@ static int	CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo,
 static pgpid_t get_pgpid(bool is_status_request);
 static char **readfile(const char *path);
 static void free_readfile(char **optlines);
-static int	start_postmaster(void);
+static pgpid_t start_postmaster(void);
 static void read_post_opts(void);
 
-static PGPing test_postmaster_connection(bool);
+static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint);
 static bool postmaster_is_alive(pid_t pid);
 
 #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
@@ -419,36 +421,70 @@ free_readfile(char **optlines)
  * start/test/stop routines
  */
 
-static int
+/*
+ * Start the postmaster and return its PID.
+ *
+ * Currently, on Windows what we return is the PID of the shell process
+ * that launched the postmaster (and, we trust, is waiting for it to exit).
+ * So the PID is usable for "is the postmaster still running" checks,
+ * but cannot be compared directly to postmaster.pid.
+ */
+static pgpid_t
 start_postmaster(void)
 {
 	char		cmd[MAXPGPATH];
 
 #ifndef WIN32
+	pgpid_t		pm_pid;
+
+	/* Flush stdio channels just before fork, to avoid double-output problems */
+	fflush(stdout);
+	fflush(stderr);
+
+	pm_pid = fork();
+	if (pm_pid < 0)
+	{
+		/* fork failed */
+		write_stderr(_("%s: could not start server: %s\n"),
+	 progname, strerro

Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-07 Thread Michael Paquier
On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
>>> So the attached modified patch adjusts the PID-match logic and some
>>> comments, but is otherwise what I posted before.  I believe that this
>>> might actually work on Windows, but I have no way to test it.  Someone
>>> please try that?  (Don't forget to test the service-start path, too.)
>
>> If "pg_ctl start" is invoked directly from a command prompt, it works.
>> Now, if I run "pg_ctl start" within a script (in an msysgit
>> environment for example), pg_ctl fails, complaining that
>> postmaster.pid already exists. The TAP tests get broken as well,
>
> Reading that again, I think you mean that "pg_ctl start" works but you
> didn't try "pg_ctl start -w" manually?  Because it's "-w" that's at
> issue here, and the failing test cases are using that.

Yes, sorry. I fat-fingered the command from the prompt and forgot the
-w switch. test_postmaster_connection just behaves incorrectly, and
exists immediately with PQPING_NO_RESPONSE, aka "Stopped waiting,
could not start server, blah".

> I think there is still room to salvage something without fully rewriting
> the postmaster invocation logic to avoid using CMD, because it's still
> true that checking whether the CMD process is still there should be as
> good as checking the postmaster proper.  We just can't use kill() for it.
> I'd be inclined to get rid of the use of kill() on Unix as well; as Noah
> mentioned upthread, if we're using fork/exec we might as well pay
> attention to waitpid's results instead.  On Windows, I imagine that the
> thing to do is have start_postmaster() save aside the handle for the CMD
> process, and then in test_postmaster_connection(), check the handle state
> to see if the process is still running (I assume there's a way to do
> that).

That would be WaitForSingleObject(handle, ms_timeout) ==
WAIT_OBJECT_0, no? The handle should be picked up from
CreateRestrictedProcess, and I think that CloseHandle should not be
called on pi.hProcess if we are going to wait for it afterwards.
Reference:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx

> I can take care of the Unix side of that, but as before, somebody else
> would need to code and test the Windows side.  It's probably just a few
> lines of code to add ... any volunteers?

I could give it a shot, do you still want to rework the Unix portion
of the patch? There are not many folks around able to test the patch
either way.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-07 Thread Tom Lane
Michael Paquier  writes:
> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
>> So the attached modified patch adjusts the PID-match logic and some
>> comments, but is otherwise what I posted before.  I believe that this
>> might actually work on Windows, but I have no way to test it.  Someone
>> please try that?  (Don't forget to test the service-start path, too.)

> If "pg_ctl start" is invoked directly from a command prompt, it works.
> Now, if I run "pg_ctl start" within a script (in an msysgit
> environment for example), pg_ctl fails, complaining that
> postmaster.pid already exists. The TAP tests get broken as well,

Reading that again, I think you mean that "pg_ctl start" works but you
didn't try "pg_ctl start -w" manually?  Because it's "-w" that's at
issue here, and the failing test cases are using that.

After studying the test logs more carefully, I think the behavior they're
showing is consistent with the idea that postmaster_is_alive() doesn't
work on the CMD shell process.  Which seems very likely now that I think
about it, because we're depending on an implementation of kill() that
is PG-specific.

So in fact this idea doesn't work :-(.

I think there is still room to salvage something without fully rewriting
the postmaster invocation logic to avoid using CMD, because it's still
true that checking whether the CMD process is still there should be as
good as checking the postmaster proper.  We just can't use kill() for it.
I'd be inclined to get rid of the use of kill() on Unix as well; as Noah
mentioned upthread, if we're using fork/exec we might as well pay
attention to waitpid's results instead.  On Windows, I imagine that the
thing to do is have start_postmaster() save aside the handle for the CMD
process, and then in test_postmaster_connection(), check the handle state
to see if the process is still running (I assume there's a way to do
that).

I can take care of the Unix side of that, but as before, somebody else
would need to code and test the Windows side.  It's probably just a few
lines of code to add ... any volunteers?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-07 Thread Tom Lane
Michael Paquier  writes:
> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
>> So the attached modified patch adjusts the PID-match logic and some
>> comments, but is otherwise what I posted before.  I believe that this
>> might actually work on Windows, but I have no way to test it.  Someone
>> please try that?  (Don't forget to test the service-start path, too.)

> If "pg_ctl start" is invoked directly from a command prompt, it works.
> Now, if I run "pg_ctl start" within a script (in an msysgit
> environment for example), pg_ctl fails, complaining that
> postmaster.pid already exists. The TAP tests get broken as well,

Hm, the TAP trace makes it look like the postmaster start does actually
work, but pg_ctl isn't correctly waiting for that to happen.  (The
complaint about "already exists" seems to be from the second start
attempt, and is exactly what to expect there.)  I could believe that
I'd fat-fingered the PID comparison logic in test_postmaster_connection,
but I don't understand how it would work from a command prompt but not
from a script.  Any ideas?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-07 Thread Michael Paquier
On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
> So the attached modified patch adjusts the PID-match logic and some
> comments, but is otherwise what I posted before.  I believe that this
> might actually work on Windows, but I have no way to test it.  Someone
> please try that?  (Don't forget to test the service-start path, too.)

If "pg_ctl start" is invoked directly from a command prompt, it works.
Now, if I run "pg_ctl start" within a script (in an msysgit
environment for example), pg_ctl fails, complaining that
postmaster.pid already exists. The TAP tests get broken as well,
attached are the log files of the run, and here is an interested bit:
# Running: pg_ctl start -D
C:\Users\ioltas\git\postgres\src\bin\pg_ctl\tmp_testor2K/data -w
waiting for server to start... stopped waiting
not ok 12 - pg_ctl start -w
#   Failed test 'pg_ctl start -w'
#   at C:/Users/ioltas/git/postgres/src/test/perl/TestLib.pm line 282.
# Running: pg_ctl start -D
C:\Users\ioltas\git\postgres\src\bin\pg_ctl\tmp_testor2K/data -w
waiting for server to start...FATAL:  lock file "postmaster.pid" already exists
HINT:  Is another postmaster (PID 4040) running in data directory
"C:/Users/ioltas/git/postgres/src/bin/pg_ctl/tmp_testor2K/data"?
LOG:  database system was shut down at 2015-10-06 23:58:02 PDT
LOG:  MultiXact member wraparound protections are now enabled
FATAL:  the database system is starting up
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
 stopped waiting
Registering a service to the SCM works, but the service registered
refuses to start.
Regards,
-- 
Michael


regress_log_001_start_stop
Description: Binary data


regress_log_002_status
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-06 Thread Michael Paquier
On Wed, Oct 7, 2015 at 7:05 AM, Michael Paquier
 wrote:
> On Wed, Oct 7, 2015 at 6:44 AM, Tom Lane  wrote:
>> I wrote:
>>> So the attached modified patch adjusts the PID-match logic and some
>>> comments, but is otherwise what I posted before.  I believe that this
>>> might actually work on Windows, but I have no way to test it.  Someone
>>> please try that?  (Don't forget to test the service-start path, too.)
>>
>> Has anyone tried that patch on Windows?  I'd like to push it in hopes of
>> improving buildfarm stability, but I'm hesitant to do so without some
>> confirmation that it acts as I think it will.
>
> I marked this patch as something to look at, though I haven't take the
> time to do it yet...

s/take/taken
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-06 Thread Michael Paquier
On Wed, Oct 7, 2015 at 6:44 AM, Tom Lane  wrote:
> I wrote:
>> So the attached modified patch adjusts the PID-match logic and some
>> comments, but is otherwise what I posted before.  I believe that this
>> might actually work on Windows, but I have no way to test it.  Someone
>> please try that?  (Don't forget to test the service-start path, too.)
>
> Has anyone tried that patch on Windows?  I'd like to push it in hopes of
> improving buildfarm stability, but I'm hesitant to do so without some
> confirmation that it acts as I think it will.

I marked this patch as something to look at, though I haven't take the
time to do it yet...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-10-06 Thread Tom Lane
I wrote:
> So the attached modified patch adjusts the PID-match logic and some
> comments, but is otherwise what I posted before.  I believe that this
> might actually work on Windows, but I have no way to test it.  Someone
> please try that?  (Don't forget to test the service-start path, too.)

Has anyone tried that patch on Windows?  I'd like to push it in hopes of
improving buildfarm stability, but I'm hesitant to do so without some
confirmation that it acts as I think it will.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-25 Thread Tom Lane
I wrote:
> Attached is a draft patch for this.  I think it's fine for Unix (unless
> someone wants to object to relying on "/bin/sh -c"), but I have no idea
> whether it works for Windows.  The main risk is that if CMD.EXE runs
> the postmaster as a subprocess rather than overlaying itself a la shell
> "exec", the PID we'd get back would apply only to CMD.EXE not to the
> eventual postmaster.  If so, I do not know how to fix that, or whether
> it's fixable at all.

> Note that this makes the test case in question fail reliably, which is
> reasonable behavior IMO so I just changed the test.

> If this does (or can be made to) work on Windows, I'd propose applying
> it back to 9.2, where the current coding came in.

Nobody seems to have stepped up to fix the Windows side of this, but
after some more thought it occurred to me that maybe it doesn't need
(much) fixing.  There are two things that pg_ctl wants the PID for:

1. To check if the postmaster.pid file contains the expected child process
PID.  Well, that's a nice-to-have, but we never had it before and got
along well enough, because the start-timestamp comparison check covers
most real-world cases.  We can omit the PID-match check on Windows.

2. To see whether the child postmaster process is still running, via a
"kill(pid, 0)" test.  But if we've launched the postmaster through a
shell, and not used "&" semantics, presumably that shell will wait for its
child process to exit and then exit itself.  So *testing whether the shell
is still there is just about as good as testing the real postmaster PID*.

Therefore, we don't really need to worry about rewriting the Windows
subprocess-launch logic.  If someone would like to do it, that would
be nice, but we don't have to wait for that to happen before we can
get rid of the 5-second-timeout issues.

So the attached modified patch adjusts the PID-match logic and some
comments, but is otherwise what I posted before.  I believe that this
might actually work on Windows, but I have no way to test it.  Someone
please try that?  (Don't forget to test the service-start path, too.)

regards, tom lane

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 6a36d29..62db328 100644
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
***
*** 28,33 
--- 28,34 
  #include 
  #include 
  #include 
+ #include 
  #include 
  
  #ifdef HAVE_SYS_RESOURCE_H
*** static int	CreateRestrictedProcess(char 
*** 153,162 
  static pgpid_t get_pgpid(bool is_status_request);
  static char **readfile(const char *path);
  static void free_readfile(char **optlines);
! static int	start_postmaster(void);
  static void read_post_opts(void);
  
! static PGPing test_postmaster_connection(bool);
  static bool postmaster_is_alive(pid_t pid);
  
  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
--- 154,163 
  static pgpid_t get_pgpid(bool is_status_request);
  static char **readfile(const char *path);
  static void free_readfile(char **optlines);
! static pgpid_t start_postmaster(void);
  static void read_post_opts(void);
  
! static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint);
  static bool postmaster_is_alive(pid_t pid);
  
  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
*** free_readfile(char **optlines)
*** 419,454 
   * start/test/stop routines
   */
  
! static int
  start_postmaster(void)
  {
  	char		cmd[MAXPGPATH];
  
  #ifndef WIN32
  
  	/*
  	 * Since there might be quotes to handle here, it is easier simply to pass
! 	 * everything to a shell to process them.
! 	 *
! 	 * XXX it would be better to fork and exec so that we would know the child
! 	 * postmaster's PID directly; then test_postmaster_connection could use
! 	 * the PID without having to rely on reading it back from the pidfile.
  	 */
  	if (log_file != NULL)
! 		snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1 &",
   exec_path, pgdata_opt, post_opts,
   DEVNULL, log_file);
  	else
! 		snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" 2>&1 &",
   exec_path, pgdata_opt, post_opts, DEVNULL);
  
! 	return system(cmd);
  #else			/* WIN32 */
  
  	/*
! 	 * On win32 we don't use system(). So we don't need to use & (which would
! 	 * be START /B on win32). However, we still call the shell (CMD.EXE) with
! 	 * it to handle redirection etc.
  	 */
  	PROCESS_INFORMATION pi;
  
--- 420,489 
   * start/test/stop routines
   */
  
! /*
!  * Start the postmaster and return its PID.
!  *
!  * Currently, on Windows what we return is the PID of the shell process
!  * that launched the postmaster (and, we trust, is waiting for it to exit).
!  * So the PID is usable for "is the postmaster still running" checks,
!  * but cannot be compared directly to postmaster.pid.
!  */
! static pgpid_t
  start_postmaster(void)
  {
  	char		cmd[MAXPGPATH];
  
  #ifndef WIN32
+ 	pgpid_t		pm_pid;
+ 
+ 	/* Flush stdio channels just before fork, to 

Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-05 Thread Tom Lane
Noah Misch  writes:
> On Sat, Sep 05, 2015 at 10:22:59PM -0400, Tom Lane wrote:
>> The kill() coding works on Windows (I believe); waitpid() not so much.

> Windows has the concept under a different name.  See postmaster.c.

Well, if someone else wants to code and test that, feel free.  I don't
do Windows ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-05 Thread Noah Misch
On Sat, Sep 05, 2015 at 10:22:59PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > kill() adds nothing when dealing with a pg_ctl child.  waitpid() is enough.
> 
> The kill() coding works on Windows (I believe); waitpid() not so much.

Windows has the concept under a different name.  See postmaster.c.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-05 Thread Tom Lane
Noah Misch  writes:
> kill() adds nothing when dealing with a pg_ctl child.  waitpid() is enough.

The kill() coding works on Windows (I believe); waitpid() not so much.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-05 Thread Noah Misch
On Fri, Sep 04, 2015 at 10:54:51PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
> >>> This is the first time I've seen an indication that the
> >>> start_postmaster change mentioned in the comment is actually important
> >>> for production use, rather than just being cleanup.
> 
> > I scratched my head awhile without thinking of a credible production use 
> > case
> > that would notice the difference.  Which one did you have in mind?
> 
> Well, mainly that it's making our regression tests fail, which suggests
> behavioral instability that could be important for production.

I doubt issuing two "pg_ctl start" within two seconds, with no intervening
stop, has a production use.  However, ...

> Aside from the specific case you diagnosed, it's clear from buildfarm
> results that the five-second timeout elsewhere in
> test_postmaster_connection has got funny behavior under load; there are
> way too many cases where pg_ctl gives up after exactly that long, with
> no useful information printed.  The draft patch I posted gets rid of that
> behavior ...

... fixing this is a good deal more valuable.  Besides coping with stalls in
early postmaster startup, pg_ctl would no longer take 5s to exit when the
postmaster reports a postgresql.conf defect or other early error.

On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
> If this does (or can be made to) work on Windows, I'd propose applying
> it back to 9.2, where the current coding came in.

That seems prudent.  We'll keep meeting buildfarm members with performance
variability sufficient to reach that 5s timeout.  (Incidentally, the current
coding first appeared in 9.1.)

> --- 650,671 
>   }
>   
>   /*
> !  * Reap child process if it died; if it remains in zombie state 
> then
> !  * our kill-based test will think it's still running.
>*/
> ! #ifndef WIN32
>   {
> ! int exitstatus;
>   
> ! (void) waitpid((pid_t) pm_pid, &exitstatus, WNOHANG);

Let's report any exit status.  When a signal kills the young postmaster, it's
a waste to omit that detail.

>   }
> + #endif
>   
>   /*
> !  * Check whether the child postmaster process is still alive.  
> This
> !  * lets us exit early if the postmaster fails during startup.
>*/
> ! if (!postmaster_is_alive((pid_t) pm_pid))
>   return PQPING_NO_RESPONSE;

kill() adds nothing when dealing with a pg_ctl child.  waitpid() is enough.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-04 Thread Tom Lane
Noah Misch  writes:
> On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
>>> This is the first time I've seen an indication that the
>>> start_postmaster change mentioned in the comment is actually important
>>> for production use, rather than just being cleanup.

> I scratched my head awhile without thinking of a credible production use case
> that would notice the difference.  Which one did you have in mind?

Well, mainly that it's making our regression tests fail, which suggests
behavioral instability that could be important for production.

Aside from the specific case you diagnosed, it's clear from buildfarm
results that the five-second timeout elsewhere in
test_postmaster_connection has got funny behavior under load; there are
way too many cases where pg_ctl gives up after exactly that long, with
no useful information printed.  The draft patch I posted gets rid of that
behavior ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-04 Thread Noah Misch
On Thu, Sep 03, 2015 at 03:31:06PM -0400, Tom Lane wrote:
> I wrote:
> > This is the first time I've seen an indication that the
> > start_postmaster change mentioned in the comment is actually important
> > for production use, rather than just being cleanup.

I scratched my head awhile without thinking of a credible production use case
that would notice the difference.  Which one did you have in mind?

> Note that this makes the test case in question fail reliably, which is
> reasonable behavior IMO so I just changed the test.

That's the better behavior, agreed.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Alvaro Herrera
Tom Lane wrote:
> Andrew Dunstan  writes:
> > There is no equivalent of execl, nor a cmd.exe exquivalent of the 
> > shell's exec. But surely the equivalent of the fork/execl you're doing 
> > here would be a simple CreateProcess(). I don't see why you need a shell 
> > in the middle on Windows at all.
> 
> The problem is to get the output redirection to work.  I imagine it could
> be rewritten without involving the shell, but it'd be less than trivial
> (or at least, I'm not going to do it).

In the CreateProcess model, I think you can use startupInfo.hStdOutput,
see

https://msdn.microsoft.com/en-us/library/windows/desktop/ms686331%28v=vs.85%29.aspx
https://msdn.microsoft.com/en-us/library/windows/desktop/ms682425%28v=vs.85%29.aspx

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Tom Lane
Andrew Dunstan  writes:
> There is no equivalent of execl, nor a cmd.exe exquivalent of the 
> shell's exec. But surely the equivalent of the fork/execl you're doing 
> here would be a simple CreateProcess(). I don't see why you need a shell 
> in the middle on Windows at all.

The problem is to get the output redirection to work.  I imagine it could
be rewritten without involving the shell, but it'd be less than trivial
(or at least, I'm not going to do it).

If we did get that to work, it could probably be applied to the
service-start case as well, which currently just does a CreateProcess and
pays no attention to what happens to postmaster stderr.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Andrew Dunstan



On 09/03/2015 03:31 PM, Tom Lane wrote:

I wrote:

Andres Freund  writes:

I'don't like adding a couple seconds of test runtime for the benefit of
very slow platforms.

Me either.  This is the first time I've seen an indication that the
start_postmaster change mentioned in the comment is actually important
for production use, rather than just being cleanup.  I think we ought
to just fix it.  I'm willing to take care of the Unix side if someone
will explain how to change the Windows side.

Attached is a draft patch for this.  I think it's fine for Unix (unless
someone wants to object to relying on "/bin/sh -c"), but I have no idea
whether it works for Windows.  The main risk is that if CMD.EXE runs
the postmaster as a subprocess rather than overlaying itself a la shell
"exec", the PID we'd get back would apply only to CMD.EXE not to the
eventual postmaster.  If so, I do not know how to fix that, or whether
it's fixable at all.

Note that this makes the test case in question fail reliably, which is
reasonable behavior IMO so I just changed the test.

If this does (or can be made to) work on Windows, I'd propose applying
it back to 9.2, where the current coding came in.




There is no equivalent of execl, nor a cmd.exe exquivalent of the 
shell's exec. But surely the equivalent of the fork/execl you're doing 
here would be a simple CreateProcess(). I don't see why you need a shell 
in the middle on Windows at all.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Tom Lane
I wrote:
> Andres Freund  writes:
>> I'don't like adding a couple seconds of test runtime for the benefit of
>> very slow platforms.

> Me either.  This is the first time I've seen an indication that the
> start_postmaster change mentioned in the comment is actually important
> for production use, rather than just being cleanup.  I think we ought
> to just fix it.  I'm willing to take care of the Unix side if someone
> will explain how to change the Windows side.

Attached is a draft patch for this.  I think it's fine for Unix (unless
someone wants to object to relying on "/bin/sh -c"), but I have no idea
whether it works for Windows.  The main risk is that if CMD.EXE runs
the postmaster as a subprocess rather than overlaying itself a la shell
"exec", the PID we'd get back would apply only to CMD.EXE not to the
eventual postmaster.  If so, I do not know how to fix that, or whether
it's fixable at all.

Note that this makes the test case in question fail reliably, which is
reasonable behavior IMO so I just changed the test.

If this does (or can be made to) work on Windows, I'd propose applying
it back to 9.2, where the current coding came in.

regards, tom lane

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 6a36d29..f0025d7 100644
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
***
*** 28,33 
--- 28,34 
  #include 
  #include 
  #include 
+ #include 
  #include 
  
  #ifdef HAVE_SYS_RESOURCE_H
*** static int	CreateRestrictedProcess(char 
*** 153,162 
  static pgpid_t get_pgpid(bool is_status_request);
  static char **readfile(const char *path);
  static void free_readfile(char **optlines);
! static int	start_postmaster(void);
  static void read_post_opts(void);
  
! static PGPing test_postmaster_connection(bool);
  static bool postmaster_is_alive(pid_t pid);
  
  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
--- 154,163 
  static pgpid_t get_pgpid(bool is_status_request);
  static char **readfile(const char *path);
  static void free_readfile(char **optlines);
! static pgpid_t start_postmaster(void);
  static void read_post_opts(void);
  
! static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint);
  static bool postmaster_is_alive(pid_t pid);
  
  #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
*** free_readfile(char **optlines)
*** 419,454 
   * start/test/stop routines
   */
  
! static int
  start_postmaster(void)
  {
  	char		cmd[MAXPGPATH];
  
  #ifndef WIN32
  
  	/*
  	 * Since there might be quotes to handle here, it is easier simply to pass
  	 * everything to a shell to process them.
- 	 *
- 	 * XXX it would be better to fork and exec so that we would know the child
- 	 * postmaster's PID directly; then test_postmaster_connection could use
- 	 * the PID without having to rely on reading it back from the pidfile.
  	 */
  	if (log_file != NULL)
! 		snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1 &",
   exec_path, pgdata_opt, post_opts,
   DEVNULL, log_file);
  	else
! 		snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" 2>&1 &",
   exec_path, pgdata_opt, post_opts, DEVNULL);
  
! 	return system(cmd);
  #else			/* WIN32 */
  
  	/*
! 	 * On win32 we don't use system(). So we don't need to use & (which would
! 	 * be START /B on win32). However, we still call the shell (CMD.EXE) with
! 	 * it to handle redirection etc.
  	 */
  	PROCESS_INFORMATION pi;
  
--- 420,479 
   * start/test/stop routines
   */
  
! static pgpid_t
  start_postmaster(void)
  {
  	char		cmd[MAXPGPATH];
  
  #ifndef WIN32
+ 	pgpid_t		pm_pid;
+ 
+ 	/* Flush stdio channels just before fork, to avoid double-output problems */
+ 	fflush(stdout);
+ 	fflush(stderr);
+ 
+ 	pm_pid = fork();
+ 	if (pm_pid < 0)
+ 	{
+ 		/* fork failed */
+ 		write_stderr(_("%s: could not start server: %s\n"),
+ 	 progname, strerror(errno));
+ 		exit(1);
+ 	}
+ 	if (pm_pid > 0)
+ 	{
+ 		/* fork succeeded, in parent */
+ 		return pm_pid;
+ 	}
+ 
+ 	/* fork succeeded, in child */
  
  	/*
  	 * Since there might be quotes to handle here, it is easier simply to pass
  	 * everything to a shell to process them.
  	 */
  	if (log_file != NULL)
! 		snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1",
   exec_path, pgdata_opt, post_opts,
   DEVNULL, log_file);
  	else
! 		snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1",
   exec_path, pgdata_opt, post_opts, DEVNULL);
  
! 	(void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL);
! 
! 	/* exec failed */
! 	write_stderr(_("%s: could not start server: %s\n"),
!  progname, strerror(errno));
! 	exit(1);
! 
! 	return 0;	/* keep dumb compilers quiet */
! 
  #else			/* WIN32 */
  
  	/*
! 	 * As with the Unix case, it's easiest to use the shell (CMD.EXE) to
! 	 * handle redirection etc.
  	 */
  	PROCESS_INFORMATION pi;
  
*** start_postmaster(void)
*** 460,469 
   exec_

Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Tom Lane
Andres Freund  writes:
> On 2015-09-03 02:25:00 -0400, Noah Misch wrote:
>> --- a/src/bin/pg_ctl/t/001_start_stop.pl
>> +++ b/src/bin/pg_ctl/t/001_start_stop.pl
>> @@ -35,6 +35,7 @@ close CONF;
>> command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
>> 'pg_ctl start -w');
>> -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
>> -'second pg_ctl start succeeds');
>> +sleep 3;# bridge test_postmaster_connection() slop threshold
>> +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
>> +'second pg_ctl start fails');
>> command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
>> 'pg_ctl stop -w');

> I'don't like adding a couple seconds of test runtime for the benefit of
> very slow platforms.

Me either.  This is the first time I've seen an indication that the
start_postmaster change mentioned in the comment is actually important
for production use, rather than just being cleanup.  I think we ought
to just fix it.  I'm willing to take care of the Unix side if someone
will explain how to change the Windows side.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

2015-09-03 Thread Andres Freund
On 2015-09-03 02:25:00 -0400, Noah Misch wrote:
> --- a/src/bin/pg_ctl/t/001_start_stop.pl
> +++ b/src/bin/pg_ctl/t/001_start_stop.pl
> @@ -35,6 +35,7 @@ close CONF;
>  command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
>   'pg_ctl start -w');
> -command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
> - 'second pg_ctl start succeeds');
> +sleep 3;# bridge test_postmaster_connection() slop threshold
> +command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
> + 'second pg_ctl start fails');
>  command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
>   'pg_ctl stop -w');

I'don't like adding a couple seconds of test runtime for the benefit of
very slow platforms.

The second pg_ctl start doesn't seem to test something very
interesting. I'm inclined to just remove it. I'm not caffeinated
sufficiently, but afaics that ought to sidestep the issue as stop
doesn't depend on the slop time?

> crake failed the same way, once:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2015-07-07%2016%3A35%3A06

Sounds like an actual production hazard too.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers