Re: [HACKERS] Debugging buildfarm pg_upgrade check failures
On Sat, Jul 25, 2015 at 10:59:27AM -0400, Tom Lane wrote: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl=2015-07-24%2020%3A29%3A18 > > What evidently happened there is that "pg_ctl start" gave up waiting for > the postmaster to start too soon. The postmaster log appears to contain > > LOG: database system was shut down at 2015-07-24 16:45:40 EDT > FATAL: the database system is starting up > LOG: MultiXact member wraparound protections are now enabled > LOG: database system is ready to accept connections > > which indicates that it did successfully come up, but not till after one > "PQping" probe from pg_ctl, which was rejected with "still starting up". For the record, that was not a PQping() probe. Since "connection to database failed: FATAL: the database system is starting up" appeared in the buildfarm log, it was pg_upgrade attempting to connect after its pg_ctl had given up. > Meanwhile we've got this log output from pg_ctl: > > waiting for server to start stopped waiting > pg_ctl: could not start server > Examine the log output. > > Counting the dots indicates that pg_ctl gave up after precisely 5 seconds. Thanks for improving that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Debugging buildfarm pg_upgrade check failures
Now that we've restored proper logging of make check, I looked into today's failure report from axolotl: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotldt=2015-07-24%2020%3A29%3A18 What evidently happened there is that pg_ctl start gave up waiting for the postmaster to start too soon. The postmaster log appears to contain LOG: database system was shut down at 2015-07-24 16:45:40 EDT FATAL: the database system is starting up LOG: MultiXact member wraparound protections are now enabled LOG: database system is ready to accept connections which indicates that it did successfully come up, but not till after one PQping probe from pg_ctl, which was rejected with still starting up. Meanwhile we've got this log output from pg_ctl: waiting for server to start stopped waiting pg_ctl: could not start server Examine the log output. Counting the dots indicates that pg_ctl gave up after precisely 5 seconds. Now, looking at the logic in pg_ctl's test_postmaster_connection(), the only explanation that seems to fit the observed output is that the stat() on the postmaster pidfile (at line 650 in HEAD) failed. It's not clear why though, since the postmaster was clearly still alive at this point, and we must have been able to read the pidfile earlier to construct a connection string, else there would have been no PQping attempt. Maybe the stat failed for some unexpected resource-exhaustion kind of reason? It seems plausible to me that we should change pg_ctl to only consider stat() failure to be a reason to give up waiting if errno is ENOENT, not anything else. At a minimum, I'd like to modify it to print the errno if it's anything else, so that we can confirm or deny this theory next time we see this buildfarm failure. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Debugging buildfarm pg_upgrade check failures
On 07/25/2015 10:59 AM, Tom Lane wrote: Now that we've restored proper logging of make check, I looked into today's failure report from axolotl: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotldt=2015-07-24%2020%3A29%3A18 What evidently happened there is that pg_ctl start gave up waiting for the postmaster to start too soon. The postmaster log appears to contain LOG: database system was shut down at 2015-07-24 16:45:40 EDT FATAL: the database system is starting up LOG: MultiXact member wraparound protections are now enabled LOG: database system is ready to accept connections which indicates that it did successfully come up, but not till after one PQping probe from pg_ctl, which was rejected with still starting up. Meanwhile we've got this log output from pg_ctl: waiting for server to start stopped waiting pg_ctl: could not start server Examine the log output. Counting the dots indicates that pg_ctl gave up after precisely 5 seconds. Now, looking at the logic in pg_ctl's test_postmaster_connection(), the only explanation that seems to fit the observed output is that the stat() on the postmaster pidfile (at line 650 in HEAD) failed. It's not clear why though, since the postmaster was clearly still alive at this point, and we must have been able to read the pidfile earlier to construct a connection string, else there would have been no PQping attempt. Maybe the stat failed for some unexpected resource-exhaustion kind of reason? It seems plausible to me that we should change pg_ctl to only consider stat() failure to be a reason to give up waiting if errno is ENOENT, not anything else. At a minimum, I'd like to modify it to print the errno if it's anything else, so that we can confirm or deny this theory next time we see this buildfarm failure. Comments? Certainly let's look at the errno. cheers andrdew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers