Re: [HACKERS] Explanation for intermittent buildfarm pg_upgradecheck failures

2015-08-02 Thread Tom Lane
I wrote:
 Further experimentation says that 9.0-9.2 do this in the expected order;
 so somebody broke it during 9.3.

Depressingly enough, the somebody was me.  Fixed now.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Explanation for intermittent buildfarm pg_upgradecheck failures

2015-08-02 Thread Tom Lane
Observe the smoking gun at
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=muledt=2015-08-02%2003%3A30%3A02

to wit this extract from pg_upgrade_server.log:

command: 
/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/install//home/pg/build-farm-4.15.1/build/HEAD/inst/bin/pg_ctl
 -w -D 
/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/data.old
 -o   stop  pg_upgrade_server.log 21
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  shutting down
waiting for server to shut down.LOG:  database system is shut down
 done
server stopped

command: 
/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/install//home/pg/build-farm-4.15.1/build/HEAD/inst/bin/pg_ctl
 -w -l pg_upgrade_server.log -D 
/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/data
 -o -p 57832 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off 
 -c listen_addresses='' -c unix_socket_permissions=0700 -c 
unix_socket_directories='/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade'
 start  pg_upgrade_server.log 21
waiting for server to startFATAL:  lock file 
/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/.s.PGSQL.57832.lock
 already exists
HINT:  Is another postmaster (PID 12295) using socket file 
/home/pg/build-farm-4.15.1/build/HEAD/pgsql.build/src/bin/pg_upgrade/.s.PGSQL.57832?
 stopped waiting
pg_ctl: could not start server
Examine the log output.

Now, the old postmaster clearly was shut down, so how come the socket lock
file still existed?  The answer is that what pg_ctl is watching for is for
the postmaster.pid file to disappear, and *the postmaster deletes its lock
files in the wrong order*.  strace'ing HEAD on my own machine shows this
sequence of kernel calls during postmaster stop:

wait4(-1, 0x723f3cbc, WNOHANG, NULL) = -1 ECHILD (No child processes)
stat(backup_label, 0x723f3bd0)= -1 ENOENT (No such file or directory)
munmap(0x7fd16e8f8000, 2316)= 0
unlink(/dev/shm/PostgreSQL.1804289383) = 0
semctl(1547862018, 0, IPC_RMID, 0)  = 0
semctl(1547894787, 0, IPC_RMID, 0)  = 0
semctl(1547927556, 0, IPC_RMID, 0)  = 0
semctl(1547960325, 0, IPC_RMID, 0)  = 0
semctl(1547993094, 0, IPC_RMID, 0)  = 0
semctl(1548025863, 0, IPC_RMID, 0)  = 0
semctl(1548058632, 0, IPC_RMID, 0)  = 0
semctl(1548091401, 0, IPC_RMID, 0)  = 0
shmdt(0x7fd16e8f9000)   = 0
munmap(0x7fd165b3c000, 148488192)   = 0
shmctl(194314244, IPC_RMID, 0)  = 0
unlink(/tmp/.s.PGSQL.5432)= 0
unlink(postmaster.pid)= 0
unlink(/tmp/.s.PGSQL.5432.lock)   = 0
exit_group(0)   = ?
+++ exited with 0 +++

I haven't looked to find out why the unlinks happen in this order, but on
a heavily loaded machine, it's certainly possible that the process would
lose the CPU after unlink(postmaster.pid), and then a new postmaster
could get far enough to see the socket lock file still there.  So that
would account for low-probability failures in the pg_upgradecheck test,
which is exactly what we've been seeing.

It seems clear to me that what the exit sequence ought to be is
unlink socket, unlink socket lock file, repeat for any additional
sockets, then unlink postmaster.pid.  Can anyone think of a reason
why the current ordering isn't a bug?

Another point that's rather obvious from the trace is that at no time
do we bother to close the postmaster's TCP socket; it only goes away when
the postmaster process dies.  So there's a second race condition here:
a new postmaster could be unable to bind() to the desired TCP port
because the old one still has it open.  I think we've seen unexplained
failures of the TCP-socket-in-use variety, too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for intermittent buildfarm pg_upgradecheck failures

2015-08-02 Thread Tom Lane
I wrote:
 unlink(/tmp/.s.PGSQL.5432)= 0
 unlink(postmaster.pid)= 0
 unlink(/tmp/.s.PGSQL.5432.lock)   = 0
 exit_group(0)   = ?
 +++ exited with 0 +++

 I haven't looked to find out why the unlinks happen in this order, but on
 a heavily loaded machine, it's certainly possible that the process would
 lose the CPU after unlink(postmaster.pid), and then a new postmaster
 could get far enough to see the socket lock file still there.  So that
 would account for low-probability failures in the pg_upgradecheck test,
 which is exactly what we've been seeing.

Further experimentation says that 9.0-9.2 do this in the expected order;
so somebody broke it during 9.3.

The lack of a close() on the postmaster socket goes all the way back
though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for intermittent buildfarm pg_upgradecheck failures

2015-08-02 Thread Michael Paquier
On Mon, Aug 3, 2015 at 1:30 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I haven't looked to find out why the unlinks happen in this order, but on
 a heavily loaded machine, it's certainly possible that the process would
 lose the CPU after unlink(postmaster.pid), and then a new postmaster
 could get far enough to see the socket lock file still there.  So that
 would account for low-probability failures in the pg_upgradecheck test,
 which is exactly what we've been seeing.

Oh... This may explain the different failures seen with TAP tests on
hamster, and axolotl with pg_upgrade as well. It is rather easy to get
them heavily loaded.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers