Re: [HACKERS] (re)start in our init scripts seems broken

2016-07-20 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Jul 20, 2016 at 11:41 AM, Tomas Vondra
>  wrote:
>> Is there a reason why it's coded like this? I think we should use the pg_ctl
>> instead or (at the very least) check the postmaster return code. Also,
>> perhaps we should add an explicit timeout, higher than 60 seconds.

> c8196c87 is one reason.

I think that 8f5500e6b improved that situation.  You still have to be
really careful when writing the init script that there not be more than
one postgres-owned shell process.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] (re)start in our init scripts seems broken

2016-07-19 Thread Michael Paquier
On Wed, Jul 20, 2016 at 11:41 AM, Tomas Vondra
 wrote:
> Is there a reason why it's coded like this? I think we should use the pg_ctl
> instead or (at the very least) check the postmaster return code. Also,
> perhaps we should add an explicit timeout, higher than 60 seconds.

c8196c87 is one reason. Honestly, I have always found that using
pg_ctl start -w is more robust in such scripts, and it avoids
maintaining sanity checks that are duplicates of the ones in pg_ctl
after the postmaster has started. So +1 for using that. Passing the
PG_OOM_* flags is not an issue either.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] (re)start in our init scripts seems broken

2016-07-19 Thread Tomas Vondra

Hi,

A few days ago I ran into a problem with the init script packaged in our 
community RPM packages. What happened was that they initiated a restart, 
but this happened:


# /etc/init.d/postgresql-9.3 restart
Stopping postgresql-9.3 service:   [FAILED]
Starting postgresql-9.3 service:   [  OK  ]

The database was however still in the shutdown mode, performing a 
checkpoint. Sadly the init script uses default timeout, so the stop 
terminates after just 60 seconds. But that seems fine, as the init 
script reports the failure correctly.


However the start action then seemingly succeeds, because it does this:

echo -n "$PSQL_START"
$SU -l postgres -c "$PGENGINE/postmaster -D '$PGDATA' ${PGOPTS} &" 
>> "$PGLOG" 2>&1 < /dev/null

sleep 2
pid=`head -n 1 "$PGDATA/postmaster.pid" 2>/dev/null`
if [ "x$pid" != x ]
then
success "$PSQL_START"
touch "$lockfile"
echo $pid > "$pidfile"
echo
else
failure "$PSQL_START"
echo
script_result=1
fi

It simply attempts to start the postmaster directly (instead of using 
pg_ctl), does not check the return code and instead proceeds to check 
the postmaster.pid file and existence of the process.


This however fails to do the trick, because the database is still 
running (in shutdown), so the postmaster does not overwrite the file. 
And of course the PID still matches a running process.


Is there a reason why it's coded like this? I think we should use the 
pg_ctl instead or (at the very least) check the postmaster return code. 
Also, perhaps we should add an explicit timeout, higher than 60 seconds.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers