Re: [HACKERS] FATAL: could not send end-of-streaming message to primary: no COPY in progress

2016-04-20 Thread Kyotaro HORIGUCHI
At Wed, 20 Apr 2016 16:16:40 +0900, Fujii Masao  wrote 
in 
> On Thu, Mar 31, 2016 at 9:15 AM, Thomas Munro
>  wrote:
> > Hi hackers,
> >
> > If you shut down a primary server, a standby that is streaming from it 
> > says54:
> >
> > LOG:  replication terminated by primary server
> > DETAIL:  End of WAL reached on timeline 1 at 0/14F4B68.
> > FATAL:  could not send end-of-streaming message to primary: no COPY in 
> > progress
> >
> > Isn't that FATAL ereport a bug?
> 
> ISTM that the cause is that walsender exits and replication connection is
> closed just after "COPY 0" is sent. That is, then after receiving "COPY 0",
> walreceiver tries to send an end-of-copy message to the primary, but fails
> because the connection has been already closed.

Though the message is followed by repetitions of other FATAL
messages, the message above itself seems a bit alarming.

> > How is clean server shutdown supposed to work?
> 
> One option is to make walsender wait for end-of-copy message from walreceiver
> before it closes the connection and exits, after sending "COPY 0" message.
> But one question is; how should walsender behave when walreceiver gets stuck
> and cannot reply an end-of-copy message to walsender? Probably we need
> the timeout (maybe we can use wal_sender_timeout here but not sure yet
> if it's appropriate or not).

-1. It is totally useless other than to avoid the FATAL message.

> Another option is to prevent walreceiver from sending an end-of-copy message.
> If "COPY 0" always means the exit of walsender and the termination of
> the connection, there seems to be no need to send back an end-of-copy message.
> I've not checked yet how this interferes with other replication logics, 
> though.

Looking into walsender.c, walsender thinks "COPY 0" is a signal
of its death coming just after, that is, proc_exit(0).

On the other hand the comment at the beginning of walreceiver.c
says that,

 * If the primary server ends streaming, but doesn't disconnect, walreceiver
 * goes into "waiting" mode, and waits for the startup process to give new
 * instructions. The startup process will treat that the same as
 * disconnection, and will rescan the archive/pg_xlog directory. But when the
 * startup process wants to try streaming replication again, it will just
 * nudge the existing walreceiver process that's waiting, instead of launching
 * a new one.

If we assume this is an useful behavior and want to keep it, a
termination after an end of XLOG streaming is just the same with
that for psql.

| FATAL:  terminating connection due to administrator command
| server closed the connection unexpectedly
| This probably means the server terminated abnormally
| before or while processing the request.

Or, we should provide another command to inform a termination.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: could not send end-of-streaming message to primary: no COPY in progress

2016-04-20 Thread Fujii Masao
On Thu, Mar 31, 2016 at 9:15 AM, Thomas Munro
 wrote:
> Hi hackers,
>
> If you shut down a primary server, a standby that is streaming from it says54:
>
> LOG:  replication terminated by primary server
> DETAIL:  End of WAL reached on timeline 1 at 0/14F4B68.
> FATAL:  could not send end-of-streaming message to primary: no COPY in 
> progress
>
> Isn't that FATAL ereport a bug?

ISTM that the cause is that walsender exits and replication connection is
closed just after "COPY 0" is sent. That is, then after receiving "COPY 0",
walreceiver tries to send an end-of-copy message to the primary, but fails
because the connection has been already closed.

> How is clean server shutdown supposed to work?

One option is to make walsender wait for end-of-copy message from walreceiver
before it closes the connection and exits, after sending "COPY 0" message.
But one question is; how should walsender behave when walreceiver gets stuck
and cannot reply an end-of-copy message to walsender? Probably we need
the timeout (maybe we can use wal_sender_timeout here but not sure yet
if it's appropriate or not).

Another option is to prevent walreceiver from sending an end-of-copy message.
If "COPY 0" always means the exit of walsender and the termination of
the connection, there seems to be no need to send back an end-of-copy message.
I've not checked yet how this interferes with other replication logics, though.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] FATAL: could not send end-of-streaming message to primary: no COPY in progress

2016-03-30 Thread Thomas Munro
Hi hackers,

If you shut down a primary server, a standby that is streaming from it says54:

LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/14F4B68.
FATAL:  could not send end-of-streaming message to primary: no COPY in progress

Isn't that FATAL ereport a bug?

I haven't worked out the root cause but the immediate problem seems to
be libpqrcv_endstreaming calls PQputCopyEnd which doesn't like the
state that the libpq connection is in, namely PGASYNC_BUSY.  That
state seems to have been established by the call to walrcv_receive
that returned -1 (end of copy).  It doesn't happen in the similar case
of promotion of the remote server.

How is clean server shutdown supposed to work?  It looks like
walsender sends COPY 0 and then just hangs up.  Meanwhile, walreceiver
has to distinguish between that case and the the new timeline case
which involves a further exchange of messages.  Is an explicit message
at the end of the copy stream saying either "goodbye" or "but wait,
there's more" lacking here?  Or is there some other way that
walreceiver could distinguish between clean shutdown of remote server
(no error necessary), unclean shutdown of remote server, and timeline
negotiation?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers