Re: [HACKERS] walsender termination error messages worse in v10

2017-06-09 Thread Peter Eisentraut
On 6/8/17 01:57, Noah Misch wrote:
> On Fri, Jun 02, 2017 at 11:51:26AM -0700, Andres Freund wrote:
>> commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
>> Author: Peter Eisentraut 
>> Date:   2017-03-23 08:36:36 -0400
>>
>> Logical replication support for initial data copy
>>
>> made walreceiver emit worse messages in v10 than before when the master
>> gets shut down.  Before 10 you'll get something like:
>>
>> 2017-06-02 11:46:07.173 PDT [16143][] LOG:  replication terminated by 
>> primary server
>> 2017-06-02 11:46:07.173 PDT [16143][] DETAIL:  End of WAL reached on 
>> timeline 1 at 0/1B7FB8C8.
>> 2017-06-02 11:46:07.173 PDT [16143][] FATAL:  could not send 
>> end-of-streaming message to primary: no COPY in progress
>>
>> the last bit is a bit superflous, but it still kinda makes sense.  Now
>> you get:
>> 2017-06-02 11:47:09.362 PDT [17061][] FATAL:  unexpected result after 
>> CommandComplete: server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
> 
> [Action required within three days.  This is a generic notification.]

This was fixed by Andres.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walsender termination error messages worse in v10

2017-06-08 Thread Alvaro Herrera
Petr Jelinek wrote:

> It took me a while to understand why walreceiver does this originally,
> but it did make sense to me once I understood the [...], so I did it
> the same way.

In other words, the coding pattern needs a comment in both places.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walsender termination error messages worse in v10

2017-06-08 Thread Petr Jelinek
On 08/06/17 23:57, Andres Freund wrote:
> On 2017-06-03 00:55:22 +0200, Petr Jelinek wrote:
>> On 02/06/17 23:45, Andres Freund wrote:
>>> Hi Petr,
>>>
>>> On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote:
 On 02/06/17 20:51, Andres Freund wrote:
> I don't understand why the new block is there, nor does the commit
> message explain it.
>

 Hmm, that particular change can actually be reverted. It was needed for
 one those custom replication commands which were replaced by normal
 query support. I have missed it during the rewrite.
>>>
>>> Doesn't appear to be quite that simple, I get regression test failures
>>> in that case.
>>>
>>
>> Hmm, looks like we still use it for normal COPY handling. So basically
>> the problem is that if we run COPY TO STDOUT and then consume it using
>> the libpqrcv_receive it will end with normal PGRES_COMMAND_OK but we
>> need to call PQgetResult() in that case otherwise libpq thinks the
>> command is still active and any following command will fail, but if we
>> call PQgetResult on dead connection we get that error you complained about.
> 
> Should this possibly handled at the caller level?  This is a bit too
> much magic for my taste.

It is, I agree with that. I guess the main issue with handling it in
callers is that callers can't access libpq api so there would need to be
some additional abstraction (we did some of the necessary parts of that
for libpqrcv_exec).

> 
> Looking at the callers, the new code isn't super-obvious either:
> 
> len = walrcv_receive(wrconn, , );
> 
> if (len != 0)
> {
> /* Process the data */
> for (;;)
> {
> CHECK_FOR_INTERRUPTS();
> 
> if (len == 0)
> {
> break;
> }
> else if (len < 0)
> {
> ereport(LOG,
> (errmsg("data stream from publisher has ended")));
> endofstream = true;
> break;
> }
> 
> The len < 0, hidden inside a len != 0, which in the loop again chcks if
> len == 0 (because it's decremented in loop)?  And there's two different[5~
>   len = walrcv_receive(wrconn, , );
> statements?
> 

The logic there is that you have one main loop and one busy loop. When
there is flow of data, things happen as the data are processed and we
don't need to do the bookkeeping of timeouts, latching, other
maintenance tasks that need to be done (like checking table
synchronization in logical replication) because it all happens naturally
and that's what the inner loop is for. But when the upstream gets idle
and does not send anything (or there is network issue, etc) we need to
do these things (otherwise we'd miss timeouts, table synchronization
worker would wait forever on apply, etc) and that's what the outer loop
is for.

We only enter the inner loop if there were some data that's why there is
the firs len != 0, but then the inner loop bases decisions on the return
code of the walrcv_receive() so the if is repeated again (it also will
call walrcv_receive() as long as there are data so it needs to check for
those calls too).

It took me a while to understand why walreceiver does this originally,
but it did make sense to me once I understood the split between the main
(idle) loop and busy loop, so I did it the same way.

>> I guess it would make sense to do conditional exit on
>> (PQstatus(streamConn) == CONNECTION_BAD) like libpqrcv_PQexec does. It's
>> quite ugly code-wise though.
> 
> I think that's fine for now.  It'd imo be a good idea to improve matters
> here a bit, but for now I've just applied your patch.
> 

Okay, thanks.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walsender termination error messages worse in v10

2017-06-08 Thread Andres Freund
On 2017-06-03 00:55:22 +0200, Petr Jelinek wrote:
> On 02/06/17 23:45, Andres Freund wrote:
> > Hi Petr,
> > 
> > On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote:
> >> On 02/06/17 20:51, Andres Freund wrote:
> >>> I don't understand why the new block is there, nor does the commit
> >>> message explain it.
> >>>
> >>
> >> Hmm, that particular change can actually be reverted. It was needed for
> >> one those custom replication commands which were replaced by normal
> >> query support. I have missed it during the rewrite.
> > 
> > Doesn't appear to be quite that simple, I get regression test failures
> > in that case.
> > 
> 
> Hmm, looks like we still use it for normal COPY handling. So basically
> the problem is that if we run COPY TO STDOUT and then consume it using
> the libpqrcv_receive it will end with normal PGRES_COMMAND_OK but we
> need to call PQgetResult() in that case otherwise libpq thinks the
> command is still active and any following command will fail, but if we
> call PQgetResult on dead connection we get that error you complained about.

Should this possibly handled at the caller level?  This is a bit too
much magic for my taste.

Looking at the callers, the new code isn't super-obvious either:

len = walrcv_receive(wrconn, , );

if (len != 0)
{
/* Process the data */
for (;;)
{
CHECK_FOR_INTERRUPTS();

if (len == 0)
{
break;
}
else if (len < 0)
{
ereport(LOG,
(errmsg("data stream from publisher has ended")));
endofstream = true;
break;
}

The len < 0, hidden inside a len != 0, which in the loop again chcks if
len == 0 (because it's decremented in loop)?  And there's two different[5~
len = walrcv_receive(wrconn, , );
statements?

> I guess it would make sense to do conditional exit on
> (PQstatus(streamConn) == CONNECTION_BAD) like libpqrcv_PQexec does. It's
> quite ugly code-wise though.

I think that's fine for now.  It'd imo be a good idea to improve matters
here a bit, but for now I've just applied your patch.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walsender termination error messages worse in v10

2017-06-07 Thread Noah Misch
On Fri, Jun 02, 2017 at 11:51:26AM -0700, Andres Freund wrote:
> commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
> Author: Peter Eisentraut 
> Date:   2017-03-23 08:36:36 -0400
> 
> Logical replication support for initial data copy
> 
> made walreceiver emit worse messages in v10 than before when the master
> gets shut down.  Before 10 you'll get something like:
> 
> 2017-06-02 11:46:07.173 PDT [16143][] LOG:  replication terminated by primary 
> server
> 2017-06-02 11:46:07.173 PDT [16143][] DETAIL:  End of WAL reached on timeline 
> 1 at 0/1B7FB8C8.
> 2017-06-02 11:46:07.173 PDT [16143][] FATAL:  could not send end-of-streaming 
> message to primary: no COPY in progress
> 
> the last bit is a bit superflous, but it still kinda makes sense.  Now
> you get:
> 2017-06-02 11:47:09.362 PDT [17061][] FATAL:  unexpected result after 
> CommandComplete: server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walsender termination error messages worse in v10

2017-06-02 Thread Petr Jelinek
On 02/06/17 23:45, Andres Freund wrote:
> Hi Petr,
> 
> On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote:
>> On 02/06/17 20:51, Andres Freund wrote:
>>> I don't understand why the new block is there, nor does the commit
>>> message explain it.
>>>
>>
>> Hmm, that particular change can actually be reverted. It was needed for
>> one those custom replication commands which were replaced by normal
>> query support. I have missed it during the rewrite.
> 
> Doesn't appear to be quite that simple, I get regression test failures
> in that case.
> 

Hmm, looks like we still use it for normal COPY handling. So basically
the problem is that if we run COPY TO STDOUT and then consume it using
the libpqrcv_receive it will end with normal PGRES_COMMAND_OK but we
need to call PQgetResult() in that case otherwise libpq thinks the
command is still active and any following command will fail, but if we
call PQgetResult on dead connection we get that error you complained about.

I guess it would make sense to do conditional exit on
(PQstatus(streamConn) == CONNECTION_BAD) like libpqrcv_PQexec does. It's
quite ugly code-wise though.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


libpqwalsender-disconnect-fix.diff
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walsender termination error messages worse in v10

2017-06-02 Thread Andres Freund
Hi Petr,

On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote:
> On 02/06/17 20:51, Andres Freund wrote:
> > I don't understand why the new block is there, nor does the commit
> > message explain it.
> > 
> 
> Hmm, that particular change can actually be reverted. It was needed for
> one those custom replication commands which were replaced by normal
> query support. I have missed it during the rewrite.

Doesn't appear to be quite that simple, I get regression test failures
in that case.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] walsender termination error messages worse in v10

2017-06-02 Thread Petr Jelinek
Hi Andres,

On 02/06/17 20:51, Andres Freund wrote:
> Hi,
> 
> commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
> Author: Peter Eisentraut 
> Date:   2017-03-23 08:36:36 -0400
> 
> Logical replication support for initial data copy
> 
> made walreceiver emit worse messages in v10 than before when the master
> gets shut down.  Before 10 you'll get something like:
> 
> 2017-06-02 11:46:07.173 PDT [16143][] LOG:  replication terminated by primary 
> server
> 2017-06-02 11:46:07.173 PDT [16143][] DETAIL:  End of WAL reached on timeline 
> 1 at 0/1B7FB8C8.
> 2017-06-02 11:46:07.173 PDT [16143][] FATAL:  could not send end-of-streaming 
> message to primary: no COPY in progress
> 
> the last bit is a bit superflous, but it still kinda makes sense.  Now
> you get:
> 2017-06-02 11:47:09.362 PDT [17061][] FATAL:  unexpected result after 
> CommandComplete: server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> 
> the reason is that
> 
> static int
> libpqrcv_receive(char **buffer, pgsocket *wait_fd)
> {
> 
> previously did:
> if (rawlen == -1)   /* end-of-streaming or error */
> {
> PGresult   *res;
> 
> res = PQgetResult(streamConn);
> if (PQresultStatus(res) == PGRES_COMMAND_OK ||
> PQresultStatus(res) == PGRES_COPY_IN)
> {
> PQclear(res);
> return -1;
> }
> else
> {
> PQclear(res);
> ereport(ERROR,
> (errmsg("could not receive data from WAL stream: %s",
> PQerrorMessage(streamConn;
> }
> }
> 
> and now does
> 
> if (rawlen == -1)   /* end-of-streaming or error */
> {
> PGresult   *res;
> 
> res = PQgetResult(conn->streamConn);
> if (PQresultStatus(res) == PGRES_COMMAND_OK)
> {
> PQclear(res);
> 
> /* Verify that there are no more results */
> res = PQgetResult(conn->streamConn);
> if (res != NULL)
> ereport(ERROR,
> (errmsg("unexpected result after CommandComplete: %s",
> PQerrorMessage(conn->streamConn;
> return -1;
> }
> else if (PQresultStatus(res) == PGRES_COPY_IN)
> {
> PQclear(res);
> return -1;
> }
> else
> {
> PQclear(res);
> ereport(ERROR,
> (errmsg("could not receive data from WAL stream: %s",
> pchomp(PQerrorMessage(conn->streamConn);
> }
> }
> 
> note the new /* Verify that there are no more results */ bit.
> 
> I don't understand why the new block is there, nor does the commit
> message explain it.
> 

Hmm, that particular change can actually be reverted. It was needed for
one those custom replication commands which were replaced by normal
query support. I have missed it during the rewrite.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers