Re: [HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-21 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Committed.  I also added a slight tweak to the wording of the documentation.

Thank you, the doc looks clearer.

Regards
Takayuki Tsunakawa


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-19 Thread Robert Haas
On Tue, May 16, 2017 at 11:58 PM, Michael Paquier
 wrote:
> Thanks for the updated patch. This looks good to me.

Committed.  I also added a slight tweak to the wording of the documentation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-16 Thread Michael Paquier
On Wed, May 17, 2017 at 11:49 AM, Tsunakawa, Takayuki
 wrote:
> From: pgsql-hackers-ow...@postgresql.org
>> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Michael Paquier
>> pqWait is internal to libpq, so we are free to set up what we want here.
>> Still I think that we should be consistent with what pqSocketCheck returns:
>
> Please let this what it is now for the same reason Robert mentioned.
>
>> +intret = 0;
>> +inttimeout = 0;
>> The declaration of "ret" should be internal in the for(;;) loop.
>
> Done.
>
>> +   /* Attempt connection to the next host, starting the
>> connect_timeout timer */
>> +   pqDropConnection(conn, true);
>> +   conn->addr_cur = conn->connhost[conn->whichhost].addrlist;
>> +   conn->status = CONNECTION_NEEDED;
>> +   finish_time = time(NULL) + timeout;
>> +   }
>> I think that it would be safer to not set finish_time if
>> conn->connect_timeout is NULL. I agree that your code works because
>> pqWaitTimed() will never complain on timeout reached if finish_time is -1.
>> That's for robustness sake.
>
> Done, but I'm not sure how this contributes to the robustness.  I guess you 
> were concerned just in case pqWaitTimed() returned 0 (timeout) even when it 
> should not.

Thanks for the updated patch. This looks good to me.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-16 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Michael Paquier
> pqWait is internal to libpq, so we are free to set up what we want here.
> Still I think that we should be consistent with what pqSocketCheck returns:

Please let this what it is now for the same reason Robert mentioned.

> +intret = 0;
> +inttimeout = 0;
> The declaration of "ret" should be internal in the for(;;) loop.

Done.


> +   /* Attempt connection to the next host, starting the
> connect_timeout timer */
> +   pqDropConnection(conn, true);
> +   conn->addr_cur = conn->connhost[conn->whichhost].addrlist;
> +   conn->status = CONNECTION_NEEDED;
> +   finish_time = time(NULL) + timeout;
> +   }
> I think that it would be safer to not set finish_time if
> conn->connect_timeout is NULL. I agree that your code works because
> pqWaitTimed() will never complain on timeout reached if finish_time is -1.
> That's for robustness sake.

Done, but I'm not sure how this contributes to the robustness.  I guess you 
were concerned just in case pqWaitTimed() returned 0 (timeout) even when it 
should not.

Regards
Takayuki Tsunakawa




libpq-retry-connect-on-timeout_v2.patch
Description: libpq-retry-connect-on-timeout_v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-16 Thread Michael Paquier
On Wed, May 17, 2017 at 12:47 AM, Robert Haas  wrote:
> On Mon, May 15, 2017 at 9:59 PM, Michael Paquier
>  wrote:
>> + *
>> + * Returns -1 on failure, 0 if the socket is readable/writable, 1 if
>> it timed out.
>>   */
>> pqWait is internal to libpq, so we are free to set up what we want
>> here. Still I think that we should be consistent with what
>> pqSocketCheck returns:
>> * >0 means that the socket is readable/writable, counting things.
>> * 0 is for timeout.
>> * -1 on failure.
>
> That would imply a lot more change, though.  The way that the patch
> currently does it, none of the other callers of pqWait() or
> pqWaitTimed() need to be adjusted.  So I prefer the way that Tsunakawa
> Takayuki currently has this over your proposal.

Consistency in APIs matters, but I won't fight hard in favor of this
item either. In short I am fine to discard this comment.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-16 Thread Robert Haas
On Mon, May 15, 2017 at 9:59 PM, Michael Paquier
 wrote:
> + *
> + * Returns -1 on failure, 0 if the socket is readable/writable, 1 if
> it timed out.
>   */
> pqWait is internal to libpq, so we are free to set up what we want
> here. Still I think that we should be consistent with what
> pqSocketCheck returns:
> * >0 means that the socket is readable/writable, counting things.
> * 0 is for timeout.
> * -1 on failure.

That would imply a lot more change, though.  The way that the patch
currently does it, none of the other callers of pqWait() or
pqWaitTimed() need to be adjusted.  So I prefer the way that Tsunakawa
Takayuki currently has this over your proposal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-16 Thread Tsunakawa, Takayuki
Hello Robert, Tom, Michael,

Thanks a lot for checking my patch.  Sorry, let me check Michael's review 
comments and reply tomorrow.

Regards
Takayuki Tsunakawa


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-15 Thread Michael Paquier
On Tue, May 16, 2017 at 3:13 AM, Tom Lane  wrote:
> FWIW, I think the position most of us were taking is that this feature
> is meant to retry transport-level connection failures, not cases where
> we successfully make a connection to a server and then it rejects our
> login attempt.  I would classify a timeout as a transport-level failure
> as long as it occurs before we got any server response --- if it happens
> during the authentication protocol, that's less clear.  But it might not
> be very practical to distinguish those two cases.
>
> In short, +1 for retrying on timeout during connection, and I'm okay with
> retrying a timeout during authentication if it's not practical to treat
> that differently.

Sensible argument here. It could happen that a server is unresponsive,
for example in split-brains (?).

I have been playing a bit with the patch.

+ *
+ * Returns -1 on failure, 0 if the socket is readable/writable, 1 if
it timed out.
  */
pqWait is internal to libpq, so we are free to set up what we want
here. Still I think that we should be consistent with what
pqSocketCheck returns:
* >0 means that the socket is readable/writable, counting things.
* 0 is for timeout.
* -1 on failure.

+intret = 0;
+inttimeout = 0;
The declaration of "ret" should be internal in the for(;;) loop.

+   /* Attempt connection to the next host, starting the
connect_timeout timer */
+   pqDropConnection(conn, true);
+   conn->addr_cur = conn->connhost[conn->whichhost].addrlist;
+   conn->status = CONNECTION_NEEDED;
+   finish_time = time(NULL) + timeout;
+   }
I think that it would be safer to not set finish_time if
conn->connect_timeout is NULL. I agree that your code works because
pqWaitTimed() will never complain on timeout reached if finish_time is
-1. That's for robustness sake.

The docs say that for connect_timeout:
  
   Maximum wait for connection, in seconds (write as a decimal integer
   string). Zero or not specified means wait indefinitely.  It is not
   recommended to use a timeout of less than 2 seconds.
   This timeout applies separately to each connection attempt.
   For example, if you specify two hosts and both of them are unreachable,
   and connect_timeout is 5, the total time spent waiting for a
   connection might be up to 10 seconds.
  

It seems to me that this implies that if a timeout occurs on the first
connection, the counter is reset, which is what this patch is doing.
So we are all set.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-15 Thread Tom Lane
Robert Haas  writes:
> Takayuki Tsunakawa raised a very similar issue in another thread
> related to another open item, namely
> https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F6F5659%40G01JPEXMBYT05
> in which he argued that libpq ought to try then next host after a
> connection failure regardless of the reason for the connection
> failure.  Tom, Michael Paquier, and I all disagreed; none of us
> believe that this feature was intended to retry the connection to a
> different host after an arbitrary error reported by the remote server.
> This thread is essentially the same issue, except here the question
> isn't what should happen after we connect to a server and it returns
> an error, but rather what happens when we time out waiting to connect
> to a server.  When that happens, should we give up, or try the next
> server?

FWIW, I think the position most of us were taking is that this feature
is meant to retry transport-level connection failures, not cases where
we successfully make a connection to a server and then it rejects our
login attempt.  I would classify a timeout as a transport-level failure
as long as it occurs before we got any server response --- if it happens
during the authentication protocol, that's less clear.  But it might not
be very practical to distinguish those two cases.

In short, +1 for retrying on timeout during connection, and I'm okay with
retrying a timeout during authentication if it's not practical to treat
that differently.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-15 Thread Robert Haas
On Sun, May 14, 2017 at 11:45 PM, Noah Misch  wrote:
>> I'll add this item in the PostgreSQL 10 Open Items.
>
> [Action required within three days.  This is a generic notification.]

I think there is a good argument that the existing behavior is as per
the documentation, but I think we may want to change it anyway.  What
the documentation is saying - or at least what I believe I intended
for it to say - is that connect_timeout is restarted for each new
host, so you could end up waiting longer than connect_timeout - but
not forever - if you specify multiple hosts.  And I believe that
statement to be correct. Takayuki Tsunakawa is saying something
different. He's saying that when connect_timeout expires, we should
try the next host instead of giving up. That may or may not be a good
idea, but it doesn't contradict the passage from the documentation
which he quoted.  That passage from the documentation doesn't say
anything at all about what happens when connect_timeout expires.  It
only talks about how much time might pass before that happens.

Takayuki Tsunakawa raised a very similar issue in another thread
related to another open item, namely
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F6F5659%40G01JPEXMBYT05
in which he argued that libpq ought to try then next host after a
connection failure regardless of the reason for the connection
failure.  Tom, Michael Paquier, and I all disagreed; none of us
believe that this feature was intended to retry the connection to a
different host after an arbitrary error reported by the remote server.
This thread is essentially the same issue, except here the question
isn't what should happen after we connect to a server and it returns
an error, but rather what happens when we time out waiting to connect
to a server.  When that happens, should we give up, or try the next
server?

Despite the chorus of support for the opposite conclusion on the other
thread, I'm inclined to think that it would be best to change the
behavior here as per the proposed patch.  The point of being able to
specify multiple hosts is to be able to have multiple database servers
(or perhaps, multiple ways to access the same database server) and use
whichever one of those servers is currently up.  I think that when the
server fails with a complaint like "I've never heard of the database
to which you want to connect" that's not a case of the server being
down, but some other kind of trouble that the administrator really
ought to fix; thus it's best to stop and report the error.  But if
connect_timeout expires, that sounds a whole lot like the server being
down.  It sounds morally equivalent to socket() or connect() failing
outright, which *would* trigger advancing to the next host.

So I'm inclined to accept the patch, but as a definitional change
rather than a bug fix.  However, I'd like to hear some other opinions.
I'll wait until Friday for such opinions to arrive, and then update on
next steps.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-14 Thread Noah Misch
On Fri, May 12, 2017 at 08:54:13AM +, Tsunakawa, Takayuki wrote:
> From: pgsql-hackers-ow...@postgresql.org
> > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tsunakawa,
> > Takayuki
> > I found a wrong sentence here in the doc.  I'm sorry, this is what I asked
> > you to add...
> > 
> > https://www.postgresql.org/docs/devel/static/libpq-connect.html#libpq-
> > paramkeywords
> > 
> > connect_timeout
> > Maximum wait for connection, in seconds (write as a decimal integer string).
> > Zero or not specified means wait indefinitely. It is not recommended to
> > use a timeout of less than 2 seconds. This timeout applies separately to
> > each connection attempt. For example, if you specify two hosts and both
> > of them are unreachable, and connect_timeout is 5, the total time spent
> > waiting for a connection might be up to 10 seconds.
> > 
> > 
> > The program behavior is that libpq times out after connect_timeout seconds
> > regardless of how many hosts are specified.  I confirmed it like this:
> > 
> > $ export PGOPTIONS="-c post_auth_delay=30"
> > $ psql -d "dbname=postgres connect_timeout=5" -h localhost,localhost -p
> > 5432,5433
> > (psql erros out after 5 seconds)
> > 
> > Could you fix the doc with something like this?
> > 
> > "This timeout applies across all the connection attempts. For example, if
> > you specify two hosts and both of them are unreachable, and connect_timeout
> > is 5, the total time spent waiting for a connection is up to 5 seconds."
> > 
> > Should we also change the minimum "2 seconds" part to be longer, according
> > to the number of hosts?
> 
> Instead, I think we should fix the program to match the documented behavior.  
> Otherwise, if the first database machine is down, libpq might wait for about 
> 2 hours (depending on the OS's TCP keepalive setting), during which it tims 
> out after connect_timeout and does not attempt to connect to other hosts.
> 
> I'll add this item in the PostgreSQL 10 Open Items.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-14 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tsunakawa,> Takayuki
> Instead, I think we should fix the program to match the documented behavior.
> Otherwise, if the first database machine is down, libpq might wait for about
> 2 hours (depending on the OS's TCP keepalive setting), during which it tims
> out after connect_timeout and does not attempt to connect to other hosts.
> 
> I'll add this item in the PostgreSQL 10 Open Items.

Please use the attached patch to fix the problem.  I confirmed the success as 
follows:

$ add "post_auth_delay = 10" in postgresql.conf on host1
$ start database servers on host1 and host2
$ psql -h host1,host2 -p 5432,5433 -d "dbname=postgres connect_timeout=3"
(psql connected to host2 after 3 seconds, which I checked with \conninfo)

Regards
Takayuki Tsunakawa

diff --git a/src/interfaces/libpq/fe-connect.c 
b/src/interfaces/libpq/fe-connect.c
index eb5aaf7098..d02e5201fa 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -1720,6 +1720,8 @@ connectDBComplete(PGconn *conn)
 {
PostgresPollingStatusType flag = PGRES_POLLING_WRITING;
time_t  finish_time = ((time_t) -1);
+   int ret = 0;
+   int timeout = 0;
 
if (conn == NULL || conn->status == CONNECTION_BAD)
return 0;
@@ -1729,8 +1731,7 @@ connectDBComplete(PGconn *conn)
 */
if (conn->connect_timeout != NULL)
{
-   int timeout = atoi(conn->connect_timeout);
-
+   timeout = atoi(conn->connect_timeout);
if (timeout > 0)
{
/*
@@ -1761,7 +1762,8 @@ connectDBComplete(PGconn *conn)
return 1;   /* success! */
 
case PGRES_POLLING_READING:
-   if (pqWaitTimed(1, 0, conn, finish_time))
+   ret = pqWaitTimed(1, 0, conn, finish_time);
+   if (ret == -1)
{
conn->status = CONNECTION_BAD;
return 0;
@@ -1769,7 +1771,8 @@ connectDBComplete(PGconn *conn)
break;
 
case PGRES_POLLING_WRITING:
-   if (pqWaitTimed(0, 1, conn, finish_time))
+   ret = pqWaitTimed(0, 1, conn, finish_time);
+   if (ret == -1)
{
conn->status = CONNECTION_BAD;
return 0;
@@ -1782,6 +1785,22 @@ connectDBComplete(PGconn *conn)
return 0;
}
 
+   if (ret == 1)   /* connect_timeout elapsed */
+   {
+   /* If there are no more hosts, return (the error 
message is already set) */
+   if (++conn->whichhost >= conn->nconnhost)
+   {
+   conn->whichhost = 0;
+   conn->status = CONNECTION_BAD;
+   return 0;
+   }
+   /* Attempt connection to the next host, starting the 
connect_timeout timer */
+   pqDropConnection(conn, true);
+   conn->addr_cur = 
conn->connhost[conn->whichhost].addrlist;
+   conn->status = CONNECTION_NEEDED;
+   finish_time = time(NULL) + timeout;
+   }
+
/*
 * Now try to advance the state machine.
 */
diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index 756c6d7779..1d6ea93a0a 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -991,11 +991,9 @@ pqWait(int forRead, int forWrite, PGconn *conn)
 /*
  * pqWaitTimed: wait, but not past finish_time.
  *
- * If finish_time is exceeded then we return failure (EOF).  This is like
- * the response for a kernel exception because we don't want the caller
- * to try to read/write in that case.
- *
  * finish_time = ((time_t) -1) disables the wait limit.
+ *
+ * Returns -1 on failure, 0 if the socket is readable/writable, 1 if it timed 
out.
  */
 int
 pqWaitTimed(int forRead, int forWrite, PGconn *conn, time_t finish_time)
@@ -1005,13 +1003,13 @@ pqWaitTimed(int forRead, int forWrite, PGconn *conn, 
time_t finish_time)
result = pqSocketCheck(conn, forRead, forWrite, finish_time);
 
if (result < 0)
-   return EOF; /* errorMessage is 
already set */
+   return -1;  /* errorMessage is 
already set */
 
if (result == 0)
{

[HACKERS] Re: [doc fix] PG10: wroing description on connect_timeout when multiple hosts are specified

2017-05-12 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tsunakawa,
> Takayuki
> I found a wrong sentence here in the doc.  I'm sorry, this is what I asked
> you to add...
> 
> https://www.postgresql.org/docs/devel/static/libpq-connect.html#libpq-
> paramkeywords
> 
> connect_timeout
> Maximum wait for connection, in seconds (write as a decimal integer string).
> Zero or not specified means wait indefinitely. It is not recommended to
> use a timeout of less than 2 seconds. This timeout applies separately to
> each connection attempt. For example, if you specify two hosts and both
> of them are unreachable, and connect_timeout is 5, the total time spent
> waiting for a connection might be up to 10 seconds.
> 
> 
> The program behavior is that libpq times out after connect_timeout seconds
> regardless of how many hosts are specified.  I confirmed it like this:
> 
> $ export PGOPTIONS="-c post_auth_delay=30"
> $ psql -d "dbname=postgres connect_timeout=5" -h localhost,localhost -p
> 5432,5433
> (psql erros out after 5 seconds)
> 
> Could you fix the doc with something like this?
> 
> "This timeout applies across all the connection attempts. For example, if
> you specify two hosts and both of them are unreachable, and connect_timeout
> is 5, the total time spent waiting for a connection is up to 5 seconds."
> 
> Should we also change the minimum "2 seconds" part to be longer, according
> to the number of hosts?

Instead, I think we should fix the program to match the documented behavior.  
Otherwise, if the first database machine is down, libpq might wait for about 2 
hours (depending on the OS's TCP keepalive setting), during which it tims out 
after connect_timeout and does not attempt to connect to other hosts.

I'll add this item in the PostgreSQL 10 Open Items.

Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers