Re: [HACKERS] Possible fix for occasional failures on castoroides etc
On 2014-05-18 01:35:04 -0400, Tom Lane wrote: > Dave Page writes: > > On Sat, May 3, 2014 at 8:29 PM, Andres Freund > > wrote: > >> On 2012-09-17 08:23:01 -0400, Dave Page wrote: > >>> I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus. > > >> I've just noticed (while checking whether backporting 4c8aa8b5aea caused > >> problems) that this doesn't seem to have fixed the issue. One further > >> thing to try would be to try whether tcp connections don't have the same > >> problem. > > > I've added: > > EXTRA_REGRESS_OPTS => '--host=localhost', > > to the build_env setting for both animals. > > According to > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=protosciurus&dt=2014-05-16%2014%3A27%3A58 > this did not fix the problem; however, the failure is > > ! psql: could not connect to server: Connection refused > ! Is the server running locally and accepting > ! connections on Unix domain socket "/tmp/.s.PGSQL.57345"? > > which shows that this configuration change did not actually have the > desired effect of forcing the regression tests to be run across TCP. > I'm too tired to check into what *would* force that. I think that's just because EXTRA_REGRESS_OPTS is fairly new (19fa6161dd6ba85b6c88b3476d165745dd5192d9). No idea if there's a nice way to pass options to the pg_regress invocations of buildfarm animals. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible fix for occasional failures on castoroides etc
Dave Page writes: > On Sat, May 3, 2014 at 8:29 PM, Andres Freund wrote: >> On 2012-09-17 08:23:01 -0400, Dave Page wrote: >>> I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus. >> I've just noticed (while checking whether backporting 4c8aa8b5aea caused >> problems) that this doesn't seem to have fixed the issue. One further >> thing to try would be to try whether tcp connections don't have the same >> problem. > I've added: > EXTRA_REGRESS_OPTS => '--host=localhost', > to the build_env setting for both animals. According to http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=protosciurus&dt=2014-05-16%2014%3A27%3A58 this did not fix the problem; however, the failure is ! psql: could not connect to server: Connection refused ! Is the server running locally and accepting ! connections on Unix domain socket "/tmp/.s.PGSQL.57345"? which shows that this configuration change did not actually have the desired effect of forcing the regression tests to be run across TCP. I'm too tired to check into what *would* force that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible fix for occasional failures on castoroides etc
On Sat, May 3, 2014 at 8:29 PM, Andres Freund wrote: > On 2014-05-03 13:25:32 -0400, Tom Lane wrote: >> Andres Freund writes: >> > On 2012-09-17 08:23:01 -0400, Dave Page wrote: >> >> I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus. >> >> > I've just noticed (while checking whether backporting 4c8aa8b5aea caused >> > problems) that this doesn't seem to have fixed the issue. One further >> > thing to try would be to try whether tcp connections don't have the same >> > problem. >> >> I did some googling on this, and found out that people have seen identical >> behavior on Solaris with mysql and other products, so at least we're not >> alone. > > Yea, I found a couple report of that as well. > >> Googling also reminded me that we could have a look at the source >> (duh), which is still available from hg.openindiana.org. > > I didn't get that far ;) > > I think we should try whether the problem disappears if tcp connections > are used. That ought to be much more heavily used in the real > world. Thus less likely to be buggy. > > While It's not documented as such, passing --host=localhost to > pg_regress seems to have the desired effect. Dave, could you make your > animal specify that? I've added: EXTRA_REGRESS_OPTS => '--host=localhost', to the build_env setting for both animals. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible fix for occasional failures on castoroides etc
On 2014-05-03 13:25:32 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2012-09-17 08:23:01 -0400, Dave Page wrote: > >> I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus. > > > I've just noticed (while checking whether backporting 4c8aa8b5aea caused > > problems) that this doesn't seem to have fixed the issue. One further > > thing to try would be to try whether tcp connections don't have the same > > problem. > > I did some googling on this, and found out that people have seen identical > behavior on Solaris with mysql and other products, so at least we're not > alone. Yea, I found a couple report of that as well. > Googling also reminded me that we could have a look at the source > (duh), which is still available from hg.openindiana.org. I didn't get that far ;) I think we should try whether the problem disappears if tcp connections are used. That ought to be much more heavily used in the real world. Thus less likely to be buggy. While It's not documented as such, passing --host=localhost to pg_regress seems to have the desired effect. Dave, could you make your animal specify that? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible fix for occasional failures on castoroides etc
I wrote: > Unfortunately, it seems the Solaris implementors didn't read Stevens, > because it looks to me like they *do* return ECONNREFUSED on accept queue > overflow. Still, it's hard to see how that would be the issue if we're > still seeing this failure with only five clients. Also, after further inspection of the source code, it appears to me that the kernel's limit on accept queue length is hard-wired at 4096 in Solaris. So there's basically no way that we're hitting that limit in the regression tests, and the MAX_CONNECTIONS configuration is irrelevant. We seem to be left with the race condition theory. In that connection, this comment in /usr/src/uts/common/io/tl.c is interesting: * The T_CONN_CON is generated when processing the T_CONN_REQ i.e. before * a T_CONN_RES is received from the acceptor. This means that a socket * connect will complete before the peer has called accept. I'm not sure that explains anything of value, but it's probably unlike any other implementation, which makes it perhaps relevant. It implies that this is totally unrelated to any server-side behavior; so if it's possible for us to work around it at all, we'd have to do so client-side. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible fix for occasional failures on castoroides etc
Andres Freund writes: > On 2012-09-17 08:23:01 -0400, Dave Page wrote: >> I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus. > I've just noticed (while checking whether backporting 4c8aa8b5aea caused > problems) that this doesn't seem to have fixed the issue. One further > thing to try would be to try whether tcp connections don't have the same > problem. I did some googling on this, and found out that people have seen identical behavior on Solaris with mysql and other products, so at least we're not alone. Googling also reminded me that we could have a look at the source (duh), which is still available from hg.openindiana.org. I poked around a bit and more or less confirmed the theory mentioned here: https://www.varnish-cache.org/trac/ticket/865 That is, Solaris' unix-sockets code will generate ECONNREFUSED if it finds that the socket is not connected and not waiting for a connection *and* there is no saved error code. One example is: if (so->so_error != 0) return (sogeterr(so, B_TRUE)); /* * Under normal circumstances, so_error should contain an error * in case the connect failed. However, it is possible for another * thread to come in a consume the error, so generate a sensible * error in that case. */ if ((so->so_state & SS_ISCONNECTED) == 0) return (ECONNREFUSED); Now, I can't imagine where the "other thread" hypothesized in this comment could be, so what I'm thinking is that maybe there's a bug somewhere that drops the connection attempt without setting any error in so_error; or maybe there's a race condition that releases the waiting client before so_error is set. But that still leaves the question of why the connection attempt is getting dropped at all. BTW, I also found no less an authority than W. Richard Stevens saying that my theory that this could happen from accept queue overflow was wrong, at least in a sane implementation: https://groups.google.com/forum/#!topic/comp.unix.solaris/e8QxFyXxr84 : >- there are too many outstanding connections that haven't : > been accepted yet (perhaps you can up the second parameter : > to listen) : : No. When the pending connection queue is filled, TCP ignores an : arriving SYN, it does not respond with an RST. This is a soft error : (a busy server) and by ignoring it, TCP forces the client to retransmit : the SYN, hopefully finding a less busy server at some time in the future. : For additional details and an example, check out pp. 257-260 of my "TCP/IP : Illustrated" (Addison-Wesley, 1994). : : Rich Stevens Unfortunately, it seems the Solaris implementors didn't read Stevens, because it looks to me like they *do* return ECONNREFUSED on accept queue overflow. Still, it's hard to see how that would be the issue if we're still seeing this failure with only five clients. There are just not that many references to ECONNREFUSED in the portions of the Solaris source tree that look like they could be related to Unix sockets, so it's hard to come up with more theories than this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible fix for occasional failures on castoroides etc
On 2012-09-17 08:23:01 -0400, Dave Page wrote: > On Sun, Sep 16, 2012 at 12:44 PM, Andrew Dunstan wrote: > > > > On 09/16/2012 12:04 PM, Tom Lane wrote: > >> > >> It's annoying that the buildfarm animals running on older versions of > >> Solaris randomly fail with "Connection refused" errors, such as in > >> today's example: > >> > >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=castoroides&dt=2012-09-15%2015%3A42%3A52 > >> > >> I believe what's probably happening there is that the kernel has a small > >> hard-wired limit on the length of the postmaster's accept queue, and you > >> get this failure if too many connection attempts arrive faster than the > >> postmaster can service them. If that theory is correct, we could > >> probably prevent these failures by reducing the number of tests run in > >> parallel, which could be done by adding say > >> MAX_CONNECTIONS=5 > >> to the environment in which the regression tests run. I'm not sure > >> though if that's "build_env" or some other setting for the buildfarm > >> script --- Andrew? > >> > >> > > > > > > > > Yes, in the build_env section of the config file. > > > > It's in the distributed sample config file, commented out. > > I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus. I've just noticed (while checking whether backporting 4c8aa8b5aea caused problems) that this doesn't seem to have fixed the issue. One further thing to try would be to try whether tcp connections don't have the same problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible fix for occasional failures on castoroides etc
On Sun, Sep 16, 2012 at 12:44 PM, Andrew Dunstan wrote: > > On 09/16/2012 12:04 PM, Tom Lane wrote: >> >> It's annoying that the buildfarm animals running on older versions of >> Solaris randomly fail with "Connection refused" errors, such as in >> today's example: >> >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=castoroides&dt=2012-09-15%2015%3A42%3A52 >> >> I believe what's probably happening there is that the kernel has a small >> hard-wired limit on the length of the postmaster's accept queue, and you >> get this failure if too many connection attempts arrive faster than the >> postmaster can service them. If that theory is correct, we could >> probably prevent these failures by reducing the number of tests run in >> parallel, which could be done by adding say >> MAX_CONNECTIONS=5 >> to the environment in which the regression tests run. I'm not sure >> though if that's "build_env" or some other setting for the buildfarm >> script --- Andrew? >> >> > > > > Yes, in the build_env section of the config file. > > It's in the distributed sample config file, commented out. I've added MAX_CONNECTIONS=5 to both Castoroides and Protosciurus. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible fix for occasional failures on castoroides etc
On 09/16/2012 12:04 PM, Tom Lane wrote: It's annoying that the buildfarm animals running on older versions of Solaris randomly fail with "Connection refused" errors, such as in today's example: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=castoroides&dt=2012-09-15%2015%3A42%3A52 I believe what's probably happening there is that the kernel has a small hard-wired limit on the length of the postmaster's accept queue, and you get this failure if too many connection attempts arrive faster than the postmaster can service them. If that theory is correct, we could probably prevent these failures by reducing the number of tests run in parallel, which could be done by adding say MAX_CONNECTIONS=5 to the environment in which the regression tests run. I'm not sure though if that's "build_env" or some other setting for the buildfarm script --- Andrew? Yes, in the build_env section of the config file. It's in the distributed sample config file, commented out. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Possible fix for occasional failures on castoroides etc
It's annoying that the buildfarm animals running on older versions of Solaris randomly fail with "Connection refused" errors, such as in today's example: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=castoroides&dt=2012-09-15%2015%3A42%3A52 I believe what's probably happening there is that the kernel has a small hard-wired limit on the length of the postmaster's accept queue, and you get this failure if too many connection attempts arrive faster than the postmaster can service them. If that theory is correct, we could probably prevent these failures by reducing the number of tests run in parallel, which could be done by adding say MAX_CONNECTIONS=5 to the environment in which the regression tests run. I'm not sure though if that's "build_env" or some other setting for the buildfarm script --- Andrew? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers