Re: [HACKERS] SSL renegotiation and other related woes

2015-02-13 Thread Heikki Linnakangas

On 02/12/2015 01:41 AM, Andres Freund wrote:

Hi,

On 2015-02-05 23:03:02 +0200, Heikki Linnakangas wrote:

On 01/26/2015 12:14 PM, Andres Freund wrote:

Hi,

When working on getting rid of ImmediateInterruptOK I wanted to verify
that ssl still works correctly. Turned out it didn't. But neither did it
in master.

Turns out there's two major things we do wrong:

1) We ignore the rule that once called and returning
SSL_ERROR_WANTS_(READ|WRITE) SSL_read()/write() have to be called
again with the same parameters. Unfortunately that rule doesn't mean
just that the same parameters have to be passed in, but also that we
can't just constantly switch between _read()/write(). Especially
nonblocking backend code (i.e. walsender) and the whole frontend code
violate this rule.


I don't buy that. Googling around, and looking at the man pages, I just
don't see anything to support that claim. I believe it is supported to
switch between reads and writes.


There's at least two implementations that seem to have workaround to
achieve something like that. Apache's mod_ssl and the ssl layer for
libevent.


2) We start renegotiations in be_tls_write() while in nonblocking mode,
but don't properly retry to handle socket readyness. There's a loop
that retries handshakes twenty times (???), but what actually is
needed is to call SSL_get_error() and ensure that there's actually
data available.


Yeah, that's just crazy.

Actually, I don't think we need to call SSL_do_handshake() at all. At least
on modern OpenSSL versions. OpenSSL internally uses a state machine, and
SSL_read() and SSL_write() calls will complete the handshake just as well.


Some blaming seems to show that that's been the case for a long time.


2) is easy enough to fix, but 1) is pretty hard. Before anybody says
that 1) isn't an important rule: It reliably causes connection aborts
within a couple renegotiations. The higher the latency the higher the
likelihood of aborts. Even locally it doesn't take very long to
abort. Errors usually are something like SSL connection has been closed
unexpectedly or SSL Error: sslv3 alert unexpected message and a host
of other similar messages. There's a couple reports of those in the
archives and I've seen many more in client logs.


Yeah, I can reproduce that. There's clearly something wrong.



I believe this is an OpenSSL bug. I traced the unexpected message error to
this code in OpenSSL's s3_read_bytes() function:


Yep, I got to that as well.


case SSL3_RT_APPLICATION_DATA:
/*
 * At this point, we were expecting handshake data, but have
 * application data.  If the library was running inside ssl3_read()
 * (i.e. in_read_app_data is set) and it makes sense to read
 * application data at this point (session renegotiation not yet
 * started), we will indulge it.
 */
if (s-s3-in_read_app_data 
(s-s3-total_renegotiations != 0) 
(((s-state  SSL_ST_CONNECT) 
  (s-state = SSL3_ST_CW_CLNT_HELLO_A) 
  (s-state = SSL3_ST_CR_SRVR_HELLO_A)
 ) || ((s-state  SSL_ST_ACCEPT) 
   (s-state = SSL3_ST_SW_HELLO_REQ_A) 
   (s-state = SSL3_ST_SR_CLNT_HELLO_A)
 )
)) {
s-s3-in_read_app_data = 2;
return (-1);
} else {
al = SSL_AD_UNEXPECTED_MESSAGE;
SSLerr(SSL_F_SSL3_READ_BYTES, SSL_R_UNEXPECTED_RECORD);
goto f_err;
}


So that quite clearly throws an error, *unless* the original application
call was SSL_read(). As you also concluded, OpenSSL doesn't like it when
SSL_write() is called when it's in renegotiation. I think that's OpenSSL's
fault and it should indulge the application data message even in
SSL_write().


That'd be nice, but I think there were multiple reports to them about
this and there wasn't much of a response. Maybe it's time to try again.


In theory, I guess we should do similar hacks in the server, and always call
SSL_read() before SSL_write(), but it seems to work without it. Or maybe
not; OpenSSL server and client code is not symmetric, so perhaps it works in
server mode without that.


I think we should do it on the server side - got some server side errors
when I cranked up the pg_receivexlog's status interval and set the wal
sender timeout very low.


I've not been able to reproduce any errors on the server side, with my 
latest client-only patch. Not sure why. I would assume that the server 
has the same problem, but perhaps there are some mitigating factors that 
make it impossible or at least less likely there.


I'm planning to commit and backpatch the first two of the attached 
patches. The first is essentially the same as what I've posted before, 
with some minor cleanup. I realized that the hack can be isolated 
completely in fe-secure-openssl.c by putting the kill-switch in the 
custom BIO_read routine, instead of 

Re: [HACKERS] SSL renegotiation and other related woes

2015-02-12 Thread Heikki Linnakangas

On 02/12/2015 01:33 AM, Andres Freund wrote:

On 2015-02-11 14:54:03 +0200, Heikki Linnakangas wrote:

Thoughts? Can you reproduce any errors with this?


I'm on battery right now, so I can't really test much.

Did you test having an actual standby instead of pg_receivexlog? I saw
some slightly different errors there.

Does this patch alone work for you or did you test it together with the
earlier one to fix the renegotiation handling server side when
nonblocking? Because that frequently seemed to error out for me, at
least over actual network. A latch loop + checking for the states seemed
to fix that for me.


This patch alone worked for me.


+* NB: This relies on the calling code to call pqsecure_read(), 
completing
+* the renegotiation handshake, if pqsecure_write() returns 0. Otherwise
+* we'll never make progress.
+*/


I think this should a) refer to the fact that pqSendSome does that in
blocking scenarios b) expand the comment around the pqReadData to
reference that fact.

How are we going to deal with callers using libpq in nonblocking mode. A
client doing a large COPY FROM STDIN to the server using a nonblocking
connection (not a stupid thing to do) could hit this IIUC.


Hmm, that's problematic even without SSL. If you do a large COPY FROM 
STDIN, and the server sends a lot of NOTICEs, the NOTICEs will be 
accumulated in the TCP send and receive buffers until they fill up. 
After that, the server will block on the send, and will stop reading the 
tuple data the client sends, and you get a deadlock. In blocking mode, 
pqSendSome calls pqReadData to avoid that.


I think pqSendSome should call pqReadData() after getting EWOULDBLOCK, 
even in non-blocking mode. It won't completely prevent the problem: it's 
possible that the receive buffer fills up, but the application doesn't 
call pqSendSome() until the socket becomes writeable, which won't happen 
until the client drains the incoming data is drained and unblocks the 
server. But it greatly reduces the risk. In practice that will solve the 
problem for SSL renegotiations.


We should also document that the application should always wait for the 
socket readable condition, in addition to writeable, and call 
pqConsumeInput(). That fixes the issue for good.



Could we perhaps call SSL_peek() when SSL_write() hits WANTS_READ? In
combination with your fix I think that should fix the possibility of
such a deadlock? Especially on the serverside where the socket most of
the time is is in, although emulated, blocking mode that seems easier -
we can just retry afterwards.


Hmm. Not sure what the semantics of SSL_peek() are. At least we'd need 
to call it with a large enough buffer that it would read all the 
incoming data from the OS buffer. I think it would be safer to just call 
SSL_read(). Both the server and libpq have an input buffer that you can 
easily read into at any time.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSL renegotiation and other related woes

2015-02-11 Thread Andres Freund
On 2015-02-11 14:54:03 +0200, Heikki Linnakangas wrote:
 Thoughts? Can you reproduce any errors with this?

I'm on battery right now, so I can't really test much.

Did you test having an actual standby instead of pg_receivexlog? I saw
some slightly different errors there.

Does this patch alone work for you or did you test it together with the
earlier one to fix the renegotiation handling server side when
nonblocking? Because that frequently seemed to error out for me, at
least over actual network. A latch loop + checking for the states seemed
to fix that for me.

I'm pretty darn happy that you seem to have found a much less ugly
solution than mine. Although it's only pretty by comparison ;)

 + /*
 +  * To work-around an issue with OpenSSL and renegotiation, we don't
 +  * let SSL_write() to read any incoming data.

superflous to.

 +  * NB: This relies on the calling code to call pqsecure_read(), 
 completing
 +  * the renegotiation handshake, if pqsecure_write() returns 0. Otherwise
 +  * we'll never make progress.
 +  */

I think this should a) refer to the fact that pqSendSome does that in
blocking scenarios b) expand the comment around the pqReadData to
reference that fact.

How are we going to deal with callers using libpq in nonblocking mode. A
client doing a large COPY FROM STDIN to the server using a nonblocking
connection (not a stupid thing to do) could hit this IIUC.

Could we perhaps call SSL_peek() when SSL_write() hits WANTS_READ? In
combination with your fix I think that should fix the possibility of
such a deadlock? Especially on the serverside where the socket most of
the time is is in, although emulated, blocking mode that seems easier -
we can just retry afterwards.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSL renegotiation and other related woes

2015-02-11 Thread Andres Freund
Hi,

On 2015-02-05 23:03:02 +0200, Heikki Linnakangas wrote:
 On 01/26/2015 12:14 PM, Andres Freund wrote:
 Hi,
 
 When working on getting rid of ImmediateInterruptOK I wanted to verify
 that ssl still works correctly. Turned out it didn't. But neither did it
 in master.
 
 Turns out there's two major things we do wrong:
 
 1) We ignore the rule that once called and returning
 SSL_ERROR_WANTS_(READ|WRITE) SSL_read()/write() have to be called
 again with the same parameters. Unfortunately that rule doesn't mean
 just that the same parameters have to be passed in, but also that we
 can't just constantly switch between _read()/write(). Especially
 nonblocking backend code (i.e. walsender) and the whole frontend code
 violate this rule.
 
 I don't buy that. Googling around, and looking at the man pages, I just
 don't see anything to support that claim. I believe it is supported to
 switch between reads and writes.

There's at least two implementations that seem to have workaround to
achieve something like that. Apache's mod_ssl and the ssl layer for
libevent.

 2) We start renegotiations in be_tls_write() while in nonblocking mode,
 but don't properly retry to handle socket readyness. There's a loop
 that retries handshakes twenty times (???), but what actually is
 needed is to call SSL_get_error() and ensure that there's actually
 data available.
 
 Yeah, that's just crazy.
 
 Actually, I don't think we need to call SSL_do_handshake() at all. At least
 on modern OpenSSL versions. OpenSSL internally uses a state machine, and
 SSL_read() and SSL_write() calls will complete the handshake just as well.

Some blaming seems to show that that's been the case for a long time.

 2) is easy enough to fix, but 1) is pretty hard. Before anybody says
 that 1) isn't an important rule: It reliably causes connection aborts
 within a couple renegotiations. The higher the latency the higher the
 likelihood of aborts. Even locally it doesn't take very long to
 abort. Errors usually are something like SSL connection has been closed
 unexpectedly or SSL Error: sslv3 alert unexpected message and a host
 of other similar messages. There's a couple reports of those in the
 archives and I've seen many more in client logs.
 
 Yeah, I can reproduce that. There's clearly something wrong.

 I believe this is an OpenSSL bug. I traced the unexpected message error to
 this code in OpenSSL's s3_read_bytes() function:

Yep, I got to that as well.

 case SSL3_RT_APPLICATION_DATA:
 /*
  * At this point, we were expecting handshake data, but have
  * application data.  If the library was running inside ssl3_read()
  * (i.e. in_read_app_data is set) and it makes sense to read
  * application data at this point (session renegotiation not yet
  * started), we will indulge it.
  */
 if (s-s3-in_read_app_data 
 (s-s3-total_renegotiations != 0) 
 (((s-state  SSL_ST_CONNECT) 
   (s-state = SSL3_ST_CW_CLNT_HELLO_A) 
   (s-state = SSL3_ST_CR_SRVR_HELLO_A)
  ) || ((s-state  SSL_ST_ACCEPT) 
(s-state = SSL3_ST_SW_HELLO_REQ_A) 
(s-state = SSL3_ST_SR_CLNT_HELLO_A)
  )
 )) {
 s-s3-in_read_app_data = 2;
 return (-1);
 } else {
 al = SSL_AD_UNEXPECTED_MESSAGE;
  SSLerr(SSL_F_SSL3_READ_BYTES, SSL_R_UNEXPECTED_RECORD);
 goto f_err;
 }
 
 So that quite clearly throws an error, *unless* the original application
 call was SSL_read(). As you also concluded, OpenSSL doesn't like it when
 SSL_write() is called when it's in renegotiation. I think that's OpenSSL's
 fault and it should indulge the application data message even in
 SSL_write().

That'd be nice, but I think there were multiple reports to them about
this and there wasn't much of a response. Maybe it's time to try again.

 In theory, I guess we should do similar hacks in the server, and always call
 SSL_read() before SSL_write(), but it seems to work without it. Or maybe
 not; OpenSSL server and client code is not symmetric, so perhaps it works in
 server mode without that.

I think we should do it on the server side - got some server side errors
when I cranked up the pg_receivexlog's status interval and set the wal
sender timeout very low.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSL renegotiation and other related woes

2015-02-11 Thread Heikki Linnakangas

On 02/05/2015 11:03 PM, Heikki Linnakangas wrote:

On 01/26/2015 12:14 PM, Andres Freund wrote:
Can we work-around that easily? I believe we can. The crucial part is
that we mustn't let SSL_write() to process any incoming application
data. We can achieve that if we always call SSL_read() to drain such
data, before calling SSL_write(). But that's not quite enough; if we're
in renegotiation, SSL_write() might still try to read more data from the
socket that has arrived after the SSL_read() call. Fortunately, we can
easily prevent that by hacking pqsecure_raw_read() to always return
EWOULDBLOCK, when it's called through SSL_write().

The attached patch does that. I've been running your pg_receivexlog test
case with this for about 15 minutes without any errors now, with
ssl_renegotiation_limit=50kB, while before it errored out within seconds.


Here is a simplified version of this patch. It seems to be enough to not 
let SSL_write() to read any data from the socket. No need to call 
SSL_read() first, and the server-side changes I made were only needed 
because of the other patch I had applied.


Thoughts? Can you reproduce any errors with this?


In theory, I guess we should do similar hacks in the server, and always
call SSL_read() before SSL_write(), but it seems to work without it. Or
maybe not; OpenSSL server and client code is not symmetric, so perhaps
it works in server mode without that.


Not included in this patch, but I believe we apply a similar patch to 
the server-side too.


- Heikki

From b78bfbb70127cbe3f360458eb2a97dcc28194fbc Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas heikki.linnakangas@iki.fi
Date: Thu, 5 Feb 2015 22:44:58 +0200
Subject: [PATCH 1/1] Fix sslv3 alert unexpected message errors in SSL
 renegotiation.

There seems to be a bug in OpenSSL renegotiation, so that when SSL_write()
needs to read data to complete the handshake, but it receives application
data instead, it gets confused and throws an unexpected message error. To
work around that, don't let SSL_write() to read data from the socket, even
if some is available. Arrange so that SSL_read() processes all incoming
messages instead.
---
 src/interfaces/libpq/fe-misc.c   | 14 +-
 src/interfaces/libpq/fe-secure-openssl.c | 20 
 src/interfaces/libpq/fe-secure.c |  9 +
 src/interfaces/libpq/libpq-int.h |  2 ++
 4 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index 945e283..9dd02d5 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -920,11 +920,15 @@ pqSendSome(PGconn *conn, int len)
 			 * to clear the channel eventually because it's blocked trying to
 			 * send data to us.  (This can happen when we are sending a large
 			 * amount of COPY data, and the server has generated lots of
-			 * NOTICE responses.)  To avoid a deadlock situation, we must be
-			 * prepared to accept and buffer incoming data before we try
-			 * again.  Furthermore, it is possible that such incoming data
-			 * might not arrive until after we've gone to sleep.  Therefore,
-			 * we wait for either read ready or write ready.
+			 * NOTICE responses.)  Another scenario is that SSL renegotiation
+			 * is in progress, and the SSL library needs to read a message
+			 * to complete the handshake, before it can send more data.
+			 *
+			 * To avoid a deadlock situation, we must be prepared to accept
+			 * and buffer incoming data before we try again.  Furthermore, it
+			 * is possible that such incoming data might not arrive until
+			 * after we've gone to sleep.  Therefore, we wait for either read
+			 * ready or write ready.
 			 */
 			if (pqReadData(conn)  0)
 			{
diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index a32af34..2e75f3c 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -319,8 +319,28 @@ pgtls_write(PGconn *conn, const void *ptr, size_t len)
 	char		sebuf[256];
 	int			err;
 
+	/*
+	 * To work-around an issue with OpenSSL and renegotiation, we don't
+	 * let SSL_write() to read any incoming data.
+	 *
+	 * If SSL_write() is called, and renegotiation has just been inititated,
+	 * SSL_write() might try to read from the socket to complete the
+	 * handshake. If there was some application data in-flight, it might
+	 * receive the application data instead. That confuses it, and it throws
+	 * an sslv3 alert unexpected message error.
+	 *
+	 * We avoid that setting a kill-switch, pqsecure_block_raw_read, which
+	 * tells pqsecure_raw_read() to not return any data to the caller, even
+	 * if some is available.
+	 *
+	 * NB: This relies on the calling code to call pqsecure_read(), completing
+	 * the renegotiation handshake, if pqsecure_write() returns 0. Otherwise
+	 * we'll never make progress.
+	 */
 	SOCK_ERRNO_SET(0);
+	pqsecure_block_raw_read = true;
 	n = 

Re: [HACKERS] SSL renegotiation and other related woes

2015-02-05 Thread Heikki Linnakangas

On 01/26/2015 12:14 PM, Andres Freund wrote:

Hi,

When working on getting rid of ImmediateInterruptOK I wanted to verify
that ssl still works correctly. Turned out it didn't. But neither did it
in master.

Turns out there's two major things we do wrong:

1) We ignore the rule that once called and returning
SSL_ERROR_WANTS_(READ|WRITE) SSL_read()/write() have to be called
again with the same parameters. Unfortunately that rule doesn't mean
just that the same parameters have to be passed in, but also that we
can't just constantly switch between _read()/write(). Especially
nonblocking backend code (i.e. walsender) and the whole frontend code
violate this rule.


I don't buy that. Googling around, and looking at the man pages, I just 
don't see anything to support that claim. I believe it is supported to 
switch between reads and writes.



2) We start renegotiations in be_tls_write() while in nonblocking mode,
but don't properly retry to handle socket readyness. There's a loop
that retries handshakes twenty times (???), but what actually is
needed is to call SSL_get_error() and ensure that there's actually
data available.


Yeah, that's just crazy.

Actually, I don't think we need to call SSL_do_handshake() at all. At 
least on modern OpenSSL versions. OpenSSL internally uses a state 
machine, and SSL_read() and SSL_write() calls will complete the 
handshake just as well.



2) is easy enough to fix, but 1) is pretty hard. Before anybody says
that 1) isn't an important rule: It reliably causes connection aborts
within a couple renegotiations. The higher the latency the higher the
likelihood of aborts. Even locally it doesn't take very long to
abort. Errors usually are something like SSL connection has been closed
unexpectedly or SSL Error: sslv3 alert unexpected message and a host
of other similar messages. There's a couple reports of those in the
archives and I've seen many more in client logs.


Yeah, I can reproduce that. There's clearly something wrong.

I believe this is an OpenSSL bug. I traced the unexpected message 
error to this code in OpenSSL's s3_read_bytes() function:



case SSL3_RT_APPLICATION_DATA:
/*
 * At this point, we were expecting handshake data, but have
 * application data.  If the library was running inside ssl3_read()
 * (i.e. in_read_app_data is set) and it makes sense to read
 * application data at this point (session renegotiation not yet
 * started), we will indulge it.
 */
if (s-s3-in_read_app_data 
(s-s3-total_renegotiations != 0) 
(((s-state  SSL_ST_CONNECT) 
  (s-state = SSL3_ST_CW_CLNT_HELLO_A) 
  (s-state = SSL3_ST_CR_SRVR_HELLO_A)
 ) || ((s-state  SSL_ST_ACCEPT) 
   (s-state = SSL3_ST_SW_HELLO_REQ_A) 
   (s-state = SSL3_ST_SR_CLNT_HELLO_A)
 )
)) {
s-s3-in_read_app_data = 2;
return (-1);
} else {
al = SSL_AD_UNEXPECTED_MESSAGE;
SSLerr(SSL_F_SSL3_READ_BYTES, SSL_R_UNEXPECTED_RECORD);
goto f_err;
}


So that quite clearly throws an error, *unless* the original application 
call was SSL_read(). As you also concluded, OpenSSL doesn't like it when 
SSL_write() is called when it's in renegotiation. I think that's 
OpenSSL's fault and it should indulge the application data message 
even in SSL_write().


Can we work-around that easily? I believe we can. The crucial part is 
that we mustn't let SSL_write() to process any incoming application 
data. We can achieve that if we always call SSL_read() to drain such 
data, before calling SSL_write(). But that's not quite enough; if we're 
in renegotiation, SSL_write() might still try to read more data from the 
socket that has arrived after the SSL_read() call. Fortunately, we can 
easily prevent that by hacking pqsecure_raw_read() to always return 
EWOULDBLOCK, when it's called through SSL_write().


The attached patch does that. I've been running your pg_receivexlog test 
case with this for about 15 minutes without any errors now, with 
ssl_renegotiation_limit=50kB, while before it errored out within seconds.


In theory, I guess we should do similar hacks in the server, and always 
call SSL_read() before SSL_write(), but it seems to work without it. Or 
maybe not; OpenSSL server and client code is not symmetric, so perhaps 
it works in server mode without that.


PS. The server code started renegotiation when it's 1kB from reaching 
ssl_renegotiation_limit. I increased that to 32kB, because in testing, I 
got some SSL failed to renegotiate connection before limit expired 
errors in testing before I did that.


PPS. I did all testing of this patch with my sleeping logic 
simplification patch applied 
(http://www.postgresql.org/message-id/54d3821e.9060...@vmware.com). It 
applies without it too, and I don't think it matters, but I thought 

[HACKERS] SSL renegotiation and other related woes

2015-01-26 Thread Andres Freund
Hi,

When working on getting rid of ImmediateInterruptOK I wanted to verify
that ssl still works correctly. Turned out it didn't. But neither did it
in master.

Turns out there's two major things we do wrong:

1) We ignore the rule that once called and returning
   SSL_ERROR_WANTS_(READ|WRITE) SSL_read()/write() have to be called
   again with the same parameters. Unfortunately that rule doesn't mean
   just that the same parameters have to be passed in, but also that we
   can't just constantly switch between _read()/write(). Especially
   nonblocking backend code (i.e. walsender) and the whole frontend code
   violate this rule.

2) We start renegotiations in be_tls_write() while in nonblocking mode,
   but don't properly retry to handle socket readyness. There's a loop
   that retries handshakes twenty times (???), but what actually is
   needed is to call SSL_get_error() and ensure that there's actually
   data available.

2) is easy enough to fix, but 1) is pretty hard. Before anybody says
that 1) isn't an important rule: It reliably causes connection aborts
within a couple renegotiations. The higher the latency the higher the
likelihood of aborts. Even locally it doesn't take very long to
abort. Errors usually are something like SSL connection has been closed
unexpectedly or SSL Error: sslv3 alert unexpected message and a host
of other similar messages. There's a couple reports of those in the
archives and I've seen many more in client logs.

As far as I can see the only realistic way to fix 1) is to change both
frontend and backend code to:
a) Always check for socket read/writeability before calling
   SSL_read/write() when in nonblocking mode. That's a bit annoying
   because it nearly doubles the amount of syscalls we do or client
   communication, but I can't really se an alternative. That allows us
   to avoid waiting inside after a WANT_READ/WRITE, or havin to setup a
   larger state machine that keeps track what we tried last.

b) When SSL_read/write nonetheless returns WANT_READ/WRITE, even though
   we tested for read/writeability, we're very likely doing
   renegotiation. In that case we'll just have to block. There's already
   code that busy loops (and thus waits) in the frontend
   (c.f. pgtls_read's WANT_WRITE case, triggered during reneg). We can't
   just return immediately to the upper layers as we'd otherwise likely
   violate the rule about calling ssl with the same parameters again.

c) Add a somewhat hacky optimization whereas we allow to break out of a
   WANT_READ condition in a nonblocking socket when ssl-state ==
   SSL_ST_OK. That's the cases where it actually, at least by my reading
   of the unreadable ssl code, safe to not wait. That case is somewhat
   important because we otherwise can end up waiting on both sides due
   to b), even when nonblocking calls where actually made.  That
   condition essentially means that we'll only block if renegotiation or
   partial reads are in progress. Afaics at least.

d) Remove the SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER hack - we don't
   actually need it anymore.

These errors are much less frequent when using a plain frontend
(e.g. psql/pgbench) because they don't use copy both stuff - the way
these clients use the FE/BE protocol there's essentially natural
synchronization points where nothing but renegotiation happens. With
walsender (or pipelined queries!) both sides can write at the same time.


My testcase for this is just to setup a server with a low
ssl_renegotiation_limit, generate lots of WAL (wal.sql attached) and
receive data via pg_receivexlog -n. Usually it'll error out quickly.


I've done a preliminary implementation of the above steps and it
survives transferring 25GB of WAL via the replication protocol with a
ssl_renegotiation_limit=100kB - previously it failed much earlier.


Does anybody have a neater way to tackle this? I'm not happy about this
solution, but I really can't think of anything better (save ditching
openssl maybe).  I'm willing to clean up my hacked up fix for this, but
not if we can't find agreement on the approach.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSL renegotiation and other related woes

2015-01-26 Thread Andres Freund
On 2015-01-26 11:14:05 +0100, Andres Freund wrote:
 My testcase for this is just to setup a server with a low
 ssl_renegotiation_limit, generate lots of WAL (wal.sql attached) and
 receive data via pg_receivexlog -n. Usually it'll error out quickly.

...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
DROP TABLE IF EXISTS foo_:client_id;
CREATE TABLE foo_:client_id AS SELECT * FROM generate_series(1, 100) a(a), 
generate_series(1, 1) b(b);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers