Re: Blocking on a non-blocking socket?

2024-05-31 Thread Wiebe Cazemier via openssl-users
- Original Message -
> From: "Wiebe Cazemier" 
> To: openssl-users@openssl.org
> Sent: Thursday, 23 May, 2024 12:22:31
> Subject: Blocking on a non-blocking socket?
>
> Hi List,
> 
> I have a very obscure problem with an application using O_NONBLOCK still
> blocking. Over the course of a year of running with hundreds of thousands of
> clients, it has happened twice over the last month that a worker thread froze.
> It's a long story, but I'm pretty sure it's not a deadlock or spinning event
> loop or something, primarily because the application recovers after about 20
> minutes with a client errorring out with ETIMEDOUT. Coincidentally, that 20
> minutes matches the timeout description of the tcp man page [1].
> 
> It really looks like a non-blocking socket is still blocking. I found 
> something
> with a similar problem ([2]), but what they think of SSL_MODE_AUTO_RETRY does
> not match the documentation.
> 
> So, is there indeed any way an application that has SSL_MODE_AUTO_RETRY on
> (which is default since 1.1.1) can block? Looking at the source code, I don't
> see any calls to fcntl() that removes the O_NONBLOCK.
> 
> My IO method is SSL_read() and SSL_write() with an SSL object given to
> SSL_set_fd().
> 
> The only SSL modes I change from the default is that I set
> SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER.
> 
> There are two primary deployments of this application, one with OpenSSL 1.1.1
> and one with 3.0.0. Only 1.1.1 has shown this problem, but it may be a
> coincidence.
> 
> Side question, is it a problem to set SSL_set_fd() before using fcntl to set 
> the
> fd to O_NONBLOCK? I ask, because the docs say "The BIO and hence the SSL 
> engine
> inherit the behaviour of fd. If fd is non-blocking, the ssl will also have
> non-blocking behaviour.". The 'inherit' may be a key word here; not sure when
> it's done.
> 
> Regards,
> 
> Wiebe Cazemier


As a follow-up, the fault did turn out to be my own... As I imagine [1] is. 
They describe SSL_MODE_AUTO_RETRY 'attempts to renegotiate a broken SSL 
connection', but all SSL_MODE_AUTO_RETRY indeed really does is read multiple 
records at a time, without returning from read. 

Despite what I thought before, my code actually did have an unfortunate edge 
case where there was a while loop spinning on SSL_write() when there was no 
room in the socket. This would eventually fail with ETIMEDOUT.

Well, it was educational at least...


[1] https://github.com/alanxz/rabbitmq-c/issues/586





Re: Blocking on a non-blocking socket?

2024-05-23 Thread Wiebe Cazemier via openssl-users
Hi Detlef,

- Original Message -
> From: "Detlef Vollmann" 
> To: openssl-users@openssl.org
> Sent: Friday, 24 May, 2024 12:02:37
> Subject: Re: Blocking on a non-blocking socket?
> 
> That's correct, but if I understand Matt correctly, this isn't the case.
> The idea of SSL_MODE_AUTO_RETRY is that if there's data, but it isn't
> application data but some kind of handshake data, then SSL_read doesn't
> return (after handling the handshake data), but immediately retries.
> If this retry fails with EWOULDBLOCK (or actually BIO_read returns 0),
> then SSL_read returns with 0 and SSL_WANT_READ.

Wouldn't the option then have to be called 'read more than one record at a 
time'? To me, 'retry' is a bit of a misnomer in that description.

Tracing the code, the retry seems to be considered based on 
BIO_fd_non_fatal_error(), which looks at EWOULDBLOCK. See [1] and [2].

Wiebe


[1] 
https://github.com/openssl/openssl/blob/b9e084f139c53ce133e66aba2f523c680141c0e6/crypto/bio/bss_fd.c#L226
[2] 
https://github.com/openssl/openssl/blob/b9e084f139c53ce133e66aba2f523c680141c0e6/crypto/bio/bss_fd.c#L113


Re: Blocking on a non-blocking socket?

2024-05-23 Thread Wiebe Cazemier via openssl-users
Hi Matt,

- Original Message -
> From: "Matt Caswell" 
> To: openssl-users@openssl.org
> Sent: Friday, 24 May, 2024 00:26:28
> Subject: Re: Blocking on a non-blocking socket?

> Not quite.
> 
> When you call SSL_read() it is because you are hoping to read
> application data.
> 
> OpenSSL will go ahead and attempt to read a record from the socket. If
> there is no data (and you are using a non-blocking socket), or only a
> partial record available then the SSL_read() call will fail and indicate
> SSL_ERROR_WANT_READ.
> 
> If a full record is available it will process it. If the record contains
> application data then the SSL_read() call will return successfully and
> provide the application data to the application.
> 
> If the record contains non-application data (i.e. some TLS protocol
> message like a key update, or new session ticket) then, with
> SSL_MODE_AUTO_RETRY on it will automatically try and read another record
> (and the above process repeats). 

Can you show me in the code where that is? It seems the callers of BIO_read() 
[1] are responsible for doing the retry, because the reader functions abort 
when retry is set. Those are many callers, for x509, evp, b64, etc. But, the 
code is kind of hard to trace, because it's all calls to bio_method_st.bread 
function pointers.

My main concern is, if it would get an EWOULDBLOCK, there is (almost) no sense 
in retrying because in the 100 microseconds or so that passed, there is likely 
still no data. Plus, is there a limit on how often it's retried? If the 
connection is broken (packet loss, so nobody is aware) in the middle of 
rekeying, it can retry all it wants, but nothing will ever come. If it does 
that, then at some point, reads on the socket would fail with ETIMEDOUT, which 
is what I'm seeing.


[1] 
https://github.com/openssl/openssl/blob/b9e084f139c53ce133e66aba2f523c680141c0e6/crypto/bio/bio_lib.c#L303


Re: Blocking on a non-blocking socket?

2024-05-23 Thread Wiebe Cazemier via openssl-users
Hi Neil,

- Original Message -
> From: "Neil Horman" 
> To: "Wiebe Cazemier" 
> Cc: "udhayakumar" , openssl-users@openssl.org
> Sent: Thursday, 23 May, 2024 23:42:18
> Subject: Re: Blocking on a non-blocking socket?

> from:
> [ https://www.openssl.org/docs/man1.0.2/man3/SSL_CTX_set_mode.html |
> https://www.openssl.org/docs/man1.0.2/man3/SSL_CTX_set_mode.html ]

> SSL_MODE_AUTO_RETRY in non-blocking mode should cause SSL_reaa/SSL_write to
> return -1 with an error code of WANT_READ/WANT_WRITE until such time as the
> re-negotiation has completed. I need to confirm thats the case in the code, 
> but
> it seems to be. If the underlying socket is in non-blocking mode, there should
> be no way for calls to block in SSL_read/SSL_write on the socket read/write
> system call.

I still don't really see what the difference is between SSL_MODE_AUTO_RETRY on 
or off in non-blocking mode?

The person at [1] seems to have had a similar issue, and was convinced clearing 
SSL_MODE_AUTO_RETRY fixed it. But I agree, I don't know how it could be. 
OpenSSL would have to remove the O_NONBLOCK, or do select/poll, and I can't 
find it doing that.

I hope it happens again soon and I'm around to attach a debugger.

Regards,

Wiebe


[1] https://github.com/alanxz/rabbitmq-c/issues/586


Re: Blocking on a non-blocking socket?

2024-05-23 Thread Wiebe Cazemier via openssl-users
- Original Message -
> From: "Neil Horman" 
> To: "udhayakumar" 
> Cc: "Wiebe Cazemier" , openssl-users@openssl.org
> Sent: Thursday, 23 May, 2024 22:05:22
> Subject: Re: Blocking on a non-blocking socket?

> do you have a stack trace of the thread hung in this state? That would confirm
> whats going on here
> Neil

Hi Neil, 

No, I don't. I wasn't there to attach a debugger. It recovered before I could 
do that. And despite a lot of effort, I can't reproduce it either.

But in general, what does SSL_MODE_AUTO_RETRY on/off change in non-blocking 
mode? The documentation is too vague for me. It says:

> Setting SSL_MODE_AUTO_RETRY for a nonblocking BIO will process 
> non-application data records until either no more data is available or an 
> application data record has been processed.

But how is that different from disabling SSL_MODE_AUTO_RETRY?

Regards,

Wiebe


Blocking on a non-blocking socket?

2024-05-22 Thread Wiebe Cazemier via openssl-users
Hi List,

I have a very obscure problem with an application using O_NONBLOCK still 
blocking. Over the course of a year of running with hundreds of thousands of 
clients, it has happened twice over the last month that a worker thread froze. 
It's a long story, but I'm pretty sure it's not a deadlock or spinning event 
loop or something, primarily because the application recovers after about 20 
minutes with a client errorring out with ETIMEDOUT. Coincidentally, that 20 
minutes matches the timeout description of the tcp man page [1].

It really looks like a non-blocking socket is still blocking. I found something 
with a similar problem ([2]), but what they think of SSL_MODE_AUTO_RETRY does 
not match the documentation.

So, is there indeed any way an application that has SSL_MODE_AUTO_RETRY on 
(which is default since 1.1.1) can block? Looking at the source code, I don't 
see any calls to fcntl() that removes the O_NONBLOCK.

My IO method is SSL_read() and SSL_write() with an SSL object given to 
SSL_set_fd().

The only SSL modes I change from the default is that I set 
SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER. 

There are two primary deployments of this application, one with OpenSSL 1.1.1 
and one with 3.0.0. Only 1.1.1 has shown this problem, but it may be a 
coincidence.

Side question, is it a problem to set SSL_set_fd() before using fcntl to set 
the fd to O_NONBLOCK? I ask, because the docs say "The BIO and hence the SSL 
engine inherit the behaviour of fd. If fd is non-blocking, the ssl will also 
have non-blocking behaviour.". The 'inherit' may be a key word here; not sure 
when it's done.

Regards,

Wiebe Cazemier



[1] https://man7.org/linux/man-pages/man7/tcp.7.html
[2] https://github.com/alanxz/rabbitmq-c/issues/586