Re: Blocking on a non-blocking socket?
- Original Message - > From: "Wiebe Cazemier" > To: openssl-users@openssl.org > Sent: Thursday, 23 May, 2024 12:22:31 > Subject: Blocking on a non-blocking socket? > > Hi List, > > I have a very obscure problem with an application using O_NONBLOCK still > blocking. Over the course of a year of running with hundreds of thousands of > clients, it has happened twice over the last month that a worker thread froze. > It's a long story, but I'm pretty sure it's not a deadlock or spinning event > loop or something, primarily because the application recovers after about 20 > minutes with a client errorring out with ETIMEDOUT. Coincidentally, that 20 > minutes matches the timeout description of the tcp man page [1]. > > It really looks like a non-blocking socket is still blocking. I found > something > with a similar problem ([2]), but what they think of SSL_MODE_AUTO_RETRY does > not match the documentation. > > So, is there indeed any way an application that has SSL_MODE_AUTO_RETRY on > (which is default since 1.1.1) can block? Looking at the source code, I don't > see any calls to fcntl() that removes the O_NONBLOCK. > > My IO method is SSL_read() and SSL_write() with an SSL object given to > SSL_set_fd(). > > The only SSL modes I change from the default is that I set > SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER. > > There are two primary deployments of this application, one with OpenSSL 1.1.1 > and one with 3.0.0. Only 1.1.1 has shown this problem, but it may be a > coincidence. > > Side question, is it a problem to set SSL_set_fd() before using fcntl to set > the > fd to O_NONBLOCK? I ask, because the docs say "The BIO and hence the SSL > engine > inherit the behaviour of fd. If fd is non-blocking, the ssl will also have > non-blocking behaviour.". The 'inherit' may be a key word here; not sure when > it's done. > > Regards, > > Wiebe Cazemier As a follow-up, the fault did turn out to be my own... As I imagine [1] is. They describe SSL_MODE_AUTO_RETRY 'attempts to renegotiate a broken SSL connection', but all SSL_MODE_AUTO_RETRY indeed really does is read multiple records at a time, without returning from read. Despite what I thought before, my code actually did have an unfortunate edge case where there was a while loop spinning on SSL_write() when there was no room in the socket. This would eventually fail with ETIMEDOUT. Well, it was educational at least... [1] https://github.com/alanxz/rabbitmq-c/issues/586
Re: Blocking on a non-blocking socket?
Hi Detlef, - Original Message - > From: "Detlef Vollmann" > To: openssl-users@openssl.org > Sent: Friday, 24 May, 2024 12:02:37 > Subject: Re: Blocking on a non-blocking socket? > > That's correct, but if I understand Matt correctly, this isn't the case. > The idea of SSL_MODE_AUTO_RETRY is that if there's data, but it isn't > application data but some kind of handshake data, then SSL_read doesn't > return (after handling the handshake data), but immediately retries. > If this retry fails with EWOULDBLOCK (or actually BIO_read returns 0), > then SSL_read returns with 0 and SSL_WANT_READ. Wouldn't the option then have to be called 'read more than one record at a time'? To me, 'retry' is a bit of a misnomer in that description. Tracing the code, the retry seems to be considered based on BIO_fd_non_fatal_error(), which looks at EWOULDBLOCK. See [1] and [2]. Wiebe [1] https://github.com/openssl/openssl/blob/b9e084f139c53ce133e66aba2f523c680141c0e6/crypto/bio/bss_fd.c#L226 [2] https://github.com/openssl/openssl/blob/b9e084f139c53ce133e66aba2f523c680141c0e6/crypto/bio/bss_fd.c#L113
Re: Blocking on a non-blocking socket?
Hi Matt, - Original Message - > From: "Matt Caswell" > To: openssl-users@openssl.org > Sent: Friday, 24 May, 2024 00:26:28 > Subject: Re: Blocking on a non-blocking socket? > Not quite. > > When you call SSL_read() it is because you are hoping to read > application data. > > OpenSSL will go ahead and attempt to read a record from the socket. If > there is no data (and you are using a non-blocking socket), or only a > partial record available then the SSL_read() call will fail and indicate > SSL_ERROR_WANT_READ. > > If a full record is available it will process it. If the record contains > application data then the SSL_read() call will return successfully and > provide the application data to the application. > > If the record contains non-application data (i.e. some TLS protocol > message like a key update, or new session ticket) then, with > SSL_MODE_AUTO_RETRY on it will automatically try and read another record > (and the above process repeats). Can you show me in the code where that is? It seems the callers of BIO_read() [1] are responsible for doing the retry, because the reader functions abort when retry is set. Those are many callers, for x509, evp, b64, etc. But, the code is kind of hard to trace, because it's all calls to bio_method_st.bread function pointers. My main concern is, if it would get an EWOULDBLOCK, there is (almost) no sense in retrying because in the 100 microseconds or so that passed, there is likely still no data. Plus, is there a limit on how often it's retried? If the connection is broken (packet loss, so nobody is aware) in the middle of rekeying, it can retry all it wants, but nothing will ever come. If it does that, then at some point, reads on the socket would fail with ETIMEDOUT, which is what I'm seeing. [1] https://github.com/openssl/openssl/blob/b9e084f139c53ce133e66aba2f523c680141c0e6/crypto/bio/bio_lib.c#L303
Re: Blocking on a non-blocking socket?
Hi Neil, - Original Message - > From: "Neil Horman" > To: "Wiebe Cazemier" > Cc: "udhayakumar" , openssl-users@openssl.org > Sent: Thursday, 23 May, 2024 23:42:18 > Subject: Re: Blocking on a non-blocking socket? > from: > [ https://www.openssl.org/docs/man1.0.2/man3/SSL_CTX_set_mode.html | > https://www.openssl.org/docs/man1.0.2/man3/SSL_CTX_set_mode.html ] > SSL_MODE_AUTO_RETRY in non-blocking mode should cause SSL_reaa/SSL_write to > return -1 with an error code of WANT_READ/WANT_WRITE until such time as the > re-negotiation has completed. I need to confirm thats the case in the code, > but > it seems to be. If the underlying socket is in non-blocking mode, there should > be no way for calls to block in SSL_read/SSL_write on the socket read/write > system call. I still don't really see what the difference is between SSL_MODE_AUTO_RETRY on or off in non-blocking mode? The person at [1] seems to have had a similar issue, and was convinced clearing SSL_MODE_AUTO_RETRY fixed it. But I agree, I don't know how it could be. OpenSSL would have to remove the O_NONBLOCK, or do select/poll, and I can't find it doing that. I hope it happens again soon and I'm around to attach a debugger. Regards, Wiebe [1] https://github.com/alanxz/rabbitmq-c/issues/586
Re: Blocking on a non-blocking socket?
- Original Message - > From: "Neil Horman" > To: "udhayakumar" > Cc: "Wiebe Cazemier" , openssl-users@openssl.org > Sent: Thursday, 23 May, 2024 22:05:22 > Subject: Re: Blocking on a non-blocking socket? > do you have a stack trace of the thread hung in this state? That would confirm > whats going on here > Neil Hi Neil, No, I don't. I wasn't there to attach a debugger. It recovered before I could do that. And despite a lot of effort, I can't reproduce it either. But in general, what does SSL_MODE_AUTO_RETRY on/off change in non-blocking mode? The documentation is too vague for me. It says: > Setting SSL_MODE_AUTO_RETRY for a nonblocking BIO will process > non-application data records until either no more data is available or an > application data record has been processed. But how is that different from disabling SSL_MODE_AUTO_RETRY? Regards, Wiebe
Blocking on a non-blocking socket?
Hi List, I have a very obscure problem with an application using O_NONBLOCK still blocking. Over the course of a year of running with hundreds of thousands of clients, it has happened twice over the last month that a worker thread froze. It's a long story, but I'm pretty sure it's not a deadlock or spinning event loop or something, primarily because the application recovers after about 20 minutes with a client errorring out with ETIMEDOUT. Coincidentally, that 20 minutes matches the timeout description of the tcp man page [1]. It really looks like a non-blocking socket is still blocking. I found something with a similar problem ([2]), but what they think of SSL_MODE_AUTO_RETRY does not match the documentation. So, is there indeed any way an application that has SSL_MODE_AUTO_RETRY on (which is default since 1.1.1) can block? Looking at the source code, I don't see any calls to fcntl() that removes the O_NONBLOCK. My IO method is SSL_read() and SSL_write() with an SSL object given to SSL_set_fd(). The only SSL modes I change from the default is that I set SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER. There are two primary deployments of this application, one with OpenSSL 1.1.1 and one with 3.0.0. Only 1.1.1 has shown this problem, but it may be a coincidence. Side question, is it a problem to set SSL_set_fd() before using fcntl to set the fd to O_NONBLOCK? I ask, because the docs say "The BIO and hence the SSL engine inherit the behaviour of fd. If fd is non-blocking, the ssl will also have non-blocking behaviour.". The 'inherit' may be a key word here; not sure when it's done. Regards, Wiebe Cazemier [1] https://man7.org/linux/man-pages/man7/tcp.7.html [2] https://github.com/alanxz/rabbitmq-c/issues/586