RE: problems with too many ssl_read and ssl_write errors

2021-08-26 Thread Michael Wojcik
Please reply to the list rather than to me directly.

> From: Kamala Ayyar 
> Sent: Thursday, 26 August, 2021 08:57

> We call the  WSAGetLastError  immediately after SSL_ERROR_SYSCALL and we get 
> the
> WSAETIMEDOUT

OK. This wasn't entirely clear to me from your previous message. So you are 
getting a network-stack timeout on a sockets operation; this isn't a TLS 
protocol issue or anything else at a level above the network stack.

> We also call the ERR_print_errors(bio); but it displays a blank line.  We call
> ERR_clear_error() before the SSL_read as mentioned in the manual.

I'm not sure why that might be happening. It may be that OpenSSL doesn't log 
any error messages in this case; I'd have to look at the OpenSSL source code to 
figure that out.

> The  ERR_print_errors() does not print anything- Is the error getting cleared
> because we called the WSAGetLastError() ?

That shouldn't affect the OpenSSL error list.

> Is there an order in which the Windows WSAGetLastError() should be called 
> before
> SSL_get_error()?

I don't believe so. They should be independent. The OpenSSL error list is 
maintained by OpenSSL; WSAGetLastError retrieves the Winsock error code. The 
two don't share data.

> We will try changing some of the timeouts on either side and try.

Make sure that's stack timeouts you're changing: calls to setsockopt, or 
Registry settings if you're not overriding them on your sockets. 
Application-level timeouts aren't the issue here.

You may need to involve a network administrator to look at network interface 
statistics, check wire traces to see if receive windows are closed, and look 
for interference from middleboxes such as routers and firewall appliances or 
from application firewalls, IDSes, and so on. These sorts of issues are not 
uncommon when there are load balancers, traffic-inspecting firewalls, or the 
like interfering with network traffic.

--
Michael Wojcik


RE: problems with too many ssl_read and ssl_write errors

2021-08-25 Thread Michael Wojcik
> From: Kamala Ayyar  
> Sent: Monday, 23 August, 2021 09:22

> We get the SSL_ERROR_SYSCALL from SSL_Read and SSL_Write quite often.

You'll get SSL_ERROR_SYSCALL any time OpenSSL makes a system call (including, 
on Windows, a Winsock call) and gets an error.

> It seems the handshake is done correctly and over a period of time (few hours
> to 2-3 days random) the SSL_Read /SSL_Write fails.  We do not get the
> WSAEWOULDBLOCK error code

What is the underlying error, then? Are you logging the result of 
WSAGetLastError immediately after you get SSL_ERROR_SYSCALL? What about the SSL 
error stack (with ERR_print_errors_fp or similar)?

> nor the OpenSSL's version of SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE 
> error.

SSL_ERROR_WANT_READ and SSL_ERROR_WANT_WRITE are not related to WSAEWOULDBLOCK, 
so I'm not sure why you're mentioning them here.

> We get WSAETIMEDOUT on Receive more often and a few times on the Send.

That's typically the case; generally speaking, a timeout is more likely when 
receiving (where you are at the mercy of the peer sending data) than when 
sending (where you simply need the peer to open the receive window and then ACK 
the sent data, both of which are often possible even if the application is not 
behaving, depending on the amount of data and other variables).

> We are not using SO_KEEPALIVE but using application specific heartbeat TO to
> keep the socket alive.

That could certainly cause send or receive timeouts on the socket if the peer 
becomes unresponsive. The same is true of any application-data transmission, of 
course.
 
> Based on blogs and googling we have seen that OpenSSL quite often issues a
> SSL_ERROR_SYSCALL when a Timeout is encountered 

Yes, that's what it should do, if "when a timeout is encountered" means "a 
socket-API function returns an error due to a timeout". SSL_ERROR_SYSCALL means 
exactly that: a system call returned an error.

I suspect one of the following:

- A client application is hanging (or blocking for some other reason), and 
consequently:
  - Not sending data, so the server's not receiving data until it times out, or
  - Not receiving data that the server is sending; that will cause its receive 
window to fill, and eventually the server's send will time out.

- Network issues are transiently preventing data and/or ACK reception by one 
side or the other. That will also eventually lead to timeouts.

-- 
Michael Wojcik


Re: problems with too many ssl_read and ssl_write errors

2021-08-23 Thread Jakob Bohm via openssl-users

For the below symptoms, I would recommend a watching the application
port with WireShark.

This should show any the TLS protocol deviations and any problems in
handling and establishing the TCP connections.

On 2021-08-19 00:38, David Bowers via openssl-users wrote:


  * We have a server that has around  2025 clients connected at any
instant.
  * Our application creates a Server /Listener socket that then is
converted into a Secure socket using OpenSSL library. This is
compiled and built in a Windows x64 environment.  We also built
the OpenSSL for the Windows. The Listener socket is created with a
default backlog of 500. The Accept socket is non-blocking socket
and waits for connections
  * Every Client makes a regular blocking connection to the Server.
The Server accepts the connection after which the Client socket is
converted to a secure socket using the OpenSSL Library.
  * The connections are coming at a rate of about 10 connections
/second ?  Not sure about this number.
  * We are able to connect to all the clients in a few minutes and it
stays like that for some time.  There constant exchange of
messages between Server(COS) and clients without issues.
  * The application logic is to keep trying to connect every timeout.
  * After maybe a few hours/days we see the clients dropping
connections. The logs indicate the SSL_Read or SSL_Write on the
Server fails for a client with SSL_Error number 5
(SSL_ERROR_SYSCALL) and the equivalent Windows error of
WSATimeOut.  We then observe the WSAECONNRESET as the Client
closed connection.  We see this behavior for multiple sites.
  * The number of Clients disconnected starts increasing and we see
the logs in the Client where the server refuses any more
connections form Clients (10061- WSAECONNREFUSED) There is nothing
to indicate this state in the server logs. Our theory is the
backlog is filled and Server refusing further connections.
  * We are trying to find why we get the SSL_Read/SSL_Write Error as
it a Blocking socket. We cannot use to a non-blocking socket due
to platform and application limitation


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S.  https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark.  Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded



Re: problems with too many ssl_read and ssl_write errors

2021-08-23 Thread Kamala Ayyar
Hello Michael,

Thank you very much for your detailed response.

We previously had checked the Registry settings for TCPIP Parameters and
have been using the Default values.  I also ran the PowershellScript for
the Ephemeral ports and you are correct - the ports are not being exhausted
as it used the same inport fort for the clients.  We did get CLIENT_WAIT
and TIME_WAIT states once on a while using the netstat commands but most
times the connections were ESTABLISHED.

We get the SSL_ERROR_SYSCALL from SSL_Read and SSL_Write quite often.  We
never got this error while using the SSL_connect for Client or SSL_accept
on the server side.  It seems the handshake is done correctly and over a
period of time( few hours to 2-3 days random)  the SSL_Read /SSL_Write
fails.  We do not get the *WSAEWOULDBLOCK *error code nor the OpenSSL's
version of SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE error.
We get WSAETIMEDOUT on Receive more often and a few times on the Send. We
are not using SO_KEEPALIVE but using application specific heartbeat TO to
keep the socket alive.

Thank you again for the response and we now have a direction to check and
probably tweak any timeouts on the application side.  We are mainly
concerned about the SSL_ERROR_SYSCALL we get quite often on the
SSL_Read/Write and the Windows error code is WSAETIMEOUT.  Based on blogs
and googling we have seen that OpenSSL quite often issues a
SSL_ERROR_SYSCALL when a Timeout is encountered (
https://github.com/openssl/openssl/issues/12416) and similar posts
We restart our server application and everything gets reset and connections
get established. We have looked at the Windows event server logs that have
not given us much.

Thanks
Kamala

*Kamala  Ayyar*



On Thu, Aug 19, 2021 at 6:23 PM Michael Wojcik <
michael.woj...@microfocus.com> wrote:

> > From: openssl-users  On Behalf Of
> David Bowers via openssl-users
> > Sent: Wednesday, 18 August, 2021 16:38
>
> I don't think this is OpenSSL-related, but at this point it's not clear
> what the issue is.
>
> > . After maybe a few hours/days we see the clients dropping connections.
> The logs
> > indicate the SSL_Read or SSL_Write on the Server fails for a client with
> SSL_Error
> > number 5 (SSL_ERROR_SYSCALL) and the equivalent Windows error of
> WSATimeOut.  We
> > then observe the WSAECONNRESET as the Client closed connection.  We see
> this
> > behavior for multiple sites.
>
> I assume this is a Server-edition version of Windows and you're not trying
> to support that kind of connection load on a desktop edition.
>
> What's set in the Registry under
> HKLM\SYSTEM\CurrentControlSet\Services\TCPIP\Parameters? In particular I'd
> be suspicious of SynAttackProtect and NetworkThrottlingIndex (which
> shouldn't be set on Server, but you never know).
>
> Many online references will suggest altering settings that affect the
> ephemeral-port space, such as TcpTimedWaitDelay, but those are irrelevant
> on the server side (since the connection tuples will use the server port,
> not an ephemeral port, for the server side).
>
> Many of the settings under the TCPIP/Performance key are undocumented.
> This page describes a number of them:
>
>
> https://forums.alliedmods.net/showpost.php?s=5fedba9ea66557ccea3bfee9e192aaf4&p=1744400&postcount=1
>
> It also discusses a number of netsh commands for TCP/IP tuning.
>
> > . The number of Clients disconnected starts increasing and we see the
> logs in the
> > Client where the server refuses any more connections form Clients (10061-
> > WSAECONNREFUSED) There is nothing to indicate this state in the server
> logs. Our
> > theory is the backlog is filled and Server refusing further connections.
>
> That's possible. Windows, unlike BSD-based stacks, sends an RST when the
> listen queue is full. (BSD-based stacks simply discard the inbound SYN,
> which is a better choice for a number of reasons. Windows did this wrong
> and stubbornly refuses to change.)
>
> You say you're specifying a backlog of 500 in the call to listen().
> Microsoft recommends just passing SOMAXCONN and letting the provider set a
> "suitable" value. Worth trying.
>
> But this appears to be a secondary issue. The primary one seems to be that
> for whatever reason you get an increasing number of conversation failures,
> and then the client's aggressive retry behavior means you get a cascade of
> connection flooding until the listen queues are full. The clients ought to
> be changed to use random backoff or another strategy that avoids flooding
> the server, but at this point that seems to be addressing a symptom rather
> than the underlying problem.
>
> > . We are trying to find why we get the SSL_Read/SSL_Write Error as it a
> Blocking
> > socket. We cannot use to a non-blocking socket due to platform and
> application
> > limitation
>
> You said you're specifically getting SSL_ERROR_SYSCALL from SSL_read and
> SSL_write. That has nothing to do with whether the socket is in blocking
> mode -- system calls on blocki

RE: problems with too many ssl_read and ssl_write errors

2021-08-19 Thread Michael Wojcik
> From: openssl-users  On Behalf Of David 
> Bowers via openssl-users
> Sent: Wednesday, 18 August, 2021 16:38

I don't think this is OpenSSL-related, but at this point it's not clear what 
the issue is.

> . After maybe a few hours/days we see the clients dropping connections.  The 
> logs
> indicate the SSL_Read or SSL_Write on the Server fails for a client with 
> SSL_Error
> number 5 (SSL_ERROR_SYSCALL) and the equivalent Windows error of WSATimeOut.  
> We
> then observe the WSAECONNRESET as the Client closed connection.  We see this
> behavior for multiple sites.

I assume this is a Server-edition version of Windows and you're not trying to 
support that kind of connection load on a desktop edition.

What's set in the Registry under 
HKLM\SYSTEM\CurrentControlSet\Services\TCPIP\Parameters? In particular I'd be 
suspicious of SynAttackProtect and NetworkThrottlingIndex (which shouldn't be 
set on Server, but you never know).

Many online references will suggest altering settings that affect the 
ephemeral-port space, such as TcpTimedWaitDelay, but those are irrelevant on 
the server side (since the connection tuples will use the server port, not an 
ephemeral port, for the server side).

Many of the settings under the TCPIP/Performance key are undocumented. This 
page describes a number of them:

https://forums.alliedmods.net/showpost.php?s=5fedba9ea66557ccea3bfee9e192aaf4&p=1744400&postcount=1

It also discusses a number of netsh commands for TCP/IP tuning.

> . The number of Clients disconnected starts increasing and we see the logs in 
> the
> Client where the server refuses any more connections form Clients (10061-
> WSAECONNREFUSED) There is nothing to indicate this state in the server logs. 
> Our
> theory is the backlog is filled and Server refusing further connections. 

That's possible. Windows, unlike BSD-based stacks, sends an RST when the listen 
queue is full. (BSD-based stacks simply discard the inbound SYN, which is a 
better choice for a number of reasons. Windows did this wrong and stubbornly 
refuses to change.)

You say you're specifying a backlog of 500 in the call to listen(). Microsoft 
recommends just passing SOMAXCONN and letting the provider set a "suitable" 
value. Worth trying.

But this appears to be a secondary issue. The primary one seems to be that for 
whatever reason you get an increasing number of conversation failures, and then 
the client's aggressive retry behavior means you get a cascade of connection 
flooding until the listen queues are full. The clients ought to be changed to 
use random backoff or another strategy that avoids flooding the server, but at 
this point that seems to be addressing a symptom rather than the underlying 
problem.

> . We are trying to find why we get the SSL_Read/SSL_Write Error as it a 
> Blocking
> socket. We cannot use to a non-blocking socket due to platform and application
> limitation

You said you're specifically getting SSL_ERROR_SYSCALL from SSL_read and 
SSL_write. That has nothing to do with whether the socket is in blocking mode 
-- system calls on blocking sockets can certainly return errors. I don't 
understand this question.

There are any number of reasons why the server's ability to handle this load 
might be compromised. Network congestion, bufferbloat, load on the CPU or NIC 
(particularly if TCP offload is enabled to the NIC), contention for DMA, other 
application I/O,  Years ago, I had one customer who had similar problems 
which turned out to be due to intermittent failures in a bad DRAM module in the 
server. Distributed computing is inherently fragile.

But in my experience, this sort of problem is most often due to one or more of:

- Application-logic errors or design issues. Are you multiplexing all these 
blocking sockets, or running a thread per conversation, or something else?

- Middlebox problems. Routers, load balancers, firewall appliances, and so 
forth frequently cause issues.

- Application firewalls and other "anti-malware" software (much of which is 
rubbish) running on the server.

WSAETIMEDOUT on a send operation, assuming OpenSSL didn't need to do a receive 
under the covers for TLS-protocol reasons, could mean that a client app isn't 
doing its receives and consequently its receive window has filled; or it could 
mean that something is interfering with the delivery of network traffic in one 
direction or the other.

WSAETIMEDOUT on a receive, though, again assuming OpenSSL didn't need to send 
under the covers, implies that something set a receive timeout on the socket, 
or that a keepalive wasn't responded to in the required time. Are you setting a 
receive timeout (typically with SO_RCVTIMEO)? Are you setting SO_KEEPALIVE? 
What about SO_KEEPALIVE_VALS? If you're not setting SO_KEEPALIVE_VALS, what are 
KeepAliveTime and KeepAliveInterval set to in the Registry? (See the MSDN docs 
for SO_KEEPALIVE.)

Has the system administrator analyzed the Windows event logs and the network 
statisti

problems with too many ssl_read and ssl_write errors

2021-08-18 Thread David Bowers via openssl-users
  *   We have a server that has around  2025 clients connected at any instant.
  *   Our application creates a Server /Listener socket that then is converted 
into a Secure socket using OpenSSL library. This is compiled and built in a 
Windows x64 environment.  We also built the OpenSSL for the Windows. The 
Listener socket is created with a default backlog of 500. The Accept socket is 
non-blocking socket and waits for connections
  *   Every Client makes a regular blocking connection to the Server. The 
Server accepts the connection after which the Client socket is converted to a 
secure socket using the OpenSSL Library.
  *   The connections are coming at a rate of about 10 connections /second ?  
Not sure about this number.
  *   We are able to connect to all the clients in a few minutes and it stays 
like that for some time.  There constant exchange of messages between 
Server(COS) and clients without issues.
  *   The application logic is to keep trying to connect every timeout.
  *   After maybe a few hours/days we see the clients dropping connections.  
The logs indicate the SSL_Read or SSL_Write on the Server fails for a client 
with SSL_Error number 5 (SSL_ERROR_SYSCALL) and the equivalent Windows error of 
WSATimeOut.  We then observe the WSAECONNRESET as the Client closed connection. 
 We see this behavior for multiple sites.
  *   The number of Clients disconnected starts increasing and we see the logs 
in the Client where the server refuses any more connections form Clients 
(10061- WSAECONNREFUSED) There is nothing to indicate this state in the server 
logs. Our theory is the backlog is filled and Server refusing further 
connections.
  *   We are trying to find why we get the SSL_Read/SSL_Write Error as it a 
Blocking socket. We cannot use to a non-blocking socket due to platform and 
application limitation