Re: Stale connection reuse in the async client
I learned something very important, which explains why I'm seeing a lot of
reports of TCP resets specifically:
https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout
For each TCP request that a client makes through a Network Load Balancer,
> the state of that connection is tracked. If no data is sent through the
> connection by either the client or target for longer than the idle timeout,
> the connection is no longer tracked. If a client or target sends data after
> the idle timeout period elapses, the client receives a TCP RST packet to
> indicate that the connection is no longer valid.
>
> The default idle timeout value for TCP flows is 350 seconds, but can be
> updated to any value between 60-6000 seconds. Clients or targets can use
> TCP keepalive packets to restart the idle timeout. Keepalive packets sent
> to maintain TLS connections can't contain data or payload.
This is nasty: the client can't possibly know that it has a stale
connection until it sends a request, and then the error it gets
("Connection reset by peer") is both highly generic and (unlike
RequestNotExecutedException) not
obviously safe to retry on. The new TCP Keep-Alive options should eliminate
this failure mode, and as I write this I'm publishing a change to enable a
five-second keep-alive interval on all clients. This will also reduce the
occurrence of the (hypothesized) Lambda-specific race condition, since
there's no connection closure race if the connections don't get closed in
the first place.
Thanks for the PR, I'll test my reproducer against it. An issue I noticed
is that it's apparently not possible to read bytes _and_ endOfStream in a
single read operation, which for us means that we can't discover that the
connection has been closed until the next event loop iteration (and by then
the connection might have been leased out again). Would it be safe to
perform a second read that returns either 0 bytes or -1 (endOfStream)?
On Thu, Aug 7, 2025 at 9:20 AM Oleg Kalnichevski wrote:
>
> >>
> >> What I'd like to know is:
> >>
> >> 1. Can we do anything to improve this race condition?
> >
>
> Please try this change-set:
>
> https://github.com/apache/httpcomponents-core/pull/543
>
> I should reduce the window of this race condition somewhat.
>
> Oleg
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
Re: Stale connection reuse in the async client
What I'd like to know is: 1. Can we do anything to improve this race condition? Please try this change-set: https://github.com/apache/httpcomponents-core/pull/543 I should reduce the window of this race condition somewhat. Oleg - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: Stale connection reuse in the async client
On 2025-08-06 04:29, Ryan Schmitt wrote: Since rolling out 5.5 (which includes my change for `validateAfterInactivity`, which was previously interpreted as a connection TTL for async HTTP 1.1 connections), I've been getting reports of intermittent request failures which appear to be caused by stale connection reuse. The async client is actually much more vulnerable to this issue than we realized, and I'm trying to understand why. First, let me state my understanding of how server-initiated connection closure works. A FIN from the server wakes up the selector, which fires an input event that is handled by the code in AbstractHttp1StreamDuplexer, which "reads" the end-of-stream. This then initiates graceful shutdown of the connection. I believe this works by the client closing its end of the connection, which is an output event (the client has to send FIN, as well as close_notify if TLS). Once this is handled, the duplexer transitions the connection state to SHUTDOWN and closes the InternalDataChannel, which (for a graceful shutdown) adds itself to the closedSessions queue. The IOReactor processes this queue in processClosedSessions(), which will fail any requests on the connection and remove it from the pool. The basic issue seems to be that since connection reuse does not go through the IOReactor, there's a race condition between the IOReactor's event loop (which drives all the processing and bookkeeping for remote connection closure) and request execution (which can draw a doomed connection from the connection pool). I wrote a test that sends requests to a server, which sends a response and then immediately closes the connection (without any `Connection: close` header). Part of the idea here is to simulate an async client running in AWS Lambda, where the JVM is suspended between invocations: if invocations are intermittent, connections will be remotely closed while the JVM is suspended, and then during the next invocation the IOReactor will race with the client request in just this way. What I found with this test setup is that sending requests one at a time mostly works, but sometimes results in an exception (typically RequestNotExecutedException). Setting a breakpoint on PoolEntry::discardConnection, I can see that the client usually detects the stale connection (the closure has been processed by the Duplexer): at org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170) at org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.isConnected(PoolingAsyncClientConnectionManager.java:739) at org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.isEndpointConnected(InternalHttpAsyncExecRuntime.java:208) at org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:157) at org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:153) at org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:128) at org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:120) at org.apache.hc.core5.concurrent.BasicFuture.completed(BasicFuture.java:148) at org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$3$1.leaseCompleted(PoolingAsyncClientConnectionManager.java:336) But sometimes, the connection closure is not processed in time, in which case I see this stack trace, where the IOReactor fails the request and discards the conn pool entry: at org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170) at org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.close(PoolingAsyncClientConnectionManager.java:724) at org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:148) at org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:180) at org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$2.failed(InternalAbstractHttpAsyncClient.java:363) at org.apache.hc.client5.http.impl.async.AsyncRedirectExec$1.failed(AsyncRedirectExec.java:261) at org.apache.hc.client5.http.impl.async.ContentCompressionAsyncExec$1.failed(ContentCompressionAsyncExec.java:160) at org.apache.hc.client5.http.impl.async.AsyncHttpRequestRetryExec$1.failed(AsyncHttpRequestRetryExec.java:203) at org.apache.hc.client5.http.impl.async.AsyncProtocolExec$1.failed(AsyncProtocolExec.java:297) at org.apache.hc.client5.http.impl.async.HttpAsyncMainClientExec$1.failed(HttpAsyncMainClientExec.java:135) at org.apache.hc.core5.http.nio.command.RequestExecutionCommand.failed(RequestExecutionCommand.java:101) at org.apache.hc.core5.http.nio.command.CommandSupport.cancelCommands(CommandSupport.java:68)
Re: Stale connection reuse in the async client
Incidentally, I just spotted this commit: https://github.com/apache/httpcomponents-client/commit/99d4a5e081f31616a2558e72824ed8cf52198596 I didn't even know that Java had these socket options. This will be a big help, especially for Lambda: since TCP keep-alive is implemented by the kernel, it will continue to operate even while the JVM is suspended. This is why the public documentation advises the use of keepalive: https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html On Tue, Aug 5, 2025 at 7:29 PM Ryan Schmitt wrote: > Since rolling out 5.5 (which includes my change for > `validateAfterInactivity`, which was previously interpreted as a connection > TTL for async HTTP 1.1 connections), I've been getting reports of > intermittent request failures which appear to be caused by stale connection > reuse. The async client is actually much more vulnerable to this issue than > we realized, and I'm trying to understand why. > > First, let me state my understanding of how server-initiated connection > closure works. A FIN from the server wakes up the selector, which fires an > input event that is handled by the code in AbstractHttp1StreamDuplexer, > which "reads" the end-of-stream. This then initiates graceful shutdown of > the connection. I believe this works by the client closing its end of the > connection, which is an output event (the client has to send FIN, as well > as close_notify if TLS). Once this is handled, the duplexer transitions the > connection state to SHUTDOWN and closes the InternalDataChannel, which (for > a graceful shutdown) adds itself to the closedSessions queue. The IOReactor > processes this queue in processClosedSessions(), which will fail any > requests on the connection and remove it from the pool. > > The basic issue seems to be that since connection reuse does not go > through the IOReactor, there's a race condition between the IOReactor's > event loop (which drives all the processing and bookkeeping for remote > connection closure) and request execution (which can draw a doomed > connection from the connection pool). I wrote a test that sends requests to > a server, which sends a response and then immediately closes the connection > (without any `Connection: close` header). Part of the idea here is to > simulate an async client running in AWS Lambda, where the JVM is suspended > between invocations: if invocations are intermittent, connections will be > remotely closed while the JVM is suspended, and then during the next > invocation the IOReactor will race with the client request in just this way. > > What I found with this test setup is that sending requests one at a time > mostly works, but sometimes results in an exception (typically > RequestNotExecutedException). Setting a breakpoint on > PoolEntry::discardConnection, I can see that the client usually detects the > stale connection (the closure has been processed by the Duplexer): > > at >> org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170) >> at >> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.isConnected(PoolingAsyncClientConnectionManager.java:739) >> at >> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.isEndpointConnected(InternalHttpAsyncExecRuntime.java:208) >> at >> org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:157) >> at >> org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:153) >> at >> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:128) >> at >> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:120) >> at >> org.apache.hc.core5.concurrent.BasicFuture.completed(BasicFuture.java:148) >> at >> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$3$1.leaseCompleted(PoolingAsyncClientConnectionManager.java:336) > > > But sometimes, the connection closure is not processed in time, in which > case I see this stack trace, where the IOReactor fails the request and > discards the conn pool entry: > > at >> org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170) >> at >> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.close(PoolingAsyncClientConnectionManager.java:724) >> at >> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:148) >> at >> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:180) >> at >> org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$2.failed(InternalAbstractHttpAsyncClient.java:363) >> at >> org.apache.hc.client5.http.impl.async.AsyncRedirectExec$1.failed(AsyncRedirectExec.java:261) >> at >> or
