Re: Stale connection reuse in the async client

2025-08-07 Thread Ryan Schmitt
I learned something very important, which explains why I'm seeing a lot of
reports of TCP resets specifically:

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout

For each TCP request that a client makes through a Network Load Balancer,
> the state of that connection is tracked. If no data is sent through the
> connection by either the client or target for longer than the idle timeout,
> the connection is no longer tracked. If a client or target sends data after
> the idle timeout period elapses, the client receives a TCP RST packet to
> indicate that the connection is no longer valid.
>
> The default idle timeout value for TCP flows is 350 seconds, but can be
> updated to any value between 60-6000 seconds. Clients or targets can use
> TCP keepalive packets to restart the idle timeout. Keepalive packets sent
> to maintain TLS connections can't contain data or payload.


This is nasty: the client can't possibly know that it has a stale
connection until it sends a request, and then the error it gets
("Connection reset by peer") is both highly generic and (unlike
RequestNotExecutedException) not
obviously safe to retry on. The new TCP Keep-Alive options should eliminate
this failure mode, and as I write this I'm publishing a change to enable a
five-second keep-alive interval on all clients. This will also reduce the
occurrence of the (hypothesized) Lambda-specific race condition, since
there's no connection closure race if the connections don't get closed in
the first place.

Thanks for the PR, I'll test my reproducer against it. An issue I noticed
is that it's apparently not possible to read bytes _and_ endOfStream in a
single read operation, which for us means that we can't discover that the
connection has been closed until the next event loop iteration (and by then
the connection might have been leased out again). Would it be safe to
perform a second read that returns either 0 bytes or -1 (endOfStream)?

On Thu, Aug 7, 2025 at 9:20 AM Oleg Kalnichevski  wrote:

>
> >>
> >> What I'd like to know is:
> >>
> >> 1. Can we do anything to improve this race condition?
> >
>
> Please try this change-set:
>
> https://github.com/apache/httpcomponents-core/pull/543
>
> I should reduce the window of this race condition somewhat.
>
> Oleg
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


Re: Stale connection reuse in the async client

2025-08-07 Thread Oleg Kalnichevski





What I'd like to know is:

1. Can we do anything to improve this race condition? 




Please try this change-set:

https://github.com/apache/httpcomponents-core/pull/543

I should reduce the window of this race condition somewhat.

Oleg

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Stale connection reuse in the async client

2025-08-07 Thread Oleg Kalnichevski




On 2025-08-06 04:29, Ryan Schmitt wrote:

Since rolling out 5.5 (which includes my change for
`validateAfterInactivity`, which was previously interpreted as a connection
TTL for async HTTP 1.1 connections), I've been getting reports of
intermittent request failures which appear to be caused by stale connection
reuse. The async client is actually much more vulnerable to this issue than
we realized, and I'm trying to understand why.

First, let me state my understanding of how server-initiated connection
closure works. A FIN from the server wakes up the selector, which fires an
input event that is handled by the code in AbstractHttp1StreamDuplexer,
which "reads" the end-of-stream. This then initiates graceful shutdown of
the connection. I believe this works by the client closing its end of the
connection, which is an output event (the client has to send FIN, as well
as close_notify if TLS). Once this is handled, the duplexer transitions the
connection state to SHUTDOWN and closes the InternalDataChannel, which (for
a graceful shutdown) adds itself to the closedSessions queue. The IOReactor
processes this queue in processClosedSessions(), which will fail any
requests on the connection and remove it from the pool.

The basic issue seems to be that since connection reuse does not go through
the IOReactor, there's a race condition between the IOReactor's event loop
(which drives all the processing and bookkeeping for remote connection
closure) and request execution (which can draw a doomed connection from the
connection pool). I wrote a test that sends requests to a server, which
sends a response and then immediately closes the connection (without any
`Connection: close` header). Part of the idea here is to simulate an async
client running in AWS Lambda, where the JVM is suspended between
invocations: if invocations are intermittent, connections will be remotely
closed while the JVM is suspended, and then during the next invocation the
IOReactor will race with the client request in just this way.

What I found with this test setup is that sending requests one at a time
mostly works, but sometimes results in an exception (typically
RequestNotExecutedException). Setting a breakpoint on
PoolEntry::discardConnection, I can see that the client usually detects the
stale connection (the closure has been processed by the Duplexer):

 at

org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170)
 at
org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.isConnected(PoolingAsyncClientConnectionManager.java:739)
 at
org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.isEndpointConnected(InternalHttpAsyncExecRuntime.java:208)
 at
org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:157)
 at
org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:153)
 at
org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:128)
 at
org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:120)
 at
org.apache.hc.core5.concurrent.BasicFuture.completed(BasicFuture.java:148)
 at
org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$3$1.leaseCompleted(PoolingAsyncClientConnectionManager.java:336)



But sometimes, the connection closure is not processed in time, in which
case I see this stack trace, where the IOReactor fails the request and
discards the conn pool entry:

 at

org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170)
 at
org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.close(PoolingAsyncClientConnectionManager.java:724)
 at
org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:148)
 at
org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:180)
 at
org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$2.failed(InternalAbstractHttpAsyncClient.java:363)
 at
org.apache.hc.client5.http.impl.async.AsyncRedirectExec$1.failed(AsyncRedirectExec.java:261)
 at
org.apache.hc.client5.http.impl.async.ContentCompressionAsyncExec$1.failed(ContentCompressionAsyncExec.java:160)
 at
org.apache.hc.client5.http.impl.async.AsyncHttpRequestRetryExec$1.failed(AsyncHttpRequestRetryExec.java:203)
 at
org.apache.hc.client5.http.impl.async.AsyncProtocolExec$1.failed(AsyncProtocolExec.java:297)
 at
org.apache.hc.client5.http.impl.async.HttpAsyncMainClientExec$1.failed(HttpAsyncMainClientExec.java:135)
 at
org.apache.hc.core5.http.nio.command.RequestExecutionCommand.failed(RequestExecutionCommand.java:101)
 at
org.apache.hc.core5.http.nio.command.CommandSupport.cancelCommands(CommandSupport.java:68)

Re: Stale connection reuse in the async client

2025-08-06 Thread Ryan Schmitt
Incidentally, I just spotted this commit:

https://github.com/apache/httpcomponents-client/commit/99d4a5e081f31616a2558e72824ed8cf52198596

I didn't even know that Java had these socket options. This will be a big
help, especially for Lambda: since TCP keep-alive is implemented by the
kernel, it will continue to operate even while the JVM is suspended. This
is why the public documentation advises the use of keepalive:

https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html

On Tue, Aug 5, 2025 at 7:29 PM Ryan Schmitt  wrote:

> Since rolling out 5.5 (which includes my change for
> `validateAfterInactivity`, which was previously interpreted as a connection
> TTL for async HTTP 1.1 connections), I've been getting reports of
> intermittent request failures which appear to be caused by stale connection
> reuse. The async client is actually much more vulnerable to this issue than
> we realized, and I'm trying to understand why.
>
> First, let me state my understanding of how server-initiated connection
> closure works. A FIN from the server wakes up the selector, which fires an
> input event that is handled by the code in AbstractHttp1StreamDuplexer,
> which "reads" the end-of-stream. This then initiates graceful shutdown of
> the connection. I believe this works by the client closing its end of the
> connection, which is an output event (the client has to send FIN, as well
> as close_notify if TLS). Once this is handled, the duplexer transitions the
> connection state to SHUTDOWN and closes the InternalDataChannel, which (for
> a graceful shutdown) adds itself to the closedSessions queue. The IOReactor
> processes this queue in processClosedSessions(), which will fail any
> requests on the connection and remove it from the pool.
>
> The basic issue seems to be that since connection reuse does not go
> through the IOReactor, there's a race condition between the IOReactor's
> event loop (which drives all the processing and bookkeeping for remote
> connection closure) and request execution (which can draw a doomed
> connection from the connection pool). I wrote a test that sends requests to
> a server, which sends a response and then immediately closes the connection
> (without any `Connection: close` header). Part of the idea here is to
> simulate an async client running in AWS Lambda, where the JVM is suspended
> between invocations: if invocations are intermittent, connections will be
> remotely closed while the JVM is suspended, and then during the next
> invocation the IOReactor will race with the client request in just this way.
>
> What I found with this test setup is that sending requests one at a time
> mostly works, but sometimes results in an exception (typically
> RequestNotExecutedException). Setting a breakpoint on
> PoolEntry::discardConnection, I can see that the client usually detects the
> stale connection (the closure has been processed by the Duplexer):
>
> at
>> org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170)
>> at
>> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.isConnected(PoolingAsyncClientConnectionManager.java:739)
>> at
>> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.isEndpointConnected(InternalHttpAsyncExecRuntime.java:208)
>> at
>> org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:157)
>> at
>> org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:153)
>> at
>> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:128)
>> at
>> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:120)
>> at
>> org.apache.hc.core5.concurrent.BasicFuture.completed(BasicFuture.java:148)
>> at
>> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$3$1.leaseCompleted(PoolingAsyncClientConnectionManager.java:336)
>
>
> But sometimes, the connection closure is not processed in time, in which
> case I see this stack trace, where the IOReactor fails the request and
> discards the conn pool entry:
>
> at
>> org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170)
>> at
>> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.close(PoolingAsyncClientConnectionManager.java:724)
>> at
>> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:148)
>> at
>> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:180)
>> at
>> org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$2.failed(InternalAbstractHttpAsyncClient.java:363)
>> at
>> org.apache.hc.client5.http.impl.async.AsyncRedirectExec$1.failed(AsyncRedirectExec.java:261)
>> at
>> or

Stale connection reuse in the async client

2025-08-05 Thread Ryan Schmitt
Since rolling out 5.5 (which includes my change for
`validateAfterInactivity`, which was previously interpreted as a connection
TTL for async HTTP 1.1 connections), I've been getting reports of
intermittent request failures which appear to be caused by stale connection
reuse. The async client is actually much more vulnerable to this issue than
we realized, and I'm trying to understand why.

First, let me state my understanding of how server-initiated connection
closure works. A FIN from the server wakes up the selector, which fires an
input event that is handled by the code in AbstractHttp1StreamDuplexer,
which "reads" the end-of-stream. This then initiates graceful shutdown of
the connection. I believe this works by the client closing its end of the
connection, which is an output event (the client has to send FIN, as well
as close_notify if TLS). Once this is handled, the duplexer transitions the
connection state to SHUTDOWN and closes the InternalDataChannel, which (for
a graceful shutdown) adds itself to the closedSessions queue. The IOReactor
processes this queue in processClosedSessions(), which will fail any
requests on the connection and remove it from the pool.

The basic issue seems to be that since connection reuse does not go through
the IOReactor, there's a race condition between the IOReactor's event loop
(which drives all the processing and bookkeeping for remote connection
closure) and request execution (which can draw a doomed connection from the
connection pool). I wrote a test that sends requests to a server, which
sends a response and then immediately closes the connection (without any
`Connection: close` header). Part of the idea here is to simulate an async
client running in AWS Lambda, where the JVM is suspended between
invocations: if invocations are intermittent, connections will be remotely
closed while the JVM is suspended, and then during the next invocation the
IOReactor will race with the client request in just this way.

What I found with this test setup is that sending requests one at a time
mostly works, but sometimes results in an exception (typically
RequestNotExecutedException). Setting a breakpoint on
PoolEntry::discardConnection, I can see that the client usually detects the
stale connection (the closure has been processed by the Duplexer):

at
> org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170)
> at
> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.isConnected(PoolingAsyncClientConnectionManager.java:739)
> at
> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.isEndpointConnected(InternalHttpAsyncExecRuntime.java:208)
> at
> org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:157)
> at
> org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:153)
> at
> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:128)
> at
> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:120)
> at
> org.apache.hc.core5.concurrent.BasicFuture.completed(BasicFuture.java:148)
> at
> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$3$1.leaseCompleted(PoolingAsyncClientConnectionManager.java:336)


But sometimes, the connection closure is not processed in time, in which
case I see this stack trace, where the IOReactor fails the request and
discards the conn pool entry:

at
> org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170)
> at
> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.close(PoolingAsyncClientConnectionManager.java:724)
> at
> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:148)
> at
> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:180)
> at
> org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$2.failed(InternalAbstractHttpAsyncClient.java:363)
> at
> org.apache.hc.client5.http.impl.async.AsyncRedirectExec$1.failed(AsyncRedirectExec.java:261)
> at
> org.apache.hc.client5.http.impl.async.ContentCompressionAsyncExec$1.failed(ContentCompressionAsyncExec.java:160)
> at
> org.apache.hc.client5.http.impl.async.AsyncHttpRequestRetryExec$1.failed(AsyncHttpRequestRetryExec.java:203)
> at
> org.apache.hc.client5.http.impl.async.AsyncProtocolExec$1.failed(AsyncProtocolExec.java:297)
> at
> org.apache.hc.client5.http.impl.async.HttpAsyncMainClientExec$1.failed(HttpAsyncMainClientExec.java:135)
> at
> org.apache.hc.core5.http.nio.command.RequestExecutionCommand.failed(RequestExecutionCommand.java:101)
> at
> org.apache.hc.core5.http.nio.command.CommandSupport.cancelCommands(CommandSuppor