[jira] [Updated] (JAMES-3676) S3 blobStore leaks HTTP connexions when reading streams partially (even when closing them))

Benoit Tellier (Jira) Wed, 24 Nov 2021 23:25:06 -0800


     [ 
https://issues.apache.org/jira/browse/JAMES-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Benoit Tellier updated JAMES-3676:
----------------------------------
    Description: 
h3. Symptoms

Recently I found those delightful logs on one of my production environment:

{code:java}
        
{
  "timestamp": "2021-11-25T04:19:56.897Z",
  "level": "ERROR",
  "thread": "imapserver-executor-560",
  "mdc": {
    "fetchedData": "FetchData{flags=false, uid=true, internalDate=false, 
size=true, envelope=false, body=false, bodyStructure=false, setSeen=false, 
bodyElements=[BodyFetchElement{firstOctet=null, numberOfOctets=null, 
name=BODY[], sectionType=CONTENT, fieldNames=Optional.empty}], modSeq=false, 
changedSince=-1, vanished=false}",
    "protocol": "IMAP",
    "useUid": "true",
    "action": "FETCH",
    "sessionId": "SID-cdkbuyectzcc",
    "user": "x...@linagora.com"
  },
  "logger": "org.apache.james.imap.processor.AbstractMailboxProcessor",
  "message": "Unexpected error during IMAP processing",
  "context": "default",
  "exception": "software.amazon.awssdk.core.exception.SdkClientException: 
Unable to execute HTTP request: Acquire operation took longer than the 
configured maximum time. This indicates that a request cannot get a connection 
from the pool within the specified maximum time. This can be due to high 
request rate.\nConsider taking any of the following actions to mitigate the 
issue: increase max connections, increase acquire timeout, or slowing the 
request rate.\nIncreasing the max connections can increase client throughput 
(unless the network interface is already fully utilized), but can eventually 
start to hit operation system limitations on the number of file descriptors 
used by the process. If you already are fully utilizing your network interface 
or cannot further increase your connection count, increasing the acquire 
timeout gives extra time for requests to acquire a connection before timing 
out. If the connections doesn't free up, the subsequent requests will still 
timeout.\nIf the above mechanisms are not able to fix the issue, try smoothing 
out your requests so that large traffic bursts cannot overload the client, 
being more efficient with the number of times you need to call AWS, or by 
increasing the number of hosts sending requests.\n\tat 
software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)\n\tat
 
software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43)\n\tat
 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:204)\n\tat
 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:200)\n\tat
 
software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179)\n\tat
 
software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159)\n\tat
 java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown 
Source)\n\tat 
software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:76)\n\tat
 java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown 
Source)\n\tat 
software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$null$0(MakeAsyncHttpRequestStage.java:104)\n\tat
 java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown 
Source)\n\tat 
software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$WrappedErrorForwardingResponseHandler.onError(MakeAsyncHttpRequestStage.java:158)\n\tat
 
software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:300)\n\tat
 
software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.makeRequestListener(NettyRequestExecutor.java:172)\n\tat
 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\tat
 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)\n\tat
 
io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)\n\tat
 io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:502)\n\tat 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)\n\tat
 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)\n\tat
 io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)\n\tat 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\tat
 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat 
java.base/java.lang.Thread.run(Unknown Source)\nCaused by: java.lang.Throwable: 
Acquire operation took longer than the configured maximum time. This indicates 
that a request cannot get a connection from the pool within the specified 
maximum time. This can be due to high request rate.\nConsider taking any of the 
following actions to mitigate the issue: increase max connections, increase 
acquire timeout, or slowing the request rate.\nIncreasing the max connections 
can increase client throughput (unless the network interface is already fully 
utilized), but can eventually start to hit operation system limitations on the 
number of file descriptors used by the process. If you already are fully 
utilizing your network interface or cannot further increase your connection 
count, increasing the acquire timeout gives extra time for requests to acquire 
a connection before timing out. If the connections doesn't free up, the 
subsequent requests will still timeout.\nIf the above mechanisms are not able 
to fix the issue, try smoothing out your requests so that large traffic bursts 
cannot overload the client, being more efficient with the number of times you 
need to call AWS, or by increasing the number of hosts sending requests.\n\tat 
software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.decorateException(NettyRequestExecutor.java:306)\n\tat
 
software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:299)\n\t...
 11 common frames omitted\nCaused by: java.util.concurrent.TimeoutException: 
Acquire operation took longer than 10000 milliseconds.\n\tat 
software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.timeoutAcquire(HealthCheckedChannelPool.java:77)\n\tat
 
software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.lambda$acquire$0(HealthCheckedChannelPool.java:67)\n\tat
 
org.glowroot.agent.plugin.executor.RunnableWrapper.run(RunnableWrapper.java:35)\n\tat
 io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\tat 
io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\n\t...
 6 common frames omitted\n"
}
{code}

TLDR the S3 driver cannot get HTTP connexion from the pool and fails randomly, 
which is very inconvenient.

I attempted a reboot, which temporarily solved the issue but we eventually run 
out again of HTTP connections after some time...

h3. Diagnostic

This sounds like a leak to me!

I quickly wrote a unit test reproducing the exact same issue by partially 
reading  many blobs. Closing the resulting or not proved to have no effect. 

This pointed me to a defect within ReactorUtils.toInputStream was not 
cancelling the related publisher when closing the stream: only full consumption 
was releasing the subscription! Once again an easy unit test properly 
demonstrate the issue.

Propagating the "close" from the SequencedInputStream back to the subscribed 
publisher cannot be done with an iterator intermediate representation, but 
adopting a Java8 Stream instead allows to propagate closure and cancel the 
subscription if need be. This approach solves both failing tests.

Possible causes that "revealed" this issue includes:
  - S3 driver update <3
  - 

h3. Complementary work

We might benefit to invest into tools for automatic leak detections.

I found some java agents to do so: https://file-leak-detector.kohsuke.org/ 
could do the job!


  was:
h3. Symptoms

Recently I found those delightful logs on one of my production environment:

{code:java}
        
{
  "timestamp": "2021-11-25T04:19:56.897Z",
  "level": "ERROR",
  "thread": "imapserver-executor-560",
  "mdc": {
    "fetchedData": "FetchData{flags=false, uid=true, internalDate=false, 
size=true, envelope=false, body=false, bodyStructure=false, setSeen=false, 
bodyElements=[BodyFetchElement{firstOctet=null, numberOfOctets=null, 
name=BODY[], sectionType=CONTENT, fieldNames=Optional.empty}], modSeq=false, 
changedSince=-1, vanished=false}",
    "protocol": "IMAP",
    "useUid": "true",
    "action": "FETCH",
    "sessionId": "SID-cdkbuyectzcc",
    "user": "x...@linagora.com"
  },
  "logger": "org.apache.james.imap.processor.AbstractMailboxProcessor",
  "message": "Unexpected error during IMAP processing",
  "context": "default",
  "exception": "software.amazon.awssdk.core.exception.SdkClientException: 
Unable to execute HTTP request: Acquire operation took longer than the 
configured maximum time. This indicates that a request cannot get a connection 
from the pool within the specified maximum time. This can be due to high 
request rate.\nConsider taking any of the following actions to mitigate the 
issue: increase max connections, increase acquire timeout, or slowing the 
request rate.\nIncreasing the max connections can increase client throughput 
(unless the network interface is already fully utilized), but can eventually 
start to hit operation system limitations on the number of file descriptors 
used by the process. If you already are fully utilizing your network interface 
or cannot further increase your connection count, increasing the acquire 
timeout gives extra time for requests to acquire a connection before timing 
out. If the connections doesn't free up, the subsequent requests will still 
timeout.\nIf the above mechanisms are not able to fix the issue, try smoothing 
out your requests so that large traffic bursts cannot overload the client, 
being more efficient with the number of times you need to call AWS, or by 
increasing the number of hosts sending requests.\n\tat 
software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)\n\tat
 
software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43)\n\tat
 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:204)\n\tat
 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:200)\n\tat
 
software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179)\n\tat
 
software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159)\n\tat
 java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown 
Source)\n\tat 
software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:76)\n\tat
 java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown 
Source)\n\tat 
software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$null$0(MakeAsyncHttpRequestStage.java:104)\n\tat
 java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown 
Source)\n\tat 
software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$WrappedErrorForwardingResponseHandler.onError(MakeAsyncHttpRequestStage.java:158)\n\tat
 
software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:300)\n\tat
 
software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.makeRequestListener(NettyRequestExecutor.java:172)\n\tat
 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\tat
 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)\n\tat
 
io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)\n\tat
 io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:502)\n\tat 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)\n\tat
 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)\n\tat
 io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)\n\tat 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\tat
 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat 
java.base/java.lang.Thread.run(Unknown Source)\nCaused by: java.lang.Throwable: 
Acquire operation took longer than the configured maximum time. This indicates 
that a request cannot get a connection from the pool within the specified 
maximum time. This can be due to high request rate.\nConsider taking any of the 
following actions to mitigate the issue: increase max connections, increase 
acquire timeout, or slowing the request rate.\nIncreasing the max connections 
can increase client throughput (unless the network interface is already fully 
utilized), but can eventually start to hit operation system limitations on the 
number of file descriptors used by the process. If you already are fully 
utilizing your network interface or cannot further increase your connection 
count, increasing the acquire timeout gives extra time for requests to acquire 
a connection before timing out. If the connections doesn't free up, the 
subsequent requests will still timeout.\nIf the above mechanisms are not able 
to fix the issue, try smoothing out your requests so that large traffic bursts 
cannot overload the client, being more efficient with the number of times you 
need to call AWS, or by increasing the number of hosts sending requests.\n\tat 
software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.decorateException(NettyRequestExecutor.java:306)\n\tat
 
software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:299)\n\t...
 11 common frames omitted\nCaused by: java.util.concurrent.TimeoutException: 
Acquire operation took longer than 10000 milliseconds.\n\tat 
software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.timeoutAcquire(HealthCheckedChannelPool.java:77)\n\tat
 
software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.lambda$acquire$0(HealthCheckedChannelPool.java:67)\n\tat
 
org.glowroot.agent.plugin.executor.RunnableWrapper.run(RunnableWrapper.java:35)\n\tat
 io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\tat 
io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\n\t...
 6 common frames omitted\n"
}
{code}

TLDR the S3 driver cannot get HTTP connexion from the pool and fails randomly, 
which is very inconvenient.

I attempted a reboot, which temporarily solved the issue but we eventually run 
out again of HTTP connections after some time...

h3. Diagnostic

This sounds like a leak to me!

I quickly wrote a unit test reproducing the exact same issue by partially 
reading  many blobs. Closing the resulting or not proved to have no effect. 

This pointed me to a defect within ReactorUtils.toInputStream was not 
cancelling the related publisher when closing the stream: only full consumption 
was releasing the subscription! Once again an easy unit test properly 
demonstrate the issue.

Propagating the "close" from the SequencedInputStream back to the subscribed 
publisher cannot be done with an iterator intermediate representation, but 
adopting a Java8 Stream instead allows to propagate closure and cancel the 
subscription if need be. This approach solves both failing tests.

h3. Complementary work

We might benefit to invest into tools for automatic leak detections.

I found some java agents to do so: https://file-leak-detector.kohsuke.org/ 
could do the job!



> S3 blobStore leaks HTTP connexions when reading streams partially (even when 
> closing them))
> -------------------------------------------------------------------------------------------
>
>                 Key: JAMES-3676
>                 URL: https://issues.apache.org/jira/browse/JAMES-3676
>             Project: James Server
>          Issue Type: Improvement
>          Components: Blob
>    Affects Versions: 3.6.0
>            Reporter: Benoit Tellier
>            Priority: Major
>              Labels: bug
>             Fix For: 3.7.0, 3.6.1
>
>
> h3. Symptoms
> Recently I found those delightful logs on one of my production environment:
> {code:java}
>       
> {
>   "timestamp": "2021-11-25T04:19:56.897Z",
>   "level": "ERROR",
>   "thread": "imapserver-executor-560",
>   "mdc": {
>     "fetchedData": "FetchData{flags=false, uid=true, internalDate=false, 
> size=true, envelope=false, body=false, bodyStructure=false, setSeen=false, 
> bodyElements=[BodyFetchElement{firstOctet=null, numberOfOctets=null, 
> name=BODY[], sectionType=CONTENT, fieldNames=Optional.empty}], modSeq=false, 
> changedSince=-1, vanished=false}",
>     "protocol": "IMAP",
>     "useUid": "true",
>     "action": "FETCH",
>     "sessionId": "SID-cdkbuyectzcc",
>     "user": "x...@linagora.com"
>   },
>   "logger": "org.apache.james.imap.processor.AbstractMailboxProcessor",
>   "message": "Unexpected error during IMAP processing",
>   "context": "default",
>   "exception": "software.amazon.awssdk.core.exception.SdkClientException: 
> Unable to execute HTTP request: Acquire operation took longer than the 
> configured maximum time. This indicates that a request cannot get a 
> connection from the pool within the specified maximum time. This can be due 
> to high request rate.\nConsider taking any of the following actions to 
> mitigate the issue: increase max connections, increase acquire timeout, or 
> slowing the request rate.\nIncreasing the max connections can increase client 
> throughput (unless the network interface is already fully utilized), but can 
> eventually start to hit operation system limitations on the number of file 
> descriptors used by the process. If you already are fully utilizing your 
> network interface or cannot further increase your connection count, 
> increasing the acquire timeout gives extra time for requests to acquire a 
> connection before timing out. If the connections doesn't free up, the 
> subsequent requests will still timeout.\nIf the above mechanisms are not able 
> to fix the issue, try smoothing out your requests so that large traffic 
> bursts cannot overload the client, being more efficient with the number of 
> times you need to call AWS, or by increasing the number of hosts sending 
> requests.\n\tat 
> software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)\n\tat
>  
> software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43)\n\tat
>  
> software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:204)\n\tat
>  
> software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:200)\n\tat
>  
> software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179)\n\tat
>  
> software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159)\n\tat
>  java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
> Source)\n\tat 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
>  Source)\n\tat 
> java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
> Source)\n\tat 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown
>  Source)\n\tat 
> software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:76)\n\tat
>  java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
> Source)\n\tat 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
>  Source)\n\tat 
> java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
> Source)\n\tat 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown
>  Source)\n\tat 
> software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$null$0(MakeAsyncHttpRequestStage.java:104)\n\tat
>  java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
> Source)\n\tat 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
>  Source)\n\tat 
> java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
> Source)\n\tat 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown
>  Source)\n\tat 
> software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$WrappedErrorForwardingResponseHandler.onError(MakeAsyncHttpRequestStage.java:158)\n\tat
>  
> software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:300)\n\tat
>  
> software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.makeRequestListener(NettyRequestExecutor.java:172)\n\tat
>  
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\tat
>  
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)\n\tat
>  
> io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)\n\tat
>  io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:502)\n\tat 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)\n\tat
>  
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)\n\tat
>  io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)\n\tat 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\tat
>  
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat
>  java.base/java.lang.Thread.run(Unknown Source)\nCaused by: 
> java.lang.Throwable: Acquire operation took longer than the configured 
> maximum time. This indicates that a request cannot get a connection from the 
> pool within the specified maximum time. This can be due to high request 
> rate.\nConsider taking any of the following actions to mitigate the issue: 
> increase max connections, increase acquire timeout, or slowing the request 
> rate.\nIncreasing the max connections can increase client throughput (unless 
> the network interface is already fully utilized), but can eventually start to 
> hit operation system limitations on the number of file descriptors used by 
> the process. If you already are fully utilizing your network interface or 
> cannot further increase your connection count, increasing the acquire timeout 
> gives extra time for requests to acquire a connection before timing out. If 
> the connections doesn't free up, the subsequent requests will still 
> timeout.\nIf the above mechanisms are not able to fix the issue, try 
> smoothing out your requests so that large traffic bursts cannot overload the 
> client, being more efficient with the number of times you need to call AWS, 
> or by increasing the number of hosts sending requests.\n\tat 
> software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.decorateException(NettyRequestExecutor.java:306)\n\tat
>  
> software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:299)\n\t...
>  11 common frames omitted\nCaused by: java.util.concurrent.TimeoutException: 
> Acquire operation took longer than 10000 milliseconds.\n\tat 
> software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.timeoutAcquire(HealthCheckedChannelPool.java:77)\n\tat
>  
> software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.lambda$acquire$0(HealthCheckedChannelPool.java:67)\n\tat
>  
> org.glowroot.agent.plugin.executor.RunnableWrapper.run(RunnableWrapper.java:35)\n\tat
>  io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\tat 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\n\t...
>  6 common frames omitted\n"
> }
> {code}
> TLDR the S3 driver cannot get HTTP connexion from the pool and fails 
> randomly, which is very inconvenient.
> I attempted a reboot, which temporarily solved the issue but we eventually 
> run out again of HTTP connections after some time...
> h3. Diagnostic
> This sounds like a leak to me!
> I quickly wrote a unit test reproducing the exact same issue by partially 
> reading  many blobs. Closing the resulting or not proved to have no effect. 
> This pointed me to a defect within ReactorUtils.toInputStream was not 
> cancelling the related publisher when closing the stream: only full 
> consumption was releasing the subscription! Once again an easy unit test 
> properly demonstrate the issue.
> Propagating the "close" from the SequencedInputStream back to the subscribed 
> publisher cannot be done with an iterator intermediate representation, but 
> adopting a Java8 Stream instead allows to propagate closure and cancel the 
> subscription if need be. This approach solves both failing tests.
> Possible causes that "revealed" this issue includes:
>   - S3 driver update <3
>   - 
> h3. Complementary work
> We might benefit to invest into tools for automatic leak detections.
> I found some java agents to do so: https://file-leak-detector.kohsuke.org/ 
> could do the job!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

[jira] [Updated] (JAMES-3676) S3 blobStore leaks HTTP connexions when reading streams partially (even when closing them))

Reply via email to