[ https://issues.apache.org/jira/browse/JAMES-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448978#comment-17448978 ]
Benoit Tellier commented on JAMES-3676: --------------------------------------- 3.7.0 PR: https://github.com/apache/james-project/pull/762 > S3 blobStore leaks HTTP connexions when reading streams partially (even when > closing them)) > ------------------------------------------------------------------------------------------- > > Key: JAMES-3676 > URL: https://issues.apache.org/jira/browse/JAMES-3676 > Project: James Server > Issue Type: Improvement > Components: Blob > Affects Versions: 3.6.0 > Reporter: Benoit Tellier > Priority: Major > Labels: bug > Fix For: 3.7.0, 3.6.1 > > > h3. Symptoms > Recently I found those delightful logs on one of my production environment: > {code:java} > > { > "timestamp": "2021-11-25T04:19:56.897Z", > "level": "ERROR", > "thread": "imapserver-executor-560", > "mdc": { > "fetchedData": "FetchData{flags=false, uid=true, internalDate=false, > size=true, envelope=false, body=false, bodyStructure=false, setSeen=false, > bodyElements=[BodyFetchElement{firstOctet=null, numberOfOctets=null, > name=BODY[], sectionType=CONTENT, fieldNames=Optional.empty}], modSeq=false, > changedSince=-1, vanished=false}", > "protocol": "IMAP", > "useUid": "true", > "action": "FETCH", > "sessionId": "SID-cdkbuyectzcc", > "user": "x...@linagora.com" > }, > "logger": "org.apache.james.imap.processor.AbstractMailboxProcessor", > "message": "Unexpected error during IMAP processing", > "context": "default", > "exception": "software.amazon.awssdk.core.exception.SdkClientException: > Unable to execute HTTP request: Acquire operation took longer than the > configured maximum time. This indicates that a request cannot get a > connection from the pool within the specified maximum time. This can be due > to high request rate.\nConsider taking any of the following actions to > mitigate the issue: increase max connections, increase acquire timeout, or > slowing the request rate.\nIncreasing the max connections can increase client > throughput (unless the network interface is already fully utilized), but can > eventually start to hit operation system limitations on the number of file > descriptors used by the process. If you already are fully utilizing your > network interface or cannot further increase your connection count, > increasing the acquire timeout gives extra time for requests to acquire a > connection before timing out. If the connections doesn't free up, the > subsequent requests will still timeout.\nIf the above mechanisms are not able > to fix the issue, try smoothing out your requests so that large traffic > bursts cannot overload the client, being more efficient with the number of > times you need to call AWS, or by increasing the number of hosts sending > requests.\n\tat > software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)\n\tat > > software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43)\n\tat > > software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:204)\n\tat > > software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:200)\n\tat > > software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179)\n\tat > > software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159)\n\tat > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown > Source)\n\tat > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown > Source)\n\tat > java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown > Source)\n\tat > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown > Source)\n\tat > software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:76)\n\tat > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown > Source)\n\tat > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown > Source)\n\tat > java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown > Source)\n\tat > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown > Source)\n\tat > software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$null$0(MakeAsyncHttpRequestStage.java:104)\n\tat > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown > Source)\n\tat > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown > Source)\n\tat > java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown > Source)\n\tat > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown > Source)\n\tat > software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$WrappedErrorForwardingResponseHandler.onError(MakeAsyncHttpRequestStage.java:158)\n\tat > > software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:300)\n\tat > > software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.makeRequestListener(NettyRequestExecutor.java:172)\n\tat > > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\tat > > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)\n\tat > > io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)\n\tat > io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:502)\n\tat > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)\n\tat > > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)\n\tat > io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)\n\tat > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\tat > > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat > java.base/java.lang.Thread.run(Unknown Source)\nCaused by: > java.lang.Throwable: Acquire operation took longer than the configured > maximum time. This indicates that a request cannot get a connection from the > pool within the specified maximum time. This can be due to high request > rate.\nConsider taking any of the following actions to mitigate the issue: > increase max connections, increase acquire timeout, or slowing the request > rate.\nIncreasing the max connections can increase client throughput (unless > the network interface is already fully utilized), but can eventually start to > hit operation system limitations on the number of file descriptors used by > the process. If you already are fully utilizing your network interface or > cannot further increase your connection count, increasing the acquire timeout > gives extra time for requests to acquire a connection before timing out. If > the connections doesn't free up, the subsequent requests will still > timeout.\nIf the above mechanisms are not able to fix the issue, try > smoothing out your requests so that large traffic bursts cannot overload the > client, being more efficient with the number of times you need to call AWS, > or by increasing the number of hosts sending requests.\n\tat > software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.decorateException(NettyRequestExecutor.java:306)\n\tat > > software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:299)\n\t... > 11 common frames omitted\nCaused by: java.util.concurrent.TimeoutException: > Acquire operation took longer than 10000 milliseconds.\n\tat > software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.timeoutAcquire(HealthCheckedChannelPool.java:77)\n\tat > > software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.lambda$acquire$0(HealthCheckedChannelPool.java:67)\n\tat > > org.glowroot.agent.plugin.executor.RunnableWrapper.run(RunnableWrapper.java:35)\n\tat > io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\tat > io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\n\t... > 6 common frames omitted\n" > } > {code} > TLDR the S3 driver cannot get HTTP connexion from the pool and fails > randomly, which is very inconvenient. > I attempted a reboot, which temporarily solved the issue but we eventually > run out again of HTTP connections after some time... > h3. Diagnostic > This sounds like a leak to me! > I quickly wrote a unit test reproducing the exact same issue by partially > reading many blobs. Closing the resulting or not proved to have no effect. > This pointed me to a defect within ReactorUtils.toInputStream was not > cancelling the related publisher when closing the stream: only full > consumption was releasing the subscription! Once again an easy unit test > properly demonstrate the issue. > Propagating the "close" from the SequencedInputStream back to the subscribed > publisher cannot be done with an iterator intermediate representation, but > adopting a Java8 Stream instead allows to propagate closure and cancel the > subscription if need be. This approach solves both failing tests. > Possible causes that "revealed" this issue includes: > - S3 driver update <3 > - Additionnal usage of the BlobStore stream API for JMAP blob uploads? > h3. Complementary work > We might benefit to invest into tools for automatic leak detections. > I found some java agents to do so: https://file-leak-detector.kohsuke.org/ > could do the job! -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org