[jira] [Commented] (JAMES-3793) OOM when loading a very large object from S3?

Benoit Tellier (Jira) Tue, 23 Aug 2022 02:42:16 -0700


    [ 
https://issues.apache.org/jira/browse/JAMES-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583472#comment-17583472
 ]


Benoit Tellier commented on JAMES-3793:
---------------------------------------

Ok I made progress toward diagnosing this.

The problem happened today again.

High traffic generated by a single user was consisting of FETCH 1:* 
BODYSTRUCTURE.

The underlying Cassandra storage issued 100 parrallel reads toward the 
ObjectStore for each of those. Quickly overbooking the pool of 100 HTTP 
connections configured for the S3 service. At that point the FETCHs starts 
failing, the MUA keeps retrying, putting further stress on the systems. The 
number of in memory messages skyrockets, especially if reactors fails to cancel 
S3 requests in a timely maner.

Some lessons learned:
  - Avoid loading too mull full message reads at once for a single IMAP request
  - Especially, batchsizes.properties should be configured in a way the S3 
driver can't be saturated by a single IMAP request
  - Being slightly more memory efficient in the S3 driver would help
  - Investigate how cancelation works on S3 driver (unit tests)
      - Something like create 100 requests that you imediately cancel, see if 
you go down the OOM path...
  - Performance tests to see our throughtput regarding the FETCH BODYSTRUCTURE 
on big messages.
  - Decrease default values for batch size.

> OOM when loading a very large object from S3?
> ---------------------------------------------
>
>                 Key: JAMES-3793
>                 URL: https://issues.apache.org/jira/browse/JAMES-3793
>             Project: James Server
>          Issue Type: Bug
>            Reporter: Benoit Tellier
>            Priority: Major
>
> h2. What?
> We encountered recurring OutOfMemory exception on one of our production 
> deployment.
> Memory dump analysis was unconclusive and this tends to disqualify an 
> explanation based on a memory leak (300MB of objects only on the heap a few 
> minutes after the OOM).
> A careful log analysis lead to find what seems to be the "original OOM":
> {code:java}
> java.lang.OutOfMemoryError: Java heap space
> at java.base/java.util.Arrays.copyOf(Unknown Source)
> at software.amazon.awssdk.core.BytesWrapper.asByteArray(BytesWrapper.java:64)
> at 
> org.apache.james.blob.objectstorage.aws.S3BlobStoreDAO$$Lambda$4237/0x00000008019f5ad8.apply(Unknown
>  Source)
> at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:106)
> at 
> reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:181)
> at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
> at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
> at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
> at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
>  Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
> at java.base/java.lang.Thread.run(Unknown Source)
> {code}
>  
> Following this OOM the application is in a zombie state, unresponsive, 
> throwing OOMs without stacktraces, with Cassandra queries that never 
> finishes, unable to obtain a rabbitMQ connection and have issues within the 
> S3 driver... This sound like reactive programming limitations, that prevents 
> the java platform to handle the OOM like it should (crash the app, take a 
> dump, etc...)
> We did audit quickly our dateset and found several emails/attachment 
> exceeding the 100MB, with a partial and quick audit (we might very well have 
> some larger data!).
> Thus the current explanation is that somehow we successfully saved in S3 a 
> very big mail and now have OOMs when one tries to read it (as S3 blob store 
> DAO does defensive copies).
> h2. Possible actions
> This is an ongoing events, thus our understanding of it can evolve yet as it 
> raises interesting fixes that are hard to understand without the related 
> context, I decided to share it here anyway. I will report upcoming 
> developments here.
> Our first action is to confirm the current diagnostic:
>   - Further audit our datasets to find large items
>   - Deploy a patched version of James that rejects and log S3 objects larger 
> than 50MB
> Yet our current understanding leads to interesting questions...
> *Is it a good idea to load big objects from S3 into our memory?*
> As a preliminary answer, upon email reads we are using `byte[]` for 
> simplicity (no resource management, full view of the data). Changing this is 
> not the scope of this ticket at this is likely a major rework with many 
> unthought impacts. (I dont want to open that Pandora box...)
> SMTP, IMAP, JMAP, and the mailet container all have configuration preventing 
> sending/saving/receiving/uploading too big of a mail/attachment/blob, so we 
> likely have the convincing defense line at the protocol level. Yet this can 
> be defeated by either bad configuration (in our case JMAP was not checking 
> the size of sent email....) history (rules were not the same in the past so 
> we ingested too big of a mail in the past), 'malicious action' (if all it 
> takes to crash james is to replace a 1 MB mail by a 1 GB mail....). It thus 
> sounds interesting to me to have additional protection at the data access 
> layer, and be able to (optionally) configure S3 to not load objects of say 
> more than 50 MBs. This can be added within the blob.properties file.
> Something like:
> {code:java}
> # Maximum size of blobs allowed to be loaded as byte array. Allow to prevent 
> loading too large objects into memory (can cause OutOfMemoryException).
> # Optional, defaults to no limit being enforced. This is a size in bytes. 
> Supported units are B, K, M, G, T defaulting to B)
> max.blob.inmemory.size=50M
> {code}
> As an operator this would give me some peace of mind knowing that James won't 
> attempt to load GB large emails into memory and would fail early, without 
> heading to the OOM realm and all the related stability issues it brings.
> Also the incriminated code path (`BytesWrapper::asByteArray`) does a 
> defensive copy but there is an alternative: BytesWrapper::asByteArrayUnsafe. 
> The S3 driver guaranties not to mutate the byte[] which sounds good enough 
> given that james don't do it either. Preventing needless copies of MBs of 
> large mail won't solve the core issue but definitly give a nice performance 
> boost as well as decrease the impact of handling very large emails...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

[jira] [Commented] (JAMES-3793) OOM when loading a very large object from S3?

Reply via email to