[jira] [Commented] (JAMES-3793) OOM when loading a very large object from S3?

Karsten Otto (Jira) Wed, 03 Aug 2022 07:21:24 -0700


    [ 
https://issues.apache.org/jira/browse/JAMES-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17574744#comment-17574744
 ]


Karsten Otto commented on JAMES-3793:
-------------------------------------

We occasionally observed an unresponsive James as well, now I wonder if it was 
this issue. Anyway, is the a trick to have Reactor let James crash "properly" 
on OOM? As an operator, I can handle cashed processes easily, while an 
unresponsive one might sneak past our monitoring until other issues pile up 
later.

> OOM when loading a very large object from S3?
> ---------------------------------------------
>
>                 Key: JAMES-3793
>                 URL: https://issues.apache.org/jira/browse/JAMES-3793
>             Project: James Server
>          Issue Type: Bug
>            Reporter: Benoit Tellier
>            Priority: Major
>
> h2. What?
> We encountered recurring OutOfMemory exception on one of our production 
> deployment.
> Memory dump analysis was unconclusive and this tends to disqualify an 
> explanation based on a memory leak (300MB of objects only on the heap a few 
> minutes after the OOM).
> A careful log analysis lead to find what seems to be the "original OOM":
> {code:java}
> java.lang.OutOfMemoryError: Java heap space
> at java.base/java.util.Arrays.copyOf(Unknown Source)
> at software.amazon.awssdk.core.BytesWrapper.asByteArray(BytesWrapper.java:64)
> at 
> org.apache.james.blob.objectstorage.aws.S3BlobStoreDAO$$Lambda$4237/0x00000008019f5ad8.apply(Unknown
>  Source)
> at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:106)
> at 
> reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:181)
> at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
> at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
> at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
> at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
>  Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
> at java.base/java.lang.Thread.run(Unknown Source)
> {code}
>  
> Following this OOM the application is in a zombie state, unresponsive, 
> throwing OOMs without stacktraces, with Cassandra queries that never 
> finishes, unable to obtain a rabbitMQ connection and have issues within the 
> S3 driver... This sound like reactive programming limitations, that prevents 
> the java platform to handle the OOM like it should (crash the app, take a 
> dump, etc...)
> We did audit quickly our dateset and found several emails/attachment 
> exceeding the 100MB, with a partial and quick audit (we might very well have 
> some larger data!).
> Thus the current explanation is that somehow we successfully saved in S3 a 
> very big mail and now have OOMs when one tries to read it (as S3 blob store 
> DAO does defensive copies).
> h2. Possible actions
> This is an ongoing events, thus our understanding of it can evolve yet as it 
> raises interesting fixes that are hard to understand without the related 
> context, I decided to share it here anyway. I will report upcoming 
> developments here.
> Our first action is to confirm the current diagnostic:
>   - Further audit our datasets to find large items
>   - Deploy a patched version of James that rejects and log S3 objects larger 
> than 50MB
> Yet our current understanding leads to interesting questions...
> *Is it a good idea to load big objects from S3 into our memory?*
> As a preliminary answer, upon email reads we are using `byte[]` for 
> simplicity (no resource management, full view of the data). Changing this is 
> not the scope of this ticket at this is likely a major rework with many 
> unthought impacts. (I dont want to open that Pandora box...)
> SMTP, IMAP, JMAP, and the mailet container all have configuration preventing 
> sending/saving/receiving/uploading too big of a mail/attachment/blob, so we 
> likely have the convincing defense line at the protocol level. Yet this can 
> be defeated by either bad configuration (in our case JMAP was not checking 
> the size of sent email....) history (rules were not the same in the past so 
> we ingested too big of a mail in the past), 'malicious action' (if all it 
> takes to crash james is to replace a 1 MB mail by a 1 GB mail....). It thus 
> sounds interesting to me to have additional protection at the data access 
> layer, and be able to (optionally) configure S3 to not load objects of say 
> more than 50 MBs. This can be added within the blob.properties file.
> Something like:
> {code:java}
> # Maximum size of blobs allowed to be loaded as byte array. Allow to prevent 
> loading too large objects into memory (can cause OutOfMemoryException).
> # Optional, defaults to no limit being enforced. This is a size in bytes. 
> Supported units are B, K, M, G, T defaulting to B)
> max.blob.inmemory.size=50M
> {code}
> As an operator this would give me some peace of mind knowing that James won't 
> attempt to load GB large emails into memory and would fail early, without 
> heading to the OOM realm and all the related stability issues it brings.
> Also the incriminated code path (`BytesWrapper::asByteArray`) does a 
> defensive copy but there is an alternative: BytesWrapper::asByteArrayUnsafe. 
> The S3 driver guaranties not to mutate the byte[] which sounds good enough 
> given that james don't do it either. Preventing needless copies of MBs of 
> large mail won't solve the core issue but definitly give a nice performance 
> boost as well as decrease the impact of handling very large emails...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

[jira] [Commented] (JAMES-3793) OOM when loading a very large object from S3?

Reply via email to