Benoit Tellier created JAMES-3793:
-------------------------------------

             Summary: OOM when loading a very large object from S3?
                 Key: JAMES-3793
                 URL: https://issues.apache.org/jira/browse/JAMES-3793
             Project: James Server
          Issue Type: Bug
            Reporter: Benoit Tellier


h2. What?

We encountered recurring OutOfMemory exception on one of our production 
deployment.

Memory dump analysis was unconclusive and this tends to disqualify an 
explanation based on a memory leak (300MB of objects only on the heap a few 
minutes after the OOM).

A careful log analysis lead to find what seems to be the "original OOM":

{code:java}
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Unknown Source)
at software.amazon.awssdk.core.BytesWrapper.asByteArray(BytesWrapper.java:64)
at 
org.apache.james.blob.objectstorage.aws.S3BlobStoreDAO$$Lambda$4237/0x00000008019f5ad8.apply(Unknown
 Source)
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:106)
at 
reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:181)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
 Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
{code}
 
Following this OOM the application is in a zombie state, unresponsive, throwing 
OOMs without stacktraces, with Cassandra queries that never finishes, unable to 
obtain a rabbitMQ connection and have issues within the S3 driver... This sound 
like reactive programming limitations, that prevents the java platform to 
handle the OOM like it should (crash the app, take a dump, etc...)

We did audit quickly our dateset and found several emails/attachment exceeding 
the 100MB, with a partial and quick audit (we might very well have some larger 
data!).

Thus the current explanation is that somehow we successfully saved in S3 a very 
big mail and now have OOMs when one tries to read it (as S3 blob store DAO does 
defensive copies).

h2. Possible actions

This is an ongoing events, thus our understanding of it can evolve yet as it 
raises interesting fixes that are hard to understand without the related 
context, I decided to share it here anyway. I will report upcoming developments 
here.

Our first action is to confirm the current diagnostic:
  - Further audit our datasets to find large items
  - Deploy a patched version of James that rejects and log S3 objects larger 
than 50MB

Yet our current understanding leads to interesting questions...

*Is it a good idea to load big objects from S3 into our memory?*

As a preliminary answer, upon email reads we are using `byte[]` for simplicity 
(no resource management, full view of the data). Changing this is not the scope 
of this ticket at this is likely a major rework with many unthought impacts. (I 
dont want to open that Pandora box...)

SMTP, IMAP, JMAP, and the mailet container all have configuration preventing 
sending/saving/receiving/uploading too big of a mail/attachment/blob, so we 
likely have the convincing defense line at the protocol level. Yet this can be 
defeated by either bad configuration (in our case JMAP was not checking the 
size of sent email....) history (rules were not the same in the past so we 
ingested too big of a mail in the past), 'malicious action' (if all it takes to 
crash james is to replace a 1 MB mail by a 1 GB mail....). It thus sounds 
interesting to me to have additional protection at the data access layer, and 
be able to (optionally) configure S3 to not load objects of say more than 50 
MBs. This can be added within the blob.properties file.

Something like:

{code:java}
# Maximum size of blobs allowed to be loaded as byte array. Allow to prevent 
loading too large objects into memory (can cause OutOfMemoryException).
# Optional, defaults to no limit being enforced. This is a size in bytes. 
Supported units are B, K, M, G, T defaulting to B)
max.blob.inmemory.size=50M
{code}

As an operator this would give me some peace of mind knowing that James won't 
attempt to load GB large emails into memory and would fail early, without 
heading to the OOM realm and all the related stability issues it brings.

Also the incriminated code path (`BytesWrapper::asByteArray`) does a defensive 
copy but there is an alternative: BytesWrapper::asByteArrayUnsafe. The S3 
driver guaranties not to mutate the byte[] which sounds good enough given that 
james don't do it either. Preventing needless copies of MBs of large mail won't 
solve the core issue but definitly give a nice performance boost as well as 
decrease the impact of handling very large emails...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Reply via email to