[ https://issues.apache.org/jira/browse/JAMES-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17574744#comment-17574744 ]
Karsten Otto commented on JAMES-3793: ------------------------------------- We occasionally observed an unresponsive James as well, now I wonder if it was this issue. Anyway, is the a trick to have Reactor let James crash "properly" on OOM? As an operator, I can handle cashed processes easily, while an unresponsive one might sneak past our monitoring until other issues pile up later. > OOM when loading a very large object from S3? > --------------------------------------------- > > Key: JAMES-3793 > URL: https://issues.apache.org/jira/browse/JAMES-3793 > Project: James Server > Issue Type: Bug > Reporter: Benoit Tellier > Priority: Major > > h2. What? > We encountered recurring OutOfMemory exception on one of our production > deployment. > Memory dump analysis was unconclusive and this tends to disqualify an > explanation based on a memory leak (300MB of objects only on the heap a few > minutes after the OOM). > A careful log analysis lead to find what seems to be the "original OOM": > {code:java} > java.lang.OutOfMemoryError: Java heap space > at java.base/java.util.Arrays.copyOf(Unknown Source) > at software.amazon.awssdk.core.BytesWrapper.asByteArray(BytesWrapper.java:64) > at > org.apache.james.blob.objectstorage.aws.S3BlobStoreDAO$$Lambda$4237/0x00000008019f5ad8.apply(Unknown > Source) > at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:106) > at > reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:181) > at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68) > at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28) > at java.base/java.util.concurrent.FutureTask.run(Unknown Source) > at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source) > {code} > > Following this OOM the application is in a zombie state, unresponsive, > throwing OOMs without stacktraces, with Cassandra queries that never > finishes, unable to obtain a rabbitMQ connection and have issues within the > S3 driver... This sound like reactive programming limitations, that prevents > the java platform to handle the OOM like it should (crash the app, take a > dump, etc...) > We did audit quickly our dateset and found several emails/attachment > exceeding the 100MB, with a partial and quick audit (we might very well have > some larger data!). > Thus the current explanation is that somehow we successfully saved in S3 a > very big mail and now have OOMs when one tries to read it (as S3 blob store > DAO does defensive copies). > h2. Possible actions > This is an ongoing events, thus our understanding of it can evolve yet as it > raises interesting fixes that are hard to understand without the related > context, I decided to share it here anyway. I will report upcoming > developments here. > Our first action is to confirm the current diagnostic: > - Further audit our datasets to find large items > - Deploy a patched version of James that rejects and log S3 objects larger > than 50MB > Yet our current understanding leads to interesting questions... > *Is it a good idea to load big objects from S3 into our memory?* > As a preliminary answer, upon email reads we are using `byte[]` for > simplicity (no resource management, full view of the data). Changing this is > not the scope of this ticket at this is likely a major rework with many > unthought impacts. (I dont want to open that Pandora box...) > SMTP, IMAP, JMAP, and the mailet container all have configuration preventing > sending/saving/receiving/uploading too big of a mail/attachment/blob, so we > likely have the convincing defense line at the protocol level. Yet this can > be defeated by either bad configuration (in our case JMAP was not checking > the size of sent email....) history (rules were not the same in the past so > we ingested too big of a mail in the past), 'malicious action' (if all it > takes to crash james is to replace a 1 MB mail by a 1 GB mail....). It thus > sounds interesting to me to have additional protection at the data access > layer, and be able to (optionally) configure S3 to not load objects of say > more than 50 MBs. This can be added within the blob.properties file. > Something like: > {code:java} > # Maximum size of blobs allowed to be loaded as byte array. Allow to prevent > loading too large objects into memory (can cause OutOfMemoryException). > # Optional, defaults to no limit being enforced. This is a size in bytes. > Supported units are B, K, M, G, T defaulting to B) > max.blob.inmemory.size=50M > {code} > As an operator this would give me some peace of mind knowing that James won't > attempt to load GB large emails into memory and would fail early, without > heading to the OOM realm and all the related stability issues it brings. > Also the incriminated code path (`BytesWrapper::asByteArray`) does a > defensive copy but there is an alternative: BytesWrapper::asByteArrayUnsafe. > The S3 driver guaranties not to mutate the byte[] which sounds good enough > given that james don't do it either. Preventing needless copies of MBs of > large mail won't solve the core issue but definitly give a nice performance > boost as well as decrease the impact of handling very large emails... -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org