[jira] [Closed] (JAMES-3810) Performance issues with RabbitMQMailQueue browsing

Benoit Tellier (Jira) Thu, 25 Aug 2022 18:54:38 -0700


     [ 
https://issues.apache.org/jira/browse/JAMES-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Benoit Tellier closed JAMES-3810.
---------------------------------
    Resolution: Fixed

> Performance issues with RabbitMQMailQueue browsing
> --------------------------------------------------
>
>                 Key: JAMES-3810
>                 URL: https://issues.apache.org/jira/browse/JAMES-3810
>             Project: James Server
>          Issue Type: Improvement
>          Components: Queue, rabbitmq
>    Affects Versions: 3.7.0
>            Reporter: Benoit Tellier
>            Priority: Major
>             Fix For: 3.8.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> h3. What
> Upon incidents, the process of the browse start update can be suspended if 
> some old mails are stuck in the mail queue.
> Given enough time, browsing/browse-start updates will fail (too much data 
> loaded at once).
> Periodically browse start updates would pile up until creating performance 
> issues.
> h3. The fix
>  - Allow better resilience upon browsing large amount of data.
>  - Handle correctly casandra exceptions upon dequeue (nack items)
>  - And finally provide a health check to WARN when health check is out of date
> h3. What to do when health check is out of date
>  - Identify the items still present in the queue
>  - Remove them from the queue
>  - and let everything catch up.
> Alternatively if old content can be discarded an aggressive anual update of 
> the browse start can be an option.
> h3. Alternative
> Disabling the Cassandra part of the RabbitMQ is OK for a MDA, and RabbitMQ 
> mailQueue is unfit for a MX. 
> Thus it would sound legit to just disable it.
> h3. Interesting Gitter discussion
> I leave it here as this is a very interesting discussions, explaining cause, 
> effects, solutions, quick fixes, etc...
> {code:java}
> @iljaweis
> Hello, I'm currently investigating some performance problems around the Mail 
> Queue in RabbitMQ und Cassandra, especially enqueuedmailsv4. From reading 
> https://github.com/apache/james-project/blob/master/src/adr/0031-distributed-mail-queue.md
>  and 
> https://github.com/apache/james-project/blob/master/src/adr/0032-distributed-mail-queue-cleanup.md
>  it is my understanding that the content of the tables "contentstart" and 
> "browsestart" indicate the oldest dates where we should start processing 
> entries from enqueuedmailsv4. However, the content of both contentstart and 
> browsestart for all queues is stuck at "2022-07-12 14:00:00" and does not 
> change. The result is that we're querying lots of "old" data from 
> enqueuedmailsv4, but our queues (judging from RabbitMQ) appear to be 
> basically empty all the time. The value of updateBrowseStartPace is 1000, the 
> default. We're using distributed-pop3-app. Is this behaviour expected, or if 
> not, is there something I could look at? Thanks.
> @iljaweis
> Hmm. Looks like James uses LIMIT 5000, but we're already at 17435 rows for 
> that timerangestart. Looks like we need to increate bucketCount, but is there 
> a way to deal with these old rows?
> @chibenwa
> Hello.
> Emails between contentStart and browseStart is elligible for cleanup (meaning 
> deletion from objectStore, then from the Cassandra mailQueue view). 
> BrowseStart matches (modulo the duration of slices) of the oldest mail still 
> in the queue. ContentStart matches the fist email whose content is not 
> deleted. So contentStart <= browseStart.
> > Looks like James uses LIMIT 5000
> I bet you are looking at logs on Cassandra server, most likely upon massive 
> query rates. That limit do not correspond to any code on James side (Cf 
> EnqueuedMailsDAO) but nothing to be worrying off as this is the default page 
> size (drivers requests large reads by packets of 5000.
> > I'm currently investigating some performance problems around the Mail Queue 
> > in RabbitMQ und Cassandra, especially enqueuedmailsv4. 
> This is rather insufficient as a description for me understanding what 
> happens, but (knowing the design of the underlying solution) let met take a 
> bet:
> You are having performance issues as massive load queries are done by the 
> RabbitMQ mail queue. This massive load is due to a full content read of the 
> mail queue content everytime the browse start is read. (reads everything from 
> 12/07, so slice * bucket, and returns load of data.
> The root cause is the browseStart being "stuck" on the 12/07. This could be 
> because an email from the 12/07 had not been properly processed.
> Things I would check includes:
> > Dead letter queues on RabbitMQ management API. If they are not empty, 
> > audit, and consider re-routing its content to resume delivery.
> (Deserialization fails, or missing object in S3 can be the root cause, thus 
> message that will always fail to get dequeued are discarded into a dead 
> letter queue to prevent an infinite loop while not loosing data).
> > Check the content of the mail queue by browsing it. If this is not doable, 
> > audit the content of the oldest slice in enqueuedMailV4 and check it 
> > against deletedMailsV2. We are looking for content within enqueuedMailV4 
> > that is not being marked as deleted.
> > The value of updateBrowseStartPace is 1000
> During the audit phase, until you know why "browseStart" is stuck, I 
> recommend increasing it to an unreasonably high value to effectively disable 
> browseStart updates. This will prevent high query rates that are hurting you 
> from happening.
> You could consider a higher value based on your traffic too. Are you 
> delivering more that 1000 emails for the slicing period?
> > Looks like we need to increate bucketCount
> Won't help. Unless you scale out your Cassandra cluster and want to spread 
> the write load of the mail queue.
> Here the way to go is to audit why browse start updates are stuck.
> > Is this behaviour expected, or if not, is there something I could look at? 
> > Thanks.
> As explained before, that is a "not surprising" edge case. I'm interested to 
> confirm the diagnostic, and also find the root cause.
> Regards
> Benoit
> @chibenwa
> Had a look. By default the browse start updates does 256 slice reads, which 
> is likely too much. We could likely decrease this number for it to be less 
> aggressive.
> @chibenwa
> Also we could think of a healthcheck warning when the browseStart is older 
> than say 1 week.
> @iljaweis
> Hello, yes, you guessed exactly right what the situation is. Sorry for being 
> slightly unclear. Browsing the queue (using the web admin endpoint) timed 
> out, so I checked for enqueueids from enqueuedmailsv4 that are not in 
> deletedmailsv2 and found a lot of them. That day we actually had a problem 
> with a backend, and a lot of mails suffered exceptions during queue handling. 
> The Mails were moved to /var/mail/error, but also a bounce was created. The 
> mails are still sitting in enqueuedmailsv4 with state=error.
> @iljaweis
> There isn't anything in any of the dead letter queues
> @iljaweis
> Right now I'm looking for any exceptions that point to one of the possible 
> failures you suggested.
> @chibenwa
> Ok, so... I would "jump" manually the browse start past them (say to ~ 1 week 
> ago).
> Don't touch the contentStart.
> Then the content clean up will kick in and clean up dandling data.
> @chibenwa
> > Right now I'm looking for any exceptions that point to one of the possible 
> > failures you suggested.
> Speak to me more about this outage...
> Like: massive disturbance in the force on the cassandra cluster side?
> ```
>    Flux<? extends MailQueue.MailQueueItem> deQueue() {
>         return Flux.using(receiverProvider::createReceiver,
>                 receiver -> 
> receiver.consumeManualAck(this.name.toWorkQueueName().asString(), new 
> ConsumeOptions().qos(this.prefetchCount.asInt())),
>                 Receiver::close)
>             .filter(getResponse -> getResponse.getBody() != null)
>             .flatMapSequential(this::loadItem)
>             .concatMap(this::filterIfDeleted);
>     }
>     private Mono<RabbitMQMailQueueItem> filterIfDeleted(RabbitMQMailQueueItem 
> item) {
>         return mailQueueView.isPresent(item.getEnqueueId())
>             .handle((isPresent, sink) -> {
>                 if (isPresent) {
>                     sink.next(item);
>                 } else {
>                     item.done(true);
>                     sink.complete();
>                 }
>             });
>     }
> ```
> That would cause the "filterIfDeleted" method to fail?
> And if this fails it do not look like the error handling is in place to nack 
> correctly the corresponding message. That message will thus sit 
> unacknowledged, and once the unacknowledge count exceed the prefetch of 20, 
> no subsequent deliveries is made until James is restarted....
> Sounds like a known story
> That do not explain "emails being dropped" though
> @iljaweis
> Would it be possible to just "skip" the bad timeslice by manually adjusting 
> the browsestart/contentstart? We actually don't care about any of those old 
> mails any more.
> @iljaweis
> Ah, sorry, didn't see you already answered.
> @iljaweis
> The outage was a problem with our S3 backend. We got a large number of 
> Exception calling LocalDelivery: 
> org.apache.james.queue.api.MailQueue$MailQueueException: Timeout enqueueing 
> during the exact period. I checked some of the mails from enqueuedmailsv4 not 
> in deletedmailsv2 and they ran into exactly that.
> @iljaweis
> I will skip the nasty day by adjusting the browsestart table.
> @iljaweis
> About updateBrowseStartPace: Slice is still the default of 1 hour, but we're 
> processing more than 1000 mails during that period.
> @chibenwa
> >i  I will skip the nasty day by adjusting the browsestart table.
> Update to something reasonably recent or you won't be able to update the 
> browse start automatically. 1 week to one month.
> > About updateBrowseStartPace: Slice is still the default of 1 hour, but 
> > we're processing more than 1000 mails during that period.
> That should be ok: you then wasted ressources trying to update the browse 
> start too many time during that period. No other harmful consequences.
> @chibenwa
> WIP PR apache/james-project#1142
> @iljaweis
> I've set the values in browsestart to '2022-08-12 06:00:00.000000+0000' and 
> let it run over night.
> @iljaweis
> Spool there is now at '2022-08-23 05:00:00' but outgoing still at '2022-08-12 
> 06:00:00'. No progress for contentstart
> @iljaweis
> Can't see any obvious exception. I could try and add some glowroot 
> instrumentation, would org.apache.james.queue.rabbitmq.view.cassandra be the 
> correct package?
> @iljaweis
> Ah, now both browsestart and contentstart for "spool" have caught up, but no 
> movement for "outgoing". Perhaps it's caused by 
> https://issues.apache.org/jira/browse/JAMES-3791, because we have ongoing 
> problems with syntactically broken recipient addresses. We'll deploy a fix 
> for that the next days.
> @chibenwa
> contentStart catch up on browseStart have a 1/updatePace on every mail.
>  - It is random
>  - Spool have likely more traffic than outgoing.
> I'd just take a coffee and relax: if it worked for your spool, when triggered 
> it will work for your outgoing queue too.
> @iljaweis
> Will do that coffee and relax thing, and wait for our bugfix deployment 
> anyway. Thank you for your help! :-)
> @iljaweis
> That coffee thing worked. We're up to date with contentstart.
> @chibenwa
> I'm opening a JIRA to preserve this discussion and contextualize the above 
> proposed PR.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

[jira] [Closed] (JAMES-3810) Performance issues with RabbitMQMailQueue browsing

Reply via email to