[ https://issues.apache.org/jira/browse/JAMES-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benoit Tellier closed JAMES-3810. --------------------------------- Resolution: Fixed > Performance issues with RabbitMQMailQueue browsing > -------------------------------------------------- > > Key: JAMES-3810 > URL: https://issues.apache.org/jira/browse/JAMES-3810 > Project: James Server > Issue Type: Improvement > Components: Queue, rabbitmq > Affects Versions: 3.7.0 > Reporter: Benoit Tellier > Priority: Major > Fix For: 3.8.0 > > Time Spent: 20m > Remaining Estimate: 0h > > h3. What > Upon incidents, the process of the browse start update can be suspended if > some old mails are stuck in the mail queue. > Given enough time, browsing/browse-start updates will fail (too much data > loaded at once). > Periodically browse start updates would pile up until creating performance > issues. > h3. The fix > - Allow better resilience upon browsing large amount of data. > - Handle correctly casandra exceptions upon dequeue (nack items) > - And finally provide a health check to WARN when health check is out of date > h3. What to do when health check is out of date > - Identify the items still present in the queue > - Remove them from the queue > - and let everything catch up. > Alternatively if old content can be discarded an aggressive anual update of > the browse start can be an option. > h3. Alternative > Disabling the Cassandra part of the RabbitMQ is OK for a MDA, and RabbitMQ > mailQueue is unfit for a MX. > Thus it would sound legit to just disable it. > h3. Interesting Gitter discussion > I leave it here as this is a very interesting discussions, explaining cause, > effects, solutions, quick fixes, etc... > {code:java} > @iljaweis > Hello, I'm currently investigating some performance problems around the Mail > Queue in RabbitMQ und Cassandra, especially enqueuedmailsv4. From reading > https://github.com/apache/james-project/blob/master/src/adr/0031-distributed-mail-queue.md > and > https://github.com/apache/james-project/blob/master/src/adr/0032-distributed-mail-queue-cleanup.md > it is my understanding that the content of the tables "contentstart" and > "browsestart" indicate the oldest dates where we should start processing > entries from enqueuedmailsv4. However, the content of both contentstart and > browsestart for all queues is stuck at "2022-07-12 14:00:00" and does not > change. The result is that we're querying lots of "old" data from > enqueuedmailsv4, but our queues (judging from RabbitMQ) appear to be > basically empty all the time. The value of updateBrowseStartPace is 1000, the > default. We're using distributed-pop3-app. Is this behaviour expected, or if > not, is there something I could look at? Thanks. > @iljaweis > Hmm. Looks like James uses LIMIT 5000, but we're already at 17435 rows for > that timerangestart. Looks like we need to increate bucketCount, but is there > a way to deal with these old rows? > @chibenwa > Hello. > Emails between contentStart and browseStart is elligible for cleanup (meaning > deletion from objectStore, then from the Cassandra mailQueue view). > BrowseStart matches (modulo the duration of slices) of the oldest mail still > in the queue. ContentStart matches the fist email whose content is not > deleted. So contentStart <= browseStart. > > Looks like James uses LIMIT 5000 > I bet you are looking at logs on Cassandra server, most likely upon massive > query rates. That limit do not correspond to any code on James side (Cf > EnqueuedMailsDAO) but nothing to be worrying off as this is the default page > size (drivers requests large reads by packets of 5000. > > I'm currently investigating some performance problems around the Mail Queue > > in RabbitMQ und Cassandra, especially enqueuedmailsv4. > This is rather insufficient as a description for me understanding what > happens, but (knowing the design of the underlying solution) let met take a > bet: > You are having performance issues as massive load queries are done by the > RabbitMQ mail queue. This massive load is due to a full content read of the > mail queue content everytime the browse start is read. (reads everything from > 12/07, so slice * bucket, and returns load of data. > The root cause is the browseStart being "stuck" on the 12/07. This could be > because an email from the 12/07 had not been properly processed. > Things I would check includes: > > Dead letter queues on RabbitMQ management API. If they are not empty, > > audit, and consider re-routing its content to resume delivery. > (Deserialization fails, or missing object in S3 can be the root cause, thus > message that will always fail to get dequeued are discarded into a dead > letter queue to prevent an infinite loop while not loosing data). > > Check the content of the mail queue by browsing it. If this is not doable, > > audit the content of the oldest slice in enqueuedMailV4 and check it > > against deletedMailsV2. We are looking for content within enqueuedMailV4 > > that is not being marked as deleted. > > The value of updateBrowseStartPace is 1000 > During the audit phase, until you know why "browseStart" is stuck, I > recommend increasing it to an unreasonably high value to effectively disable > browseStart updates. This will prevent high query rates that are hurting you > from happening. > You could consider a higher value based on your traffic too. Are you > delivering more that 1000 emails for the slicing period? > > Looks like we need to increate bucketCount > Won't help. Unless you scale out your Cassandra cluster and want to spread > the write load of the mail queue. > Here the way to go is to audit why browse start updates are stuck. > > Is this behaviour expected, or if not, is there something I could look at? > > Thanks. > As explained before, that is a "not surprising" edge case. I'm interested to > confirm the diagnostic, and also find the root cause. > Regards > Benoit > @chibenwa > Had a look. By default the browse start updates does 256 slice reads, which > is likely too much. We could likely decrease this number for it to be less > aggressive. > @chibenwa > Also we could think of a healthcheck warning when the browseStart is older > than say 1 week. > @iljaweis > Hello, yes, you guessed exactly right what the situation is. Sorry for being > slightly unclear. Browsing the queue (using the web admin endpoint) timed > out, so I checked for enqueueids from enqueuedmailsv4 that are not in > deletedmailsv2 and found a lot of them. That day we actually had a problem > with a backend, and a lot of mails suffered exceptions during queue handling. > The Mails were moved to /var/mail/error, but also a bounce was created. The > mails are still sitting in enqueuedmailsv4 with state=error. > @iljaweis > There isn't anything in any of the dead letter queues > @iljaweis > Right now I'm looking for any exceptions that point to one of the possible > failures you suggested. > @chibenwa > Ok, so... I would "jump" manually the browse start past them (say to ~ 1 week > ago). > Don't touch the contentStart. > Then the content clean up will kick in and clean up dandling data. > @chibenwa > > Right now I'm looking for any exceptions that point to one of the possible > > failures you suggested. > Speak to me more about this outage... > Like: massive disturbance in the force on the cassandra cluster side? > ``` > Flux<? extends MailQueue.MailQueueItem> deQueue() { > return Flux.using(receiverProvider::createReceiver, > receiver -> > receiver.consumeManualAck(this.name.toWorkQueueName().asString(), new > ConsumeOptions().qos(this.prefetchCount.asInt())), > Receiver::close) > .filter(getResponse -> getResponse.getBody() != null) > .flatMapSequential(this::loadItem) > .concatMap(this::filterIfDeleted); > } > private Mono<RabbitMQMailQueueItem> filterIfDeleted(RabbitMQMailQueueItem > item) { > return mailQueueView.isPresent(item.getEnqueueId()) > .handle((isPresent, sink) -> { > if (isPresent) { > sink.next(item); > } else { > item.done(true); > sink.complete(); > } > }); > } > ``` > That would cause the "filterIfDeleted" method to fail? > And if this fails it do not look like the error handling is in place to nack > correctly the corresponding message. That message will thus sit > unacknowledged, and once the unacknowledge count exceed the prefetch of 20, > no subsequent deliveries is made until James is restarted.... > Sounds like a known story > That do not explain "emails being dropped" though > @iljaweis > Would it be possible to just "skip" the bad timeslice by manually adjusting > the browsestart/contentstart? We actually don't care about any of those old > mails any more. > @iljaweis > Ah, sorry, didn't see you already answered. > @iljaweis > The outage was a problem with our S3 backend. We got a large number of > Exception calling LocalDelivery: > org.apache.james.queue.api.MailQueue$MailQueueException: Timeout enqueueing > during the exact period. I checked some of the mails from enqueuedmailsv4 not > in deletedmailsv2 and they ran into exactly that. > @iljaweis > I will skip the nasty day by adjusting the browsestart table. > @iljaweis > About updateBrowseStartPace: Slice is still the default of 1 hour, but we're > processing more than 1000 mails during that period. > @chibenwa > >i I will skip the nasty day by adjusting the browsestart table. > Update to something reasonably recent or you won't be able to update the > browse start automatically. 1 week to one month. > > About updateBrowseStartPace: Slice is still the default of 1 hour, but > > we're processing more than 1000 mails during that period. > That should be ok: you then wasted ressources trying to update the browse > start too many time during that period. No other harmful consequences. > @chibenwa > WIP PR apache/james-project#1142 > @iljaweis > I've set the values in browsestart to '2022-08-12 06:00:00.000000+0000' and > let it run over night. > @iljaweis > Spool there is now at '2022-08-23 05:00:00' but outgoing still at '2022-08-12 > 06:00:00'. No progress for contentstart > @iljaweis > Can't see any obvious exception. I could try and add some glowroot > instrumentation, would org.apache.james.queue.rabbitmq.view.cassandra be the > correct package? > @iljaweis > Ah, now both browsestart and contentstart for "spool" have caught up, but no > movement for "outgoing". Perhaps it's caused by > https://issues.apache.org/jira/browse/JAMES-3791, because we have ongoing > problems with syntactically broken recipient addresses. We'll deploy a fix > for that the next days. > @chibenwa > contentStart catch up on browseStart have a 1/updatePace on every mail. > - It is random > - Spool have likely more traffic than outgoing. > I'd just take a coffee and relax: if it worked for your spool, when triggered > it will work for your outgoing queue too. > @iljaweis > Will do that coffee and relax thing, and wait for our bugfix deployment > anyway. Thank you for your help! :-) > @iljaweis > That coffee thing worked. We're up to date with contentstart. > @chibenwa > I'm opening a JIRA to preserve this discussion and contextualize the above > proposed PR. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org