Benoit Tellier created JAMES-3810:
-------------------------------------

             Summary: Performance issues with RabbitMQMailQueue browsing
                 Key: JAMES-3810
                 URL: https://issues.apache.org/jira/browse/JAMES-3810
             Project: James Server
          Issue Type: Improvement
          Components: Queue, rabbitmq
    Affects Versions: 3.7.0
            Reporter: Benoit Tellier
             Fix For: 3.8.0


h3. What

Upon incidents, the process of the browse start update can be suspended if some 
old mails are stuck in the mail queue.

Given enough time, browsing/browse-start updates will fail (too much data 
loaded at once).

Periodically browse start updates would pile up until creating performance 
issues.

h3. The fix

 - Allow better resilience upon browsing large amount of data.
 - Handle correctly casandra exceptions upon dequeue (nack items)
 - And finally provide a health check to WARN when health check is out of date

h3. What to do when health check is out of date

 - Identify the items still present in the queue
 - Remove them from the queue
 - and let everything catch up.

Alternatively if old content can be discarded an aggressive anual update of the 
browse start can be an option.

h3. Alternative

Disabling the Cassandra part of the RabbitMQ is OK for a MDA, and RabbitMQ 
mailQueue is unfit for a MX. 

Thus it would sound legit to just disable it.

h3. Interesting Gitter discussion

I leave it here as this is a very interesting discussions, explaining cause, 
effects, solutions, quick fixes, etc...

{code:java}
@iljaweis
Hello, I'm currently investigating some performance problems around the Mail 
Queue in RabbitMQ und Cassandra, especially enqueuedmailsv4. From reading 
https://github.com/apache/james-project/blob/master/src/adr/0031-distributed-mail-queue.md
 and 
https://github.com/apache/james-project/blob/master/src/adr/0032-distributed-mail-queue-cleanup.md
 it is my understanding that the content of the tables "contentstart" and 
"browsestart" indicate the oldest dates where we should start processing 
entries from enqueuedmailsv4. However, the content of both contentstart and 
browsestart for all queues is stuck at "2022-07-12 14:00:00" and does not 
change. The result is that we're querying lots of "old" data from 
enqueuedmailsv4, but our queues (judging from RabbitMQ) appear to be basically 
empty all the time. The value of updateBrowseStartPace is 1000, the default. 
We're using distributed-pop3-app. Is this behaviour expected, or if not, is 
there something I could look at? Thanks.


@iljaweis
Hmm. Looks like James uses LIMIT 5000, but we're already at 17435 rows for that 
timerangestart. Looks like we need to increate bucketCount, but is there a way 
to deal with these old rows?

@chibenwa
Hello.
Emails between contentStart and browseStart is elligible for cleanup (meaning 
deletion from objectStore, then from the Cassandra mailQueue view). BrowseStart 
matches (modulo the duration of slices) of the oldest mail still in the queue. 
ContentStart matches the fist email whose content is not deleted. So 
contentStart <= browseStart.
> Looks like James uses LIMIT 5000
I bet you are looking at logs on Cassandra server, most likely upon massive 
query rates. That limit do not correspond to any code on James side (Cf 
EnqueuedMailsDAO) but nothing to be worrying off as this is the default page 
size (drivers requests large reads by packets of 5000.
> I'm currently investigating some performance problems around the Mail Queue 
> in RabbitMQ und Cassandra, especially enqueuedmailsv4. 
This is rather insufficient as a description for me understanding what happens, 
but (knowing the design of the underlying solution) let met take a bet:
You are having performance issues as massive load queries are done by the 
RabbitMQ mail queue. This massive load is due to a full content read of the 
mail queue content everytime the browse start is read. (reads everything from 
12/07, so slice * bucket, and returns load of data.
The root cause is the browseStart being "stuck" on the 12/07. This could be 
because an email from the 12/07 had not been properly processed.
Things I would check includes:
> Dead letter queues on RabbitMQ management API. If they are not empty, audit, 
> and consider re-routing its content to resume delivery.
(Deserialization fails, or missing object in S3 can be the root cause, thus 
message that will always fail to get dequeued are discarded into a dead letter 
queue to prevent an infinite loop while not loosing data).
> Check the content of the mail queue by browsing it. If this is not doable, 
> audit the content of the oldest slice in enqueuedMailV4 and check it against 
> deletedMailsV2. We are looking for content within enqueuedMailV4 that is not 
> being marked as deleted.
> The value of updateBrowseStartPace is 1000
During the audit phase, until you know why "browseStart" is stuck, I recommend 
increasing it to an unreasonably high value to effectively disable browseStart 
updates. This will prevent high query rates that are hurting you from happening.
You could consider a higher value based on your traffic too. Are you delivering 
more that 1000 emails for the slicing period?
> Looks like we need to increate bucketCount
Won't help. Unless you scale out your Cassandra cluster and want to spread the 
write load of the mail queue.
Here the way to go is to audit why browse start updates are stuck.
> Is this behaviour expected, or if not, is there something I could look at? 
> Thanks.
As explained before, that is a "not surprising" edge case. I'm interested to 
confirm the diagnostic, and also find the root cause.
Regards
Benoit

@chibenwa
Had a look. By default the browse start updates does 256 slice reads, which is 
likely too much. We could likely decrease this number for it to be less 
aggressive.

@chibenwa
Also we could think of a healthcheck warning when the browseStart is older than 
say 1 week.

@iljaweis
Hello, yes, you guessed exactly right what the situation is. Sorry for being 
slightly unclear. Browsing the queue (using the web admin endpoint) timed out, 
so I checked for enqueueids from enqueuedmailsv4 that are not in deletedmailsv2 
and found a lot of them. That day we actually had a problem with a backend, and 
a lot of mails suffered exceptions during queue handling. The Mails were moved 
to /var/mail/error, but also a bounce was created. The mails are still sitting 
in enqueuedmailsv4 with state=error.

@iljaweis
There isn't anything in any of the dead letter queues

@iljaweis
Right now I'm looking for any exceptions that point to one of the possible 
failures you suggested.

@chibenwa
Ok, so... I would "jump" manually the browse start past them (say to ~ 1 week 
ago).
Don't touch the contentStart.
Then the content clean up will kick in and clean up dandling data.

@chibenwa
> Right now I'm looking for any exceptions that point to one of the possible 
> failures you suggested.
Speak to me more about this outage...
Like: massive disturbance in the force on the cassandra cluster side?
```
   Flux<? extends MailQueue.MailQueueItem> deQueue() {
        return Flux.using(receiverProvider::createReceiver,
                receiver -> 
receiver.consumeManualAck(this.name.toWorkQueueName().asString(), new 
ConsumeOptions().qos(this.prefetchCount.asInt())),
                Receiver::close)
            .filter(getResponse -> getResponse.getBody() != null)
            .flatMapSequential(this::loadItem)
            .concatMap(this::filterIfDeleted);
    }

    private Mono<RabbitMQMailQueueItem> filterIfDeleted(RabbitMQMailQueueItem 
item) {
        return mailQueueView.isPresent(item.getEnqueueId())
            .handle((isPresent, sink) -> {
                if (isPresent) {
                    sink.next(item);
                } else {
                    item.done(true);
                    sink.complete();
                }
            });
    }
```
That would cause the "filterIfDeleted" method to fail?
And if this fails it do not look like the error handling is in place to nack 
correctly the corresponding message. That message will thus sit unacknowledged, 
and once the unacknowledge count exceed the prefetch of 20, no subsequent 
deliveries is made until James is restarted....
Sounds like a known story
That do not explain "emails being dropped" though

@iljaweis
Would it be possible to just "skip" the bad timeslice by manually adjusting the 
browsestart/contentstart? We actually don't care about any of those old mails 
any more.

@iljaweis
Ah, sorry, didn't see you already answered.

@iljaweis
The outage was a problem with our S3 backend. We got a large number of 
Exception calling LocalDelivery: 
org.apache.james.queue.api.MailQueue$MailQueueException: Timeout enqueueing 
during the exact period. I checked some of the mails from enqueuedmailsv4 not 
in deletedmailsv2 and they ran into exactly that.

@iljaweis
I will skip the nasty day by adjusting the browsestart table.

@iljaweis
About updateBrowseStartPace: Slice is still the default of 1 hour, but we're 
processing more than 1000 mails during that period.

@chibenwa
>i  I will skip the nasty day by adjusting the browsestart table.
Update to something reasonably recent or you won't be able to update the browse 
start automatically. 1 week to one month.
> About updateBrowseStartPace: Slice is still the default of 1 hour, but we're 
> processing more than 1000 mails during that period.
That should be ok: you then wasted ressources trying to update the browse start 
too many time during that period. No other harmful consequences.

@chibenwa
WIP PR apache/james-project#1142

@iljaweis
I've set the values in browsestart to '2022-08-12 06:00:00.000000+0000' and let 
it run over night.

@iljaweis
Spool there is now at '2022-08-23 05:00:00' but outgoing still at '2022-08-12 
06:00:00'. No progress for contentstart

@iljaweis
Can't see any obvious exception. I could try and add some glowroot 
instrumentation, would org.apache.james.queue.rabbitmq.view.cassandra be the 
correct package?

@iljaweis
Ah, now both browsestart and contentstart for "spool" have caught up, but no 
movement for "outgoing". Perhaps it's caused by 
https://issues.apache.org/jira/browse/JAMES-3791, because we have ongoing 
problems with syntactically broken recipient addresses. We'll deploy a fix for 
that the next days.

@chibenwa
contentStart catch up on browseStart have a 1/updatePace on every mail.
 - It is random
 - Spool have likely more traffic than outgoing.
I'd just take a coffee and relax: if it worked for your spool, when triggered 
it will work for your outgoing queue too.

@iljaweis
Will do that coffee and relax thing, and wait for our bugfix deployment anyway. 
Thank you for your help! :-)

@iljaweis
That coffee thing worked. We're up to date with contentstart.

@chibenwa
I'm opening a JIRA to preserve this discussion and contextualize the above 
proposed PR.

{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Reply via email to