Hello guys!
Maybe some people in the community had similar issues than some of us
before for a while with RabbitMQ when using the distributed-app, that it
would stop sometimes consuming a queue, where we would just restart
James manually.
We encountered the case with the TaskManagerWorkQueue when running a
heavy tasks on it taking few hours and having other tasks coming and
piling up in the queue, waiting to be consumed. We could observe then
that James would nack the messages in the queue (telling RabbitMQ it
will process them later, ideally after finishing its current task). But
then after 30 minutes, we could see James stopping consuming the queue
and all items going back to the ready state.
The reason is that RabbitMQ has a timeout on consuming items, as a
safety measure. If the consumer fails to ack a message within a certain
time (30 minutes by default), then it closes the channel with a
`PRECONDITION_FAILED` channel exception :
https://www.rabbitmq.com/consumers.html#acknowledgement-timeout
From there we think that sometimes James could also fails for some
reason to ack properly a message, then loosing consuming on that queue,
like we had in the past with the mail queue.
From there, we can take action, like doing a reconnection when we
detect such issue on the channel.
More details in the JIRA ticket:
https://issues.apache.org/jira/browse/JAMES-3955
Benoit seems to have taken a shot at resume consuming on queues loosing
them as well, if some people want to check it out:
https://github.com/apache/james-project/pull/1778
If there is some RabbitMQ experts as well in the community that have
better ideas or other suggestions, don't hesitate !
Thanks and cheers guys,
Rene.