Hello guys!

Maybe some people in the community had similar issues than some of us before for a while with RabbitMQ when using the distributed-app, that it would stop sometimes consuming a queue, where we would just restart James manually.

We encountered the case with the TaskManagerWorkQueue when running a heavy tasks on it taking few hours and having other tasks coming and piling up in the queue, waiting to be consumed. We could observe then that James would nack the messages in the queue (telling RabbitMQ it will process them later, ideally after finishing its current task). But then after 30 minutes, we could see James stopping consuming  the queue and all items going back to the ready state.

The reason is that RabbitMQ has a timeout on consuming items, as a safety measure. If the consumer fails to ack a message within a certain time (30 minutes by default), then it closes the channel with a `PRECONDITION_FAILED` channel exception : https://www.rabbitmq.com/consumers.html#acknowledgement-timeout

From there we think that sometimes James could also fails for some reason to ack properly a message, then loosing consuming on that queue, like we had in the past with the mail queue.

From there, we can take action, like doing a reconnection when we detect such issue on the channel.

More details in the JIRA ticket: https://issues.apache.org/jira/browse/JAMES-3955

Benoit seems to have taken a shot at resume consuming on queues loosing them as well, if some people want to check it out: https://github.com/apache/james-project/pull/1778

If there is some RabbitMQ experts as well in the community that have better ideas or other suggestions, don't hesitate !

Thanks and cheers guys,

Rene.

Reply via email to