René Cordier created JAMES-3955:
-----------------------------------

             Summary: James stops consuming sometimes RabbitMQ queue
                 Key: JAMES-3955
                 URL: https://issues.apache.org/jira/browse/JAMES-3955
             Project: James Server
          Issue Type: Improvement
          Components: rabbitmq
            Reporter: René Cordier


We sometimes had troubles with RabbitMQ in some production environments where 
james would stop consuming some queues (like the mail queue) and we never would 
understand really why, and we would just restart James in this case.

Well recently I had similar issues but with TaskManagerWorkQueue. Except that 
we managed to reproduce the problem manually. We have a task we play at night 
that can take a long time to play. After had some other planned tasks as well, 
we could observe the following pattern:

While the heavy task is being executed by James, others are pilling up in the 
TaskManagerWorkQueue. They getting unacked by James, meaning it's telling 
RabbitMQ that it will consume them later (as James executes one task at a 
time). Except that after 30 minutes after the first unacked item in the queue, 
could see James stopping consuming the queue, and all items coming back to the 
ready state.

After looking around RabbitMQ configuration: 
[https://www.rabbitmq.com/consumers.html#acknowledgement-timeout]

RabbitMQ will close the channel with a `PRECONDITION_FAILED` channel exception 
when detecting that an item here the first one being unacked) has not been 
consumed within 30 minutes. Matching with what we observed actually.

>From this I guess we could deduce that when we had a similar issue with the 
>mail queue, maybe James failed to consume properly a message or failed at 
>acknowledging it for some reason and got the channel closed by RabbitMQ.

>From there, there is some actions we can take to prevent this:
 * adding error logs when we get the channel closed on such an exception
 * trying to reconnect to the channel when such an exception occurs
 * on at least important queues like task manager queue, mail queue, event bus
 * potentially try to audit as well if in some cases we do not ack/nack the 
message back
 *  giving the possibility to increase the consumer timeout of the above queue 
with the `x-consumer-timeout` queue argument (would require to run rabbitmq 
3.12 at least)

For now we can as well increase that timeout in rabbitmq.conf to minimize the 
problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Reply via email to