Tran Hong Quan created JAMES-4027:
-------------------------------------

             Summary: Make all queues on Rabbitmq quorum queue when option 
enabled
                 Key: JAMES-4027
                 URL: https://issues.apache.org/jira/browse/JAMES-4027
             Project: James Server
          Issue Type: Bug
          Components: eventbus, Queue, rabbitmq
            Reporter: Tran Hong Quan


Today, when the quorum option is enabled, only some queues are quorum queues, 
not all (e.g. event bus notification queues and Task Manager's termination 
queues).

On a James deployment where we use quorum queues and RabbitMQ cluster 3 nodes, 
when a RabbitMQ node outages, James can not be fault tolerant against it.

I tried to reproduce what happens and here is my theory: 

The RabbitMQ node that stores the notification queues is down 
-> James can not publish messages to RabbitMQ and causes e.g. IMAP SELECT, 
STORE, APPEND, UNSELECT ... commands to fail 
-> James keeps retrying the publish failures (retry for Group registration 
which seems to rely on the classic queue too) and queues other IMAP requests.

-> The IMAP server queue is full and the exception `The IMAP server has reached 
its maximum capacity` is thrown.

-> James IMAP becomes a zombie and cascading failures.


James needs to be more fault-tolerant in this case.

I propose we apply quorum queues for all the queues when `
quorum.queues.enable=true` so the queues are still available even when a 
RabbitMQ node is down, and help James keep functions well.

We did a POC [here  |https://github.com/apache/james-project/pull/2191]and the 
full quorum queues helped James be more fault tolerant as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Reply via email to