Tran Hong Quan created JAMES-4027: ------------------------------------- Summary: Make all queues on Rabbitmq quorum queue when option enabled Key: JAMES-4027 URL: https://issues.apache.org/jira/browse/JAMES-4027 Project: James Server Issue Type: Bug Components: eventbus, Queue, rabbitmq Reporter: Tran Hong Quan
Today, when the quorum option is enabled, only some queues are quorum queues, not all (e.g. event bus notification queues and Task Manager's termination queues). On a James deployment where we use quorum queues and RabbitMQ cluster 3 nodes, when a RabbitMQ node outages, James can not be fault tolerant against it. I tried to reproduce what happens and here is my theory: The RabbitMQ node that stores the notification queues is down -> James can not publish messages to RabbitMQ and causes e.g. IMAP SELECT, STORE, APPEND, UNSELECT ... commands to fail -> James keeps retrying the publish failures (retry for Group registration which seems to rely on the classic queue too) and queues other IMAP requests. -> The IMAP server queue is full and the exception `The IMAP server has reached its maximum capacity` is thrown. -> James IMAP becomes a zombie and cascading failures. James needs to be more fault-tolerant in this case. I propose we apply quorum queues for all the queues when ` quorum.queues.enable=true` so the queues are still available even when a RabbitMQ node is down, and help James keep functions well. We did a POC [here |https://github.com/apache/james-project/pull/2191]and the full quorum queues helped James be more fault tolerant as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org