Hi folks, Recently we encountered a deployment issue that used a RabbitMQ Cluster where a RabbitMQ node outage (for about 1 hour) forced James service more or less to be down too.
I created a Jira ticket to report the issue: https://issues.apache.org/jira/projects/JAMES/issues/JAMES-4027 More details below for one did not read the Jira ticket yet: Today, when the quorum option is enabled, only some queues are quorum queues, not all (e.g. event bus notification queues and Task Manager's termination queues). I tried to reproduce the issue and here is my theory: The RabbitMQ node that stores the notification queues is down -> James can not publish messages to RabbitMQ and causes e.g. IMAP SELECT, STORE, APPEND, UNSELECT ... commands to fail -> James keeps retrying the publish failures (retry for Group registration which seems to rely on the classic queue too) and queues other IMAP requests in the meantime. -> The IMAP server queue becomes full and the exception `The IMAP server has reached its maximum capacity` is thrown. -> James IMAP becomes a zombie and cascading failures. James needs to be more fault-tolerant in this case. We think making all queues on Rabbitmq quorum queue when `quorum.queues.enable=true` would help James be more fault tolerant on that scenario. We investigated a POC at https://github.com/apache/james-project/pull/2191 and the full quorum queues helped James be more fault tolerant as expected. After full quorum queues are used, the James performance is a bit slower but is still fine, and that cost is likely needed to make James more reliable. If we use Redis backed event bus notifications, the performance is better than the RabbitMQ notification quorum queues. What do you think about making all queues on Rabbitmq quorum queue when option enabled? Feedback and review are very welcome. Thanks for reading. Quan