Hi Quan, First thanks for the job done on this topic.
I know some members of the community (Karsten ?) already did significant work on the topic but more oriented toward the POP3 server.
This work is of course welcome as it would result in a higher reliability for the IMAP / JMAP components.
> What do you think about making all queues on Rabbitmq quorum queue when option enabled? On the principle, +1 In practice that is slightly harder for the event bus notification queue... - We can likely afford losing some of those pub sub message? - The queue is tied to a connection, thus if the node/connection goes done it can be recreated elsewhere? - We would need to come up with a cleanup strategy in order to eventually deletes queues haging around. - Also, how relevant is this RabbitMQ backend pub sub implementation when compared with the work done with Redis? IMO the eventbus notification was the main blocker in order to achieve decent HA with RabbitMQ. Do you share this point of view? Best regards, Benoit TELLIER
On 15/04/2024 09:53, Quan tran hong wrote:
Hi folks, Recently we encountered a deployment issue that used a RabbitMQ Cluster where a RabbitMQ node outage (for about 1 hour) forced James service more or less to be down too. I created a Jira ticket to report the issue: https://issues.apache.org/jira/projects/JAMES/issues/JAMES-4027 More details below for one did not read the Jira ticket yet: Today, when the quorum option is enabled, only some queues are quorum queues, not all (e.g. event bus notification queues and Task Manager's termination queues). I tried to reproduce the issue and here is my theory: The RabbitMQ node that stores the notification queues is down -> James can not publish messages to RabbitMQ and causes e.g. IMAP SELECT, STORE, APPEND, UNSELECT ... commands to fail -> James keeps retrying the publish failures (retry for Group registration which seems to rely on the classic queue too) and queues other IMAP requests in the meantime. -> The IMAP server queue becomes full and the exception `The IMAP server has reached its maximum capacity` is thrown. -> James IMAP becomes a zombie and cascading failures. James needs to be more fault-tolerant in this case. We think making all queues on Rabbitmq quorum queue when `quorum.queues.enable=true` would help James be more fault tolerant on that scenario. We investigated a POC athttps://github.com/apache/james-project/pull/2191 and the full quorum queues helped James be more fault tolerant as expected. After full quorum queues are used, the James performance is a bit slower but is still fine, and that cost is likely needed to make James more reliable. If we use Redis backed event bus notifications, the performance is better than the RabbitMQ notification quorum queues. What do you think about making all queues on Rabbitmq quorum queue when option enabled? Feedback and review are very welcome. Thanks for reading. Quan