Hi folks,

Recently we encountered a deployment issue that used a RabbitMQ Cluster
where a RabbitMQ node outage (for about 1 hour) forced James service more
or less to be down too.

I created a Jira ticket to report the issue:
https://issues.apache.org/jira/projects/JAMES/issues/JAMES-4027

More details below for one did not read the Jira ticket yet:

Today, when the quorum option is enabled, only some queues are quorum
queues, not all (e.g. event bus notification queues and Task Manager's
termination queues).

I tried to reproduce the issue and here is my theory:

The RabbitMQ node that stores the notification queues is down
-> James can not publish messages to RabbitMQ and causes e.g. IMAP SELECT,
STORE, APPEND, UNSELECT ... commands to fail
-> James keeps retrying the publish failures (retry for Group registration
which seems to rely on the classic queue too) and queues other IMAP
requests in the meantime.
-> The IMAP server queue becomes full and the exception `The IMAP server
has reached its maximum capacity` is thrown.
-> James IMAP becomes a zombie and cascading failures.

James needs to be more fault-tolerant in this case.

We think making all queues on Rabbitmq quorum queue when
`quorum.queues.enable=true` would help James be more fault tolerant on that
scenario.

We investigated a POC at https://github.com/apache/james-project/pull/2191 and
the full quorum queues helped James be more fault tolerant as expected.

After full quorum queues are used, the James performance is a bit slower
but is still fine, and that cost is likely needed to make James more
reliable.

If we use Redis backed event bus notifications, the performance is better
than the RabbitMQ notification quorum queues.

What do you think about making all queues on Rabbitmq quorum queue when
option enabled? Feedback and review are very welcome.

Thanks for reading.

Quan

Reply via email to