Hello James devs, I did spend a bit of time digging within the RabbitMQ performances and stability.
I was surprised to discover weeks ago the amount of work performed by play.json library and could not just quite explain why it was hogging 3% of CPU time, and be the most CPU consumer for mailbox events. RabbitMQ acks account for another 1.20% of CPU time. Investigating in the RabbitMQ eventbus I realized the events are routed to all group queues, dispatched and deserialized then applied if relevant. Given 200 events/s and given that the JMAP server has 10 groups we end up deserializing 2000 events/s, even if irrelevant for the groups. As I recall, we wanted the the event per group to be the unit of retry. Noble design goal. I think parallelizing groups is a non goal: this kind of optimization would not improve response time as it is asynchronous, running in the background, and makes little sense at 1000s requests per seconds. However ending up having one queue per event is likely sub-optimal. I think the design can be improved by, in the nominal case, transmitting only one message to all groups. The receiving groups will then try to execute all groups. We can keep reties for individual groups (with their dedicated exchanges and queues): upon failure, we republish to the retry exchange of the incriminated listener. This makes the upgrade path easy too, as the group queue keeps being consumed. One would just need to do some unbindings... Note that such an evolution would: - also enable us, if we want, to enforce some execution orders for listeners, opening the way to fix things like JAMES-3561 <https://issues.apache.org/jira/browse/JAMES-3561> ... - it could serve as an inspiration for future eventBus implementations like the Pulsar one, hence getting feedback on the existing design is IMO useful. I will create a JIRA ticket holding the design proposal (schema) and how it does defer from the previous one, as well as some RabbitMQ management screenshots. Cheers, Benoit