[ 
https://issues.apache.org/jira/browse/JAMES-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Tellier closed JAMES-3749.
---------------------------------
    Resolution: Fixed

> Better metrics for RabbitMQ
> ---------------------------
>
>                 Key: JAMES-3749
>                 URL: https://issues.apache.org/jira/browse/JAMES-3749
>             Project: James Server
>          Issue Type: Improvement
>          Components: Metrics, rabbitmq
>    Affects Versions: 3.7.0
>            Reporter: Benoit Tellier
>            Priority: Major
>             Fix For: 3.8.0
>
>
> To my surprise, IMAP performance tests were highly limited by RabbitMQ.
> We lacked decent metrics on RabbitMQ / the event bus to clearly audit this.
> I added a few additional metrics, here are the results:
> {code:java}
> name=rabbit-acquire, count=52615, min=0.010816, max=2197.815295, 
> mean=14.692384926275777, stddev=84.45677147245601, p50=0.075775, 
> p75=0.203775, p95=63.176703, p98=199.229439, p99=375.390207, 
> p999=1216.348159, m1_rate=203.7148778352276, m5_rate=119.5444112071225, 
> m15_rate=51.27213833578196, mean_rate=83.30809197633765, 
> rate_unit=events/second, duration_unit=milliseconds
> name=rabbit-dispatch, count=27858, min=0.333824, max=2365.587455, 
> mean=54.42362489080336, stddev=132.51032578954067, p50=15.466495, 
> p75=43.253759, p95=229.638143, p98=432.013311, p99=633.339903, 
> p999=1753.219071, m1_rate=109.2818061840995, m5_rate=70.86014994542329, 
> m15_rate=40.57835311284083, mean_rate=104.18175115750351, 
> rate_unit=events/second, duration_unit=milliseconds
> name=rabbit-register, count=2976, min=9.633792, max=5603.590143, 
> mean=179.2821071827957, stddev=508.63321381095363, p50=50.331647, 
> p75=103.809023, p95=687.865855, p98=2013.265919, p99=3003.121663, 
> p999=5100.273663, m1_rate=3.7740876538564017, m5_rate=9.38568671432365, 
> m15_rate=9.574694038543646, mean_rate=11.166515283444058, 
> rate_unit=events/second, duration_unit=milliseconds
> name=rabbit-release, count=52600, min=6.64E-4, max=131.596287, 
> mean=0.12847917764258554, stddev=1.7017175408795067, p50=0.006111, 
> p75=0.010303, p95=0.035583, p98=1.269759, p99=2.719743, p999=17.432575, 
> m1_rate=204.4922821701763, m5_rate=136.5136219818052, 
> m15_rate=81.73827938617151, mean_rate=197.1284478914709, 
> rate_unit=events/second, duration_unit=milliseconds
> name=rabbit-unregister, count=449, min=10.878976, max=2466.250751, 
> mean=190.00389787082403, stddev=380.5671338872364, p50=51.118079, 
> p75=135.266303, p95=1010.827263, p98=1702.887423, p99=1912.602623, 
> p999=2466.250751, m1_rate=9.012783767745082, m5_rate=5.543710748795059, 
> m15_rate=4.918174526687269, mean_rate=19.889486577715797, 
> rate_unit=events/second, duration_unit=milliseconds
> {code}
> Analysis:
>  - dispatch takes a really long time and impacts negatively all other 
> operations
>  - the channel pool was undersized (contention to get a channel)
> I did try out the followings:
>  - https://issues.apache.org/jira/browse/JAMES-3747 reactive implementation 
> for the RabbitMQ channel pool.
> Better reactive code but not a game changer to be honnest.
>  - Shorter routing key (don't include the full FQDN) -> small performance 
> gains...
>  - Disable publish confirms: Game changer! Dispatch mean went from 50ms+ p99 
> to 500ms+ to mean 1ms, p99 8ms... All other metrics (bind / unbind) are 
> impacted as well. Contention to acquire a channel is effectively gone...
>  - Turning off durability on notifiation channels unlocked further gains.
> {code:java}
> name=rabbit-acquire, count=1380387, min=0.005824, max=132.120575, 
> mean=0.120752084973272, stddev=0.47552513438388405, p50=0.056831, 
> p75=0.096767, p95=0.354303, p98=0.692223, p99=1.122303, p999=4.915199, 
> m1_rate=1.6804637345701686E-238, m5_rate=9.96453889901133E-47, 
> m15_rate=9.66533044449863E-15, mean_rate=32.71739356870932, 
> rate_unit=events/second, duration_unit=milliseconds
> name=rabbit-dispatch, count=757489, min=0.063232, max=245.366783, 
> mean=0.6006688857527964, stddev=1.0950763726058712, p50=0.456703, 
> p75=0.610303, p95=1.310719, p98=2.064383, p99=2.949119, p999=9.764863, 
> m1_rate=9.8844762264434E-239, m5_rate=5.952316454575153E-47, 
> m15_rate=5.761608528831366E-15, mean_rate=17.954172496560656, 
> rate_unit=events/second, duration_unit=milliseconds
> name=rabbit-register, count=18810, min=3.6864, max=1317.011455, 
> mean=21.051024209250397, stddev=41.433421890807836, p50=12.058623, 
> p75=19.529727, p95=66.322431, p98=106.954751, p99=155.189247, 
> p999=507.510783, m1_rate=1.5731326523774949E-260, 
> m5_rate=2.4205436310646514E-54, m15_rate=3.643125827717571E-18, 
> mean_rate=0.4463666568673792, rate_unit=events/second, 
> duration_unit=milliseconds
> name=rabbit-release, count=1380385, min=6.6E-4, max=131.596287, 
> mean=0.00816027294848901, stddev=0.12039944630360619, p50=0.006143, 
> p75=0.009087, p95=0.013183, p98=0.021759, p99=0.034047, p999=0.230399, 
> m1_rate=1.6486782760303405E-238, m5_rate=9.925571881590765E-47, 
> m15_rate=9.652806233445519E-15, mean_rate=32.71825157947973, 
> rate_unit=events/second, duration_unit=milliseconds
> name=rabbit-unregister, count=18810, min=3.11296, max=1761.607679, 
> mean=28.082487391812865, stddev=64.54031379534226, p50=12.582911, 
> p75=20.971519, p95=100.139007, p98=192.937983, p99=287.309823, 
> p999=805.306367, m1_rate=7.66299030904417E-260, 
> m5_rate=1.8971818595887257E-53, m15_rate=8.637433599041584E-18, 
> mean_rate=0.44693388443373244, rate_unit=events/second, 
> duration_unit=milliseconds
> {code}
> I was able to double the request count in my IMAP benches and still get a 3 
> fold latency reduction.
> h3. Proposals
>  - Offer an option to disable publish confirms. This new James 3.7.0 
> behaviour brings cool resiliency semantic but is definitly harmful for 
> scalability. We can imagine some users wanting to turn that off.
>  - Offer a way to turn off durability on notifications. Notifications is 
> likely not critical, and loss acceptable.
>  - Add those cool rabbitMQ metrics.
> And of course, invest in an alternative to RabbitMQ that do not force us to 
> choose between throughtput and safety. Thoughts: Pulsar.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Reply via email to