Benoit Tellier created JAMES-3749:
-------------------------------------

             Summary: Better metrics for RabbitMQ
                 Key: JAMES-3749
                 URL: https://issues.apache.org/jira/browse/JAMES-3749
             Project: James Server
          Issue Type: Improvement
          Components: Metrics, rabbitmq
    Affects Versions: 3.7.0
            Reporter: Benoit Tellier
             Fix For: 3.8.0


To my surprise, IMAP performance tests were highly limited by RabbitMQ.

We lacked decent metrics on RabbitMQ / the event bus to clearly audit this.

I added a few additional metrics, here are the results:

{code:java}
name=rabbit-acquire, count=52615, min=0.010816, max=2197.815295, 
mean=14.692384926275777, stddev=84.45677147245601, p50=0.075775, p75=0.203775, 
p95=63.176703, p98=199.229439, p99=375.390207, p999=1216.348159, 
m1_rate=203.7148778352276, m5_rate=119.5444112071225, 
m15_rate=51.27213833578196, mean_rate=83.30809197633765, 
rate_unit=events/second, duration_unit=milliseconds

name=rabbit-dispatch, count=27858, min=0.333824, max=2365.587455, 
mean=54.42362489080336, stddev=132.51032578954067, p50=15.466495, 
p75=43.253759, p95=229.638143, p98=432.013311, p99=633.339903, 
p999=1753.219071, m1_rate=109.2818061840995, m5_rate=70.86014994542329, 
m15_rate=40.57835311284083, mean_rate=104.18175115750351, 
rate_unit=events/second, duration_unit=milliseconds

name=rabbit-register, count=2976, min=9.633792, max=5603.590143, 
mean=179.2821071827957, stddev=508.63321381095363, p50=50.331647, 
p75=103.809023, p95=687.865855, p98=2013.265919, p99=3003.121663, 
p999=5100.273663, m1_rate=3.7740876538564017, m5_rate=9.38568671432365, 
m15_rate=9.574694038543646, mean_rate=11.166515283444058, 
rate_unit=events/second, duration_unit=milliseconds

name=rabbit-release, count=52600, min=6.64E-4, max=131.596287, 
mean=0.12847917764258554, stddev=1.7017175408795067, p50=0.006111, 
p75=0.010303, p95=0.035583, p98=1.269759, p99=2.719743, p999=17.432575, 
m1_rate=204.4922821701763, m5_rate=136.5136219818052, 
m15_rate=81.73827938617151, mean_rate=197.1284478914709, 
rate_unit=events/second, duration_unit=milliseconds

name=rabbit-unregister, count=449, min=10.878976, max=2466.250751, 
mean=190.00389787082403, stddev=380.5671338872364, p50=51.118079, 
p75=135.266303, p95=1010.827263, p98=1702.887423, p99=1912.602623, 
p999=2466.250751, m1_rate=9.012783767745082, m5_rate=5.543710748795059, 
m15_rate=4.918174526687269, mean_rate=19.889486577715797, 
rate_unit=events/second, duration_unit=milliseconds
{code}

Analysis:

 - dispatch takes a really long time and impacts negatively all other operations
 - the channel pool was undersized (contention to get a channel)

I did try out the followings:

 - https://issues.apache.org/jira/browse/JAMES-3747 reactive implementation for 
the RabbitMQ channel pool.

Better reactive code but not a game changer to be honnest.

 - Shorter routing key (don't include the full FQDN) -> small performance 
gains...

 - Disable publish confirms: Game changer! Dispatch mean went from 50ms+ p99 to 
500ms+ to mean 1ms, p99 8ms... All other metrics (bind / unbind) are impacted 
as well. Contention to acquire a channel is effectively gone...

 - Turning off durability on notifiation channels unlocked further gains.

{code:java}
name=rabbit-acquire, count=1380387, min=0.005824, max=132.120575, 
mean=0.120752084973272, stddev=0.47552513438388405, p50=0.056831, p75=0.096767, 
p95=0.354303, p98=0.692223, p99=1.122303, p999=4.915199, 
m1_rate=1.6804637345701686E-238, m5_rate=9.96453889901133E-47, 
m15_rate=9.66533044449863E-15, mean_rate=32.71739356870932, 
rate_unit=events/second, duration_unit=milliseconds

name=rabbit-dispatch, count=757489, min=0.063232, max=245.366783, 
mean=0.6006688857527964, stddev=1.0950763726058712, p50=0.456703, p75=0.610303, 
p95=1.310719, p98=2.064383, p99=2.949119, p999=9.764863, 
m1_rate=9.8844762264434E-239, m5_rate=5.952316454575153E-47, 
m15_rate=5.761608528831366E-15, mean_rate=17.954172496560656, 
rate_unit=events/second, duration_unit=milliseconds

name=rabbit-register, count=18810, min=3.6864, max=1317.011455, 
mean=21.051024209250397, stddev=41.433421890807836, p50=12.058623, 
p75=19.529727, p95=66.322431, p98=106.954751, p99=155.189247, p999=507.510783, 
m1_rate=1.5731326523774949E-260, m5_rate=2.4205436310646514E-54, 
m15_rate=3.643125827717571E-18, mean_rate=0.4463666568673792, 
rate_unit=events/second, duration_unit=milliseconds

name=rabbit-release, count=1380385, min=6.6E-4, max=131.596287, 
mean=0.00816027294848901, stddev=0.12039944630360619, p50=0.006143, 
p75=0.009087, p95=0.013183, p98=0.021759, p99=0.034047, p999=0.230399, 
m1_rate=1.6486782760303405E-238, m5_rate=9.925571881590765E-47, 
m15_rate=9.652806233445519E-15, mean_rate=32.71825157947973, 
rate_unit=events/second, duration_unit=milliseconds

name=rabbit-unregister, count=18810, min=3.11296, max=1761.607679, 
mean=28.082487391812865, stddev=64.54031379534226, p50=12.582911, 
p75=20.971519, p95=100.139007, p98=192.937983, p99=287.309823, p999=805.306367, 
m1_rate=7.66299030904417E-260, m5_rate=1.8971818595887257E-53, 
m15_rate=8.637433599041584E-18, mean_rate=0.44693388443373244, 
rate_unit=events/second, duration_unit=milliseconds
{code}

I was able to double the request count in my IMAP benches and still get a 3 
fold latency reduction.

h3. Proposals

 - Offer an option to disable publish confirms. This new James 3.7.0 behaviour 
brings cool resiliency semantic but is definitly harmful for scalability. We 
can imagine some users wanting to turn that off.
 - Offer a way to turn off durability on notifications. Notifications is likely 
not critical, and loss acceptable.
 - Add those cool rabbitMQ metrics.

And of course, invest in an alternative to RabbitMQ that do not force us to 
choose between throughtput and safety. Thoughts: Pulsar.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Reply via email to