Hello,

We have a cluster with 2m-2s architecture. TPS is between 1000 to 6000 and 
about 200 million messages in 24 hours typically.
The commit log size is about 200GB and can be cached in memory fully.

Our server(one rocketmq each, 4 servers in this cluster):
6 core Xeon-E5-2620 x 2 (24 thread total)
256G ddr3 1600MHz
240G Intel SSD 530(SSDSC2BW240H6) x 6 (Raid 1)

To "IMPROVE" producing performance(mainly reduce network packages) we start to 
use batch sending feature, and we begain to receive so much "slow log" after 
that(batch is 50~70, 100~1000B per message body).

After some tuning(osPageCacheBusyTimeOutMills, waitTimeMillsInSendQueue to 
500ms, etc) we only increasing the "time cost" printed in slow log.

The pagecache rt in log before using batch sending:

[PAGECACHERT] TotalPut 50650, PutMessageDistributeTime [<=0ms]:44250 
[0~10ms]:5865 [10~50ms]:514 [50~100ms]:10 [100~200ms]:11 [200~500ms]:0 
[500ms~1s]:0 [1~2s]:0 [2~3s]:0 [3~4s]:0 [4~5s]:0 [5~10s]:0 [10s~]:0 

after:

[PAGECACHERT] TotalPut 1970, PutMessageDistributeTime [<=0ms]:1428 [0~10ms]:459 
[10~50ms]:4 [50~100ms]:2 [100~200ms]:5 [200~500ms]:10 [500ms~1s]:10 [1~2s]:25 
[2~3s]:24 [3~4s]:3 [4~5s]:0 [5~10s]:0 [10s~]:0



After reading CommitLog i found the whole batch message is surrounded in one 
lock together. So it make sense the average executing time of PUT would 
increase. but we alse will receive some "in lock slow log" which will trigger 
flow control for a while:

2018-06-14 21:02:42 WARN SendMessageThread_102 - [NOTIFYME]putMessages in lock 
cost time(ms)=1373, bodyLength=49380 
AppendMessageResult=AppendMessageResult{status=PUT_OK, 
wroteOffset=68238809866237, wroteBytes=54420, 
msgId='0A0A63CF00002A9F00003E1016059FFD,0A0A63CF00002A9F00003E101605A37D,0A0A63CF00002A9F00003E101605A6DE,0A0A63CF00002A9F00003E101605A8C3,0A0A63CF00002A9F00003E101605ABA3,0A0A63CF00002A9F00003E101605AE81,0A0A63CF00002A9F00003E101605B066,0A0A63CF00002A9F00003E101605B24B,0A0A63CF00002A9F00003E101605B4CC,0A0A63CF00002A9F00003E101605B7F5,0A0A63CF00002A9F00003E101605BB78,0A0A63CF00002A9F00003E101605BE70,0A0A63CF00002A9F00003E101605C19F,0A0A63CF00002A9F00003E101605C4BF,0A0A63CF00002A9F00003E101605C78B,0A0A63CF00002A9F00003E101605CAE4,0A0A63CF00002A9F00003E101605CDC4,0A0A63CF00002A9F00003E101605D206,0A0A63CF00002A9F00003E101605D581,0A0A63CF00002A9F00003E101605D83B,0A0A63CF00002A9F00003E101605DB5B,0A0A63CF00002A9F00003E101605DED0,0A0A63CF00002A9F00003E101605E1AA,0A0A63CF00002A9F00003E101605E54D,0A0A63CF00002A9F00003E101605E8C7,0A0A63CF00002A9F00003E101605EB92,0A0A63CF00002A9F00003E101605EF0C,0A0A63CF00002A9F00003E101605F0D8,0A0A63CF00002A9F00003E101605F346,0A0A63CF00002A9F00003E101605F5FB,0A0A63CF00002A9F00003E101605F8D6,0A0A63CF00002A9F00003E101605FC51,0A0A63CF00002A9F00003E101605FFD1,0A0A63CF00002A9F00003E101606019D,0A0A63CF00002A9F00003E1016060509,0A0A63CF00002A9F00003E1016060816,0A0A63CF00002A9F00003E1016060B7A,0A0A63CF00002A9F00003E1016060E61,0A0A63CF00002A9F00003E10160611DF,0A0A63CF00002A9F00003E1016061558,0A0A63CF00002A9F00003E10160618A6,0A0A63CF00002A9F00003E1016061B68,0A0A63CF00002A9F00003E1016061E6B,0A0A63CF00002A9F00003E10160621E5,0A0A63CF00002A9F00003E10160624A2,0A0A63CF00002A9F00003E101606281D,0A0A63CF00002A9F00003E1016062AE3,0A0A63CF00002A9F00003E1016062E60,0A0A63CF00002A9F00003E101606313E,0A0A63CF00002A9F00003E101606343A,0A0A63CF00002A9F00003E10160637B5,0A0A63CF00002A9F00003E1016063AE3,0A0A63CF00002A9F00003E1016063E67,0A0A63CF00002A9F00003E1016064218,0A0A63CF00002A9F00003E10160644E0,0A0A63CF00002A9F00003E101606487A,0A0A63CF00002A9F00003E1016064BF4,0A0A63CF00002A9F00003E1016064E61,0A0A63CF00002A9F00003E101606512E,0A0A63CF00002A9F00003E1016065489,0A0A63CF00002A9F00003E101606574F,0A0A63CF00002A9F00003E1016065AE3,0A0A63CF00002A9F00003E1016065CC8,0A0A63CF00002A9F00003E1016065FED,0A0A63CF00002A9F00003E10160662E2,0A0A63CF00002A9F00003E10160664C7,0A0A63CF00002A9F00003E10160667F6,0A0A63CF00002A9F00003E1016066B81,0A0A63CF00002A9F00003E1016066E54,0A0A63CF00002A9F00003E1016067132',
 storeTimestamp=1528981361218, logicsOffset=5784993480, pagecacheRT=1373, 
msgNum=70}
2018-06-14 21:02:42 WARN SendMessageThread_102 - not in lock eclipse 
time(ms)=1374, bodyLength=49380
2018-06-14 21:02:42 WARN SendMessageThread_79 - not in lock eclipse 
time(ms)=688, bodyLength=274
2018-06-14 21:02:42 WARN SendMessageThread_84 - not in lock eclipse 
time(ms)=562, bodyLength=48242
2018-06-14 21:02:42 WARN SendMessageThread_107 - not in lock eclipse 
time(ms)=839, bodyLength=274
2018-06-14 21:02:42 WARN SendMessageThread_97 - not in lock eclipse 
time(ms)=1199, bodyLength=49469
2018-06-14 21:02:42 WARN SendMessageThread_111 - not in lock eclipse 
time(ms)=1307, bodyLength=47329
2018-06-14 21:02:42 WARN SendMessageThread_91 - not in lock eclipse 
time(ms)=786, bodyLength=275
2018-06-14 21:02:42 WARN SendMessageThread_67 - not in lock eclipse 
time(ms)=807, bodyLength=575
2018-06-14 21:02:42 WARN SendMessageThread_65 - not in lock eclipse 
time(ms)=1326, bodyLength=47294
2018-06-14 21:02:43 INFO FlushRealTimeService - Flush data to disk costs 2271 ms
2018-06-14 21:02:43 WARN SendMessageThread_116 - not in lock eclipse 
time(ms)=1542, bodyLength=50039
2018-06-14 21:02:43 WARN SendMessageThread_119 - not in lock eclipse 
time(ms)=1202, bodyLength=50024
2018-06-14 21:02:43 WARN SendMessageThread_55 - not in lock eclipse 
time(ms)=1436, bodyLength=907

So the question is:
Is batch sending recommended?
It maybe "slower" compared to simple sending, should i increase the 
in-lock-timeout?
how we can solve the performance problem(many REJECTREQUEST will generated in 
producers), add new Master into the cluster? 


My question

best regards,

Jason

Reply via email to