[ https://issues.apache.org/jira/browse/ROCKETMQ-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394750#comment-16394750 ]
Shilin Lu commented on ROCKETMQ-384: ------------------------------------ I think ntp offset is not a root cause.There will be 2S offset between different machines.But in this case,it is in a same machine,i think it can be ignore.But in this case ,i think may be gc is a root cause. > broker flow control is abnormal when the machine's physical time drift > ---------------------------------------------------------------------- > > Key: ROCKETMQ-384 > URL: https://issues.apache.org/jira/browse/ROCKETMQ-384 > Project: Apache RocketMQ > Issue Type: Improvement > Components: rocketmq-broker > Affects Versions: 4.2.0 > Environment: operating system: CentOS6.0 > hardware: 8C8G VM > version: rocketmq-4.2.0 > broker config: default 2m-2s-async config > producer qps: 4000 > pre message byte: 10 bytes > Reporter: liyuzhou > Assignee: yukon > Priority: Major > > When I did a performance test for broker,I found large numbers of > exception like this: > send error com.alibaba.rocketmq.client.exception.MQBrokerException: CODE: 2 > DESC: [REJECTREQUEST]system busy, start flow control for a while > After I have read the related source code,I think this exception occurs > when the broker think its OS is busy, it will refuse the increasing request. > But I saw the server's monitor, the JVM gc is normal(normally 10ms), max cost > 50ms and performed at a frequency of 10s once, and my server's CPU and disk > IO also is health.But my server's NTP offset occasionally drift more than > 2s.So I think the pyhsical time caused the flow control. > related code: > {code:java} > CommitLog.java > public PutMessageResult putMessage(final MessageExtBrokerInner msg) { > ... ... > long beginLockTimestamp = > this.defaultMessageStore.getSystemClock().now(); //now() is > System.currentTimeMillis() > this.beginTimeInLock = beginLockTimestamp; > ... ... > } > DefaultMessageStore.java > public boolean isOSPageCacheBusy() { > long begin = this.getCommitLog().getBeginTimeInLock(); > long diff = this.systemClock.now() - begin; > return diff < 10000000 > && diff > this.messageStoreConfig.getOsPageCacheBusyTimeOutMills(); > } > {code} > Assume the first request run into the CommitLog's method putMessage, the > physical clock is 0s, the server's physical clock drift to 2s, then the > second request run into the isOSPageCacheBusy in order to check if the system > is busy, so the second request is rejected by the broker for the clock drift. > So should we replace System.currentTimeMillis() with System.nanoTime() to > reduce the abnormal request rejection? -- This message was sent by Atlassian JIRA (v7.6.3#76005)