[jira] [Commented] (ROCKETMQ-272) The config `syncFlushTimeout` doesn't work for SYNC_MASTER
[ https://issues.apache.org/jira/browse/ROCKETMQ-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172755#comment-16172755 ] ASF GitHub Bot commented on ROCKETMQ-272: - Github user evthoriz commented on the issue: https://github.com/apache/incubator-rocketmq/pull/153 @dongeforever thanks for your reply. I tried, but it seems that there's no easy way to mock a test for this, since the ha mechanism involves lots of salve replicating behavior. Do you have any good ideas? > The config `syncFlushTimeout` doesn't work for SYNC_MASTER > -- > > Key: ROCKETMQ-272 > URL: https://issues.apache.org/jira/browse/ROCKETMQ-272 > Project: Apache RocketMQ > Issue Type: Bug > Components: rocketmq-broker >Affects Versions: 4.1.0-incubating >Reporter: Yu Kaiyuan >Assignee: yukon > > It's quite frequent to get result as `sendStatus=FLUSH_SLAVE_TIMEOUT` when > sending big messages(>500k) in SYNC_MASTER/SLAVE scenario. > The timeout value used by the sync process currently as I found, is the > config `syncFlushTimeout`. And its default value is 5000 milliseconds. > But it shows that producer get the result as `FLUSH_SLAVE_TIMEOUT` less than > 1 second. > So why does the config not work as expected? > Relevant code: > {code:java} > // CommitLog.java > public void handleHA(AppendMessageResult result, PutMessageResult > putMessageResult, MessageExt messageExt) { > if (BrokerRole.SYNC_MASTER == > this.defaultMessageStore.getMessageStoreConfig().getBrokerRole()) { > HAService service = this.defaultMessageStore.getHaService(); > if (messageExt.isWaitStoreMsgOK()) { > // Determine whether to wait > if (service.isSlaveOK(result.getWroteOffset() + > result.getWroteBytes())) { > GroupCommitRequest request = new > GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes()); > service.putRequest(request); > service.getWaitNotifyObject().wakeupAll(); > boolean flushOK = > > request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout()); > if (!flushOK) { > log.error("do sync transfer other node, wait return, but > failed, topic: " + messageExt.getTopic() + " tags: " > + messageExt.getTags() + " client address: " + > messageExt.getBornHostNameString()); > > putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_SLAVE_TIMEOUT); > } > } > // Slave problem > else { > // Tell the producer, slave not available > > putMessageResult.setPutMessageStatus(PutMessageStatus.SLAVE_NOT_AVAILABLE); > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ROCKETMQ-272) The config `syncFlushTimeout` doesn't work for SYNC_MASTER
[ https://issues.apache.org/jira/browse/ROCKETMQ-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144692#comment-16144692 ] ASF GitHub Bot commented on ROCKETMQ-272: - Github user evthoriz commented on the issue: https://github.com/apache/incubator-rocketmq/pull/153 @vongosling @zhouxinyu @shroman @lizhanhui Anybody willing to review this pr? > The config `syncFlushTimeout` doesn't work for SYNC_MASTER > -- > > Key: ROCKETMQ-272 > URL: https://issues.apache.org/jira/browse/ROCKETMQ-272 > Project: Apache RocketMQ > Issue Type: Bug > Components: rocketmq-broker >Affects Versions: 4.1.0-incubating >Reporter: Yu Kaiyuan >Assignee: yukon > > It's quite frequent to get result as `sendStatus=FLUSH_SLAVE_TIMEOUT` when > sending big messages(>500k) in SYNC_MASTER/SLAVE scenario. > The timeout value used by the sync process currently as I found, is the > config `syncFlushTimeout`. And its default value is 5000 milliseconds. > But it shows that producer get the result as `FLUSH_SLAVE_TIMEOUT` less than > 1 second. > So why does the config not work as expected? > Relevant code: > {code:java} > // CommitLog.java > public void handleHA(AppendMessageResult result, PutMessageResult > putMessageResult, MessageExt messageExt) { > if (BrokerRole.SYNC_MASTER == > this.defaultMessageStore.getMessageStoreConfig().getBrokerRole()) { > HAService service = this.defaultMessageStore.getHaService(); > if (messageExt.isWaitStoreMsgOK()) { > // Determine whether to wait > if (service.isSlaveOK(result.getWroteOffset() + > result.getWroteBytes())) { > GroupCommitRequest request = new > GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes()); > service.putRequest(request); > service.getWaitNotifyObject().wakeupAll(); > boolean flushOK = > > request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout()); > if (!flushOK) { > log.error("do sync transfer other node, wait return, but > failed, topic: " + messageExt.getTopic() + " tags: " > + messageExt.getTags() + " client address: " + > messageExt.getBornHostNameString()); > > putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_SLAVE_TIMEOUT); > } > } > // Slave problem > else { > // Tell the producer, slave not available > > putMessageResult.setPutMessageStatus(PutMessageStatus.SLAVE_NOT_AVAILABLE); > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ROCKETMQ-272) The config `syncFlushTimeout` doesn't work for SYNC_MASTER
[ https://issues.apache.org/jira/browse/ROCKETMQ-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141277#comment-16141277 ] ASF GitHub Bot commented on ROCKETMQ-272: - Github user evthoriz commented on the issue: https://github.com/apache/incubator-rocketmq/pull/153 @Jaskey Another key condition is the MASTER-SLAVE replicating mode, which is SYNC_MASTER in the case. However, the option `syncFlushTimeout` is useless not only for big messages, but also those small messages. Small Messages just not trigger the bug, but it still exists. > The config `syncFlushTimeout` doesn't work for SYNC_MASTER > -- > > Key: ROCKETMQ-272 > URL: https://issues.apache.org/jira/browse/ROCKETMQ-272 > Project: Apache RocketMQ > Issue Type: Bug > Components: rocketmq-broker >Affects Versions: 4.1.0-incubating >Reporter: Yu Kaiyuan >Assignee: yukon > > It's quite frequent to get result as `sendStatus=FLUSH_SLAVE_TIMEOUT` when > sending big messages(>500k) in SYNC_MASTER/SLAVE scenario. > The timeout value used by the sync process currently as I found, is the > config `syncFlushTimeout`. And its default value is 5000 milliseconds. > But it shows that producer get the result as `FLUSH_SLAVE_TIMEOUT` less than > 1 second. > So why does the config not work as expected? > Relevant code: > {code:java} > // CommitLog.java > public void handleHA(AppendMessageResult result, PutMessageResult > putMessageResult, MessageExt messageExt) { > if (BrokerRole.SYNC_MASTER == > this.defaultMessageStore.getMessageStoreConfig().getBrokerRole()) { > HAService service = this.defaultMessageStore.getHaService(); > if (messageExt.isWaitStoreMsgOK()) { > // Determine whether to wait > if (service.isSlaveOK(result.getWroteOffset() + > result.getWroteBytes())) { > GroupCommitRequest request = new > GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes()); > service.putRequest(request); > service.getWaitNotifyObject().wakeupAll(); > boolean flushOK = > > request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout()); > if (!flushOK) { > log.error("do sync transfer other node, wait return, but > failed, topic: " + messageExt.getTopic() + " tags: " > + messageExt.getTags() + " client address: " + > messageExt.getBornHostNameString()); > > putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_SLAVE_TIMEOUT); > } > } > // Slave problem > else { > // Tell the producer, slave not available > > putMessageResult.setPutMessageStatus(PutMessageStatus.SLAVE_NOT_AVAILABLE); > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ROCKETMQ-272) The config `syncFlushTimeout` doesn't work for SYNC_MASTER
[ https://issues.apache.org/jira/browse/ROCKETMQ-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141225#comment-16141225 ] ASF GitHub Bot commented on ROCKETMQ-272: - Github user evthoriz commented on the issue: https://github.com/apache/incubator-rocketmq/pull/153 @Jaskey This issue can be easily reproduced with the config I provided in Jira. Please take a look at the description there. > The config `syncFlushTimeout` doesn't work for SYNC_MASTER > -- > > Key: ROCKETMQ-272 > URL: https://issues.apache.org/jira/browse/ROCKETMQ-272 > Project: Apache RocketMQ > Issue Type: Bug > Components: rocketmq-broker >Affects Versions: 4.1.0-incubating >Reporter: Yu Kaiyuan >Assignee: yukon > > It's quite frequent to get result as `sendStatus=FLUSH_SLAVE_TIMEOUT` when > sending big messages(>500k) in SYNC_MASTER/SLAVE scenario. > The timeout value used by the sync process currently as I found, is the > config `syncFlushTimeout`. And its default value is 5000 milliseconds. > But it shows that producer get the result as `FLUSH_SLAVE_TIMEOUT` less than > 1 second. > So why does the config not work as expected? > Relevant code: > {code:java} > // CommitLog.java > public void handleHA(AppendMessageResult result, PutMessageResult > putMessageResult, MessageExt messageExt) { > if (BrokerRole.SYNC_MASTER == > this.defaultMessageStore.getMessageStoreConfig().getBrokerRole()) { > HAService service = this.defaultMessageStore.getHaService(); > if (messageExt.isWaitStoreMsgOK()) { > // Determine whether to wait > if (service.isSlaveOK(result.getWroteOffset() + > result.getWroteBytes())) { > GroupCommitRequest request = new > GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes()); > service.putRequest(request); > service.getWaitNotifyObject().wakeupAll(); > boolean flushOK = > > request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout()); > if (!flushOK) { > log.error("do sync transfer other node, wait return, but > failed, topic: " + messageExt.getTopic() + " tags: " > + messageExt.getTags() + " client address: " + > messageExt.getBornHostNameString()); > > putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_SLAVE_TIMEOUT); > } > } > // Slave problem > else { > // Tell the producer, slave not available > > putMessageResult.setPutMessageStatus(PutMessageStatus.SLAVE_NOT_AVAILABLE); > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ROCKETMQ-272) The config `syncFlushTimeout` doesn't work for SYNC_MASTER
[ https://issues.apache.org/jira/browse/ROCKETMQ-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141135#comment-16141135 ] ASF GitHub Bot commented on ROCKETMQ-272: - Github user Jaskey commented on the issue: https://github.com/apache/incubator-rocketmq/pull/153 @evthoriz Would you please also point out that what kind of scenario will this issue be reproduced? > The config `syncFlushTimeout` doesn't work for SYNC_MASTER > -- > > Key: ROCKETMQ-272 > URL: https://issues.apache.org/jira/browse/ROCKETMQ-272 > Project: Apache RocketMQ > Issue Type: Bug > Components: rocketmq-broker >Affects Versions: 4.1.0-incubating >Reporter: Yu Kaiyuan >Assignee: yukon > > It's quite frequent to get result as `sendStatus=FLUSH_SLAVE_TIMEOUT` when > sending big messages(>500k) in SYNC_MASTER/SLAVE scenario. > The timeout value used by the sync process currently as I found, is the > config `syncFlushTimeout`. And its default value is 5000 milliseconds. > But it shows that producer get the result as `FLUSH_SLAVE_TIMEOUT` less than > 1 second. > So why does the config not work as expected? > Relevant code: > {code:java} > // CommitLog.java > public void handleHA(AppendMessageResult result, PutMessageResult > putMessageResult, MessageExt messageExt) { > if (BrokerRole.SYNC_MASTER == > this.defaultMessageStore.getMessageStoreConfig().getBrokerRole()) { > HAService service = this.defaultMessageStore.getHaService(); > if (messageExt.isWaitStoreMsgOK()) { > // Determine whether to wait > if (service.isSlaveOK(result.getWroteOffset() + > result.getWroteBytes())) { > GroupCommitRequest request = new > GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes()); > service.putRequest(request); > service.getWaitNotifyObject().wakeupAll(); > boolean flushOK = > > request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout()); > if (!flushOK) { > log.error("do sync transfer other node, wait return, but > failed, topic: " + messageExt.getTopic() + " tags: " > + messageExt.getTags() + " client address: " + > messageExt.getBornHostNameString()); > > putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_SLAVE_TIMEOUT); > } > } > // Slave problem > else { > // Tell the producer, slave not available > > putMessageResult.setPutMessageStatus(PutMessageStatus.SLAVE_NOT_AVAILABLE); > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ROCKETMQ-272) The config `syncFlushTimeout` doesn't work for SYNC_MASTER
[ https://issues.apache.org/jira/browse/ROCKETMQ-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141062#comment-16141062 ] ASF GitHub Bot commented on ROCKETMQ-272: - Github user evthoriz commented on the issue: https://github.com/apache/incubator-rocketmq/pull/153 @vongosling @zhouxinyu The CI environment is not correctly set. Would you guys have a loot at that? > The config `syncFlushTimeout` doesn't work for SYNC_MASTER > -- > > Key: ROCKETMQ-272 > URL: https://issues.apache.org/jira/browse/ROCKETMQ-272 > Project: Apache RocketMQ > Issue Type: Bug > Components: rocketmq-broker >Affects Versions: 4.1.0-incubating >Reporter: Yu Kaiyuan >Assignee: yukon > > It's quite frequent to get result as `sendStatus=FLUSH_SLAVE_TIMEOUT` when > sending big messages(>500k) in SYNC_MASTER/SLAVE scenario. > The timeout value used by the sync process currently as I found, is the > config `syncFlushTimeout`. And its default value is 5000 milliseconds. > But it shows that producer get the result as `FLUSH_SLAVE_TIMEOUT` less than > 1 second. > So why does the config not work as expected? > Relevant code: > {code:java} > // CommitLog.java > public void handleHA(AppendMessageResult result, PutMessageResult > putMessageResult, MessageExt messageExt) { > if (BrokerRole.SYNC_MASTER == > this.defaultMessageStore.getMessageStoreConfig().getBrokerRole()) { > HAService service = this.defaultMessageStore.getHaService(); > if (messageExt.isWaitStoreMsgOK()) { > // Determine whether to wait > if (service.isSlaveOK(result.getWroteOffset() + > result.getWroteBytes())) { > GroupCommitRequest request = new > GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes()); > service.putRequest(request); > service.getWaitNotifyObject().wakeupAll(); > boolean flushOK = > > request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout()); > if (!flushOK) { > log.error("do sync transfer other node, wait return, but > failed, topic: " + messageExt.getTopic() + " tags: " > + messageExt.getTags() + " client address: " + > messageExt.getBornHostNameString()); > > putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_SLAVE_TIMEOUT); > } > } > // Slave problem > else { > // Tell the producer, slave not available > > putMessageResult.setPutMessageStatus(PutMessageStatus.SLAVE_NOT_AVAILABLE); > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ROCKETMQ-272) The config `syncFlushTimeout` doesn't work for SYNC_MASTER
[ https://issues.apache.org/jira/browse/ROCKETMQ-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139963#comment-16139963 ] ASF GitHub Bot commented on ROCKETMQ-272: - GitHub user evthoriz opened a pull request: https://github.com/apache/incubator-rocketmq/pull/153 [ROCKETMQ-272] Fix sync slave timeout when using SYNC_MASTER Jira: https://issues.apache.org/jira/browse/ROCKETMQ-272 The timeout logic doesn't work correctly. Thread waiting in GroupTransferService may frequently waked up by ReadSocketService in HAConnection. So the transfer logic may return soon and wake up the thread waiting for the HA handling, which will make the timeout value in HA handling useless. This patch repairs the timeout logic in syncing, and also introduces an option `syncSlaveTimeout` in `MessageStoreConfig` to distinguish from the disk flush timeout option. You can merge this pull request into a Git repository by running: $ git pull https://github.com/evthoriz/incubator-rocketmq debug-ha Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-rocketmq/pull/153.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #153 commit 6f2501a24a701368b6213fd5acb3355ebdaafeb6 Author: evthorizDate: 2017-08-24T11:50:20Z [ROCKETMQ-272] Fix sync slave timeout when using SYNC_MASTER > The config `syncFlushTimeout` doesn't work for SYNC_MASTER > -- > > Key: ROCKETMQ-272 > URL: https://issues.apache.org/jira/browse/ROCKETMQ-272 > Project: Apache RocketMQ > Issue Type: Bug > Components: rocketmq-broker >Affects Versions: 4.1.0-incubating >Reporter: Yu Kaiyuan >Assignee: yukon > > It's quite frequent to get result as `sendStatus=FLUSH_SLAVE_TIMEOUT` when > sending big messages(>500k) in SYNC_MASTER/SLAVE scenario. > The timeout value used by the sync process currently as I found, is the > config `syncFlushTimeout`. And its default value is 5000 milliseconds. > But it shows that producer get the result as `FLUSH_SLAVE_TIMEOUT` less than > 1 second. > So why does the config not work as expected? > Relevant code: > {code:java} > // CommitLog.java > public void handleHA(AppendMessageResult result, PutMessageResult > putMessageResult, MessageExt messageExt) { > if (BrokerRole.SYNC_MASTER == > this.defaultMessageStore.getMessageStoreConfig().getBrokerRole()) { > HAService service = this.defaultMessageStore.getHaService(); > if (messageExt.isWaitStoreMsgOK()) { > // Determine whether to wait > if (service.isSlaveOK(result.getWroteOffset() + > result.getWroteBytes())) { > GroupCommitRequest request = new > GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes()); > service.putRequest(request); > service.getWaitNotifyObject().wakeupAll(); > boolean flushOK = > > request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout()); > if (!flushOK) { > log.error("do sync transfer other node, wait return, but > failed, topic: " + messageExt.getTopic() + " tags: " > + messageExt.getTags() + " client address: " + > messageExt.getBornHostNameString()); > > putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_SLAVE_TIMEOUT); > } > } > // Slave problem > else { > // Tell the producer, slave not available > > putMessageResult.setPutMessageStatus(PutMessageStatus.SLAVE_NOT_AVAILABLE); > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ROCKETMQ-272) The config `syncFlushTimeout` doesn't work for SYNC_MASTER
[ https://issues.apache.org/jira/browse/ROCKETMQ-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136453#comment-16136453 ] Yu Kaiyuan commented on ROCKETMQ-272: - Relevant log: *Proudcer* {code:java} 2017-08-22 14:26:25,420 INFO main com.rmq.example.TestProducer 28: Message content size = 512028 2017-08-22 14:26:25,887 INFO main com.rmq.example.TestProducer 35: sendResult = SendResult [sendStatus=FLUSH_SLAVE_TIMEOUT, msgId=0A05121B278412A3A3806CAB9B7D, offsetMsgId=0ADD49222A9F11B26A50, messageQueue=MessageQueue [topic=TopicTestBigMessage, brokerName=broker-a, queueId=0], queueOffset=1] {code} *Broker* {code:java} 2017-08-22 14:26:25 WARN GroupTransferService - transfer messsage to slave timeout, 297196971 2017-08-22 14:26:25 ERROR SendMessageThread_1 - do sync transfer other node, wait return, but failed, topic: TopicTestBigMessage tags: null client address: 10.5.18.27 {code} > The config `syncFlushTimeout` doesn't work for SYNC_MASTER > -- > > Key: ROCKETMQ-272 > URL: https://issues.apache.org/jira/browse/ROCKETMQ-272 > Project: Apache RocketMQ > Issue Type: Bug > Components: rocketmq-broker >Affects Versions: 4.1.0-incubating >Reporter: Yu Kaiyuan >Assignee: yukon > > It's quite frequent to get result as `sendStatus=FLUSH_SLAVE_TIMEOUT` when > sending big messages(>500k) in SYNC_MASTER/SLAVE scenario. > The timeout value used by the sync process currently as I found, is the > config `syncFlushTimeout`. And its default value is 5000 milliseconds. > But it shows that producer get the result as `FLUSH_SLAVE_TIMEOUT` less than > 1 second. > So why does the config not work as expected? > Relevant code: > {code:java} > // CommitLog.java > public PutMessageResult putMessage(final MessageExtBrokerInner msg) { > { > // Synchronous write double > if (BrokerRole.SYNC_MASTER == > this.defaultMessageStore.getMessageStoreConfig().getBrokerRole()) { > HAService service = this.defaultMessageStore.getHaService(); > if (msg.isWaitStoreMsgOK()) { > // Determine whether to wait > if (service.isSlaveOK(result.getWroteOffset() + > result.getWroteBytes())) { > if (null == request) { > request = new GroupCommitRequest(result.getWroteOffset() > + result.getWroteBytes()); > } > service.putRequest(request); > service.getWaitNotifyObject().wakeupAll(); > boolean flushOK = > // TODO > > request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout()); > if (!flushOK) { > log.error("do sync transfer other node, wait return, but > failed, topic: " + msg.getTopic() + " tags: " > + msg.getTags() + " client address: " + > msg.getBornHostString()); > > putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_SLAVE_TIMEOUT); > } > } > // Slave problem > else { > // Tell the producer, slave not available > > putMessageResult.setPutMessageStatus(PutMessageStatus.SLAVE_NOT_AVAILABLE); > } > } > } > return putMessageResult; > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)