Hi, Attila wind, 



I want to know whether the failed node is completely shut down. If it is 
completely shut down, the nameserver will weed out the failed node, and the 
maximum time for producers to perceive the route info and weed the dead node 
out is 30 seconds by default 
(org.apache.rocketmq.client.clientconfig#pollnameserverinterval). In addition, 
I suggest upgrading the version of the client as much as possible.




If you don't want the failed node to join the cluster immediately when it goes 
online again, you can modify the configuration file (broker.conf) and set 
brokerPermission = 4 (write disabled and read enable) before going online, and 
then use the mqadmin tool updateBrokerConfig command to change the 
brokerPermission to 6 (write and read enable) after online 5 minutes.




Regards, 

RongtongJin










At 2022-05-02 00:10:17, "Attila Wind" <attilaw@swf.technology> wrote:

Hi RMQ Users,

We are running a 3 node Rocket MQ Cluster - version 4.7.0, only master nodes.
Our app language is Java and we are using 
org.apache.rocketmq:rocketmq-client:4.2.0

Recently we had an outage. One of the nodes went down due to hardware failure. 
The node was unavailable for 50 minutes.

What we noticed during this time was:

~ 1/3 of the message producers started to wait 3 seconds - those ones who 
wanted to produce the message towards the dead node.
Then they retried another node and the message was produced successfully.
The above behavior was in place for 15 minutes - after 15 minutes it looked no 
producer tried to send the message to the failed node anymore
After 50 minutes when the failed node returned this node immediately started to 
get messages again from the producers

So actually we realized there are 2 timeouts here.

The first, the 3 seconds timeout I believe we found it here: 
org.apache.rocketmq.client.producer.DefaultMQProducer.sendMsgTimeout
That's fine.

But the 2nd, the 15 minutes timeout (when failed node is marked as dead 
eventually) we could not find anywhere...
We also tried to take a look into the RocketMQ Nameserver code because our idea 
was at the end it could be the Nameserver who marks that node dead but no luck. 
:-(

Our goal would be to shorten this 15 minutes timeout if possible (given the 3rd 
observation from above that when the node came back it joined the cluster back 
seamlessly we believe something like 5 minutes would be much better for our App)


Does anyone maybe know if changing this 15 minutes timeout is possible and if 
yes then how/where?

thanks!


--
Attila Wind

http://www.linkedin.com/in/attilaw
Mobile: +49 176 43556932


Reply via email to