Re: timeout settings related question

Attila Wind Sun, 01 May 2022 23:26:17 -0700

Thanks RongtongJin for the answers!

The failed node had a hardware failure so it went down immediately andwithout a graceful shutdown of course.

Do you say that Nameserver should eliminate this node from the routingin 30 secs in such case? Do I get it correctly?

Because if I do then from that point you might be right and this couldbe maybe a rocketmq-client problem. I mean the producers were stilltrying that node for 15 minutes - this is a fact we saw. Maybe it is therocketmq-client which did not update routing info from Nameservers? Itcould be...

Just one more info here. We analyzed the broker and nameserver logs (ofcourse) after the outage around the timestamp we lost the node. Butsurprisingly could not see any single line anywhere which would havebeen about "hey this node is lost" or similar message. We are using thedefault log config came with 4.7.0 did not tailor anything there.So unfortunately we are not smarter from these logs therefore could notprove/disprove anything what nameservers/brokers did when their fellownode went down :-(


thanks again,

Attila Wind

http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932


02.05.2022 04:30 keltezéssel, jinrongtong írta:


Hi, Attila wind,

I want to know whether the failed node is completely shut down. If itis completely shut down, the nameserver will weed out the failed node,and the maximum time for producers to perceive the route info and weedthe dead node out is 30 seconds by default(org.apache.rocketmq.client.clientconfig#pollnameserverinterval). Inaddition, I suggest upgrading the version of the client as much aspossible.

If you don't want the failed node to join the cluster immediately whenit goes online again, you can modify the configuration file(broker.conf) and set brokerPermission = 4 (write disabled and readenable) before going online, and then use the mqadmin toolupdateBrokerConfig command to change the brokerPermission to 6 (writeand read enable) after online 5 minutes.



Regards,

RongtongJin




At 2022-05-02 00:10:17, "Attila Wind" <attilaw@swf.technology> wrote:

    Hi RMQ Users,

    We are running a 3 node Rocket MQ Cluster - version 4.7.0, only
    master nodes.
    Our app language is Java and we are using
    org.apache.rocketmq:rocketmq-client:4.2.0

    Recently we had an outage. One of the nodes went down due to
    hardware failure. The node was unavailable for 50 minutes.

    What we noticed during this time was:

      * ~ 1/3 of the message producers started to wait 3 seconds -
        those ones who wanted to produce the message towards the dead
        node.
        Then they retried another node and the message was produced
        successfully.
      * The above behavior was in place for 15 minutes - after 15
        minutes it looked no producer tried to send the message to the
        failed node anymore
      * After 50 minutes when the failed node returned this node
        immediately started to get messages again from the producers

    So actually we realized there are 2 timeouts here.

    The first, the 3 seconds timeout I believe we found it here:
    org.apache.rocketmq.client.producer.DefaultMQProducer.sendMsgTimeout
    That's fine.

    *But the 2nd, the 15 minutes timeout (when failed node is marked
    as dead eventually) we could not find anywhere...*
    We also tried to take a look into the RocketMQ Nameserver code
    because our idea was at the end it could be the Nameserver who
    marks that node dead but no luck. :-(

    Our goal would be to shorten this 15 minutes timeout if possible
    (given the 3rd observation from above that when the node came back
    it joined the cluster back seamlessly we believe something like 5
    minutes would be much better for our App)

    *Does anyone maybe know if changing this 15 minutes timeout is
    possible and if yes then how/where?*

    thanks!

--Attila Wind


    http://www.linkedin.com/in/attilaw
    <http://www.linkedin.com/in/attilaw>
    Mobile: +49 176 43556932

Re: timeout settings related question

Reply via email to