Thank you RongtongJin!

Aham!! Based on your log pattern below I found the entries in the namesrv.log! For some reason - however the machine clocks are accurate and using appropriate TZ - in the namesrv.log I just realized we have a shift in time... In the file the timestamps are +6 hours. So we were looking at the wrong place simply... arrrggghhhh

And you are right! Based on this I can confirm the Nameservers dropped the failed node very quickly - ~30 secs

So this means that it looks for some reason the producers were not updated quick enough with the modified routing info from the Namervers but just much later.

thank you for your help!

From this point we really have to focus on this delay and figure out why the routing info was updated with such a big latency

I will come back on this thread if we figure this out so others searching in archives in the future might find updates too

cheers

Attila Wind

http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932


02.05.2022 09:49 keltezéssel, jinrongtong írta:
Hi, Attila Wind, Nameserver will eliminate the failed node from the routing immediately if the channel destroyed and the client will update the routing every 30 secs. So maybe the client didn't update the routing.

We can find the log (default in ~/logs/rocketmqlogs/namesrv.log) like "the broker's channel destroyed, {}, clean it's data structure at once" or "remove brokerAddr[{}, {}] from brokerAddrTable, because channel destroyed" if channel destroyed.

At 2022-05-02 14:26:03, "Attila Wind" <attilaw@swf.technology> wrote:

    Thanks RongtongJin for the answers!

    The failed node had a hardware failure so it went down immediately
    and without a graceful shutdown of course.

    Do you say that Nameserver should eliminate this node from the
    routing in 30 secs in such case? Do I get it correctly?

    Because if I do then from that point you might be right and this
    could be maybe a rocketmq-client problem. I mean the producers
    were still trying that node for 15 minutes - this is a fact we
    saw. Maybe it is the rocketmq-client which did not update routing
    info from Nameservers? It could be...

    Just one more info here. We analyzed the broker and nameserver
    logs (of course) after the outage around the timestamp we lost the
    node. But surprisingly could not see any single line anywhere
    which would have been about "hey this node is lost" or similar
    message. We are using the default log config came with 4.7.0 did
    not tailor anything there.
    So unfortunately we are not smarter from these logs therefore
    could not prove/disprove anything what nameservers/brokers did
    when their fellow node went down :-(

    thanks again,

    Attila Wind

    http://www.linkedin.com/in/attilaw
    <http://www.linkedin.com/in/attilaw>
    Mobile: +49 176 43556932


    02.05.2022 04:30 keltezéssel, jinrongtong írta:

    Hi, Attila wind,


    I want to know whether the failed node is completely shut down.
    If it is completely shut down, the nameserver will weed out the
    failed node, and the maximum time for producers to perceive the
    route info and weed the dead node out is 30 seconds by default
    (org.apache.rocketmq.client.clientconfig#pollnameserverinterval).
    In addition, I suggest upgrading the version of the client as
    much as possible.


    If you don't want the failed node to join the cluster immediately
    when it goes online again, you can modify the configuration file
    (broker.conf) and set brokerPermission = 4 (write disabled and
    read enable) before going online, and then use the mqadmin tool
    updateBrokerConfig command to change the brokerPermission to 6
    (write and read enable) after online 5 minutes.


    Regards,

    RongtongJin




    At 2022-05-02 00:10:17, "Attila Wind" <attilaw@swf.technology> wrote:

        Hi RMQ Users,

        We are running a 3 node Rocket MQ Cluster - version 4.7.0,
        only master nodes.
        Our app language is Java and we are using
        org.apache.rocketmq:rocketmq-client:4.2.0

        Recently we had an outage. One of the nodes went down due to
        hardware failure. The node was unavailable for 50 minutes.

        What we noticed during this time was:

          * ~ 1/3 of the message producers started to wait 3 seconds
            - those ones who wanted to produce the message towards
            the dead node.
            Then they retried another node and the message was
            produced successfully.
          * The above behavior was in place for 15 minutes - after 15
            minutes it looked no producer tried to send the message
            to the failed node anymore
          * After 50 minutes when the failed node returned this node
            immediately started to get messages again from the producers

        So actually we realized there are 2 timeouts here.

        The first, the 3 seconds timeout I believe we found it here:
        org.apache.rocketmq.client.producer.DefaultMQProducer.sendMsgTimeout
        That's fine.

        *But the 2nd, the 15 minutes timeout (when failed node is
        marked as dead eventually) we could not find anywhere...*
        We also tried to take a look into the RocketMQ Nameserver
        code because our idea was at the end it could be the
        Nameserver who marks that node dead but no luck. :-(

        Our goal would be to shorten this 15 minutes timeout if
        possible (given the 3rd observation from above that when the
        node came back it joined the cluster back seamlessly we
        believe something like 5 minutes would be much better for our
        App)

        *Does anyone maybe know if changing this 15 minutes timeout
        is possible and if yes then how/where?*

        thanks!

-- Attila Wind

        http://www.linkedin.com/in/attilaw
        <http://www.linkedin.com/in/attilaw>
        Mobile: +49 176 43556932






Reply via email to