Can you share the logs? 2018-07-02 20:54 GMT+03:00 HEWA WIDANA GAMAGE, SUBASH < [email protected]>:
> Ok I did following poc real quick. > > > > 1. Two nodes, started. And joined. Topology snapshot servers=2. > > 2. In one node, I blocked the Ignite ports(47500, 47100 etc). > > 3. Then After failureDetecitonTimeout, it logged NODE_FAILED, and > Topology snapshot servers=1 in each node. > > 4. Then after 10-15 seconds, I unblock those ports. > > 5. Then after few seconds, both nodes logged, Node joined, and > topology snapshot server=2 > > > > So it’s the same node, ID, because JVM is still up and running. And looks > like it doesn’t forget. > > > > Can this “10-15 seconds” be any time ? Even in 1-2 hours if the node comes > back, can it rejoin ? > > > > > > > > > > *From:* Evgenii Zhuravlev [mailto:[email protected]] > *Sent:* Monday, July 02, 2018 1:25 PM > *To:* [email protected] > *Subject:* Re: How long Ignite retries upon NODE_FAILED events > > > > If cluster already decided that node failed, it will be stopped after it > will try to reconnect to the cluster with the same id > > > > 2018-07-02 18:37 GMT+03:00 HEWA WIDANA GAMAGE, SUBASH < > [email protected]>: > > Yes failureDetectionTimeout determines the time it wait to mark a node > failed. But my question is, after such node failed happened, and then what > happens when that failed node becomes reachable in the network (less that > failureDetectionTimeout) ? > > > > *From:* Evgenii Zhuravlev [mailto:[email protected]] > *Sent:* Monday, July 02, 2018 11:05 AM > *To:* [email protected] > *Subject:* Re: How long Ignite retries upon NODE_FAILED events > > > > Hi, > > > > by default, Ignite uses a mechanism, that can be configured using > failureDetectionTimeout: https://apacheignite.readme.io/v2. > 5/docs/tcpip-discovery#section-failure-detection-timeout > > > > Evgenii > > > > 2018-07-02 16:40 GMT+03:00 HEWA WIDANA GAMAGE, SUBASH < > [email protected]>: > > Hi team, > > > > For example, let’s say one of the node is not down(JVM is up), but network > not reachable from/to it. Then rest of the nodes will see NODE_FAILED and > started working as normal with reduced cluster size. If that failed node, > the network from/to it, becomes normal again after X minutes. Then, > > - will other nodes discover them, or will that node be able to figure it > out ? > > - How long X can be at max? Is there max retry or timeout. (I seen > joinTimeout param in discovery, but that’s seems only applicable for > startup, like how long it should pause starting the node to let join others) > > > > >
