Hi, Once the new coordinator was elected, it is responsible for disconnecting nodes due to lack of heartbeat. It will wait 8 times the configured nifi.cluster.protocol.heartbeat.interval before the node is disconnected. Can you confirm that this amount of time has elapsed?
Did you see any messages containing "Have not received a heartbeat from node in" or "Failed to remove heartbeat for" during this time? Can you describe your environment a little more? Are you running an external or embedded zookeeper? Can you enabled debug level logging for this package? org.apache.nifi.cluster.coordination.heartbeat Thanks Matt On Thu, May 18, 2017 at 7:30 AM, ddewaele <[email protected]> wrote: > Hi, > > I have a NiFi cluster up and running and I'm testing various failover > scenarios. > > I have 2 nodes in the cluster : > > - centos-a : Coordinator node / primary > - centos-b : Cluster node > > I noticed in 1 of the scenarios when I killed the Cluster Coordinator node, > that the following happened : > > centos-b couldn't contact the coordinator anymore and became the new > coordinator / primary node. (as expected) : > > Failed to send heartbeat due to: > org.apache.nifi.cluster.protocol.ProtocolException: Failed to send message > to Cluster Coordinator due to: java.net.ConnectException: Connection > refused > (Connection refused) > This node has been elected Leader for Role 'Primary Node' > This node has been elected Leader for Role 'Cluster Coordinator' > > When attempting to access the UI on centos-b, I got the following error : > > 2017-05-18 11:18:49,368 WARN [Replicate Request Thread-2] > o.a.n.c.c.h.r.ThreadPoolRequestReplicator Failed to replicate request GET > /nifi-api/flow/current-user to centos-a:8080 due to {} > > If my understanding is correct, NiFi will try to replicate to connected > nodes in the cluster. Here, centos-a was killed a while back and should > have > been disconnected, but as far as NiFi was concerned it was still connected. > > As a result I cannot access the UI anymore (due to the replication error), > but I can lookup the cluster info via the REST API. And sure enough, it > still sees centos-a as being CONNECTED. > > { > "cluster": { > "generated": "11:20:13 UTC", > "nodes": [ > { > "activeThreadCount": 0, > "address": "centos-b", > "apiPort": 8080, > "events": [ > { > "category": "INFO", > "message": "Node Status changed from CONNECTING to > CONNECTED", > "timestamp": "05/18/2017 11:17:31 UTC" > }, > { > "category": "INFO", > "message": "Node Status changed from [Unknown Node] > to CONNECTING", > "timestamp": "05/18/2017 11:17:27 UTC" > } > ], > "heartbeat": "05/18/2017 11:20:09 UTC", > "nodeId": "a5bce78d-23ea-4435-a0dd-4b731459f1b9", > "nodeStartTime": "05/18/2017 11:17:25 UTC", > "queued": "8,492 / 13.22 MB", > "roles": [ > "Primary Node", > "Cluster Coordinator" > ], > "status": "CONNECTED" > }, > { > "address": "centos-a", > "apiPort": 8080, > "events": [], > "nodeId": "b89e8418-4b7f-4743-bdf4-4a08a92f3892", > "roles": [], > "status": "CONNECTED" > } > ] > } > } > > When centos-a was brought back online, i noticed the following status > change > : > > Status of centos-a:8080 changed from > NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=15] > to > NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTING, updateId=19] > > So it went from connected -> connecting. > > It clearly missed the disconnected step here. > > When shutting down the centos-a node using nifi.sh stop, it goes into the > DISCONNECTED state : > > Status of centos-a:8080 changed from > NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=12] > to > NodeConnectionStatus[nodeId=centos-a:8 > 080, state=DISCONNECTED, Disconnect Code=Node was Shutdown, Disconnect > Reason=Node was Shutdown, updateId=13] > > How can I debug this further, and can somebody provide some additional > insights ? I have seen nodes getting disconnected due to missing heartbeats > > tatus of centos-a:8080 changed from > NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=10] > to > NodeConnectionStatus[nodeId=centos-a:8080, state=DISCONNECTED, Disconnect > Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat > from > node in 41 seconds, updateId=11] > > But sometimes it doesn't seem to detect this, and NiFi keeps on thinking it > is CONNECTED, despite not having received heartbeats in ages. > > Any ideas ? > > > > -- > View this message in context: http://apache-nifi-users-list. > 2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect- > node-when-node-was-killed-tp1942.html > Sent from the Apache NiFi Users List mailing list archive at Nabble.com. >
