Re: Nifi Cluster fails to disconnect node when node was killed

Matt Gilman Thu, 18 May 2017 05:28:14 -0700

Hi,

Once the new coordinator was elected, it is responsible for disconnecting
nodes due to lack of heartbeat. It will wait 8 times the
configured nifi.cluster.protocol.heartbeat.interval before the node
is disconnected. Can you confirm that this amount of time has elapsed?


Did you see any messages containing "Have not received a heartbeat from
node in" or "Failed to remove heartbeat for" during this time? Can you
describe your environment a little more? Are you running an external or
embedded zookeeper?

Can you enabled debug level logging for this package?
org.apache.nifi.cluster.coordination.heartbeat

Thanks

Matt

On Thu, May 18, 2017 at 7:30 AM, ddewaele <[email protected]> wrote:

> Hi,
>
> I have a NiFi cluster up and running and I'm testing various failover
> scenarios.
>
> I have 2 nodes in the cluster :
>
> - centos-a : Coordinator node / primary
> - centos-b : Cluster node
>
> I noticed in 1 of the scenarios when I killed the Cluster Coordinator node,
> that the following happened :
>
> centos-b couldn't contact the coordinator anymore and became the new
> coordinator / primary node. (as expected) :
>
> Failed to send heartbeat due to:
> org.apache.nifi.cluster.protocol.ProtocolException: Failed to send message
> to Cluster Coordinator due to: java.net.ConnectException: Connection
> refused
> (Connection refused)
> This node has been elected Leader for Role 'Primary Node'
> This node has been elected Leader for Role 'Cluster Coordinator'
>
> When attempting to access the UI on centos-b, I got the following error :
>
> 2017-05-18 11:18:49,368 WARN [Replicate Request Thread-2]
> o.a.n.c.c.h.r.ThreadPoolRequestReplicator Failed to replicate request GET
> /nifi-api/flow/current-user to centos-a:8080 due to {}
>
> If my understanding is correct, NiFi will try to replicate to connected
> nodes in the cluster. Here, centos-a was killed a while back and should
> have
> been disconnected, but as far as NiFi was concerned it was still connected.
>
> As a result I cannot access the UI anymore (due to the replication error),
> but I can lookup the cluster info via the REST API. And sure enough, it
> still sees centos-a as being CONNECTED.
>
> {
>     "cluster": {
>         "generated": "11:20:13 UTC",
>         "nodes": [
>             {
>                 "activeThreadCount": 0,
>                 "address": "centos-b",
>                 "apiPort": 8080,
>                 "events": [
>                     {
>                         "category": "INFO",
>                         "message": "Node Status changed from CONNECTING to
> CONNECTED",
>                         "timestamp": "05/18/2017 11:17:31 UTC"
>                     },
>                     {
>                         "category": "INFO",
>                         "message": "Node Status changed from [Unknown Node]
> to CONNECTING",
>                         "timestamp": "05/18/2017 11:17:27 UTC"
>                     }
>                 ],
>                 "heartbeat": "05/18/2017 11:20:09 UTC",
>                 "nodeId": "a5bce78d-23ea-4435-a0dd-4b731459f1b9",
>                 "nodeStartTime": "05/18/2017 11:17:25 UTC",
>                 "queued": "8,492 / 13.22 MB",
>                 "roles": [
>                     "Primary Node",
>                     "Cluster Coordinator"
>                 ],
>                 "status": "CONNECTED"
>             },
>             {
>                 "address": "centos-a",
>                 "apiPort": 8080,
>                 "events": [],
>                 "nodeId": "b89e8418-4b7f-4743-bdf4-4a08a92f3892",
>                 "roles": [],
>                 "status": "CONNECTED"
>             }
>         ]
>     }
> }
>
> When centos-a was brought back online, i noticed the following status
> change
> :
>
> Status of centos-a:8080 changed from
> NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=15]
> to
> NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTING, updateId=19]
>
> So it went from connected -> connecting.
>
> It clearly missed the disconnected step here.
>
> When shutting down the centos-a node using nifi.sh stop, it goes into the
> DISCONNECTED state :
>
> Status of centos-a:8080 changed from
> NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=12]
> to
> NodeConnectionStatus[nodeId=centos-a:8
> 080, state=DISCONNECTED, Disconnect Code=Node was Shutdown, Disconnect
> Reason=Node was Shutdown, updateId=13]
>
> How can I debug this further, and can somebody provide some additional
> insights ? I have seen nodes getting disconnected due to missing heartbeats
>
> tatus of centos-a:8080 changed from
> NodeConnectionStatus[nodeId=centos-a:8080, state=CONNECTED, updateId=10]
> to
> NodeConnectionStatus[nodeId=centos-a:8080, state=DISCONNECTED, Disconnect
> Code=Lack of Heartbeat, Disconnect Reason=Have not received a heartbeat
> from
> node in 41 seconds, updateId=11]
>
> But sometimes it doesn't seem to detect this, and NiFi keeps on thinking it
> is CONNECTED, despite not having received heartbeats in ages.
>
> Any ideas ?
>
>
>
> --
> View this message in context: http://apache-nifi-users-list.
> 2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-
> node-when-node-was-killed-tp1942.html
> Sent from the Apache NiFi Users List mailing list archive at Nabble.com.
>

Re: Nifi Cluster fails to disconnect node when node was killed

Reply via email to