Thanks for the insight Matt.

It's a disaster recovery issue.  It's not something I plan on doing on
purpose.  It seems it is a single point of failure unfortunately.  I can
see no other way to resolve the issue other than to blow everything away
and start a new cluster.

On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <[email protected]>
wrote:

> Neil,
>
> Disconnecting a node prior to removal is the correct process. It appears
> that the check was lost going from 0.x to 1.x. Folks reported this JIRA [1]
> indicating that deleting a connected node did not work. This process does
> not work because the node needs to be disconnected first. The JIRA was
> addressed by restoring the check that a node is disconnected prior to
> deletion.
>
> Hopefully the JIRA I filed earlier today [2] will address the phantom node
> you were seeing. Until then, can you update your workaround to disconnect
> the node in question prior to deletion?
>
> Thanks
>
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-3295
> [2] https://issues.apache.org/jira/browse/NIFI-3933
>
> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh <neil.derraugh@
> intellifylearning.com> wrote:
>
>> Pretty sure this is the problem I was describing in the "Phantom Node"
>> thread recently.
>>
>> If I kill non-primary nodes the cluster remains healthy despite the lost
>> nodes.  The terminated nodes end up with a DISCONNECTED status.
>>
>> If I kill the primary it winds up with a CONNECTED status, but a new
>> primary/cluster coordinator gets elected too.
>>
>> Additionally it seems in 1.2.0 that the REST API no longer support
>> deleting a node in a CONNECTED state (Cannot remove Node with ID
>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected,
>> current state = CONNECTED).  So right now I don't have a workaround and
>> have to kill all the nodes and start over.
>>
>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <[email protected]>
>> wrote:
>>
>>> Hello,
>>>
>>> Just looking through this thread now. I believe that I understand the
>>> problem. I have updated the JIRA with details about what I think is the
>>> problem and a potential remedy for the problem.
>>>
>>> Thanks
>>> -Mark
>>>
>>> > On May 18, 2017, at 9:49 AM, Matt Gilman <[email protected]>
>>> wrote:
>>> >
>>> > Thanks for the additional details. They will be helpful when working
>>> the JIRA. All nodes, including the coordinator, heartbeat to the active
>>> coordinator. This means that the coordinator effectively heartbeats to
>>> itself. It appears, based on your log messages, that this is not happening.
>>> Because no heartbeats were receive from any node, the lack of heartbeats
>>> from the terminated node is not considered.
>>> >
>>> > Matt
>>> >
>>> > Sent from my iPhone
>>> >
>>> >> On May 18, 2017, at 8:30 AM, ddewaele <[email protected]> wrote:
>>> >>
>>> >> Found something interesting in the centos-b debug logging....
>>> >>
>>> >> after centos-a (the coordinator) is killed centos-b takes over.
>>> Notice how
>>> >> it "Will not disconnect any nodes due to lack of heartbeat" and how
>>> it still
>>> >> sees centos-a as connected despite the fact that there are no
>>> heartbeats
>>> >> anymore.
>>> >>
>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification Thread-2]
>>> >> o.apache.nifi.controller.FlowController This node elected Active
>>> Cluster
>>> >> Coordinator
>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification Thread-2]
>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats
>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification Thread-1]
>>> >> o.apache.nifi.controller.FlowController This node has been elected
>>> Primary
>>> >> Node
>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1]
>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats.
>>> Will not
>>> >> disconnect any nodes due to lack of heartbeat
>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3]
>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>>> from
>>> >> centos-b:8080
>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3]
>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>> >>
>>> >> Calculated diff between current cluster status and node cluster
>>> status as
>>> >> follows:
>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>> state=CONNECTED,
>>> >> updateId=42]]
>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>> state=CONNECTED,
>>> >> updateId=42]]
>>> >> Difference: []
>>> >>
>>> >>
>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3]
>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request
>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341
>>> bytes)
>>> >> from centos-b:8080 in 3 millis
>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2]
>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-18
>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339;
>>> send
>>> >> took 8 millis
>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1]
>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1
>>> heartbeats in
>>> >> 93276 nanos
>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat
>>> from
>>> >> centos-b:8080
>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4]
>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor
>>> >>
>>> >> Calculated diff between current cluster status and node cluster
>>> status as
>>> >> follows:
>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>> state=CONNECTED,
>>> >> updateId=42]]
>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED,
>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080,
>>> state=CONNECTED,
>>> >> updateId=42]]
>>> >> Difference: []
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context: http://apache-nifi-users-list.
>>> 2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node-
>>> when-node-was-killed-tp1942p1950.html
>>> >> Sent from the Apache NiFi Users List mailing list archive at
>>> Nabble.com.
>>>
>>>
>>
>

Reply via email to