Thanks for the insight Matt. It's a disaster recovery issue. It's not something I plan on doing on purpose. It seems it is a single point of failure unfortunately. I can see no other way to resolve the issue other than to blow everything away and start a new cluster.
On Thu, May 18, 2017 at 2:49 PM, Matt Gilman <[email protected]> wrote: > Neil, > > Disconnecting a node prior to removal is the correct process. It appears > that the check was lost going from 0.x to 1.x. Folks reported this JIRA [1] > indicating that deleting a connected node did not work. This process does > not work because the node needs to be disconnected first. The JIRA was > addressed by restoring the check that a node is disconnected prior to > deletion. > > Hopefully the JIRA I filed earlier today [2] will address the phantom node > you were seeing. Until then, can you update your workaround to disconnect > the node in question prior to deletion? > > Thanks > > Matt > > [1] https://issues.apache.org/jira/browse/NIFI-3295 > [2] https://issues.apache.org/jira/browse/NIFI-3933 > > On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh <neil.derraugh@ > intellifylearning.com> wrote: > >> Pretty sure this is the problem I was describing in the "Phantom Node" >> thread recently. >> >> If I kill non-primary nodes the cluster remains healthy despite the lost >> nodes. The terminated nodes end up with a DISCONNECTED status. >> >> If I kill the primary it winds up with a CONNECTED status, but a new >> primary/cluster coordinator gets elected too. >> >> Additionally it seems in 1.2.0 that the REST API no longer support >> deleting a node in a CONNECTED state (Cannot remove Node with ID >> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not disconnected, >> current state = CONNECTED). So right now I don't have a workaround and >> have to kill all the nodes and start over. >> >> On Thu, May 18, 2017 at 11:20 AM, Mark Payne <[email protected]> >> wrote: >> >>> Hello, >>> >>> Just looking through this thread now. I believe that I understand the >>> problem. I have updated the JIRA with details about what I think is the >>> problem and a potential remedy for the problem. >>> >>> Thanks >>> -Mark >>> >>> > On May 18, 2017, at 9:49 AM, Matt Gilman <[email protected]> >>> wrote: >>> > >>> > Thanks for the additional details. They will be helpful when working >>> the JIRA. All nodes, including the coordinator, heartbeat to the active >>> coordinator. This means that the coordinator effectively heartbeats to >>> itself. It appears, based on your log messages, that this is not happening. >>> Because no heartbeats were receive from any node, the lack of heartbeats >>> from the terminated node is not considered. >>> > >>> > Matt >>> > >>> > Sent from my iPhone >>> > >>> >> On May 18, 2017, at 8:30 AM, ddewaele <[email protected]> wrote: >>> >> >>> >> Found something interesting in the centos-b debug logging.... >>> >> >>> >> after centos-a (the coordinator) is killed centos-b takes over. >>> Notice how >>> >> it "Will not disconnect any nodes due to lack of heartbeat" and how >>> it still >>> >> sees centos-a as connected despite the fact that there are no >>> heartbeats >>> >> anymore. >>> >> >>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification Thread-2] >>> >> o.apache.nifi.controller.FlowController This node elected Active >>> Cluster >>> >> Coordinator >>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification Thread-2] >>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old heartbeats >>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification Thread-1] >>> >> o.apache.nifi.controller.FlowController This node has been elected >>> Primary >>> >> Node >>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1] >>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new heartbeats. >>> Will not >>> >> disconnect any nodes due to lack of heartbeat >>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol Request-3] >>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat >>> from >>> >> centos-b:8080 >>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol Request-3] >>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor >>> >> >>> >> Calculated diff between current cluster status and node cluster >>> status as >>> >> follows: >>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, >>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >>> state=CONNECTED, >>> >> updateId=42]] >>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, >>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >>> state=CONNECTED, >>> >> updateId=42]] >>> >> Difference: [] >>> >> >>> >> >>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol Request-3] >>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing request >>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, length=2341 >>> bytes) >>> >> from centos-b:8080 in 3 millis >>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2] >>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2017-05-18 >>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 12:41:41,339; >>> send >>> >> took 8 millis >>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1] >>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 >>> heartbeats in >>> >> 93276 nanos >>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4] >>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new heartbeat >>> from >>> >> centos-b:8080 >>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol Request-4] >>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor >>> >> >>> >> Calculated diff between current cluster status and node cluster >>> status as >>> >> follows: >>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, >>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >>> state=CONNECTED, >>> >> updateId=42]] >>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, state=CONNECTED, >>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >>> state=CONNECTED, >>> >> updateId=42]] >>> >> Difference: [] >>> >> >>> >> >>> >> >>> >> >>> >> -- >>> >> View this message in context: http://apache-nifi-users-list. >>> 2361937.n4.nabble.com/Nifi-Cluster-fails-to-disconnect-node- >>> when-node-was-killed-tp1942p1950.html >>> >> Sent from the Apache NiFi Users List mailing list archive at >>> Nabble.com. >>> >>> >> >
