That's the whole problem from my perspective: it stays CONNECTED. It never becomes DISCONNECTED. You can't delete it from the API in 1.2.0.
That's why I said it was a single point of failure. The exact semantics of calling it a single point of failure might be debatable, but the fact that the cluster can't be modified and/or gracefully shutdown (afaik) is what I was referring to. On Fri, May 19, 2017 at 12:40 PM, Joe Witt <[email protected]> wrote: > I believe at the state you describe that down node is now considered > disconnected. The cluster behavior prohibits you from making changes when > it knows not all members of the cluster cannot honor the change. If you > are sure you want to make the changes anyway and move on without that node > you should be able to remove it/delete it from the cluster. Now you have a > cluster of two connected nodes and you can make changes. > > On May 19, 2017 12:23 PM, "Neil Derraugh" <neil.derraugh@ > intellifylearning.com> wrote: > >> That's fair. But for the sake of total clarity on my own part, after one >> of these disaster scenarios with a newly quorum-elected primary things >> cannot be driven through the UI and at least through parts the REST API. >> >> I just ran through the following. We have 3 nodes A, B, C with A >> primary, and A becomes unreachable without first disconnecting. Then B and >> C may (I haven't verified) continue operating the flow they had in the >> clusters' last "good" state. But they do elect a new primary, as per the >> REST nifi-api/controller/cluster response. But now the flow can't be >> changed, and in some cases it can't be reported on either, i.e. some GETs >> fail, like nifi-api/flow/process-groups/root. >> >> Are we describing the same behavior? >> >> On Fri, May 19, 2017 at 11:12 AM, Joe Witt <[email protected]> wrote: >> >>> If there is no longer a quorum then we cannot drive things from the UI >>> but the cluster remaining is in tact from a functioning point of view >>> other than being able to assign a primary to handle the one-off items. >>> >>> On Fri, May 19, 2017 at 11:04 AM, Neil Derraugh >>> <[email protected]> wrote: >>> > Hi Joe, >>> > >>> > Maybe I'm missing something, but if the primary node suffers a network >>> > partition or container/vm/machine loss or becomes otherwise >>> unreachable then >>> > the cluster is unusable, at least from the UI. >>> > >>> > If that's not so please correct me. >>> > >>> > Thanks, >>> > Neil >>> > >>> > On Thu, May 18, 2017 at 9:56 PM, Joe Witt <[email protected]> wrote: >>> >> >>> >> Neil, >>> >> >>> >> Want to make sure I understand what you're saying. What are stating >>> >> is a single point of failure? >>> >> >>> >> Thanks >>> >> Joe >>> >> >>> >> On Thu, May 18, 2017 at 5:27 PM, Neil Derraugh >>> >> <[email protected]> wrote: >>> >> > Thanks for the insight Matt. >>> >> > >>> >> > It's a disaster recovery issue. It's not something I plan on doing >>> on >>> >> > purpose. It seems it is a single point of failure unfortunately. >>> I can >>> >> > see >>> >> > no other way to resolve the issue other than to blow everything >>> away and >>> >> > start a new cluster. >>> >> > >>> >> > On Thu, May 18, 2017 at 2:49 PM, Matt Gilman < >>> [email protected]> >>> >> > wrote: >>> >> >> >>> >> >> Neil, >>> >> >> >>> >> >> Disconnecting a node prior to removal is the correct process. It >>> >> >> appears >>> >> >> that the check was lost going from 0.x to 1.x. Folks reported this >>> JIRA >>> >> >> [1] >>> >> >> indicating that deleting a connected node did not work. This >>> process >>> >> >> does >>> >> >> not work because the node needs to be disconnected first. The JIRA >>> was >>> >> >> addressed by restoring the check that a node is disconnected prior >>> to >>> >> >> deletion. >>> >> >> >>> >> >> Hopefully the JIRA I filed earlier today [2] will address the >>> phantom >>> >> >> node >>> >> >> you were seeing. Until then, can you update your workaround to >>> >> >> disconnect >>> >> >> the node in question prior to deletion? >>> >> >> >>> >> >> Thanks >>> >> >> >>> >> >> Matt >>> >> >> >>> >> >> [1] https://issues.apache.org/jira/browse/NIFI-3295 >>> >> >> [2] https://issues.apache.org/jira/browse/NIFI-3933 >>> >> >> >>> >> >> On Thu, May 18, 2017 at 12:29 PM, Neil Derraugh >>> >> >> <[email protected]> wrote: >>> >> >>> >>> >> >>> Pretty sure this is the problem I was describing in the "Phantom >>> Node" >>> >> >>> thread recently. >>> >> >>> >>> >> >>> If I kill non-primary nodes the cluster remains healthy despite >>> the >>> >> >>> lost >>> >> >>> nodes. The terminated nodes end up with a DISCONNECTED status. >>> >> >>> >>> >> >>> If I kill the primary it winds up with a CONNECTED status, but a >>> new >>> >> >>> primary/cluster coordinator gets elected too. >>> >> >>> >>> >> >>> Additionally it seems in 1.2.0 that the REST API no longer support >>> >> >>> deleting a node in a CONNECTED state (Cannot remove Node with ID >>> >> >>> 1780fde7-c2f4-469c-9884-fe843eac5b73 because it is not >>> disconnected, >>> >> >>> current >>> >> >>> state = CONNECTED). So right now I don't have a workaround and >>> have >>> >> >>> to kill >>> >> >>> all the nodes and start over. >>> >> >>> >>> >> >>> On Thu, May 18, 2017 at 11:20 AM, Mark Payne < >>> [email protected]> >>> >> >>> wrote: >>> >> >>>> >>> >> >>>> Hello, >>> >> >>>> >>> >> >>>> Just looking through this thread now. I believe that I >>> understand the >>> >> >>>> problem. I have updated the JIRA with details about what I think >>> is >>> >> >>>> the >>> >> >>>> problem and a potential remedy for the problem. >>> >> >>>> >>> >> >>>> Thanks >>> >> >>>> -Mark >>> >> >>>> >>> >> >>>> > On May 18, 2017, at 9:49 AM, Matt Gilman < >>> [email protected]> >>> >> >>>> > wrote: >>> >> >>>> > >>> >> >>>> > Thanks for the additional details. They will be helpful when >>> >> >>>> > working >>> >> >>>> > the JIRA. All nodes, including the coordinator, heartbeat to >>> the >>> >> >>>> > active >>> >> >>>> > coordinator. This means that the coordinator effectively >>> heartbeats >>> >> >>>> > to >>> >> >>>> > itself. It appears, based on your log messages, that this is >>> not >>> >> >>>> > happening. >>> >> >>>> > Because no heartbeats were receive from any node, the lack of >>> >> >>>> > heartbeats >>> >> >>>> > from the terminated node is not considered. >>> >> >>>> > >>> >> >>>> > Matt >>> >> >>>> > >>> >> >>>> > Sent from my iPhone >>> >> >>>> > >>> >> >>>> >> On May 18, 2017, at 8:30 AM, ddewaele <[email protected]> >>> wrote: >>> >> >>>> >> >>> >> >>>> >> Found something interesting in the centos-b debug logging.... >>> >> >>>> >> >>> >> >>>> >> after centos-a (the coordinator) is killed centos-b takes >>> over. >>> >> >>>> >> Notice how >>> >> >>>> >> it "Will not disconnect any nodes due to lack of heartbeat" >>> and >>> >> >>>> >> how >>> >> >>>> >> it still >>> >> >>>> >> sees centos-a as connected despite the fact that there are no >>> >> >>>> >> heartbeats >>> >> >>>> >> anymore. >>> >> >>>> >> >>> >> >>>> >> 2017-05-18 12:41:38,010 INFO [Leader Election Notification >>> >> >>>> >> Thread-2] >>> >> >>>> >> o.apache.nifi.controller.FlowController This node elected >>> Active >>> >> >>>> >> Cluster >>> >> >>>> >> Coordinator >>> >> >>>> >> 2017-05-18 12:41:38,010 DEBUG [Leader Election Notification >>> >> >>>> >> Thread-2] >>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Purging old >>> heartbeats >>> >> >>>> >> 2017-05-18 12:41:38,014 INFO [Leader Election Notification >>> >> >>>> >> Thread-1] >>> >> >>>> >> o.apache.nifi.controller.FlowController This node has been >>> elected >>> >> >>>> >> Primary >>> >> >>>> >> Node >>> >> >>>> >> 2017-05-18 12:41:38,353 DEBUG [Heartbeat Monitor Thread-1] >>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Received no new >>> heartbeats. >>> >> >>>> >> Will >>> >> >>>> >> not >>> >> >>>> >> disconnect any nodes due to lack of heartbeat >>> >> >>>> >> 2017-05-18 12:41:41,336 DEBUG [Process Cluster Protocol >>> Request-3] >>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new >>> heartbeat >>> >> >>>> >> from >>> >> >>>> >> centos-b:8080 >>> >> >>>> >> 2017-05-18 12:41:41,337 DEBUG [Process Cluster Protocol >>> Request-3] >>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor >>> >> >>>> >> >>> >> >>>> >> Calculated diff between current cluster status and node >>> cluster >>> >> >>>> >> status as >>> >> >>>> >> follows: >>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, >>> state=CONNECTED, >>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >>> >> >>>> >> state=CONNECTED, >>> >> >>>> >> updateId=42]] >>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, >>> state=CONNECTED, >>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >>> >> >>>> >> state=CONNECTED, >>> >> >>>> >> updateId=42]] >>> >> >>>> >> Difference: [] >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> 2017-05-18 12:41:41,337 INFO [Process Cluster Protocol >>> Request-3] >>> >> >>>> >> o.a.n.c.p.impl.SocketProtocolListener Finished processing >>> request >>> >> >>>> >> 410e7db5-8bb0-4f97-8ee8-fc8647c54959 (type=HEARTBEAT, >>> length=2341 >>> >> >>>> >> bytes) >>> >> >>>> >> from centos-b:8080 in 3 millis >>> >> >>>> >> 2017-05-18 12:41:41,339 INFO [Clustering Tasks Thread-2] >>> >> >>>> >> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at >>> >> >>>> >> 2017-05-18 >>> >> >>>> >> 12:41:41,330 and sent to centos-b:10001 at 2017-05-18 >>> >> >>>> >> 12:41:41,339; >>> >> >>>> >> send >>> >> >>>> >> took 8 millis >>> >> >>>> >> 2017-05-18 12:41:43,354 INFO [Heartbeat Monitor Thread-1] >>> >> >>>> >> o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 1 >>> >> >>>> >> heartbeats >>> >> >>>> >> in >>> >> >>>> >> 93276 nanos >>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol >>> Request-4] >>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor Received new >>> heartbeat >>> >> >>>> >> from >>> >> >>>> >> centos-b:8080 >>> >> >>>> >> 2017-05-18 12:41:46,346 DEBUG [Process Cluster Protocol >>> Request-4] >>> >> >>>> >> o.a.n.c.c.h.ClusterProtocolHeartbeatMonitor >>> >> >>>> >> >>> >> >>>> >> Calculated diff between current cluster status and node >>> cluster >>> >> >>>> >> status as >>> >> >>>> >> follows: >>> >> >>>> >> Node: [NodeConnectionStatus[nodeId=centos-b:8080, >>> state=CONNECTED, >>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >>> >> >>>> >> state=CONNECTED, >>> >> >>>> >> updateId=42]] >>> >> >>>> >> Self: [NodeConnectionStatus[nodeId=centos-b:8080, >>> state=CONNECTED, >>> >> >>>> >> updateId=45], NodeConnectionStatus[nodeId=centos-a:8080, >>> >> >>>> >> state=CONNECTED, >>> >> >>>> >> updateId=42]] >>> >> >>>> >> Difference: [] >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> -- >>> >> >>>> >> View this message in context: >>> >> >>>> >> >>> >> >>>> >> http://apache-nifi-users-list.2361937.n4.nabble.com/Nifi-Clu >>> ster-fails-to-disconnect-node-when-node-was-killed-tp1942p1950.html >>> >> >>>> >> Sent from the Apache NiFi Users List mailing list archive at >>> >> >>>> >> Nabble.com. >>> >> >>>> >>> >> >>> >>> >> >> >>> >> > >>> > >>> > >>> >> >>
