I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a 
StatefulSet) using external zookeeper (3 nodes also) to manage state.

Whenever even 1 node (pod/container) goes down or is restarted, it can throw 
the whole cluster into a bad state that forces me to restart ALL of the pods in 
order to recover.  This seems wrong.  The problem seems to be that when the 
primary node goes away, the remaining 2 nodes don't ever try to take over.  
Instead, I have restart all of them individually until one of them becomes the 
primary, then the other 2 eventually join and sync up.

When one of the nodes is refusing to sync up, I often see these errors in the 
log and the only way to get it back into the cluster is to restart it.  The 
node showing the errors below never seems to be able to rejoin or resync with 
the other 2 nodes.



2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] 
o.a.nifi.controller.StandardFlowService Handling reconnection request failed 
due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node to 
cluster due to: java.lang.NullPointerException

org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster 
due to: java.lang.NullPointerException

at 
org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)

at 
org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)

at 
org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)

at 
org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.NullPointerException: null

at 
org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)

... 4 common frames omitted

2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] 
o.a.c.f.imps.CuratorFrameworkImpl Starting

2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] 
org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes

2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] 
o.a.c.f.imps.CuratorFrameworkImpl Default schema

2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] 
o.a.c.f.state.ConnectionStateManager State change: CONNECTED

2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] 
o.a.c.framework.imps.EnsembleTracker New config event received: 
{server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, 
version=0, 
server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, 
server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181}

2020-09-29 10:18:53,810 INFO [Curator-Framework-0] 
o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting

2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] 
o.a.c.framework.imps.EnsembleTracker New config event received: 
{server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, 
version=0, 
server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, 
server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181}

2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] 
o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 
'Primary Node' becuase that role is not registered

2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] 
o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role 
'Cluster Coordinator' becuase that role is not registered

Reply via email to