I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a
StatefulSet) using external zookeeper (3 nodes also) to manage state.
Whenever even 1 node (pod/container) goes down or is restarted, it can throw
the whole cluster into a bad state that forces me to restart ALL of the pods in
order to recover. This seems wrong. The problem seems to be that when the
primary node goes away, the remaining 2 nodes don't ever try to take over.
Instead, I have restart all of them individually until one of them becomes the
primary, then the other 2 eventually join and sync up.
When one of the nodes is refusing to sync up, I often see these errors in the
log and the only way to get it back into the cluster is to restart it. The
node showing the errors below never seems to be able to rejoin or resync with
the other 2 nodes.
2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster]
o.a.nifi.controller.StandardFlowService Handling reconnection request failed
due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node to
cluster due to: java.lang.NullPointerException
org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster
due to: java.lang.NullPointerException
at
org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
at
org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
at
org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
at
org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: null
at
org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
... 4 common frames omitted
2020-09-29 10:18:53,326 INFO [Reconnect to Cluster]
o.a.c.f.imps.CuratorFrameworkImpl Starting
2020-09-29 10:18:53,327 INFO [Reconnect to Cluster]
org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2020-09-29 10:18:53,328 INFO [Reconnect to Cluster]
o.a.c.f.imps.CuratorFrameworkImpl Default schema
2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread]
o.a.c.f.state.ConnectionStateManager State change: CONNECTED
2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread]
o.a.c.framework.imps.EnsembleTracker New config event received:
{server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181,
version=0,
server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181,
server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181}
2020-09-29 10:18:53,810 INFO [Curator-Framework-0]
o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread]
o.a.c.framework.imps.EnsembleTracker New config event received:
{server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181,
version=0,
server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181,
server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181}
2020-09-29 10:18:54,323 INFO [Reconnect to Cluster]
o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role
'Primary Node' becuase that role is not registered
2020-09-29 10:18:54,324 INFO [Reconnect to Cluster]
o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role
'Cluster Coordinator' becuase that role is not registered