Hey Mark, thanks for your answer. We always had an external zookeeper running, but on the same hosts as the nifi instances. But as we were researching through the load balancing problem in version 1.13.2, we decided to use the embedded zookeeper, because in the documentation there is no hint at all to use an external zookeeper instance in production environment. So we just thought we should get rid of that possible source of error. Like incompatible versions etc.
So just to be clear: It's officially advisable to run some external zookeeper instance on extra servers? If so, which version should we use? How many servers should a zookeeper cluster have to manage 3-5 Nifi-Hosts? Gotta say, I really appreciate all the work this community puts into helping others. I hope I can give something back some time. --- Ursprüngliche Nachricht --- Von: Mark Payne <marka...@hotmail.com> Datum: 17.08.2021 16:13:29 An: "users@nifi.apache.org" <users@nifi.apache.org>, Axel Schwarz <axelkop...@emailn.de> Betreff: Re: Disfunctional cluster with version 1.13.2 > Axel, > > So a few things to consider here. > > The NullPointer that you’re showing there in the logs was fixed in 1.14.0 > [1]. > > Secondly, is a question of why the ZooKeeper client is disconnecting. Given > that you’re running an embedded zookeeper server, it could be for any number > of reasons - low java heap, too high CPU usage, over-utilized disks, etc. > The embedded zookeeper is a convenience that should be used for testing & > playing around locally but should never be used for production use - you > should always depend on an external zookeeper for any production use case, > running on a different machine (or set of machines). I suspect moving to > an external zookeeper alone will be enough to help significantly here. > > But if you run into issues after moving to an external zookeeper, would > recommend > running “bin/nifi.sh diagnostics diag1.txt” when you’re having trouble. This > will gather all sorts of information that you should be able to share, > including > garbage collection information, how many times the cluster coordinator is > changing, etc. that can be helpful in diagnosing cluster-related issues. > > > Thanks > -Mark > > [1] https://issues.apache.org/jira/browse/NIFI-8394 > > > > On Aug 17, 2021, at 1:18 AM, Axel Schwarz <axelkop...@emailn.de> > wrote: > > > > Hey Pierre, > > > > we're currently using the embedded ZK. So as far as I know this is Version > 3.5.6 > > > > --- Ursprüngliche Nachricht --- > > Von: Pierre Villard <pierre.villard...@gmail.com> > > Datum: 16.08.2021 16:59:22 > > An: users@nifi.apache.org, Axel Schwarz <axelkop...@emailn.de> > > > Betreff: Re: Disfunctional cluster with version 1.13.2 > > > >> Hi, > >> > >> What's the version of ZK? > >> > >> Thanks, > >> Pierre > >> > >> Le jeu. 12 août 2021 à 09:55, Axel Schwarz <axelkop...@emailn.de> > a > >> écrit : > >> > >>> Dear all, > >>> > >>> after successfully battling the load balancing and installing > Version > >> > >>> 1.13.2 again in our 3 node production environment, we experienced > another > >> > >>> failure in the cluster resulting in a complete cut-off of the > flow just > >> > >>> 1,5h after the update. > >>> We noticed it just by trying to access the webinterface, which > immediately > >> > >>> after login showed something like: > >>> > >>> "Cannot replicate request to Node nifiHost1.contoso.com:8443 > because > >> the > >>> node is not connected" > >>> > >>> There was nothing we could do through the webinterface aside > from staring > >> > >>> at this message and when looking at the live logs, there was > nothing > >> > >>> suspicious. The log moved on as if nothing happened. > >>> After a restart of the cluster everything was working fine again, > but > >> we > >>> saw, that the entire flow wasn't working for some period of > time. This > >> > >>> alone is really uncool, as we running a cluster for exactly > that reason: > >> > >>> The flow should keep working, even if some node decides to malfunction > > >> for > >>> whatever reason. > >>> > >>> Digging a little deeper into the logs showed two noticable problems: > > >> > >>> > >>> 1. The Zookeeper is restarting every few minutes. Which in the > log always > >> > >>> looks like this: > >>> > >>> ------------------------------------(nifiHost1.contoso.com) > > >>> nifi-app.log------------------------------------------------------------------------------------ > > >> > >>> > >>> 2021-08-11 12:02:39,187 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> 2021-08-11 12:02:39,187 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer shutting down > >>> 2021-08-11 12:02:39,194 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.z.server.FinalRequestProcessor shutdown of request processor > complete > >> > >>> 2021-08-11 12:02:39,196 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.z.server.SyncRequestProcessor Shutting down > >>> 2021-08-11 12:02:39,196 INFO [SyncThread:1] > >>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited! > > >>> 2021-08-11 12:02:39,199 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to > 4000 > >>> 2021-08-11 12:02:39,200 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to > 40000 > >> > >>> 2021-08-11 12:02:39,200 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime > 2000 > >> > >>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir > >>> /opt/nifi/logs/zookeeper/version-2 snapdir > >>> /opt/nifi/state/zookeeper/version-2 > >>> 2021-08-11 12:02:39,276 ERROR [Timer-Driven Process Thread-27-EventThread] > > >> > >>> org.apache.zookeeper.ClientCnxn Error while calling watcher > > >>> java.lang.NullPointerException: null > >>> at > >>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535) > > >> > >>> at > >>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > > >> > >>> 2021-08-11 12:02:39,277 INFO [main-EventThread] > >>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED > > >>> 2021-08-11 12:02:40,001 INFO [main-EventThread] > >>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED > > >>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] > > >>> o.a.n.c.l.e.CuratorLeaderElectionManager > >>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350 > > >> > >>> Connection State changed to SUSPENDED > >>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] > > >>> o.a.n.c.l.e.CuratorLeaderElectionManager > >>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e > > >> > >>> Connection State changed to SUSPENDED > >>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] > > >>> o.a.n.c.l.e.CuratorLeaderElectionManager > >>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1d37e350 > > >> > >>> Connection State changed to RECONNECTED > >>> 2021-08-11 12:02:40,006 INFO [Curator-ConnectionStateManager-0] > > >>> o.a.n.c.l.e.CuratorLeaderElectionManager > >>> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1908642e > > >> > >>> Connection State changed to RECONNECTED > >>> 2021-08-11 12:02:40,014 INFO [main-EventThread] > >>> o.a.c.framework.imps.EnsembleTracker New config event received: > {} > >>> 2021-08-11 12:02:40,127 INFO [Clustering Tasks Thread-2] > >>> o.a.n.c.c.ClusterProtocolHeartbeater Heartbeat created at 2021-08-11 > > >> > >>> 12:02:39,176 and sent to nifiHost3.contoso.com:54321 at 2021-08-11 > > >>> 12:02:40,127; send took 951 millis > >>> 2021-08-11 12:02:40,186 ERROR [Timer-Driven Process Thread-27-EventThread] > > >> > >>> org.apache.zookeeper.ClientCnxn Error while calling watcher > > >>> java.lang.NullPointerException: null > >>> at > >>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535) > > >> > >>> at > >>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > > >> > >>> > >>> > >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >> > >>> > >>> This happens, as mentioned, every few minutes and looks like > this when > >> > >>> condensed through grep: > >>> > >>> ------------------------------------(nifiHost1.contoso.com) > nifi-app.log > >> > >>> | grep "ZooKeeperServer Shutting > >>> down"--------------------------------------------- > >>> > >>> ./nifi-app.log:4237:2021-08-11 12:02:39,187 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:4378:2021-08-11 12:03:22,084 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:4447:2021-08-11 12:03:57,639 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:4563:2021-08-11 12:04:18,921 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:4629:2021-08-11 12:04:33,199 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:4650:2021-08-11 12:05:03,939 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:4801:2021-08-11 12:05:39,348 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:5327:2021-08-11 12:06:56,736 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:7250:2021-08-11 12:21:41,464 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:7703:2021-08-11 12:24:54,244 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:8148:2021-08-11 12:27:21,093 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:8190:2021-08-11 12:27:37,820 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:8350:2021-08-11 12:28:37,417 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:8797:2021-08-11 12:33:25,184 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:8813:2021-08-11 12:33:39,672 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:8971:2021-08-11 12:34:22,003 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:8984:2021-08-11 12:34:36,511 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:9349:2021-08-11 12:36:29,802 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:9376:2021-08-11 12:36:41,541 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:9491:2021-08-11 12:37:12,390 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:9504:2021-08-11 12:37:25,888 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:9676:2021-08-11 12:39:22,775 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:9842:2021-08-11 12:40:49,396 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:10251:2021-08-11 12:43:38,807 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:10397:2021-08-11 12:44:36,326 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:11405:2021-08-11 12:52:26,384 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:11629:2021-08-11 12:53:33,054 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:11827:2021-08-11 12:54:43,783 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:11990:2021-08-11 12:56:23,389 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:12085:2021-08-11 12:57:21,410 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:12823:2021-08-11 13:01:23,674 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:13087:2021-08-11 13:04:44,524 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:14610:2021-08-11 13:16:15,432 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:14635:2021-08-11 13:16:37,724 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:14751:2021-08-11 13:17:38,169 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:15136:2021-08-11 13:17:59,547 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:15203:2021-08-11 13:18:30,695 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:15516:2021-08-11 13:21:43,631 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:15743:2021-08-11 13:22:49,630 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:15950:2021-08-11 13:23:50,574 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:16035:2021-08-11 13:24:30,913 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:16396:2021-08-11 13:28:01,873 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:16470:2021-08-11 13:29:12,550 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:16541:2021-08-11 13:29:38,490 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:16568:2021-08-11 13:29:53,554 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:16678:2021-08-11 13:31:13,437 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> ./nifi-app.log:16705:2021-08-11 13:31:28,195 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> > >>> > >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >> > >>> > >>> So my first question is: Is this behaviour of zookeeper normal? > Well > >> it is > >>> just an INFO, but I would love to here that this is a completely > normal > >> > >>> phenomenon. :D > >>> > >>> 2. After all these zookeeper restarts, nifi finally throws the > following > >> > >>> error along with the "last" zookeeper restart: > >>> > >>> ------------------------------------(nifiHost1.contoso.com) > > >>> nifi-app.log------------------------------------------------------------------------------------ > > >> > >>> > >>> 2021-08-11 13:31:13,083 INFO [main-EventThread] > >>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED > > >>> 2021-08-11 13:31:13,437 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> 2021-08-11 13:31:13,437 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer shutting down > >>> 2021-08-11 13:31:13,439 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.z.server.FinalRequestProcessor shutdown of request processor > complete > >> > >>> 2021-08-11 13:31:13,440 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.z.server.SyncRequestProcessor Shutting down > >>> 2021-08-11 13:31:13,440 INFO [SyncThread:1] > >>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited! > > >>> 2021-08-11 13:31:13,450 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to > 4000 > >>> 2021-08-11 13:31:13,450 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to > 40000 > >> > >>> 2021-08-11 13:31:13,450 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime > 2000 > >> > >>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir > >>> /opt/nifi/logs/zookeeper/version-2 snapdir > >>> /opt/nifi/state/zookeeper/version-2 > >>> 2021-08-11 13:31:13,462 ERROR [Load-Balance Server Thread-5] > > >>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate > with > >> Peer > >>> nifiHost2.contoso.com/192.168.1.11:19500 > >>> java.io.EOFException: Expected to receive a Transaction Completion > > >>> Indicator from Peer nifiHost2.contoso.com but encountered EOF > > >>> at > >>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299) > > >> > >>> at > >>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275) > > >> > >>> at > >>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133) > > >> > >>> at > >>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162) > > >> > >>> at java.lang.Thread.run(Thread.java:748) > >>> 2021-08-11 13:31:13,464 ERROR [Load-Balance Server Thread-6] > > >>> o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate > with > >> Peer > >>> nifiHost2.contoso.com/192.168.1.11:19502 > >>> java.io.EOFException: Expected to receive a Transaction Completion > > >>> Indicator from Peer nifiHost2.contoso.com but encountered EOF > > >>> at > >>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.completeTransaction(StandardLoadBalanceProtocol.java:299) > > >> > >>> at > >>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:275) > > >> > >>> at > >>> org.apache.nifi.controller.queue.clustered.server.StandardLoadBalanceProtocol.receiveFlowFiles(StandardLoadBalanceProtocol.java:133) > > >> > >>> at > >>> org.apache.nifi.controller.queue.clustered.server.ConnectionLoadBalanceServer$CommunicateAction.run(ConnectionLoadBalanceServer.java:162) > > >> > >>> at java.lang.Thread.run(Thread.java:748) > >>> 2021-08-11 13:31:13,589 INFO [Process Cluster Protocol Request-1] > > >>> o.a.n.c.c.node.NodeClusterCoordinator Status of nifiHost1.contoso.com:8443 > > >> > >>> changed from NodeConnectionStatus[nodeId=nifiHost1.contoso.com:8443, > > >> > >>> state=CONNECTED, updateId=16] to NodeConnectionStatus[nodeId= > > >>> nifiHost1.contoso.com:8443, state=CONNECTING, updateId=16] > >>> 2021-08-11 13:31:13,793 INFO [Process Cluster Protocol Request-7] > > >>> o.a.n.c.p.impl.SocketProtocolListener Finished processing request > > >>> 88980d85-b956-4713-9686-50d76a407871 (type=RECONNECTION_REQUEST, > > >>> length=2209040 bytes) from nifiHost1.contoso.com:8443 in 142 > millis > >> > >>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster] > >>> o.a.nifi.controller.StandardFlowService Processing reconnection > request > >> > >>> from cluster coordinator. > >>> 2021-08-11 13:31:13,794 INFO [Reconnect to Cluster] > >>> o.a.n.c.l.e.CuratorLeaderElectionManager Election Listener for > Role > >> Primary > >>> Node disabled > >>> 2021-08-11 13:31:13,802 INFO [main-EventThread] > >>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED > > >>> 2021-08-11 13:31:28,195 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer Shutting down > >>> 2021-08-11 13:31:28,196 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer shutting down > >>> 2021-08-11 13:31:28,198 INFO [pool-22-thread-1] > >>> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint > of FlowFile > >> > >>> Repository > >>> 2021-08-11 13:31:28,202 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.z.server.FinalRequestProcessor shutdown of request processor > complete > >> > >>> 2021-08-11 13:31:28,203 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.z.server.SyncRequestProcessor Shutting down > >>> 2021-08-11 13:31:28,203 INFO [SyncThread:1] > >>> o.a.z.server.SyncRequestProcessor SyncRequestProcessor exited! > > >>> 2021-08-11 13:31:28,204 INFO [Timer-Driven Process Thread-21] > > >>> o.a.n.p.store.WriteAheadStorePartition Successfully rolled over > Event > >> > >>> Writer for Provenance Event Store Partition[directory=/opt/pr] > due to > >> > >>> MAX_TIME_REACHED. Event File was 6.7 MB and contained 1626 events. > > >>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1] > >>> o.a.n.wali.SequentialAccessWriteAheadLog Checkpointed Write-Ahead > Log > >> with > >>> 10 Records and 0 Swap Files in 8 milliseconds (Stop-the-world > time = > >> 1 > >>> milliseconds), max Transaction ID 27772177643 > >>> 2021-08-11 13:31:28,208 INFO [pool-22-thread-1] > >>> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed > FlowFile > >> > >>> Repository with 10 records in 8 milliseconds > >>> 2021-08-11 13:31:28,295 INFO [main-EventThread] > >>> o.a.c.f.state.ConnectionStateManager State change: SUSPENDED > > >>> 2021-08-11 13:31:29,808 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to > 4000 > >>> 2021-08-11 13:31:29,809 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to > 40000 > >> > >>> 2021-08-11 13:31:29,809 INFO > >>> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled)] > > >> > >>> o.a.zookeeper.server.ZooKeeperServer Created server with tickTime > 2000 > >> > >>> minSessionTimeout 4000 maxSessionTimeout 40000 datadir > >>> /opt/nifi/logs/zookeeper/version-2 snapdir > >>> /opt/nifi/state/zookeeper/version-2 > >>> 2021-08-11 13:31:32,291 INFO [main-EventThread] > >>> o.a.c.f.state.ConnectionStateManager State change: RECONNECTED > > >>> > >>> > >>> ------------------------------------------------------------------------------------------------------------------------------------------------------------ > > >> > >>> > >>> After that, the log continues with normal messages about heartbeats > > >> and > >>> provenance events but no error occurs anymore. That exactly > was the > >> time, > >>> the whole flow cut-off. > >>> > >>> I've got to add, that we've never experienced that flow cut-off > behaviour > >> > >>> with version 1.12.1, but the zookeeper restarts occur there > as well > >> and as > >>> often as seen above. > >>> But when first updating to version 1.13.2 back in June, we experienced > > >> > >>> this flow cut-off multiple times beside the not working load > balancing. > >> > >>> Back then I thought this was connected to each other, but now > we had > >> a > >>> fully functional load balancing and the cut-off happened nevertheless. > > >> > >>> > >>> I'm thankful for any hint. :) > >>> > >>> > >>> > >>> > >> > > > > >