We have a node nifi cluster running with 3 zookeeper instances (replicated) in a Docker Swarm Cluster.
Most of time the cluster is operating fine, but from time to time we notice that Nifi stops processing messages completely. It eventually resumes after a while (sometimes after a couple of seconds, sometimes after a couple of minutes). When I do a grep o.a.n.c.l.e.CuratorLeaderElectionManager /srv/nifi/logs/nifi-app.log on the primary node, I see a lof of suspended / reconnected messages. Likewise on the other node, I see similar messages The only real exceptions I'm seeing in the logs are these I also this on the UI from time to time : com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Read timed out Is there anything I can do to further debug this ? Is it normal to see that many connection state changes ? (the logs are full of them). The solution is running on 3 VMs, using Docker Swarm. Nifi is running on 2 of those 3 VMs. We have a zookeeper setup running on all 3 VMs. I don't see any errors in the zookeeper logs. -- View this message in context: http://apache-nifi-users-list.2361937.n4.nabble.com/NiFi-Cluster-with-lots-of-SUSPENDED-RECONNECTED-LOST-events-tp2194.html Sent from the Apache NiFi Users List mailing list archive at Nabble.com.
