We started to have more stability when we switched to bitnami/zookeeper:3.5.7, but I suspect that's a red herring here.
Your properties have nifi-0 in several places, so just to double check that the relevant properties are changed for each of the instances within your statefulset? For example: * nifi.remote.input.host * nifi.cluster.node.address * nifi.web.https.host And are you using a separate (non-wildcard) certificate for each node? Do you have liveness/readiness probes set on your nifi sts? And are you using a headless service[1] to manage the cluster during startup? [1] https://kubernetes.io/docs/concepts/services-networking/service/#headless-services Cheers, Chris Sampson On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, <[email protected]> wrote: > Zookeeper is from the docker hub zookeeper:3.5.7 image. > > Below is our nifi.properties (with secrets and hostnames modified). > > thanks! > - Wyllys > > > > nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz > > nifi.flow.configuration.archive.enabled=true > > nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives > > nifi.flow.configuration.archive.max.time=30 days > > nifi.flow.configuration.archive.max.storage=500 MB > > nifi.flow.configuration.archive.max.count= > > nifi.flowcontroller.autoResumeState=false > > nifi.flowcontroller.graceful.shutdown.period=10 sec > > nifi.flowservice.writedelay.interval=500 ms > > nifi.administrative.yield.duration=30 sec > > > nifi.bored.yield.duration=10 millis > > nifi.queue.backpressure.count=10000 > > nifi.queue.backpressure.size=1 GB > > > nifi.authorizer.configuration.file=./conf/authorizers.xml > > > nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml > > nifi.templates.directory=/opt/nifi/nifi-current/templates > > nifi.ui.banner.text=KI Nifi Cluster > > nifi.ui.autorefresh.interval=30 sec > > nifi.nar.library.directory=./lib > > nifi.nar.library.autoload.directory=./extensions > > nifi.nar.working.directory=./work/nar/ > > nifi.documentation.working.directory=./work/docs/components > > > nifi.state.management.configuration.file=./conf/state-management.xml > > nifi.state.management.provider.local=local-provider > > nifi.state.management.provider.cluster=zk-provider > > nifi.state.management.embedded.zookeeper.start=false > > > nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties > > > nifi.database.directory=./database_repository > > nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE > > > > nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository > > > nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog > > nifi.flowfile.repository.directory=./flowfile_repository > > nifi.flowfile.repository.partitions=256 > > nifi.flowfile.repository.checkpoint.interval=2 mins > > nifi.flowfile.repository.always.sync=false > > nifi.flowfile.repository.encryption.key.provider.implementation= > > nifi.flowfile.repository.encryption.key.provider.location= > > nifi.flowfile.repository.encryption.key.id= > > nifi.flowfile.repository.encryption.key= > > > > nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager > > nifi.queue.swap.threshold=20000 > > nifi.swap.in.period=5 sec > > nifi.swap.in.threads=1 > > nifi.swap.out.period=5 sec > > nifi.swap.out.threads=4 > > > > nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository > > nifi.content.claim.max.appendable.size=1 MB > > nifi.content.claim.max.flow.files=100 > > nifi.content.repository.directory.default=./content_repository > > nifi.content.repository.archive.max.retention.period=12 hours > > nifi.content.repository.archive.max.usage.percentage=50% > > nifi.content.repository.archive.enabled=true > > nifi.content.repository.always.sync=false > > nifi.content.viewer.url=../nifi-content-viewer/ > > nifi.content.repository.encryption.key.provider.implementation= > > nifi.content.repository.encryption.key.provider.location= > > nifi.content.repository.encryption.key.id= > > nifi.content.repository.encryption.key= > > > > nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository > > nifi.provenance.repository.debug.frequency=1_000_000 > > nifi.provenance.repository.encryption.key.provider.implementation= > > nifi.provenance.repository.encryption.key.provider.location= > > nifi.provenance.repository.encryption.key.id= > > nifi.provenance.repository.encryption.key= > > > nifi.provenance.repository.directory.default=./provenance_repository > > nifi.provenance.repository.max.storage.time=7 days > > nifi.provenance.repository.max.storage.size=100 GB > > nifi.provenance.repository.rollover.time=120 secs > > nifi.provenance.repository.rollover.size=100 MB > > nifi.provenance.repository.query.threads=2 > > nifi.provenance.repository.index.threads=2 > > nifi.provenance.repository.compress.on.rollover=true > > nifi.provenance.repository.always.sync=false > > nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, > Filename, ProcessorID, Relationship > > nifi.provenance.repository.indexed.attributes= > > nifi.provenance.repository.index.shard.size=4 GB > > nifi.provenance.repository.max.attribute.length=65536 > > nifi.provenance.repository.concurrent.merge.threads=2 > > nifi.provenance.repository.buffer.size=100000 > > > > nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository > > nifi.components.status.repository.buffer.size=1440 > > nifi.components.status.snapshot.frequency=1 min > > > nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local > > nifi.remote.input.secure=true > > nifi.remote.input.socket.port=10000 > > nifi.remote.input.http.enabled=true > > nifi.remote.input.http.transaction.ttl=30 sec > > nifi.remote.contents.cache.expiration=30 secs > > > nifi.web.war.directory=./lib > > nifi.web.http.host= > > nifi.web.http.port= > > nifi.web.http.network.interface.default= > > nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local > > nifi.web.https.port=8080 > > nifi.web.https.network.interface.default= > > nifi.web.jetty.working.directory=./work/jetty > > nifi.web.jetty.threads=200 > > nifi.web.max.header.size=16 KB > > nifi.web.proxy.context.path=/nifi-api,/nifi > > nifi.web.proxy.host=ingress.ourdomain.com > > > nifi.sensitive.props.key= > > nifi.sensitive.props.key.protected= > > nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL > > nifi.sensitive.props.provider=BC > > nifi.sensitive.props.additional.keys= > > > nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks > > nifi.security.keystoreType=jks > > nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX > > nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX > > > nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks > > nifi.security.truststoreType=jks > > nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX > > nifi.security.user.authorizer=managed-authorizer > > nifi.security.user.login.identity.provider= > > nifi.security.ocsp.responder.url= > > nifi.security.ocsp.responder.certificate= > > > nifi.security.user.oidc.discovery.url= > https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration > > nifi.security.user.oidc.connect.timeout=15 secs > > nifi.security.user.oidc.read.timeout=15 secs > > nifi.security.user.oidc.client.id=nifi > > nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX > > nifi.security.user.oidc.preferred.jwsalgorithm=RS512 > > nifi.security.user.oidc.additional.scopes= > > nifi.security.user.oidc.claim.identifying.user= > > > nifi.security.user.knox.url= > > nifi.security.user.knox.publicKey= > > nifi.security.user.knox.cookieName=hadoop-jwt > > nifi.security.user.knox.audiences= > > > nifi.cluster.protocol.heartbeat.interval=30 secs > > nifi.cluster.protocol.is.secure=true > > > nifi.cluster.is.node=true > > nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local > > nifi.cluster.node.protocol.port=2882 > > nifi.cluster.node.protocol.threads=40 > > nifi.cluster.node.protocol.max.threads=50 > > nifi.cluster.node.event.history.size=25 > > nifi.cluster.node.connection.timeout=120 secs > > nifi.cluster.node.read.timeout=120 secs > > nifi.cluster.node.max.concurrent.requests=100 > > nifi.cluster.firewall.file= > > nifi.cluster.flow.election.max.wait.time=5 mins > > nifi.cluster.flow.election.max.candidates= > > > nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local > > nifi.cluster.load.balance.port=6342 > > nifi.cluster.load.balance.connections.per.node=4 > > nifi.cluster.load.balance.max.thread.count=8 > > nifi.cluster.load.balance.comms.timeout=30 sec > > > > nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181 > > nifi.zookeeper.connect.timeout=30 secs > > nifi.zookeeper.session.timeout=30 secs > > nifi.zookeeper.root.node=/nifi > > nifi.zookeeper.auth.type= > > nifi.zookeeper.kerberos.removeHostFromPrincipal= > > nifi.zookeeper.kerberos.removeRealmFromPrincipal= > > > nifi.kerberos.krb5.file= > > > nifi.kerberos.service.principal= > > nifi.kerberos.service.keytab.location= > > > nifi.kerberos.spnego.principal= > > nifi.kerberos.spnego.keytab.location= > > nifi.kerberos.spnego.authentication.expiration=12 hours > > > nifi.variable.registry.properties= > > > nifi.analytics.predict.enabled=false > > nifi.analytics.predict.interval=3 mins > > nifi.analytics.query.interval=5 mins > > > nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares > > nifi.analytics.connection.model.score.name=rSquared > > nifi.analytics.connection.model.score.threshold=.90 > > ------------------------------ > *From:* Chris Sampson <[email protected]> > *Sent:* Tuesday, September 29, 2020 12:41 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Clustered nifi issues > > Also, which version of zookeeper and what image (I've found different > versions and images provided better stability)? > > > Cheers, > > Chris Sampson > > On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <[email protected]> wrote: > > Hello Wyll > > It may be helpful if you can send nifi.properties. > > Thanks > Sushil Kumar > > On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll < > [email protected]> wrote: > > > I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a > StatefulSet) using external zookeeper (3 nodes also) to manage state. > > Whenever even 1 node (pod/container) goes down or is restarted, it can > throw the whole cluster into a bad state that forces me to restart ALL of > the pods in order to recover. This seems wrong. The problem seems to be > that when the primary node goes away, the remaining 2 nodes don't ever try > to take over. Instead, I have restart all of them individually until one > of them becomes the primary, then the other 2 eventually join and sync up. > > When one of the nodes is refusing to sync up, I often see these errors in > the log and the only way to get it back into the cluster is to restart it. > The node showing the errors below never seems to be able to rejoin or > resync with the other 2 nodes. > > > 2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] > o.a.nifi.controller.StandardFlowService Handling reconnection request > failed due to: org.apache.nifi.cluster.ConnectionException: Failed to > connect node to cluster due to: java.lang.NullPointerException > > org.apache.nifi.cluster.ConnectionException: Failed to connect node to > cluster due to: java.lang.NullPointerException > > at > org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035) > > at > org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668) > > at > org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109) > > at > org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415) > > at java.lang.Thread.run(Thread.java:748) > > Caused by: java.lang.NullPointerException: null > > at > org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989) > > ... 4 common frames omitted > > 2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] > o.a.c.f.imps.CuratorFrameworkImpl Starting > > 2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] > org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes > > 2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] > o.a.c.f.imps.CuratorFrameworkImpl Default schema > > 2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] > o.a.c.f.state.ConnectionStateManager State change: CONNECTED > > 2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] > o.a.c.framework.imps.EnsembleTracker New config event received: > {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant; > 0.0.0.0:2181, version=0, > server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant; > 0.0.0.0:2181, > server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant; > 0.0.0.0:2181} > > 2020-09-29 10:18:53,810 INFO [Curator-Framework-0] > o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting > > 2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] > o.a.c.framework.imps.EnsembleTracker New config event received: > {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant; > 0.0.0.0:2181, version=0, > server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant; > 0.0.0.0:2181, > server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant; > 0.0.0.0:2181} > > 2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] > o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election > Role 'Primary Node' becuase that role is not registered > > 2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] > o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election > Role 'Cluster Coordinator' becuase that role is not registered > >
