Did you say that the same line of code works fine for secured clusters too. I asked because nifi-toolkit has a separate set of parameters asking for certificates and everything else related to secure clusters.
On Tue, Oct 13, 2020 at 12:14 PM Wyll Ingersoll < [email protected]> wrote: > > I found that instead of dealing with nifi client certificate hell, the > nifi-toolkit cli.sh will work just fine for testing the readiness of the > cluster. Here is my readiness script which seems to work just fine with in > kubernetes with the apache/nifi docker container version 1.12.1 > > > #!/bin/bash > > > $NIFI_TOOLKIT_HOME/bin/cli.sh nifi get-nodes -ot json > /tmp/cluster.state > > if [ $? -ne 0 ]; then > > cat /tmp/cluster.state > > exit 1 > > fi > > STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") > or .address==\"localhost\") | .status" /tmp/cluster.state) > > if [[ ! $STATUS = "CONNECTED" ]]; then > > echo "Node not found with CONNECTED state. Full cluster state:" > > jq . /tmp/cluster.state > > exit 1 > > fi > > > ------------------------------ > *From:* Chris Sampson <[email protected]> > *Sent:* Thursday, October 1, 2020 9:03 AM > *To:* [email protected] <[email protected]> > *Subject:* Re: Clustered nifi issues > > For info, the probes we currently use for our StatefulSet Pods are: > > - livenessProbe - tcpSocket to ping the NiFi instance port (e.g. 8080) > - readinessProbe - exec command to curl the > nifi-api/controller/cluster endpoint to check the node's cluster connection > status, e.g.: > > readinessProbe: > exec: > command: > - bash > - -c > - | > if [ "${SECURE}" = "true" ]; then > INITIAL_ADMIN_SLUG=$(echo "${INITIAL_ADMIN}" | tr '[:upper:]' '[:lower:]' > | tr ' ' '-') > > curl -v \ > --cacert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/nifi-cert.pem \ > --cert > ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-cert.pem > \ > --key > ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-key.pem > \ > https://$(hostname -f):8080/nifi-api/controller/cluster > > /tmp/cluster.state > else > curl -kv http://$(hostname -f):8080/nifi-api/controller/cluster > > /tmp/cluster.state > fi > > STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") > or .address==\"localhost\") | .status" /tmp/cluster.state) > > if [[ ! $STATUS = "CONNECTED" ]]; then > echo "Node not found with CONNECTED state. Full cluster state:" > jq . /tmp/cluster.state > exit 1 > fi > > > Note that INITIAL_ADMIN is the CN of a user with appropriate permissions > to call the endpoint and for whom our pod contains a set of certificate > files in the indicated locations (generated from NiFi Toolkit in an > init-container before the Pod starts); jq utility was added into our > customised version of the apache/nifi Docker Image. > > > --- > *Chris Sampson* > IT Consultant > [email protected] > <https://www.naimuri.com/> > > > On Wed, 30 Sep 2020 at 16:43, Wyll Ingersoll < > [email protected]> wrote: > > Thanks for following up and filing the issue. Unfortunately, I dont have > any of the logs from the original issue since I have since restarted and > rebooted my containers many times. > ------------------------------ > *From:* Mark Payne <[email protected]> > *Sent:* Wednesday, September 30, 2020 11:21 AM > *To:* [email protected] <[email protected]> > *Subject:* Re: Clustered nifi issues > > Thanks Wyll, > > I created a Jira [1] to address this. The NullPointer that you show in the > stack trace will prevent the node from reconnecting to the cluster. > Unfortunately, it’s a bug that needs to be addressed. It’s possible that > you may find a way to work around the issue, but I can’t tell you off the > top of my head what that would be. > > Can you check the logs for anything else from the StandardFlowService > class? That may help to understand why the null value is getting returned, > causing the NullPointerException that you’re seeing. > > Thanks > -Mark > > [1] https://issues.apache.org/jira/browse/NIFI-7866 > > On Sep 30, 2020, at 11:03 AM, Wyll Ingersoll < > [email protected]> wrote: > > 1.11.4 > ------------------------------ > *From:* Mark Payne <[email protected]> > *Sent:* Wednesday, September 30, 2020 11:02 AM > *To:* [email protected] <[email protected]> > *Subject:* Re: Clustered nifi issues > > Wyll, > > What version of nifi are you running? > > Thanks > -Mark > > > On Sep 30, 2020, at 10:33 AM, Wyll Ingersoll < > [email protected]> wrote: > > > - Yes - the host specific parameters on the different instances are > configured correctly (nifi-0, nifi-1, nifi-2) > - Yes - we have separate certificate for each node and the keystores > are configured correctly. > - Yes - we have a headless service in front of the STS cluster > - No - I don't think there is an explicit liveness or readiness probe > defined for the STS, perhaps I need to add one. Do you have an example? > > > -Wyllys > > > ------------------------------ > *From:* Chris Sampson <[email protected]> > *Sent:* Tuesday, September 29, 2020 3:21 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Clustered nifi issues > > We started to have more stability when we switched to > bitnami/zookeeper:3.5.7, but I suspect that's a red herring here. > > Your properties have nifi-0 in several places, so just to double check > that the relevant properties are changed for each of the instances within > your statefulset? > > For example: > * nifi.remote.input.host > * nifi.cluster.node.address > * nifi.web.https.host > > > Yes > > And are you using a separate (non-wildcard) certificate for each node? > > > Do you have liveness/readiness probes set on your nifi sts? > > > And are you using a headless service[1] to manage the cluster during > startup? > > > [1] > https://kubernetes.io/docs/concepts/services-networking/service/#headless-services > > > Cheers, > > Chris Sampson > > On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, < > [email protected]> wrote: > > Zookeeper is from the docker hub zookeeper:3.5.7 image. > > Below is our nifi.properties (with secrets and hostnames modified). > > thanks! > - Wyllys > > > > nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz > nifi.flow.configuration.archive.enabled=true > nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives > nifi.flow.configuration.archive.max.time=30 days > nifi.flow.configuration.archive.max.storage=500 MB > nifi.flow.configuration.archive.max.count= > nifi.flowcontroller.autoResumeState=false > nifi.flowcontroller.graceful.shutdown.period=10 sec > nifi.flowservice.writedelay.interval=500 ms > nifi.administrative.yield.duration=30 sec > > nifi.bored.yield.duration=10 millis > nifi.queue.backpressure.count=10000 > nifi.queue.backpressure.size=1 GB > > nifi.authorizer.configuration.file=./conf/authorizers.xml > > nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml > nifi.templates.directory=/opt/nifi/nifi-current/templates > nifi.ui.banner.text=KI Nifi Cluster > nifi.ui.autorefresh.interval=30 sec > nifi.nar.library.directory=./lib > nifi.nar.library.autoload.directory=./extensions > nifi.nar.working.directory=./work/nar/ > nifi.documentation.working.directory=./work/docs/components > > nifi.state.management.configuration.file=./conf/state-management.xml > nifi.state.management.provider.local=local-provider > nifi.state.management.provider.cluster=zk-provider > nifi.state.management.embedded.zookeeper.start=false > > nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties > > nifi.database.directory=./database_repository > nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE > > > nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository > > nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog > nifi.flowfile.repository.directory=./flowfile_repository > nifi.flowfile.repository.partitions=256 > nifi.flowfile.repository.checkpoint.interval=2 mins > nifi.flowfile.repository.always.sync=false > nifi.flowfile.repository.encryption.key.provider.implementation= > nifi.flowfile.repository.encryption.key.provider.location= > nifi.flowfile.repository.encryption.key.id= > nifi.flowfile.repository.encryption.key= > > > nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager > nifi.queue.swap.threshold=20000 > nifi.swap.in.period=5 sec > nifi.swap.in.threads=1 > nifi.swap.out.period=5 sec > nifi.swap.out.threads=4 > > > nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository > nifi.content.claim.max.appendable.size=1 MB > nifi.content.claim.max.flow.files=100 > nifi.content.repository.directory.default=./content_repository > nifi.content.repository.archive.max.retention.period=12 hours > nifi.content.repository.archive.max.usage.percentage=50% > nifi.content.repository.archive.enabled=true > nifi.content.repository.always.sync=false > nifi.content.viewer.url=../nifi-content-viewer/ > nifi.content.repository.encryption.key.provider.implementation= > nifi.content.repository.encryption.key.provider.location= > nifi.content.repository.encryption.key.id= > nifi.content.repository.encryption.key= > > > nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository > nifi.provenance.repository.debug.frequency=1_000_000 > nifi.provenance.repository.encryption.key.provider.implementation= > nifi.provenance.repository.encryption.key.provider.location= > nifi.provenance.repository.encryption.key.id= > nifi.provenance.repository.encryption.key= > > nifi.provenance.repository.directory.default=./provenance_repository > nifi.provenance.repository.max.storage.time=7 days > nifi.provenance.repository.max.storage.size=100 GB > nifi.provenance.repository.rollover.time=120 secs > nifi.provenance.repository.rollover.size=100 MB > nifi.provenance.repository.query.threads=2 > nifi.provenance.repository.index.threads=2 > nifi.provenance.repository.compress.on.rollover=true > nifi.provenance.repository.always.sync=false > nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, > Filename, ProcessorID, Relationship > nifi.provenance.repository.indexed.attributes= > nifi.provenance.repository.index.shard.size=4 GB > nifi.provenance.repository.max.attribute.length=65536 > nifi.provenance.repository.concurrent.merge.threads=2 > nifi.provenance.repository.buffer.size=100000 > > > nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository > nifi.components.status.repository.buffer.size=1440 > nifi.components.status.snapshot.frequency=1 min > > nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local > nifi.remote.input.secure=true > nifi.remote.input.socket.port=10000 > nifi.remote.input.http.enabled=true > nifi.remote.input.http.transaction.ttl=30 sec > nifi.remote.contents.cache.expiration=30 secs > > nifi.web.war.directory=./lib > nifi.web.http.host= > nifi.web.http.port= > nifi.web.http.network.interface.default= > nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local > nifi.web.https.port=8080 > nifi.web.https.network.interface.default= > nifi.web.jetty.working.directory=./work/jetty > nifi.web.jetty.threads=200 > nifi.web.max.header.size=16 KB > nifi.web.proxy.context.path=/nifi-api,/nifi > nifi.web.proxy.host=ingress.ourdomain.com > > nifi.sensitive.props.key= > nifi.sensitive.props.key.protected= > nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL > nifi.sensitive.props.provider=BC > nifi.sensitive.props.additional.keys= > > nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks > nifi.security.keystoreType=jks > nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX > nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX > > nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks > nifi.security.truststoreType=jks > nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX > nifi.security.user.authorizer=managed-authorizer > nifi.security.user.login.identity.provider= > nifi.security.ocsp.responder.url= > nifi.security.ocsp.responder.certificate= > > nifi.security.user.oidc.discovery.url= > https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration > nifi.security.user.oidc.connect.timeout=15 secs > nifi.security.user.oidc.read.timeout=15 secs > nifi.security.user.oidc.client.id=nifi > nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX > nifi.security.user.oidc.preferred.jwsalgorithm=RS512 > nifi.security.user.oidc.additional.scopes= > nifi.security.user.oidc.claim.identifying.user= > > nifi.security.user.knox.url= > nifi.security.user.knox.publicKey= > nifi.security.user.knox.cookieName=hadoop-jwt > nifi.security.user.knox.audiences= > > nifi.cluster.protocol.heartbeat.interval=30 secs > nifi.cluster.protocol.is.secure=true > > nifi.cluster.is.node=true > nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local > nifi.cluster.node.protocol.port=2882 > nifi.cluster.node.protocol.threads=40 > nifi.cluster.node.protocol.max.threads=50 > nifi.cluster.node.event.history.size=25 > nifi.cluster.node.connection.timeout=120 secs > nifi.cluster.node.read.timeout=120 secs > nifi.cluster.node.max.concurrent.requests=100 > nifi.cluster.firewall.file= > nifi.cluster.flow.election.max.wait.time=5 mins > nifi.cluster.flow.election.max.candidates= > > nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local > nifi.cluster.load.balance.port=6342 > nifi.cluster.load.balance.connections.per.node=4 > nifi.cluster.load.balance.max.thread.count=8 > nifi.cluster.load.balance.comms.timeout=30 sec > > > nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181 > nifi.zookeeper.connect.timeout=30 secs > nifi.zookeeper.session.timeout=30 secs > nifi.zookeeper.root.node=/nifi > nifi.zookeeper.auth.type= > nifi.zookeeper.kerberos.removeHostFromPrincipal= > nifi.zookeeper.kerberos.removeRealmFromPrincipal= > > nifi.kerberos.krb5.file= > > nifi.kerberos.service.principal= > nifi.kerberos.service.keytab.location= > > nifi.kerberos.spnego.principal= > nifi.kerberos.spnego.keytab.location= > nifi.kerberos.spnego.authentication.expiration=12 hours > > nifi.variable.registry.properties= > > nifi.analytics.predict.enabled=false > nifi.analytics.predict.interval=3 mins > nifi.analytics.query.interval=5 mins > > nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares > nifi.analytics.connection.model.score.name=rSquared > nifi.analytics.connection.model.score.threshold=.90 > > ------------------------------ > *From:* Chris Sampson <[email protected]> > *Sent:* Tuesday, September 29, 2020 12:41 PM > *To:* [email protected] <[email protected]> > *Subject:* Re: Clustered nifi issues > > Also, which version of zookeeper and what image (I've found different > versions and images provided better stability)? > > > Cheers, > > Chris Sampson > > On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <[email protected]> wrote: > > Hello Wyll > > It may be helpful if you can send nifi.properties. > > Thanks > Sushil Kumar > > On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll < > [email protected]> wrote: > > > I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a > StatefulSet) using external zookeeper (3 nodes also) to manage state. > > Whenever even 1 node (pod/container) goes down or is restarted, it can > throw the whole cluster into a bad state that forces me to restart ALL of > the pods in order to recover. This seems wrong. The problem seems to be > that when the primary node goes away, the remaining 2 nodes don't ever try > to take over. Instead, I have restart all of them individually until one > of them becomes the primary, then the other 2 eventually join and sync up. > > When one of the nodes is refusing to sync up, I often see these errors in > the log and the only way to get it back into the cluster is to restart it. > The node showing the errors below never seems to be able to rejoin or > resync with the other 2 nodes. > > > 2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster] > o.a.nifi.controller.StandardFlowService Handling reconnection request > failed due to: org.apache.nifi.cluster.ConnectionException: Failed to > connect node to cluster due to: java.lang.NullPointerException > org.apache.nifi.cluster.ConnectionException: Failed to connect node to > cluster due to: java.lang.NullPointerException > at > org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035) > at > org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668) > at > org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109) > at > org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.NullPointerException: null > at > org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989) > ... 4 common frames omitted > 2020-09-29 10:18:53,326 INFO [Reconnect to Cluster] > o.a.c.f.imps.CuratorFrameworkImpl Starting > 2020-09-29 10:18:53,327 INFO [Reconnect to Cluster] > org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes > 2020-09-29 10:18:53,328 INFO [Reconnect to Cluster] > o.a.c.f.imps.CuratorFrameworkImpl Default schema > 2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread] > o.a.c.f.state.ConnectionStateManager State change: CONNECTED > 2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread] > o.a.c.framework.imps.EnsembleTracker New config event received: > {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant; > 0.0.0.0:2181, version=0, > server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant; > 0.0.0.0:2181, > server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant; > 0.0.0.0:2181} > 2020-09-29 10:18:53,810 INFO [Curator-Framework-0] > o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting > 2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread] > o.a.c.framework.imps.EnsembleTracker New config event received: > {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant; > 0.0.0.0:2181, version=0, > server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant; > 0.0.0.0:2181, > server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant; > 0.0.0.0:2181} > 2020-09-29 10:18:54,323 INFO [Reconnect to Cluster] > o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election > Role 'Primary Node' becuase that role is not registered > 2020-09-29 10:18:54,324 INFO [Reconnect to Cluster] > o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election > Role 'Cluster Coordinator' becuase that role is not registered > > >
