I found that instead of dealing with nifi client certificate hell, the
nifi-toolkit cli.sh will work just fine for testing the readiness of the
cluster. Here is my readiness script which seems to work just fine with in
kubernetes with the apache/nifi docker container version 1.12.1
#!/bin/bash
$NIFI_TOOLKIT_HOME/bin/cli.sh nifi get-nodes -ot json > /tmp/cluster.state
if [ $? -ne 0 ]; then
cat /tmp/cluster.state
exit 1
fi
STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") or
.address==\"localhost\") | .status" /tmp/cluster.state)
if [[ ! $STATUS = "CONNECTED" ]]; then
echo "Node not found with CONNECTED state. Full cluster state:"
jq . /tmp/cluster.state
exit 1
fi
________________________________
From: Chris Sampson <[email protected]>
Sent: Thursday, October 1, 2020 9:03 AM
To: [email protected] <[email protected]>
Subject: Re: Clustered nifi issues
For info, the probes we currently use for our StatefulSet Pods are:
* livenessProbe - tcpSocket to ping the NiFi instance port (e.g. 8080)
* readinessProbe - exec command to curl the nifi-api/controller/cluster
endpoint to check the node's cluster connection status, e.g.:
readinessProbe:
exec:
command:
- bash
- -c
- |
if [ "${SECURE}" = "true" ]; then
INITIAL_ADMIN_SLUG=$(echo "${INITIAL_ADMIN}" | tr '[:upper:]' '[:lower:]' | tr
' ' '-')
curl -v \
--cacert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/nifi-cert.pem \
--cert
${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-cert.pem
\
--key
${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-key.pem
\
https://$(hostname -f):8080/nifi-api/controller/cluster > /tmp/cluster.state
else
curl -kv http://$(hostname -f):8080/nifi-api/controller/cluster >
/tmp/cluster.state
fi
STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\") or
.address==\"localhost\") | .status" /tmp/cluster.state)
if [[ ! $STATUS = "CONNECTED" ]]; then
echo "Node not found with CONNECTED state. Full cluster state:"
jq . /tmp/cluster.state
exit 1
fi
Note that INITIAL_ADMIN is the CN of a user with appropriate permissions to
call the endpoint and for whom our pod contains a set of certificate files in
the indicated locations (generated from NiFi Toolkit in an init-container
before the Pod starts); jq utility was added into our customised version of the
apache/nifi Docker Image.
---
Chris Sampson
IT Consultant
[email protected]<mailto:[email protected]>
[https://docs.google.com/uc?export=download&id=1oPtzd0P7DqtuzpjiTRAa6h6coFitpqom&revid=0B9aXwC5rMc6lVlZ2OWpUaVlFVmUwTlZBdjQ0KzAxb1dZS2hJPQ]<https://www.naimuri.com/>
On Wed, 30 Sep 2020 at 16:43, Wyll Ingersoll
<[email protected]<mailto:[email protected]>> wrote:
Thanks for following up and filing the issue. Unfortunately, I dont have any of
the logs from the original issue since I have since restarted and rebooted my
containers many times.
________________________________
From: Mark Payne <[email protected]<mailto:[email protected]>>
Sent: Wednesday, September 30, 2020 11:21 AM
To: [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Subject: Re: Clustered nifi issues
Thanks Wyll,
I created a Jira [1] to address this. The NullPointer that you show in the
stack trace will prevent the node from reconnecting to the cluster.
Unfortunately, it’s a bug that needs to be addressed. It’s possible that you
may find a way to work around the issue, but I can’t tell you off the top of my
head what that would be.
Can you check the logs for anything else from the StandardFlowService class?
That may help to understand why the null value is getting returned, causing the
NullPointerException that you’re seeing.
Thanks
-Mark
[1] https://issues.apache.org/jira/browse/NIFI-7866
On Sep 30, 2020, at 11:03 AM, Wyll Ingersoll
<[email protected]<mailto:[email protected]>> wrote:
1.11.4
________________________________
From: Mark Payne <[email protected]<mailto:[email protected]>>
Sent: Wednesday, September 30, 2020 11:02 AM
To: [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Subject: Re: Clustered nifi issues
Wyll,
What version of nifi are you running?
Thanks
-Mark
On Sep 30, 2020, at 10:33 AM, Wyll Ingersoll
<[email protected]<mailto:[email protected]>> wrote:
* Yes - the host specific parameters on the different instances are
configured correctly (nifi-0, nifi-1, nifi-2)
* Yes - we have separate certificate for each node and the keystores are
configured correctly.
* Yes - we have a headless service in front of the STS cluster
* No - I don't think there is an explicit liveness or readiness probe
defined for the STS, perhaps I need to add one. Do you have an example?
-Wyllys
________________________________
From: Chris Sampson
<[email protected]<mailto:[email protected]>>
Sent: Tuesday, September 29, 2020 3:21 PM
To: [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Subject: Re: Clustered nifi issues
We started to have more stability when we switched to bitnami/zookeeper:3.5.7,
but I suspect that's a red herring here.
Your properties have nifi-0 in several places, so just to double check that the
relevant properties are changed for each of the instances within your
statefulset?
For example:
* nifi.remote.input.host
* nifi.cluster.node.address
* nifi.web.https.host
Yes
And are you using a separate (non-wildcard) certificate for each node?
Do you have liveness/readiness probes set on your nifi sts?
And are you using a headless service[1] to manage the cluster during startup?
[1]
https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
Cheers,
Chris Sampson
On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll,
<[email protected]<mailto:[email protected]>> wrote:
Zookeeper is from the docker hub zookeeper:3.5.7 image.
Below is our nifi.properties (with secrets and hostnames modified).
thanks!
- Wyllys
nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz
nifi.flow.configuration.archive.enabled=true
nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives
nifi.flow.configuration.archive.max.time=30 days
nifi.flow.configuration.archive.max.storage=500 MB
nifi.flow.configuration.archive.max.count=
nifi.flowcontroller.autoResumeState=false
nifi.flowcontroller.graceful.shutdown.period=10 sec
nifi.flowservice.writedelay.interval=500 ms
nifi.administrative.yield.duration=30 sec
nifi.bored.yield.duration=10 millis
nifi.queue.backpressure.count=10000
nifi.queue.backpressure.size=1 GB
nifi.authorizer.configuration.file=./conf/authorizers.xml
nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml
nifi.templates.directory=/opt/nifi/nifi-current/templates
nifi.ui.banner.text=KI Nifi Cluster
nifi.ui.autorefresh.interval=30 sec
nifi.nar.library.directory=./lib
nifi.nar.library.autoload.directory=./extensions
nifi.nar.working.directory=./work/nar/
nifi.documentation.working.directory=./work/docs/components
nifi.state.management.configuration.file=./conf/state-management.xml
nifi.state.management.provider.local=local-provider
nifi.state.management.provider.cluster=zk-provider
nifi.state.management.embedded.zookeeper.start=false
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
nifi.database.directory=./database_repository
nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE
nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
nifi.flowfile.repository.directory=./flowfile_repository
nifi.flowfile.repository.partitions=256
nifi.flowfile.repository.checkpoint.interval=2 mins
nifi.flowfile.repository.always.sync=false
nifi.flowfile.repository.encryption.key.provider.implementation=
nifi.flowfile.repository.encryption.key.provider.location=
nifi.flowfile.repository.encryption.key.id<http://nifi.flowfile.repository.encryption.key.id/>=
nifi.flowfile.repository.encryption.key=
nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
nifi.queue.swap.threshold=20000
nifi.swap.in.period=5 sec
nifi.swap.in.threads=1
nifi.swap.out.period=5 sec
nifi.swap.out.threads=4
nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=1 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=./content_repository
nifi.content.repository.archive.max.retention.period=12 hours
nifi.content.repository.archive.max.usage.percentage=50%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false
nifi.content.viewer.url=../nifi-content-viewer/
nifi.content.repository.encryption.key.provider.implementation=
nifi.content.repository.encryption.key.provider.location=
nifi.content.repository.encryption.key.id<http://nifi.content.repository.encryption.key.id/>=
nifi.content.repository.encryption.key=
nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
nifi.provenance.repository.debug.frequency=1_000_000
nifi.provenance.repository.encryption.key.provider.implementation=
nifi.provenance.repository.encryption.key.provider.location=
nifi.provenance.repository.encryption.key.id<http://nifi.provenance.repository.encryption.key.id/>=
nifi.provenance.repository.encryption.key=
nifi.provenance.repository.directory.default=./provenance_repository
nifi.provenance.repository.max.storage.time=7 days
nifi.provenance.repository.max.storage.size=100 GB
nifi.provenance.repository.rollover.time=120 secs
nifi.provenance.repository.rollover.size=100 MB
nifi.provenance.repository.query.threads=2
nifi.provenance.repository.index.threads=2
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.always.sync=false
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename,
ProcessorID, Relationship
nifi.provenance.repository.indexed.attributes=
nifi.provenance.repository.index.shard.size=4 GB
nifi.provenance.repository.max.attribute.length=65536
nifi.provenance.repository.concurrent.merge.threads=2
nifi.provenance.repository.buffer.size=100000
nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository
nifi.components.status.repository.buffer.size=1440
nifi.components.status.snapshot.frequency=1 min
nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local
nifi.remote.input.secure=true
nifi.remote.input.socket.port=10000
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec
nifi.remote.contents.cache.expiration=30 secs
nifi.web.war.directory=./lib
nifi.web.http.host=
nifi.web.http.port=
nifi.web.http.network.interface.default=
nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local
nifi.web.https.port=8080
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=/nifi-api,/nifi
nifi.web.proxy.host=ingress.ourdomain.com<http://ingress.ourdomain.com/>
nifi.sensitive.props.key=
nifi.sensitive.props.key.protected=
nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
nifi.sensitive.props.provider=BC
nifi.sensitive.props.additional.keys=
nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks
nifi.security.keystoreType=jks
nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX
nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX
nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks
nifi.security.truststoreType=jks
nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX
nifi.security.user.authorizer=managed-authorizer
nifi.security.user.login.identity.provider=
nifi.security.ocsp.responder.url=
nifi.security.ocsp.responder.certificate=
nifi.security.user.oidc.discovery.url=https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration
nifi.security.user.oidc.connect.timeout=15 secs
nifi.security.user.oidc.read.timeout=15 secs
nifi.security.user.oidc.client.id<http://nifi.security.user.oidc.client.id/>=nifi
nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX
nifi.security.user.oidc.preferred.jwsalgorithm=RS512
nifi.security.user.oidc.additional.scopes=
nifi.security.user.oidc.claim.identifying.user=
nifi.security.user.knox.url=
nifi.security.user.knox.publicKey=
nifi.security.user.knox.cookieName=hadoop-jwt
nifi.security.user.knox.audiences=
nifi.cluster.protocol.heartbeat.interval=30 secs
nifi.cluster.protocol.is.secure=true
nifi.cluster.is.node=true
nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.node.protocol.port=2882
nifi.cluster.node.protocol.threads=40
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=120 secs
nifi.cluster.node.read.timeout=120 secs
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=
nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local
nifi.cluster.load.balance.port=6342
nifi.cluster.load.balance.connections.per.node=4
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec
nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181
nifi.zookeeper.connect.timeout=30 secs
nifi.zookeeper.session.timeout=30 secs
nifi.zookeeper.root.node=/nifi
nifi.zookeeper.auth.type=
nifi.zookeeper.kerberos.removeHostFromPrincipal=
nifi.zookeeper.kerberos.removeRealmFromPrincipal=
nifi.kerberos.krb5.file=
nifi.kerberos.service.principal=
nifi.kerberos.service.keytab.location=
nifi.kerberos.spnego.principal=
nifi.kerberos.spnego.keytab.location=
nifi.kerberos.spnego.authentication.expiration=12 hours
nifi.variable.registry.properties=
nifi.analytics.predict.enabled=false
nifi.analytics.predict.interval=3 mins
nifi.analytics.query.interval=5 mins
nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares
nifi.analytics.connection.model.score.name<http://nifi.analytics.connection.model.score.name/>=rSquared
nifi.analytics.connection.model.score.threshold=.90
________________________________
From: Chris Sampson
<[email protected]<mailto:[email protected]>>
Sent: Tuesday, September 29, 2020 12:41 PM
To: [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Subject: Re: Clustered nifi issues
Also, which version of zookeeper and what image (I've found different versions
and images provided better stability)?
Cheers,
Chris Sampson
On Tue, 29 Sep 2020, 17:34 Sushil Kumar,
<[email protected]<mailto:[email protected]>> wrote:
Hello Wyll
It may be helpful if you can send nifi.properties.
Thanks
Sushil Kumar
On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll
<[email protected]<mailto:[email protected]>> wrote:
I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a
StatefulSet) using external zookeeper (3 nodes also) to manage state.
Whenever even 1 node (pod/container) goes down or is restarted, it can throw
the whole cluster into a bad state that forces me to restart ALL of the pods in
order to recover. This seems wrong. The problem seems to be that when the
primary node goes away, the remaining 2 nodes don't ever try to take over.
Instead, I have restart all of them individually until one of them becomes the
primary, then the other 2 eventually join and sync up.
When one of the nodes is refusing to sync up, I often see these errors in the
log and the only way to get it back into the cluster is to restart it. The
node showing the errors below never seems to be able to rejoin or resync with
the other 2 nodes.
2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster]
o.a.nifi.controller.StandardFlowService Handling reconnection request failed
due to: org.apache.nifi.cluster.ConnectionException: Failed to connect node to
cluster due to: java.lang.NullPointerException
org.apache.nifi.cluster.ConnectionException: Failed to connect node to cluster
due to: java.lang.NullPointerException
at
org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
at
org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
at
org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
at
org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: null
at
org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
... 4 common frames omitted
2020-09-29 10:18:53,326 INFO [Reconnect to Cluster]
o.a.c.f.imps.CuratorFrameworkImpl Starting
2020-09-29 10:18:53,327 INFO [Reconnect to Cluster]
org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
2020-09-29 10:18:53,328 INFO [Reconnect to Cluster]
o.a.c.f.imps.CuratorFrameworkImpl Default schema
2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread]
o.a.c.f.state.ConnectionStateManager State change: CONNECTED
2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread]
o.a.c.framework.imps.EnsembleTracker New config event received:
{server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>,
version=0,
server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>,
server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:53,810 INFO [Curator-Framework-0]
o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread]
o.a.c.framework.imps.EnsembleTracker New config event received:
{server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>,
version=0,
server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>,
server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;0.0.0.0:2181<http://0.0.0.0:2181/>}
2020-09-29 10:18:54,323 INFO [Reconnect to Cluster]
o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role
'Primary Node' becuase that role is not registered
2020-09-29 10:18:54,324 INFO [Reconnect to Cluster]
o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election Role
'Cluster Coordinator' becuase that role is not registered