Re: Clustered nifi issues

Sushil Kumar Tue, 13 Oct 2020 13:02:22 -0700

Did you say that the same line of code works fine for secured clusters too.
I asked because nifi-toolkit has a separate set of parameters asking for
certificates and everything else related to secure clusters.



On Tue, Oct 13, 2020 at 12:14 PM Wyll Ingersoll <
[email protected]> wrote:

>
> I found that instead of dealing with nifi client certificate hell, the
> nifi-toolkit cli.sh will work just fine for testing the readiness of the
> cluster.  Here is my readiness script which seems to work just fine with in
> kubernetes with the apache/nifi docker container version 1.12.1
>
>
> #!/bin/bash
>
>
> $NIFI_TOOLKIT_HOME/bin/cli.sh nifi get-nodes -ot json > /tmp/cluster.state
>
> if [ $? -ne 0 ]; then
>
>         cat /tmp/cluster.state
>
>         exit 1
>
> fi
>
> STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\")
> or .address==\"localhost\") | .status" /tmp/cluster.state)
>
> if [[ ! $STATUS = "CONNECTED" ]]; then
>
>         echo "Node not found with CONNECTED state. Full cluster state:"
>
>         jq . /tmp/cluster.state
>
>         exit 1
>
> fi
>
>
> ------------------------------
> *From:* Chris Sampson <[email protected]>
> *Sent:* Thursday, October 1, 2020 9:03 AM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: Clustered nifi issues
>
> For info, the probes we currently use for our StatefulSet Pods are:
>
>    - livenessProbe - tcpSocket to ping the NiFi instance port (e.g. 8080)
>    - readinessProbe - exec command to curl the
>    nifi-api/controller/cluster endpoint to check the node's cluster connection
>    status, e.g.:
>
> readinessProbe:
> exec:
> command:
> - bash
> - -c
> - |
> if [ "${SECURE}" = "true" ]; then
> INITIAL_ADMIN_SLUG=$(echo "${INITIAL_ADMIN}" | tr '[:upper:]' '[:lower:]'
> | tr ' ' '-')
>
> curl -v \
> --cacert ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/nifi-cert.pem \
> --cert
> ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-cert.pem
> \
> --key
> ${NIFI_HOME}/data/conf/certs/${INITIAL_ADMIN_SLUG}/${INITIAL_ADMIN_SLUG}-key.pem
> \
> https://$(hostname -f):8080/nifi-api/controller/cluster >
> /tmp/cluster.state
> else
> curl -kv http://$(hostname -f):8080/nifi-api/controller/cluster >
> /tmp/cluster.state
> fi
>
> STATUS=$(jq -r ".cluster.nodes[] | select((.address==\"$(hostname -f)\")
> or .address==\"localhost\") | .status" /tmp/cluster.state)
>
> if [[ ! $STATUS = "CONNECTED" ]]; then
> echo "Node not found with CONNECTED state. Full cluster state:"
> jq . /tmp/cluster.state
> exit 1
> fi
>
>
> Note that INITIAL_ADMIN is the CN of a user with appropriate permissions
> to call the endpoint and for whom our pod contains a set of certificate
> files in the indicated locations (generated from NiFi Toolkit in an
> init-container before the Pod starts); jq utility was added into our
> customised version of the apache/nifi Docker Image.
>
>
> ---
> *Chris Sampson*
> IT Consultant
> [email protected]
> <https://www.naimuri.com/>
>
>
> On Wed, 30 Sep 2020 at 16:43, Wyll Ingersoll <
> [email protected]> wrote:
>
> Thanks for following up and filing the issue. Unfortunately, I dont have
> any of the logs from the original issue since I have since restarted and
> rebooted my containers many times.
> ------------------------------
> *From:* Mark Payne <[email protected]>
> *Sent:* Wednesday, September 30, 2020 11:21 AM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: Clustered nifi issues
>
> Thanks Wyll,
>
> I created a Jira [1] to address this. The NullPointer that you show in the
> stack trace will prevent the node from reconnecting to the cluster.
> Unfortunately, it’s a bug that needs to be addressed. It’s possible that
> you may find a way to work around the issue, but I can’t tell you off the
> top of my head what that would be.
>
> Can you check the logs for anything else from the StandardFlowService
> class? That may help to understand why the null value is getting returned,
> causing the NullPointerException that you’re seeing.
>
> Thanks
> -Mark
>
> [1] https://issues.apache.org/jira/browse/NIFI-7866
>
> On Sep 30, 2020, at 11:03 AM, Wyll Ingersoll <
> [email protected]> wrote:
>
> 1.11.4
> ------------------------------
> *From:* Mark Payne <[email protected]>
> *Sent:* Wednesday, September 30, 2020 11:02 AM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: Clustered nifi issues
>
> Wyll,
>
> What version of nifi are you running?
>
> Thanks
> -Mark
>
>
> On Sep 30, 2020, at 10:33 AM, Wyll Ingersoll <
> [email protected]> wrote:
>
>
>    - Yes - the host specific parameters on the different instances are
>    configured correctly (nifi-0, nifi-1, nifi-2)
>    - Yes - we have separate certificate for each node and the keystores
>    are configured correctly.
>    - Yes - we have a headless service in front of the STS cluster
>    - No - I don't think there is an explicit liveness or readiness probe
>    defined for the STS, perhaps I need to add one. Do you have an example?
>
>
> -Wyllys
>
>
> ------------------------------
> *From:* Chris Sampson <[email protected]>
> *Sent:* Tuesday, September 29, 2020 3:21 PM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: Clustered nifi issues
>
> We started to have more stability when we switched to
> bitnami/zookeeper:3.5.7, but I suspect that's a red herring here.
>
> Your properties have nifi-0 in several places, so just to double check
> that the relevant properties are changed for each of the instances within
> your statefulset?
>
> For example:
> * nifi.remote.input.host
> * nifi.cluster.node.address
> * nifi.web.https.host
>
>
> Yes
>
> And are you using a separate (non-wildcard) certificate for each node?
>
>
> Do you have liveness/readiness probes set on your nifi sts?
>
>
> And are you using a headless service[1] to manage the cluster during
> startup?
>
>
> [1]
> https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
>
>
> Cheers,
>
> Chris Sampson
>
> On Tue, 29 Sep 2020, 18:48 Wyll Ingersoll, <
> [email protected]> wrote:
>
> Zookeeper is from the docker hub zookeeper:3.5.7 image.
>
> Below is our nifi.properties (with secrets and hostnames modified).
>
> thanks!
>  - Wyllys
>
>
>
> nifi.flow.configuration.file=/opt/nifi/nifi-current/latest_flow/nifi-0/flow.xml.gz
> nifi.flow.configuration.archive.enabled=true
> nifi.flow.configuration.archive.dir=/opt/nifi/nifi-current/archives
> nifi.flow.configuration.archive.max.time=30 days
> nifi.flow.configuration.archive.max.storage=500 MB
> nifi.flow.configuration.archive.max.count=
> nifi.flowcontroller.autoResumeState=false
> nifi.flowcontroller.graceful.shutdown.period=10 sec
> nifi.flowservice.writedelay.interval=500 ms
> nifi.administrative.yield.duration=30 sec
>
> nifi.bored.yield.duration=10 millis
> nifi.queue.backpressure.count=10000
> nifi.queue.backpressure.size=1 GB
>
> nifi.authorizer.configuration.file=./conf/authorizers.xml
>
> nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml
> nifi.templates.directory=/opt/nifi/nifi-current/templates
> nifi.ui.banner.text=KI Nifi Cluster
> nifi.ui.autorefresh.interval=30 sec
> nifi.nar.library.directory=./lib
> nifi.nar.library.autoload.directory=./extensions
> nifi.nar.working.directory=./work/nar/
> nifi.documentation.working.directory=./work/docs/components
>
> nifi.state.management.configuration.file=./conf/state-management.xml
> nifi.state.management.provider.local=local-provider
> nifi.state.management.provider.cluster=zk-provider
> nifi.state.management.embedded.zookeeper.start=false
>
> nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
>
> nifi.database.directory=./database_repository
> nifi.h2.url.append=;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE
>
>
> nifi.flowfile.repository.implementation=org.apache.nifi.controller.repository.WriteAheadFlowFileRepository
>
> nifi.flowfile.repository.wal.implementation=org.apache.nifi.wali.SequentialAccessWriteAheadLog
> nifi.flowfile.repository.directory=./flowfile_repository
> nifi.flowfile.repository.partitions=256
> nifi.flowfile.repository.checkpoint.interval=2 mins
> nifi.flowfile.repository.always.sync=false
> nifi.flowfile.repository.encryption.key.provider.implementation=
> nifi.flowfile.repository.encryption.key.provider.location=
> nifi.flowfile.repository.encryption.key.id=
> nifi.flowfile.repository.encryption.key=
>
>
> nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
> nifi.queue.swap.threshold=20000
> nifi.swap.in.period=5 sec
> nifi.swap.in.threads=1
> nifi.swap.out.period=5 sec
> nifi.swap.out.threads=4
>
>
> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
> nifi.content.claim.max.appendable.size=1 MB
> nifi.content.claim.max.flow.files=100
> nifi.content.repository.directory.default=./content_repository
> nifi.content.repository.archive.max.retention.period=12 hours
> nifi.content.repository.archive.max.usage.percentage=50%
> nifi.content.repository.archive.enabled=true
> nifi.content.repository.always.sync=false
> nifi.content.viewer.url=../nifi-content-viewer/
> nifi.content.repository.encryption.key.provider.implementation=
> nifi.content.repository.encryption.key.provider.location=
> nifi.content.repository.encryption.key.id=
> nifi.content.repository.encryption.key=
>
>
> nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
> nifi.provenance.repository.debug.frequency=1_000_000
> nifi.provenance.repository.encryption.key.provider.implementation=
> nifi.provenance.repository.encryption.key.provider.location=
> nifi.provenance.repository.encryption.key.id=
> nifi.provenance.repository.encryption.key=
>
> nifi.provenance.repository.directory.default=./provenance_repository
> nifi.provenance.repository.max.storage.time=7 days
> nifi.provenance.repository.max.storage.size=100 GB
> nifi.provenance.repository.rollover.time=120 secs
> nifi.provenance.repository.rollover.size=100 MB
> nifi.provenance.repository.query.threads=2
> nifi.provenance.repository.index.threads=2
> nifi.provenance.repository.compress.on.rollover=true
> nifi.provenance.repository.always.sync=false
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID,
> Filename, ProcessorID, Relationship
> nifi.provenance.repository.indexed.attributes=
> nifi.provenance.repository.index.shard.size=4 GB
> nifi.provenance.repository.max.attribute.length=65536
> nifi.provenance.repository.concurrent.merge.threads=2
> nifi.provenance.repository.buffer.size=100000
>
>
> nifi.components.status.repository.implementation=org.apache.nifi.controller.status.history.VolatileComponentStatusRepository
> nifi.components.status.repository.buffer.size=1440
> nifi.components.status.snapshot.frequency=1 min
>
> nifi.remote.input.host=nifi-0.nifi.ki.svc.cluster.local
> nifi.remote.input.secure=true
> nifi.remote.input.socket.port=10000
> nifi.remote.input.http.enabled=true
> nifi.remote.input.http.transaction.ttl=30 sec
> nifi.remote.contents.cache.expiration=30 secs
>
> nifi.web.war.directory=./lib
> nifi.web.http.host=
> nifi.web.http.port=
> nifi.web.http.network.interface.default=
> nifi.web.https.host=nifi-0.nifi.ki.svc.cluster.local
> nifi.web.https.port=8080
> nifi.web.https.network.interface.default=
> nifi.web.jetty.working.directory=./work/jetty
> nifi.web.jetty.threads=200
> nifi.web.max.header.size=16 KB
> nifi.web.proxy.context.path=/nifi-api,/nifi
> nifi.web.proxy.host=ingress.ourdomain.com
>
> nifi.sensitive.props.key=
> nifi.sensitive.props.key.protected=
> nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
> nifi.sensitive.props.provider=BC
> nifi.sensitive.props.additional.keys=
>
> nifi.security.keystore=/opt/nifi/nifi-current/security/nifi-0.keystore.jks
> nifi.security.keystoreType=jks
> nifi.security.keystorePasswd=XXXXXXXXXXXXXXXX
> nifi.security.keyPasswd=XXXXXXXXXXXXXXXXX
>
> nifi.security.truststore=/opt/nifi/nifi-current/security/nifi-0.truststore.jks
> nifi.security.truststoreType=jks
> nifi.security.truststorePasswd=XXXXXXXXXXXXXXXXXXXXXXXXXXX
> nifi.security.user.authorizer=managed-authorizer
> nifi.security.user.login.identity.provider=
> nifi.security.ocsp.responder.url=
> nifi.security.ocsp.responder.certificate=
>
> nifi.security.user.oidc.discovery.url=
> https://keycloak-server-address/auth/realms/Test/.well-known/openid-configuration
> nifi.security.user.oidc.connect.timeout=15 secs
> nifi.security.user.oidc.read.timeout=15 secs
> nifi.security.user.oidc.client.id=nifi
> nifi.security.user.oidc.client.secret=XXXXXXXXXXXXXXXXXXXXX
> nifi.security.user.oidc.preferred.jwsalgorithm=RS512
> nifi.security.user.oidc.additional.scopes=
> nifi.security.user.oidc.claim.identifying.user=
>
> nifi.security.user.knox.url=
> nifi.security.user.knox.publicKey=
> nifi.security.user.knox.cookieName=hadoop-jwt
> nifi.security.user.knox.audiences=
>
> nifi.cluster.protocol.heartbeat.interval=30 secs
> nifi.cluster.protocol.is.secure=true
>
> nifi.cluster.is.node=true
> nifi.cluster.node.address=nifi-0.nifi.ki.svc.cluster.local
> nifi.cluster.node.protocol.port=2882
> nifi.cluster.node.protocol.threads=40
> nifi.cluster.node.protocol.max.threads=50
> nifi.cluster.node.event.history.size=25
> nifi.cluster.node.connection.timeout=120 secs
> nifi.cluster.node.read.timeout=120 secs
> nifi.cluster.node.max.concurrent.requests=100
> nifi.cluster.firewall.file=
> nifi.cluster.flow.election.max.wait.time=5 mins
> nifi.cluster.flow.election.max.candidates=
>
> nifi.cluster.load.balance.host=nifi-0.nifi.ki.svc.cluster.local
> nifi.cluster.load.balance.port=6342
> nifi.cluster.load.balance.connections.per.node=4
> nifi.cluster.load.balance.max.thread.count=8
> nifi.cluster.load.balance.comms.timeout=30 sec
>
>
> nifi.zookeeper.connect.string=zk-0.zk-hs.ki.svc.cluster.local:2181,zk-1.zk-hs.ki.svc.cluster.local:2181,zk-2.zk-hs.ki.svc.cluster.local:2181
> nifi.zookeeper.connect.timeout=30 secs
> nifi.zookeeper.session.timeout=30 secs
> nifi.zookeeper.root.node=/nifi
> nifi.zookeeper.auth.type=
> nifi.zookeeper.kerberos.removeHostFromPrincipal=
> nifi.zookeeper.kerberos.removeRealmFromPrincipal=
>
> nifi.kerberos.krb5.file=
>
> nifi.kerberos.service.principal=
> nifi.kerberos.service.keytab.location=
>
> nifi.kerberos.spnego.principal=
> nifi.kerberos.spnego.keytab.location=
> nifi.kerberos.spnego.authentication.expiration=12 hours
>
> nifi.variable.registry.properties=
>
> nifi.analytics.predict.enabled=false
> nifi.analytics.predict.interval=3 mins
> nifi.analytics.query.interval=5 mins
>
> nifi.analytics.connection.model.implementation=org.apache.nifi.controller.status.analytics.models.OrdinaryLeastSquares
> nifi.analytics.connection.model.score.name=rSquared
> nifi.analytics.connection.model.score.threshold=.90
>
> ------------------------------
> *From:* Chris Sampson <[email protected]>
> *Sent:* Tuesday, September 29, 2020 12:41 PM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: Clustered nifi issues
>
> Also, which version of zookeeper and what image (I've found different
> versions and images provided better stability)?
>
>
> Cheers,
>
> Chris Sampson
>
> On Tue, 29 Sep 2020, 17:34 Sushil Kumar, <[email protected]> wrote:
>
> Hello Wyll
>
> It may be helpful if you can send nifi.properties.
>
> Thanks
> Sushil Kumar
>
> On Tue, Sep 29, 2020 at 7:58 AM Wyll Ingersoll <
> [email protected]> wrote:
>
>
> I have a 3-node Nifi (1.11.4) cluster in kubernetes environment (as a
> StatefulSet) using external zookeeper (3 nodes also) to manage state.
>
> Whenever even 1 node (pod/container) goes down or is restarted, it can
> throw the whole cluster into a bad state that forces me to restart ALL of
> the pods in order to recover.  This seems wrong.  The problem seems to be
> that when the primary node goes away, the remaining 2 nodes don't ever try
> to take over.  Instead, I have restart all of them individually until one
> of them becomes the primary, then the other 2 eventually join and sync up.
>
> When one of the nodes is refusing to sync up, I often see these errors in
> the log and the only way to get it back into the cluster is to restart it.
> The node showing the errors below never seems to be able to rejoin or
> resync with the other 2 nodes.
>
>
> 2020-09-29 10:18:53,324 ERROR [Reconnect to Cluster]
> o.a.nifi.controller.StandardFlowService Handling reconnection request
> failed due to: org.apache.nifi.cluster.ConnectionException: Failed to
> connect node to cluster due to: java.lang.NullPointerException
> org.apache.nifi.cluster.ConnectionException: Failed to connect node to
> cluster due to: java.lang.NullPointerException
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1035)
> at
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:668)
> at
> org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:109)
> at
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:415)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException: null
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:989)
> ... 4 common frames omitted
> 2020-09-29 10:18:53,326 INFO [Reconnect to Cluster]
> o.a.c.f.imps.CuratorFrameworkImpl Starting
> 2020-09-29 10:18:53,327 INFO [Reconnect to Cluster]
> org.apache.zookeeper.ClientCnxnSocket jute.maxbuffer value is 4194304 Bytes
> 2020-09-29 10:18:53,328 INFO [Reconnect to Cluster]
> o.a.c.f.imps.CuratorFrameworkImpl Default schema
> 2020-09-29 10:18:53,807 INFO [Reconnect to Cluster-EventThread]
> o.a.c.f.state.ConnectionStateManager State change: CONNECTED
> 2020-09-29 10:18:53,809 INFO [Reconnect to Cluster-EventThread]
> o.a.c.framework.imps.EnsembleTracker New config event received:
> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181, version=0,
> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181,
> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181}
> 2020-09-29 10:18:53,810 INFO [Curator-Framework-0]
> o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
> 2020-09-29 10:18:53,813 INFO [Reconnect to Cluster-EventThread]
> o.a.c.framework.imps.EnsembleTracker New config event received:
> {server.1=zk-0.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181, version=0,
> server.3=zk-2.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181,
> server.2=zk-1.zk-hs.ki.svc.cluster.local:2888:3888:participant;
> 0.0.0.0:2181}
> 2020-09-29 10:18:54,323 INFO [Reconnect to Cluster]
> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election
> Role 'Primary Node' becuase that role is not registered
> 2020-09-29 10:18:54,324 INFO [Reconnect to Cluster]
> o.a.n.c.l.e.CuratorLeaderElectionManager Cannot unregister Leader Election
> Role 'Cluster Coordinator' becuase that role is not registered
>
>
>

Re: Clustered nifi issues

Reply via email to