I have a NiFi 1.0 deployed in a non-secure cluster across 3 nodes.
I have a flow pipeline that reads from a Kafka topic using ConsumeKafka and
kicks off an ExecuteStreamCommand mediated job based on attributes included
in the notification message.
What I observe is that jobs are being kicked off and they complete
successfully on 2 of the nodes. The 3rd node however never seems to make
progress on any of the jobs scheduled on it.
I do see the node receiving the notification messages (based on PutRiemann
events posted when message is received by ConsumeKafka) but thereafter
there is no progress at all. The consequence is that the queue in front of
the ExecuteStreamCommand processor keeps growing whenever a job is
scheduled on the 'stuck' node.
I don't see anything obvious to me in the nifi-app logs on any of the nodes
that helps me get insight into what is afoot. I figured that some state is
out-of-sync on the stuck node and decided to restart it. When that node
went down, the queue in front of the ExecuteStreamCommand immediately went
to 0 (I happened to be watching using the UI on one of the other nodes).
When that node came back up, the queue is restored to the value it had
prior to the restart.
I am looking for debugging hints / ideas to help get insight into what is
really going on.