Re: Help needed in debugging why one of the nodes in a cluster is not making any progress ..

A.B. Srinivasan Thu, 01 Dec 2016 18:03:07 -0800

Hi Bryan,

Thanks for your response.


As it turned out, it was a different issue and involved the java utility
that ExecuteStreamCommand was kicking off. The java utility creates a tmp
directory that happened to clash with another java program running on the
one node where I was having these problems. The java utility was failing to
get off the ground due a permission issue on the tmp directory setting.
Running the java utility on its own enabled me to see this problem. After I
changed the java command line to include a '-Djava.io.tmpdir' to address
this situation, things work fine.
Still not sure why ExecuteStreamCommand was not able to break out and
'return' in this situation or was unable to direct the output of the failed
attempt to kick off the java utility into the file assigned as the
recipient of the output stream of the flowfile.

Cheers,
A.B.

On Thu, Dec 1, 2016 at 1:26 PM, Bryan Bende <[email protected]> wrote:

> Hello,
>
> I think the behavior you saw with the queue going to 0 is expected
> behavior... when you are looking at the UI it is showing the aggregated
> view of all the nodes in the cluster, so if one node has flow files in the
> queue and that nodes goes down while you are in the UI in another node,
> that number of flow files would no longer be visible in the UI.
>
> How many flow files are in the queue before ExecuteStreamCommand when you
> see it not making any progress?
>
> There was a bug in 1.0 where if the # of flow files in a queue was evenly
> splittable by the swap size, then those flow files would be swapped out and
> never swapped back in and would be sitting there.
>
> The JIRA was: https://issues.apache.org/jira/browse/NIFI-2754
>
> Additionally you could try upgrading to the recently released 1.1 release
> to see if the same behavior occurs.
>
> -Bryan
>
>
> On Thu, Dec 1, 2016 at 3:39 PM, A.B. Srinivasan <[email protected]>
> wrote:
>
>> Folks,
>>
>> I have a NiFi 1.0 deployed in a non-secure cluster across 3 nodes.
>>
>> I have a flow pipeline that reads from a Kafka topic using ConsumeKafka
>> and kicks off an ExecuteStreamCommand mediated job based on attributes
>> included in the notification message.
>>
>> What I observe is that jobs are  being kicked off and they complete
>> successfully on 2 of the nodes. The 3rd node however never seems to make
>> progress on any of the jobs scheduled on it.
>> I do see the node receiving the notification messages (based on
>> PutRiemann events posted when message is received by ConsumeKafka) but
>> thereafter there is no progress at all. The consequence is that the queue
>> in front of the ExecuteStreamCommand processor keeps growing whenever a job
>> is scheduled on the 'stuck' node.
>>
>> I don't see anything obvious to me in the nifi-app logs on any of the
>> nodes that helps me get insight into what is afoot. I figured that some
>> state is out-of-sync on the stuck node and decided to restart it. When that
>> node went down, the queue in front of the ExecuteStreamCommand immediately
>> went to 0 (I happened to be watching using the UI on one of the other
>> nodes). When that node came back up, the queue is restored to the value it
>> had prior to the restart.
>>
>> I am looking for debugging hints / ideas to help get insight into what is
>> really going on.
>>
>> Thanks,
>> A.B.
>>
>
>

Re: Help needed in debugging why one of the nodes in a cluster is not making any progress ..

Reply via email to