Re: Help needed in debugging why one of the nodes in a cluster is not making any progress ..
Hi Bryan, Thanks for your response. As it turned out, it was a different issue and involved the java utility that ExecuteStreamCommand was kicking off. The java utility creates a tmp directory that happened to clash with another java program running on the one node where I was having these problems. The java utility was failing to get off the ground due a permission issue on the tmp directory setting. Running the java utility on its own enabled me to see this problem. After I changed the java command line to include a '-Djava.io.tmpdir' to address this situation, things work fine. Still not sure why ExecuteStreamCommand was not able to break out and 'return' in this situation or was unable to direct the output of the failed attempt to kick off the java utility into the file assigned as the recipient of the output stream of the flowfile. Cheers, A.B. On Thu, Dec 1, 2016 at 1:26 PM, Bryan Bende wrote: > Hello, > > I think the behavior you saw with the queue going to 0 is expected > behavior... when you are looking at the UI it is showing the aggregated > view of all the nodes in the cluster, so if one node has flow files in the > queue and that nodes goes down while you are in the UI in another node, > that number of flow files would no longer be visible in the UI. > > How many flow files are in the queue before ExecuteStreamCommand when you > see it not making any progress? > > There was a bug in 1.0 where if the # of flow files in a queue was evenly > splittable by the swap size, then those flow files would be swapped out and > never swapped back in and would be sitting there. > > The JIRA was: https://issues.apache.org/jira/browse/NIFI-2754 > > Additionally you could try upgrading to the recently released 1.1 release > to see if the same behavior occurs. > > -Bryan > > > On Thu, Dec 1, 2016 at 3:39 PM, A.B. Srinivasan > wrote: > >> Folks, >> >> I have a NiFi 1.0 deployed in a non-secure cluster across 3 nodes. >> >> I have a flow pipeline that reads from a Kafka topic using ConsumeKafka >> and kicks off an ExecuteStreamCommand mediated job based on attributes >> included in the notification message. >> >> What I observe is that jobs are being kicked off and they complete >> successfully on 2 of the nodes. The 3rd node however never seems to make >> progress on any of the jobs scheduled on it. >> I do see the node receiving the notification messages (based on >> PutRiemann events posted when message is received by ConsumeKafka) but >> thereafter there is no progress at all. The consequence is that the queue >> in front of the ExecuteStreamCommand processor keeps growing whenever a job >> is scheduled on the 'stuck' node. >> >> I don't see anything obvious to me in the nifi-app logs on any of the >> nodes that helps me get insight into what is afoot. I figured that some >> state is out-of-sync on the stuck node and decided to restart it. When that >> node went down, the queue in front of the ExecuteStreamCommand immediately >> went to 0 (I happened to be watching using the UI on one of the other >> nodes). When that node came back up, the queue is restored to the value it >> had prior to the restart. >> >> I am looking for debugging hints / ideas to help get insight into what is >> really going on. >> >> Thanks, >> A.B. >> > >
Re: Help needed in debugging why one of the nodes in a cluster is not making any progress ..
Hello, I think the behavior you saw with the queue going to 0 is expected behavior... when you are looking at the UI it is showing the aggregated view of all the nodes in the cluster, so if one node has flow files in the queue and that nodes goes down while you are in the UI in another node, that number of flow files would no longer be visible in the UI. How many flow files are in the queue before ExecuteStreamCommand when you see it not making any progress? There was a bug in 1.0 where if the # of flow files in a queue was evenly splittable by the swap size, then those flow files would be swapped out and never swapped back in and would be sitting there. The JIRA was: https://issues.apache.org/jira/browse/NIFI-2754 Additionally you could try upgrading to the recently released 1.1 release to see if the same behavior occurs. -Bryan On Thu, Dec 1, 2016 at 3:39 PM, A.B. Srinivasan wrote: > Folks, > > I have a NiFi 1.0 deployed in a non-secure cluster across 3 nodes. > > I have a flow pipeline that reads from a Kafka topic using ConsumeKafka > and kicks off an ExecuteStreamCommand mediated job based on attributes > included in the notification message. > > What I observe is that jobs are being kicked off and they complete > successfully on 2 of the nodes. The 3rd node however never seems to make > progress on any of the jobs scheduled on it. > I do see the node receiving the notification messages (based on PutRiemann > events posted when message is received by ConsumeKafka) but thereafter > there is no progress at all. The consequence is that the queue in front of > the ExecuteStreamCommand processor keeps growing whenever a job is > scheduled on the 'stuck' node. > > I don't see anything obvious to me in the nifi-app logs on any of the > nodes that helps me get insight into what is afoot. I figured that some > state is out-of-sync on the stuck node and decided to restart it. When that > node went down, the queue in front of the ExecuteStreamCommand immediately > went to 0 (I happened to be watching using the UI on one of the other > nodes). When that node came back up, the queue is restored to the value it > had prior to the restart. > > I am looking for debugging hints / ideas to help get insight into what is > really going on. > > Thanks, > A.B. >
Help needed in debugging why one of the nodes in a cluster is not making any progress ..
Folks, I have a NiFi 1.0 deployed in a non-secure cluster across 3 nodes. I have a flow pipeline that reads from a Kafka topic using ConsumeKafka and kicks off an ExecuteStreamCommand mediated job based on attributes included in the notification message. What I observe is that jobs are being kicked off and they complete successfully on 2 of the nodes. The 3rd node however never seems to make progress on any of the jobs scheduled on it. I do see the node receiving the notification messages (based on PutRiemann events posted when message is received by ConsumeKafka) but thereafter there is no progress at all. The consequence is that the queue in front of the ExecuteStreamCommand processor keeps growing whenever a job is scheduled on the 'stuck' node. I don't see anything obvious to me in the nifi-app logs on any of the nodes that helps me get insight into what is afoot. I figured that some state is out-of-sync on the stuck node and decided to restart it. When that node went down, the queue in front of the ExecuteStreamCommand immediately went to 0 (I happened to be watching using the UI on one of the other nodes). When that node came back up, the queue is restored to the value it had prior to the restart. I am looking for debugging hints / ideas to help get insight into what is really going on. Thanks, A.B.