Re: Nifi cluster nodes regularly stop processing any flowfiles

Mark Payne Mon, 01 Aug 2016 08:55:15 -0700

Aaron,

Thanks for getting that to us quickly! It is extremely useful.


Joe,

I do indeed believe this is the same thing. I was in the middle of typing a 
response, but you beat me to it!

Thanks
-Mark


> On Aug 1, 2016, at 11:49 AM, Joe Witt <[email protected]> wrote:
> 
> Aaron, Mark,
> 
> In looking at the thread-dump provided it looks to me like this is the
> same as what was reported and addressed in
> https://issues.apache.org/jira/browse/NIFI-2395
> 
> The fix for this has not yet been released but it slated to end up on
> an 0.x and 1.0 release line.
> 
> Mark do you agree it is the same thing by looking at the logs?
> 
> Thanks
> Joe
> 
> On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield <[email protected]> wrote:
>> Alright, here you go for one of the nodes!
>> 
>> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <[email protected]> wrote:
>>> 
>>> Aaron,
>>> 
>>> Any time that you find NiFi stop performing its work, the best thing to do
>>> is to perform a thread-dump to and
>>> to the mailing list. This allows us to determine what exactly is
>>> happening, so we know what action is being
>>> performed that prevents any other progress.
>>> 
>>> To do this, you can go to the NiFi node that is not performing and run the
>>> command:
>>> 
>>> bin/nifi.sh dump thread-dump.txt
>>> 
>>> This will generate a file named thread-dump.txt that you can send to us.
>>> 
>>> Thanks!
>>> -Mark
>>> 
>>> 
>>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <[email protected]> wrote:
>>> 
>>> I've been trying different things to try to fix my NiFi freeze problems,
>>> and it seems the most frequent reason that my cluster gets stuck and stops
>>> processing has to do with network related processors.  My data enters the
>>> environment from Kafka and leaves via a site-to-site output port.  After
>>> some time processing (sometimes a few minutes, sometimes a few hours) one of
>>> those will start logging connection errors, and then that node will stop
>>> processing any flowfiles across all processors.
>>> 
>>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to RHEL7
>>> (although RHEL seems to be happier).  I've tried restricting threads to less
>>> than the number of available cores on each node, different heap sizes, and
>>> different garbage collectors.  So far none of that has preventing the
>>> problem, unfortunately.
>>> 
>>> I'm not quite ready to build all custom processors for my flow logic...
>>> most of it is straightforward attribute routing, text replacement, and
>>> flowfile merging.
>>> 
>>> What are other things that I could try, or just be doing wrong that could
>>> lead to this?  I'm happy to keep trying suggestions and changes; I really
>>> want this to work!
>>> 
>>> Thanks,
>>> -Aaron
>>> 
>>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <[email protected]> wrote:
>>>> 
>>>> Aaron,
>>>> 
>>>> I ran into an issue where the Execute Stream Command (ESC) processor with
>>>> many threads would run a legacy script that would hang if the incoming file
>>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck threads as
>>>> malformed data randomly streamed through it. Eventually I ran out of 
>>>> threads
>>>> as the system was just waiting for a thread to become available.
>>>> 
>>>> It was apparent in the processor statistics where the flowfiles-out
>>>> statistic would eventually step down to zero as threads became stuck.
>>>> 
>>>> It might be worth trying InvokeScriptedProcessor or building custom
>>>> processors as they provide a means to handle these inconsistencies more
>>>> gracefully.
>>>> 
>>>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
>>>> 
>>>> Thanks,
>>>> Lee
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield <[email protected]>
>>>> wrote:
>>>>> 
>>>>> Hi Mark,
>>>>> 
>>>>> I've been using the G1 garbage collector.  I brought the nodes down to
>>>>> 8GB heap and let it run overnight, but processing still got stuck and
>>>>> requiring NiFi to be restarted on all nodes.  It took longer to happen, 
>>>>> but
>>>>> they went down after a few hours.  Are there any other things I can look
>>>>> into?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> -Aaron
>>>>> 
>>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>> Aaron,
>>>>>> 
>>>>>> My guess would be that you are hitting a Full Garbage Collection. With
>>>>>> such a huge Java heap, that will cause a "stop the world" pause for 
>>>>>> quite a
>>>>>> long time.
>>>>>> Which garbage collector are you using? Have you tried reducing the heap
>>>>>> from 48 GB to say 4 or 8 GB?
>>>>>> 
>>>>>> Thanks
>>>>>> -Mark
>>>>>> 
>>>>>> 
>>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield <[email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'm having an issue with a small (two node) NiFi cluster where the
>>>>>>> nodes will stop processing any queued flowfiles.  I haven't seen any 
>>>>>>> error
>>>>>>> messages logged related to it, and when attempting to restart the 
>>>>>>> service,
>>>>>>> NiFi doesn't respond and the script forcibly kills it.  This causes 
>>>>>>> multiple
>>>>>>> flowfile version to hang around, and generally makes me feel like it 
>>>>>>> might
>>>>>>> be causing data loss.
>>>>>>> 
>>>>>>> I'm running the web UI on a different box, and when things stop
>>>>>>> working, it stops showing changes to counts in any queues, and the 
>>>>>>> thread
>>>>>>> count never changes.  It still thinks the nodes are connecting and
>>>>>>> responding, though.
>>>>>>> 
>>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB given to
>>>>>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited to 12, and
>>>>>>> event threads to 4.  Install is on the current Amazon Linux AMI and 
>>>>>>> using
>>>>>>> OpenJDK 1.8.0.91 x64.
>>>>>>> 
>>>>>>> Any idea, other debug steps, or changes that I can try?  I'm running
>>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring with both
>>>>>>> versions.  The higher the flowfile volume I push through, the faster 
>>>>>>> this
>>>>>>> happens.
>>>>>>> 
>>>>>>> Thanks for any help there is to give!
>>>>>>> 
>>>>>>> -Aaron Longfield
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Reply via email to