Re: Nifi cluster nodes regularly stop processing any flowfiles

Joe Witt Mon, 01 Aug 2016 08:50:22 -0700

Aaron, Mark,

In looking at the thread-dump provided it looks to me like this is the
same as what was reported and addressed in
https://issues.apache.org/jira/browse/NIFI-2395


The fix for this has not yet been released but it slated to end up on
an 0.x and 1.0 release line.

Mark do you agree it is the same thing by looking at the logs?

Thanks
Joe

On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield <[email protected]> wrote:
> Alright, here you go for one of the nodes!
>
> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <[email protected]> wrote:
>>
>> Aaron,
>>
>> Any time that you find NiFi stop performing its work, the best thing to do
>> is to perform a thread-dump to and
>> to the mailing list. This allows us to determine what exactly is
>> happening, so we know what action is being
>> performed that prevents any other progress.
>>
>> To do this, you can go to the NiFi node that is not performing and run the
>> command:
>>
>> bin/nifi.sh dump thread-dump.txt
>>
>> This will generate a file named thread-dump.txt that you can send to us.
>>
>> Thanks!
>> -Mark
>>
>>
>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <[email protected]> wrote:
>>
>> I've been trying different things to try to fix my NiFi freeze problems,
>> and it seems the most frequent reason that my cluster gets stuck and stops
>> processing has to do with network related processors.  My data enters the
>> environment from Kafka and leaves via a site-to-site output port.  After
>> some time processing (sometimes a few minutes, sometimes a few hours) one of
>> those will start logging connection errors, and then that node will stop
>> processing any flowfiles across all processors.
>>
>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to RHEL7
>> (although RHEL seems to be happier).  I've tried restricting threads to less
>> than the number of available cores on each node, different heap sizes, and
>> different garbage collectors.  So far none of that has preventing the
>> problem, unfortunately.
>>
>> I'm not quite ready to build all custom processors for my flow logic...
>> most of it is straightforward attribute routing, text replacement, and
>> flowfile merging.
>>
>> What are other things that I could try, or just be doing wrong that could
>> lead to this?  I'm happy to keep trying suggestions and changes; I really
>> want this to work!
>>
>> Thanks,
>> -Aaron
>>
>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <[email protected]> wrote:
>>>
>>> Aaron,
>>>
>>> I ran into an issue where the Execute Stream Command (ESC) processor with
>>> many threads would run a legacy script that would hang if the incoming file
>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck threads as
>>> malformed data randomly streamed through it. Eventually I ran out of threads
>>> as the system was just waiting for a thread to become available.
>>>
>>> It was apparent in the processor statistics where the flowfiles-out
>>> statistic would eventually step down to zero as threads became stuck.
>>>
>>> It might be worth trying InvokeScriptedProcessor or building custom
>>> processors as they provide a means to handle these inconsistencies more
>>> gracefully.
>>>
>>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
>>>
>>> Thanks,
>>> Lee
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield <[email protected]>
>>> wrote:
>>>>
>>>> Hi Mark,
>>>>
>>>> I've been using the G1 garbage collector.  I brought the nodes down to
>>>> 8GB heap and let it run overnight, but processing still got stuck and
>>>> requiring NiFi to be restarted on all nodes.  It took longer to happen, but
>>>> they went down after a few hours.  Are there any other things I can look
>>>> into?
>>>>
>>>> Thanks!
>>>>
>>>> -Aaron
>>>>
>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <[email protected]>
>>>> wrote:
>>>>>
>>>>> Aaron,
>>>>>
>>>>> My guess would be that you are hitting a Full Garbage Collection. With
>>>>> such a huge Java heap, that will cause a "stop the world" pause for quite 
>>>>> a
>>>>> long time.
>>>>> Which garbage collector are you using? Have you tried reducing the heap
>>>>> from 48 GB to say 4 or 8 GB?
>>>>>
>>>>> Thanks
>>>>> -Mark
>>>>>
>>>>>
>>>>> > On Jul 14, 2016, at 11:14 AM, Aaron Longfield <[email protected]>
>>>>> > wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > I'm having an issue with a small (two node) NiFi cluster where the
>>>>> > nodes will stop processing any queued flowfiles.  I haven't seen any 
>>>>> > error
>>>>> > messages logged related to it, and when attempting to restart the 
>>>>> > service,
>>>>> > NiFi doesn't respond and the script forcibly kills it.  This causes 
>>>>> > multiple
>>>>> > flowfile version to hang around, and generally makes me feel like it 
>>>>> > might
>>>>> > be causing data loss.
>>>>> >
>>>>> > I'm running the web UI on a different box, and when things stop
>>>>> > working, it stops showing changes to counts in any queues, and the 
>>>>> > thread
>>>>> > count never changes.  It still thinks the nodes are connecting and
>>>>> > responding, though.
>>>>> >
>>>>> > My environment is two 8 cpu systems w/ 60GB memory with 48GB given to
>>>>> > the NiFi JVM in bootstrap.conf.  I have timer threads limited to 12, and
>>>>> > event threads to 4.  Install is on the current Amazon Linux AMI and 
>>>>> > using
>>>>> > OpenJDK 1.8.0.91 x64.
>>>>> >
>>>>> > Any idea, other debug steps, or changes that I can try?  I'm running
>>>>> > 0.7.0, having upgraded from 0.6.1, but this has been occurring with both
>>>>> > versions.  The higher the flowfile volume I push through, the faster 
>>>>> > this
>>>>> > happens.
>>>>> >
>>>>> > Thanks for any help there is to give!
>>>>> >
>>>>> > -Aaron Longfield
>>>>>
>>>>
>>>
>>
>>
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Reply via email to