Re: Nifi cluster nodes regularly stop processing any flowfiles

Aaron Longfield Mon, 01 Aug 2016 12:31:03 -0700

Great, glad there's already a fixed bug for it!  Is there anything I try to
work around it for now, or at least just get longer processing times
between restarts?


-Aaron

On Mon, Aug 1, 2016 at 11:54 AM, Mark Payne <[email protected]> wrote:

> Aaron,
>
> Thanks for getting that to us quickly! It is extremely useful.
>
> Joe,
>
> I do indeed believe this is the same thing. I was in the middle of typing
> a response, but you beat me to it!
>
> Thanks
> -Mark
>
>
> > On Aug 1, 2016, at 11:49 AM, Joe Witt <[email protected]> wrote:
> >
> > Aaron, Mark,
> >
> > In looking at the thread-dump provided it looks to me like this is the
> > same as what was reported and addressed in
> > https://issues.apache.org/jira/browse/NIFI-2395
> >
> > The fix for this has not yet been released but it slated to end up on
> > an 0.x and 1.0 release line.
> >
> > Mark do you agree it is the same thing by looking at the logs?
> >
> > Thanks
> > Joe
> >
> > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield <[email protected]>
> wrote:
> >> Alright, here you go for one of the nodes!
> >>
> >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <[email protected]>
> wrote:
> >>>
> >>> Aaron,
> >>>
> >>> Any time that you find NiFi stop performing its work, the best thing
> to do
> >>> is to perform a thread-dump to and
> >>> to the mailing list. This allows us to determine what exactly is
> >>> happening, so we know what action is being
> >>> performed that prevents any other progress.
> >>>
> >>> To do this, you can go to the NiFi node that is not performing and run
> the
> >>> command:
> >>>
> >>> bin/nifi.sh dump thread-dump.txt
> >>>
> >>> This will generate a file named thread-dump.txt that you can send to
> us.
> >>>
> >>> Thanks!
> >>> -Mark
> >>>
> >>>
> >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <[email protected]>
> wrote:
> >>>
> >>> I've been trying different things to try to fix my NiFi freeze
> problems,
> >>> and it seems the most frequent reason that my cluster gets stuck and
> stops
> >>> processing has to do with network related processors.  My data enters
> the
> >>> environment from Kafka and leaves via a site-to-site output port.
> After
> >>> some time processing (sometimes a few minutes, sometimes a few hours)
> one of
> >>> those will start logging connection errors, and then that node will
> stop
> >>> processing any flowfiles across all processors.
> >>>
> >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to
> RHEL7
> >>> (although RHEL seems to be happier).  I've tried restricting threads
> to less
> >>> than the number of available cores on each node, different heap sizes,
> and
> >>> different garbage collectors.  So far none of that has preventing the
> >>> problem, unfortunately.
> >>>
> >>> I'm not quite ready to build all custom processors for my flow logic...
> >>> most of it is straightforward attribute routing, text replacement, and
> >>> flowfile merging.
> >>>
> >>> What are other things that I could try, or just be doing wrong that
> could
> >>> lead to this?  I'm happy to keep trying suggestions and changes; I
> really
> >>> want this to work!
> >>>
> >>> Thanks,
> >>> -Aaron
> >>>
> >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <[email protected]> wrote:
> >>>>
> >>>> Aaron,
> >>>>
> >>>> I ran into an issue where the Execute Stream Command (ESC) processor
> with
> >>>> many threads would run a legacy script that would hang if the
> incoming file
> >>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck
> threads as
> >>>> malformed data randomly streamed through it. Eventually I ran out of
> threads
> >>>> as the system was just waiting for a thread to become available.
> >>>>
> >>>> It was apparent in the processor statistics where the flowfiles-out
> >>>> statistic would eventually step down to zero as threads became stuck.
> >>>>
> >>>> It might be worth trying InvokeScriptedProcessor or building custom
> >>>> processors as they provide a means to handle these inconsistencies
> more
> >>>> gracefully.
> >>>>
> >>>>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
> >>>>
> >>>> Thanks,
> >>>> Lee
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield <
> [email protected]>
> >>>> wrote:
> >>>>>
> >>>>> Hi Mark,
> >>>>>
> >>>>> I've been using the G1 garbage collector.  I brought the nodes down
> to
> >>>>> 8GB heap and let it run overnight, but processing still got stuck and
> >>>>> requiring NiFi to be restarted on all nodes.  It took longer to
> happen, but
> >>>>> they went down after a few hours.  Are there any other things I can
> look
> >>>>> into?
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> -Aaron
> >>>>>
> >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <[email protected]>
> >>>>> wrote:
> >>>>>>
> >>>>>> Aaron,
> >>>>>>
> >>>>>> My guess would be that you are hitting a Full Garbage Collection.
> With
> >>>>>> such a huge Java heap, that will cause a "stop the world" pause for
> quite a
> >>>>>> long time.
> >>>>>> Which garbage collector are you using? Have you tried reducing the
> heap
> >>>>>> from 48 GB to say 4 or 8 GB?
> >>>>>>
> >>>>>> Thanks
> >>>>>> -Mark
> >>>>>>
> >>>>>>
> >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield <
> [email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I'm having an issue with a small (two node) NiFi cluster where the
> >>>>>>> nodes will stop processing any queued flowfiles.  I haven't seen
> any error
> >>>>>>> messages logged related to it, and when attempting to restart the
> service,
> >>>>>>> NiFi doesn't respond and the script forcibly kills it.  This
> causes multiple
> >>>>>>> flowfile version to hang around, and generally makes me feel like
> it might
> >>>>>>> be causing data loss.
> >>>>>>>
> >>>>>>> I'm running the web UI on a different box, and when things stop
> >>>>>>> working, it stops showing changes to counts in any queues, and the
> thread
> >>>>>>> count never changes.  It still thinks the nodes are connecting and
> >>>>>>> responding, though.
> >>>>>>>
> >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB given
> to
> >>>>>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited to
> 12, and
> >>>>>>> event threads to 4.  Install is on the current Amazon Linux AMI
> and using
> >>>>>>> OpenJDK 1.8.0.91 x64.
> >>>>>>>
> >>>>>>> Any idea, other debug steps, or changes that I can try?  I'm
> running
> >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring
> with both
> >>>>>>> versions.  The higher the flowfile volume I push through, the
> faster this
> >>>>>>> happens.
> >>>>>>>
> >>>>>>> Thanks for any help there is to give!
> >>>>>>>
> >>>>>>> -Aaron Longfield
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Reply via email to