Re: Nifi cluster nodes regularly stop processing any flowfiles

Aaron Longfield Mon, 01 Aug 2016 13:04:28 -0700

Joe,

Sure, I can give that a go.  Any serious bugs that I might run across with
that branch that should make me worried about running it on a production
flow?


-Aaron

On Mon, Aug 1, 2016 at 4:01 PM, Joe Witt <[email protected]> wrote:

> Aaron,
>
> It doesn't look like the 0.x version of that patch has been created
> yet.  Any chance you could build master (slated for upcoming 1.x
> release) and try that?
>
> Thanks
> Joe
>
> On Mon, Aug 1, 2016 at 3:30 PM, Aaron Longfield <[email protected]>
> wrote:
> > Great, glad there's already a fixed bug for it!  Is there anything I try
> to
> > work around it for now, or at least just get longer processing times
> between
> > restarts?
> >
> > -Aaron
> >
> > On Mon, Aug 1, 2016 at 11:54 AM, Mark Payne <[email protected]>
> wrote:
> >>
> >> Aaron,
> >>
> >> Thanks for getting that to us quickly! It is extremely useful.
> >>
> >> Joe,
> >>
> >> I do indeed believe this is the same thing. I was in the middle of
> typing
> >> a response, but you beat me to it!
> >>
> >> Thanks
> >> -Mark
> >>
> >>
> >> > On Aug 1, 2016, at 11:49 AM, Joe Witt <[email protected]> wrote:
> >> >
> >> > Aaron, Mark,
> >> >
> >> > In looking at the thread-dump provided it looks to me like this is the
> >> > same as what was reported and addressed in
> >> > https://issues.apache.org/jira/browse/NIFI-2395
> >> >
> >> > The fix for this has not yet been released but it slated to end up on
> >> > an 0.x and 1.0 release line.
> >> >
> >> > Mark do you agree it is the same thing by looking at the logs?
> >> >
> >> > Thanks
> >> > Joe
> >> >
> >> > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield <
> [email protected]>
> >> > wrote:
> >> >> Alright, here you go for one of the nodes!
> >> >>
> >> >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <[email protected]>
> >> >> wrote:
> >> >>>
> >> >>> Aaron,
> >> >>>
> >> >>> Any time that you find NiFi stop performing its work, the best thing
> >> >>> to do
> >> >>> is to perform a thread-dump to and
> >> >>> to the mailing list. This allows us to determine what exactly is
> >> >>> happening, so we know what action is being
> >> >>> performed that prevents any other progress.
> >> >>>
> >> >>> To do this, you can go to the NiFi node that is not performing and
> run
> >> >>> the
> >> >>> command:
> >> >>>
> >> >>> bin/nifi.sh dump thread-dump.txt
> >> >>>
> >> >>> This will generate a file named thread-dump.txt that you can send to
> >> >>> us.
> >> >>>
> >> >>> Thanks!
> >> >>> -Mark
> >> >>>
> >> >>>
> >> >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <[email protected]>
> >> >>> wrote:
> >> >>>
> >> >>> I've been trying different things to try to fix my NiFi freeze
> >> >>> problems,
> >> >>> and it seems the most frequent reason that my cluster gets stuck and
> >> >>> stops
> >> >>> processing has to do with network related processors.  My data
> enters
> >> >>> the
> >> >>> environment from Kafka and leaves via a site-to-site output port.
> >> >>> After
> >> >>> some time processing (sometimes a few minutes, sometimes a few
> hours)
> >> >>> one of
> >> >>> those will start logging connection errors, and then that node will
> >> >>> stop
> >> >>> processing any flowfiles across all processors.
> >> >>>
> >> >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to
> >> >>> RHEL7
> >> >>> (although RHEL seems to be happier).  I've tried restricting threads
> >> >>> to less
> >> >>> than the number of available cores on each node, different heap
> sizes,
> >> >>> and
> >> >>> different garbage collectors.  So far none of that has preventing
> the
> >> >>> problem, unfortunately.
> >> >>>
> >> >>> I'm not quite ready to build all custom processors for my flow
> >> >>> logic...
> >> >>> most of it is straightforward attribute routing, text replacement,
> and
> >> >>> flowfile merging.
> >> >>>
> >> >>> What are other things that I could try, or just be doing wrong that
> >> >>> could
> >> >>> lead to this?  I'm happy to keep trying suggestions and changes; I
> >> >>> really
> >> >>> want this to work!
> >> >>>
> >> >>> Thanks,
> >> >>> -Aaron
> >> >>>
> >> >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <[email protected]>
> wrote:
> >> >>>>
> >> >>>> Aaron,
> >> >>>>
> >> >>>> I ran into an issue where the Execute Stream Command (ESC)
> processor
> >> >>>> with
> >> >>>> many threads would run a legacy script that would hang if the
> >> >>>> incoming file
> >> >>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck
> >> >>>> threads as
> >> >>>> malformed data randomly streamed through it. Eventually I ran out
> of
> >> >>>> threads
> >> >>>> as the system was just waiting for a thread to become available.
> >> >>>>
> >> >>>> It was apparent in the processor statistics where the flowfiles-out
> >> >>>> statistic would eventually step down to zero as threads became
> stuck.
> >> >>>>
> >> >>>> It might be worth trying InvokeScriptedProcessor or building custom
> >> >>>> processors as they provide a means to handle these inconsistencies
> >> >>>> more
> >> >>>> gracefully.
> >> >>>>
> >> >>>>
> >> >>>>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
> >> >>>>
> >> >>>> Thanks,
> >> >>>> Lee
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield
> >> >>>> <[email protected]>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> Hi Mark,
> >> >>>>>
> >> >>>>> I've been using the G1 garbage collector.  I brought the nodes
> down
> >> >>>>> to
> >> >>>>> 8GB heap and let it run overnight, but processing still got stuck
> >> >>>>> and
> >> >>>>> requiring NiFi to be restarted on all nodes.  It took longer to
> >> >>>>> happen, but
> >> >>>>> they went down after a few hours.  Are there any other things I
> can
> >> >>>>> look
> >> >>>>> into?
> >> >>>>>
> >> >>>>> Thanks!
> >> >>>>>
> >> >>>>> -Aaron
> >> >>>>>
> >> >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <[email protected]
> >
> >> >>>>> wrote:
> >> >>>>>>
> >> >>>>>> Aaron,
> >> >>>>>>
> >> >>>>>> My guess would be that you are hitting a Full Garbage Collection.
> >> >>>>>> With
> >> >>>>>> such a huge Java heap, that will cause a "stop the world" pause
> for
> >> >>>>>> quite a
> >> >>>>>> long time.
> >> >>>>>> Which garbage collector are you using? Have you tried reducing
> the
> >> >>>>>> heap
> >> >>>>>> from 48 GB to say 4 or 8 GB?
> >> >>>>>>
> >> >>>>>> Thanks
> >> >>>>>> -Mark
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield
> >> >>>>>>> <[email protected]>
> >> >>>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>> Hi,
> >> >>>>>>>
> >> >>>>>>> I'm having an issue with a small (two node) NiFi cluster where
> the
> >> >>>>>>> nodes will stop processing any queued flowfiles.  I haven't seen
> >> >>>>>>> any error
> >> >>>>>>> messages logged related to it, and when attempting to restart
> the
> >> >>>>>>> service,
> >> >>>>>>> NiFi doesn't respond and the script forcibly kills it.  This
> >> >>>>>>> causes multiple
> >> >>>>>>> flowfile version to hang around, and generally makes me feel
> like
> >> >>>>>>> it might
> >> >>>>>>> be causing data loss.
> >> >>>>>>>
> >> >>>>>>> I'm running the web UI on a different box, and when things stop
> >> >>>>>>> working, it stops showing changes to counts in any queues, and
> the
> >> >>>>>>> thread
> >> >>>>>>> count never changes.  It still thinks the nodes are connecting
> and
> >> >>>>>>> responding, though.
> >> >>>>>>>
> >> >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB
> given
> >> >>>>>>> to
> >> >>>>>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited to
> >> >>>>>>> 12, and
> >> >>>>>>> event threads to 4.  Install is on the current Amazon Linux AMI
> >> >>>>>>> and using
> >> >>>>>>> OpenJDK 1.8.0.91 x64.
> >> >>>>>>>
> >> >>>>>>> Any idea, other debug steps, or changes that I can try?  I'm
> >> >>>>>>> running
> >> >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring
> >> >>>>>>> with both
> >> >>>>>>> versions.  The higher the flowfile volume I push through, the
> >> >>>>>>> faster this
> >> >>>>>>> happens.
> >> >>>>>>>
> >> >>>>>>> Thanks for any help there is to give!
> >> >>>>>>>
> >> >>>>>>> -Aaron Longfield
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>>
> >> >>
> >>
> >
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Reply via email to