Great, glad there's already a fixed bug for it! Is there anything I try to work around it for now, or at least just get longer processing times between restarts?
-Aaron On Mon, Aug 1, 2016 at 11:54 AM, Mark Payne <[email protected]> wrote: > Aaron, > > Thanks for getting that to us quickly! It is extremely useful. > > Joe, > > I do indeed believe this is the same thing. I was in the middle of typing > a response, but you beat me to it! > > Thanks > -Mark > > > > On Aug 1, 2016, at 11:49 AM, Joe Witt <[email protected]> wrote: > > > > Aaron, Mark, > > > > In looking at the thread-dump provided it looks to me like this is the > > same as what was reported and addressed in > > https://issues.apache.org/jira/browse/NIFI-2395 > > > > The fix for this has not yet been released but it slated to end up on > > an 0.x and 1.0 release line. > > > > Mark do you agree it is the same thing by looking at the logs? > > > > Thanks > > Joe > > > > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield <[email protected]> > wrote: > >> Alright, here you go for one of the nodes! > >> > >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <[email protected]> > wrote: > >>> > >>> Aaron, > >>> > >>> Any time that you find NiFi stop performing its work, the best thing > to do > >>> is to perform a thread-dump to and > >>> to the mailing list. This allows us to determine what exactly is > >>> happening, so we know what action is being > >>> performed that prevents any other progress. > >>> > >>> To do this, you can go to the NiFi node that is not performing and run > the > >>> command: > >>> > >>> bin/nifi.sh dump thread-dump.txt > >>> > >>> This will generate a file named thread-dump.txt that you can send to > us. > >>> > >>> Thanks! > >>> -Mark > >>> > >>> > >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <[email protected]> > wrote: > >>> > >>> I've been trying different things to try to fix my NiFi freeze > problems, > >>> and it seems the most frequent reason that my cluster gets stuck and > stops > >>> processing has to do with network related processors. My data enters > the > >>> environment from Kafka and leaves via a site-to-site output port. > After > >>> some time processing (sometimes a few minutes, sometimes a few hours) > one of > >>> those will start logging connection errors, and then that node will > stop > >>> processing any flowfiles across all processors. > >>> > >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to > RHEL7 > >>> (although RHEL seems to be happier). I've tried restricting threads > to less > >>> than the number of available cores on each node, different heap sizes, > and > >>> different garbage collectors. So far none of that has preventing the > >>> problem, unfortunately. > >>> > >>> I'm not quite ready to build all custom processors for my flow logic... > >>> most of it is straightforward attribute routing, text replacement, and > >>> flowfile merging. > >>> > >>> What are other things that I could try, or just be doing wrong that > could > >>> lead to this? I'm happy to keep trying suggestions and changes; I > really > >>> want this to work! > >>> > >>> Thanks, > >>> -Aaron > >>> > >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <[email protected]> wrote: > >>>> > >>>> Aaron, > >>>> > >>>> I ran into an issue where the Execute Stream Command (ESC) processor > with > >>>> many threads would run a legacy script that would hang if the > incoming file > >>>> was 'inconsistent'. It appeared that ESC slowly collected stuck > threads as > >>>> malformed data randomly streamed through it. Eventually I ran out of > threads > >>>> as the system was just waiting for a thread to become available. > >>>> > >>>> It was apparent in the processor statistics where the flowfiles-out > >>>> statistic would eventually step down to zero as threads became stuck. > >>>> > >>>> It might be worth trying InvokeScriptedProcessor or building custom > >>>> processors as they provide a means to handle these inconsistencies > more > >>>> gracefully. > >>>> > >>>> > https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html > >>>> > >>>> Thanks, > >>>> Lee > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield < > [email protected]> > >>>> wrote: > >>>>> > >>>>> Hi Mark, > >>>>> > >>>>> I've been using the G1 garbage collector. I brought the nodes down > to > >>>>> 8GB heap and let it run overnight, but processing still got stuck and > >>>>> requiring NiFi to be restarted on all nodes. It took longer to > happen, but > >>>>> they went down after a few hours. Are there any other things I can > look > >>>>> into? > >>>>> > >>>>> Thanks! > >>>>> > >>>>> -Aaron > >>>>> > >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <[email protected]> > >>>>> wrote: > >>>>>> > >>>>>> Aaron, > >>>>>> > >>>>>> My guess would be that you are hitting a Full Garbage Collection. > With > >>>>>> such a huge Java heap, that will cause a "stop the world" pause for > quite a > >>>>>> long time. > >>>>>> Which garbage collector are you using? Have you tried reducing the > heap > >>>>>> from 48 GB to say 4 or 8 GB? > >>>>>> > >>>>>> Thanks > >>>>>> -Mark > >>>>>> > >>>>>> > >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield < > [email protected]> > >>>>>>> wrote: > >>>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> I'm having an issue with a small (two node) NiFi cluster where the > >>>>>>> nodes will stop processing any queued flowfiles. I haven't seen > any error > >>>>>>> messages logged related to it, and when attempting to restart the > service, > >>>>>>> NiFi doesn't respond and the script forcibly kills it. This > causes multiple > >>>>>>> flowfile version to hang around, and generally makes me feel like > it might > >>>>>>> be causing data loss. > >>>>>>> > >>>>>>> I'm running the web UI on a different box, and when things stop > >>>>>>> working, it stops showing changes to counts in any queues, and the > thread > >>>>>>> count never changes. It still thinks the nodes are connecting and > >>>>>>> responding, though. > >>>>>>> > >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB given > to > >>>>>>> the NiFi JVM in bootstrap.conf. I have timer threads limited to > 12, and > >>>>>>> event threads to 4. Install is on the current Amazon Linux AMI > and using > >>>>>>> OpenJDK 1.8.0.91 x64. > >>>>>>> > >>>>>>> Any idea, other debug steps, or changes that I can try? I'm > running > >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring > with both > >>>>>>> versions. The higher the flowfile volume I push through, the > faster this > >>>>>>> happens. > >>>>>>> > >>>>>>> Thanks for any help there is to give! > >>>>>>> > >>>>>>> -Aaron Longfield > >>>>>> > >>>>> > >>>> > >>> > >>> > >> > >
