Aaron, Thanks for getting that to us quickly! It is extremely useful.
Joe, I do indeed believe this is the same thing. I was in the middle of typing a response, but you beat me to it! Thanks -Mark > On Aug 1, 2016, at 11:49 AM, Joe Witt <[email protected]> wrote: > > Aaron, Mark, > > In looking at the thread-dump provided it looks to me like this is the > same as what was reported and addressed in > https://issues.apache.org/jira/browse/NIFI-2395 > > The fix for this has not yet been released but it slated to end up on > an 0.x and 1.0 release line. > > Mark do you agree it is the same thing by looking at the logs? > > Thanks > Joe > > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield <[email protected]> wrote: >> Alright, here you go for one of the nodes! >> >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <[email protected]> wrote: >>> >>> Aaron, >>> >>> Any time that you find NiFi stop performing its work, the best thing to do >>> is to perform a thread-dump to and >>> to the mailing list. This allows us to determine what exactly is >>> happening, so we know what action is being >>> performed that prevents any other progress. >>> >>> To do this, you can go to the NiFi node that is not performing and run the >>> command: >>> >>> bin/nifi.sh dump thread-dump.txt >>> >>> This will generate a file named thread-dump.txt that you can send to us. >>> >>> Thanks! >>> -Mark >>> >>> >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <[email protected]> wrote: >>> >>> I've been trying different things to try to fix my NiFi freeze problems, >>> and it seems the most frequent reason that my cluster gets stuck and stops >>> processing has to do with network related processors. My data enters the >>> environment from Kafka and leaves via a site-to-site output port. After >>> some time processing (sometimes a few minutes, sometimes a few hours) one of >>> those will start logging connection errors, and then that node will stop >>> processing any flowfiles across all processors. >>> >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to RHEL7 >>> (although RHEL seems to be happier). I've tried restricting threads to less >>> than the number of available cores on each node, different heap sizes, and >>> different garbage collectors. So far none of that has preventing the >>> problem, unfortunately. >>> >>> I'm not quite ready to build all custom processors for my flow logic... >>> most of it is straightforward attribute routing, text replacement, and >>> flowfile merging. >>> >>> What are other things that I could try, or just be doing wrong that could >>> lead to this? I'm happy to keep trying suggestions and changes; I really >>> want this to work! >>> >>> Thanks, >>> -Aaron >>> >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <[email protected]> wrote: >>>> >>>> Aaron, >>>> >>>> I ran into an issue where the Execute Stream Command (ESC) processor with >>>> many threads would run a legacy script that would hang if the incoming file >>>> was 'inconsistent'. It appeared that ESC slowly collected stuck threads as >>>> malformed data randomly streamed through it. Eventually I ran out of >>>> threads >>>> as the system was just waiting for a thread to become available. >>>> >>>> It was apparent in the processor statistics where the flowfiles-out >>>> statistic would eventually step down to zero as threads became stuck. >>>> >>>> It might be worth trying InvokeScriptedProcessor or building custom >>>> processors as they provide a means to handle these inconsistencies more >>>> gracefully. >>>> >>>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html >>>> >>>> Thanks, >>>> Lee >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield <[email protected]> >>>> wrote: >>>>> >>>>> Hi Mark, >>>>> >>>>> I've been using the G1 garbage collector. I brought the nodes down to >>>>> 8GB heap and let it run overnight, but processing still got stuck and >>>>> requiring NiFi to be restarted on all nodes. It took longer to happen, >>>>> but >>>>> they went down after a few hours. Are there any other things I can look >>>>> into? >>>>> >>>>> Thanks! >>>>> >>>>> -Aaron >>>>> >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <[email protected]> >>>>> wrote: >>>>>> >>>>>> Aaron, >>>>>> >>>>>> My guess would be that you are hitting a Full Garbage Collection. With >>>>>> such a huge Java heap, that will cause a "stop the world" pause for >>>>>> quite a >>>>>> long time. >>>>>> Which garbage collector are you using? Have you tried reducing the heap >>>>>> from 48 GB to say 4 or 8 GB? >>>>>> >>>>>> Thanks >>>>>> -Mark >>>>>> >>>>>> >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm having an issue with a small (two node) NiFi cluster where the >>>>>>> nodes will stop processing any queued flowfiles. I haven't seen any >>>>>>> error >>>>>>> messages logged related to it, and when attempting to restart the >>>>>>> service, >>>>>>> NiFi doesn't respond and the script forcibly kills it. This causes >>>>>>> multiple >>>>>>> flowfile version to hang around, and generally makes me feel like it >>>>>>> might >>>>>>> be causing data loss. >>>>>>> >>>>>>> I'm running the web UI on a different box, and when things stop >>>>>>> working, it stops showing changes to counts in any queues, and the >>>>>>> thread >>>>>>> count never changes. It still thinks the nodes are connecting and >>>>>>> responding, though. >>>>>>> >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB given to >>>>>>> the NiFi JVM in bootstrap.conf. I have timer threads limited to 12, and >>>>>>> event threads to 4. Install is on the current Amazon Linux AMI and >>>>>>> using >>>>>>> OpenJDK 1.8.0.91 x64. >>>>>>> >>>>>>> Any idea, other debug steps, or changes that I can try? I'm running >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring with both >>>>>>> versions. The higher the flowfile volume I push through, the faster >>>>>>> this >>>>>>> happens. >>>>>>> >>>>>>> Thanks for any help there is to give! >>>>>>> >>>>>>> -Aaron Longfield >>>>>> >>>>> >>>> >>> >>> >>
