Joe, Sure, I can give that a go. Any serious bugs that I might run across with that branch that should make me worried about running it on a production flow?
-Aaron On Mon, Aug 1, 2016 at 4:01 PM, Joe Witt <[email protected]> wrote: > Aaron, > > It doesn't look like the 0.x version of that patch has been created > yet. Any chance you could build master (slated for upcoming 1.x > release) and try that? > > Thanks > Joe > > On Mon, Aug 1, 2016 at 3:30 PM, Aaron Longfield <[email protected]> > wrote: > > Great, glad there's already a fixed bug for it! Is there anything I try > to > > work around it for now, or at least just get longer processing times > between > > restarts? > > > > -Aaron > > > > On Mon, Aug 1, 2016 at 11:54 AM, Mark Payne <[email protected]> > wrote: > >> > >> Aaron, > >> > >> Thanks for getting that to us quickly! It is extremely useful. > >> > >> Joe, > >> > >> I do indeed believe this is the same thing. I was in the middle of > typing > >> a response, but you beat me to it! > >> > >> Thanks > >> -Mark > >> > >> > >> > On Aug 1, 2016, at 11:49 AM, Joe Witt <[email protected]> wrote: > >> > > >> > Aaron, Mark, > >> > > >> > In looking at the thread-dump provided it looks to me like this is the > >> > same as what was reported and addressed in > >> > https://issues.apache.org/jira/browse/NIFI-2395 > >> > > >> > The fix for this has not yet been released but it slated to end up on > >> > an 0.x and 1.0 release line. > >> > > >> > Mark do you agree it is the same thing by looking at the logs? > >> > > >> > Thanks > >> > Joe > >> > > >> > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield < > [email protected]> > >> > wrote: > >> >> Alright, here you go for one of the nodes! > >> >> > >> >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <[email protected]> > >> >> wrote: > >> >>> > >> >>> Aaron, > >> >>> > >> >>> Any time that you find NiFi stop performing its work, the best thing > >> >>> to do > >> >>> is to perform a thread-dump to and > >> >>> to the mailing list. This allows us to determine what exactly is > >> >>> happening, so we know what action is being > >> >>> performed that prevents any other progress. > >> >>> > >> >>> To do this, you can go to the NiFi node that is not performing and > run > >> >>> the > >> >>> command: > >> >>> > >> >>> bin/nifi.sh dump thread-dump.txt > >> >>> > >> >>> This will generate a file named thread-dump.txt that you can send to > >> >>> us. > >> >>> > >> >>> Thanks! > >> >>> -Mark > >> >>> > >> >>> > >> >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <[email protected]> > >> >>> wrote: > >> >>> > >> >>> I've been trying different things to try to fix my NiFi freeze > >> >>> problems, > >> >>> and it seems the most frequent reason that my cluster gets stuck and > >> >>> stops > >> >>> processing has to do with network related processors. My data > enters > >> >>> the > >> >>> environment from Kafka and leaves via a site-to-site output port. > >> >>> After > >> >>> some time processing (sometimes a few minutes, sometimes a few > hours) > >> >>> one of > >> >>> those will start logging connection errors, and then that node will > >> >>> stop > >> >>> processing any flowfiles across all processors. > >> >>> > >> >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to > >> >>> RHEL7 > >> >>> (although RHEL seems to be happier). I've tried restricting threads > >> >>> to less > >> >>> than the number of available cores on each node, different heap > sizes, > >> >>> and > >> >>> different garbage collectors. So far none of that has preventing > the > >> >>> problem, unfortunately. > >> >>> > >> >>> I'm not quite ready to build all custom processors for my flow > >> >>> logic... > >> >>> most of it is straightforward attribute routing, text replacement, > and > >> >>> flowfile merging. > >> >>> > >> >>> What are other things that I could try, or just be doing wrong that > >> >>> could > >> >>> lead to this? I'm happy to keep trying suggestions and changes; I > >> >>> really > >> >>> want this to work! > >> >>> > >> >>> Thanks, > >> >>> -Aaron > >> >>> > >> >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <[email protected]> > wrote: > >> >>>> > >> >>>> Aaron, > >> >>>> > >> >>>> I ran into an issue where the Execute Stream Command (ESC) > processor > >> >>>> with > >> >>>> many threads would run a legacy script that would hang if the > >> >>>> incoming file > >> >>>> was 'inconsistent'. It appeared that ESC slowly collected stuck > >> >>>> threads as > >> >>>> malformed data randomly streamed through it. Eventually I ran out > of > >> >>>> threads > >> >>>> as the system was just waiting for a thread to become available. > >> >>>> > >> >>>> It was apparent in the processor statistics where the flowfiles-out > >> >>>> statistic would eventually step down to zero as threads became > stuck. > >> >>>> > >> >>>> It might be worth trying InvokeScriptedProcessor or building custom > >> >>>> processors as they provide a means to handle these inconsistencies > >> >>>> more > >> >>>> gracefully. > >> >>>> > >> >>>> > >> >>>> > https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html > >> >>>> > >> >>>> Thanks, > >> >>>> Lee > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield > >> >>>> <[email protected]> > >> >>>> wrote: > >> >>>>> > >> >>>>> Hi Mark, > >> >>>>> > >> >>>>> I've been using the G1 garbage collector. I brought the nodes > down > >> >>>>> to > >> >>>>> 8GB heap and let it run overnight, but processing still got stuck > >> >>>>> and > >> >>>>> requiring NiFi to be restarted on all nodes. It took longer to > >> >>>>> happen, but > >> >>>>> they went down after a few hours. Are there any other things I > can > >> >>>>> look > >> >>>>> into? > >> >>>>> > >> >>>>> Thanks! > >> >>>>> > >> >>>>> -Aaron > >> >>>>> > >> >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <[email protected] > > > >> >>>>> wrote: > >> >>>>>> > >> >>>>>> Aaron, > >> >>>>>> > >> >>>>>> My guess would be that you are hitting a Full Garbage Collection. > >> >>>>>> With > >> >>>>>> such a huge Java heap, that will cause a "stop the world" pause > for > >> >>>>>> quite a > >> >>>>>> long time. > >> >>>>>> Which garbage collector are you using? Have you tried reducing > the > >> >>>>>> heap > >> >>>>>> from 48 GB to say 4 or 8 GB? > >> >>>>>> > >> >>>>>> Thanks > >> >>>>>> -Mark > >> >>>>>> > >> >>>>>> > >> >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield > >> >>>>>>> <[email protected]> > >> >>>>>>> wrote: > >> >>>>>>> > >> >>>>>>> Hi, > >> >>>>>>> > >> >>>>>>> I'm having an issue with a small (two node) NiFi cluster where > the > >> >>>>>>> nodes will stop processing any queued flowfiles. I haven't seen > >> >>>>>>> any error > >> >>>>>>> messages logged related to it, and when attempting to restart > the > >> >>>>>>> service, > >> >>>>>>> NiFi doesn't respond and the script forcibly kills it. This > >> >>>>>>> causes multiple > >> >>>>>>> flowfile version to hang around, and generally makes me feel > like > >> >>>>>>> it might > >> >>>>>>> be causing data loss. > >> >>>>>>> > >> >>>>>>> I'm running the web UI on a different box, and when things stop > >> >>>>>>> working, it stops showing changes to counts in any queues, and > the > >> >>>>>>> thread > >> >>>>>>> count never changes. It still thinks the nodes are connecting > and > >> >>>>>>> responding, though. > >> >>>>>>> > >> >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB > given > >> >>>>>>> to > >> >>>>>>> the NiFi JVM in bootstrap.conf. I have timer threads limited to > >> >>>>>>> 12, and > >> >>>>>>> event threads to 4. Install is on the current Amazon Linux AMI > >> >>>>>>> and using > >> >>>>>>> OpenJDK 1.8.0.91 x64. > >> >>>>>>> > >> >>>>>>> Any idea, other debug steps, or changes that I can try? I'm > >> >>>>>>> running > >> >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring > >> >>>>>>> with both > >> >>>>>>> versions. The higher the flowfile volume I push through, the > >> >>>>>>> faster this > >> >>>>>>> happens. > >> >>>>>>> > >> >>>>>>> Thanks for any help there is to give! > >> >>>>>>> > >> >>>>>>> -Aaron Longfield > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> > > >
