Nick, Good news, I was able to reproduce this and I am fairly confident that as long as you increase the swap threshold above 10k you shouldn't see this problem anymore.
I created this JIRA which further describes what is happening: https://issues.apache.org/jira/browse/NIFI-3250 Thanks, Bryan On Thu, Dec 22, 2016 at 12:59 PM, Bryan Bende <[email protected]> wrote: > Nick, > > Thanks for reporting back. > > Just to confirm the scenario, you ran over night without any stalling > happening, and then while nothing was stalled you stopped and started the > GeoEnrichIP processor, which then didn't consume anything from the incoming > queue? > Or were things already stalled from overnight, and you stopped and started > the processor to see if it would start processing again? > > I noticed in your nifi.properties you lowered the swap threshold to 1k, > the default is 20k. Was there a specific reason for lowering it so much? > Would you be to do another test putting that back to 20k? > > The way swapping works is that when the active queue for a processor > reaches the threshold (1k in your case), it starts putting any additional > flow files on to a separate swap queue, and when the swap queue reaches 10k > it starts writing these swapped flow file files to disk in batches of 10k. > > I wouldn't expect setting the threshold to 1k to cause no processing to > happen, but it will definitely cause a lot of extra work because as soon as > 10k flowfiles are swapped back in, you are already over the 1k threshold > again. > > One other thing to check would be to see if any crazy garbage collection > is happening during these stalls. You could probably connect JVisualVM to > one of your NiFi JVM processes and see if the GC activity graph is spiking > up. > > -Bryan > > > On Thu, Dec 22, 2016 at 11:36 AM, Nick Carenza < > [email protected]> wrote: > >> I replaced the Kafka processor with PulishKafka_0_10. It didn't start >> consuming from the stalled queue. I cleared all the queues again and it ran >> overnight without stalling, longer than it has before. I stopped and >> started the geoEnrichIp processor just now to see if it would stall and it >> did. I should be able to restart a processor like that right, and it should >> start consuming the queue again? As soon as I clear the stalled queue, >> whether or not it's full, it starts flowing again. >> >> Thanks, >> Nick >> >> On Wed, Dec 21, 2016 at 11:34 AM, Bryan Bende <[email protected]> wrote: >> >>> Thanks for the info. >>> >>> Since your Kafka broker is 0.10.1, I would be curious if you experience >>> the same behavior switching to PublishKafka_0_10. >>> >>> The Kafka processors line up like this... >>> >>> GetKafka/PutKafka use the 0.8.x Kafka client >>> ConsumeKafka/PublishKafka use the 0.9.x Kafka client >>> ConsumeKafka_0_10/PublishKafka_0_10 use the 0.10.x Kafka client >>> >>> In some cases it is possible to use a version of the client with a >>> different version of the broker, but it usually works best to use the >>> client that goes with the broker. >>> >>> I'm wondering if your PutKafka processor is getting stuck somehow, which >>> then causes back-pressure to build up all the way back to your TCP >>> processor, since it looked like all your queues were filled up. >>> >>> It is entirely possible that there is something else going on, but maybe >>> we can eliminate the Kafka processor from the list of possible problems by >>> testing with PublishKafka_0_10. >>> >>> -Bryan >>> >>> On Wed, Dec 21, 2016 at 2:25 PM, Nick Carenza < >>> [email protected]> wrote: >>> >>>> Hey Brian, >>>> >>>> Thanks for taking the time! >>>> >>>> - This is nifi 1.1.0. I had the same troubles on 1.0.0 and upgraded >>>> recently with the hope there was a fix for the issue. >>>> - Kafka is version 2.11-0.10.1.0 >>>> - I am using the PutKafka processor. >>>> >>>> - Nick >>>> >>>> On Wed, Dec 21, 2016 at 11:19 AM, Bryan Bende <[email protected]> wrote: >>>> >>>>> Hey Nick, >>>>> >>>>> Sorry to hear about these troubles. A couple of questions... >>>>> >>>>> - What version of NiFi is this? >>>>> - What version of Kafka are you using? >>>>> - Which Kafka processor in NiFi are you using? It looks like PutKafka, >>>>> but just confirming. >>>>> >>>>> Thanks, >>>>> >>>>> Bryan >>>>> >>>>> >>>>> On Wed, Dec 21, 2016 at 2:00 PM, Nick Carenza < >>>>> [email protected]> wrote: >>>>> >>>>>> I am running into an issue where a processor will stop receiving flow >>>>>> files from it's queue. >>>>>> >>>>>> flow: tcp --(100,000)--> evaljsonpath --(100,000)--> geoip >>>>>> --(100,000)--> putkafka >>>>>> >>>>>> This time, putkafka is the processor that has stopped receiving >>>>>> flowfiles​ >>>>>> >>>>>> I will try to list the queue and I'll get a message that says the >>>>>> queue has no flow files in it. I checked the http request and the >>>>>> response >>>>>> says there are 100,000 flow files in the queue but the flowFileSummaries >>>>>> array is empty. >>>>>> >>>>>> GET /nifi-api/flowfile-queues/1d72b81f-0159-1000-d09b-dc33e81b35 >>>>>>> c2/listing-requests/22754339-0159-1000-2dc9-07db09366132 HTTP/1.1 >>>>>>> { >>>>>>> "listingRequest": { >>>>>>> "id": "22754339-0159-1000-2dc9-07db09366132", >>>>>>> "uri": "http://ipaddress:8080/nifi-ap >>>>>>> i/flowfile-queues/1d72b81f-0159-1000-d09b-dc33e81b35c2/listi >>>>>>> ng-requests/22754339-0159-1000-2dc9-07db09366132", >>>>>>> "submissionTime": "12/21/2016 17:37:07.385 UTC", >>>>>>> "lastUpdated": "17:37:07 UTC", >>>>>>> "percentCompleted": 100, >>>>>>> "finished": true, >>>>>>> "maxResults": 100, >>>>>>> "state": "Completed successfully", >>>>>>> "queueSize": { >>>>>>> "byteCount": 288609476, >>>>>>> "objectCount": 100000 >>>>>>> }, >>>>>>> "flowFileSummaries": [], >>>>>>> "sourceRunning": true, >>>>>>> "destinationRunning": true >>>>>>> } >>>>>>> } >>>>>> >>>>>> >>>>>> I tried stopping and starting all the processors, replacing the >>>>>> putkafka with a new duplicate putkafka processor and moving the queue >>>>>> over >>>>>> to it, restarting kafka itself. I ran a dump with all the processors >>>>>> "running". >>>>>> >>>>>> Since this is not running in a production environment, as a last >>>>>> resort I cleared the queue and then everything started flowing again. >>>>>> >>>>>> I have experienced this issue many times since I have begun >>>>>> evaluating Nifi. I have heard others having great success with it so I am >>>>>> convinced I have misconfigured something. I have tried to provide any >>>>>> relevant configuration information here: >>>>>> >>>>>> # nifi.properties >>>>>> nifi.version=1.1.0 >>>>>> nifi.flowcontroller.autoResumeState=true >>>>>> nifi.flowcontroller.graceful.shutdown.period=10 sec >>>>>> nifi.flowservice.writedelay.interval=500 ms >>>>>> nifi.administrative.yield.duration=30 sec >>>>>> nifi.bored.yield.duration=10 millis >>>>>> nifi.state.management.provider.local=local-provider >>>>>> nifi.swap.manager.implementation=org.apache.nifi.controller. >>>>>> FileSystemSwapManager >>>>>> nifi.queue.swap.threshold=1000 >>>>>> nifi.swap.in.period=5 sec >>>>>> nifi.swap.in.threads=1 >>>>>> nifi.swap.out.period=5 sec >>>>>> nifi.swap.out.threads=4 >>>>>> nifi.cluster.is.node=false >>>>>> nifi.build.tag=nifi-1.1.0-RC2 >>>>>> nifi.build.branch=NIFI-3100-rc2 >>>>>> nifi.build.revision=f61e42c >>>>>> nifi.build.timestamp=2016-11-26T04:39:37Z >>>>>> >>>>>> # JVM memory settings >>>>>> java.arg.2=-Xms28g >>>>>> java.arg.3=-Xmx28g >>>>>> java.arg.13=-XX:+UseG1GC >>>>>> >>>>>> controller settings: >>>>>> timer driven thread count: 10-20 (i have tried values from 10 to 20 >>>>>> and still experience the issue) >>>>>> event drive thread count: 5 (haven't touched) >>>>>> >>>>>> processors: >>>>>> concurrency: 1-20 (i have tried values from 1 to 20 and still >>>>>> experience the issue) >>>>>> scheduling: timer driven (run-schedule: 0 run-duration: 0) >>>>>> >>>>>> queues: >>>>>> backpressure flowfile count: 100,000 >>>>>> backpressure flowfile size: 1G >>>>>> >>>>>> machine: >>>>>> 128g ram >>>>>> 20 cpu >>>>>> disk: 3T >>>>>> >>>>>> --- >>>>>> >>>>>> Really I have 2 questions: >>>>>> >>>>>> 1. Why is this happening? >>>>>> 2. Once the flow is in this state, how can I get it flowing again >>>>>> without losing flowfiles? >>>>>> >>>>>> Thanks, >>>>>> Nick >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
