Re: Finding slow down in processing

Aaron Rich Mon, 15 Jan 2024 08:18:00 -0800

@Mark - thanks for that note. I hadn't tried restarting. When I did that,
the performance dropped back down. So I'm back to the drawing board.


@Phillip - I didn't have any other services/components/dataflows going. It
was just those 2 processors going (I tried to remove every variable I could
to make it as controlled as possible). And during the week I ran that test,
there wasn't any slow down at all. Even when I turned on the rest of the
dataflows (~2500 components total) everything was performing as expected.
There is very, very little variability in data volumes so I don't have any
reason to believe that is the cause of the slow down.

I'm going to try to see what kind of the nifi diagnostics and such I can
get.

Is there anywhere that explains the output of nifi.sh dump and
nifi.sh diagnostics?

Thanks all for the help.

-Aaron

On Fri, Jan 12, 2024 at 11:45 AM Phillip Lord <[email protected]>
wrote:

> Ditto...
>
> @Aaron... so outside of the GenerateFlowFile -> PutFile, were there
> additional components/dataflows handling data at the same time as the
> "stress-test".  These will all share the same thread-pool.  So depending
> upon your dataflow footprint and any variability regarding data volumes...
> 20 timer-driven threads could be exhausted pretty quickly.  This might
> cause not only your "stress-test" to slow down but your other flows as well
> as components might be waiting for available threads to do their jobs.
>
> Thanks,
> Phil
>
> On Thu, Jan 11, 2024 at 3:44 PM Mark Payne <[email protected]> wrote:
>
>> Aaron,
>>
>> Interestingly, up to version 1.21 of NiFi, if you increase the size of
>> the thread pool, it increased immediately. But if you decreased the size of
>> the thread pool, the decrease didn’t take effect until you restart NiFi. So
>> that’s probably why you’re seeing the behavior you are. Even though you
>> reset it to 10 or 20, it’s still running at 40.
>>
>> This was done to issues with Java many years ago, where it caused
>> problems to decrease the thread pool size.  So just recently we updated
>> NiFi to immediately scale down the thread pools as well.
>>
>> Thanks
>> -Mark
>>
>>
>> On Jan 11, 2024, at 1:35 PM, Aaron Rich <[email protected]> wrote:
>>
>> So the good news is it's working now. I know what I did but I don't know
>> why it worked so I'm hoping others can enlighten me based on what I did.
>>
>> TL;DR - "turn it off/turn in on" for Max Timer Driven Thread Count fixed
>> performance. Max Timer Driven Thread Count was set to 20. I changed it to
>> 30 - performance increased. I changed to more to 40 - it increased. I moved
>> it back to 20 - performance was still up and what it originally was before
>> ever slowing down.
>>
>> (this is long to give background and details)
>> NiFi version: 1.19.1
>>
>> NiFi was deployed into a Kubernetes cluster as a single instance - no
>> NiFi clustering. We would set a CPU request of 4, and limit of 8, memory
>> request of 8, limit of 12. The repos are all volumed mounted out to ssd.
>>
>> The original deployment was as described above and Max Timer Driven
>> Thread Count was set to 20. We ran a very simple data flow
>> (generatoeFile->PutFile) AFAP to try to stress as much as possible before
>> starting our other data flows. That ran for a week with no issue doing
>> 20K/5m.
>> We turned on the other data flows and everything was processing as
>> expected, good throughput rates and things were happy.
>> Then the throughput dropped DRAMATICALLY to (instead of 11K/5m in an
>> UpdateAttribute, it went to 350/5m) after 3 days. The data being processed
>> did not change in volume/cadence/velocity/etc.
>> Rancher Cluster explorer dashboards didn't show resources standing out as
>> limiting or constraining.
>> Tried restarting workload in Kubernetes, and data flows were slow right
>> from start - so there wasn't a ramp up or any degradation over time - it
>> was just slow to begin.
>> Tried removing all the repos/state so NiFi came up clean incase it was
>> the historical data that was issue - still slow from start.
>> Tried changing node in Kube Cluster incase node was bad - still slow from
>> start.
>> Removed CPU limit (allowing NiFi to potentially use all 16 cores on node)
>> from deployment to see if there was CPU throttling happening that I wasn't
>> able to see on the Grafana dashboards - still slow from start.
>> While NiFi was running, I changed the Max Timer Driven Thread Count from
>> 20->30, performance picked up. Changed it again from 30->40, performance
>> picked up. I changed from 40->10, performance stayed up. I changed from
>> 10-20, performance stayed up and was at the original amount before slow
>> down every happened.
>>
>> So end of the day, the Max Timer Driven Thread Count is at exactly what
>> it was before but the performance changed. It's like something was "stuck".
>> It's very, very odd to me to see things be fine, degrade for days and
>> through multiple environment changes/debugging, and then return to fine
>> when I change a parameter to a different value->back to original value.
>> Effectively, I "turned it off/turned it on" with the Max Timer Driven
>> Thread Count value.
>>
>> My question is - what is happening under the hood when the Max Timer
>> Driven Thread Count is changed? What does that affect? Is there something I
>> could look at from Kubernetes' side potentially that would relate to that
>> value?
>>
>> Could an internal NiFi thread gotten stuck and changing that value
>> rebuilt the thread pool? If that is even possible? If that is
>> even possible, is any way to know what caused the thread to "get stuck" in
>> the first place?
>>
>> Any insight would be greatly appreciated!
>>
>> Thanks so much for all the suggestions and help on this.
>>
>> -Aaron
>>
>>
>>
>> On Wed, Jan 10, 2024 at 1:54 PM Aaron Rich <[email protected]> wrote:
>>
>>> Hi Joe,
>>>
>>> Nothing is load balanced- it's all basic queues.
>>>
>>> Mark,
>>> I'm using NiFi 1.19.1.
>>>
>>> nifi.performance.tracking.percentage sounds exactly what I might need.
>>> I'll give that a shot.
>>>
>>> Richard,
>>> I hadn't looked at the rotating logs and/or cleared them out. I'll give
>>> that a shot too.
>>>
>>> Thank you all. Please keep the suggestions coming.
>>>
>>> -Aaron
>>>
>>> On Wed, Jan 10, 2024 at 1:34 PM Richard Beare <[email protected]>
>>> wrote:
>>>
>>>> I had a similar sounding issue, although not in a Kube cluster. Nifi
>>>> was running in a docker container and the issue was the log rotation
>>>> interacting with the log file being mounted from the host. The mounted log
>>>> file was not deleted on rotation, meaning that once rotation was triggered
>>>> by log file size it would be continually triggered because the new log file
>>>> was never emptied. The clue was that the content of rotated logfiles was
>>>> mostly the same, with only a small number of messages appended to each new
>>>> one. Rotating multi GB logs was enough to destroy performance, especially
>>>> if it was being triggered frequently by debug messages.
>>>>
>>>> On Thu, Jan 11, 2024 at 7:14 AM Aaron Rich <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Joe,
>>>>>
>>>>> It's a pretty fixed size objects at a fixed interval- One 5mb-ish
>>>>> file, we break down to individual rows.
>>>>>
>>>>> I went so far as to create a "stress test" where I have a
>>>>> generateFlow( creating a fix, 100k fille, in batches of 1000, every .1s)
>>>>> feeding right into a putFile. I wanted to see the sustained max. It was
>>>>> very stable, fast for over a week running - but now it's extremely slow.
>>>>> That was able as simple of a data flow I could think of to hit all the
>>>>> different resources (CPU, memory
>>>>>
>>>>> I was thinking too, maybe it was memory but it's slow right at the
>>>>> start when starting NiFi. I would expect the memory to cause it to be
>>>>> slower over time, and the stress test showed it wasn't something that was
>>>>> fluenting over time.
>>>>>
>>>>> I'm happy to make other flows that anyone can suggest to help
>>>>> troubleshoot, diagnose issue.
>>>>>
>>>>> Lars,
>>>>>
>>>>> We haven't changed it between when performance was good and now when
>>>>> it's slow. That is what is throwing me - nothing changed from NiFi
>>>>> configuration standby.
>>>>> My guess is we are having some throttling/resource contention from our
>>>>> provider but I can't determine what/where/how. The Grafana cluster
>>>>> dashboards I have don't indicate issues. If there are suggestions for
>>>>> specific cluster metrics to plot/dashboards to use, I'm happy to build 
>>>>> them
>>>>> and contribute them back (I do have a dashboard I need to figure out how 
>>>>> to
>>>>> share for creating the "status history" plots in Grafana).
>>>>> The repos aren't full and I tried even blowing them away just to see
>>>>> if that made a difference.
>>>>> I'm not seeing anything new in the logs that indicate an issue...but
>>>>> maybe I'm missing it so I will try to look again
>>>>>
>>>>> By chance, are there any low level debugging metrics/observability/etc
>>>>> that would show how long things like writing to the repository disks is
>>>>> taking? There is a part of me that feels this could be a Disk I/O resource
>>>>> issue but I don't know how I can verify that is/isn't the issue.
>>>>>
>>>>> Thank you all for the help and suggestions - please keep them coming
>>>>> as I'm grasping at straws right now.
>>>>>
>>>>> -Aaron
>>>>>
>>>>>
>>>>> On Wed, Jan 10, 2024 at 10:10 AM Joe Witt <[email protected]> wrote:
>>>>>
>>>>>> Aaron,
>>>>>>
>>>>>> The usual suspects are memory consumption leading to high GC leading
>>>>>> to lower performance over time, or back pressure in the flow, etc.. But
>>>>>> your description does not really fit either exactly.  Does your flow see 
>>>>>> a
>>>>>> mix of large objects and smaller objects?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Wed, Jan 10, 2024 at 10:07 AM Aaron Rich <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>>
>>>>>>> I’m running into an odd issue and hoping someone can point me in the
>>>>>>> right direction.
>>>>>>>
>>>>>>>
>>>>>>> I have NiFi 1.19 deployed in a Kube cluster with all the
>>>>>>> repositories volume mounted out. It was processing great with processors
>>>>>>> like UpdateAttribute sending through 15K/5m PutFile sending through 
>>>>>>> 3K/5m.
>>>>>>>
>>>>>>>
>>>>>>> With nothing changing in the deployment, the performance has dropped
>>>>>>> to UpdateAttribute doing 350/5m and Putfile to 200/5m.
>>>>>>>
>>>>>>>
>>>>>>> I’m trying to determine what resource is suddenly dropping our
>>>>>>> performance like this. I don’t see anything on the Kube monitoring that
>>>>>>> stands out and I have restarted, cleaned repos, changed nodes but 
>>>>>>> nothing
>>>>>>> is helping.
>>>>>>>
>>>>>>>
>>>>>>> I was hoping there is something from the NiFi POV that can help
>>>>>>> identify the limiting resource. I'm not sure if there is additional
>>>>>>> diagnostic/debug/etc information available beyond the node status 
>>>>>>> graphs.
>>>>>>>
>>>>>>>
>>>>>>> Any help would be greatly appreciated.
>>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>> -Aaron
>>>>>>>
>>>>>>
>>

Re: Finding slow down in processing

Reply via email to