Re: Finding slow down in processing

Phillip Lord Fri, 12 Jan 2024 10:45:19 -0800

Ditto...

@Aaron... so outside of the GenerateFlowFile -> PutFile, were there
additional components/dataflows handling data at the same time as the
"stress-test".  These will all share the same thread-pool.  So depending
upon your dataflow footprint and any variability regarding data volumes...
20 timer-driven threads could be exhausted pretty quickly.  This might
cause not only your "stress-test" to slow down but your other flows as well
as components might be waiting for available threads to do their jobs.


Thanks,
Phil

On Thu, Jan 11, 2024 at 3:44 PM Mark Payne <[email protected]> wrote:

> Aaron,
>
> Interestingly, up to version 1.21 of NiFi, if you increase the size of the
> thread pool, it increased immediately. But if you decreased the size of the
> thread pool, the decrease didn’t take effect until you restart NiFi. So
> that’s probably why you’re seeing the behavior you are. Even though you
> reset it to 10 or 20, it’s still running at 40.
>
> This was done to issues with Java many years ago, where it caused problems
> to decrease the thread pool size.  So just recently we updated NiFi to
> immediately scale down the thread pools as well.
>
> Thanks
> -Mark
>
>
> On Jan 11, 2024, at 1:35 PM, Aaron Rich <[email protected]> wrote:
>
> So the good news is it's working now. I know what I did but I don't know
> why it worked so I'm hoping others can enlighten me based on what I did.
>
> TL;DR - "turn it off/turn in on" for Max Timer Driven Thread Count fixed
> performance. Max Timer Driven Thread Count was set to 20. I changed it to
> 30 - performance increased. I changed to more to 40 - it increased. I moved
> it back to 20 - performance was still up and what it originally was before
> ever slowing down.
>
> (this is long to give background and details)
> NiFi version: 1.19.1
>
> NiFi was deployed into a Kubernetes cluster as a single instance - no NiFi
> clustering. We would set a CPU request of 4, and limit of 8, memory request
> of 8, limit of 12. The repos are all volumed mounted out to ssd.
>
> The original deployment was as described above and Max Timer Driven Thread
> Count was set to 20. We ran a very simple data flow
> (generatoeFile->PutFile) AFAP to try to stress as much as possible before
> starting our other data flows. That ran for a week with no issue doing
> 20K/5m.
> We turned on the other data flows and everything was processing as
> expected, good throughput rates and things were happy.
> Then the throughput dropped DRAMATICALLY to (instead of 11K/5m in an
> UpdateAttribute, it went to 350/5m) after 3 days. The data being processed
> did not change in volume/cadence/velocity/etc.
> Rancher Cluster explorer dashboards didn't show resources standing out as
> limiting or constraining.
> Tried restarting workload in Kubernetes, and data flows were slow right
> from start - so there wasn't a ramp up or any degradation over time - it
> was just slow to begin.
> Tried removing all the repos/state so NiFi came up clean incase it was the
> historical data that was issue - still slow from start.
> Tried changing node in Kube Cluster incase node was bad - still slow from
> start.
> Removed CPU limit (allowing NiFi to potentially use all 16 cores on node)
> from deployment to see if there was CPU throttling happening that I wasn't
> able to see on the Grafana dashboards - still slow from start.
> While NiFi was running, I changed the Max Timer Driven Thread Count from
> 20->30, performance picked up. Changed it again from 30->40, performance
> picked up. I changed from 40->10, performance stayed up. I changed from
> 10-20, performance stayed up and was at the original amount before slow
> down every happened.
>
> So end of the day, the Max Timer Driven Thread Count is at exactly what it
> was before but the performance changed. It's like something was "stuck".
> It's very, very odd to me to see things be fine, degrade for days and
> through multiple environment changes/debugging, and then return to fine
> when I change a parameter to a different value->back to original value.
> Effectively, I "turned it off/turned it on" with the Max Timer Driven
> Thread Count value.
>
> My question is - what is happening under the hood when the Max Timer
> Driven Thread Count is changed? What does that affect? Is there something I
> could look at from Kubernetes' side potentially that would relate to that
> value?
>
> Could an internal NiFi thread gotten stuck and changing that value rebuilt
> the thread pool? If that is even possible? If that is even possible, is any
> way to know what caused the thread to "get stuck" in the first place?
>
> Any insight would be greatly appreciated!
>
> Thanks so much for all the suggestions and help on this.
>
> -Aaron
>
>
>
> On Wed, Jan 10, 2024 at 1:54 PM Aaron Rich <[email protected]> wrote:
>
>> Hi Joe,
>>
>> Nothing is load balanced- it's all basic queues.
>>
>> Mark,
>> I'm using NiFi 1.19.1.
>>
>> nifi.performance.tracking.percentage sounds exactly what I might need.
>> I'll give that a shot.
>>
>> Richard,
>> I hadn't looked at the rotating logs and/or cleared them out. I'll give
>> that a shot too.
>>
>> Thank you all. Please keep the suggestions coming.
>>
>> -Aaron
>>
>> On Wed, Jan 10, 2024 at 1:34 PM Richard Beare <[email protected]>
>> wrote:
>>
>>> I had a similar sounding issue, although not in a Kube cluster. Nifi was
>>> running in a docker container and the issue was the log rotation
>>> interacting with the log file being mounted from the host. The mounted log
>>> file was not deleted on rotation, meaning that once rotation was triggered
>>> by log file size it would be continually triggered because the new log file
>>> was never emptied. The clue was that the content of rotated logfiles was
>>> mostly the same, with only a small number of messages appended to each new
>>> one. Rotating multi GB logs was enough to destroy performance, especially
>>> if it was being triggered frequently by debug messages.
>>>
>>> On Thu, Jan 11, 2024 at 7:14 AM Aaron Rich <[email protected]> wrote:
>>>
>>>> Hi Joe,
>>>>
>>>> It's a pretty fixed size objects at a fixed interval- One 5mb-ish file,
>>>> we break down to individual rows.
>>>>
>>>> I went so far as to create a "stress test" where I have a generateFlow(
>>>> creating a fix, 100k fille, in batches of 1000, every .1s) feeding right
>>>> into a putFile. I wanted to see the sustained max. It was very stable, fast
>>>> for over a week running - but now it's extremely slow. That was able as
>>>> simple of a data flow I could think of to hit all the different resources
>>>> (CPU, memory
>>>>
>>>> I was thinking too, maybe it was memory but it's slow right at the
>>>> start when starting NiFi. I would expect the memory to cause it to be
>>>> slower over time, and the stress test showed it wasn't something that was
>>>> fluenting over time.
>>>>
>>>> I'm happy to make other flows that anyone can suggest to help
>>>> troubleshoot, diagnose issue.
>>>>
>>>> Lars,
>>>>
>>>> We haven't changed it between when performance was good and now when
>>>> it's slow. That is what is throwing me - nothing changed from NiFi
>>>> configuration standby.
>>>> My guess is we are having some throttling/resource contention from our
>>>> provider but I can't determine what/where/how. The Grafana cluster
>>>> dashboards I have don't indicate issues. If there are suggestions for
>>>> specific cluster metrics to plot/dashboards to use, I'm happy to build them
>>>> and contribute them back (I do have a dashboard I need to figure out how to
>>>> share for creating the "status history" plots in Grafana).
>>>> The repos aren't full and I tried even blowing them away just to see if
>>>> that made a difference.
>>>> I'm not seeing anything new in the logs that indicate an issue...but
>>>> maybe I'm missing it so I will try to look again
>>>>
>>>> By chance, are there any low level debugging metrics/observability/etc
>>>> that would show how long things like writing to the repository disks is
>>>> taking? There is a part of me that feels this could be a Disk I/O resource
>>>> issue but I don't know how I can verify that is/isn't the issue.
>>>>
>>>> Thank you all for the help and suggestions - please keep them coming as
>>>> I'm grasping at straws right now.
>>>>
>>>> -Aaron
>>>>
>>>>
>>>> On Wed, Jan 10, 2024 at 10:10 AM Joe Witt <[email protected]> wrote:
>>>>
>>>>> Aaron,
>>>>>
>>>>> The usual suspects are memory consumption leading to high GC leading
>>>>> to lower performance over time, or back pressure in the flow, etc.. But
>>>>> your description does not really fit either exactly.  Does your flow see a
>>>>> mix of large objects and smaller objects?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Wed, Jan 10, 2024 at 10:07 AM Aaron Rich <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>
>>>>>> I’m running into an odd issue and hoping someone can point me in the
>>>>>> right direction.
>>>>>>
>>>>>>
>>>>>> I have NiFi 1.19 deployed in a Kube cluster with all the repositories
>>>>>> volume mounted out. It was processing great with processors like
>>>>>> UpdateAttribute sending through 15K/5m PutFile sending through 3K/5m.
>>>>>>
>>>>>>
>>>>>> With nothing changing in the deployment, the performance has dropped
>>>>>> to UpdateAttribute doing 350/5m and Putfile to 200/5m.
>>>>>>
>>>>>>
>>>>>> I’m trying to determine what resource is suddenly dropping our
>>>>>> performance like this. I don’t see anything on the Kube monitoring that
>>>>>> stands out and I have restarted, cleaned repos, changed nodes but nothing
>>>>>> is helping.
>>>>>>
>>>>>>
>>>>>> I was hoping there is something from the NiFi POV that can help
>>>>>> identify the limiting resource. I'm not sure if there is additional
>>>>>> diagnostic/debug/etc information available beyond the node status graphs.
>>>>>>
>>>>>>
>>>>>> Any help would be greatly appreciated.
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>> -Aaron
>>>>>>
>>>>>
>

Re: Finding slow down in processing

Reply via email to