Re: Finding slow down in processing

Mark Payne Thu, 11 Jan 2024 11:00:40 -0800

Aaron,

Interestingly, up to version 1.21 of NiFi, if you increase the size of the 
thread pool, it increased immediately. But if you decreased the size of the 
thread pool, the decrease didn’t take effect until you restart NiFi. So that’s 
probably why you’re seeing the behavior you are. Even though you reset it to 10 
or 20, it’s still running at 40.


This was done to issues with Java many years ago, where it caused problems to 
decrease the thread pool size.  So just recently we updated NiFi to immediately 
scale down the thread pools as well.

Thanks
-Mark


On Jan 11, 2024, at 1:35 PM, Aaron Rich <[email protected]> wrote:

So the good news is it's working now. I know what I did but I don't know why it 
worked so I'm hoping others can enlighten me based on what I did.

TL;DR - "turn it off/turn in on" for Max Timer Driven Thread Count fixed 
performance. Max Timer Driven Thread Count was set to 20. I changed it to 30 - 
performance increased. I changed to more to 40 - it increased. I moved it back 
to 20 - performance was still up and what it originally was before ever slowing 
down.

(this is long to give background and details)
NiFi version: 1.19.1

NiFi was deployed into a Kubernetes cluster as a single instance - no NiFi 
clustering. We would set a CPU request of 4, and limit of 8, memory request of 
8, limit of 12. The repos are all volumed mounted out to ssd.

The original deployment was as described above and Max Timer Driven Thread 
Count was set to 20. We ran a very simple data flow (generatoeFile->PutFile) 
AFAP to try to stress as much as possible before starting our other data flows. 
That ran for a week with no issue doing 20K/5m.
We turned on the other data flows and everything was processing as expected, 
good throughput rates and things were happy.
Then the throughput dropped DRAMATICALLY to (instead of 11K/5m in an 
UpdateAttribute, it went to 350/5m) after 3 days. The data being processed did 
not change in volume/cadence/velocity/etc.
Rancher Cluster explorer dashboards didn't show resources standing out as 
limiting or constraining.
Tried restarting workload in Kubernetes, and data flows were slow right from 
start - so there wasn't a ramp up or any degradation over time - it was just 
slow to begin.
Tried removing all the repos/state so NiFi came up clean incase it was the 
historical data that was issue - still slow from start.
Tried changing node in Kube Cluster incase node was bad - still slow from start.
Removed CPU limit (allowing NiFi to potentially use all 16 cores on node) from 
deployment to see if there was CPU throttling happening that I wasn't able to 
see on the Grafana dashboards - still slow from start.
While NiFi was running, I changed the Max Timer Driven Thread Count from 
20->30, performance picked up. Changed it again from 30->40, performance picked 
up. I changed from 40->10, performance stayed up. I changed from 10-20, 
performance stayed up and was at the original amount before slow down every 
happened.

So end of the day, the Max Timer Driven Thread Count is at exactly what it was 
before but the performance changed. It's like something was "stuck". It's very, 
very odd to me to see things be fine, degrade for days and through multiple 
environment changes/debugging, and then return to fine when I change a 
parameter to a different value->back to original value. Effectively, I "turned 
it off/turned it on" with the Max Timer Driven Thread Count value.

My question is - what is happening under the hood when the Max Timer Driven 
Thread Count is changed? What does that affect? Is there something I could look 
at from Kubernetes' side potentially that would relate to that value?

Could an internal NiFi thread gotten stuck and changing that value rebuilt the 
thread pool? If that is even possible? If that is even possible, is any way to 
know what caused the thread to "get stuck" in the first place?

Any insight would be greatly appreciated!

Thanks so much for all the suggestions and help on this.

-Aaron



On Wed, Jan 10, 2024 at 1:54 PM Aaron Rich 
<[email protected]<mailto:[email protected]>> wrote:
Hi Joe,

Nothing is load balanced- it's all basic queues.

Mark,
I'm using NiFi 1.19.1.

nifi.performance.tracking.percentage sounds exactly what I might need. I'll 
give that a shot.

Richard,
I hadn't looked at the rotating logs and/or cleared them out. I'll give that a 
shot too.

Thank you all. Please keep the suggestions coming.

-Aaron

On Wed, Jan 10, 2024 at 1:34 PM Richard Beare 
<[email protected]<mailto:[email protected]>> wrote:
I had a similar sounding issue, although not in a Kube cluster. Nifi was 
running in a docker container and the issue was the log rotation interacting 
with the log file being mounted from the host. The mounted log file was not 
deleted on rotation, meaning that once rotation was triggered by log file size 
it would be continually triggered because the new log file was never emptied. 
The clue was that the content of rotated logfiles was mostly the same, with 
only a small number of messages appended to each new one. Rotating multi GB 
logs was enough to destroy performance, especially if it was being triggered 
frequently by debug messages.

On Thu, Jan 11, 2024 at 7:14 AM Aaron Rich 
<[email protected]<mailto:[email protected]>> wrote:
Hi Joe,

It's a pretty fixed size objects at a fixed interval- One 5mb-ish file, we 
break down to individual rows.

I went so far as to create a "stress test" where I have a generateFlow( 
creating a fix, 100k fille, in batches of 1000, every .1s) feeding right into a 
putFile. I wanted to see the sustained max. It was very stable, fast for over a 
week running - but now it's extremely slow. That was able as simple of a data 
flow I could think of to hit all the different resources (CPU, memory

I was thinking too, maybe it was memory but it's slow right at the start when 
starting NiFi. I would expect the memory to cause it to be slower over time, 
and the stress test showed it wasn't something that was fluenting over time.

I'm happy to make other flows that anyone can suggest to help troubleshoot, 
diagnose issue.

Lars,

We haven't changed it between when performance was good and now when it's slow. 
That is what is throwing me - nothing changed from NiFi configuration standby.
My guess is we are having some throttling/resource contention from our provider 
but I can't determine what/where/how. The Grafana cluster dashboards I have 
don't indicate issues. If there are suggestions for specific cluster metrics to 
plot/dashboards to use, I'm happy to build them and contribute them back (I do 
have a dashboard I need to figure out how to share for creating the "status 
history" plots in Grafana).
The repos aren't full and I tried even blowing them away just to see if that 
made a difference.
I'm not seeing anything new in the logs that indicate an issue...but maybe I'm 
missing it so I will try to look again

By chance, are there any low level debugging metrics/observability/etc that 
would show how long things like writing to the repository disks is taking? 
There is a part of me that feels this could be a Disk I/O resource issue but I 
don't know how I can verify that is/isn't the issue.

Thank you all for the help and suggestions - please keep them coming as I'm 
grasping at straws right now.

-Aaron


On Wed, Jan 10, 2024 at 10:10 AM Joe Witt 
<[email protected]<mailto:[email protected]>> wrote:
Aaron,

The usual suspects are memory consumption leading to high GC leading to lower 
performance over time, or back pressure in the flow, etc.. But your description 
does not really fit either exactly.  Does your flow see a mix of large objects 
and smaller objects?

Thanks

On Wed, Jan 10, 2024 at 10:07 AM Aaron Rich 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

I’m running into an odd issue and hoping someone can point me in the right 
direction.

I have NiFi 1.19 deployed in a Kube cluster with all the repositories volume 
mounted out. It was processing great with processors like UpdateAttribute 
sending through 15K/5m PutFile sending through 3K/5m.

With nothing changing in the deployment, the performance has dropped to 
UpdateAttribute doing 350/5m and Putfile to 200/5m.

I’m trying to determine what resource is suddenly dropping our performance like 
this. I don’t see anything on the Kube monitoring that stands out and I have 
restarted, cleaned repos, changed nodes but nothing is helping.

I was hoping there is something from the NiFi POV that can help identify the 
limiting resource. I'm not sure if there is additional diagnostic/debug/etc 
information available beyond the node status graphs.

Any help would be greatly appreciated.

Thanks.

-Aaron

Re: Finding slow down in processing

Reply via email to