So the good news is it's working now. I know what I did but I don't know why it worked so I'm hoping others can enlighten me based on what I did.
TL;DR - "turn it off/turn in on" for Max Timer Driven Thread Count fixed performance. Max Timer Driven Thread Count was set to 20. I changed it to 30 - performance increased. I changed to more to 40 - it increased. I moved it back to 20 - performance was still up and what it originally was before ever slowing down. (this is long to give background and details) NiFi version: 1.19.1 NiFi was deployed into a Kubernetes cluster as a single instance - no NiFi clustering. We would set a CPU request of 4, and limit of 8, memory request of 8, limit of 12. The repos are all volumed mounted out to ssd. The original deployment was as described above and Max Timer Driven Thread Count was set to 20. We ran a very simple data flow (generatoeFile->PutFile) AFAP to try to stress as much as possible before starting our other data flows. That ran for a week with no issue doing 20K/5m. We turned on the other data flows and everything was processing as expected, good throughput rates and things were happy. Then the throughput dropped DRAMATICALLY to (instead of 11K/5m in an UpdateAttribute, it went to 350/5m) after 3 days. The data being processed did not change in volume/cadence/velocity/etc. Rancher Cluster explorer dashboards didn't show resources standing out as limiting or constraining. Tried restarting workload in Kubernetes, and data flows were slow right from start - so there wasn't a ramp up or any degradation over time - it was just slow to begin. Tried removing all the repos/state so NiFi came up clean incase it was the historical data that was issue - still slow from start. Tried changing node in Kube Cluster incase node was bad - still slow from start. Removed CPU limit (allowing NiFi to potentially use all 16 cores on node) from deployment to see if there was CPU throttling happening that I wasn't able to see on the Grafana dashboards - still slow from start. While NiFi was running, I changed the Max Timer Driven Thread Count from 20->30, performance picked up. Changed it again from 30->40, performance picked up. I changed from 40->10, performance stayed up. I changed from 10-20, performance stayed up and was at the original amount before slow down every happened. So end of the day, the Max Timer Driven Thread Count is at exactly what it was before but the performance changed. It's like something was "stuck". It's very, very odd to me to see things be fine, degrade for days and through multiple environment changes/debugging, and then return to fine when I change a parameter to a different value->back to original value. Effectively, I "turned it off/turned it on" with the Max Timer Driven Thread Count value. My question is - what is happening under the hood when the Max Timer Driven Thread Count is changed? What does that affect? Is there something I could look at from Kubernetes' side potentially that would relate to that value? Could an internal NiFi thread gotten stuck and changing that value rebuilt the thread pool? If that is even possible? If that is even possible, is any way to know what caused the thread to "get stuck" in the first place? Any insight would be greatly appreciated! Thanks so much for all the suggestions and help on this. -Aaron On Wed, Jan 10, 2024 at 1:54 PM Aaron Rich <[email protected]> wrote: > Hi Joe, > > Nothing is load balanced- it's all basic queues. > > Mark, > I'm using NiFi 1.19.1. > > nifi.performance.tracking.percentage sounds exactly what I might need. > I'll give that a shot. > > Richard, > I hadn't looked at the rotating logs and/or cleared them out. I'll give > that a shot too. > > Thank you all. Please keep the suggestions coming. > > -Aaron > > On Wed, Jan 10, 2024 at 1:34 PM Richard Beare <[email protected]> > wrote: > >> I had a similar sounding issue, although not in a Kube cluster. Nifi was >> running in a docker container and the issue was the log rotation >> interacting with the log file being mounted from the host. The mounted log >> file was not deleted on rotation, meaning that once rotation was triggered >> by log file size it would be continually triggered because the new log file >> was never emptied. The clue was that the content of rotated logfiles was >> mostly the same, with only a small number of messages appended to each new >> one. Rotating multi GB logs was enough to destroy performance, especially >> if it was being triggered frequently by debug messages. >> >> On Thu, Jan 11, 2024 at 7:14 AM Aaron Rich <[email protected]> wrote: >> >>> Hi Joe, >>> >>> It's a pretty fixed size objects at a fixed interval- One 5mb-ish file, >>> we break down to individual rows. >>> >>> I went so far as to create a "stress test" where I have a generateFlow( >>> creating a fix, 100k fille, in batches of 1000, every .1s) feeding right >>> into a putFile. I wanted to see the sustained max. It was very stable, fast >>> for over a week running - but now it's extremely slow. That was able as >>> simple of a data flow I could think of to hit all the different resources >>> (CPU, memory >>> >>> I was thinking too, maybe it was memory but it's slow right at the start >>> when starting NiFi. I would expect the memory to cause it to be slower over >>> time, and the stress test showed it wasn't something that was fluenting >>> over time. >>> >>> I'm happy to make other flows that anyone can suggest to help >>> troubleshoot, diagnose issue. >>> >>> Lars, >>> >>> We haven't changed it between when performance was good and now when >>> it's slow. That is what is throwing me - nothing changed from NiFi >>> configuration standby. >>> My guess is we are having some throttling/resource contention from our >>> provider but I can't determine what/where/how. The Grafana cluster >>> dashboards I have don't indicate issues. If there are suggestions for >>> specific cluster metrics to plot/dashboards to use, I'm happy to build them >>> and contribute them back (I do have a dashboard I need to figure out how to >>> share for creating the "status history" plots in Grafana). >>> The repos aren't full and I tried even blowing them away just to see if >>> that made a difference. >>> I'm not seeing anything new in the logs that indicate an issue...but >>> maybe I'm missing it so I will try to look again >>> >>> By chance, are there any low level debugging metrics/observability/etc >>> that would show how long things like writing to the repository disks is >>> taking? There is a part of me that feels this could be a Disk I/O resource >>> issue but I don't know how I can verify that is/isn't the issue. >>> >>> Thank you all for the help and suggestions - please keep them coming as >>> I'm grasping at straws right now. >>> >>> -Aaron >>> >>> >>> On Wed, Jan 10, 2024 at 10:10 AM Joe Witt <[email protected]> wrote: >>> >>>> Aaron, >>>> >>>> The usual suspects are memory consumption leading to high GC leading to >>>> lower performance over time, or back pressure in the flow, etc.. But your >>>> description does not really fit either exactly. Does your flow see a mix >>>> of large objects and smaller objects? >>>> >>>> Thanks >>>> >>>> On Wed, Jan 10, 2024 at 10:07 AM Aaron Rich <[email protected]> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> >>>>> >>>>> I’m running into an odd issue and hoping someone can point me in the >>>>> right direction. >>>>> >>>>> >>>>> >>>>> I have NiFi 1.19 deployed in a Kube cluster with all the repositories >>>>> volume mounted out. It was processing great with processors like >>>>> UpdateAttribute sending through 15K/5m PutFile sending through 3K/5m. >>>>> >>>>> >>>>> >>>>> With nothing changing in the deployment, the performance has dropped >>>>> to UpdateAttribute doing 350/5m and Putfile to 200/5m. >>>>> >>>>> >>>>> >>>>> I’m trying to determine what resource is suddenly dropping our >>>>> performance like this. I don’t see anything on the Kube monitoring that >>>>> stands out and I have restarted, cleaned repos, changed nodes but nothing >>>>> is helping. >>>>> >>>>> >>>>> >>>>> I was hoping there is something from the NiFi POV that can help >>>>> identify the limiting resource. I'm not sure if there is additional >>>>> diagnostic/debug/etc information available beyond the node status graphs. >>>>> >>>>> >>>>> >>>>> Any help would be greatly appreciated. >>>>> >>>>> >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> >>>>> -Aaron >>>>> >>>>
