@Mark - thanks for that note. I hadn't tried restarting. When I did that, the performance dropped back down. So I'm back to the drawing board.
@Phillip - I didn't have any other services/components/dataflows going. It was just those 2 processors going (I tried to remove every variable I could to make it as controlled as possible). And during the week I ran that test, there wasn't any slow down at all. Even when I turned on the rest of the dataflows (~2500 components total) everything was performing as expected. There is very, very little variability in data volumes so I don't have any reason to believe that is the cause of the slow down. I'm going to try to see what kind of the nifi diagnostics and such I can get. Is there anywhere that explains the output of nifi.sh dump and nifi.sh diagnostics? Thanks all for the help. -Aaron On Fri, Jan 12, 2024 at 11:45 AM Phillip Lord <[email protected]> wrote: > Ditto... > > @Aaron... so outside of the GenerateFlowFile -> PutFile, were there > additional components/dataflows handling data at the same time as the > "stress-test". These will all share the same thread-pool. So depending > upon your dataflow footprint and any variability regarding data volumes... > 20 timer-driven threads could be exhausted pretty quickly. This might > cause not only your "stress-test" to slow down but your other flows as well > as components might be waiting for available threads to do their jobs. > > Thanks, > Phil > > On Thu, Jan 11, 2024 at 3:44 PM Mark Payne <[email protected]> wrote: > >> Aaron, >> >> Interestingly, up to version 1.21 of NiFi, if you increase the size of >> the thread pool, it increased immediately. But if you decreased the size of >> the thread pool, the decrease didn’t take effect until you restart NiFi. So >> that’s probably why you’re seeing the behavior you are. Even though you >> reset it to 10 or 20, it’s still running at 40. >> >> This was done to issues with Java many years ago, where it caused >> problems to decrease the thread pool size. So just recently we updated >> NiFi to immediately scale down the thread pools as well. >> >> Thanks >> -Mark >> >> >> On Jan 11, 2024, at 1:35 PM, Aaron Rich <[email protected]> wrote: >> >> So the good news is it's working now. I know what I did but I don't know >> why it worked so I'm hoping others can enlighten me based on what I did. >> >> TL;DR - "turn it off/turn in on" for Max Timer Driven Thread Count fixed >> performance. Max Timer Driven Thread Count was set to 20. I changed it to >> 30 - performance increased. I changed to more to 40 - it increased. I moved >> it back to 20 - performance was still up and what it originally was before >> ever slowing down. >> >> (this is long to give background and details) >> NiFi version: 1.19.1 >> >> NiFi was deployed into a Kubernetes cluster as a single instance - no >> NiFi clustering. We would set a CPU request of 4, and limit of 8, memory >> request of 8, limit of 12. The repos are all volumed mounted out to ssd. >> >> The original deployment was as described above and Max Timer Driven >> Thread Count was set to 20. We ran a very simple data flow >> (generatoeFile->PutFile) AFAP to try to stress as much as possible before >> starting our other data flows. That ran for a week with no issue doing >> 20K/5m. >> We turned on the other data flows and everything was processing as >> expected, good throughput rates and things were happy. >> Then the throughput dropped DRAMATICALLY to (instead of 11K/5m in an >> UpdateAttribute, it went to 350/5m) after 3 days. The data being processed >> did not change in volume/cadence/velocity/etc. >> Rancher Cluster explorer dashboards didn't show resources standing out as >> limiting or constraining. >> Tried restarting workload in Kubernetes, and data flows were slow right >> from start - so there wasn't a ramp up or any degradation over time - it >> was just slow to begin. >> Tried removing all the repos/state so NiFi came up clean incase it was >> the historical data that was issue - still slow from start. >> Tried changing node in Kube Cluster incase node was bad - still slow from >> start. >> Removed CPU limit (allowing NiFi to potentially use all 16 cores on node) >> from deployment to see if there was CPU throttling happening that I wasn't >> able to see on the Grafana dashboards - still slow from start. >> While NiFi was running, I changed the Max Timer Driven Thread Count from >> 20->30, performance picked up. Changed it again from 30->40, performance >> picked up. I changed from 40->10, performance stayed up. I changed from >> 10-20, performance stayed up and was at the original amount before slow >> down every happened. >> >> So end of the day, the Max Timer Driven Thread Count is at exactly what >> it was before but the performance changed. It's like something was "stuck". >> It's very, very odd to me to see things be fine, degrade for days and >> through multiple environment changes/debugging, and then return to fine >> when I change a parameter to a different value->back to original value. >> Effectively, I "turned it off/turned it on" with the Max Timer Driven >> Thread Count value. >> >> My question is - what is happening under the hood when the Max Timer >> Driven Thread Count is changed? What does that affect? Is there something I >> could look at from Kubernetes' side potentially that would relate to that >> value? >> >> Could an internal NiFi thread gotten stuck and changing that value >> rebuilt the thread pool? If that is even possible? If that is >> even possible, is any way to know what caused the thread to "get stuck" in >> the first place? >> >> Any insight would be greatly appreciated! >> >> Thanks so much for all the suggestions and help on this. >> >> -Aaron >> >> >> >> On Wed, Jan 10, 2024 at 1:54 PM Aaron Rich <[email protected]> wrote: >> >>> Hi Joe, >>> >>> Nothing is load balanced- it's all basic queues. >>> >>> Mark, >>> I'm using NiFi 1.19.1. >>> >>> nifi.performance.tracking.percentage sounds exactly what I might need. >>> I'll give that a shot. >>> >>> Richard, >>> I hadn't looked at the rotating logs and/or cleared them out. I'll give >>> that a shot too. >>> >>> Thank you all. Please keep the suggestions coming. >>> >>> -Aaron >>> >>> On Wed, Jan 10, 2024 at 1:34 PM Richard Beare <[email protected]> >>> wrote: >>> >>>> I had a similar sounding issue, although not in a Kube cluster. Nifi >>>> was running in a docker container and the issue was the log rotation >>>> interacting with the log file being mounted from the host. The mounted log >>>> file was not deleted on rotation, meaning that once rotation was triggered >>>> by log file size it would be continually triggered because the new log file >>>> was never emptied. The clue was that the content of rotated logfiles was >>>> mostly the same, with only a small number of messages appended to each new >>>> one. Rotating multi GB logs was enough to destroy performance, especially >>>> if it was being triggered frequently by debug messages. >>>> >>>> On Thu, Jan 11, 2024 at 7:14 AM Aaron Rich <[email protected]> >>>> wrote: >>>> >>>>> Hi Joe, >>>>> >>>>> It's a pretty fixed size objects at a fixed interval- One 5mb-ish >>>>> file, we break down to individual rows. >>>>> >>>>> I went so far as to create a "stress test" where I have a >>>>> generateFlow( creating a fix, 100k fille, in batches of 1000, every .1s) >>>>> feeding right into a putFile. I wanted to see the sustained max. It was >>>>> very stable, fast for over a week running - but now it's extremely slow. >>>>> That was able as simple of a data flow I could think of to hit all the >>>>> different resources (CPU, memory >>>>> >>>>> I was thinking too, maybe it was memory but it's slow right at the >>>>> start when starting NiFi. I would expect the memory to cause it to be >>>>> slower over time, and the stress test showed it wasn't something that was >>>>> fluenting over time. >>>>> >>>>> I'm happy to make other flows that anyone can suggest to help >>>>> troubleshoot, diagnose issue. >>>>> >>>>> Lars, >>>>> >>>>> We haven't changed it between when performance was good and now when >>>>> it's slow. That is what is throwing me - nothing changed from NiFi >>>>> configuration standby. >>>>> My guess is we are having some throttling/resource contention from our >>>>> provider but I can't determine what/where/how. The Grafana cluster >>>>> dashboards I have don't indicate issues. If there are suggestions for >>>>> specific cluster metrics to plot/dashboards to use, I'm happy to build >>>>> them >>>>> and contribute them back (I do have a dashboard I need to figure out how >>>>> to >>>>> share for creating the "status history" plots in Grafana). >>>>> The repos aren't full and I tried even blowing them away just to see >>>>> if that made a difference. >>>>> I'm not seeing anything new in the logs that indicate an issue...but >>>>> maybe I'm missing it so I will try to look again >>>>> >>>>> By chance, are there any low level debugging metrics/observability/etc >>>>> that would show how long things like writing to the repository disks is >>>>> taking? There is a part of me that feels this could be a Disk I/O resource >>>>> issue but I don't know how I can verify that is/isn't the issue. >>>>> >>>>> Thank you all for the help and suggestions - please keep them coming >>>>> as I'm grasping at straws right now. >>>>> >>>>> -Aaron >>>>> >>>>> >>>>> On Wed, Jan 10, 2024 at 10:10 AM Joe Witt <[email protected]> wrote: >>>>> >>>>>> Aaron, >>>>>> >>>>>> The usual suspects are memory consumption leading to high GC leading >>>>>> to lower performance over time, or back pressure in the flow, etc.. But >>>>>> your description does not really fit either exactly. Does your flow see >>>>>> a >>>>>> mix of large objects and smaller objects? >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Wed, Jan 10, 2024 at 10:07 AM Aaron Rich <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> >>>>>>> I’m running into an odd issue and hoping someone can point me in the >>>>>>> right direction. >>>>>>> >>>>>>> >>>>>>> I have NiFi 1.19 deployed in a Kube cluster with all the >>>>>>> repositories volume mounted out. It was processing great with processors >>>>>>> like UpdateAttribute sending through 15K/5m PutFile sending through >>>>>>> 3K/5m. >>>>>>> >>>>>>> >>>>>>> With nothing changing in the deployment, the performance has dropped >>>>>>> to UpdateAttribute doing 350/5m and Putfile to 200/5m. >>>>>>> >>>>>>> >>>>>>> I’m trying to determine what resource is suddenly dropping our >>>>>>> performance like this. I don’t see anything on the Kube monitoring that >>>>>>> stands out and I have restarted, cleaned repos, changed nodes but >>>>>>> nothing >>>>>>> is helping. >>>>>>> >>>>>>> >>>>>>> I was hoping there is something from the NiFi POV that can help >>>>>>> identify the limiting resource. I'm not sure if there is additional >>>>>>> diagnostic/debug/etc information available beyond the node status >>>>>>> graphs. >>>>>>> >>>>>>> >>>>>>> Any help would be greatly appreciated. >>>>>>> >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> >>>>>>> -Aaron >>>>>>> >>>>>> >>
