Ditto... @Aaron... so outside of the GenerateFlowFile -> PutFile, were there additional components/dataflows handling data at the same time as the "stress-test". These will all share the same thread-pool. So depending upon your dataflow footprint and any variability regarding data volumes... 20 timer-driven threads could be exhausted pretty quickly. This might cause not only your "stress-test" to slow down but your other flows as well as components might be waiting for available threads to do their jobs.
Thanks, Phil On Thu, Jan 11, 2024 at 3:44 PM Mark Payne <[email protected]> wrote: > Aaron, > > Interestingly, up to version 1.21 of NiFi, if you increase the size of the > thread pool, it increased immediately. But if you decreased the size of the > thread pool, the decrease didn’t take effect until you restart NiFi. So > that’s probably why you’re seeing the behavior you are. Even though you > reset it to 10 or 20, it’s still running at 40. > > This was done to issues with Java many years ago, where it caused problems > to decrease the thread pool size. So just recently we updated NiFi to > immediately scale down the thread pools as well. > > Thanks > -Mark > > > On Jan 11, 2024, at 1:35 PM, Aaron Rich <[email protected]> wrote: > > So the good news is it's working now. I know what I did but I don't know > why it worked so I'm hoping others can enlighten me based on what I did. > > TL;DR - "turn it off/turn in on" for Max Timer Driven Thread Count fixed > performance. Max Timer Driven Thread Count was set to 20. I changed it to > 30 - performance increased. I changed to more to 40 - it increased. I moved > it back to 20 - performance was still up and what it originally was before > ever slowing down. > > (this is long to give background and details) > NiFi version: 1.19.1 > > NiFi was deployed into a Kubernetes cluster as a single instance - no NiFi > clustering. We would set a CPU request of 4, and limit of 8, memory request > of 8, limit of 12. The repos are all volumed mounted out to ssd. > > The original deployment was as described above and Max Timer Driven Thread > Count was set to 20. We ran a very simple data flow > (generatoeFile->PutFile) AFAP to try to stress as much as possible before > starting our other data flows. That ran for a week with no issue doing > 20K/5m. > We turned on the other data flows and everything was processing as > expected, good throughput rates and things were happy. > Then the throughput dropped DRAMATICALLY to (instead of 11K/5m in an > UpdateAttribute, it went to 350/5m) after 3 days. The data being processed > did not change in volume/cadence/velocity/etc. > Rancher Cluster explorer dashboards didn't show resources standing out as > limiting or constraining. > Tried restarting workload in Kubernetes, and data flows were slow right > from start - so there wasn't a ramp up or any degradation over time - it > was just slow to begin. > Tried removing all the repos/state so NiFi came up clean incase it was the > historical data that was issue - still slow from start. > Tried changing node in Kube Cluster incase node was bad - still slow from > start. > Removed CPU limit (allowing NiFi to potentially use all 16 cores on node) > from deployment to see if there was CPU throttling happening that I wasn't > able to see on the Grafana dashboards - still slow from start. > While NiFi was running, I changed the Max Timer Driven Thread Count from > 20->30, performance picked up. Changed it again from 30->40, performance > picked up. I changed from 40->10, performance stayed up. I changed from > 10-20, performance stayed up and was at the original amount before slow > down every happened. > > So end of the day, the Max Timer Driven Thread Count is at exactly what it > was before but the performance changed. It's like something was "stuck". > It's very, very odd to me to see things be fine, degrade for days and > through multiple environment changes/debugging, and then return to fine > when I change a parameter to a different value->back to original value. > Effectively, I "turned it off/turned it on" with the Max Timer Driven > Thread Count value. > > My question is - what is happening under the hood when the Max Timer > Driven Thread Count is changed? What does that affect? Is there something I > could look at from Kubernetes' side potentially that would relate to that > value? > > Could an internal NiFi thread gotten stuck and changing that value rebuilt > the thread pool? If that is even possible? If that is even possible, is any > way to know what caused the thread to "get stuck" in the first place? > > Any insight would be greatly appreciated! > > Thanks so much for all the suggestions and help on this. > > -Aaron > > > > On Wed, Jan 10, 2024 at 1:54 PM Aaron Rich <[email protected]> wrote: > >> Hi Joe, >> >> Nothing is load balanced- it's all basic queues. >> >> Mark, >> I'm using NiFi 1.19.1. >> >> nifi.performance.tracking.percentage sounds exactly what I might need. >> I'll give that a shot. >> >> Richard, >> I hadn't looked at the rotating logs and/or cleared them out. I'll give >> that a shot too. >> >> Thank you all. Please keep the suggestions coming. >> >> -Aaron >> >> On Wed, Jan 10, 2024 at 1:34 PM Richard Beare <[email protected]> >> wrote: >> >>> I had a similar sounding issue, although not in a Kube cluster. Nifi was >>> running in a docker container and the issue was the log rotation >>> interacting with the log file being mounted from the host. The mounted log >>> file was not deleted on rotation, meaning that once rotation was triggered >>> by log file size it would be continually triggered because the new log file >>> was never emptied. The clue was that the content of rotated logfiles was >>> mostly the same, with only a small number of messages appended to each new >>> one. Rotating multi GB logs was enough to destroy performance, especially >>> if it was being triggered frequently by debug messages. >>> >>> On Thu, Jan 11, 2024 at 7:14 AM Aaron Rich <[email protected]> wrote: >>> >>>> Hi Joe, >>>> >>>> It's a pretty fixed size objects at a fixed interval- One 5mb-ish file, >>>> we break down to individual rows. >>>> >>>> I went so far as to create a "stress test" where I have a generateFlow( >>>> creating a fix, 100k fille, in batches of 1000, every .1s) feeding right >>>> into a putFile. I wanted to see the sustained max. It was very stable, fast >>>> for over a week running - but now it's extremely slow. That was able as >>>> simple of a data flow I could think of to hit all the different resources >>>> (CPU, memory >>>> >>>> I was thinking too, maybe it was memory but it's slow right at the >>>> start when starting NiFi. I would expect the memory to cause it to be >>>> slower over time, and the stress test showed it wasn't something that was >>>> fluenting over time. >>>> >>>> I'm happy to make other flows that anyone can suggest to help >>>> troubleshoot, diagnose issue. >>>> >>>> Lars, >>>> >>>> We haven't changed it between when performance was good and now when >>>> it's slow. That is what is throwing me - nothing changed from NiFi >>>> configuration standby. >>>> My guess is we are having some throttling/resource contention from our >>>> provider but I can't determine what/where/how. The Grafana cluster >>>> dashboards I have don't indicate issues. If there are suggestions for >>>> specific cluster metrics to plot/dashboards to use, I'm happy to build them >>>> and contribute them back (I do have a dashboard I need to figure out how to >>>> share for creating the "status history" plots in Grafana). >>>> The repos aren't full and I tried even blowing them away just to see if >>>> that made a difference. >>>> I'm not seeing anything new in the logs that indicate an issue...but >>>> maybe I'm missing it so I will try to look again >>>> >>>> By chance, are there any low level debugging metrics/observability/etc >>>> that would show how long things like writing to the repository disks is >>>> taking? There is a part of me that feels this could be a Disk I/O resource >>>> issue but I don't know how I can verify that is/isn't the issue. >>>> >>>> Thank you all for the help and suggestions - please keep them coming as >>>> I'm grasping at straws right now. >>>> >>>> -Aaron >>>> >>>> >>>> On Wed, Jan 10, 2024 at 10:10 AM Joe Witt <[email protected]> wrote: >>>> >>>>> Aaron, >>>>> >>>>> The usual suspects are memory consumption leading to high GC leading >>>>> to lower performance over time, or back pressure in the flow, etc.. But >>>>> your description does not really fit either exactly. Does your flow see a >>>>> mix of large objects and smaller objects? >>>>> >>>>> Thanks >>>>> >>>>> On Wed, Jan 10, 2024 at 10:07 AM Aaron Rich <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> >>>>>> I’m running into an odd issue and hoping someone can point me in the >>>>>> right direction. >>>>>> >>>>>> >>>>>> I have NiFi 1.19 deployed in a Kube cluster with all the repositories >>>>>> volume mounted out. It was processing great with processors like >>>>>> UpdateAttribute sending through 15K/5m PutFile sending through 3K/5m. >>>>>> >>>>>> >>>>>> With nothing changing in the deployment, the performance has dropped >>>>>> to UpdateAttribute doing 350/5m and Putfile to 200/5m. >>>>>> >>>>>> >>>>>> I’m trying to determine what resource is suddenly dropping our >>>>>> performance like this. I don’t see anything on the Kube monitoring that >>>>>> stands out and I have restarted, cleaned repos, changed nodes but nothing >>>>>> is helping. >>>>>> >>>>>> >>>>>> I was hoping there is something from the NiFi POV that can help >>>>>> identify the limiting resource. I'm not sure if there is additional >>>>>> diagnostic/debug/etc information available beyond the node status graphs. >>>>>> >>>>>> >>>>>> Any help would be greatly appreciated. >>>>>> >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>>> -Aaron >>>>>> >>>>> >
