My other issue is that the balancing is not rebalancing the queue? Perhaps I misunderstand how balancing should work and it only balances round robin new incoming files? I can easily manually rebalance by disabling balancing and enabling it again but after a while it gets back to the same situation where some nodes are getting worse and worse delayed more and more and some remain fine.
On Thu, Jan 28, 2021 at 8:22 PM Zilvinas Saltys < [email protected]> wrote: > Hi Joe, > > Yes it is the same issue. We have used your advice and reduced the amount > of threads on our large processors: fetch/compress/publish to a minimum and > then increased gradually to 4 until the processing rate became acceptable > (about 2000 files per 5 min). This is a cluster of 25 nodes of 36 cores > each. > > On Thu, Jan 28, 2021 at 8:19 PM Joe Witt <[email protected]> wrote: > >> I'm assuming also this is the same thing Maksym was asking about >> yesterday. Let's try to keep the thread together as this gets discussed. >> >> On Thu, Jan 28, 2021 at 1:10 PM Pierre Villard < >> [email protected]> wrote: >> >>> Hi Zilvinas, >>> >>> I'm afraid we would need more details to help you out here. >>> >>> My first question by quickly looking at the graph would be: there is a >>> host (green line) where the number of queued flow files is more or less >>> constantly growing. Where in the flow are the flow files accumulating for >>> this node? What processor is creating back pressure? Do we have anything in >>> the log for this node around the time where flow files start accumulating? >>> >>> Thanks, >>> Pierre >>> >>> Le ven. 29 janv. 2021 à 00:02, Zilvinas Saltys < >>> [email protected]> a écrit : >>> >>>> Hi, >>>> >>>> We run a 25 node Nifi cluster on version 1.12. We're processing about >>>> 2000 files per 5 mins where each file is from 100 to 500 megabytes. >>>> >>>> What I notice is that some workers degrade in performance and keep >>>> accumulating a queued files delay. See attached screenshots where it shows >>>> two hosts where one is degraded. >>>> >>>> One seemingly dead give away is that the degraded node starts doing >>>> heavy and intensive disk read io while the other node keeps doing none. I >>>> ran iostat on those nodes and I know that the read IOs are on the >>>> content_repository directory. But it makes no sense to me how some of the >>>> nodes who are doing these heavy tasks are doing no disk read io. In this >>>> example I know that both nodes are processing roughly the same amount of >>>> files and of same size. >>>> >>>> The pipeline is somewhat simple: >>>> 1) Read from SQS 2) Fetch file contents from S3 3) Publish file >>>> contents to Kafka 4) Compress file contents 5) Put compressed contents back >>>> to S3 >>>> >>>> All of these operations to my understanding should require heavy reads >>>> from local disk to fetch file contents from content repository? How is such >>>> a thing possible that some nodes are processing lots of files and are not >>>> showing any disk reads and then suddenly spike in disk reads and degrade? >>>> >>>> Any clues would be really helpful. >>>> Thanks. >>>> >>>
