Re: [E] Re: Lagging worker nodes

Zilvinas Saltys Thu, 28 Jan 2021 12:28:17 -0800

My other issue is that the balancing is not rebalancing the queue? Perhaps
I misunderstand how balancing should work and it only balances round robin
new incoming files? I can easily manually rebalance by disabling balancing
and enabling it again but after a while it gets back to the same situation
where some nodes are getting worse and worse delayed more and more and some
remain fine.


On Thu, Jan 28, 2021 at 8:22 PM Zilvinas Saltys <
[email protected]> wrote:

> Hi Joe,
>
> Yes it is the same issue. We have used your advice and reduced the amount
> of threads on our large processors: fetch/compress/publish to a minimum and
> then increased gradually to 4 until the processing rate became acceptable
> (about 2000 files per 5 min). This is a cluster of 25 nodes of 36 cores
> each.
>
> On Thu, Jan 28, 2021 at 8:19 PM Joe Witt <[email protected]> wrote:
>
>> I'm assuming also this is the same thing Maksym was asking about
>> yesterday.  Let's try to keep the thread together as this gets discussed.
>>
>> On Thu, Jan 28, 2021 at 1:10 PM Pierre Villard <
>> [email protected]> wrote:
>>
>>> Hi Zilvinas,
>>>
>>> I'm afraid we would need more details to help you out here.
>>>
>>> My first question by quickly looking at the graph would be: there is a
>>> host (green line) where the number of queued flow files is more or less
>>> constantly growing. Where in the flow are the flow files accumulating for
>>> this node? What processor is creating back pressure? Do we have anything in
>>> the log for this node around the time where flow files start accumulating?
>>>
>>> Thanks,
>>> Pierre
>>>
>>> Le ven. 29 janv. 2021 à 00:02, Zilvinas Saltys <
>>> [email protected]> a écrit :
>>>
>>>> Hi,
>>>>
>>>> We run a 25 node Nifi cluster on version 1.12. We're processing about
>>>> 2000 files per 5 mins where each file is from 100 to 500 megabytes.
>>>>
>>>> What I notice is that some workers degrade in performance and keep
>>>> accumulating a queued files delay. See attached screenshots where it shows
>>>> two hosts where one is degraded.
>>>>
>>>> One seemingly dead give away is that the degraded node starts doing
>>>> heavy and intensive disk read io while the other node keeps doing none. I
>>>> ran iostat on those nodes and I know that the read IOs are on the
>>>> content_repository directory. But it makes no sense to me how some of the
>>>> nodes who are doing these heavy tasks are doing no disk read io. In this
>>>> example I know that both nodes are processing roughly the same amount of
>>>> files and of same size.
>>>>
>>>> The pipeline is somewhat simple:
>>>> 1) Read from SQS 2) Fetch file contents from S3 3) Publish file
>>>> contents to Kafka 4) Compress file contents 5) Put compressed contents back
>>>> to S3
>>>>
>>>> All of these operations to my understanding should require heavy reads
>>>> from local disk to fetch file contents from content repository? How is such
>>>> a thing possible that some nodes are processing lots of files and are not
>>>> showing any disk reads and then suddenly spike in disk reads and degrade?
>>>>
>>>> Any clues would be really helpful.
>>>> Thanks.
>>>>
>>>

Re: [E] Re: Lagging worker nodes

Reply via email to