Re: [E] Re: Lagging worker nodes

Zilvinas Saltys Thu, 28 Jan 2021 12:23:27 -0800

Hi Joe,

Yes it is the same issue. We have used your advice and reduced the amount
of threads on our large processors: fetch/compress/publish to a minimum and
then increased gradually to 4 until the processing rate became acceptable
(about 2000 files per 5 min). This is a cluster of 25 nodes of 36 cores
each.


On Thu, Jan 28, 2021 at 8:19 PM Joe Witt <[email protected]> wrote:

> I'm assuming also this is the same thing Maksym was asking about
> yesterday.  Let's try to keep the thread together as this gets discussed.
>
> On Thu, Jan 28, 2021 at 1:10 PM Pierre Villard <
> [email protected]> wrote:
>
>> Hi Zilvinas,
>>
>> I'm afraid we would need more details to help you out here.
>>
>> My first question by quickly looking at the graph would be: there is a
>> host (green line) where the number of queued flow files is more or less
>> constantly growing. Where in the flow are the flow files accumulating for
>> this node? What processor is creating back pressure? Do we have anything in
>> the log for this node around the time where flow files start accumulating?
>>
>> Thanks,
>> Pierre
>>
>> Le ven. 29 janv. 2021 à 00:02, Zilvinas Saltys <
>> [email protected]> a écrit :
>>
>>> Hi,
>>>
>>> We run a 25 node Nifi cluster on version 1.12. We're processing about
>>> 2000 files per 5 mins where each file is from 100 to 500 megabytes.
>>>
>>> What I notice is that some workers degrade in performance and keep
>>> accumulating a queued files delay. See attached screenshots where it shows
>>> two hosts where one is degraded.
>>>
>>> One seemingly dead give away is that the degraded node starts doing
>>> heavy and intensive disk read io while the other node keeps doing none. I
>>> ran iostat on those nodes and I know that the read IOs are on the
>>> content_repository directory. But it makes no sense to me how some of the
>>> nodes who are doing these heavy tasks are doing no disk read io. In this
>>> example I know that both nodes are processing roughly the same amount of
>>> files and of same size.
>>>
>>> The pipeline is somewhat simple:
>>> 1) Read from SQS 2) Fetch file contents from S3 3) Publish file contents
>>> to Kafka 4) Compress file contents 5) Put compressed contents back to S3
>>>
>>> All of these operations to my understanding should require heavy reads
>>> from local disk to fetch file contents from content repository? How is such
>>> a thing possible that some nodes are processing lots of files and are not
>>> showing any disk reads and then suddenly spike in disk reads and degrade?
>>>
>>> Any clues would be really helpful.
>>> Thanks.
>>>
>>

Re: [E] Re: Lagging worker nodes

Reply via email to