Re: [E] Re: Lagging worker nodes

Zilvinas Saltys Thu, 28 Jan 2021 12:56:34 -0800

Joe,

Absolutely. I can provide configuration of every single processor. Could
you point me to anything I can read through to see how actual content can
be cached in memory? Perhaps a link to github. If there's a condition where
processors can avoid reading from local disk to fetch actual content I
would like to know. What I feel is happening is that somehow the content is
cached in memory and does not need to be re-read from local disk and those
nodes that do that perform blazingly fast. But as soon as the node is not
able to do to that it's performance plummets due to the large files and it
starts accumulating a backlog.


Would you be able to point me to anything in the code or documentation that
would help me confirm this?

Thanks

On Thu, Jan 28, 2021 at 8:47 PM Joe Witt <[email protected]> wrote:

> Saltys
>
> It can be possible because those things can still be cached.  The way this
> thing really works at scale can be quite awesome actually.
>
> However, definitely want to help you understand what is happening but the
> pictures alone dont cut it.  We appreciate you have sensitivities/stuff you
> have to remove.  But that is also a major factor in being able to help.
>
> We need details on how processors are configured.
>
> Thanks
>
> On Thu, Jan 28, 2021 at 1:45 PM Zilvinas Saltys <
> [email protected]> wrote:
>
>> We're still on an old version of Kafka that's why we're still using old
>> processors.
>>
>> File sizes vary .. Generally they are all within +-100mb range before
>> they are uncompressed. There can be some small files but they are not a
>> majority. From logging I can see that all hosts are processing files of all
>> sizes.
>>
>> Our SQS processor runs on all nodes and takes 1 message only. We force
>> initial balancing this way.
>>
>> Any idea how a node can publish a 400 mb file to Kafka and not show any
>> DISK read IO at the same time? How could something like that be possible?
>> Is there any way where Nifi would not read the file out of the
>> local content repo but have it cached? Or could this be just the kernel
>> caching the entire content repo device?
>>
>> Thanks
>>
>> On Thu, Jan 28, 2021 at 8:39 PM Pierre Villard <
>> [email protected]> wrote:
>>
>>> Not saying this is the issue, but is your Kafka cluster using Kafka
>>> 0.11? Looking at the screenshot, you're using the Kafka processors from the
>>> 0.11 bundle, you might want to look at the processors for Kafka 2.x instead.
>>>
>>> Are your files more or less evenly distributed in terms of sizes?
>>> I suppose your SQS processor is running on the primary node only? What
>>> node is that in the previous screenshot?
>>>
>>> Pierre
>>>
>>> Le ven. 29 janv. 2021 à 00:28, Zilvinas Saltys <
>>> [email protected]> a écrit :
>>>
>>>> My other issue is that the balancing is not rebalancing the queue?
>>>> Perhaps I misunderstand how balancing should work and it only balances
>>>> round robin new incoming files? I can easily manually rebalance by
>>>> disabling balancing and enabling it again but after a while it gets back to
>>>> the same situation where some nodes are getting worse and worse delayed
>>>> more and more and some remain fine.
>>>>
>>>> On Thu, Jan 28, 2021 at 8:22 PM Zilvinas Saltys <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Joe,
>>>>>
>>>>> Yes it is the same issue. We have used your advice and reduced the
>>>>> amount of threads on our large processors: fetch/compress/publish to a
>>>>> minimum and then increased gradually to 4 until the processing rate became
>>>>> acceptable (about 2000 files per 5 min). This is a cluster of 25 nodes of
>>>>> 36 cores each.
>>>>>
>>>>> On Thu, Jan 28, 2021 at 8:19 PM Joe Witt <[email protected]> wrote:
>>>>>
>>>>>> I'm assuming also this is the same thing Maksym was asking about
>>>>>> yesterday.  Let's try to keep the thread together as this gets discussed.
>>>>>>
>>>>>> On Thu, Jan 28, 2021 at 1:10 PM Pierre Villard <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Zilvinas,
>>>>>>>
>>>>>>> I'm afraid we would need more details to help you out here.
>>>>>>>
>>>>>>> My first question by quickly looking at the graph would be: there is
>>>>>>> a host (green line) where the number of queued flow files is more or 
>>>>>>> less
>>>>>>> constantly growing. Where in the flow are the flow files accumulating 
>>>>>>> for
>>>>>>> this node? What processor is creating back pressure? Do we have 
>>>>>>> anything in
>>>>>>> the log for this node around the time where flow files start 
>>>>>>> accumulating?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Pierre
>>>>>>>
>>>>>>> Le ven. 29 janv. 2021 à 00:02, Zilvinas Saltys <
>>>>>>> [email protected]> a écrit :
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We run a 25 node Nifi cluster on version 1.12. We're processing
>>>>>>>> about 2000 files per 5 mins where each file is from 100 to 500 
>>>>>>>> megabytes.
>>>>>>>>
>>>>>>>> What I notice is that some workers degrade in performance and keep
>>>>>>>> accumulating a queued files delay. See attached screenshots where it 
>>>>>>>> shows
>>>>>>>> two hosts where one is degraded.
>>>>>>>>
>>>>>>>> One seemingly dead give away is that the degraded node starts doing
>>>>>>>> heavy and intensive disk read io while the other node keeps doing 
>>>>>>>> none. I
>>>>>>>> ran iostat on those nodes and I know that the read IOs are on the
>>>>>>>> content_repository directory. But it makes no sense to me how some of 
>>>>>>>> the
>>>>>>>> nodes who are doing these heavy tasks are doing no disk read io. In 
>>>>>>>> this
>>>>>>>> example I know that both nodes are processing roughly the same amount 
>>>>>>>> of
>>>>>>>> files and of same size.
>>>>>>>>
>>>>>>>> The pipeline is somewhat simple:
>>>>>>>> 1) Read from SQS 2) Fetch file contents from S3 3) Publish file
>>>>>>>> contents to Kafka 4) Compress file contents 5) Put compressed contents 
>>>>>>>> back
>>>>>>>> to S3
>>>>>>>>
>>>>>>>> All of these operations to my understanding should require heavy
>>>>>>>> reads from local disk to fetch file contents from content repository? 
>>>>>>>> How
>>>>>>>> is such a thing possible that some nodes are processing lots of files 
>>>>>>>> and
>>>>>>>> are not showing any disk reads and then suddenly spike in disk reads 
>>>>>>>> and
>>>>>>>> degrade?
>>>>>>>>
>>>>>>>> Any clues would be really helpful.
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>

Re: [E] Re: Lagging worker nodes

Reply via email to