Re: Round robin load balancing eventually stops using all nodes

Mike Thomsen Fri, 10 Sep 2021 05:07:57 -0700

The use case where we most often run into this problem involves
extracting content from tarballs of varying sizes that are fairly
large. These tarballs vary in size from 80GB to the better part of
500GB and contain a ton of 250k-1MB files in them; about 1.5M files
per tarball is the norm.


(I am aware that this is a really bad way to get data for NiFi, but
the upstream source has absolutely refused to change their export
methodology)

On Tue, Sep 7, 2021 at 5:03 PM Joe Witt <[email protected]> wrote:
>
> Ryan
>
> If this is so easily replicated for you it should be trivially found and 
> fixed most likely.
>
> Please share, for each node in your cluster, both a thread dump and heap dump 
> within 30 mins of startup and again after 24 hours.
>
> This will allow us to see the delta and if there appears to be any sort of 
> leak.   If you cannot share these then you can do that analysis and share the 
> results.
>
> Nobody should have to restart nodes to keep things healthy.
>
> Joe
>
> On Tue, Sep 7, 2021 at 12:58 PM Ryan Hendrickson 
> <[email protected]> wrote:
>>
>> We have a daily cron job that restarts our nifi cluster to keep it in a good 
>> state.
>>
>> On Mon, Sep 6, 2021 at 6:11 PM Mike Thomsen <[email protected]> wrote:
>>>
>>> >  there is a ticket to overcome this (there is no ETA), but other details 
>>> > might shed light to a different root cause.
>>>
>>> Good to know I'm not crazy, and it's in the TODO. Until then, it seems
>>> fixable by bouncing the box.
>>>
>>> On Mon, Sep 6, 2021 at 7:14 AM Simon Bence <[email protected]> wrote:
>>> >
>>> > Hi Mike,
>>> >
>>> > I did a quick check on the round robin balancing and based on what I 
>>> > found the reason for the issue must lie somewhere else, not directly 
>>> > within it. The one thing I can think of is the scenario where one (or 
>>> > more) nodes are significantly slower than the other ones. In these cases 
>>> > it might happen then the nodes are “running behind” blocks the other 
>>> > nodes from balancing perspective.
>>> >
>>> > Based on what you wrote this is a possible reason and there is a ticket 
>>> > to overcome this (there is no ETA), but other details might shed light to 
>>> > a different root cause.
>>> >
>>> > Regards,
>>> > Bence
>>> >
>>> >
>>> >
>>> > > On 2021. Sep 3., at 14:13, Mike Thomsen <[email protected]> wrote:
>>> > >
>>> > > We have a 5 node cluster, and sometimes I've noticed that round robin
>>> > > load balancing stops sending flowfiles to two of them, and sometimes
>>> > > toward the end of the data processing can get as low as a single node.
>>> > > Has anyone seen similar behavior?
>>> > >
>>> > > Thanks,
>>> > >
>>> > > Mike
>>> >

Re: Round robin load balancing eventually stops using all nodes

Reply via email to