Re: Round robin load balancing eventually stops using all nodes

Ryan Hendrickson Fri, 17 Sep 2021 17:39:44 -0700

Joe, we were asked to open a ticket and add diagnostics there, we've
summarized the diagnostics (https://issues.apache.org/jira/browse/NIFI-9056).
Unfortunately, we can't export logs en-masse, but if there is anything
specific or a series of lines we should be looking for, we can summarize
and report back.  More than happy to do this.  We've been dragged down with
competing priorities which is why we're stuck with the cron approach at
the moment.  We're not happy with the cron approach, it is in no way
sustainable.   We have it on our developer agenda to debug more next
week.  Our goal is to get you a consistently repeatable test scenario such
that the issue can be reproduced, or otherwise identify it's an issue
unique to our environment.


Mike, That's pretty much our use case too.  We get 200mb tar.gz files
streamed in constantly.  Each tar.gz is unpacked into 100k-200k individual
JSON files.

We opened a couple tickets in response to a few things we found:

1. Documentation of all sys admin configurable properties:
https://issues.apache.org/jira/browse/NIFI-9029
We opened this ticket after we realized the undocumented
"nifi.content.repository.archive.backpressure.percentage" was freezing the
NiFi's.  Details documented in the ticket.

2. Warn users on the canvas when
"nifi.content.repository.archive.backpressure.percentage" is exceeded
https://issues.apache.org/jira/browse/NIFI-9030
We opened this because we couldn't find any indication of the percentage
being exceeded, indicating the frozen state.

3. Content Repository Filling Up -
https://issues.apache.org/jira/browse/NIFI-9056
This is the ticket where we tried to take the conversation offline from the
emails here to Jira and get some hard/fast details documented (as best we
can).


Thanks,
Ryan




On Fri, Sep 10, 2021 at 8:07 AM Mike Thomsen <[email protected]> wrote:

> The use case where we most often run into this problem involves
> extracting content from tarballs of varying sizes that are fairly
> large. These tarballs vary in size from 80GB to the better part of
> 500GB and contain a ton of 250k-1MB files in them; about 1.5M files
> per tarball is the norm.
>
> (I am aware that this is a really bad way to get data for NiFi, but
> the upstream source has absolutely refused to change their export
> methodology)
>
> On Tue, Sep 7, 2021 at 5:03 PM Joe Witt <[email protected]> wrote:
> >
> > Ryan
> >
> > If this is so easily replicated for you it should be trivially found and
> fixed most likely.
> >
> > Please share, for each node in your cluster, both a thread dump and heap
> dump within 30 mins of startup and again after 24 hours.
> >
> > This will allow us to see the delta and if there appears to be any sort
> of leak.   If you cannot share these then you can do that analysis and
> share the results.
> >
> > Nobody should have to restart nodes to keep things healthy.
> >
> > Joe
> >
> > On Tue, Sep 7, 2021 at 12:58 PM Ryan Hendrickson <
> [email protected]> wrote:
> >>
> >> We have a daily cron job that restarts our nifi cluster to keep it in a
> good state.
> >>
> >> On Mon, Sep 6, 2021 at 6:11 PM Mike Thomsen <[email protected]>
> wrote:
> >>>
> >>> >  there is a ticket to overcome this (there is no ETA), but other
> details might shed light to a different root cause.
> >>>
> >>> Good to know I'm not crazy, and it's in the TODO. Until then, it seems
> >>> fixable by bouncing the box.
> >>>
> >>> On Mon, Sep 6, 2021 at 7:14 AM Simon Bence <[email protected]>
> wrote:
> >>> >
> >>> > Hi Mike,
> >>> >
> >>> > I did a quick check on the round robin balancing and based on what I
> found the reason for the issue must lie somewhere else, not directly within
> it. The one thing I can think of is the scenario where one (or more) nodes
> are significantly slower than the other ones. In these cases it might
> happen then the nodes are “running behind” blocks the other nodes from
> balancing perspective.
> >>> >
> >>> > Based on what you wrote this is a possible reason and there is a
> ticket to overcome this (there is no ETA), but other details might shed
> light to a different root cause.
> >>> >
> >>> > Regards,
> >>> > Bence
> >>> >
> >>> >
> >>> >
> >>> > > On 2021. Sep 3., at 14:13, Mike Thomsen <[email protected]>
> wrote:
> >>> > >
> >>> > > We have a 5 node cluster, and sometimes I've noticed that round
> robin
> >>> > > load balancing stops sending flowfiles to two of them, and
> sometimes
> >>> > > toward the end of the data processing can get as low as a single
> node.
> >>> > > Has anyone seen similar behavior?
> >>> > >
> >>> > > Thanks,
> >>> > >
> >>> > > Mike
> >>> >
>

Re: Round robin load balancing eventually stops using all nodes

Reply via email to