Re: Round robin load balancing eventually stops using all nodes

Mike Thomsen Fri, 01 Apr 2022 05:53:43 -0700

I think I figured out how to get around this: partition-by-attribute
using UUID. About 10 minutes ago, I was down to 3/5 nodes on my
cluster. Switched the queues to that strategy, and the 3 full nodes
started sending work to the other two nodes without a restart.


On Fri, Apr 1, 2022 at 7:44 AM Mike Thomsen <[email protected]> wrote:
>
> I think I forgot to mention early on that we're using embedded
> ZooKeeper. Could that be a factor in this behavior?
>
> Thanks,
>
> Mike
>
> On Fri, Apr 1, 2022 at 7:28 AM Mike Thomsen <[email protected]> wrote:
> >
> > When we talk about "slower nodes" here, are we referring to nodes that
> > are bogged down by data but of the same size as the rest of the
> > cluster or are we talking about a heterogeneous cluster?
> >
> > On Mon, Sep 27, 2021 at 12:07 PM Joe Witt <[email protected]> wrote:
> > >
> > > Ryan,
> > >
> > > Regarding NIFI-9236 the JIRA captures it well but sounds like there is
> > > now a better understanding of how it works and what options exist to
> > > better view details.
> > >
> > > Regarding Load Balancing: NIFI-7081 is largely about the scenario
> > > whereby in load balancing cases nodes which are slower effectively set
> > > the rate the whole cluster can sustain because we don't have a fluid
> > > load balancing strategy which we should.  Such a strategy would allow
> > > for the fastest nodes to always take the most data.  We just need to
> > > do that work.  No ETA.
> > >
> > > Thanks
> > >
> > > On Tue, Sep 21, 2021 at 2:18 PM Ryan Hendrickson
> > > <[email protected]> wrote:
> > > >
> > > > Joe - We're testing some scenarios.  Andrew captured some confusing 
> > > > behavior in the UI when enabling and disabling load balancing on a 
> > > > relationship: "Update UI for Clustered Connections" -- 
> > > > https://issues.apache.org/jira/projects/NIFI/issues/NIFI-9236
> > > >
> > > > Question - When a FlowFile is Load Balanced from one node to another, 
> > > > is the entire Content Claim load balanced?  Or just the small portion 
> > > > necessary?
> > > >
> > > > Mike -
> > > > We found two tickets that are in the ballpark:
> > > >
> > > > 1.  Improve handling of Load Balanced Connections when one node is slow 
> > > >   --    https://issues.apache.org/jira/browse/NIFI-7081
> > > > 2.  NiFi FlowFiles stuck in queue when using Single Node load balance 
> > > > strategy   --    https://issues.apache.org/jira/browse/NIFI-8970
> > > >
> > > > From @Simon comment - we know we've seen underperforming nodes in a 
> > > > cluster before.  We're discussing @Simon's comment is applicable to the 
> > > > issue we're seeing
> > > >           > "The one thing I can think of is the scenario where one (or 
> > > > more) nodes are significantly slower than the other ones. In these 
> > > > cases it might happen then the nodes are “running behind” blocks the 
> > > > other nodes from balancing perspective."
> > > >
> > > > @Simon - I'd like to understand the "blocks other nodes from balancing 
> > > > perspective" better if you have additional information.  We're trying 
> > > > to replicate this scenario.
> > > >
> > > > Thanks,
> > > > Ryan
> > > >
> > > > On Sat, Sep 18, 2021 at 3:45 PM Mike Thomsen <[email protected]> 
> > > > wrote:
> > > >>
> > > >> > there is a ticket to overcome this (there is no ETA),
> > > >>
> > > >> Do you know what the Jira # is?
> > > >>
> > > >> On Mon, Sep 6, 2021 at 7:14 AM Simon Bence <[email protected]> 
> > > >> wrote:
> > > >> >
> > > >> > Hi Mike,
> > > >> >
> > > >> > I did a quick check on the round robin balancing and based on what I 
> > > >> > found the reason for the issue must lie somewhere else, not directly 
> > > >> > within it. The one thing I can think of is the scenario where one 
> > > >> > (or more) nodes are significantly slower than the other ones. In 
> > > >> > these cases it might happen then the nodes are “running behind” 
> > > >> > blocks the other nodes from balancing perspective.
> > > >> >
> > > >> > Based on what you wrote this is a possible reason and there is a 
> > > >> > ticket to overcome this (there is no ETA), but other details might 
> > > >> > shed light to a different root cause.
> > > >> >
> > > >> > Regards,
> > > >> > Bence
> > > >> >
> > > >> >
> > > >> >
> > > >> > > On 2021. Sep 3., at 14:13, Mike Thomsen <[email protected]> 
> > > >> > > wrote:
> > > >> > >
> > > >> > > We have a 5 node cluster, and sometimes I've noticed that round 
> > > >> > > robin
> > > >> > > load balancing stops sending flowfiles to two of them, and 
> > > >> > > sometimes
> > > >> > > toward the end of the data processing can get as low as a single 
> > > >> > > node.
> > > >> > > Has anyone seen similar behavior?
> > > >> > >
> > > >> > > Thanks,
> > > >> > >
> > > >> > > Mike
> > > >> >

Re: Round robin load balancing eventually stops using all nodes

Reply via email to