I have a three node cluster and I am trying to rewrite a dataflow that's
used in several places to have the common parts distribute the data across
the cluster in a more efficient and load balanced way.  This is my first
experience with RPGs, so I was just starting from basics and working my way
up, but I am just out of the gate and already confused.

Here's the setup.  I have an input port on my root dataflow which points to
a LogMessage processor.  In another process group I have an RPG configured
with the three endpoints of the cluster separated by commas.  Feeding into
that is a GenerateFlowFile processor which is running every 5ms with 9
concurrent tasks on the primary node only.  Everything else has default
values.

When I start the dataflow it more or less works as expected except that the
distribution of FlowFiles looks uneven.  That is if I look at the Status
History of the LogMessage processor and select the FlowFiles In it looks
like the two non-primary nodes have the bulk of the flows files moving
through them.  I can wrap my head around that.

But then I rewrote it to put a DistributeLoad processor in front of three
RPGs, one for each node in the cluster, and left it set to `round robin`.
The FlowFiles In on the LogMessage processor looks exactly the same as
before.  The bulk of the FlowFiles In are on the two non-primary nodes.

In 5 minutes there are about 500K FlowFiles being processed and two
non-primary nodes are processing 234238 and 233089, with the primary node
processing 47597.

What am I missing?  Why doesn't a round robin distribute them evenly?

Neil

Reply via email to