Update:  we have another solution path that seems to give a very useful
workaround for this challenge pre-1.8. If depends on having a messaging
queue that can be read by ConsumeAMQP.
Use ListFile to establish your set of zero-byte flowfiles -- essentially,
metadata about the files you want to read into the flow. All will be on one
cluster node because you read in configured for Primary only.
Manipulate the metadata as you wish, add attributes, then write to an
exchange / queue in your AMQP. All lightweight work that doesn't burden the
single node.
In a separate workflow path, a ConsumeAMQP processor is configured to read
in from that exchange/queue, configured for All Nodes. It reads in and
appears to round robin the flowfiles. Then do the FetchFile to retrieve the
data itself. It seems to be balancing the load in a round robin fashion.
Not a solution path for all because it has that AMQP dependency, but if
you've got an amqp anyway it's a reasonable alternative that seems to work.

On Wed, Jul 10, 2019 at 9:47 AM James McMahon <[email protected]> wrote:

> Great advice, thank you Joe. I had not realized Batch Settings Count was
> even there, and so had left it unset. If any other folks are also still
> using pre-1.8, you can set on your Remote Process Group, Manage Remote
> Ports, parameter is Batch Settings Count. I still want to experiment to
> gauge the effect. For large flowfiles, I will keep it at 1 to spread the
> load across all nodes as evenly as possible. For small flowfiles I will try
> a batch of 10. I suspect this will reduce the number of calls to the nodes,
> and be more efficient in that respect. In the latter case with small
> flowfiles the skew in load distribution shouldn't be extreme.
> This is fascinating, and fun to experiment with. Thanks again. -Jim
>
> On Wed, Jul 10, 2019 at 9:12 AM Joe Witt <[email protected]> wrote:
>
>> James
>>
>> Did you apply any specific batch settings on s2s?  By default it sends
>> large chunks of messages at once.  If you're testing on small scale you
>> might not see the distribution you would at typical/protracted scale.
>> Setting batch sizes smaller may be appropriate for your case or just
>> leaving the defaults and observing for a longer period of time may be
>> better.
>>
>> Thanks
>>
>> On Wed, Jul 10, 2019 at 9:04 AM James McMahon <[email protected]>
>> wrote:
>>
>>> Thank you Joe. We do one day intend to upgrade but are bound by
>>> enterprise options available to us to 1.7 for the near-term. So, do I
>>> understand your explanation correctly: behavior exhibited through the first
>>> 4000 flowfiles as past performance may not represent future results. It
>>> will do what it does, and I may find that node1 does get loaded as I work
>>> through flowfiles in steady state.
>>> Again, thanks.
>>>
>>> On Wed, Jul 10, 2019 at 8:41 AM Joe Witt <[email protected]> wrote:
>>>
>>>> James
>>>>
>>>> For distributing work across the cluster the load balanced connection
>>>> capability in NiFi 1.8 and beyond is the right answer - purpose built for
>>>> the job.  I'd strongly recommend upgrading to avoid use of s2s for this
>>>> scenario and instead use load balanced connections.  When using load
>>>> balanced or s2s you will want to observe the behavior at typical sustained
>>>> scale. Using either strategy all nodes will have an opportunity to receive
>>>> data.  However, backpressure/loading, configuration, and other factors
>>>> could mean that periodically a given node is not receiving data.  We have
>>>> seen a lot of folks become confused about s2s behaviors for cases like this
>>>> so I think you'll find load balanced connections are much better for this.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>>
>>>> On Wed, Jul 10, 2019 at 8:34 AM James McMahon <[email protected]>
>>>> wrote:
>>>>
>>>>> We are on 1.7.1.g and have just recently established our first
>>>>> clustered configuration. Using Pierre Villard's article from Feb 2017 (
>>>>> https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/
>>>>> ) and a few other related technical articles to flesh out some details, we
>>>>> have gotten a ListFile / FetchFile to distribute load using Remote Process
>>>>> Group - almost.
>>>>>
>>>>> Downstream of the FetchFile running on all nodes I connect to a
>>>>> Monitor Activity processor simply to examine the flowfiles that result 
>>>>> from
>>>>> the fetch, in that following queue. In that queue one can look at the info
>>>>> for each flowfile and find what appears to be the node on which the
>>>>> flowfile was processed by field Node Address.
>>>>>
>>>>> I have four nodes in my cluster - one primary, three not primary. I
>>>>> can see in the queue listing that three flowfiles share common Position
>>>>> values. Three have Position 1, three have Position 2, etc etc etc in a
>>>>> pattern that repeats throughout the entire queue. Within each Position
>>>>> group, the flowfiles have been distributed to node2, node3, and node4 - 
>>>>> *but
>>>>> none at all to node1*.
>>>>>
>>>>> What would cause such behavior? How can I get my files to distribute
>>>>> across all four nodes?
>>>>> I should mention:
>>>>> 1. all four node URLS are in the RPG URL configuration parameter,
>>>>> delimited by commas.
>>>>> 2. node1 is currently assigned by my external Zookeeper as my Primary,
>>>>> and is where the ListFile processor executes.
>>>>> 3. all four nodes are granted access for "retrieve site-to-site
>>>>> details" in my Hamburger Menu, Access Policies.
>>>>> 4. all four nodes are granted access for "receive data via
>>>>> site-to-site" in the Access Policies for the RPG Input Port.
>>>>>
>>>>> My concern is that I am leaving nearly 25% of my available cluster
>>>>> capacity unused.
>>>>>
>>>>

Reply via email to