That’s a cool workaround!! Thanks for sharing. 

On Thu, 11 Jul 2019, at 16:57, James McMahon wrote:
> Update: we have another solution path that seems to give a very useful 
> workaround for this challenge pre-1.8. If depends on having a messaging queue 
> that can be read by ConsumeAMQP.
> Use ListFile to establish your set of zero-byte flowfiles -- essentially, 
> metadata about the files you want to read into the flow. All will be on one 
> cluster node because you read in configured for Primary only.
> Manipulate the metadata as you wish, add attributes, then write to an 
> exchange / queue in your AMQP. All lightweight work that doesn't burden the 
> single node.
> In a separate workflow path, a ConsumeAMQP processor is configured to read in 
> from that exchange/queue, configured for All Nodes. It reads in and appears 
> to round robin the flowfiles. Then do the FetchFile to retrieve the data 
> itself. It seems to be balancing the load in a round robin fashion.
> Not a solution path for all because it has that AMQP dependency, but if 
> you've got an amqp anyway it's a reasonable alternative that seems to work.
> 
> On Wed, Jul 10, 2019 at 9:47 AM James McMahon <[email protected]> wrote:
>> Great advice, thank you Joe. I had not realized Batch Settings Count was 
>> even there, and so had left it unset. If any other folks are also still 
>> using pre-1.8, you can set on your Remote Process Group, Manage Remote 
>> Ports, parameter is Batch Settings Count. I still want to experiment to 
>> gauge the effect. For large flowfiles, I will keep it at 1 to spread the 
>> load across all nodes as evenly as possible. For small flowfiles I will try 
>> a batch of 10. I suspect this will reduce the number of calls to the nodes, 
>> and be more efficient in that respect. In the latter case with small 
>> flowfiles the skew in load distribution shouldn't be extreme.
>> This is fascinating, and fun to experiment with. Thanks again. -Jim
>> 
>> On Wed, Jul 10, 2019 at 9:12 AM Joe Witt <[email protected]> wrote:
>>> James
>>> 
>>> Did you apply any specific batch settings on s2s? By default it sends large 
>>> chunks of messages at once. If you're testing on small scale you might not 
>>> see the distribution you would at typical/protracted scale. Setting batch 
>>> sizes smaller may be appropriate for your case or just leaving the defaults 
>>> and observing for a longer period of time may be better.
>>> 
>>> Thanks
>>> 
>>> On Wed, Jul 10, 2019 at 9:04 AM James McMahon <[email protected]> wrote:
>>>> Thank you Joe. We do one day intend to upgrade but are bound by enterprise 
>>>> options available to us to 1.7 for the near-term. So, do I understand your 
>>>> explanation correctly: behavior exhibited through the first 4000 flowfiles 
>>>> as past performance may not represent future results. It will do what it 
>>>> does, and I may find that node1 does get loaded as I work through 
>>>> flowfiles in steady state. 
>>>> Again, thanks.
>>>> 
>>>> On Wed, Jul 10, 2019 at 8:41 AM Joe Witt <[email protected]> wrote:
>>>>> James
>>>>> 
>>>>> For distributing work across the cluster the load balanced connection 
>>>>> capability in NiFi 1.8 and beyond is the right answer - purpose built for 
>>>>> the job. I'd strongly recommend upgrading to avoid use of s2s for this 
>>>>> scenario and instead use load balanced connections. When using load 
>>>>> balanced or s2s you will want to observe the behavior at typical 
>>>>> sustained scale. Using either strategy all nodes will have an opportunity 
>>>>> to receive data. However, backpressure/loading, configuration, and other 
>>>>> factors could mean that periodically a given node is not receiving data. 
>>>>> We have seen a lot of folks become confused about s2s behaviors for cases 
>>>>> like this so I think you'll find load balanced connections are much 
>>>>> better for this.
>>>>> 
>>>>> Thanks
>>>>> Joe
>>>>> 
>>>>> 
>>>>> On Wed, Jul 10, 2019 at 8:34 AM James McMahon <[email protected]> 
>>>>> wrote:
>>>>>> We are on 1.7.1.g and have just recently established our first clustered 
>>>>>> configuration. Using Pierre Villard's article from Feb 2017 ( 
>>>>>> https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/
>>>>>>  ) and a few other related technical articles to flesh out some details, 
>>>>>> we have gotten a ListFile / FetchFile to distribute load using Remote 
>>>>>> Process Group - almost. 
>>>>>> 
>>>>>> Downstream of the FetchFile running on all nodes I connect to a Monitor 
>>>>>> Activity processor simply to examine the flowfiles that result from the 
>>>>>> fetch, in that following queue. In that queue one can look at the info 
>>>>>> for each flowfile and find what appears to be the node on which the 
>>>>>> flowfile was processed by field Node Address.
>>>>>> 
>>>>>> I have four nodes in my cluster - one primary, three not primary. I can 
>>>>>> see in the queue listing that three flowfiles share common Position 
>>>>>> values. Three have Position 1, three have Position 2, etc etc etc in a 
>>>>>> pattern that repeats throughout the entire queue. Within each Position 
>>>>>> group, the flowfiles have been distributed to node2, node3, and node4 - 
>>>>>> *but none at all to node1*. 
>>>>>> 
>>>>>> What would cause such behavior? How can I get my files to distribute 
>>>>>> across all four nodes?
>>>>>> I should mention:
>>>>>> 1. all four node URLS are in the RPG URL configuration parameter, 
>>>>>> delimited by commas.
>>>>>> 2. node1 is currently assigned by my external Zookeeper as my Primary, 
>>>>>> and is where the ListFile processor executes.
>>>>>> 3. all four nodes are granted access for "retrieve site-to-site details" 
>>>>>> in my Hamburger Menu, Access Policies.
>>>>>> 4. all four nodes are granted access for "receive data via site-to-site" 
>>>>>> in the Access Policies for the RPG Input Port.
>>>>>> 
>>>>>> My concern is that I am leaving nearly 25% of my available cluster 
>>>>>> capacity unused.

Reply via email to