That’s a cool workaround!! Thanks for sharing.
On Thu, 11 Jul 2019, at 16:57, James McMahon wrote: > Update: we have another solution path that seems to give a very useful > workaround for this challenge pre-1.8. If depends on having a messaging queue > that can be read by ConsumeAMQP. > Use ListFile to establish your set of zero-byte flowfiles -- essentially, > metadata about the files you want to read into the flow. All will be on one > cluster node because you read in configured for Primary only. > Manipulate the metadata as you wish, add attributes, then write to an > exchange / queue in your AMQP. All lightweight work that doesn't burden the > single node. > In a separate workflow path, a ConsumeAMQP processor is configured to read in > from that exchange/queue, configured for All Nodes. It reads in and appears > to round robin the flowfiles. Then do the FetchFile to retrieve the data > itself. It seems to be balancing the load in a round robin fashion. > Not a solution path for all because it has that AMQP dependency, but if > you've got an amqp anyway it's a reasonable alternative that seems to work. > > On Wed, Jul 10, 2019 at 9:47 AM James McMahon <[email protected]> wrote: >> Great advice, thank you Joe. I had not realized Batch Settings Count was >> even there, and so had left it unset. If any other folks are also still >> using pre-1.8, you can set on your Remote Process Group, Manage Remote >> Ports, parameter is Batch Settings Count. I still want to experiment to >> gauge the effect. For large flowfiles, I will keep it at 1 to spread the >> load across all nodes as evenly as possible. For small flowfiles I will try >> a batch of 10. I suspect this will reduce the number of calls to the nodes, >> and be more efficient in that respect. In the latter case with small >> flowfiles the skew in load distribution shouldn't be extreme. >> This is fascinating, and fun to experiment with. Thanks again. -Jim >> >> On Wed, Jul 10, 2019 at 9:12 AM Joe Witt <[email protected]> wrote: >>> James >>> >>> Did you apply any specific batch settings on s2s? By default it sends large >>> chunks of messages at once. If you're testing on small scale you might not >>> see the distribution you would at typical/protracted scale. Setting batch >>> sizes smaller may be appropriate for your case or just leaving the defaults >>> and observing for a longer period of time may be better. >>> >>> Thanks >>> >>> On Wed, Jul 10, 2019 at 9:04 AM James McMahon <[email protected]> wrote: >>>> Thank you Joe. We do one day intend to upgrade but are bound by enterprise >>>> options available to us to 1.7 for the near-term. So, do I understand your >>>> explanation correctly: behavior exhibited through the first 4000 flowfiles >>>> as past performance may not represent future results. It will do what it >>>> does, and I may find that node1 does get loaded as I work through >>>> flowfiles in steady state. >>>> Again, thanks. >>>> >>>> On Wed, Jul 10, 2019 at 8:41 AM Joe Witt <[email protected]> wrote: >>>>> James >>>>> >>>>> For distributing work across the cluster the load balanced connection >>>>> capability in NiFi 1.8 and beyond is the right answer - purpose built for >>>>> the job. I'd strongly recommend upgrading to avoid use of s2s for this >>>>> scenario and instead use load balanced connections. When using load >>>>> balanced or s2s you will want to observe the behavior at typical >>>>> sustained scale. Using either strategy all nodes will have an opportunity >>>>> to receive data. However, backpressure/loading, configuration, and other >>>>> factors could mean that periodically a given node is not receiving data. >>>>> We have seen a lot of folks become confused about s2s behaviors for cases >>>>> like this so I think you'll find load balanced connections are much >>>>> better for this. >>>>> >>>>> Thanks >>>>> Joe >>>>> >>>>> >>>>> On Wed, Jul 10, 2019 at 8:34 AM James McMahon <[email protected]> >>>>> wrote: >>>>>> We are on 1.7.1.g and have just recently established our first clustered >>>>>> configuration. Using Pierre Villard's article from Feb 2017 ( >>>>>> https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/ >>>>>> ) and a few other related technical articles to flesh out some details, >>>>>> we have gotten a ListFile / FetchFile to distribute load using Remote >>>>>> Process Group - almost. >>>>>> >>>>>> Downstream of the FetchFile running on all nodes I connect to a Monitor >>>>>> Activity processor simply to examine the flowfiles that result from the >>>>>> fetch, in that following queue. In that queue one can look at the info >>>>>> for each flowfile and find what appears to be the node on which the >>>>>> flowfile was processed by field Node Address. >>>>>> >>>>>> I have four nodes in my cluster - one primary, three not primary. I can >>>>>> see in the queue listing that three flowfiles share common Position >>>>>> values. Three have Position 1, three have Position 2, etc etc etc in a >>>>>> pattern that repeats throughout the entire queue. Within each Position >>>>>> group, the flowfiles have been distributed to node2, node3, and node4 - >>>>>> *but none at all to node1*. >>>>>> >>>>>> What would cause such behavior? How can I get my files to distribute >>>>>> across all four nodes? >>>>>> I should mention: >>>>>> 1. all four node URLS are in the RPG URL configuration parameter, >>>>>> delimited by commas. >>>>>> 2. node1 is currently assigned by my external Zookeeper as my Primary, >>>>>> and is where the ListFile processor executes. >>>>>> 3. all four nodes are granted access for "retrieve site-to-site details" >>>>>> in my Hamburger Menu, Access Policies. >>>>>> 4. all four nodes are granted access for "receive data via site-to-site" >>>>>> in the Access Policies for the RPG Input Port. >>>>>> >>>>>> My concern is that I am leaving nearly 25% of my available cluster >>>>>> capacity unused.
