Hey Mans,
To load balance and send the FlowFiles to a processing group or processor in
the same cluster you will still set up an RPG but you will just point the RPG
to the current cluster's NCM.
The NiFi Cluster Manager (NCM) is in charge of splitting the incoming FlowFiles
among the nodes. It knows the current load of each and splits up the files
accordingly.
Hope that helps,
Joe
- - - - - -
Joseph Percivall
linkedin.com/in/Percivall
e: joeperciv...@yahoo.com
On Tuesday, October 20, 2015 9:34 AM, M Singh wrote:
Hi Matt:
The screenshot seems to be truncated - after the after the FetchHDFS processor
- but I am not sure if that is important.
I have a question though - the ListHDFS processor running on a separate cluster
is producing one flow file for each file on hdfs and from your comments it
appears that RPG will load balance the flow files to it's processor nodes so
that they process each flow file separately. Can the ListHDFS send the flow
files to processors or a processing group in the same cluster that can then
fetch the data from HDFS ? Also, you indicate that the input port is running on
each RPG node so how do nodes in the RPG coordinate splitting of the incoming
flow files among them ?
Thanks again.
Mans
On Friday, October 16, 2015 3:16 PM, M Singh wrote:
Hi Matt:
Thanks for taking the time to describe and draw out the scenario for me. I go
through your notes and documentation to understand the concepts.
Thanks again for your generous support in helping me understand Nifi better.
Mans
On Thursday, October 15, 2015 12:53 PM, Matthew Clarke
wrote:
Mans,
I have attached a screenshot for how the listHDFS and fetchHDFS would be
configured in a NIFi cluster to achieve what we believe your looking to
accomplish. At the end you will have each of your nodes fetching different
files from HDFS. These nodes will work on each of their files independently of
the other Nodes. The NCM serves as your eyes in to your cluster. Every
processor on your graph exists on every node. Unless specifically configured to
run 'on primary node' only, the processors all run on every node using the
configured values. Setting the 'concurrent tasks' on a processor will have the
affect of setting that number of concurrent tasks on that processor on every
node.
Thanks,
Matt
On Thu, Oct 15, 2015 at 12:17 PM, M Singh wrote:
Hi Mark:
>
>Thanks for your answers but being a newbie I am still not clear about some
>issues:
>
>
>Regarding hdfs multiple files:
>
>
>Typically, if you want to pull from HDFS and partition that data
>across the cluster, you would run ListHDFS on the Primary Node only, and then
>use Site-to-Site [1] to distribute
>that listing to all nodes in the cluster.
>
>
>Question - I believe that this requires distributing the list of files to NCM
>to the other site - who will take care of distributing it to it's worker
>nodes. Do we send the list of files to NCM as a single message and NCM will
>split it to distribute one to each of the nodes, or should we send separate
>messages to NCM and then it will send one message to each worker node ? Also,
>if we send a single list of files to NCM, does it send the same list to all
>it's workers ? If the NCM sends the same list then won't there be duplication
>of work ?
>
>
>Regarding concurrent tasks -
>
>
>Question - How do they help in parallelizing the processing ?
>
>
>Regarding passing separate arguments to workers :
>
>
>Question - This is related to the above two, ie, how to partition the tasks
>across worker nodes in a cluster ?
>
>
>Thanks again for your help.
>
>
>Mans
>
>
>
>
>
>
>
>
>
>
>On Wednesday, October 14, 2015 2:08 PM, Mark Payne
>wrote:
>
>
>
>Mans,
>
>
>Nodes in a cluster work independently from one another and do not know about
>each other. That is accurate.
>Each node in a cluster runs the same flow. Typically, if you want to pull from
>HDFS and partition that data
>across the cluster, you would run ListHDFS on the Primary Node only, and then
>use Site-to-Site [1] to distribute
>that listing to all nodes in the cluster. Each node would then pull the data
>that it is responsible to pull and begin
>working on it. We do realize that this is not ideal to have to setup this way,
>and it is something that we are working
>on so that it is much easier to have that listing automatically distributed
>across the cluster.
>
>
>I'm not sure that I understand your #3 - how do we design the workflow so that
>the nodes work on one file at a time?
>For each Processor, you can configure how many threads (Concurrent Tasks) are
>to be used in the Scheduling tab
>of the Processor Configuration dialog. You can certainly configure that to run
>only a single Concurrent Task.
>This is the number of Concurrent Tasks that will run on each node in the
>cluster, not