subject:"Re\: Nifi Clustering \- work distribution on workers"

Re: Nifi Clustering - work distribution on workers

2015-10-23 Thread Joe Percivall

Hey Mans,

To load balance and send the FlowFiles to a processing group or processor in 
the same cluster you will still set up an RPG but you will just point the RPG 
to the current cluster's NCM.

The NiFi Cluster Manager (NCM) is in charge of splitting the incoming FlowFiles 
among the nodes. It knows the current load of each and splits up the files 
accordingly.

Hope that helps, 
Joe
- - - - - - 
Joseph Percivall
linkedin.com/in/Percivall
e: joeperciv...@yahoo.com

On Tuesday, October 20, 2015 9:34 AM, M Singh  wrote:

Hi Matt:

The screenshot seems to be truncated - after the after the FetchHDFS processor 
- but I am not sure if that is important.

I have a question though - the ListHDFS processor running on a separate cluster 
is producing one flow file for each file on hdfs and from your comments it 
appears that RPG will load balance the flow files to it's processor nodes so 
that they process each flow file separately.  Can the ListHDFS send the flow 
files to processors or a processing group in the same cluster that can then 
fetch the data from HDFS ? Also, you indicate that the input port is running on 
each RPG node so how do nodes in the RPG coordinate splitting of the incoming 
flow files among them ?

Thanks again.

Mans

On Friday, October 16, 2015 3:16 PM, M Singh  wrote:

Hi Matt:

Thanks for taking the time to describe and draw out the scenario for me.  I go 
through your notes and documentation to understand the concepts.   

Thanks again for your generous support in helping me understand Nifi better.

Mans

On Thursday, October 15, 2015 12:53 PM, Matthew Clarke 
 wrote:

Mans,
 I have attached a screenshot for how the listHDFS and fetchHDFS would be 
configured in a NIFi cluster to achieve what we believe your looking to 
accomplish. At the end you will have each of your nodes fetching different 
files from HDFS.  These nodes will work on each of their files independently of 
the other Nodes.  The NCM serves as your eyes in to your cluster.  Every 
processor on your graph exists on every node. Unless specifically configured to 
run 'on primary node' only, the processors all run on every node using the 
configured values.  Setting the 'concurrent tasks' on a processor will have the 
affect of setting that number of concurrent tasks on that processor on every 
node.

Thanks,
Matt

On Thu, Oct 15, 2015 at 12:17 PM, M Singh  wrote:

Hi Mark:
>
>Thanks for your answers but being a newbie I am still not clear about some 
>issues:
>
>
>Regarding hdfs multiple files:
>
>
>Typically, if you want to pull from HDFS and partition that data
>across the cluster, you would run ListHDFS on the Primary Node only, and then 
>use Site-to-Site [1] to distribute
>that listing to all nodes in the cluster. 
>
>
>Question - I believe that this requires distributing the list of files to NCM 
>to the other site - who will take care of distributing it to it's worker 
>nodes.  Do we send the list of files to NCM as a single message and NCM will 
>split it to distribute one to each of the nodes, or should we send separate 
>messages to NCM and then it will send one message to each worker node ? Also, 
>if we send a single list of files to NCM, does it send the same list to all 
>it's workers ? If the NCM sends the same list then won't there be duplication 
>of work ?
>
>
>Regarding concurrent tasks - 
>
>
>Question - How do they help in parallelizing the processing ?
>
>
>Regarding passing separate arguments to workers :
>
>
>Question - This is related to the above two, ie, how to partition the tasks 
>across worker nodes in a cluster ?
>
>
>Thanks again for your help.
>
>
>Mans
>
>
>
>
>
>
>
>
>
>
>On Wednesday, October 14, 2015 2:08 PM, Mark Payne  
>wrote:
> 
>
>
>Mans,
>
>
>Nodes in a cluster work independently from one another and do not know about 
>each other. That is accurate.
>Each node in a cluster runs the same flow. Typically, if you want to pull from 
>HDFS and partition that data
>across the cluster, you would run ListHDFS on the Primary Node only, and then 
>use Site-to-Site [1] to distribute
>that listing to all nodes in the cluster. Each node would then pull the data 
>that it is responsible to pull and begin
>working on it. We do realize that this is not ideal to have to setup this way, 
>and it is something that we are working
>on so that it is much easier to have that listing automatically distributed 
>across the cluster.
>
>
>I'm not sure that I understand your #3 - how do we design the workflow so that 
>the nodes work on one file at a time?
>For each Processor, you can configure how many threads (Concurrent Tasks) are 
>to be used in the Scheduling tab
>of the Processor Configuration dialog. You can certainly configure that to run 
>only a single Concurrent Task. 
>This is the number of Concurrent Tasks that will run on each node in the 
>cluster, not

Re: Nifi Clustering - work distribution on workers

2015-10-14 Thread Mark Payne

Mans,

Nodes in a cluster work independently from one another and do not know about 
each other. That is accurate.
Each node in a cluster runs the same flow. Typically, if you want to pull from 
HDFS and partition that data
across the cluster, you would run ListHDFS on the Primary Node only, and then 
use Site-to-Site [1] to distribute
that listing to all nodes in the cluster. Each node would then pull the data 
that it is responsible to pull and begin
working on it. We do realize that this is not ideal to have to setup this way, 
and it is something that we are working
on so that it is much easier to have that listing automatically distributed 
across the cluster.

I'm not sure that I understand your #3 - how do we design the workflow so that 
the nodes work on one file at a time?
For each Processor, you can configure how many threads (Concurrent Tasks) are 
to be used in the Scheduling tab
of the Processor Configuration dialog. You can certainly configure that to run 
only a single Concurrent Task. 
This is the number of Concurrent Tasks that will run on each node in the 
cluster, not the total number of concurrent
tasks that would run across the entire cluster.

I am not sure that I understand your #4 either. Are you indicating that you 
want to configure each node in the cluster
with a different value for a processor property?

Does this help?

Thanks
-Mark

[1] http://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site

> On Oct 14, 2015, at 4:49 PM, M Singh  wrote:
> 
> Hi:
> 
> 
> 
> A few questions about NiFi cluster:
> 
> 1. If we have multiple worker nodes in the cluster, do they partition the 
> work if the source allows partitioning - eg: HDFS, or do all the nodes work 
> on the same data ?
> 2. If the nodes partition the work, then how do they coordinate the work 
> distribution and recovery etc ?  From the documentation it appears that the 
> workers are not aware of each other.
> 3. If I need to process multiple files - how do we design the work flow so 
> that the nodes work on one file at a time ?
> 4. If I have multiple arguments and need to pass one parameter to each 
> worker, how can I do that ?
> 5. Is there any way to control how many workers are involved in processing 
> the flow ?
> 6. Does specifying the number of threads in the processor distribute work on 
> multiple workers ?  Does it split the task across the threads or is it the 
> responsibility of the application ?
> 
> I tried to find some answers from the documentation and users list but could 
> not get a clear picture.
> 
> Thanks
> 
> Mans
> 
> 
> 
>

Re: Nifi Clustering - work distribution on workers

Re: Nifi Clustering - work distribution on workers

2 matches

Site Navigation

Mail list logo

Footer information