Hi Mark:
Thanks for your answers but being a newbie I am still not clear about some 
issues:
Regarding hdfs multiple files:
Typically, if you want to pull from HDFS and partition that dataacross the 
cluster, you would run ListHDFS on the Primary Node only, and then use 
Site-to-Site [1] to distributethat listing to all nodes in the cluster. 
Question - I believe that this requires distributing the list of files to NCM 
to the other site - who will take care of distributing it to it's worker nodes. 
 Do we send the list of files to NCM as a single message and NCM will split it 
to distribute one to each of the nodes, or should we send separate messages to 
NCM and then it will send one message to each worker node ? Also, if we send a 
single list of files to NCM, does it send the same list to all it's workers ? 
If the NCM sends the same list then won't there be duplication of work ?
Regarding concurrent tasks - 
Question - How do they help in parallelizing the processing ?
Regarding passing separate arguments to workers :
Question - This is related to the above two, ie, how to partition the tasks 
across worker nodes in a cluster ?
Thanks again for your help.
Mans


 


     On Wednesday, October 14, 2015 2:08 PM, Mark Payne <[email protected]> 
wrote:
   

 Mans,
Nodes in a cluster work independently from one another and do not know about 
each other. That is accurate.Each node in a cluster runs the same flow. 
Typically, if you want to pull from HDFS and partition that dataacross the 
cluster, you would run ListHDFS on the Primary Node only, and then use 
Site-to-Site [1] to distributethat listing to all nodes in the cluster. Each 
node would then pull the data that it is responsible to pull and beginworking 
on it. We do realize that this is not ideal to have to setup this way, and it 
is something that we are workingon so that it is much easier to have that 
listing automatically distributed across the cluster.
I'm not sure that I understand your #3 - how do we design the workflow so that 
the nodes work on one file at a time?For each Processor, you can configure how 
many threads (Concurrent Tasks) are to be used in the Scheduling tabof the 
Processor Configuration dialog. You can certainly configure that to run only a 
single Concurrent Task. This is the number of Concurrent Tasks that will run on 
each node in the cluster, not the total number of concurrenttasks that would 
run across the entire cluster.
I am not sure that I understand your #4 either. Are you indicating that you 
want to configure each node in the clusterwith a different value for a 
processor property?
Does this help?
Thanks-Mark
[1] http://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site


On Oct 14, 2015, at 4:49 PM, M Singh <[email protected]> wrote:
Hi:



A few questions about NiFi cluster:
1. If we have multiple worker nodes in the cluster, do they partition the work 
if the source allows partitioning - eg: HDFS, or do all the nodes work on the 
same data ?2. If the nodes partition the work, then how do they coordinate the 
work distribution and recovery etc ?  From the documentation it appears that 
the workers are not aware of each other.3. If I need to process multiple files 
- how do we design the work flow so that the nodes work on one file at a time 
?4. If I have multiple arguments and need to pass one parameter to each worker, 
how can I do that ?5. Is there any way to control how many workers are involved 
in processing the flow ?6. Does specifying the number of threads in the 
processor distribute work on multiple workers ?  Does it split the task across 
the threads or is it the responsibility of the application ?
I tried to find some answers from the documentation and users list but could 
not get a clear picture.
Thanks
Mans



    



  

Reply via email to