Hi Mark:
Thanks for your answers but being a newbie I am still not clear about some
issues:
Regarding hdfs multiple files:
Typically, if you want to pull from HDFS and partition that dataacross the
cluster, you would run ListHDFS on the Primary Node only, and then use
Site-to-Site [1] to distributethat listing to all nodes in the cluster.
Question - I believe that this requires distributing the list of files to NCM
to the other site - who will take care of distributing it to it's worker nodes.
Do we send the list of files to NCM as a single message and NCM will split it
to distribute one to each of the nodes, or should we send separate messages to
NCM and then it will send one message to each worker node ? Also, if we send a
single list of files to NCM, does it send the same list to all it's workers ?
If the NCM sends the same list then won't there be duplication of work ?
Regarding concurrent tasks -
Question - How do they help in parallelizing the processing ?
Regarding passing separate arguments to workers :
Question - This is related to the above two, ie, how to partition the tasks
across worker nodes in a cluster ?
Thanks again for your help.
Mans
On Wednesday, October 14, 2015 2:08 PM, Mark Payne <[email protected]>
wrote:
Mans,
Nodes in a cluster work independently from one another and do not know about
each other. That is accurate.Each node in a cluster runs the same flow.
Typically, if you want to pull from HDFS and partition that dataacross the
cluster, you would run ListHDFS on the Primary Node only, and then use
Site-to-Site [1] to distributethat listing to all nodes in the cluster. Each
node would then pull the data that it is responsible to pull and beginworking
on it. We do realize that this is not ideal to have to setup this way, and it
is something that we are workingon so that it is much easier to have that
listing automatically distributed across the cluster.
I'm not sure that I understand your #3 - how do we design the workflow so that
the nodes work on one file at a time?For each Processor, you can configure how
many threads (Concurrent Tasks) are to be used in the Scheduling tabof the
Processor Configuration dialog. You can certainly configure that to run only a
single Concurrent Task. This is the number of Concurrent Tasks that will run on
each node in the cluster, not the total number of concurrenttasks that would
run across the entire cluster.
I am not sure that I understand your #4 either. Are you indicating that you
want to configure each node in the clusterwith a different value for a
processor property?
Does this help?
Thanks-Mark
[1] http://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site
On Oct 14, 2015, at 4:49 PM, M Singh <[email protected]> wrote:
Hi:
A few questions about NiFi cluster:
1. If we have multiple worker nodes in the cluster, do they partition the work
if the source allows partitioning - eg: HDFS, or do all the nodes work on the
same data ?2. If the nodes partition the work, then how do they coordinate the
work distribution and recovery etc ? From the documentation it appears that
the workers are not aware of each other.3. If I need to process multiple files
- how do we design the work flow so that the nodes work on one file at a time
?4. If I have multiple arguments and need to pass one parameter to each worker,
how can I do that ?5. Is there any way to control how many workers are involved
in processing the flow ?6. Does specifying the number of threads in the
processor distribute work on multiple workers ? Does it split the task across
the threads or is it the responsibility of the application ?
I tried to find some answers from the documentation and users list but could
not get a clear picture.
Thanks
Mans