Mans, Nodes in a cluster work independently from one another and do not know about each other. That is accurate. Each node in a cluster runs the same flow. Typically, if you want to pull from HDFS and partition that data across the cluster, you would run ListHDFS on the Primary Node only, and then use Site-to-Site [1] to distribute that listing to all nodes in the cluster. Each node would then pull the data that it is responsible to pull and begin working on it. We do realize that this is not ideal to have to setup this way, and it is something that we are working on so that it is much easier to have that listing automatically distributed across the cluster.
I'm not sure that I understand your #3 - how do we design the workflow so that the nodes work on one file at a time? For each Processor, you can configure how many threads (Concurrent Tasks) are to be used in the Scheduling tab of the Processor Configuration dialog. You can certainly configure that to run only a single Concurrent Task. This is the number of Concurrent Tasks that will run on each node in the cluster, not the total number of concurrent tasks that would run across the entire cluster. I am not sure that I understand your #4 either. Are you indicating that you want to configure each node in the cluster with a different value for a processor property? Does this help? Thanks -Mark [1] http://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site > On Oct 14, 2015, at 4:49 PM, M Singh <[email protected]> wrote: > > Hi: > > > > A few questions about NiFi cluster: > > 1. If we have multiple worker nodes in the cluster, do they partition the > work if the source allows partitioning - eg: HDFS, or do all the nodes work > on the same data ? > 2. If the nodes partition the work, then how do they coordinate the work > distribution and recovery etc ? From the documentation it appears that the > workers are not aware of each other. > 3. If I need to process multiple files - how do we design the work flow so > that the nodes work on one file at a time ? > 4. If I have multiple arguments and need to pass one parameter to each > worker, how can I do that ? > 5. Is there any way to control how many workers are involved in processing > the flow ? > 6. Does specifying the number of threads in the processor distribute work on > multiple workers ? Does it split the task across the threads or is it the > responsibility of the application ? > > I tried to find some answers from the documentation and users list but could > not get a clear picture. > > Thanks > > Mans > > > >
