In my experience, if you set first processor to run on "Primary Node", then remaining flow that directly connects to it will run on that node independent of how sub-sequent processors are configured to run.
If you really want to distribute the flow, after first step, insert a RPG processors, that will distribute the load across cluster if you set sub-sequenty processors to run in "AllNodes" mode. Thanks, Ravi Papisetti On 02/07/18, 9:10 AM, "Matt Burgess" <[email protected]> wrote: Joe, Only the first (source) processor needs to be set to Primary Node Only. Once that happens, the flow files will only proceed down the flow on the primary node, so step 5 will also only run on the primary node. In order to redistribute the flow files among the cluster, you'll want a Remote Process Group to point back to an Input Port on your cluster, between steps 4 & 5. From that point on, the flow files will be distributed among the nodes and the downstream flow (steps 5-7) will run on all the nodes. Regards, Matt On Mon, Jul 2, 2018 at 10:05 AM Joe Trite <[email protected]> wrote: > > I have a question/need confirmation about cluster execution. I have a 3 node - 1.6 NiFi cluster. My use case is extracting data from Hive and deposting it into an RDBMS. Here is my flow. > > 1. SelectHiveQL - executes a "show paritions" command. > 2. SplitText - splits the returned partition (7) into individual flowFiles > 3. ExtractText - populates a 'partition_info' attribute > 4. UpdateAttribute - reformat the 'partition_info' into sql syntax > 5. SelectHiveQL - executes the "SELECT" against hive with the provided 'partition_info' as the WHERE clause. > 6. SplitAvro - chunks the data info bit-size peices. > 7. PutDatabaseRecord - INSERT into the db. > > Processors 1-4 are set to 'Primary Node' only. 5-7 are set to 'All Nodes'. All processors are set to 1 concurrent task. > > The question is around what happens in step 5. I see the 7 'partition_info' flowFiles in the queue after step 4 completes and they seem to get executed one-at-a-time in step 5, atleast from viewing the queue drain. I would expect that step 5 would execute on each on the nodes (3) and that i would see the queue drain in 3's, is this assumption correct and maybe I have something misconfigured? > > I do see in the provenance data that all 3 nodes did process a flowFile, I am just expecting it to happen in parallel. > > I did see this article about distribution but don't think it is required for this use case to work: > https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html > > Thanks > Joe > >
