I have a question/need confirmation about cluster execution. I have a 3 node - 1.6 NiFi cluster. My use case is extracting data from Hive and deposting it into an RDBMS. Here is my flow.
1. SelectHiveQL - executes a "show paritions" command. 2. SplitText - splits the returned partition (7) into individual flowFiles 3. ExtractText - populates a 'partition_info' attribute 4. UpdateAttribute - reformat the 'partition_info' into sql syntax 5. SelectHiveQL - executes the "SELECT" against hive with the provided 'partition_info' as the WHERE clause. 6. SplitAvro - chunks the data info bit-size peices. 7. PutDatabaseRecord - INSERT into the db. Processors 1-4 are set to 'Primary Node' only. 5-7 are set to 'All Nodes'. All processors are set to 1 concurrent task. The question is around what happens in step 5. I see the 7 'partition_info' flowFiles in the queue after step 4 completes and they seem to get executed one-at-a-time in step 5, atleast from viewing the queue drain. I would expect that step 5 would execute on each on the nodes (3) and that i would see the queue drain in 3's, is this assumption correct and maybe I have something misconfigured? I do see in the provenance data that all 3 nodes did process a flowFile, I am just expecting it to happen in parallel. I did see this article about distribution but don't think it is required for this use case to work: https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html Thanks Joe
