I have a question/need confirmation about cluster execution.  I have a 3
node - 1.6 NiFi cluster.  My use case is extracting data from Hive and
deposting it into an RDBMS.  Here is my flow.

1. SelectHiveQL - executes a "show paritions" command.
2. SplitText - splits the returned partition (7) into individual flowFiles
3. ExtractText - populates a 'partition_info' attribute
4. UpdateAttribute - reformat the 'partition_info' into sql syntax
5. SelectHiveQL - executes the "SELECT" against hive with the provided
'partition_info' as the WHERE clause.
6. SplitAvro - chunks the data info bit-size peices.
7. PutDatabaseRecord - INSERT into the db.

Processors 1-4 are set to 'Primary Node' only.  5-7 are set to 'All
Nodes'.  All processors are set to 1 concurrent task.

The question is around what happens in step 5.  I see the 7
'partition_info' flowFiles in the queue after step 4 completes and they
seem to get executed one-at-a-time in step 5, atleast from viewing the
queue drain.  I would expect that step 5 would execute on each on the nodes
(3) and that i would see the queue drain in 3's, is this assumption correct
and maybe I have something misconfigured?

I do see in the provenance data that all 3 nodes did process a flowFile, I
am just expecting it to happen in parallel.

I did see this article about distribution but don't think it is required
for this use case to work:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

Thanks
Joe

Reply via email to