Re: Clustered Flow Execution Help

Ravi Papisetti (rpapiset) Mon, 02 Jul 2018 08:00:42 -0700

In my experience, if you set first processor to run on "Primary Node", then 
remaining flow that directly connects to it will run on that node independent 
of how sub-sequent processors are configured to run.


If you really want to distribute the flow, after first step, insert a RPG 
processors, that will distribute the load across cluster if you set 
sub-sequenty processors to run in "AllNodes" mode.

Thanks,
Ravi Papisetti

On 02/07/18, 9:10 AM, "Matt Burgess" <[email protected]> wrote:

    Joe,
    
    Only the first (source) processor needs to be set to Primary Node
    Only. Once that happens, the flow files will only proceed down the
    flow on the primary node, so step 5 will also only run on the primary
    node. In order to redistribute the flow files among the cluster,
    you'll want a Remote Process Group to point back to an Input Port on
    your cluster, between steps 4 & 5. From that point on, the flow files
    will be distributed among the nodes and the downstream flow (steps
    5-7) will run on all the nodes.
    
    Regards,
    Matt
    
    On Mon, Jul 2, 2018 at 10:05 AM Joe Trite <[email protected]> wrote:
    >
    > I have a question/need confirmation about cluster execution.  I have a 3 
node - 1.6 NiFi cluster.  My use case is extracting data from Hive and 
deposting it into an RDBMS.  Here is my flow.
    >
    > 1. SelectHiveQL - executes a "show paritions" command.
    > 2. SplitText - splits the returned partition (7) into individual flowFiles
    > 3. ExtractText - populates a 'partition_info' attribute
    > 4. UpdateAttribute - reformat the 'partition_info' into sql syntax
    > 5. SelectHiveQL - executes the "SELECT" against hive with the provided 
'partition_info' as the WHERE clause.
    > 6. SplitAvro - chunks the data info bit-size peices.
    > 7. PutDatabaseRecord - INSERT into the db.
    >
    > Processors 1-4 are set to 'Primary Node' only.  5-7 are set to 'All 
Nodes'.  All processors are set to 1 concurrent task.
    >
    > The question is around what happens in step 5.  I see the 7 
'partition_info' flowFiles in the queue after step 4 completes and they seem to 
get executed one-at-a-time in step 5, atleast from viewing the queue drain.  I 
would expect that step 5 would execute on each on the nodes (3) and that i 
would see the queue drain in 3's, is this assumption correct and maybe I have 
something misconfigured?
    >
    > I do see in the provenance data that all 3 nodes did process a flowFile, 
I am just expecting it to happen in parallel.
    >
    > I did see this article about distribution but don't think it is required 
for this use case to work:
    > 
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
    >
    > Thanks
    > Joe
    >
    >

Re: Clustered Flow Execution Help

Reply via email to