Thanks Bryan. This is very helpful.

Regards,
Manish


From: Bryan Bende [mailto:[email protected]]
Sent: Tuesday, August 09, 2016 12:50 AM
To: [email protected]
Subject: Re: Processors in cluster mode

Hi Manish,

This post [1] has an overview of how to distribute data across your NiFi 
cluster.

In general though, NiFi runs the same flow on each node and the data needs to 
be divided across the nodes appropriately depending on the situation.
The only exception to running the same flow on every node is when a processor 
is scheduled to run Primary Node only.

Concurrent Tasks is the number of threads that will concurrently call a given 
instance of a processor. So if you have processor "Foo" and a three node 
cluster, and set concurrent tasks to 2, there will be three instances of Foo 
and each will have two threads calling the onTrigger method.

For some of your specific cases...

ListenTCP - You would have an instance of this process on all three nodes and 
need the producer to send to all of them, or have a load balancer that supports 
TCP sitting in front of the nodes and have the producer send to the load 
balancer.
Get/Fetch File - These pick up files from the local filesystem so it would be 
up to the data producer to send/write files on each node of the cluster for 
each instance of this processor to pick up.
Distribute Load Processor - There will be a Distribute Load processor running 
on each node and operating on only the flow files on that node.
ExecuteSQL - Typically you would run this on primary node only, or in an 
upcoming release there is going to be some more options with a 
ListDatabaseTable processor that can produce instructions than can be 
distributed across a cluster to your ExecuteSQL processor.

Hope that helps.

Thanks,

Bryan

[1] 
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

On Mon, Aug 8, 2016 at 7:55 AM, Manish Gupta 8 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I am running a multi-node NiFi (0.7.0) cluster and trying to implement a 
streaming ingestion pipeline (@ 200 MB/s at peak and around 30 MB/s at non-peak 
hours) and routing to different destinations (Kafka, Azure Storage, HDFS). The 
dataflow will be exposing a TCP port for incoming data and will also be 
ingesting files from folder, database records etc.

It would be great if someone can provide a link/doc that explains how 
processors can be expected to behave in a multi-node environment.
My doubts are about how some of the processors work in a clustered mode, and 
the meaning of concurrent tasks.

For example:


•         ListenTCP:

o   When this processor is scheduled to run on a cluster (and not on the 
primary node), then does it mean I need to send data to all the individual 
nodes manually i.e. specify each node’s host names separately? If I don’t 
specify hosts individually and only provide let’s say primary node’s host name 
from producer, will all the other nodes remain idle? Or NiFi tries to 
distribute the data to other nodes using some routing strategy? I am trying to 
increase the throughput and thinking something like this as data producer 
strategy:



[cid:[email protected]]



And consumer will be simply as following:

[cid:[email protected]]





o   When I increase the number of concurrent tasks, does it make multiple 
copies of buffer/channel reader etc.? Or is it only the processing which gets 
multiplied?

•         Get / Fetch File:

o   Can we assume that when this processor is running on multiple nodes and 
threads, the same file will never get pulled multiple times as a flow-file?

•         Distribute Load Processor:

o   When this processor is running on multiple nodes, will all the incoming 
flow files go to each instance of running node? And this question is for any 
processor that run on a cluster and has to consume an incoming flow-file? 
What’s the general routing strategy in NiFi when a processor is running on 
multiple node?

•         ExecuteSQL

o   Will all the running instances on all the nodes be hitting the RDBMS to 
fetch the data for the same query leading to duplicates, and heavy load on 
database?

Thanks,
Manish


Reply via email to