Processors in cluster mode

Manish Gupta 8 Mon, 08 Aug 2016 04:56:27 -0700

Hi,

I am running a multi-node NiFi (0.7.0) cluster and trying to implement a 
streaming ingestion pipeline (@ 200 MB/s at peak and around 30 MB/s at non-peak 
hours) and routing to different destinations (Kafka, Azure Storage, HDFS). The 
dataflow will be exposing a TCP port for incoming data and will also be 
ingesting files from folder, database records etc.


It would be great if someone can provide a link/doc that explains how 
processors can be expected to behave in a multi-node environment.
My doubts are about how some of the processors work in a clustered mode, and 
the meaning of concurrent tasks.

For example:


*         ListenTCP:

o   When this processor is scheduled to run on a cluster (and not on the 
primary node), then does it mean I need to send data to all the individual 
nodes manually i.e. specify each node's host names separately? If I don't 
specify hosts individually and only provide let's say primary node's host name 
from producer, will all the other nodes remain idle? Or NiFi tries to 
distribute the data to other nodes using some routing strategy? I am trying to 
increase the throughput and thinking something like this as data producer 
strategy:



[cid:[email protected]]



And consumer will be simply as following:

[cid:[email protected]]





o   When I increase the number of concurrent tasks, does it make multiple 
copies of buffer/channel reader etc.? Or is it only the processing which gets 
multiplied?

*         Get / Fetch File:

o   Can we assume that when this processor is running on multiple nodes and 
threads, the same file will never get pulled multiple times as a flow-file?

*         Distribute Load Processor:

o   When this processor is running on multiple nodes, will all the incoming 
flow files go to each instance of running node? And this question is for any 
processor that run on a cluster and has to consume an incoming flow-file? 
What's the general routing strategy in NiFi when a processor is running on 
multiple node?

*         ExecuteSQL

o   Will all the running instances on all the nodes be hitting the RDBMS to 
fetch the data for the same query leading to duplicates, and heavy load on 
database?

Thanks,
Manish

Processors in cluster mode

Reply via email to