Hi, I am running a multi-node NiFi (0.7.0) cluster and trying to implement a streaming ingestion pipeline (@ 200 MB/s at peak and around 30 MB/s at non-peak hours) and routing to different destinations (Kafka, Azure Storage, HDFS). The dataflow will be exposing a TCP port for incoming data and will also be ingesting files from folder, database records etc.
It would be great if someone can provide a link/doc that explains how processors can be expected to behave in a multi-node environment. My doubts are about how some of the processors work in a clustered mode, and the meaning of concurrent tasks. For example: * ListenTCP: o When this processor is scheduled to run on a cluster (and not on the primary node), then does it mean I need to send data to all the individual nodes manually i.e. specify each node's host names separately? If I don't specify hosts individually and only provide let's say primary node's host name from producer, will all the other nodes remain idle? Or NiFi tries to distribute the data to other nodes using some routing strategy? I am trying to increase the throughput and thinking something like this as data producer strategy: [cid:[email protected]] And consumer will be simply as following: [cid:[email protected]] o When I increase the number of concurrent tasks, does it make multiple copies of buffer/channel reader etc.? Or is it only the processing which gets multiplied? * Get / Fetch File: o Can we assume that when this processor is running on multiple nodes and threads, the same file will never get pulled multiple times as a flow-file? * Distribute Load Processor: o When this processor is running on multiple nodes, will all the incoming flow files go to each instance of running node? And this question is for any processor that run on a cluster and has to consume an incoming flow-file? What's the general routing strategy in NiFi when a processor is running on multiple node? * ExecuteSQL o Will all the running instances on all the nodes be hitting the RDBMS to fetch the data for the same query leading to duplicates, and heavy load on database? Thanks, Manish
