Thanks Bryan. This is very helpful. Regards, Manish
From: Bryan Bende [mailto:[email protected]] Sent: Tuesday, August 09, 2016 12:50 AM To: [email protected] Subject: Re: Processors in cluster mode Hi Manish, This post [1] has an overview of how to distribute data across your NiFi cluster. In general though, NiFi runs the same flow on each node and the data needs to be divided across the nodes appropriately depending on the situation. The only exception to running the same flow on every node is when a processor is scheduled to run Primary Node only. Concurrent Tasks is the number of threads that will concurrently call a given instance of a processor. So if you have processor "Foo" and a three node cluster, and set concurrent tasks to 2, there will be three instances of Foo and each will have two threads calling the onTrigger method. For some of your specific cases... ListenTCP - You would have an instance of this process on all three nodes and need the producer to send to all of them, or have a load balancer that supports TCP sitting in front of the nodes and have the producer send to the load balancer. Get/Fetch File - These pick up files from the local filesystem so it would be up to the data producer to send/write files on each node of the cluster for each instance of this processor to pick up. Distribute Load Processor - There will be a Distribute Load processor running on each node and operating on only the flow files on that node. ExecuteSQL - Typically you would run this on primary node only, or in an upcoming release there is going to be some more options with a ListDatabaseTable processor that can produce instructions than can be distributed across a cluster to your ExecuteSQL processor. Hope that helps. Thanks, Bryan [1] https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html On Mon, Aug 8, 2016 at 7:55 AM, Manish Gupta 8 <[email protected]<mailto:[email protected]>> wrote: Hi, I am running a multi-node NiFi (0.7.0) cluster and trying to implement a streaming ingestion pipeline (@ 200 MB/s at peak and around 30 MB/s at non-peak hours) and routing to different destinations (Kafka, Azure Storage, HDFS). The dataflow will be exposing a TCP port for incoming data and will also be ingesting files from folder, database records etc. It would be great if someone can provide a link/doc that explains how processors can be expected to behave in a multi-node environment. My doubts are about how some of the processors work in a clustered mode, and the meaning of concurrent tasks. For example: • ListenTCP: o When this processor is scheduled to run on a cluster (and not on the primary node), then does it mean I need to send data to all the individual nodes manually i.e. specify each node’s host names separately? If I don’t specify hosts individually and only provide let’s say primary node’s host name from producer, will all the other nodes remain idle? Or NiFi tries to distribute the data to other nodes using some routing strategy? I am trying to increase the throughput and thinking something like this as data producer strategy: [cid:[email protected]] And consumer will be simply as following: [cid:[email protected]] o When I increase the number of concurrent tasks, does it make multiple copies of buffer/channel reader etc.? Or is it only the processing which gets multiplied? • Get / Fetch File: o Can we assume that when this processor is running on multiple nodes and threads, the same file will never get pulled multiple times as a flow-file? • Distribute Load Processor: o When this processor is running on multiple nodes, will all the incoming flow files go to each instance of running node? And this question is for any processor that run on a cluster and has to consume an incoming flow-file? What’s the general routing strategy in NiFi when a processor is running on multiple node? • ExecuteSQL o Will all the running instances on all the nodes be hitting the RDBMS to fetch the data for the same query leading to duplicates, and heavy load on database? Thanks, Manish
