Hello, Bryan Thanks for the answer. You've understood me correctly. What I'm trying to achieve is to put some validation on the dataset. So I fetch all data with one query from db(I can't change this behavior), then I use SplitAvro processor to split it into chunks say 1000 records each. After that I want to treat each chunk independently, transform each record in a chunk according to my domain model, validate it and save. This transform-load work I want to distribute across the cluster.
While reading about Nifi I've haven't found any information about flows like mine. This fact worries me a little. Maybe I'm trying to do something that is not suitable for Nifi. Is Nifi a suitable tool for processing large files or I should not do actual processing work outside the Nifi flow? 2016-06-01 17:28 GMT+03:00 Bryan Bende <[email protected]>: > Hello, > > This post [1] has a description of how to redistribute data with in the > same cluster. You are correct that it involves a RPG pointing back to the > same cluster. > > One thing to keep in mind is that typically we do this with a List + Fetch > pattern, where the List operation produces lightweight results like the > list of filenames to fetch, then redistributes those results and the > fetching happens in parallel. > In your case, if i understand it correctly, you will have already fetched > the data on the first node, and then have to transfer the actual data to > the cluster nodes which could have some overhead. > > It might require a custom processor to do this, but you might want to > consider somehow determining what needs to be fetched after receiving the > HTTP request, and redistributing that so each node can then fetch from the > DB in parallel. > > Let me know if this doesn't make sense. > > -Bryan > > [1] > https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html > > > On Wed, Jun 1, 2016 at 6:06 AM, Yuri Nikonovich <[email protected]> > wrote: > >> Hi >> I have the following flow: >> Receive HTTP request -> Fetch data from db -> split it in chunks of fixed >> size -> process each chunk and save it to Cassandra. >> >> I've built a flow and it works perfectly on non-clustered setup. But when >> I configured clustered setup >> I found out that all heavy work is done only on one node. So if the flow >> has started on node1 it will run to the end on node1. What I want to >> achieve is to spread data chunks fetched from DB across the cluster in >> order to process them in parallel, but it looks like Nifi doesn't send flow >> files between nodes in a cluster. >> As far as I understand, in order to make node send data to another node I >> should create a remote process group and send data to this RPG. All >> examples I could find on Internet describe RPGs as cluster-to-cluster >> communication or remote node-to-cluster communication. So for my case, I >> assume, have to create RPG pointing to the same cluster. Could you please >> provide me a guide how to do this. >> >> >> -- >> Regards, >> Nikanovich Yury >> > > -- С уважением, Юрий Никонович
