Joe, This might be a good basis for a blog post or page on the wiki. On Tue, Nov 10, 2015 at 9:28 PM, Joe Witt <[email protected]> wrote:
> Darren, > > In short, yes I think NiFi can handle such a case in a generic sense quite > well. > > Read on for the longer response... > > NiFi can process extremely large data, extremely large datasets, > extremely small data and high rates, variable sized data, etc.. It > makes this efficient by its design, how the content repository works > whereby it supports pass-by-reference and copy-on-write behavior and > that it operates in a manner that allows disk caching benefits to > really shine through. > > Now that said if all that is of interest is pure 'processing' and > having a general purpose processing framework Storm, Spark, others are > focused solely on that space. NiFi is focused on the management of > dataflows from wherever in your enterprise data is created, produced, > etc.. to and through processing systems and ultimately into storage > systems like HDFS, NoSQL stores, relational databases. > > So depending on what you're trying to do to these documents be it > feature extraction, transformation, etc.. NiFi may be a great choice > or NiFi may simply be the tool you use to feed this data into systems > like Storm or Spark or others. You can absolutely parallelize the > flow of data across a NiFi cluster. For producers we offer a library > to interact with our site to site protocol which will handle things > like load balancing and failover and make it really easy to stream > data to NiFi. Or NiFi itself could pull from your system if perhaps > these documents are sitting as files or available via some other > supported interface. > > NiFi can be configured to control the rate of processing, queue data, > apply back-pressure, handle errors, and a number of other features > that are beneficial to the dataflow management problem. > > NiFi supports making tradeoffs at key points in the flow for batch > (time tolerant) or low latency (time sensitive) > processing/distribution. Whether data arrives in a streaming or batch > fashion and whether it must be delivered to systems in batch or > streaming fashion is a concern that NiFi handles well so the various > systems can be less coupled. > > Regarding its elasticity I will state that NiFi is not elastic in the > sense that it will (at this time) automatically provision additional > nodes to take on the work load and then deprovision them as the load > decreases. We will get there. But what we support are key > capabilities like event driven processing with upper bounds on > threads, back-pressure which can propogate to the source causing data > to go to lesser loaded nodes, and so on. These are elements of > elastic behavior but it is not elastic provisioning (as folks often > mean). > > I hope this response is helpful. If any of this was unclear or you > want to dive deeper just let us know. > > Thanks > Joe > > On Tue, Nov 10, 2015 at 6:30 PM, Darren Govoni <[email protected]> > wrote: > > Hi, > > I studied the nifi website a bit and if I missed a key part, forgive me > > for asking this question. > > But I am wondering if or how nifi can accommodate processing large data > sets > > with possibly compute intensive operations. > > For example, if we have say 2 million documents, how does nifi make > > processing these documents efficient? > > I understand the visual workflow and its nice. How is that parallelized > > across a data set? > > > > Do we submit all the documents to a cluster of flows (how many?) that > > execute some number of documents simultaneously? > > Does nifi support batch processing? Is it elastic? > > > > Thanks. >
