Darren, In short, yes I think NiFi can handle such a case in a generic sense quite well.
Read on for the longer response... NiFi can process extremely large data, extremely large datasets, extremely small data and high rates, variable sized data, etc.. It makes this efficient by its design, how the content repository works whereby it supports pass-by-reference and copy-on-write behavior and that it operates in a manner that allows disk caching benefits to really shine through. Now that said if all that is of interest is pure 'processing' and having a general purpose processing framework Storm, Spark, others are focused solely on that space. NiFi is focused on the management of dataflows from wherever in your enterprise data is created, produced, etc.. to and through processing systems and ultimately into storage systems like HDFS, NoSQL stores, relational databases. So depending on what you're trying to do to these documents be it feature extraction, transformation, etc.. NiFi may be a great choice or NiFi may simply be the tool you use to feed this data into systems like Storm or Spark or others. You can absolutely parallelize the flow of data across a NiFi cluster. For producers we offer a library to interact with our site to site protocol which will handle things like load balancing and failover and make it really easy to stream data to NiFi. Or NiFi itself could pull from your system if perhaps these documents are sitting as files or available via some other supported interface. NiFi can be configured to control the rate of processing, queue data, apply back-pressure, handle errors, and a number of other features that are beneficial to the dataflow management problem. NiFi supports making tradeoffs at key points in the flow for batch (time tolerant) or low latency (time sensitive) processing/distribution. Whether data arrives in a streaming or batch fashion and whether it must be delivered to systems in batch or streaming fashion is a concern that NiFi handles well so the various systems can be less coupled. Regarding its elasticity I will state that NiFi is not elastic in the sense that it will (at this time) automatically provision additional nodes to take on the work load and then deprovision them as the load decreases. We will get there. But what we support are key capabilities like event driven processing with upper bounds on threads, back-pressure which can propogate to the source causing data to go to lesser loaded nodes, and so on. These are elements of elastic behavior but it is not elastic provisioning (as folks often mean). I hope this response is helpful. If any of this was unclear or you want to dive deeper just let us know. Thanks Joe On Tue, Nov 10, 2015 at 6:30 PM, Darren Govoni <[email protected]> wrote: > Hi, > I studied the nifi website a bit and if I missed a key part, forgive me > for asking this question. > But I am wondering if or how nifi can accommodate processing large data sets > with possibly compute intensive operations. > For example, if we have say 2 million documents, how does nifi make > processing these documents efficient? > I understand the visual workflow and its nice. How is that parallelized > across a data set? > > Do we submit all the documents to a cluster of flows (how many?) that > execute some number of documents simultaneously? > Does nifi support batch processing? Is it elastic? > > Thanks.
