Joe,
This might be a good basis for a blog post or page on the wiki.

On Tue, Nov 10, 2015 at 9:28 PM, Joe Witt <[email protected]> wrote:

> Darren,
>
> In short, yes I think NiFi can handle such a case in a generic sense quite
> well.
>
> Read on for the longer response...
>
> NiFi can process extremely large data, extremely large datasets,
> extremely small data and high rates, variable sized data, etc.. It
> makes this efficient by its design, how the content repository works
> whereby it supports pass-by-reference and copy-on-write behavior and
> that it operates in a manner that allows disk caching benefits to
> really shine through.
>
> Now that said if all that is of interest is pure 'processing' and
> having a general purpose processing framework Storm, Spark, others are
> focused solely on that space.  NiFi is focused on the management of
> dataflows from wherever in your enterprise data is created, produced,
> etc.. to and through processing systems and ultimately into storage
> systems like HDFS, NoSQL stores, relational databases.
>
> So depending on what you're trying to do to these documents be it
> feature extraction, transformation, etc.. NiFi may be a great choice
> or NiFi may simply be the tool you use to feed this data into systems
> like Storm or Spark or others.  You can absolutely parallelize the
> flow of data across a NiFi cluster.  For producers we offer a library
> to interact with our site to site protocol which will handle things
> like load balancing and failover and make it really easy to stream
> data to NiFi.  Or NiFi itself could pull from your system if perhaps
> these documents are sitting as files or available via some other
> supported interface.
>
> NiFi can be configured to control the rate of processing, queue data,
> apply back-pressure, handle errors, and a number of other features
> that are beneficial to the dataflow management problem.
>
> NiFi supports making tradeoffs at key points in the flow for batch
> (time tolerant) or low latency (time sensitive)
> processing/distribution.  Whether data arrives in a streaming or batch
> fashion and whether it must be delivered to systems in batch or
> streaming fashion is a concern that NiFi handles well so the various
> systems can be less coupled.
>
> Regarding its elasticity I will state that NiFi is not elastic in the
> sense that it will (at this time) automatically provision additional
> nodes to take on the work load and then deprovision them as the load
> decreases.  We will get there.  But what we support are key
> capabilities like event driven processing with upper bounds on
> threads, back-pressure which can propogate to the source causing data
> to go to lesser loaded nodes, and so on.  These are elements of
> elastic behavior but it is not elastic provisioning (as folks often
> mean).
>
> I hope this response is helpful.  If any of this was unclear or you
> want to dive deeper just let us know.
>
> Thanks
> Joe
>
> On Tue, Nov 10, 2015 at 6:30 PM, Darren Govoni <[email protected]>
> wrote:
> > Hi,
> >   I studied the nifi website a bit and if I missed a key part, forgive me
> > for asking this question.
> > But I am wondering if or how nifi can accommodate processing large data
> sets
> > with possibly compute intensive operations.
> > For example, if we have say 2 million documents, how does nifi make
> > processing these documents efficient?
> > I understand the visual workflow and its nice. How is that parallelized
> > across a data set?
> >
> > Do we submit all the documents to a cluster of flows (how many?) that
> > execute some number of documents simultaneously?
> > Does nifi support batch processing? Is it elastic?
> >
> > Thanks.
>

Reply via email to