Re: Persistence of intermediary data in NiFi ?

Dmitry Goldenberg Sun, 20 Mar 2016 11:27:02 -0700

Thanks, Joe.

- Dmitry


On Sun, Mar 20, 2016 at 1:24 PM, Joe Witt <[email protected]> wrote:

> Dmitry,
>
> Good questions and things to think about.  You are right into the
> heart of the framework.  So there are three repositories to consider
> that I'll talk about here but they are probably pretty well covered on
> the existing guides.  I know Joe Percivall has a 'day in the life of a
> flow file' document that he is nearly ready to send out - that should
> help a lot!
>
> Key terminology item:
> "Flow File" This is the metadata about a 'thing' in the flow and the
> content of that 'thing'.  A flowfile can represent an 'event', a
> 'record', a 'tuple', 'a bag of bits'.  The bottom line is a flow file
> holds context about that thing and the data about that thing.  People
> coming from different frames of reference call these things by various
> names 'data, event, record, file, etc...' so keep that in mind.
>
> So three repositories to understand and I'll be intermingling the
> concept and the default implementation here a bit so just keep that in
> mind.
>
> 1) Flow File Repository
> - This is where the 'fact-of' a flow file lives.  This holds things
> like the identifier of a flow file, its name, entry data, map of
> attributes.  The typical/default implementation is a write ahead log
> which is keeping track of the persistent state of these flow file
> objects.  Key thing to realize is that this does *not* include the
> content of a flow file.
>
> 2) Content Repository
> - This is where the 'bytes' of the thing live.  So let's say you have
> a flowfile that is a JSON document.  The actual JSON bytes live here.
> The 'name' of that thing and things we know about it or have learned
> about it live in the flow file repo.  Ok, so in the content repository
> the default implementation is to persist the content to disk.  No we
> do not persist it every time a flow file moves from processor to
> processor.  More on that in a minute.  Also, it never needs to be
> fully in memory and more on that in a minute too.  Disks these days
> and things like caching in Linux are awesome.  The content repository
> is designed to help us greatly take advantage of that.  The repository
> design also allows us to take great advantage of things like copy on
> write (only ever make a new version of something once it is
> manipulated) and pass by reference (never make a clone of the bytes
> but rather make and pass pointers).  So, in this sense you can think
> of the content repository (at least the default implementation) as an
> immutable versioned content store.  Very powerful and very fast.
>
> 3) Provenance Repository
> - This is where events about the events live as data comes into, goes
> through, and leaves the flow.  These events form the truth of what
> happened and you can think of it as index of events about what
> happened in the flow.  it has information which looks a lot like what
> you'd see in the Flow File repository (no content) and some nice
> relationship data so you dont' have to trudge through log files
> anymore to figure out what happened.  What is cool too is that it has
> pointers to content.  This is how we can let you click on content at
> any point of its life in the flow.  Doing some complex transformation?
>  Use provenance to click on content easily before and after some
> important transformation even to prove it works.  Flow not quite right
> yet?  Use provenance to hit replay after you tweak the settings and
> keep watching it evolve until you are sure it works.  You can even do
> all this live in the flow on a dev copy (thanks again to copy on
> write/pass by reference) then when you are ready merge it into be the
> production feed.  In this way it is a lot like the mentality a
> developer has when using Git just now with a fun UI.
>
> Ok, so we've talked through a lot and I probably skipped major
> details.  Feel free to ask as much as you want.
>
> Last thing I'll mention for now is that all the interactions a
> developer has to these interfaces occurs through the ProcessSession
> abstraction.  This is how you can build processors to interact with
> these data objects as streams and thus never need to load the whole
> content into memory.  We can do things like compress or encrypt
> massive multi-GB objects and it never look any different on the heap
> than a 1 KB message.  That is because of the design of the
> ProcessSession and these repositories.  So, only when you manipulate
> content is the content repository engaged to make new versions.  It
> doesn't have to get rid of the previous version until nothing else
> references it.  This is then similar in concept to the design of the
> Heap and Garbage Collection in Java.  Just becomes a really nice model
> and since we control when we age things off we can let you do
> click-to-content in a natural/efficient way.
>
> Hopefully this helps a bit.
>
> Thanks
> Joe
>
>
>
>
>
> On Sun, Mar 20, 2016 at 11:48 AM, Dmitry Goldenberg
> <[email protected]> wrote:
> > I apologize if this is spelled out somewhere in the documentation.
> >
> > There is a certain amount of fuzzyness around the notion of a FlowFile.
> Is
> > this really always a file? or is it a "document" or an "item" which may
> have
> > a link to an actual file / byte content, whether on disk or elsewhere?
> My
> > noob-level understanding is that it's the latter - could someone confirm?
> >
> > Furthermore, when data is moving between Processors in a Dataflow, how is
> > that done?  Is the data streamed in memory?  Is there a spill-to-disk
> option
> > to configure how disk spillage would be done?  Or do FlowFiles always get
> > written to disk prior to being sent to the next destination?
> >
> > I would think that persisting to disk after every step would be quite
> > expensive.  Is that simply not what NiFi does?
> >
> > Thanks.
>

Re: Persistence of intermediary data in NiFi ?

Reply via email to