Thanks, Joe. - Dmitry
On Sun, Mar 20, 2016 at 1:24 PM, Joe Witt <[email protected]> wrote: > Dmitry, > > Good questions and things to think about. You are right into the > heart of the framework. So there are three repositories to consider > that I'll talk about here but they are probably pretty well covered on > the existing guides. I know Joe Percivall has a 'day in the life of a > flow file' document that he is nearly ready to send out - that should > help a lot! > > Key terminology item: > "Flow File" This is the metadata about a 'thing' in the flow and the > content of that 'thing'. A flowfile can represent an 'event', a > 'record', a 'tuple', 'a bag of bits'. The bottom line is a flow file > holds context about that thing and the data about that thing. People > coming from different frames of reference call these things by various > names 'data, event, record, file, etc...' so keep that in mind. > > So three repositories to understand and I'll be intermingling the > concept and the default implementation here a bit so just keep that in > mind. > > 1) Flow File Repository > - This is where the 'fact-of' a flow file lives. This holds things > like the identifier of a flow file, its name, entry data, map of > attributes. The typical/default implementation is a write ahead log > which is keeping track of the persistent state of these flow file > objects. Key thing to realize is that this does *not* include the > content of a flow file. > > 2) Content Repository > - This is where the 'bytes' of the thing live. So let's say you have > a flowfile that is a JSON document. The actual JSON bytes live here. > The 'name' of that thing and things we know about it or have learned > about it live in the flow file repo. Ok, so in the content repository > the default implementation is to persist the content to disk. No we > do not persist it every time a flow file moves from processor to > processor. More on that in a minute. Also, it never needs to be > fully in memory and more on that in a minute too. Disks these days > and things like caching in Linux are awesome. The content repository > is designed to help us greatly take advantage of that. The repository > design also allows us to take great advantage of things like copy on > write (only ever make a new version of something once it is > manipulated) and pass by reference (never make a clone of the bytes > but rather make and pass pointers). So, in this sense you can think > of the content repository (at least the default implementation) as an > immutable versioned content store. Very powerful and very fast. > > 3) Provenance Repository > - This is where events about the events live as data comes into, goes > through, and leaves the flow. These events form the truth of what > happened and you can think of it as index of events about what > happened in the flow. it has information which looks a lot like what > you'd see in the Flow File repository (no content) and some nice > relationship data so you dont' have to trudge through log files > anymore to figure out what happened. What is cool too is that it has > pointers to content. This is how we can let you click on content at > any point of its life in the flow. Doing some complex transformation? > Use provenance to click on content easily before and after some > important transformation even to prove it works. Flow not quite right > yet? Use provenance to hit replay after you tweak the settings and > keep watching it evolve until you are sure it works. You can even do > all this live in the flow on a dev copy (thanks again to copy on > write/pass by reference) then when you are ready merge it into be the > production feed. In this way it is a lot like the mentality a > developer has when using Git just now with a fun UI. > > Ok, so we've talked through a lot and I probably skipped major > details. Feel free to ask as much as you want. > > Last thing I'll mention for now is that all the interactions a > developer has to these interfaces occurs through the ProcessSession > abstraction. This is how you can build processors to interact with > these data objects as streams and thus never need to load the whole > content into memory. We can do things like compress or encrypt > massive multi-GB objects and it never look any different on the heap > than a 1 KB message. That is because of the design of the > ProcessSession and these repositories. So, only when you manipulate > content is the content repository engaged to make new versions. It > doesn't have to get rid of the previous version until nothing else > references it. This is then similar in concept to the design of the > Heap and Garbage Collection in Java. Just becomes a really nice model > and since we control when we age things off we can let you do > click-to-content in a natural/efficient way. > > Hopefully this helps a bit. > > Thanks > Joe > > > > > > On Sun, Mar 20, 2016 at 11:48 AM, Dmitry Goldenberg > <[email protected]> wrote: > > I apologize if this is spelled out somewhere in the documentation. > > > > There is a certain amount of fuzzyness around the notion of a FlowFile. > Is > > this really always a file? or is it a "document" or an "item" which may > have > > a link to an actual file / byte content, whether on disk or elsewhere? > My > > noob-level understanding is that it's the latter - could someone confirm? > > > > Furthermore, when data is moving between Processors in a Dataflow, how is > > that done? Is the data streamed in memory? Is there a spill-to-disk > option > > to configure how disk spillage would be done? Or do FlowFiles always get > > written to disk prior to being sent to the next destination? > > > > I would think that persisting to disk after every step would be quite > > expensive. Is that simply not what NiFi does? > > > > Thanks. >
