Chris,

Not a problem, we're happy to answer questions :)

Re #1: There are two benefits to having multiple content repositories. The 
first, as you mentioned, is parallel reads and writes, which can be a 
tremendous performance improvement. The other benefit is simply that it 
provides you with more storage, in general. By default, NiFi "archives" the 
content when it's done with it instead of immediately deleting it. This allows 
you to go into your Provenance data and actually View/Download the data exactly 
as it was at that point in the
flow. So this is extremely powerful because Provenance shows you the lineage 
(How did it get to this point?), the attributes (The context used to get to 
this point), and the data itself. Having all 3 of these pieces of information 
dramatically improves your ability to debug and understand what's happening - 
and gives you the ability to replay individual pieces of data from anywhere in 
the flow if it wasn't done right. But, as you can imagine, storing all of this 
information can take a lot of - well, storage. So having multiple disks to 
store that on can be very helpful.

Re #2: I don't know that i've used any SAN to back my repositories other than 
the EBS provided by Amazon EC2. In that environment, I found that having one or 
having multiple repos was essentially equivalent.

Re #3: Whether or not NiFi has disk contention is really dependent on the data 
rate. NiFi is pretty smart about how it handles file I/O so that it is able to 
write multiple FlowFiles to the same underlying file on disk and by default 
FlowFiles are sorted/prioritized in a queue such that they are the most 
efficient to read. That being said, if you're reading/writing hundreds of 
MB/sec then you're probably going to have some disk contention :) The number of 
flows you have running, though, does not really play a factor, though - one 
flow processing 100 MB/sec will result in approximately the same contention as 
10 flows each processing 10 MB/sec.

Also of note, you can assign multiple partitions to the Provenance Repository 
as well. If you are processing tons of very small FlowFiles, you may actually 
be better off using multiple partitions for the Provenance Repository than 
using multiple partitions for the content repository - or if you have the 
partitions free, use multiple for both.

Does this clear things up? Hopefully it doesn't murky the water more, at least! 
:)

Thanks
-Mark


> On Dec 3, 2015, at 9:48 PM, Chris Lim <[email protected]> wrote:
> 
> Thanks Joe.
> 
> Following through the inquiries on multiple content repositories. I still 
> have a few more questions. :)
> 
> 1. Is it correct to say that the use case for having multiple content 
> repositories is to take advantage of parallel disk writes assuming that the 
> system have multiple bare metal disk drives mounted? Are there any other use 
> cases for doing multiple content repositories?
> 
> 2. On an enterprise environment wherein NiFi writes to a SAN (Storage Area 
> Network) does it make sense to have logical mounted volumes for the multiple 
> content repositories? Or are we better off having just one content 
> repository. Of course the assumption here is that we are dealing with 
> multiple files with 10 to 50 gigabytes in sizes.
> 
> 3. Will NiFi have disk contention issues in a scenario wherein we have 5 or 
> more independent flows on a single NiFI instance and all the flows are 
> involved in ETL?
> 
> Regards,
> Chris
> 
> 
> 
> On Fri, Nov 27, 2015 at 3:56 AM, Joe Witt <[email protected] 
> <mailto:[email protected]>> wrote:
> Chris,
> 
> It is something which occurs automatically and behind the scenes.
> Under normal circumstances there will be many FlowFiles written to the
> same content claim they'll just each have different offsets.  It is
> more aligned with how disks work in terms of efficiently writing data,
> efficiently reading data, and efficiently deleting the entire claim
> (which is a file on disk).  Rather than a delete per flowfile we
> delete once there are no more references to the entire claim.  Much
> faster.  And all of that is totally abstracted away from the
> perspective of someone writing extensions.  This bit, combined with
> the copy on write and pass by reference logic the content repository
> provides is a key part of what makes nifi efficient.
> 
> Thanks
> Joe
> 
> On Thu, Nov 26, 2015 at 1:40 AM, Chris Lim <[email protected] 
> <mailto:[email protected]>> wrote:
> > Thanks Mark.
> >
> > The answer on the content repository round-robin is perfect. :)
> >
> > It got me curious when you mentioned that one or more FlowFiles can be
> > written to the same Resource Claim. Is there a specific scenario wherein
> > this can occur? Under normal circumstances there is only one FlowFile
> > written to a Resource Claim?
> >
> > --
> > Chris
> >
> >
> > On Wed, Nov 25, 2015 at 9:39 PM, Mark Payne <[email protected] 
> > <mailto:[email protected]>> wrote:
> >>
> >> Chris,
> >>
> >> In terms of round robin-ing between the repositories, yes, it follows a
> >> simple round-robin approach.
> >> In terms of sections within those containers, the answer is more of a
> >> "sort-of." Each FlowFile has what
> >> we refer to as a Resource Claim, which points to a location in the content
> >> repository. In the case of the
> >> FileSystemRepository (which is the default and almost all that's ever used
> >> right now), the Resource Claim
> >> maps to a file on disk. In order to be very efficient, we may write many
> >> FlowFiles to the same Resource Claim.
> >>
> >> Once we finish writing to a particular Resource Claim, we close the
> >> resources and create a new one for the next
> >> FlowFile. When we create these Resource Claims, we do so in a round-robin
> >> fashion across the different Sections
> >> of the content repository.
> >>
> >> Sorry, this is a fairly long-winded answer to such a seemingly simple
> >> question :) but I wasn't sure how much detail you were
> >> looking for. If anything is not clear, let us know.
> >>
> >> Thanks
> >> -Mark
> >>
> >>
> >> On Nov 25, 2015, at 5:12 AM, Chris Lim <[email protected] 
> >> <mailto:[email protected]>>
> >> wrote:
> >>
> >> Hi Guys,
> >>
> >> I am configuring our NiFi instance to have multiple content repositories
> >> specifically with the "nifi.content.repository.directory." property setting
> >> as mentioned in the Administrator's guide. Am I correct that flow file
> >> contents are written to the repository using a round-robin algorithm? Also,
> >> does the sections within a specific content repository follow the same
> >> round-robin algorithm?
> >>
> >> Thanks,
> >> Chris
> >>
> >>
> >
> 

Reply via email to