NiFi only writes data to disk when it is actually changing the data. It is very uncommon to have a 10 processor flow where all or even most are actually touching the content. You can take a look at the live status history data to see exactly how much content is being read from disk and written to disk precisely. This makes it very easy to find where the heavy users of the underlying content repository - and disk - are.
Even in the case of reads you should generally benefit from pretty excellent disk/OS caching. Also, even if you're flow forks data and send it down multiple paths it is not actually creating copies. Just creating new references. NiFi will also automatically combine writes of events to the same file on disk within a short span of time and space. This too helps with efficiency of disk utilization. Key point there is efficiency of the content repository is pretty strong at this stage. If you're using a version of NiFi that is years old then these things may not be true. Now, the run duration suggestion is about efficiency of the flow file repository which is the bookkeeping of the flowfiles (not the content). We want you to be able to reduce how often we commit the session so run duration lets you choose your tolerance for delay while we automatically batch together sessions. So, key is to keep in mind that there are a few repositories and things (depending on your configuration) that will use disk: 1) Content repository (the bytes of the things you're reading/writing) 2) FlowFile repository (information about the flow files and their attributes - no content) 3) Provenance Repository 4) Logs All of these can be on different partitions and all can be across partitions and such. To really help with this particular case I think we'll need you to list out the processors involved (generically if necessary) and how much they read/write over a five minute period in steady state. If there is really a chain of 10 processors and most are actually reading and writing content we can talk about additional strategies such as alternative composition of processors that will be more efficient. Thanks jOe On Wed, Oct 5, 2016 at 11:21 AM, Brett Tiplitz <[email protected]> wrote: > I was always trying to understand the run duration. I'm good on the > latency, so if it processes a bunch of events at once and my overall > throughput is the same, then it's ok. I increased it to 100ms. But I > looked at the bulk of my flow and this feature was only on 1 of the > 10 > processors data goes through. > > I realize that slowing the rate of commits seems bad, but even the big guys > limit commits > > > On Wed, Oct 5, 2016 at 12:05 PM, Bryan Bende <[email protected]> wrote: >> >> Brett, >> >> One thing that could possibly improve the performance here, although hard >> to say how much, is the concept of "Run Duration" on the processor >> scheduling tab. This is only available on processors marked with the >> @SupportsBatching annotation, so it depends what processors you are using. >> >> By increasing the run duration it lets the framework batch together all of >> the framework operations during that time period. The default setting is 0 >> which means no batching by default, giving you the lowest latency per flow >> file, but users can choose to sacrifice some latency for higher throughput. >> >> I don't know enough about how provenance events are specifically >> committed, but I believe they would be tied to the session commits so that >> if a rollback occurred there wouldn't be unwanted events written. >> >> -Bryan >> >> >> On Wed, Oct 5, 2016 at 11:38 AM, Brett Tiplitz >> <[email protected]> wrote: >>> >>> James - >>> >>> I believe the complication for me is both the number of objects as well >>> as the number of processors the data goes through. I talked with a few >>> people and it sounds like NIFI writes each event out disk and then executes >>> a commit, which really does have a major impact on the performance. I don't >>> have the liberty of resolving the disk performance, though I think I will >>> try moving the journals directory to /dev/shm. I know on reboot I'll loose >>> data, but that is just like 1-2 times a year, so I think that loss is >>> acceptable. Also, I'm not specifying anything on what data get's indexed so >>> it's what ever the default is. >>> >>> If I'm producing about 6000 (just a guess, though I think it's pretty >>> large) events per second, it would be nice if there was an option not to >>> perform a commit on every one of the 6000 items. In reality, I would say a >>> commit should never occur more than once a second and that is likely way too >>> often. >>> >>> Last, is there a way to measure the actual provenance events going >>> through as I'm guessing on what it's actually doing here. >>> >>> brett >>> >>> On Fri, Sep 30, 2016 at 2:16 PM, James Wing <[email protected]> wrote: >>>> >>>> Brett, >>>> >>>> The default provenance store, PersistentProvenanceRepository, does >>>> require I/O in proportion to flowfile events. Flowfiles with many >>>> attributes, especially large attributes, are a frequent contributor to >>>> provenance overload because attribute state is tracked in provenance >>>> events. >>>> But this is different from flowfile content reads and writes, which use the >>>> separate content repository. You might consider moving the provenance >>>> repository to a separate disk for additional I/O capacity. >>>> >>>> Does this sound relevant? Can you share some details of your flow >>>> volumes and attribute sizes? >>>> >>>> nifi.provenance.repository.buffer.size is only used by the >>>> VolatileProvenanceRepository implementation, an in-memory provenance store. >>>> The property defines the size of the in-memory store. The volatile store >>>> can avoid disk I/O issues, but at the expense of reduced provenance >>>> functionality. >>>> >>>> Thanks, >>>> >>>> James >>>> >>>> On Thu, Sep 29, 2016 at 1:37 PM, Brett Tiplitz >>>> <[email protected]> wrote: >>>>> >>>>> I'm having a throughput problem when processing data with Provenance >>>>> recording enabled. I've pretty much disabled it, so I believe that is the >>>>> source of my issue. On occasion, I get a message saying the flow is >>>>> slowing >>>>> due to provenance recording. I was running the out of the box >>>>> configuration >>>>> for provenance. >>>>> >>>>> I believe the issue might be related to commit writes, though it's just >>>>> a theory. There is a variable nifi.provenance.repository.buffer.size, >>>>> though I don't see anything about what that does. >>>>> >>>>> Any suggestions ? >>>>> >>>>> thanks, >>>>> >>>>> brett >>>>> >>>>> -- >>>>> Brett Tiplitz >>>>> Systolic, Inc >>>> >>>> >>> >>> >>> >>> -- >>> Brett Tiplitz >>> Systolic, Inc >> >> > > > > -- > Brett Tiplitz > Systolic, Inc
