Re: provenance

Joe Witt Wed, 05 Oct 2016 09:43:48 -0700

NiFi only writes data to disk when it is actually changing the data.
It is very uncommon to have a 10 processor flow where all or even most
are actually touching the content.  You can take a look at the live
status history data to see exactly how much content is being read from
disk and written to disk precisely.  This makes it very easy to find
where the heavy users of the underlying content repository - and disk
- are.


Even in the case of reads you should generally benefit from pretty
excellent disk/OS caching.  Also, even if you're flow forks data and
send it down multiple paths it is not actually creating copies.  Just
creating new references.  NiFi will also automatically combine writes
of events to the same file on disk within a short span of time and
space.  This too helps with efficiency of disk utilization.  Key point
there is efficiency of the content repository is pretty strong at this
stage.  If you're using a version of NiFi that is years old then these
things may not be true.

Now, the run duration suggestion is about efficiency of the flow file
repository which is the bookkeeping of the flowfiles (not the
content).  We want you to be able to reduce how often we commit the
session so run duration lets you choose your tolerance for delay while
we automatically batch together sessions.

So, key is to keep in mind that there are a few repositories and
things (depending on your configuration) that will use disk:
1) Content repository (the bytes of the things you're reading/writing)
2) FlowFile repository (information about the flow files and their
attributes - no content)
3) Provenance Repository
4) Logs

All of these can be on different partitions and all can be across
partitions and such.

To really help with this particular case I think we'll need you to
list out the processors involved (generically if necessary) and how
much they read/write over a five minute period in steady state.  If
there is really a chain of 10 processors and most are actually reading
and writing content we can talk about additional strategies such as
alternative composition of processors that will be more efficient.

Thanks
jOe


On Wed, Oct 5, 2016 at 11:21 AM, Brett Tiplitz
<[email protected]> wrote:
> I was always trying to understand the run duration.  I'm good on the
> latency, so if it processes a bunch of events at once and my overall
> throughput is the same, then it's ok.  I increased it to 100ms.  But I
> looked at the bulk of my flow and this feature was only on 1 of the > 10
> processors data goes through.
>
> I realize that slowing the rate of commits seems bad, but even the big guys
> limit commits
>
>
> On Wed, Oct 5, 2016 at 12:05 PM, Bryan Bende <[email protected]> wrote:
>>
>> Brett,
>>
>> One thing that could possibly improve the performance here, although hard
>> to say how much, is the concept of "Run Duration" on the processor
>> scheduling tab. This is only available on processors marked with the
>> @SupportsBatching annotation, so it depends what processors you are using.
>>
>> By increasing the run duration it lets the framework batch together all of
>> the framework operations during that time period. The default setting is 0
>> which means no batching by default, giving you the lowest latency per flow
>> file, but users can choose to sacrifice some latency for higher throughput.
>>
>> I don't know enough about how provenance events are specifically
>> committed, but I believe they would be tied to the session commits so that
>> if a rollback occurred there wouldn't be unwanted events written.
>>
>> -Bryan
>>
>>
>> On Wed, Oct 5, 2016 at 11:38 AM, Brett Tiplitz
>> <[email protected]> wrote:
>>>
>>> James -
>>>
>>> I believe the complication for me is both the number of objects as well
>>> as the number of processors the data goes through.  I talked with a few
>>> people and it sounds like NIFI writes each event out disk and then executes
>>> a commit, which really does have a major impact on the performance.  I don't
>>> have the liberty of resolving the disk performance, though I think I will
>>> try moving the journals directory to /dev/shm.  I know on reboot I'll loose
>>> data, but that is just like 1-2 times a year, so I think that loss is
>>> acceptable.  Also, I'm not specifying anything on what data get's indexed so
>>> it's what ever the default is.
>>>
>>> If I'm producing about 6000 (just a guess, though I think it's pretty
>>> large) events per second, it would be nice if there was an option not to
>>> perform a commit on every one of the 6000 items.  In reality, I would say a
>>> commit should never occur more than once a second and that is likely way too
>>> often.
>>>
>>> Last, is there a way to measure the actual provenance events going
>>> through as I'm guessing on what it's actually doing here.
>>>
>>> brett
>>>
>>> On Fri, Sep 30, 2016 at 2:16 PM, James Wing <[email protected]> wrote:
>>>>
>>>> Brett,
>>>>
>>>> The default provenance store, PersistentProvenanceRepository, does
>>>> require I/O in proportion to flowfile events.  Flowfiles with many
>>>> attributes, especially large attributes, are a frequent contributor to
>>>> provenance overload because attribute state is tracked in provenance 
>>>> events.
>>>> But this is different from flowfile content reads and writes, which use the
>>>> separate content repository.  You might consider moving the provenance
>>>> repository to a separate disk for additional I/O capacity.
>>>>
>>>> Does this sound relevant?  Can you share some details of your flow
>>>> volumes and attribute sizes?
>>>>
>>>> nifi.provenance.repository.buffer.size is only used by the
>>>> VolatileProvenanceRepository implementation, an in-memory provenance store.
>>>> The property defines the size of the in-memory store.  The volatile store
>>>> can avoid disk I/O issues, but at the expense of reduced provenance
>>>> functionality.
>>>>
>>>> Thanks,
>>>>
>>>> James
>>>>
>>>> On Thu, Sep 29, 2016 at 1:37 PM, Brett Tiplitz
>>>> <[email protected]> wrote:
>>>>>
>>>>> I'm having a throughput problem when processing data with Provenance
>>>>> recording enabled.  I've pretty much disabled it, so I believe that is the
>>>>> source of my issue.  On occasion, I get a message saying the flow is 
>>>>> slowing
>>>>> due to provenance recording.  I was running the out of the box 
>>>>> configuration
>>>>> for provenance.
>>>>>
>>>>> I believe the issue might be related to commit writes, though it's just
>>>>> a theory.  There is a variable nifi.provenance.repository.buffer.size,
>>>>> though I don't see anything about what that does.
>>>>>
>>>>> Any suggestions ?
>>>>>
>>>>> thanks,
>>>>>
>>>>> brett
>>>>>
>>>>> --
>>>>> Brett Tiplitz
>>>>> Systolic, Inc
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Brett Tiplitz
>>> Systolic, Inc
>>
>>
>
>
>
> --
> Brett Tiplitz
> Systolic, Inc

Re: provenance

Reply via email to