Brett,

One thing that could possibly improve the performance here, although hard
to say how much, is the concept of "Run Duration" on the processor
scheduling tab. This is only available on processors marked with the
@SupportsBatching annotation, so it depends what processors you are using.

By increasing the run duration it lets the framework batch together all of
the framework operations during that time period. The default setting is 0
which means no batching by default, giving you the lowest latency per flow
file, but users can choose to sacrifice some latency for higher throughput.

I don't know enough about how provenance events are specifically committed,
but I believe they would be tied to the session commits so that if a
rollback occurred there wouldn't be unwanted events written.

-Bryan


On Wed, Oct 5, 2016 at 11:38 AM, Brett Tiplitz <
[email protected]> wrote:

> James -
>
> I believe the complication for me is both the number of objects as well as
> the number of processors the data goes through.  I talked with a few people
> and it sounds like NIFI writes each event out disk and then executes a
> commit, which really does have a major impact on the performance.  I don't
> have the liberty of resolving the disk performance, though I think I will
> try moving the journals directory to /dev/shm.  I know on reboot I'll loose
> data, but that is just like 1-2 times a year, so I think that loss is
> acceptable.  Also, I'm not specifying anything on what data get's indexed
> so it's what ever the default is.
>
> If I'm producing about 6000 (just a guess, though I think it's pretty
> large) events per second, it would be nice if there was an option not to
> perform a commit on every one of the 6000 items.  In reality, I would say a
> commit should never occur more than once a second and that is likely way
> too often.
>
> Last, is there a way to measure the actual provenance events going through
> as I'm guessing on what it's actually doing here.
>
> brett
>
> On Fri, Sep 30, 2016 at 2:16 PM, James Wing <[email protected]> wrote:
>
>> Brett,
>>
>> The default provenance store, PersistentProvenanceRepository, does
>> require I/O in proportion to flowfile events.  Flowfiles with many
>> attributes, especially large attributes, are a frequent contributor to
>> provenance overload because attribute state is tracked in provenance
>> events.  But this is different from flowfile content reads and writes,
>> which use the separate content repository.  You might consider moving the
>> provenance repository to a separate disk for additional I/O capacity.
>>
>> Does this sound relevant?  Can you share some details of your flow
>> volumes and attribute sizes?
>>
>> nifi.provenance.repository.buffer.size is only used by the
>> VolatileProvenanceRepository implementation, an in-memory provenance
>> store.  The property defines the size of the in-memory store.  The volatile
>> store can avoid disk I/O issues, but at the expense of reduced provenance
>> functionality.
>>
>> Thanks,
>>
>> James
>>
>> On Thu, Sep 29, 2016 at 1:37 PM, Brett Tiplitz <
>> [email protected]> wrote:
>>
>>> I'm having a throughput problem when processing data with Provenance
>>> recording enabled.  I've pretty much disabled it, so I believe that is the
>>> source of my issue.  On occasion, I get a message saying the flow is
>>> slowing due to provenance recording.  I was running the out of the box
>>> configuration for provenance.
>>>
>>> I believe the issue might be related to commit writes, though it's just
>>> a theory.  There is a variable nifi.provenance.repository.buffer.size,
>>> though I don't see anything about what that does.
>>>
>>> Any suggestions ?
>>>
>>> thanks,
>>>
>>> brett
>>>
>>> --
>>> Brett Tiplitz
>>> Systolic, Inc
>>>
>>
>>
>
>
> --
> Brett Tiplitz
> Systolic, Inc
>

Reply via email to