You're right that the generation and indexing of provenance data
creates overhead.  We've put considerable effort in minimizing that
overhead to a point where you should not have to think about it and
still get all the powerful user experience/auditing gains it provides.
However, when you're talking about 100s of thousands of events per
second it can simply be too much overhead to give up.  I dont know if
we have a JIRA for it yet but it makes a lot of sense to allow
properly authorized folks to shut off generation of provenance events
at certain points of a flow.

On Wed, Apr 19, 2017 at 5:34 PM, Juan Sequeiros <[email protected]> wrote:
> Simon,
>
> I feel that " provenance event is emitted for each flowfile for each
> processor." is accurate understanding "each processor" means the unique
> processors the flowFile goes through.
>
> The provenance database is a lucene database and 1 million provenance events
> is not unreasonable.
> It would have to do with how you configure your NIFI and a best practice is
> to store your provenance on its own disk.
>
> Many tweak able settings for provenance are on nifi.properties [1]
>
> [1] https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html
>
>
> On Wed, Apr 19, 2017 at 6:50 AM <[email protected]> wrote:
>>
>> Hi All,
>>
>> In some parts of the NiFi documentation, it is stated that a provenance
>> event is emitted for each flowfile for each processor. However elsewhere
>> it is stated that no provenance-event is generated for a flowfile sent
>> to the “success” output of a processor - which is true?
>>
>> And are there mechanisms for reducing the number of provenance events
>> generated by a NiFi flow? When a dataflow is processing large numbers of
>> events, it would seem to me that the generation of provenance events
>> will be the limiting factor for performance. When processing 1 million
>> records per day, generating 1 million provenance events (or worse) is
>> not helpful..
>>
>> Thanks in advance,
>>
>> Simon

Reply via email to