Re: Provenance Event performance

simon Thu, 20 Apr 2017 01:08:26 -0700

Thanks Joe, Juan,

Perhaps it would be useful to be able to generate provenance events fora _sample_ of flowfiles? eg every Nth flowfile created by a "dataingress" (GET* or LISTEN*) processor gets tracked? Or maybe better:every flowfile gets tracked with a probability of N, to ensure thatspecific input patterns (eg every (N+1)th message is unusual) don't gounaudited..

I have seen users reporting problems on this email list where theprovenance repository becomes full and everything stops. That is clearlynot desirable, but neither is simply discarding the oldest provenancerecords in the repository; some flows are presumably more important thanothers. In particular, a single large-volume flow should presumably notcause provenance for other flows to be flushed. The admin-guide page youreferenced below apparently does not allow provenance storage to beconfigured per-flow. Maybe the ability to configure "sampling" mighthelp?

I'm developing a data import process right now for a customer; somedatasources will be reasonably low-volume while others will be veryhigh-volume. Sampling for high-volume flows might be useful, buttracking each one is simply not practical. In addition, some datasourceshold very confidential data; it doesn't seem desirable to record this atall - although AFAIK avoiding retaining this data in the NiFi contentrepository for unknown periods of time is unavoidable..


Thanks once again for your feedback!

Regards,
Simon

On 2017-04-19 23:36, Joe Witt wrote:

You're right that the generation and indexing of provenance data
creates overhead.  We've put considerable effort in minimizing that
overhead to a point where you should not have to think about it and
still get all the powerful user experience/auditing gains it provides.
However, when you're talking about 100s of thousands of events per
second it can simply be too much overhead to give up.  I dont know if
we have a JIRA for it yet but it makes a lot of sense to allow
properly authorized folks to shut off generation of provenance events
at certain points of a flow.
On Wed, Apr 19, 2017 at 5:34 PM, Juan Sequeiros <[email protected]>wrote:
Simon,

I feel that " provenance event is emitted for each flowfile for each
processor." is accurate understanding "each processor" means theunique
processors the flowFile goes through.
The provenance database is a lucene database and 1 million provenanceevents
is not unreasonable.
It would have to do with how you configure your NIFI and a bestpractice is
to store your provenance on its own disk.

Many tweak able settings for provenance are on nifi.properties [1]
[1]https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html
On Wed, Apr 19, 2017 at 6:50 AM <[email protected]> wrote:
Hi All,
In some parts of the NiFi documentation, it is stated that aprovenanceevent is emitted for each flowfile for each processor. Howeverelsewhereit is stated that no provenance-event is generated for a flowfilesent
to the “success” output of a processor - which is true?

And are there mechanisms for reducing the number of provenance events
generated by a NiFi flow? When a dataflow is processing large numbersof
events, it would seem to me that the generation of provenance events
will be the limiting factor for performance. When processing 1million
records per day, generating 1 million provenance events (or worse) is
not helpful..

Thanks in advance,

Simon

Re: Provenance Event performance

Reply via email to