Perhaps it would be useful to be able to generate provenance events for a _sample_ of flowfiles? eg every Nth flowfile created by a "data ingress" (GET* or LISTEN*) processor gets tracked? Or maybe better: every flowfile gets tracked with a probability of N, to ensure that specific input patterns (eg every (N+1)th message is unusual) don't go unaudited..

I have seen users reporting problems on this email list where the provenance repository becomes full and everything stops. That is clearly not desirable, but neither is simply discarding the oldest provenance records in the repository; some flows are presumably more important than others. In particular, a single large-volume flow should presumably not cause provenance for other flows to be flushed. The admin-guide page you referenced below apparently does not allow provenance storage to be configured per-flow. Maybe the ability to configure "sampling" might help?

I'm developing a data import process right now for a customer; some datasources will be reasonably low-volume while others will be very high-volume. Sampling for high-volume flows might be useful, but tracking each one is simply not practical. In addition, some datasources hold very confidential data; it doesn't seem desirable to record this at all - although AFAIK avoiding retaining this data in the NiFi content repository for unknown periods of time is unavoidable..

You're right that the generation and indexing of provenance data
creates overhead.  We've put considerable effort in minimizing that
overhead to a point where you should not have to think about it and
still get all the powerful user experience/auditing gains it provides.
However, when you're talking about 100s of thousands of events per
second it can simply be too much overhead to give up.  I dont know if
we have a JIRA for it yet but it makes a lot of sense to allow
properly authorized folks to shut off generation of provenance events
at certain points of a flow.

I feel that " provenance event is emitted for each flowfile for each
processor." is accurate understanding "each processor" means the unique
processors the flowFile goes through.

The provenance database is a lucene database and 1 million provenance events
is not unreasonable.
It would have to do with how you configure your NIFI and a best practice is
to store your provenance on its own disk.

Many tweak able settings for provenance are on [1]


Hi All,

In some parts of the NiFi documentation, it is stated that a provenance event is emitted for each flowfile for each processor. However elsewhere it is stated that no provenance-event is generated for a flowfile sent
to the “success” output of a processor - which is true?

And are there mechanisms for reducing the number of provenance events
generated by a NiFi flow? When a dataflow is processing large numbers of
events, it would seem to me that the generation of provenance events
will be the limiting factor for performance. When processing 1 million
records per day, generating 1 million provenance events (or worse) is
not helpful..

