Thanks Joe, Juan,
Perhaps it would be useful to be able to generate provenance events for
a _sample_ of flowfiles? eg every Nth flowfile created by a "data
ingress" (GET* or LISTEN*) processor gets tracked? Or maybe better:
every flowfile gets tracked with a probability of N, to ensure that
specific input patterns (eg every (N+1)th message is unusual) don't go
I have seen users reporting problems on this email list where the
provenance repository becomes full and everything stops. That is clearly
not desirable, but neither is simply discarding the oldest provenance
records in the repository; some flows are presumably more important than
others. In particular, a single large-volume flow should presumably not
cause provenance for other flows to be flushed. The admin-guide page you
referenced below apparently does not allow provenance storage to be
configured per-flow. Maybe the ability to configure "sampling" might
I'm developing a data import process right now for a customer; some
datasources will be reasonably low-volume while others will be very
high-volume. Sampling for high-volume flows might be useful, but
tracking each one is simply not practical. In addition, some datasources
hold very confidential data; it doesn't seem desirable to record this at
all - although AFAIK avoiding retaining this data in the NiFi content
repository for unknown periods of time is unavoidable..
Thanks once again for your feedback!
On 2017-04-19 23:36, Joe Witt wrote:
You're right that the generation and indexing of provenance data
creates overhead. We've put considerable effort in minimizing that
overhead to a point where you should not have to think about it and
still get all the powerful user experience/auditing gains it provides.
However, when you're talking about 100s of thousands of events per
second it can simply be too much overhead to give up. I dont know if
we have a JIRA for it yet but it makes a lot of sense to allow
properly authorized folks to shut off generation of provenance events
at certain points of a flow.
On Wed, Apr 19, 2017 at 5:34 PM, Juan Sequeiros <helloj...@gmail.com>
I feel that " provenance event is emitted for each flowfile for each
processor." is accurate understanding "each processor" means the
processors the flowFile goes through.
The provenance database is a lucene database and 1 million provenance
is not unreasonable.
It would have to do with how you configure your NIFI and a best
to store your provenance on its own disk.
Many tweak able settings for provenance are on nifi.properties 
On Wed, Apr 19, 2017 at 6:50 AM <si...@vonos.net> wrote:
In some parts of the NiFi documentation, it is stated that a
event is emitted for each flowfile for each processor. However
it is stated that no provenance-event is generated for a flowfile
to the “success” output of a processor - which is true?
And are there mechanisms for reducing the number of provenance events
generated by a NiFi flow? When a dataflow is processing large numbers
events, it would seem to me that the generation of provenance events
will be the limiting factor for performance. When processing 1
records per day, generating 1 million provenance events (or worse) is
Thanks in advance,