You're right that the generation and indexing of provenance data creates overhead. We've put considerable effort in minimizing that overhead to a point where you should not have to think about it and still get all the powerful user experience/auditing gains it provides. However, when you're talking about 100s of thousands of events per second it can simply be too much overhead to give up. I dont know if we have a JIRA for it yet but it makes a lot of sense to allow properly authorized folks to shut off generation of provenance events at certain points of a flow.
On Wed, Apr 19, 2017 at 5:34 PM, Juan Sequeiros <[email protected]> wrote: > Simon, > > I feel that " provenance event is emitted for each flowfile for each > processor." is accurate understanding "each processor" means the unique > processors the flowFile goes through. > > The provenance database is a lucene database and 1 million provenance events > is not unreasonable. > It would have to do with how you configure your NIFI and a best practice is > to store your provenance on its own disk. > > Many tweak able settings for provenance are on nifi.properties [1] > > [1] https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html > > > On Wed, Apr 19, 2017 at 6:50 AM <[email protected]> wrote: >> >> Hi All, >> >> In some parts of the NiFi documentation, it is stated that a provenance >> event is emitted for each flowfile for each processor. However elsewhere >> it is stated that no provenance-event is generated for a flowfile sent >> to the “success” output of a processor - which is true? >> >> And are there mechanisms for reducing the number of provenance events >> generated by a NiFi flow? When a dataflow is processing large numbers of >> events, it would seem to me that the generation of provenance events >> will be the limiting factor for performance. When processing 1 million >> records per day, generating 1 million provenance events (or worse) is >> not helpful.. >> >> Thanks in advance, >> >> Simon
