Thanks Andy/Joe/Juan for the helpful replies. I have some follow-up
questions...

(1) 

I'm not quite sure how a flowfile-level audit trail can be used for
governance. Is it expected that some system will process the info as
follows? 

* for each provenance event
  * if (is_bad_record(event)) then raise alert  

If so, then which system is responsible for doing so?
(a) this facility is built in to NiFi and I just haven't noticed it;
(b) this facility will be built into NiFi in the future, but hasn't been
implemented yet;
(c) this can already be done via external open-source tool X (eg
X=atlas);
(d) this can already be done via commercial tool Y (eg Kylo or
Hortonworks DataFlow);
(e) other

And how the "is_bad_record" logic be implemented? There is a range of
processors which can "exfiltrate" data; most of them are named "Put*"
but not all. And most of them will mark the flowfile as "done"
afterwards, but not all. So how can the author of such validation-logic
know at which points in the dataflow the content "must be encrypted" or
"must have internal field A masked"? Are there tutorials on using
provenance events for data governance?

It seems to me that any such validation logic must be very tightly
coupled to the dataflow that it is validating. In that case, any change
to the dataflow definition must first be reviewed and the corresponding
validation updated if needed. But when every flow is already reviewed by
an authorised person accredited to write the "governance validation
logic", I'm not so sure what the additional benefit of runtime checking
of the provenance is. Have I misunderstood?

(2)

Thanks for the reminder about "replay". I work so often with Kafka-based
systems that I sometimes take it for granted. However many "data
integration" tools (ETL-based, things like Flume/logstash, or
hand-written scripts) do indeed lack the ability to resend data
on-demand (whether in prod or development).

(3)

Showing a graphical history (visualization) of a specific flowfile is
indeed useful for development and for explaining how an existing flow
works. However it would seem to me that this is only occasionally needed
- ie that the ability to turn on/off provenance events would be useful,
and that for this purpose they only need to be turned on briefly to
gather enough data for the demonstration.

Does "visualization" really benefit from permanently-on provenance
information? Enough to be worth the overhead? 

(4) 

Juan wrote: 'We use NIFI's data provenance capabilities, to track the
like cycle of a "flowFile" / data object as it goes through its system
lifecycle.'.

Juan, can you explain _why_ you want to track the life-cycle of a
flow-file, ie which benefits does your business receive from such
tracking (other than debugging workflows)? 

And thanks for the link to
http://stackoverflow.com/questions/38948494/what-is-the-purpose-of-data-provenance-in-apache-nifi-processors


(5)

Regarding communication/explanations: yes, more documentation is always
needed :-). I do think a section in the NiFi user guide on "why NiFi"
would be useful, and would benefit from some information on the uses of
provenance events, as you have put so much effort into developing the
feature. 

Regards, 

Simon 

On 2017-04-20 02:00, Andy LoPresto wrote:

> Simon, 
> 
> The provenance capability is definitely used by many users for governance and 
> regulatory purposes. For example, when dealing with geolocation data, many 
> countries regulate the export of this data outside their borders. With 
> provenance, you can provably demonstrate that every flowfile which contained 
> such data was properly redacted before exfil or is never sent outside the 
> country. Without flowfile-level event auditing, you would only be able to 
> demonstrate this for a flow model at a specific point in time, but with no 
> visibility into actual data history.  
> 
> Similar use cases exist for documenting the point at which data was 
> encrypted, routed to/received by an external system, or written to disk. Many 
> times in large enterprises, data traverses the responsibility boundaries of 
> multiple disparate teams, and there can be "misunderstandings" about when/if 
> data was properly sent/received. Not only does NiFi's provenance allow for 
> documentation, but as Juan mentioned, the replay feature allows the dropped 
> data to be re-sent immediately. The replay feature also allows for flow 
> sandboxing, as the same events can be replayed consistently through iterative 
> versions of a flow with very low "development latency" or cost.  
> 
> In addition, the granularity of the provenance events allows for compelling 
> visualization of the data lineage graph for each piece of data, with 
> time-based graph illustration to show logical flow movement.  
> 
> Your message shows that we can do a better job explaining to our community 
> the features that are available and how they can make your life easier. Many 
> of us have worked on the software for a number of years, and it's become so 
> familiar that we forget how to advertise what is old hat to us. Thanks for 
> pushing us to be better.  
> 
> Andy LoPresto 
> alopre...@apache.org 
> _alopresto.apache@gmail.com_ 
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69 
> 
> On Apr 19, 2017, at 2:25 PM, Joe Witt <joe.w...@gmail.com> wrote: 
> 
> Additionally it is important to note that flow level changes are now
> exposed and available to reporting tasks as well.  It is envisioned
> this will be used to report to systems like Apache Atlas for that flow
> level metadata you describe but made far more powerful by combining it
> with event level lineage as well.
> 
> On Wed, Apr 19, 2017 at 5:23 PM, Juan Sequeiros <helloj...@gmail.com> wrote:
> Simon,
> 
> We use NIFI's data provenance capabilities, to track the like cycle of a
> "flowFile" / data object as it goes through its system lifecycle. ( LINEAGE
> )
> We also use it for troubleshooting as we can see the nifi attributes (
> metadata ) and its content ( if configured )
> 
> You can also use provenance to "replay" your data at specific points during
> its dataflow life cycle.
> 
> Please reference similar answer given on stackoverflow by Joe Witt [1]
> I also recommend reading Apache NIFI in depth which has a good provenance
> section [2]
> 
> [1]
> http://stackoverflow.com/questions/38948494/what-is-the-purpose-of-data-provenance-in-apache-nifi-processors
> [2] https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
> 
> On Wed, Apr 19, 2017 at 6:02 AM <si...@vonos.net> wrote:
> 
> Hi All,
> 
> Can someone explain to me the business-level use cases that "provenance
> events" are intended to solve?
> 
> I can see that they are useful for "flow developers" to debug problems.
> But is that their only use?
> 
> Can they be used to address some kinds of regulatory compliance
> requirements? Or data governance issues? Such problems however generally
> need information at the _flow_ level, not at the per-message level..
> 
> Thanks in advance,
> Simon

Reply via email to