Data Provenance @scale in Nifi

milind parikh Wed, 06 Jul 2016 22:28:18 -0700

I am relatively new to Nifi. I have written a processor in Java for Nifi (
which gives you an understanding of my knowledge about nifi; which is
little)


I have a scenario where there are about 100k flow files a day representing
about 100m records; which needs to be aggregated across 1m data points
across 100 dimensions.

If in my architecture, I split the initial flow file into records and write
them into Kafka for 1000 records per flow file and read in parallel,  how
do I do data provenance of the aggregated values.

The use case that I am interested in is showing how one of the data points
( out  of 1m) arrived at the daily aggregated value for an average of 100
records coming out of very few of the 100k files.

I can't expand the data provenance through the UI (1000 initial records )
and THEN through 1m data points OR traverse through 1 m data points in the
UI as my starting point.

I know the exact reference of the data point ( it's truncated version of
the sha1 of a complex but unique datapoint string).

Is there a command line equivalent of the UI that can be more precisely
targeted for one data point?

Thanks
Milind

Data Provenance @scale in Nifi

Reply via email to