Re: Proposal: standard record metadata attributes for data sources

2018-05-24 Thread Otto Fowler
I commented on the PR, but I’ll add this to the thread here.

Wouldn’t something like this lend itself to a ReportingTask?  If not the
current structure, a like structure
for records?

That would allow the destination to do time series analysis etc.
That is not saying there isn’t a case to have it in the Flow as well.



On May 24, 2018 at 08:05:29, Mike Thomsen (mikerthom...@gmail.com) wrote:

I wrote a processor that's inspired by one of the Groovy scripts we use at
that client. PR is here if anyone wants to take a look:

https://github.com/apache/nifi/pull/2737

It's called "RecordStats" and provides both a general record count
attribute and lets you specify record path operations to get stats on
individual field values as well. For example, if you have a field called
called "department" you can do this:

department_count (prop name) => /department

as a dynamic property which will produce the following:

{
"record_count": "100",
"department": "75",
"department.Engineering": "25",
"department.Marketing": "10",
"department.Operations": "25",
"department.Finance": "15"
}

The scenario we have that lead to this involves a lot of big queries and
full collection fetches from MongoDB often as much as 80GB at a time, so
they'd rather see a little slow down from examining those stats and being
able to get "accurate counts" than see things go lightning fast and not
have the insight into exactly what came out of those fetches.



On Tue, May 15, 2018 at 8:40 PM Koji Kawamura 
wrote:

> Hi Mike,
>
> I agree with the approach that enrich provenance events. In order to
> do so, we can use several places to embed meta-data:
>
> - FlowFile attributes: automatically mapped to a provenance event, but
> as Andy mentioned, we need to be careful not to put sensitive data.
> - Transit URI: when I developed NiFi Atlas integration, I used this as
> the primary source of what data a processor interact with. E.g. remote
> address, database, table ... etc.
> - The 'details' string. It might not be ideal solution, but
> ProvenanceReporter accepts additional 'details' string. We can embed
> whatever we want here.
>
> I'd map meta-data you mentioned as follows:
> 1. Source system. => Transit URI
> 2. Database/table/index/collection/etc. => Transit URI or FlowFile
> attribute. I think it's fine to put these into attribute.
> 3. The lookup criteria that was used (similar to the "query attribute"
> some already have). => 'details' string
>
> What I learned from Atlas integration, it's really hard to design a
> complete standard set of attributes. I'd suggest use what NiFi
> framework provides currently.
>
> Thanks,
>
> Koji
>
> On Tue, May 15, 2018 at 8:15 AM, Andy LoPresto 
> wrote:
> > Maybe an ADDINFO event or FORK event could be used and a new flowfile
> with
> > the relevant attributes/content could be created. The flowfiles would
be
> > linked, but the “sensitive” information wouldn’t travel with the
> original.
> >
> > Andy LoPresto
> > alopre...@apache.org
> > alopresto.apa...@gmail.com
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
> >
> > On May 14, 2018, at 3:32 PM, Mike Thomsen 
> wrote:
> >
> > Does the provenance system have the ability to add user-defined
key/value
> > pairs to a flowfile's provenance record at a particular processor?
> >
> > On Mon, May 14, 2018 at 6:11 PM Andy LoPresto 
> wrote:
> >
> > I would actually propose that this is added to the provenance but not
> > always put into the flowfile attributes. There are many scenarios in
> which
> > the data retrieval should be separated from the analysis/follow-on,
both
> > for visibility, responsibility, and security concerns. While I
> understand a
> > separate UpdateAttribute processor could be put in the downstream flow
to
> > remove these attributes, I would push for not adding them by default as
a
> > more secure approach. Perhaps this could be configurable on the Get*
> > processor via a boolean property, but I think doing it automatically by
> > default introduces some serious concerns.
> >
> >
> > Andy LoPresto
> > alopre...@apache.org
> > *alopresto.apa...@gmail.com *
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
> >
> > On May 13, 2018, at 11:48 AM, Mike Thomsen 
> wrote:
> >
> > @Joe @Matt
> >
> > This is kinda related to the point that Joe made in the graph DB thread
> > about provenance. My thought here was that we need some standards on
> > enriching the metadata about what was fetched so that no matter how you
> > store the provenance, you can find some way to query it for questions
> like
> > when a data set was loaded into NiFi, how many records went through a
> > terminating processor, etc. IMO this could help batch-oriented
> > organizations feel more at ease with something stream-oriented like
NiFi.
> >
> > On Fri, Apr 13, 2018 at 4:01 PM Mike 

Re: Proposal: standard record metadata attributes for data sources

2018-05-24 Thread Mike Thomsen
I wrote a processor that's inspired by one of the Groovy scripts we use at
that client. PR is here if anyone wants to take a look:

https://github.com/apache/nifi/pull/2737

It's called "RecordStats" and provides both a general record count
attribute and lets you specify record path operations to get stats on
individual field values as well. For example, if you have a field called
called "department" you can do this:

department_count (prop name) => /department

as a dynamic property which will produce the following:

{
"record_count": "100",
"department": "75",
"department.Engineering": "25",
"department.Marketing": "10",
"department.Operations": "25",
"department.Finance": "15"
}

The scenario we have that lead to this involves a lot of big queries and
full collection fetches from MongoDB often as much as 80GB at a time, so
they'd rather see a little slow down from examining those stats and being
able to get "accurate counts" than see things go lightning fast and not
have the insight into exactly what came out of those fetches.



On Tue, May 15, 2018 at 8:40 PM Koji Kawamura 
wrote:

> Hi Mike,
>
> I agree with the approach that enrich provenance events. In order to
> do so, we can use several places to embed meta-data:
>
> - FlowFile attributes: automatically mapped to a provenance event, but
> as Andy mentioned, we need to be careful not to put sensitive data.
> - Transit URI: when I developed NiFi Atlas integration, I used this as
> the primary source of what data a processor interact with. E.g. remote
> address, database, table ... etc.
> - The 'details' string. It might not be ideal solution, but
> ProvenanceReporter accepts additional 'details' string. We can embed
> whatever we want here.
>
> I'd map meta-data you mentioned as follows:
> 1. Source system. => Transit URI
> 2. Database/table/index/collection/etc. => Transit URI or FlowFile
> attribute. I think it's fine to put these into attribute.
> 3. The lookup criteria that was used (similar to the "query attribute"
> some already have). => 'details' string
>
> What I learned from Atlas integration, it's really hard to design a
> complete standard set of attributes. I'd suggest use what NiFi
> framework provides currently.
>
> Thanks,
>
> Koji
>
> On Tue, May 15, 2018 at 8:15 AM, Andy LoPresto 
> wrote:
> > Maybe an ADDINFO event or FORK event could be used and a new flowfile
> with
> > the relevant attributes/content could be created. The flowfiles would be
> > linked, but the “sensitive” information wouldn’t travel with the
> original.
> >
> > Andy LoPresto
> > alopre...@apache.org
> > alopresto.apa...@gmail.com
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >
> > On May 14, 2018, at 3:32 PM, Mike Thomsen 
> wrote:
> >
> > Does the provenance system have the ability to add user-defined key/value
> > pairs to a flowfile's provenance record at a particular processor?
> >
> > On Mon, May 14, 2018 at 6:11 PM Andy LoPresto 
> wrote:
> >
> > I would actually propose that this is added to the provenance but not
> > always put into the flowfile attributes. There are many scenarios in
> which
> > the data retrieval should be separated from the analysis/follow-on, both
> > for visibility, responsibility, and security concerns. While I
> understand a
> > separate UpdateAttribute processor could be put in the downstream flow to
> > remove these attributes, I would push for not adding them by default as a
> > more secure approach. Perhaps this could be configurable on the Get*
> > processor via a boolean property, but I think doing it automatically by
> > default introduces some serious concerns.
> >
> >
> > Andy LoPresto
> > alopre...@apache.org
> > *alopresto.apa...@gmail.com *
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >
> > On May 13, 2018, at 11:48 AM, Mike Thomsen 
> wrote:
> >
> > @Joe @Matt
> >
> > This is kinda related to the point that Joe made in the graph DB thread
> > about provenance. My thought here was that we need some standards on
> > enriching the metadata about what was fetched so that no matter how you
> > store the provenance, you can find some way to query it for questions
> like
> > when a data set was loaded into NiFi, how many records went through a
> > terminating processor, etc. IMO this could help batch-oriented
> > organizations feel more at ease with something stream-oriented like NiFi.
> >
> > On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen 
> > wrote:
> >
> > I'd like to propose that all non-deprecated (or likely to be deprecated)
> > Get/Fetch/Query processors get a standard convention for attributes that
> > describe things like:
> >
> > 1. Source system.
> > 2. Database/table/index/collection/etc.
> > 3. The lookup criteria that was used (similar to the "query attribute"
> > 

Re: Proposal: standard record metadata attributes for data sources

2018-05-15 Thread Koji Kawamura
Hi Mike,

I agree with the approach that enrich provenance events. In order to
do so, we can use several places to embed meta-data:

- FlowFile attributes: automatically mapped to a provenance event, but
as Andy mentioned, we need to be careful not to put sensitive data.
- Transit URI: when I developed NiFi Atlas integration, I used this as
the primary source of what data a processor interact with. E.g. remote
address, database, table ... etc.
- The 'details' string. It might not be ideal solution, but
ProvenanceReporter accepts additional 'details' string. We can embed
whatever we want here.

I'd map meta-data you mentioned as follows:
1. Source system. => Transit URI
2. Database/table/index/collection/etc. => Transit URI or FlowFile
attribute. I think it's fine to put these into attribute.
3. The lookup criteria that was used (similar to the "query attribute"
some already have). => 'details' string

What I learned from Atlas integration, it's really hard to design a
complete standard set of attributes. I'd suggest use what NiFi
framework provides currently.

Thanks,

Koji

On Tue, May 15, 2018 at 8:15 AM, Andy LoPresto  wrote:
> Maybe an ADDINFO event or FORK event could be used and a new flowfile with
> the relevant attributes/content could be created. The flowfiles would be
> linked, but the “sensitive” information wouldn’t travel with the original.
>
> Andy LoPresto
> alopre...@apache.org
> alopresto.apa...@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On May 14, 2018, at 3:32 PM, Mike Thomsen  wrote:
>
> Does the provenance system have the ability to add user-defined key/value
> pairs to a flowfile's provenance record at a particular processor?
>
> On Mon, May 14, 2018 at 6:11 PM Andy LoPresto  wrote:
>
> I would actually propose that this is added to the provenance but not
> always put into the flowfile attributes. There are many scenarios in which
> the data retrieval should be separated from the analysis/follow-on, both
> for visibility, responsibility, and security concerns. While I understand a
> separate UpdateAttribute processor could be put in the downstream flow to
> remove these attributes, I would push for not adding them by default as a
> more secure approach. Perhaps this could be configurable on the Get*
> processor via a boolean property, but I think doing it automatically by
> default introduces some serious concerns.
>
>
> Andy LoPresto
> alopre...@apache.org
> *alopresto.apa...@gmail.com *
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On May 13, 2018, at 11:48 AM, Mike Thomsen  wrote:
>
> @Joe @Matt
>
> This is kinda related to the point that Joe made in the graph DB thread
> about provenance. My thought here was that we need some standards on
> enriching the metadata about what was fetched so that no matter how you
> store the provenance, you can find some way to query it for questions like
> when a data set was loaded into NiFi, how many records went through a
> terminating processor, etc. IMO this could help batch-oriented
> organizations feel more at ease with something stream-oriented like NiFi.
>
> On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen 
> wrote:
>
> I'd like to propose that all non-deprecated (or likely to be deprecated)
> Get/Fetch/Query processors get a standard convention for attributes that
> describe things like:
>
> 1. Source system.
> 2. Database/table/index/collection/etc.
> 3. The lookup criteria that was used (similar to the "query attribute"
> some already have).
>
> Using GetMongo as an example, it would add something like this:
>
> source.url=mongodb://localhost:27017
> source.database=testdb
> source.collection=test_collection
> source.query={ "username": "john.smith" }
> source.criteria.username=john.smith //GetMongo would parse the query and
> add this.
>
> We have a use case where a team is coming from an extremely batch-oriented
> view and really wants to know when "dataset X" was run. Our solution was to
> extract that from the result set because the dataset name is one of the
> fields in the JSON body.
>
> I think this would help expand what you can do out of the box with
> provenance tracking because it would provide a lot of useful information
> that could be stored in Solr or ES and then queried against terminating
> processors' DROP events to get a solid window into when jobs were run
> historically.
>
> Thoughts?
>
>
>
>


Re: Proposal: standard record metadata attributes for data sources

2018-05-14 Thread Andy LoPresto
Maybe an ADDINFO event or FORK event could be used and a new flowfile with the 
relevant attributes/content could be created. The flowfiles would be linked, 
but the “sensitive” information wouldn’t travel with the original.

Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On May 14, 2018, at 3:32 PM, Mike Thomsen  wrote:
> 
> Does the provenance system have the ability to add user-defined key/value
> pairs to a flowfile's provenance record at a particular processor?
> 
> On Mon, May 14, 2018 at 6:11 PM Andy LoPresto  wrote:
> 
>> I would actually propose that this is added to the provenance but not
>> always put into the flowfile attributes. There are many scenarios in which
>> the data retrieval should be separated from the analysis/follow-on, both
>> for visibility, responsibility, and security concerns. While I understand a
>> separate UpdateAttribute processor could be put in the downstream flow to
>> remove these attributes, I would push for not adding them by default as a
>> more secure approach. Perhaps this could be configurable on the Get*
>> processor via a boolean property, but I think doing it automatically by
>> default introduces some serious concerns.
>> 
>> 
>> Andy LoPresto
>> alopre...@apache.org
>> *alopresto.apa...@gmail.com *
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> 
>> On May 13, 2018, at 11:48 AM, Mike Thomsen  wrote:
>> 
>> @Joe @Matt
>> 
>> This is kinda related to the point that Joe made in the graph DB thread
>> about provenance. My thought here was that we need some standards on
>> enriching the metadata about what was fetched so that no matter how you
>> store the provenance, you can find some way to query it for questions like
>> when a data set was loaded into NiFi, how many records went through a
>> terminating processor, etc. IMO this could help batch-oriented
>> organizations feel more at ease with something stream-oriented like NiFi.
>> 
>> On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen 
>> wrote:
>> 
>> I'd like to propose that all non-deprecated (or likely to be deprecated)
>> Get/Fetch/Query processors get a standard convention for attributes that
>> describe things like:
>> 
>> 1. Source system.
>> 2. Database/table/index/collection/etc.
>> 3. The lookup criteria that was used (similar to the "query attribute"
>> some already have).
>> 
>> Using GetMongo as an example, it would add something like this:
>> 
>> source.url=mongodb://localhost:27017
>> source.database=testdb
>> source.collection=test_collection
>> source.query={ "username": "john.smith" }
>> source.criteria.username=john.smith //GetMongo would parse the query and
>> add this.
>> 
>> We have a use case where a team is coming from an extremely batch-oriented
>> view and really wants to know when "dataset X" was run. Our solution was to
>> extract that from the result set because the dataset name is one of the
>> fields in the JSON body.
>> 
>> I think this would help expand what you can do out of the box with
>> provenance tracking because it would provide a lot of useful information
>> that could be stored in Solr or ES and then queried against terminating
>> processors' DROP events to get a solid window into when jobs were run
>> historically.
>> 
>> Thoughts?
>> 
>> 
>> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Proposal: standard record metadata attributes for data sources

2018-05-14 Thread Mike Thomsen
Does the provenance system have the ability to add user-defined key/value
pairs to a flowfile's provenance record at a particular processor?

On Mon, May 14, 2018 at 6:11 PM Andy LoPresto  wrote:

> I would actually propose that this is added to the provenance but not
> always put into the flowfile attributes. There are many scenarios in which
> the data retrieval should be separated from the analysis/follow-on, both
> for visibility, responsibility, and security concerns. While I understand a
> separate UpdateAttribute processor could be put in the downstream flow to
> remove these attributes, I would push for not adding them by default as a
> more secure approach. Perhaps this could be configurable on the Get*
> processor via a boolean property, but I think doing it automatically by
> default introduces some serious concerns.
>
>
> Andy LoPresto
> alopre...@apache.org
> *alopresto.apa...@gmail.com *
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On May 13, 2018, at 11:48 AM, Mike Thomsen  wrote:
>
> @Joe @Matt
>
> This is kinda related to the point that Joe made in the graph DB thread
> about provenance. My thought here was that we need some standards on
> enriching the metadata about what was fetched so that no matter how you
> store the provenance, you can find some way to query it for questions like
> when a data set was loaded into NiFi, how many records went through a
> terminating processor, etc. IMO this could help batch-oriented
> organizations feel more at ease with something stream-oriented like NiFi.
>
> On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen 
> wrote:
>
> I'd like to propose that all non-deprecated (or likely to be deprecated)
> Get/Fetch/Query processors get a standard convention for attributes that
> describe things like:
>
> 1. Source system.
> 2. Database/table/index/collection/etc.
> 3. The lookup criteria that was used (similar to the "query attribute"
> some already have).
>
> Using GetMongo as an example, it would add something like this:
>
> source.url=mongodb://localhost:27017
> source.database=testdb
> source.collection=test_collection
> source.query={ "username": "john.smith" }
> source.criteria.username=john.smith //GetMongo would parse the query and
> add this.
>
> We have a use case where a team is coming from an extremely batch-oriented
> view and really wants to know when "dataset X" was run. Our solution was to
> extract that from the result set because the dataset name is one of the
> fields in the JSON body.
>
> I think this would help expand what you can do out of the box with
> provenance tracking because it would provide a lot of useful information
> that could be stored in Solr or ES and then queried against terminating
> processors' DROP events to get a solid window into when jobs were run
> historically.
>
> Thoughts?
>
>
>


Re: Proposal: standard record metadata attributes for data sources

2018-05-14 Thread Andy LoPresto
I would actually propose that this is added to the provenance but not always 
put into the flowfile attributes. There are many scenarios in which the data 
retrieval should be separated from the analysis/follow-on, both for visibility, 
responsibility, and security concerns. While I understand a separate 
UpdateAttribute processor could be put in the downstream flow to remove these 
attributes, I would push for not adding them by default as a more secure 
approach. Perhaps this could be configurable on the Get* processor via a 
boolean property, but I think doing it automatically by default introduces some 
serious concerns.


Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On May 13, 2018, at 11:48 AM, Mike Thomsen  wrote:
> 
> @Joe @Matt
> 
> This is kinda related to the point that Joe made in the graph DB thread
> about provenance. My thought here was that we need some standards on
> enriching the metadata about what was fetched so that no matter how you
> store the provenance, you can find some way to query it for questions like
> when a data set was loaded into NiFi, how many records went through a
> terminating processor, etc. IMO this could help batch-oriented
> organizations feel more at ease with something stream-oriented like NiFi.
> 
> On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen  wrote:
> 
>> I'd like to propose that all non-deprecated (or likely to be deprecated)
>> Get/Fetch/Query processors get a standard convention for attributes that
>> describe things like:
>> 
>> 1. Source system.
>> 2. Database/table/index/collection/etc.
>> 3. The lookup criteria that was used (similar to the "query attribute"
>> some already have).
>> 
>> Using GetMongo as an example, it would add something like this:
>> 
>> source.url=mongodb://localhost:27017
>> source.database=testdb
>> source.collection=test_collection
>> source.query={ "username": "john.smith" }
>> source.criteria.username=john.smith //GetMongo would parse the query and
>> add this.
>> 
>> We have a use case where a team is coming from an extremely batch-oriented
>> view and really wants to know when "dataset X" was run. Our solution was to
>> extract that from the result set because the dataset name is one of the
>> fields in the JSON body.
>> 
>> I think this would help expand what you can do out of the box with
>> provenance tracking because it would provide a lot of useful information
>> that could be stored in Solr or ES and then queried against terminating
>> processors' DROP events to get a solid window into when jobs were run
>> historically.
>> 
>> Thoughts?
>> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Proposal: standard record metadata attributes for data sources

2018-05-13 Thread Mike Thomsen
@Joe @Matt

This is kinda related to the point that Joe made in the graph DB thread
about provenance. My thought here was that we need some standards on
enriching the metadata about what was fetched so that no matter how you
store the provenance, you can find some way to query it for questions like
when a data set was loaded into NiFi, how many records went through a
terminating processor, etc. IMO this could help batch-oriented
organizations feel more at ease with something stream-oriented like NiFi.

On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen  wrote:

> I'd like to propose that all non-deprecated (or likely to be deprecated)
> Get/Fetch/Query processors get a standard convention for attributes that
> describe things like:
>
> 1. Source system.
> 2. Database/table/index/collection/etc.
> 3. The lookup criteria that was used (similar to the "query attribute"
> some already have).
>
> Using GetMongo as an example, it would add something like this:
>
> source.url=mongodb://localhost:27017
> source.database=testdb
> source.collection=test_collection
> source.query={ "username": "john.smith" }
> source.criteria.username=john.smith //GetMongo would parse the query and
> add this.
>
> We have a use case where a team is coming from an extremely batch-oriented
> view and really wants to know when "dataset X" was run. Our solution was to
> extract that from the result set because the dataset name is one of the
> fields in the JSON body.
>
> I think this would help expand what you can do out of the box with
> provenance tracking because it would provide a lot of useful information
> that could be stored in Solr or ES and then queried against terminating
> processors' DROP events to get a solid window into when jobs were run
> historically.
>
> Thoughts?
>