Re: Re: Processor to enrich attribute from external service

2016-09-03 Thread Andre
Uwe,

I will be happy to help.

Do you have an open PR or a github repo with the code?

Cheers

On Sat, Sep 3, 2016 at 8:07 PM, Uwe Geercken  wrote:

> Matt,
>
> I worked a while ago on a processor with apache velocity. I stopped work
> when the packaging as nar did not work and
> I was somewhat confused of the layout. You helped me at that time but
> there was an error.
>
> I would like to pickup the work again. But I need help with packaging. I
> am not very familiar with maven.
>
> Would you have time to review that with me again?
>
> Rgds,
>
> Uwe
>


Re: Processor to enrich attribute from external service

2016-09-03 Thread Andre
Gunjan,

There are many ways of skinning this cat indeed... :-)

Another reasonably efficient strategy for those looking to perform high
speed enrichment against structured single match data sources may use
QueryDNS processor in regex or split parsing together with a synthetic DNS
server acting as bridge to a database or data source.

It works pretty much like and dynamic anti-SPAM black list:

1. Spin up PowerDNS using the pipe packend (that published the data you
want to enrich against) consulting the enrichment data source;
2. Point NiFi QueryDNS to resolve using the PowerDNS instance as name
server (TXT records work very well with data up to 253 characters)
3. Use the attributes added by QueryDNS to do take decisions within NiFi.

The setup above scales out very well for small length enrichment and you
can use multiple QueryDNS processors and powerDNS sub-domains to perform
multiple stages of enrichment.

Also worth of notice is the fact the use of DNS (UDP) allows the DFM to
design enrichment paths that "fail open" (via DNS query timeout) in case of
outage affecting the data source.

Cheers


On Sat, Sep 3, 2016 at 12:07 PM, Gunjan Dave <gunjanpiyushd...@gmail.com>
wrote:

> How i have handled this personally is to use wrap sql processors with
> handlehttprequest processor essentially making the db operation as a REsT
> webservice.
>
> Then you have the option of fetchhttp processor update appending the
> results in attribute instead of content, which is an option already
> available.
>
> With mongo db, you need not do this additional operation of wrapping as it
> has a REST interface so directly use that in http processor.
>
> On Sat, Sep 3, 2016, 4:28 AM Matt Burgess <mattyb...@apache.org> wrote:
>
>> Agreed.  Additionally, if we want to get fancy, we can work with
>> incoming flow files based on MIME type (JSON, XML, CSV) and have a
>> "Path" property to a field in the document. Then the processor could
>> replace inline the value in the document with the lookup value. If XML
>> files are coming in, the Path is an XPath expression. Same for JSON
>> and JSONPath, and CSV could be a column index (0-based, e.g.).
>>
>> I have something very similar (not the lookup, but the "Path" thing
>> for multiple file types) coming soon as a Jira case / PR ;) If that
>> proves useful, I could move it into a util or base class or something.
>>
>> Regards,
>> Matt
>>
>> On Fri, Sep 2, 2016 at 6:47 PM, Manish Gupta 8 <mgupt...@sapient.com>
>> wrote:
>> > I think the lookup processor should return data in a format that can be
>> > efficiently parsed/processed by NiFi expression language. For example –
>> > JSON. This would avoid using additional “Extract” type processor. All
>> the
>> > downstream processor can simply work with “jsonPath” for additional
>> lookup
>> > inside the attribute.
>> >
>> >
>> >
>> > Regards,
>> >
>> > Manish
>> >
>> >
>> >
>> > From: Matt Burgess [mailto:mattyb...@gmail.com]
>> > Sent: Friday, September 02, 2016 6:37 PM
>> >
>> >
>> > To: users@nifi.apache.org
>> > Subject: Re: Processor to enrich attribute from external service
>> >
>> >
>> >
>> > Manish,
>> >
>> >
>> >
>> > Some of the queries in those processors could bring back lots of data,
>> and
>> > putting them into an attribute could cause memory issues. Another
>> concern is
>> > when the result is binary data, such as ExecuteSQL returning an Avro
>> file.
>> > And since the return of these is a collection of records, these
>> processors
>> > are often followed by a Split processor to perform operations on
>> individual
>> > records.
>> >
>> >
>> >
>> > Having said that, if the return value is text and you'd like to
>> transfer it
>> > to an attribute, you can use ExtractText to put the content into an
>> > attribute. For small content (which is the appropriate use case), this
>> > should be pretty fast, and keeps the logic in a single processor
>> instead of
>> > duplicated (either logically or physically) across processors.
>> >
>> >
>> >
>> > By the way I'm very interested in an RDBMS lookup processor, but not
>> sure
>> > I'd have time in the short run to write it up. If someone takes a crack
>> at
>> > it, I recommend properties to pre-cache the table with a refresh
>> interval.
>> > This way if the lookup table doesn't change much and is no

Re: Processor to enrich attribute from external service

2016-09-02 Thread Gunjan Dave
In addtion, Manish, if u have a larger dataflow to design, you'll start
facing the issue of "difficult to interpret the flow visually" problem.
Process groups helps but if you have many process groups on UI, you'll
visibly see the problem.

For this, I am thinking of using the same approach of wrapping my logically
similar process groups between rest api and then using the http processor
to invoke those groups. This, i think, will work as a sort of reference
process group until the actual concept of reference process groups is
brought in nifi, which i think is in the roadmap. I have not yet
implemented this on NiFi, so not sure if it will actually work, but i think
it should, as the similar stuff for database side worked.




On Sat, Sep 3, 2016, 7:37 AM Gunjan Dave <gunjanpiyushd...@gmail.com> wrote:

> How i have handled this personally is to use wrap sql processors with
> handlehttprequest processor essentially making the db operation as a REsT
> webservice.
>
> Then you have the option of fetchhttp processor update appending the
> results in attribute instead of content, which is an option already
> available.
>
> With mongo db, you need not do this additional operation of wrapping as it
> has a REST interface so directly use that in http processor.
>
> On Sat, Sep 3, 2016, 4:28 AM Matt Burgess <mattyb...@apache.org> wrote:
>
>> Agreed.  Additionally, if we want to get fancy, we can work with
>> incoming flow files based on MIME type (JSON, XML, CSV) and have a
>> "Path" property to a field in the document. Then the processor could
>> replace inline the value in the document with the lookup value. If XML
>> files are coming in, the Path is an XPath expression. Same for JSON
>> and JSONPath, and CSV could be a column index (0-based, e.g.).
>>
>> I have something very similar (not the lookup, but the "Path" thing
>> for multiple file types) coming soon as a Jira case / PR ;) If that
>> proves useful, I could move it into a util or base class or something.
>>
>> Regards,
>> Matt
>>
>> On Fri, Sep 2, 2016 at 6:47 PM, Manish Gupta 8 <mgupt...@sapient.com>
>> wrote:
>> > I think the lookup processor should return data in a format that can be
>> > efficiently parsed/processed by NiFi expression language. For example –
>> > JSON. This would avoid using additional “Extract” type processor. All
>> the
>> > downstream processor can simply work with “jsonPath” for additional
>> lookup
>> > inside the attribute.
>> >
>> >
>> >
>> > Regards,
>> >
>> > Manish
>> >
>> >
>> >
>> > From: Matt Burgess [mailto:mattyb...@gmail.com]
>> > Sent: Friday, September 02, 2016 6:37 PM
>> >
>> >
>> > To: users@nifi.apache.org
>> > Subject: Re: Processor to enrich attribute from external service
>> >
>> >
>> >
>> > Manish,
>> >
>> >
>> >
>> > Some of the queries in those processors could bring back lots of data,
>> and
>> > putting them into an attribute could cause memory issues. Another
>> concern is
>> > when the result is binary data, such as ExecuteSQL returning an Avro
>> file.
>> > And since the return of these is a collection of records, these
>> processors
>> > are often followed by a Split processor to perform operations on
>> individual
>> > records.
>> >
>> >
>> >
>> > Having said that, if the return value is text and you'd like to
>> transfer it
>> > to an attribute, you can use ExtractText to put the content into an
>> > attribute. For small content (which is the appropriate use case), this
>> > should be pretty fast, and keeps the logic in a single processor
>> instead of
>> > duplicated (either logically or physically) across processors.
>> >
>> >
>> >
>> > By the way I'm very interested in an RDBMS lookup processor, but not
>> sure
>> > I'd have time in the short run to write it up. If someone takes a crack
>> at
>> > it, I recommend properties to pre-cache the table with a refresh
>> interval.
>> > This way if the lookup table doesn't change much and is not too big, it
>> > could be read into the processor's memory for super-fast lookups.
>> > Alternatively, a property could be a cache size, which would build a
>> subset
>> > of the table in memory as values are looked up. This is probably more
>> robust
>> > as it is bounded and if the size is set high enough for a sm

Re: Processor to enrich attribute from external service

2016-09-02 Thread Gunjan Dave
How i have handled this personally is to use wrap sql processors with
handlehttprequest processor essentially making the db operation as a REsT
webservice.

Then you have the option of fetchhttp processor update appending the
results in attribute instead of content, which is an option already
available.

With mongo db, you need not do this additional operation of wrapping as it
has a REST interface so directly use that in http processor.

On Sat, Sep 3, 2016, 4:28 AM Matt Burgess <mattyb...@apache.org> wrote:

> Agreed.  Additionally, if we want to get fancy, we can work with
> incoming flow files based on MIME type (JSON, XML, CSV) and have a
> "Path" property to a field in the document. Then the processor could
> replace inline the value in the document with the lookup value. If XML
> files are coming in, the Path is an XPath expression. Same for JSON
> and JSONPath, and CSV could be a column index (0-based, e.g.).
>
> I have something very similar (not the lookup, but the "Path" thing
> for multiple file types) coming soon as a Jira case / PR ;) If that
> proves useful, I could move it into a util or base class or something.
>
> Regards,
> Matt
>
> On Fri, Sep 2, 2016 at 6:47 PM, Manish Gupta 8 <mgupt...@sapient.com>
> wrote:
> > I think the lookup processor should return data in a format that can be
> > efficiently parsed/processed by NiFi expression language. For example –
> > JSON. This would avoid using additional “Extract” type processor. All the
> > downstream processor can simply work with “jsonPath” for additional
> lookup
> > inside the attribute.
> >
> >
> >
> > Regards,
> >
> > Manish
> >
> >
> >
> > From: Matt Burgess [mailto:mattyb...@gmail.com]
> > Sent: Friday, September 02, 2016 6:37 PM
> >
> >
> > To: users@nifi.apache.org
> > Subject: Re: Processor to enrich attribute from external service
> >
> >
> >
> > Manish,
> >
> >
> >
> > Some of the queries in those processors could bring back lots of data,
> and
> > putting them into an attribute could cause memory issues. Another
> concern is
> > when the result is binary data, such as ExecuteSQL returning an Avro
> file.
> > And since the return of these is a collection of records, these
> processors
> > are often followed by a Split processor to perform operations on
> individual
> > records.
> >
> >
> >
> > Having said that, if the return value is text and you'd like to transfer
> it
> > to an attribute, you can use ExtractText to put the content into an
> > attribute. For small content (which is the appropriate use case), this
> > should be pretty fast, and keeps the logic in a single processor instead
> of
> > duplicated (either logically or physically) across processors.
> >
> >
> >
> > By the way I'm very interested in an RDBMS lookup processor, but not sure
> > I'd have time in the short run to write it up. If someone takes a crack
> at
> > it, I recommend properties to pre-cache the table with a refresh
> interval.
> > This way if the lookup table doesn't change much and is not too big, it
> > could be read into the processor's memory for super-fast lookups.
> > Alternatively, a property could be a cache size, which would build a
> subset
> > of the table in memory as values are looked up. This is probably more
> robust
> > as it is bounded and if the size is set high enough for a small table, it
> > would be read in its entirety. Still would want the cache refresh
> property
> > though.
> >
> >
> >
> > Cheers,
> >
> > Matt
> >
> >
> > On Sep 2, 2016, at 6:19 PM, Manish Gupta 8 <mgupt...@sapient.com> wrote:
> >
> > Thanks for the reply Joe. Just a thought – do you think it would be a
> good
> > idea for every Get processor (GetMongo, GetHBase etc.) to have 2
> additional
> > properties like:
> >
> > 1.  Result in Content or Result in Attribute
> >
> > 2.  Result Attribute Name (only applicable when “Result in
> Attribute” is
> > selected).
> >
> > But then all such processors should be able to accept incoming flowfile
> > (which they don’t as of now – being a “Get”).
> >
> >
> >
> > May be ExecuteSQL and FetchDistributeMapCache can be enhanced that way
> i.e.
> > have an option to specify the destination – content or attribute?
> >
> >
> >
> > Regards,
> >
> > Manish
> >
> >
> >
> > From: Joe Witt [mailto:joe.w...@gmail.com]
> > Sent: Friday, September 

Re: Processor to enrich attribute from external service

2016-09-02 Thread Matt Burgess
Agreed.  Additionally, if we want to get fancy, we can work with
incoming flow files based on MIME type (JSON, XML, CSV) and have a
"Path" property to a field in the document. Then the processor could
replace inline the value in the document with the lookup value. If XML
files are coming in, the Path is an XPath expression. Same for JSON
and JSONPath, and CSV could be a column index (0-based, e.g.).

I have something very similar (not the lookup, but the "Path" thing
for multiple file types) coming soon as a Jira case / PR ;) If that
proves useful, I could move it into a util or base class or something.

Regards,
Matt

On Fri, Sep 2, 2016 at 6:47 PM, Manish Gupta 8 <mgupt...@sapient.com> wrote:
> I think the lookup processor should return data in a format that can be
> efficiently parsed/processed by NiFi expression language. For example –
> JSON. This would avoid using additional “Extract” type processor. All the
> downstream processor can simply work with “jsonPath” for additional lookup
> inside the attribute.
>
>
>
> Regards,
>
> Manish
>
>
>
> From: Matt Burgess [mailto:mattyb...@gmail.com]
> Sent: Friday, September 02, 2016 6:37 PM
>
>
> To: users@nifi.apache.org
> Subject: Re: Processor to enrich attribute from external service
>
>
>
> Manish,
>
>
>
> Some of the queries in those processors could bring back lots of data, and
> putting them into an attribute could cause memory issues. Another concern is
> when the result is binary data, such as ExecuteSQL returning an Avro file.
> And since the return of these is a collection of records, these processors
> are often followed by a Split processor to perform operations on individual
> records.
>
>
>
> Having said that, if the return value is text and you'd like to transfer it
> to an attribute, you can use ExtractText to put the content into an
> attribute. For small content (which is the appropriate use case), this
> should be pretty fast, and keeps the logic in a single processor instead of
> duplicated (either logically or physically) across processors.
>
>
>
> By the way I'm very interested in an RDBMS lookup processor, but not sure
> I'd have time in the short run to write it up. If someone takes a crack at
> it, I recommend properties to pre-cache the table with a refresh interval.
> This way if the lookup table doesn't change much and is not too big, it
> could be read into the processor's memory for super-fast lookups.
> Alternatively, a property could be a cache size, which would build a subset
> of the table in memory as values are looked up. This is probably more robust
> as it is bounded and if the size is set high enough for a small table, it
> would be read in its entirety. Still would want the cache refresh property
> though.
>
>
>
> Cheers,
>
> Matt
>
>
> On Sep 2, 2016, at 6:19 PM, Manish Gupta 8 <mgupt...@sapient.com> wrote:
>
> Thanks for the reply Joe. Just a thought – do you think it would be a good
> idea for every Get processor (GetMongo, GetHBase etc.) to have 2 additional
> properties like:
>
> 1.  Result in Content or Result in Attribute
>
> 2.  Result Attribute Name (only applicable when “Result in Attribute” is
> selected).
>
> But then all such processors should be able to accept incoming flowfile
> (which they don’t as of now – being a “Get”).
>
>
>
> May be ExecuteSQL and FetchDistributeMapCache can be enhanced that way i.e.
> have an option to specify the destination – content or attribute?
>
>
>
> Regards,
>
> Manish
>
>
>
> From: Joe Witt [mailto:joe.w...@gmail.com]
> Sent: Friday, September 02, 2016 5:58 PM
> To: users@nifi.apache.org
> Subject: Re: Processor to enrich attribute from external service
>
>
>
> You would need to make a custom process for now.  I think we should have a
> nice controller service to generalize jdbc lookups which supports caching.
> And then a processor which leverages it.
>
> This comes up fairly often and is pretty straightforward from a design POV.
> Anyone want to take a stab at this?
>
>
>
> On Sep 2, 2016 4:47 PM, "Manish Gupta 8" <mgupt...@sapient.com> wrote:
>
> Hello Everyone,
>
>
>
> Is there a processor that we can use for updating/adding an attribute of an
> incoming flow file from some external service (say MongoDB or Couchbase or
> any RDBMS)? The processor will use the attribute of incoming flow file,
> query the external service, and simply modify/add an additional attribute of
> flow-file (without touching the flow file content).
>
>
>
> If we have to achieve this kind of “lookup” operation (but only to update
> attribute and not the content), what are the options in NiFi?
>
> Should we create a custom processor (may be by taking GetMongo processor and
> modifying its code to update an attribute with query result)?
>
>
>
> Thanks,
>
> Manish
>
>


RE: Processor to enrich attribute from external service

2016-09-02 Thread Manish Gupta 8
I think the lookup processor should return data in a format that can be 
efficiently parsed/processed by NiFi expression language. For example – JSON. 
This would avoid using additional “Extract” type processor. All the downstream 
processor can simply work with “jsonPath” for additional lookup inside the 
attribute.

Regards,
Manish

From: Matt Burgess [mailto:mattyb...@gmail.com]
Sent: Friday, September 02, 2016 6:37 PM
To: users@nifi.apache.org
Subject: Re: Processor to enrich attribute from external service

Manish,

Some of the queries in those processors could bring back lots of data, and 
putting them into an attribute could cause memory issues. Another concern is 
when the result is binary data, such as ExecuteSQL returning an Avro file. And 
since the return of these is a collection of records, these processors are 
often followed by a Split processor to perform operations on individual records.

Having said that, if the return value is text and you'd like to transfer it to 
an attribute, you can use ExtractText to put the content into an attribute. For 
small content (which is the appropriate use case), this should be pretty fast, 
and keeps the logic in a single processor instead of duplicated (either 
logically or physically) across processors.

By the way I'm very interested in an RDBMS lookup processor, but not sure I'd 
have time in the short run to write it up. If someone takes a crack at it, I 
recommend properties to pre-cache the table with a refresh interval. This way 
if the lookup table doesn't change much and is not too big, it could be read 
into the processor's memory for super-fast lookups. Alternatively, a property 
could be a cache size, which would build a subset of the table in memory as 
values are looked up. This is probably more robust as it is bounded and if the 
size is set high enough for a small table, it would be read in its entirety. 
Still would want the cache refresh property though.

Cheers,
Matt

On Sep 2, 2016, at 6:19 PM, Manish Gupta 8 
<mgupt...@sapient.com<mailto:mgupt...@sapient.com>> wrote:
Thanks for the reply Joe. Just a thought – do you think it would be a good idea 
for every Get processor (GetMongo, GetHBase etc.) to have 2 additional 
properties like:

1.  Result in Content or Result in Attribute

2.  Result Attribute Name (only applicable when “Result in Attribute” is 
selected).
But then all such processors should be able to accept incoming flowfile (which 
they don’t as of now – being a “Get”).

May be ExecuteSQL and FetchDistributeMapCache can be enhanced that way i.e. 
have an option to specify the destination – content or attribute?

Regards,
Manish

From: Joe Witt [mailto:joe.w...@gmail.com]
Sent: Friday, September 02, 2016 5:58 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: Re: Processor to enrich attribute from external service


You would need to make a custom process for now.  I think we should have a nice 
controller service to generalize jdbc lookups which supports caching.  And then 
a processor which leverages it.

This comes up fairly often and is pretty straightforward from a design POV.  
Anyone want to take a stab at this?

On Sep 2, 2016 4:47 PM, "Manish Gupta 8" 
<mgupt...@sapient.com<mailto:mgupt...@sapient.com>> wrote:
Hello Everyone,

Is there a processor that we can use for updating/adding an attribute of an 
incoming flow file from some external service (say MongoDB or Couchbase or any 
RDBMS)? The processor will use the attribute of incoming flow file, query the 
external service, and simply modify/add an additional attribute of flow-file 
(without touching the flow file content).

If we have to achieve this kind of “lookup” operation (but only to update 
attribute and not the content), what are the options in NiFi?
Should we create a custom processor (may be by taking GetMongo processor and 
modifying its code to update an attribute with query result)?

Thanks,
Manish



Re: Processor to enrich attribute from external service

2016-09-02 Thread Matt Burgess
Manish,

Some of the queries in those processors could bring back lots of data, and 
putting them into an attribute could cause memory issues. Another concern is 
when the result is binary data, such as ExecuteSQL returning an Avro file. And 
since the return of these is a collection of records, these processors are 
often followed by a Split processor to perform operations on individual records.

Having said that, if the return value is text and you'd like to transfer it to 
an attribute, you can use ExtractText to put the content into an attribute. For 
small content (which is the appropriate use case), this should be pretty fast, 
and keeps the logic in a single processor instead of duplicated (either 
logically or physically) across processors.

By the way I'm very interested in an RDBMS lookup processor, but not sure I'd 
have time in the short run to write it up. If someone takes a crack at it, I 
recommend properties to pre-cache the table with a refresh interval. This way 
if the lookup table doesn't change much and is not too big, it could be read 
into the processor's memory for super-fast lookups. Alternatively, a property 
could be a cache size, which would build a subset of the table in memory as 
values are looked up. This is probably more robust as it is bounded and if the 
size is set high enough for a small table, it would be read in its entirety. 
Still would want the cache refresh property though.

Cheers,
Matt


> On Sep 2, 2016, at 6:19 PM, Manish Gupta 8 <mgupt...@sapient.com> wrote:
> 
> Thanks for the reply Joe. Just a thought – do you think it would be a good 
> idea for every Get processor (GetMongo, GetHBase etc.) to have 2 additional 
> properties like:
> 1.   Result in Content or Result in Attribute
> 2.   Result Attribute Name (only applicable when “Result in Attribute” is 
> selected).
> But then all such processors should be able to accept incoming flowfile 
> (which they don’t as of now – being a “Get”).
>  
> May be ExecuteSQL and FetchDistributeMapCache can be enhanced that way i.e. 
> have an option to specify the destination – content or attribute?
>  
> Regards,
> Manish
>  
> From: Joe Witt [mailto:joe.w...@gmail.com] 
> Sent: Friday, September 02, 2016 5:58 PM
> To: users@nifi.apache.org
> Subject: Re: Processor to enrich attribute from external service
>  
> You would need to make a custom process for now.  I think we should have a 
> nice controller service to generalize jdbc lookups which supports caching.  
> And then a processor which leverages it.
> 
> This comes up fairly often and is pretty straightforward from a design POV.  
> Anyone want to take a stab at this?
> 
>  
> On Sep 2, 2016 4:47 PM, "Manish Gupta 8" <mgupt...@sapient.com> wrote:
> Hello Everyone,
>  
> Is there a processor that we can use for updating/adding an attribute of an 
> incoming flow file from some external service (say MongoDB or Couchbase or 
> any RDBMS)? The processor will use the attribute of incoming flow file, query 
> the external service, and simply modify/add an additional attribute of 
> flow-file (without touching the flow file content).
>  
> If we have to achieve this kind of “lookup” operation (but only to update 
> attribute and not the content), what are the options in NiFi?
> Should we create a custom processor (may be by taking GetMongo processor and 
> modifying its code to update an attribute with query result)?
>  
> Thanks,
> Manish
>  


Re: Processor to enrich attribute from external service

2016-09-02 Thread Joe Witt
You would need to make a custom process for now.  I think we should have a
nice controller service to generalize jdbc lookups which supports caching.
And then a processor which leverages it.

This comes up fairly often and is pretty straightforward from a design
POV.  Anyone want to take a stab at this?

On Sep 2, 2016 4:47 PM, "Manish Gupta 8"  wrote:

> Hello Everyone,
>
>
>
> Is there a processor that we can use for updating/adding an attribute of
> an incoming flow file from some external service (say MongoDB or Couchbase
> or any RDBMS)? The processor will use the attribute of incoming flow file,
> query the external service, and simply modify/add an additional attribute
> of flow-file (without touching the flow file content).
>
>
>
> If we have to achieve this kind of “lookup” operation (but only to update
> attribute and not the content), what are the options in NiFi?
>
> Should we create a custom processor (may be by taking GetMongo processor
> and modifying its code to update an attribute with query result)?
>
>
>
> Thanks,
>
> Manish
>
>
>