Re: Record-oriented DetectDuplicate?

2019-02-16 Thread Mike Thomsen
Andrew, Mark, etc.

A new contributor alerted me on Jira that he did his own take on this
processor. I encouraged him to join the dev list so we can discuss the use
case in more depth and sort out what is the best way forward.

See https://issues.apache.org/jira/browse/NIFI-6047

I'll give him a little while to join and announce he's ready to go over it
before I move forward with a discussion on this.

On Sat, Feb 9, 2019 at 12:34 PM Mike Thomsen  wrote:

> PR if anyone is interested:
>
> https://github.com/apache/nifi/pull/3298
>
> On Fri, Feb 8, 2019 at 5:34 PM Mike Thomsen 
> wrote:
>
>> With Redis and HBase you can set a TTL on the data itself in the lookup
>> table. Were you thinking something more than that?
>>
>> On Fri, Feb 8, 2019 at 4:42 PM Andrew Grande  wrote:
>>
>>> Can I suggest a time-based option for specifying the window? I think we
>>> only mentioned the number of records.
>>>
>>> Andrew
>>>
>>> On Fri, Feb 8, 2019, 8:22 AM Mike Thomsen 
>>> wrote:
>>>
 Thanks. That answers it succinctly for me. I'll build out a
 DetectDuplicateRecord processor to handle this.

 On Fri, Feb 8, 2019 at 11:17 AM Mark Payne 
 wrote:

> Matt,
>
> That would work if you want to select distinct records in a given
> FlowFIle but not across FlowFiles.
> PartitionRecord -> UpdateAttribute (optionally to combine multiple
> attributes into one) -> DetectDuplicate
> would work, but given that you expect the records to be unique
> generally, this would have the effect of
> splitting each FlowFile into Record-per-FlowFile, which is certainly
> not ideal.
>
> Thanks
> -Mark
>
>
> > On Feb 8, 2019, at 11:14 AM, Matt Burgess 
> wrote:
> >
> > Mike,
> >
> > I don't think so, but you could try a SELECT DISTINCT in QueryRecord,
> > might be a bit of a pain if you want to select all columns and there
> > are lots of them.
> >
> > Alternatively you could try PartitionRecord -> QueryRecord (select *
> > limit 1). Neither PartitionRecord nor QueryRecord keeps state so
> you'd
> > likely need to use distributed cache or UpdateAttribute.
> >
> > Regards,
> > Matt
> >
> > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen 
> wrote:
> >>
> >> Do we have anything like DetectDuplicate for the Record API
> already? Didn't see anything, but wanted to ask before reinventing the
> wheel.
> >>
> >> Thanks,
> >>
> >> Mike
>
>


Re: Record-oriented DetectDuplicate?

2019-02-09 Thread Mike Thomsen
PR if anyone is interested:

https://github.com/apache/nifi/pull/3298

On Fri, Feb 8, 2019 at 5:34 PM Mike Thomsen  wrote:

> With Redis and HBase you can set a TTL on the data itself in the lookup
> table. Were you thinking something more than that?
>
> On Fri, Feb 8, 2019 at 4:42 PM Andrew Grande  wrote:
>
>> Can I suggest a time-based option for specifying the window? I think we
>> only mentioned the number of records.
>>
>> Andrew
>>
>> On Fri, Feb 8, 2019, 8:22 AM Mike Thomsen  wrote:
>>
>>> Thanks. That answers it succinctly for me. I'll build out a
>>> DetectDuplicateRecord processor to handle this.
>>>
>>> On Fri, Feb 8, 2019 at 11:17 AM Mark Payne  wrote:
>>>
 Matt,

 That would work if you want to select distinct records in a given
 FlowFIle but not across FlowFiles.
 PartitionRecord -> UpdateAttribute (optionally to combine multiple
 attributes into one) -> DetectDuplicate
 would work, but given that you expect the records to be unique
 generally, this would have the effect of
 splitting each FlowFile into Record-per-FlowFile, which is certainly
 not ideal.

 Thanks
 -Mark


 > On Feb 8, 2019, at 11:14 AM, Matt Burgess 
 wrote:
 >
 > Mike,
 >
 > I don't think so, but you could try a SELECT DISTINCT in QueryRecord,
 > might be a bit of a pain if you want to select all columns and there
 > are lots of them.
 >
 > Alternatively you could try PartitionRecord -> QueryRecord (select *
 > limit 1). Neither PartitionRecord nor QueryRecord keeps state so you'd
 > likely need to use distributed cache or UpdateAttribute.
 >
 > Regards,
 > Matt
 >
 > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen 
 wrote:
 >>
 >> Do we have anything like DetectDuplicate for the Record API already?
 Didn't see anything, but wanted to ask before reinventing the wheel.
 >>
 >> Thanks,
 >>
 >> Mike




Re: Record-oriented DetectDuplicate?

2019-02-08 Thread Mike Thomsen
With Redis and HBase you can set a TTL on the data itself in the lookup
table. Were you thinking something more than that?

On Fri, Feb 8, 2019 at 4:42 PM Andrew Grande  wrote:

> Can I suggest a time-based option for specifying the window? I think we
> only mentioned the number of records.
>
> Andrew
>
> On Fri, Feb 8, 2019, 8:22 AM Mike Thomsen  wrote:
>
>> Thanks. That answers it succinctly for me. I'll build out a
>> DetectDuplicateRecord processor to handle this.
>>
>> On Fri, Feb 8, 2019 at 11:17 AM Mark Payne  wrote:
>>
>>> Matt,
>>>
>>> That would work if you want to select distinct records in a given
>>> FlowFIle but not across FlowFiles.
>>> PartitionRecord -> UpdateAttribute (optionally to combine multiple
>>> attributes into one) -> DetectDuplicate
>>> would work, but given that you expect the records to be unique
>>> generally, this would have the effect of
>>> splitting each FlowFile into Record-per-FlowFile, which is certainly not
>>> ideal.
>>>
>>> Thanks
>>> -Mark
>>>
>>>
>>> > On Feb 8, 2019, at 11:14 AM, Matt Burgess 
>>> wrote:
>>> >
>>> > Mike,
>>> >
>>> > I don't think so, but you could try a SELECT DISTINCT in QueryRecord,
>>> > might be a bit of a pain if you want to select all columns and there
>>> > are lots of them.
>>> >
>>> > Alternatively you could try PartitionRecord -> QueryRecord (select *
>>> > limit 1). Neither PartitionRecord nor QueryRecord keeps state so you'd
>>> > likely need to use distributed cache or UpdateAttribute.
>>> >
>>> > Regards,
>>> > Matt
>>> >
>>> > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen 
>>> wrote:
>>> >>
>>> >> Do we have anything like DetectDuplicate for the Record API already?
>>> Didn't see anything, but wanted to ask before reinventing the wheel.
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Mike
>>>
>>>


Re: Record-oriented DetectDuplicate?

2019-02-08 Thread Andrew Grande
Can I suggest a time-based option for specifying the window? I think we
only mentioned the number of records.

Andrew

On Fri, Feb 8, 2019, 8:22 AM Mike Thomsen  wrote:

> Thanks. That answers it succinctly for me. I'll build out a
> DetectDuplicateRecord processor to handle this.
>
> On Fri, Feb 8, 2019 at 11:17 AM Mark Payne  wrote:
>
>> Matt,
>>
>> That would work if you want to select distinct records in a given
>> FlowFIle but not across FlowFiles.
>> PartitionRecord -> UpdateAttribute (optionally to combine multiple
>> attributes into one) -> DetectDuplicate
>> would work, but given that you expect the records to be unique generally,
>> this would have the effect of
>> splitting each FlowFile into Record-per-FlowFile, which is certainly not
>> ideal.
>>
>> Thanks
>> -Mark
>>
>>
>> > On Feb 8, 2019, at 11:14 AM, Matt Burgess  wrote:
>> >
>> > Mike,
>> >
>> > I don't think so, but you could try a SELECT DISTINCT in QueryRecord,
>> > might be a bit of a pain if you want to select all columns and there
>> > are lots of them.
>> >
>> > Alternatively you could try PartitionRecord -> QueryRecord (select *
>> > limit 1). Neither PartitionRecord nor QueryRecord keeps state so you'd
>> > likely need to use distributed cache or UpdateAttribute.
>> >
>> > Regards,
>> > Matt
>> >
>> > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen 
>> wrote:
>> >>
>> >> Do we have anything like DetectDuplicate for the Record API already?
>> Didn't see anything, but wanted to ask before reinventing the wheel.
>> >>
>> >> Thanks,
>> >>
>> >> Mike
>>
>>


Re: Record-oriented DetectDuplicate?

2019-02-08 Thread Mike Thomsen
Thanks. That answers it succinctly for me. I'll build out a
DetectDuplicateRecord processor to handle this.

On Fri, Feb 8, 2019 at 11:17 AM Mark Payne  wrote:

> Matt,
>
> That would work if you want to select distinct records in a given FlowFIle
> but not across FlowFiles.
> PartitionRecord -> UpdateAttribute (optionally to combine multiple
> attributes into one) -> DetectDuplicate
> would work, but given that you expect the records to be unique generally,
> this would have the effect of
> splitting each FlowFile into Record-per-FlowFile, which is certainly not
> ideal.
>
> Thanks
> -Mark
>
>
> > On Feb 8, 2019, at 11:14 AM, Matt Burgess  wrote:
> >
> > Mike,
> >
> > I don't think so, but you could try a SELECT DISTINCT in QueryRecord,
> > might be a bit of a pain if you want to select all columns and there
> > are lots of them.
> >
> > Alternatively you could try PartitionRecord -> QueryRecord (select *
> > limit 1). Neither PartitionRecord nor QueryRecord keeps state so you'd
> > likely need to use distributed cache or UpdateAttribute.
> >
> > Regards,
> > Matt
> >
> > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen 
> wrote:
> >>
> >> Do we have anything like DetectDuplicate for the Record API already?
> Didn't see anything, but wanted to ask before reinventing the wheel.
> >>
> >> Thanks,
> >>
> >> Mike
>
>


Re: Record-oriented DetectDuplicate?

2019-02-08 Thread Mark Payne
Matt,

That would work if you want to select distinct records in a given FlowFIle but 
not across FlowFiles.
PartitionRecord -> UpdateAttribute (optionally to combine multiple attributes 
into one) -> DetectDuplicate 
would work, but given that you expect the records to be unique generally, this 
would have the effect of
splitting each FlowFile into Record-per-FlowFile, which is certainly not ideal.

Thanks
-Mark


> On Feb 8, 2019, at 11:14 AM, Matt Burgess  wrote:
> 
> Mike,
> 
> I don't think so, but you could try a SELECT DISTINCT in QueryRecord,
> might be a bit of a pain if you want to select all columns and there
> are lots of them.
> 
> Alternatively you could try PartitionRecord -> QueryRecord (select *
> limit 1). Neither PartitionRecord nor QueryRecord keeps state so you'd
> likely need to use distributed cache or UpdateAttribute.
> 
> Regards,
> Matt
> 
> On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen  wrote:
>> 
>> Do we have anything like DetectDuplicate for the Record API already? Didn't 
>> see anything, but wanted to ask before reinventing the wheel.
>> 
>> Thanks,
>> 
>> Mike



Re: Record-oriented DetectDuplicate?

2019-02-08 Thread Mark Payne
We do not. I've thought about it, but I have not had a chance to put any work 
towards it. My vision of how it would work would be to
allow user to specify N number of RecordPath values as user-defined properties. 
Then have those values extracted out and another
Record would be considered a 'duplicate' if all RecordPaths evaluated to the 
same values. However, we then have to be rather careful
because this can certainly be sensitive data that is stored in a 
DistributedMapCache or something of the sort, so we'll have to ensure
that we support secure comms well and document this.


> On Feb 8, 2019, at 10:57 AM, Mike Thomsen  wrote:
> 
> Do we have anything like DetectDuplicate for the Record API already? Didn't 
> see anything, but wanted to ask before reinventing the wheel.
> 
> Thanks,
> 
> Mike



Re: Record-oriented DetectDuplicate?

2019-02-08 Thread Matt Burgess
Mike,

I don't think so, but you could try a SELECT DISTINCT in QueryRecord,
might be a bit of a pain if you want to select all columns and there
are lots of them.

Alternatively you could try PartitionRecord -> QueryRecord (select *
limit 1). Neither PartitionRecord nor QueryRecord keeps state so you'd
likely need to use distributed cache or UpdateAttribute.

Regards,
Matt

On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen  wrote:
>
> Do we have anything like DetectDuplicate for the Record API already? Didn't 
> see anything, but wanted to ask before reinventing the wheel.
>
> Thanks,
>
> Mike


Record-oriented DetectDuplicate?

2019-02-08 Thread Mike Thomsen
Do we have anything like DetectDuplicate for the Record API already? Didn't
see anything, but wanted to ask before reinventing the wheel.

Thanks,

Mike