Re: Record-oriented DetectDuplicate?

Mike Thomsen Sat, 16 Feb 2019 17:28:47 -0800

Andrew, Mark, etc.

A new contributor alerted me on Jira that he did his own take on this
processor. I encouraged him to join the dev list so we can discuss the use
case in more depth and sort out what is the best way forward.


See https://issues.apache.org/jira/browse/NIFI-6047

I'll give him a little while to join and announce he's ready to go over it
before I move forward with a discussion on this.

On Sat, Feb 9, 2019 at 12:34 PM Mike Thomsen <mikerthom...@gmail.com> wrote:

> PR if anyone is interested:
>
> https://github.com/apache/nifi/pull/3298
>
> On Fri, Feb 8, 2019 at 5:34 PM Mike Thomsen <mikerthom...@gmail.com>
> wrote:
>
>> With Redis and HBase you can set a TTL on the data itself in the lookup
>> table. Were you thinking something more than that?
>>
>> On Fri, Feb 8, 2019 at 4:42 PM Andrew Grande <apere...@gmail.com> wrote:
>>
>>> Can I suggest a time-based option for specifying the window? I think we
>>> only mentioned the number of records.
>>>
>>> Andrew
>>>
>>> On Fri, Feb 8, 2019, 8:22 AM Mike Thomsen <mikerthom...@gmail.com>
>>> wrote:
>>>
>>>> Thanks. That answers it succinctly for me. I'll build out a
>>>> DetectDuplicateRecord processor to handle this.
>>>>
>>>> On Fri, Feb 8, 2019 at 11:17 AM Mark Payne <marka...@hotmail.com>
>>>> wrote:
>>>>
>>>>> Matt,
>>>>>
>>>>> That would work if you want to select distinct records in a given
>>>>> FlowFIle but not across FlowFiles.
>>>>> PartitionRecord -> UpdateAttribute (optionally to combine multiple
>>>>> attributes into one) -> DetectDuplicate
>>>>> would work, but given that you expect the records to be unique
>>>>> generally, this would have the effect of
>>>>> splitting each FlowFile into Record-per-FlowFile, which is certainly
>>>>> not ideal.
>>>>>
>>>>> Thanks
>>>>> -Mark
>>>>>
>>>>>
>>>>> > On Feb 8, 2019, at 11:14 AM, Matt Burgess <mattyb...@apache.org>
>>>>> wrote:
>>>>> >
>>>>> > Mike,
>>>>> >
>>>>> > I don't think so, but you could try a SELECT DISTINCT in QueryRecord,
>>>>> > might be a bit of a pain if you want to select all columns and there
>>>>> > are lots of them.
>>>>> >
>>>>> > Alternatively you could try PartitionRecord -> QueryRecord (select *
>>>>> > limit 1). Neither PartitionRecord nor QueryRecord keeps state so
>>>>> you'd
>>>>> > likely need to use distributed cache or UpdateAttribute.
>>>>> >
>>>>> > Regards,
>>>>> > Matt
>>>>> >
>>>>> > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen <mikerthom...@gmail.com>
>>>>> wrote:
>>>>> >>
>>>>> >> Do we have anything like DetectDuplicate for the Record API
>>>>> already? Didn't see anything, but wanted to ask before reinventing the
>>>>> wheel.
>>>>> >>
>>>>> >> Thanks,
>>>>> >>
>>>>> >> Mike
>>>>>
>>>>>

Re: Record-oriented DetectDuplicate?

Reply via email to