Re: Record-oriented DetectDuplicate?
Andrew, Mark, etc. A new contributor alerted me on Jira that he did his own take on this processor. I encouraged him to join the dev list so we can discuss the use case in more depth and sort out what is the best way forward. See https://issues.apache.org/jira/browse/NIFI-6047 I'll give him a little while to join and announce he's ready to go over it before I move forward with a discussion on this. On Sat, Feb 9, 2019 at 12:34 PM Mike Thomsen wrote: > PR if anyone is interested: > > https://github.com/apache/nifi/pull/3298 > > On Fri, Feb 8, 2019 at 5:34 PM Mike Thomsen > wrote: > >> With Redis and HBase you can set a TTL on the data itself in the lookup >> table. Were you thinking something more than that? >> >> On Fri, Feb 8, 2019 at 4:42 PM Andrew Grande wrote: >> >>> Can I suggest a time-based option for specifying the window? I think we >>> only mentioned the number of records. >>> >>> Andrew >>> >>> On Fri, Feb 8, 2019, 8:22 AM Mike Thomsen >>> wrote: >>> Thanks. That answers it succinctly for me. I'll build out a DetectDuplicateRecord processor to handle this. On Fri, Feb 8, 2019 at 11:17 AM Mark Payne wrote: > Matt, > > That would work if you want to select distinct records in a given > FlowFIle but not across FlowFiles. > PartitionRecord -> UpdateAttribute (optionally to combine multiple > attributes into one) -> DetectDuplicate > would work, but given that you expect the records to be unique > generally, this would have the effect of > splitting each FlowFile into Record-per-FlowFile, which is certainly > not ideal. > > Thanks > -Mark > > > > On Feb 8, 2019, at 11:14 AM, Matt Burgess > wrote: > > > > Mike, > > > > I don't think so, but you could try a SELECT DISTINCT in QueryRecord, > > might be a bit of a pain if you want to select all columns and there > > are lots of them. > > > > Alternatively you could try PartitionRecord -> QueryRecord (select * > > limit 1). Neither PartitionRecord nor QueryRecord keeps state so > you'd > > likely need to use distributed cache or UpdateAttribute. > > > > Regards, > > Matt > > > > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen > wrote: > >> > >> Do we have anything like DetectDuplicate for the Record API > already? Didn't see anything, but wanted to ask before reinventing the > wheel. > >> > >> Thanks, > >> > >> Mike > >
Re: Record-oriented DetectDuplicate?
PR if anyone is interested: https://github.com/apache/nifi/pull/3298 On Fri, Feb 8, 2019 at 5:34 PM Mike Thomsen wrote: > With Redis and HBase you can set a TTL on the data itself in the lookup > table. Were you thinking something more than that? > > On Fri, Feb 8, 2019 at 4:42 PM Andrew Grande wrote: > >> Can I suggest a time-based option for specifying the window? I think we >> only mentioned the number of records. >> >> Andrew >> >> On Fri, Feb 8, 2019, 8:22 AM Mike Thomsen wrote: >> >>> Thanks. That answers it succinctly for me. I'll build out a >>> DetectDuplicateRecord processor to handle this. >>> >>> On Fri, Feb 8, 2019 at 11:17 AM Mark Payne wrote: >>> Matt, That would work if you want to select distinct records in a given FlowFIle but not across FlowFiles. PartitionRecord -> UpdateAttribute (optionally to combine multiple attributes into one) -> DetectDuplicate would work, but given that you expect the records to be unique generally, this would have the effect of splitting each FlowFile into Record-per-FlowFile, which is certainly not ideal. Thanks -Mark > On Feb 8, 2019, at 11:14 AM, Matt Burgess wrote: > > Mike, > > I don't think so, but you could try a SELECT DISTINCT in QueryRecord, > might be a bit of a pain if you want to select all columns and there > are lots of them. > > Alternatively you could try PartitionRecord -> QueryRecord (select * > limit 1). Neither PartitionRecord nor QueryRecord keeps state so you'd > likely need to use distributed cache or UpdateAttribute. > > Regards, > Matt > > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen wrote: >> >> Do we have anything like DetectDuplicate for the Record API already? Didn't see anything, but wanted to ask before reinventing the wheel. >> >> Thanks, >> >> Mike
Re: Record-oriented DetectDuplicate?
With Redis and HBase you can set a TTL on the data itself in the lookup table. Were you thinking something more than that? On Fri, Feb 8, 2019 at 4:42 PM Andrew Grande wrote: > Can I suggest a time-based option for specifying the window? I think we > only mentioned the number of records. > > Andrew > > On Fri, Feb 8, 2019, 8:22 AM Mike Thomsen wrote: > >> Thanks. That answers it succinctly for me. I'll build out a >> DetectDuplicateRecord processor to handle this. >> >> On Fri, Feb 8, 2019 at 11:17 AM Mark Payne wrote: >> >>> Matt, >>> >>> That would work if you want to select distinct records in a given >>> FlowFIle but not across FlowFiles. >>> PartitionRecord -> UpdateAttribute (optionally to combine multiple >>> attributes into one) -> DetectDuplicate >>> would work, but given that you expect the records to be unique >>> generally, this would have the effect of >>> splitting each FlowFile into Record-per-FlowFile, which is certainly not >>> ideal. >>> >>> Thanks >>> -Mark >>> >>> >>> > On Feb 8, 2019, at 11:14 AM, Matt Burgess >>> wrote: >>> > >>> > Mike, >>> > >>> > I don't think so, but you could try a SELECT DISTINCT in QueryRecord, >>> > might be a bit of a pain if you want to select all columns and there >>> > are lots of them. >>> > >>> > Alternatively you could try PartitionRecord -> QueryRecord (select * >>> > limit 1). Neither PartitionRecord nor QueryRecord keeps state so you'd >>> > likely need to use distributed cache or UpdateAttribute. >>> > >>> > Regards, >>> > Matt >>> > >>> > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen >>> wrote: >>> >> >>> >> Do we have anything like DetectDuplicate for the Record API already? >>> Didn't see anything, but wanted to ask before reinventing the wheel. >>> >> >>> >> Thanks, >>> >> >>> >> Mike >>> >>>
Re: Record-oriented DetectDuplicate?
Can I suggest a time-based option for specifying the window? I think we only mentioned the number of records. Andrew On Fri, Feb 8, 2019, 8:22 AM Mike Thomsen wrote: > Thanks. That answers it succinctly for me. I'll build out a > DetectDuplicateRecord processor to handle this. > > On Fri, Feb 8, 2019 at 11:17 AM Mark Payne wrote: > >> Matt, >> >> That would work if you want to select distinct records in a given >> FlowFIle but not across FlowFiles. >> PartitionRecord -> UpdateAttribute (optionally to combine multiple >> attributes into one) -> DetectDuplicate >> would work, but given that you expect the records to be unique generally, >> this would have the effect of >> splitting each FlowFile into Record-per-FlowFile, which is certainly not >> ideal. >> >> Thanks >> -Mark >> >> >> > On Feb 8, 2019, at 11:14 AM, Matt Burgess wrote: >> > >> > Mike, >> > >> > I don't think so, but you could try a SELECT DISTINCT in QueryRecord, >> > might be a bit of a pain if you want to select all columns and there >> > are lots of them. >> > >> > Alternatively you could try PartitionRecord -> QueryRecord (select * >> > limit 1). Neither PartitionRecord nor QueryRecord keeps state so you'd >> > likely need to use distributed cache or UpdateAttribute. >> > >> > Regards, >> > Matt >> > >> > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen >> wrote: >> >> >> >> Do we have anything like DetectDuplicate for the Record API already? >> Didn't see anything, but wanted to ask before reinventing the wheel. >> >> >> >> Thanks, >> >> >> >> Mike >> >>
Re: Record-oriented DetectDuplicate?
Thanks. That answers it succinctly for me. I'll build out a DetectDuplicateRecord processor to handle this. On Fri, Feb 8, 2019 at 11:17 AM Mark Payne wrote: > Matt, > > That would work if you want to select distinct records in a given FlowFIle > but not across FlowFiles. > PartitionRecord -> UpdateAttribute (optionally to combine multiple > attributes into one) -> DetectDuplicate > would work, but given that you expect the records to be unique generally, > this would have the effect of > splitting each FlowFile into Record-per-FlowFile, which is certainly not > ideal. > > Thanks > -Mark > > > > On Feb 8, 2019, at 11:14 AM, Matt Burgess wrote: > > > > Mike, > > > > I don't think so, but you could try a SELECT DISTINCT in QueryRecord, > > might be a bit of a pain if you want to select all columns and there > > are lots of them. > > > > Alternatively you could try PartitionRecord -> QueryRecord (select * > > limit 1). Neither PartitionRecord nor QueryRecord keeps state so you'd > > likely need to use distributed cache or UpdateAttribute. > > > > Regards, > > Matt > > > > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen > wrote: > >> > >> Do we have anything like DetectDuplicate for the Record API already? > Didn't see anything, but wanted to ask before reinventing the wheel. > >> > >> Thanks, > >> > >> Mike > >
Re: Record-oriented DetectDuplicate?
Matt, That would work if you want to select distinct records in a given FlowFIle but not across FlowFiles. PartitionRecord -> UpdateAttribute (optionally to combine multiple attributes into one) -> DetectDuplicate would work, but given that you expect the records to be unique generally, this would have the effect of splitting each FlowFile into Record-per-FlowFile, which is certainly not ideal. Thanks -Mark > On Feb 8, 2019, at 11:14 AM, Matt Burgess wrote: > > Mike, > > I don't think so, but you could try a SELECT DISTINCT in QueryRecord, > might be a bit of a pain if you want to select all columns and there > are lots of them. > > Alternatively you could try PartitionRecord -> QueryRecord (select * > limit 1). Neither PartitionRecord nor QueryRecord keeps state so you'd > likely need to use distributed cache or UpdateAttribute. > > Regards, > Matt > > On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen wrote: >> >> Do we have anything like DetectDuplicate for the Record API already? Didn't >> see anything, but wanted to ask before reinventing the wheel. >> >> Thanks, >> >> Mike
Re: Record-oriented DetectDuplicate?
We do not. I've thought about it, but I have not had a chance to put any work towards it. My vision of how it would work would be to allow user to specify N number of RecordPath values as user-defined properties. Then have those values extracted out and another Record would be considered a 'duplicate' if all RecordPaths evaluated to the same values. However, we then have to be rather careful because this can certainly be sensitive data that is stored in a DistributedMapCache or something of the sort, so we'll have to ensure that we support secure comms well and document this. > On Feb 8, 2019, at 10:57 AM, Mike Thomsen wrote: > > Do we have anything like DetectDuplicate for the Record API already? Didn't > see anything, but wanted to ask before reinventing the wheel. > > Thanks, > > Mike
Re: Record-oriented DetectDuplicate?
Mike, I don't think so, but you could try a SELECT DISTINCT in QueryRecord, might be a bit of a pain if you want to select all columns and there are lots of them. Alternatively you could try PartitionRecord -> QueryRecord (select * limit 1). Neither PartitionRecord nor QueryRecord keeps state so you'd likely need to use distributed cache or UpdateAttribute. Regards, Matt On Fri, Feb 8, 2019 at 11:08 AM Mike Thomsen wrote: > > Do we have anything like DetectDuplicate for the Record API already? Didn't > see anything, but wanted to ask before reinventing the wheel. > > Thanks, > > Mike
Record-oriented DetectDuplicate?
Do we have anything like DetectDuplicate for the Record API already? Didn't see anything, but wanted to ask before reinventing the wheel. Thanks, Mike