Sorry I should have been more clear.  My need is to detect if each record
has been seen in the past.  So I need a solution that would be able to go
record by record against something like a redis cache that would tell me
either first time the record was seen or not and update the cache
accordingly.  Guessing nothing like that for records exists at this point?

We've used DetectDuplicate to do this for entire flow files, but have the
need to do this per record with a preference of not splitting the flow
files.

Thanks all!

On Sat, Aug 15, 2020, 13:38 Jens M. Kofoed <jmkofoed....@gmail.com> wrote:

> Just some info about DISTINCT. In MySQL a union is much much faster than a
> DISTINCT. The DICTINCT create a new temp table with the result of the
> query. Sorting it and removing duplicates.
> If you make a union with a select id=-1, the result is exactly the same.
> All duplicates are removed. A DISTINCT which takes 2 min. and 45 sec. only
> takes about  15 sec with a union.
> kind regards.
>
> I don't know which engine is in NIFI.
> Jens M. Kofoed
>
> Den lør. 15. aug. 2020 kl. 18.08 skrev Matt Burgess <mattyb...@apache.org
> >:
>
>> In addition to the SO answer, if you know all the fields in the
>> record, you can use QueryRecord with SELECT DISTINCT field1,field2...
>> FROM FLOWFILE. The SO answer might be more performant but is more
>> complex, and QueryRecord will do the operations in-memory so it might
>> not handle very large flowfiles.
>>
>> The current pull request for the Jira has not been active and is not
>> in mergeable shape, perhaps I'll get some time to pick it up and get
>> it across the finish line :)
>>
>> Regards,
>> Matt
>>
>> On Sat, Aug 15, 2020 at 11:47 AM Josh Friberg-Wyckoff
>> <j...@thefribergs.com> wrote:
>> >
>> > Gosh, I should search the NiFi resources first.  They have current JIRA
>> for what you are wanting.
>> > https://issues.apache.org/jira/browse/NIFI-6047
>> >
>> > On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff <
>> j...@thefribergs.com> wrote:
>> >>
>> >> This looks interesting as well.
>> >> https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi
>> >>
>> >> On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff <
>> j...@thefribergs.com> wrote:
>> >>>
>> >>> In theory I would think you could use the ExecuteStreamCommand to use
>> the builtin Operating System sort commands to grab unique records.  The
>> Windows Sort command has an undocumented unique option.  The sort command
>> on Linux distros also has a unique option as well.
>> >>>
>> >>> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno <rbru...@gmail.com>
>> wrote:
>> >>>>
>> >>>> I wanted to see if anyone knew is there a clever way to detect
>> duplicate records much like you can with entire flow files with
>> DetectDuplicate?  I'd really rather not have to split my records into
>> individual flow files since this flow is such high volume.
>> >>>>
>> >>>> Thanks so much in advance.
>>
>

Reply via email to