Just some info about DISTINCT. In MySQL a union is much much faster than a
DISTINCT. The DICTINCT create a new temp table with the result of the
query. Sorting it and removing duplicates.
If you make a union with a select id=-1, the result is exactly the same.
All duplicates are removed. A DISTINCT which takes 2 min. and 45 sec. only
takes about  15 sec with a union.
kind regards.

I don't know which engine is in NIFI.
Jens M. Kofoed

Den lør. 15. aug. 2020 kl. 18.08 skrev Matt Burgess <mattyb...@apache.org>:

> In addition to the SO answer, if you know all the fields in the
> record, you can use QueryRecord with SELECT DISTINCT field1,field2...
> FROM FLOWFILE. The SO answer might be more performant but is more
> complex, and QueryRecord will do the operations in-memory so it might
> not handle very large flowfiles.
>
> The current pull request for the Jira has not been active and is not
> in mergeable shape, perhaps I'll get some time to pick it up and get
> it across the finish line :)
>
> Regards,
> Matt
>
> On Sat, Aug 15, 2020 at 11:47 AM Josh Friberg-Wyckoff
> <j...@thefribergs.com> wrote:
> >
> > Gosh, I should search the NiFi resources first.  They have current JIRA
> for what you are wanting.
> > https://issues.apache.org/jira/browse/NIFI-6047
> >
> > On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff <
> j...@thefribergs.com> wrote:
> >>
> >> This looks interesting as well.
> >> https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi
> >>
> >> On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff <
> j...@thefribergs.com> wrote:
> >>>
> >>> In theory I would think you could use the ExecuteStreamCommand to use
> the builtin Operating System sort commands to grab unique records.  The
> Windows Sort command has an undocumented unique option.  The sort command
> on Linux distros also has a unique option as well.
> >>>
> >>> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno <rbru...@gmail.com>
> wrote:
> >>>>
> >>>> I wanted to see if anyone knew is there a clever way to detect
> duplicate records much like you can with entire flow files with
> DetectDuplicate?  I'd really rather not have to split my records into
> individual flow files since this flow is such high volume.
> >>>>
> >>>> Thanks so much in advance.
>

Reply via email to