Just some info about DISTINCT. In MySQL a union is much much faster than a DISTINCT. The DICTINCT create a new temp table with the result of the query. Sorting it and removing duplicates. If you make a union with a select id=-1, the result is exactly the same. All duplicates are removed. A DISTINCT which takes 2 min. and 45 sec. only takes about 15 sec with a union. kind regards.
I don't know which engine is in NIFI. Jens M. Kofoed Den lør. 15. aug. 2020 kl. 18.08 skrev Matt Burgess <mattyb...@apache.org>: > In addition to the SO answer, if you know all the fields in the > record, you can use QueryRecord with SELECT DISTINCT field1,field2... > FROM FLOWFILE. The SO answer might be more performant but is more > complex, and QueryRecord will do the operations in-memory so it might > not handle very large flowfiles. > > The current pull request for the Jira has not been active and is not > in mergeable shape, perhaps I'll get some time to pick it up and get > it across the finish line :) > > Regards, > Matt > > On Sat, Aug 15, 2020 at 11:47 AM Josh Friberg-Wyckoff > <j...@thefribergs.com> wrote: > > > > Gosh, I should search the NiFi resources first. They have current JIRA > for what you are wanting. > > https://issues.apache.org/jira/browse/NIFI-6047 > > > > On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff < > j...@thefribergs.com> wrote: > >> > >> This looks interesting as well. > >> https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi > >> > >> On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff < > j...@thefribergs.com> wrote: > >>> > >>> In theory I would think you could use the ExecuteStreamCommand to use > the builtin Operating System sort commands to grab unique records. The > Windows Sort command has an undocumented unique option. The sort command > on Linux distros also has a unique option as well. > >>> > >>> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno <rbru...@gmail.com> > wrote: > >>>> > >>>> I wanted to see if anyone knew is there a clever way to detect > duplicate records much like you can with entire flow files with > DetectDuplicate? I'd really rather not have to split my records into > individual flow files since this flow is such high volume. > >>>> > >>>> Thanks so much in advance. >