subject:"Re\: Detect duplicate records"

Re: Detect duplicate records

2020-08-16 Thread Joe Witt

I believe Robert's case is that he has records flowing through bundled in
flowfiles containing one or more of them at a time and he'd like to
understand on a per record level (regardless of the flowfile they're
contained in) whether that record has already been seen over some time
interval.

DetectDuplicate wired into an appropriate record processor would be optimal
for this.  A scripted processor could be used now whereas we need to just
add a DetectDuplicateRecord processor or possibly wire this into one of the
existing processors.

Thanks

On Sun, Aug 16, 2020 at 12:52 AM Jens M. Kofoed 
wrote:

> So Robert too understand it correctly. You have a lot of records in one
> flow file. And if one record has been seen before that record should be
> removed?
> If true: wouldn’t it be a workflow that go through all records, record by
> record and join the final result. So first you would have to split all
> records, check each record and join the rest. No matter if you do it inside
> or outside nifi. Right?
> Split records -> hash record -> detect duplicates -> merge records
>
> Regards Jens.
>
> Den 16. aug. 2020 kl. 01.17 skrev Robert R. Bruno :
>
> Yep we were leaning towards off loading it to an external program and then
> putting data back to nifi for final delivery.  Looks like that will be best
> from the sounds of it.  Again thanks all!
>
> On Sat, Aug 15, 2020, 16:24 Josh Friberg-Wyckoff 
> wrote:
>
>> If that is the case and this is high volume like you say, I would think
>> it would be more efficient to offload the task to a separate program then
>> having a processor for NiFi doing it.
>>
>> On Sat, Aug 15, 2020, 2:52 PM Otto Fowler 
>> wrote:
>>
>>> I was working on something for this, but in discussion with some of
>>> sme’s on the project, decided to shelve it.  I don’t think I had gotten to
>>> the point of a jira.
>>>
>>> https://apachenifi.slack.com/archives/C0L9S92JY/p1589911056303500
>>>
>>> On August 15, 2020 at 14:12:07, Robert R. Bruno (rbru...@gmail.com)
>>> wrote:
>>>
>>> Sorry I should have been more clear.  My need is to detect if each
>>> record has been seen in the past.  So I need a solution that would be able
>>> to go record by record against something like a redis cache that would tell
>>> me either first time the record was seen or not and update the cache
>>> accordingly.  Guessing nothing like that for records exists at this point?
>>>
>>> We've used DetectDuplicate to do this for entire flow files, but have
>>> the need to do this per record with a preference of not splitting the flow
>>> files.
>>>
>>> Thanks all!
>>>
>>> On Sat, Aug 15, 2020, 13:38 Jens M. Kofoed 
>>> wrote:
>>>
 Just some info about DISTINCT. In MySQL a union is much much faster
 than a DISTINCT. The DICTINCT create a new temp table with the result of
 the query. Sorting it and removing duplicates.
 If you make a union with a select id=-1, the result is exactly the
 same. All duplicates are removed. A DISTINCT which takes 2 min. and 45 sec.
 only takes about  15 sec with a union.
 kind regards.

 I don't know which engine is in NIFI.
 Jens M. Kofoed

 Den lør. 15. aug. 2020 kl. 18.08 skrev Matt Burgess <
 mattyb...@apache.org>:

> In addition to the SO answer, if you know all the fields in the
> record, you can use QueryRecord with SELECT DISTINCT field1,field2...
> FROM FLOWFILE. The SO answer might be more performant but is more
> complex, and QueryRecord will do the operations in-memory so it might
> not handle very large flowfiles.
>
> The current pull request for the Jira has not been active and is not
> in mergeable shape, perhaps I'll get some time to pick it up and get
> it across the finish line :)
>
> Regards,
> Matt
>
> On Sat, Aug 15, 2020 at 11:47 AM Josh Friberg-Wyckoff
>  wrote:
> >
> > Gosh, I should search the NiFi resources first.  They have current
> JIRA for what you are wanting.
> > https://issues.apache.org/jira/browse/NIFI-6047
> >
> > On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff <
> j...@thefribergs.com> wrote:
> >>
> >> This looks interesting as well.
> >>
> https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi
> >>
> >> On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff <
> j...@thefribergs.com> wrote:
> >>>
> >>> In theory I would think you could use the ExecuteStreamCommand to
> use the builtin Operating System sort commands to grab unique records.  
> The
> Windows Sort command has an undocumented unique option.  The sort command
> on Linux distros also has a unique option as well.
> >>>
> >>> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno 
> wrote:
> 
>  I wanted to see if anyone knew is there a clever way to detect
> duplicate records much like you can with entire flow files with
> DetectDuplicate?  I'd really rather

Re: Detect duplicate records

2020-08-16 Thread Jens M. Kofoed

So Robert too understand it correctly. You have a lot of records in one flow 
file. And if one record has been seen before that record should be removed?
If true: wouldn’t it be a workflow that go through all records, record by 
record and join the final result. So first you would have to split all records, 
check each record and join the rest. No matter if you do it inside or outside 
nifi. Right?
Split records -> hash record -> detect duplicates -> merge records 

Regards Jens. 

> Den 16. aug. 2020 kl. 01.17 skrev Robert R. Bruno :
> 
> Yep we were leaning towards off loading it to an external program and then 
> putting data back to nifi for final delivery.  Looks like that will be best 
> from the sounds of it.  Again thanks all!
> 
>> On Sat, Aug 15, 2020, 16:24 Josh Friberg-Wyckoff  
>> wrote:
>> If that is the case and this is high volume like you say, I would think it 
>> would be more efficient to offload the task to a separate program then 
>> having a processor for NiFi doing it.
>> 
>>> On Sat, Aug 15, 2020, 2:52 PM Otto Fowler  wrote:
>>> I was working on something for this, but in discussion with some of sme’s 
>>> on the project, decided to shelve it.  I don’t think I had gotten to the 
>>> point of a jira.
>>> 
>>> https://apachenifi.slack.com/archives/C0L9S92JY/p1589911056303500 
>>> 
 On August 15, 2020 at 14:12:07, Robert R. Bruno (rbru...@gmail.com) wrote:
 
 Sorry I should have been more clear.  My need is to detect if each record 
 has been seen in the past.  So I need a solution that would be able to go 
 record by record against something like a redis cache that would tell me 
 either first time the record was seen or not and update the cache 
 accordingly.  Guessing nothing like that for records exists at this point?
 
 We've used DetectDuplicate to do this for entire flow files, but have the 
 need to do this per record with a preference of not splitting the flow 
 files.
 
 Thanks all!
 
> On Sat, Aug 15, 2020, 13:38 Jens M. Kofoed  wrote:
> Just some info about DISTINCT. In MySQL a union is much much faster than 
> a DISTINCT. The DICTINCT create a new temp table with the result of the 
> query. Sorting it and removing duplicates.
> If you make a union with a select id=-1, the result is exactly the same. 
> All duplicates are removed. A DISTINCT which takes 2 min. and 45 sec. 
> only takes about  15 sec with a union.
> kind regards.
> 
> I don't know which engine is in NIFI.
> Jens M. Kofoed
> 
>> Den lør. 15. aug. 2020 kl. 18.08 skrev Matt Burgess 
>> :
>> In addition to the SO answer, if you know all the fields in the
>> record, you can use QueryRecord with SELECT DISTINCT field1,field2...
>> FROM FLOWFILE. The SO answer might be more performant but is more
>> complex, and QueryRecord will do the operations in-memory so it might
>> not handle very large flowfiles.
>> 
>> The current pull request for the Jira has not been active and is not
>> in mergeable shape, perhaps I'll get some time to pick it up and get
>> it across the finish line :)
>> 
>> Regards,
>> Matt
>> 
>> On Sat, Aug 15, 2020 at 11:47 AM Josh Friberg-Wyckoff
>>  wrote:
>> >
>> > Gosh, I should search the NiFi resources first.  They have current 
>> > JIRA for what you are wanting.
>> > https://issues.apache.org/jira/browse/NIFI-6047
>> >
>> > On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff 
>> >  wrote:
>> >>
>> >> This looks interesting as well.
>> >> https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi
>> >>
>> >> On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff 
>> >>  wrote:
>> >>>
>> >>> In theory I would think you could use the ExecuteStreamCommand to 
>> >>> use the builtin Operating System sort commands to grab unique 
>> >>> records.  The Windows Sort command has an undocumented unique 
>> >>> option.  The sort command on Linux distros also has a unique option 
>> >>> as well.
>> >>>
>> >>> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno  
>> >>> wrote:
>> 
>>  I wanted to see if anyone knew is there a clever way to detect 
>>  duplicate records much like you can with entire flow files with 
>>  DetectDuplicate?  I'd really rather not have to split my records 
>>  into individual flow files since this flow is such high volume.
>> 
>>  Thanks so much in advance.

Re: Detect duplicate records

2020-08-15 Thread Robert R. Bruno

Yep we were leaning towards off loading it to an external program and then
putting data back to nifi for final delivery.  Looks like that will be best
from the sounds of it.  Again thanks all!

On Sat, Aug 15, 2020, 16:24 Josh Friberg-Wyckoff 
wrote:

> If that is the case and this is high volume like you say, I would think it
> would be more efficient to offload the task to a separate program then
> having a processor for NiFi doing it.
>
> On Sat, Aug 15, 2020, 2:52 PM Otto Fowler  wrote:
>
>> I was working on something for this, but in discussion with some of sme’s
>> on the project, decided to shelve it.  I don’t think I had gotten to the
>> point of a jira.
>>
>> https://apachenifi.slack.com/archives/C0L9S92JY/p1589911056303500
>>
>> On August 15, 2020 at 14:12:07, Robert R. Bruno (rbru...@gmail.com)
>> wrote:
>>
>> Sorry I should have been more clear.  My need is to detect if each record
>> has been seen in the past.  So I need a solution that would be able to go
>> record by record against something like a redis cache that would tell me
>> either first time the record was seen or not and update the cache
>> accordingly.  Guessing nothing like that for records exists at this point?
>>
>> We've used DetectDuplicate to do this for entire flow files, but have the
>> need to do this per record with a preference of not splitting the flow
>> files.
>>
>> Thanks all!
>>
>> On Sat, Aug 15, 2020, 13:38 Jens M. Kofoed 
>> wrote:
>>
>>> Just some info about DISTINCT. In MySQL a union is much much faster than
>>> a DISTINCT. The DICTINCT create a new temp table with the result of the
>>> query. Sorting it and removing duplicates.
>>> If you make a union with a select id=-1, the result is exactly the same.
>>> All duplicates are removed. A DISTINCT which takes 2 min. and 45 sec. only
>>> takes about  15 sec with a union.
>>> kind regards.
>>>
>>> I don't know which engine is in NIFI.
>>> Jens M. Kofoed
>>>
>>> Den lør. 15. aug. 2020 kl. 18.08 skrev Matt Burgess <
>>> mattyb...@apache.org>:
>>>
 In addition to the SO answer, if you know all the fields in the
 record, you can use QueryRecord with SELECT DISTINCT field1,field2...
 FROM FLOWFILE. The SO answer might be more performant but is more
 complex, and QueryRecord will do the operations in-memory so it might
 not handle very large flowfiles.

 The current pull request for the Jira has not been active and is not
 in mergeable shape, perhaps I'll get some time to pick it up and get
 it across the finish line :)

 Regards,
 Matt

 On Sat, Aug 15, 2020 at 11:47 AM Josh Friberg-Wyckoff
  wrote:
 >
 > Gosh, I should search the NiFi resources first.  They have current
 JIRA for what you are wanting.
 > https://issues.apache.org/jira/browse/NIFI-6047
 >
 > On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff <
 j...@thefribergs.com> wrote:
 >>
 >> This looks interesting as well.
 >>
 https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi
 >>
 >> On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff <
 j...@thefribergs.com> wrote:
 >>>
 >>> In theory I would think you could use the ExecuteStreamCommand to
 use the builtin Operating System sort commands to grab unique records.  The
 Windows Sort command has an undocumented unique option.  The sort command
 on Linux distros also has a unique option as well.
 >>>
 >>> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno 
 wrote:
 
  I wanted to see if anyone knew is there a clever way to detect
 duplicate records much like you can with entire flow files with
 DetectDuplicate?  I'd really rather not have to split my records into
 individual flow files since this flow is such high volume.
 
  Thanks so much in advance.

>>>

Re: Detect duplicate records

2020-08-15 Thread Josh Friberg-Wyckoff

If that is the case and this is high volume like you say, I would think it
would be more efficient to offload the task to a separate program then
having a processor for NiFi doing it.

On Sat, Aug 15, 2020, 2:52 PM Otto Fowler  wrote:

> I was working on something for this, but in discussion with some of sme’s
> on the project, decided to shelve it.  I don’t think I had gotten to the
> point of a jira.
>
> https://apachenifi.slack.com/archives/C0L9S92JY/p1589911056303500
>
> On August 15, 2020 at 14:12:07, Robert R. Bruno (rbru...@gmail.com) wrote:
>
> Sorry I should have been more clear.  My need is to detect if each record
> has been seen in the past.  So I need a solution that would be able to go
> record by record against something like a redis cache that would tell me
> either first time the record was seen or not and update the cache
> accordingly.  Guessing nothing like that for records exists at this point?
>
> We've used DetectDuplicate to do this for entire flow files, but have the
> need to do this per record with a preference of not splitting the flow
> files.
>
> Thanks all!
>
> On Sat, Aug 15, 2020, 13:38 Jens M. Kofoed  wrote:
>
>> Just some info about DISTINCT. In MySQL a union is much much faster than
>> a DISTINCT. The DICTINCT create a new temp table with the result of the
>> query. Sorting it and removing duplicates.
>> If you make a union with a select id=-1, the result is exactly the same.
>> All duplicates are removed. A DISTINCT which takes 2 min. and 45 sec. only
>> takes about  15 sec with a union.
>> kind regards.
>>
>> I don't know which engine is in NIFI.
>> Jens M. Kofoed
>>
>> Den lør. 15. aug. 2020 kl. 18.08 skrev Matt Burgess > >:
>>
>>> In addition to the SO answer, if you know all the fields in the
>>> record, you can use QueryRecord with SELECT DISTINCT field1,field2...
>>> FROM FLOWFILE. The SO answer might be more performant but is more
>>> complex, and QueryRecord will do the operations in-memory so it might
>>> not handle very large flowfiles.
>>>
>>> The current pull request for the Jira has not been active and is not
>>> in mergeable shape, perhaps I'll get some time to pick it up and get
>>> it across the finish line :)
>>>
>>> Regards,
>>> Matt
>>>
>>> On Sat, Aug 15, 2020 at 11:47 AM Josh Friberg-Wyckoff
>>>  wrote:
>>> >
>>> > Gosh, I should search the NiFi resources first.  They have current
>>> JIRA for what you are wanting.
>>> > https://issues.apache.org/jira/browse/NIFI-6047
>>> >
>>> > On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff <
>>> j...@thefribergs.com> wrote:
>>> >>
>>> >> This looks interesting as well.
>>> >>
>>> https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi
>>> >>
>>> >> On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff <
>>> j...@thefribergs.com> wrote:
>>> >>>
>>> >>> In theory I would think you could use the ExecuteStreamCommand to
>>> use the builtin Operating System sort commands to grab unique records.  The
>>> Windows Sort command has an undocumented unique option.  The sort command
>>> on Linux distros also has a unique option as well.
>>> >>>
>>> >>> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno 
>>> wrote:
>>> 
>>>  I wanted to see if anyone knew is there a clever way to detect
>>> duplicate records much like you can with entire flow files with
>>> DetectDuplicate?  I'd really rather not have to split my records into
>>> individual flow files since this flow is such high volume.
>>> 
>>>  Thanks so much in advance.
>>>
>>

Re: Detect duplicate records

2020-08-15 Thread Otto Fowler

 I was working on something for this, but in discussion with some of sme’s
on the project, decided to shelve it.  I don’t think I had gotten to the
point of a jira.

https://apachenifi.slack.com/archives/C0L9S92JY/p1589911056303500

On August 15, 2020 at 14:12:07, Robert R. Bruno (rbru...@gmail.com) wrote:

Sorry I should have been more clear.  My need is to detect if each record
has been seen in the past.  So I need a solution that would be able to go
record by record against something like a redis cache that would tell me
either first time the record was seen or not and update the cache
accordingly.  Guessing nothing like that for records exists at this point?

We've used DetectDuplicate to do this for entire flow files, but have the
need to do this per record with a preference of not splitting the flow
files.

Thanks all!

On Sat, Aug 15, 2020, 13:38 Jens M. Kofoed  wrote:

> Just some info about DISTINCT. In MySQL a union is much much faster than a
> DISTINCT. The DICTINCT create a new temp table with the result of the
> query. Sorting it and removing duplicates.
> If you make a union with a select id=-1, the result is exactly the same.
> All duplicates are removed. A DISTINCT which takes 2 min. and 45 sec. only
> takes about  15 sec with a union.
> kind regards.
>
> I don't know which engine is in NIFI.
> Jens M. Kofoed
>
> Den lør. 15. aug. 2020 kl. 18.08 skrev Matt Burgess  >:
>
>> In addition to the SO answer, if you know all the fields in the
>> record, you can use QueryRecord with SELECT DISTINCT field1,field2...
>> FROM FLOWFILE. The SO answer might be more performant but is more
>> complex, and QueryRecord will do the operations in-memory so it might
>> not handle very large flowfiles.
>>
>> The current pull request for the Jira has not been active and is not
>> in mergeable shape, perhaps I'll get some time to pick it up and get
>> it across the finish line :)
>>
>> Regards,
>> Matt
>>
>> On Sat, Aug 15, 2020 at 11:47 AM Josh Friberg-Wyckoff
>>  wrote:
>> >
>> > Gosh, I should search the NiFi resources first.  They have current JIRA
>> for what you are wanting.
>> > https://issues.apache.org/jira/browse/NIFI-6047
>> >
>> > On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff <
>> j...@thefribergs.com> wrote:
>> >>
>> >> This looks interesting as well.
>> >> https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi
>> >>
>> >> On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff <
>> j...@thefribergs.com> wrote:
>> >>>
>> >>> In theory I would think you could use the ExecuteStreamCommand to use
>> the builtin Operating System sort commands to grab unique records.  The
>> Windows Sort command has an undocumented unique option.  The sort command
>> on Linux distros also has a unique option as well.
>> >>>
>> >>> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno 
>> wrote:
>> 
>>  I wanted to see if anyone knew is there a clever way to detect
>> duplicate records much like you can with entire flow files with
>> DetectDuplicate?  I'd really rather not have to split my records into
>> individual flow files since this flow is such high volume.
>> 
>>  Thanks so much in advance.
>>
>

Re: Detect duplicate records

2020-08-15 Thread Robert R. Bruno

Sorry I should have been more clear.  My need is to detect if each record
has been seen in the past.  So I need a solution that would be able to go
record by record against something like a redis cache that would tell me
either first time the record was seen or not and update the cache
accordingly.  Guessing nothing like that for records exists at this point?

We've used DetectDuplicate to do this for entire flow files, but have the
need to do this per record with a preference of not splitting the flow
files.

Thanks all!

On Sat, Aug 15, 2020, 13:38 Jens M. Kofoed  wrote:

> Just some info about DISTINCT. In MySQL a union is much much faster than a
> DISTINCT. The DICTINCT create a new temp table with the result of the
> query. Sorting it and removing duplicates.
> If you make a union with a select id=-1, the result is exactly the same.
> All duplicates are removed. A DISTINCT which takes 2 min. and 45 sec. only
> takes about  15 sec with a union.
> kind regards.
>
> I don't know which engine is in NIFI.
> Jens M. Kofoed
>
> Den lør. 15. aug. 2020 kl. 18.08 skrev Matt Burgess  >:
>
>> In addition to the SO answer, if you know all the fields in the
>> record, you can use QueryRecord with SELECT DISTINCT field1,field2...
>> FROM FLOWFILE. The SO answer might be more performant but is more
>> complex, and QueryRecord will do the operations in-memory so it might
>> not handle very large flowfiles.
>>
>> The current pull request for the Jira has not been active and is not
>> in mergeable shape, perhaps I'll get some time to pick it up and get
>> it across the finish line :)
>>
>> Regards,
>> Matt
>>
>> On Sat, Aug 15, 2020 at 11:47 AM Josh Friberg-Wyckoff
>>  wrote:
>> >
>> > Gosh, I should search the NiFi resources first.  They have current JIRA
>> for what you are wanting.
>> > https://issues.apache.org/jira/browse/NIFI-6047
>> >
>> > On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff <
>> j...@thefribergs.com> wrote:
>> >>
>> >> This looks interesting as well.
>> >> https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi
>> >>
>> >> On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff <
>> j...@thefribergs.com> wrote:
>> >>>
>> >>> In theory I would think you could use the ExecuteStreamCommand to use
>> the builtin Operating System sort commands to grab unique records.  The
>> Windows Sort command has an undocumented unique option.  The sort command
>> on Linux distros also has a unique option as well.
>> >>>
>> >>> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno 
>> wrote:
>> 
>>  I wanted to see if anyone knew is there a clever way to detect
>> duplicate records much like you can with entire flow files with
>> DetectDuplicate?  I'd really rather not have to split my records into
>> individual flow files since this flow is such high volume.
>> 
>>  Thanks so much in advance.
>>
>

Re: Detect duplicate records

2020-08-15 Thread Jens M. Kofoed

Just some info about DISTINCT. In MySQL a union is much much faster than a
DISTINCT. The DICTINCT create a new temp table with the result of the
query. Sorting it and removing duplicates.
If you make a union with a select id=-1, the result is exactly the same.
All duplicates are removed. A DISTINCT which takes 2 min. and 45 sec. only
takes about  15 sec with a union.
kind regards.

I don't know which engine is in NIFI.
Jens M. Kofoed

Den lør. 15. aug. 2020 kl. 18.08 skrev Matt Burgess :

> In addition to the SO answer, if you know all the fields in the
> record, you can use QueryRecord with SELECT DISTINCT field1,field2...
> FROM FLOWFILE. The SO answer might be more performant but is more
> complex, and QueryRecord will do the operations in-memory so it might
> not handle very large flowfiles.
>
> The current pull request for the Jira has not been active and is not
> in mergeable shape, perhaps I'll get some time to pick it up and get
> it across the finish line :)
>
> Regards,
> Matt
>
> On Sat, Aug 15, 2020 at 11:47 AM Josh Friberg-Wyckoff
>  wrote:
> >
> > Gosh, I should search the NiFi resources first.  They have current JIRA
> for what you are wanting.
> > https://issues.apache.org/jira/browse/NIFI-6047
> >
> > On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff <
> j...@thefribergs.com> wrote:
> >>
> >> This looks interesting as well.
> >> https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi
> >>
> >> On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff <
> j...@thefribergs.com> wrote:
> >>>
> >>> In theory I would think you could use the ExecuteStreamCommand to use
> the builtin Operating System sort commands to grab unique records.  The
> Windows Sort command has an undocumented unique option.  The sort command
> on Linux distros also has a unique option as well.
> >>>
> >>> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno 
> wrote:
> 
>  I wanted to see if anyone knew is there a clever way to detect
> duplicate records much like you can with entire flow files with
> DetectDuplicate?  I'd really rather not have to split my records into
> individual flow files since this flow is such high volume.
> 
>  Thanks so much in advance.
>

Re: Detect duplicate records

2020-08-15 Thread James McMahon

If you opt to try a few of these options, please tell us which appeared to
be the best from a performance perspective - with our understanding that
results may vary depending on the size of the incoming data. It would be
very interesting to learn what you found.

On Sat, Aug 15, 2020 at 6:53 AM Robert R. Bruno  wrote:

> I wanted to see if anyone knew is there a clever way to detect duplicate
> records much like you can with entire flow files with DetectDuplicate?  I'd
> really rather not have to split my records into individual flow files since
> this flow is such high volume.
>
> Thanks so much in advance.
>

Re: Detect duplicate records

2020-08-15 Thread Matt Burgess

In addition to the SO answer, if you know all the fields in the
record, you can use QueryRecord with SELECT DISTINCT field1,field2...
FROM FLOWFILE. The SO answer might be more performant but is more
complex, and QueryRecord will do the operations in-memory so it might
not handle very large flowfiles.

The current pull request for the Jira has not been active and is not
in mergeable shape, perhaps I'll get some time to pick it up and get
it across the finish line :)

Regards,
Matt

On Sat, Aug 15, 2020 at 11:47 AM Josh Friberg-Wyckoff
 wrote:
>
> Gosh, I should search the NiFi resources first.  They have current JIRA for 
> what you are wanting.
> https://issues.apache.org/jira/browse/NIFI-6047
>
> On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff  
> wrote:
>>
>> This looks interesting as well.
>> https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi
>>
>> On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff  
>> wrote:
>>>
>>> In theory I would think you could use the ExecuteStreamCommand to use the 
>>> builtin Operating System sort commands to grab unique records.  The Windows 
>>> Sort command has an undocumented unique option.  The sort command on Linux 
>>> distros also has a unique option as well.
>>>
>>> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno  wrote:

 I wanted to see if anyone knew is there a clever way to detect duplicate 
 records much like you can with entire flow files with DetectDuplicate?  
 I'd really rather not have to split my records into individual flow files 
 since this flow is such high volume.

 Thanks so much in advance.

Re: Detect duplicate records

2020-08-15 Thread Josh Friberg-Wyckoff

Gosh, I should search the NiFi resources first.  They have current JIRA for
what you are wanting.
https://issues.apache.org/jira/browse/NIFI-6047

On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff 
wrote:

> This looks interesting as well.
> https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi
>
> On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff <
> j...@thefribergs.com> wrote:
>
>> In theory I would think you could use the ExecuteStreamCommand to use the
>> builtin Operating System sort commands to grab unique records.  The Windows
>> Sort command has an undocumented unique option.  The sort command on Linux
>> distros also has a unique option as well.
>>
>> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno 
>> wrote:
>>
>>> I wanted to see if anyone knew is there a clever way to detect duplicate
>>> records much like you can with entire flow files with DetectDuplicate?  I'd
>>> really rather not have to split my records into individual flow files since
>>> this flow is such high volume.
>>>
>>> Thanks so much in advance.
>>>
>>

Re: Detect duplicate records

2020-08-15 Thread Josh Friberg-Wyckoff

This looks interesting as well.
https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi

On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff 
wrote:

> In theory I would think you could use the ExecuteStreamCommand to use the
> builtin Operating System sort commands to grab unique records.  The Windows
> Sort command has an undocumented unique option.  The sort command on Linux
> distros also has a unique option as well.
>
> On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno  wrote:
>
>> I wanted to see if anyone knew is there a clever way to detect duplicate
>> records much like you can with entire flow files with DetectDuplicate?  I'd
>> really rather not have to split my records into individual flow files since
>> this flow is such high volume.
>>
>> Thanks so much in advance.
>>
>

Re: Detect duplicate records

2020-08-15 Thread Josh Friberg-Wyckoff

In theory I would think you could use the ExecuteStreamCommand to use the
builtin Operating System sort commands to grab unique records.  The Windows
Sort command has an undocumented unique option.  The sort command on Linux
distros also has a unique option as well.

On Sat, Aug 15, 2020 at 5:53 AM Robert R. Bruno  wrote:

> I wanted to see if anyone knew is there a clever way to detect duplicate
> records much like you can with entire flow files with DetectDuplicate?  I'd
> really rather not have to split my records into individual flow files since
> this flow is such high volume.
>
> Thanks so much in advance.
>

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

Re: Detect duplicate records

12 matches

Site Navigation

Mail list logo

Footer information