I believe Robert's case is that he has records flowing through bundled in
flowfiles containing one or more of them at a time and he'd like to
understand on a per record level (regardless of the flowfile they're
contained in) whether that record has already been seen over some time
interval.
So Robert too understand it correctly. You have a lot of records in one flow
file. And if one record has been seen before that record should be removed?
If true: wouldn’t it be a workflow that go through all records, record by
record and join the final result. So first you would have to split
Yep we were leaning towards off loading it to an external program and then
putting data back to nifi for final delivery. Looks like that will be best
from the sounds of it. Again thanks all!
On Sat, Aug 15, 2020, 16:24 Josh Friberg-Wyckoff
wrote:
> If that is the case and this is high volume
If that is the case and this is high volume like you say, I would think it
would be more efficient to offload the task to a separate program then
having a processor for NiFi doing it.
On Sat, Aug 15, 2020, 2:52 PM Otto Fowler wrote:
> I was working on something for this, but in discussion with
I was working on something for this, but in discussion with some of sme’s
on the project, decided to shelve it. I don’t think I had gotten to the
point of a jira.
https://apachenifi.slack.com/archives/C0L9S92JY/p1589911056303500
On August 15, 2020 at 14:12:07, Robert R. Bruno
Sorry I should have been more clear. My need is to detect if each record
has been seen in the past. So I need a solution that would be able to go
record by record against something like a redis cache that would tell me
either first time the record was seen or not and update the cache
Just some info about DISTINCT. In MySQL a union is much much faster than a
DISTINCT. The DICTINCT create a new temp table with the result of the
query. Sorting it and removing duplicates.
If you make a union with a select id=-1, the result is exactly the same.
All duplicates are removed. A
If you opt to try a few of these options, please tell us which appeared to
be the best from a performance perspective - with our understanding that
results may vary depending on the size of the incoming data. It would be
very interesting to learn what you found.
On Sat, Aug 15, 2020 at 6:53 AM
In addition to the SO answer, if you know all the fields in the
record, you can use QueryRecord with SELECT DISTINCT field1,field2...
FROM FLOWFILE. The SO answer might be more performant but is more
complex, and QueryRecord will do the operations in-memory so it might
not handle very large
Gosh, I should search the NiFi resources first. They have current JIRA for
what you are wanting.
https://issues.apache.org/jira/browse/NIFI-6047
On Sat, Aug 15, 2020 at 10:35 AM Josh Friberg-Wyckoff
wrote:
> This looks interesting as well.
>
This looks interesting as well.
https://stackoverflow.com/questions/52674532/remove-duplicates-in-nifi
On Sat, Aug 15, 2020 at 10:23 AM Josh Friberg-Wyckoff
wrote:
> In theory I would think you could use the ExecuteStreamCommand to use the
> builtin Operating System sort commands to grab unique
In theory I would think you could use the ExecuteStreamCommand to use the
builtin Operating System sort commands to grab unique records. The Windows
Sort command has an undocumented unique option. The sort command on Linux
distros also has a unique option as well.
On Sat, Aug 15, 2020 at 5:53
12 matches
Mail list logo