the idea behind key-selectors is to extract a property on which you can
to equality comparisons.
let's get one question out of the way first:
is your scoring algorithm transitive? as in if A==B and B==C, is it a
given that A==C? because if not, there's
just no way to group(=partition) the data, since B would belong to 2
distinct groups.
Even if it did work, one thing you have to realize is that this wouldn't
scale at all. For every element that
comes in you would have to compare it to all other groups you have
created so far.
What i would propose is the following: create a key-selector that allows
a /rough/ grouping of your data.
something like "John L" => "J L". On that group (that is hopefully
relatively small) you can then fire up your
algorithm between all possible pairs to do whatever you wanna do.
On 07.06.2016 10:48, iñaki williams wrote:
Thanks for your answer Ufuk.
However, I have been reading about KeySelector and I don't understand
completely how it works with my idea.
I am using an algorithm that gives me an score between some different
strings. My idea is: if the score is higher than 0'80 for example,
then those two strings will be consider the same and when I apply the
keyby("name") those similar string will be keyed as they have the
exact same name.
El lunes, 6 de junio de 2016, Ufuk Celebi <u...@apache.org
<mailto:u...@apache.org>> escribió:
Hey Iñaki,
you can use the KeySelector as described here:
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#specifying-keys
But you only a local view for the current element, e.g. the library
you use to determine the similarity has to know the similarities
upfront.
– Ufuk
On Mon, Jun 6, 2016 at 9:31 AM, iñaki williams
<juanramall...@gmail.com <javascript:;>> wrote:
> Hi guys,
>
> I am using Flink on my project and I have a question. (I am
using Java)
>
> Is it possible to modify the keyby method in order to key by
similarities
> and not by the exact name?
>
> Example: I recieve 2 DataStreams, in the first one , the name of
the field
> that I want to KeyBy is "John Locke", while in the Datastream
the field
> value is "John L". Can I use some java library to find for
similarities
> between strings and if the similitude is high, then key those
elements
> together.