Hello,

I'm implementing MinHash for reccomendation on Flink. I'm almost done but I
need an efficient way to merge sets of similar keys together (and later
join these sets of keys with more data).

The actual data structure is of the form DataSet[(Int,Set[Int])] where the
left element of the tuple is an ID for the right element, that is a set of
keys. I want to merge these sets together only if they share at least one
element.

I'm rather sure to have studied the efficient solution to this problem in a
local environment but I don't really know how to treat it in a distributed
environment. Any suggestion?

Thanks,

Simone

Reply via email to