Hi Fabian, hi Stephen,
thanks for answering my question. Good hint with the list replication, I
will benchmark this vs. cross + filter.
Best, Martin
Am 17.07.2015 um 11:17 schrieb Stephan Ewen:
I would rewrite this to replicate the list into tuples:
"foreach x in list: emit (x, list)"
Then join on fields 0.
This replicates the lists, but makes the join very efficient.
On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <fhue...@gmail.com
<mailto:fhue...@gmail.com>> wrote:
Hi Martin,
good to hear that you like Flink :-)
AFAIK, there are no plans to add a containment join. The Flink
community is currently working on adding support for outer joins.
Regarding a containment join, I am not sure about the number of
use cases. I would rather try to implement it on top of Flink's
batch API instead of adding it as an internal feature/operator to
the system because this would touch a lot of things (API,
optimizer, operator implementation).
There might be better ways to implement a containment join than
using a cross and a filter.
- Do you know a distributed algorithm for containment joins? Maybe
it can be implemented with Flink's API.
- I guess, you are implementing a generic graph framework, but can
you make certain assumptions about the data such as relative sizes
of the inputs or avg/max size of the lists, etc.?
Contributions to Gelly (and Flink in general) are highly welcome.
Best, Fabian
2015-07-16 9:39 GMT+02:00 Martin Junghanns
<martin.jungha...@gmx.net <mailto:martin.jungha...@gmx.net>>:
Hi everyone,
at first, thanks for building this great framework! We are
using Flink
and especially Gelly for building a graph analytics stack
(gradoop.com <http://gradoop.com>).
I was wondering if there is a [planned] support for a
containment join
operator. Consider the following example:
DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>,
<5, 2>}
What I want to compute is
left.join(right).where(list).contains(tuple.f0) :=
{
<[0, 1], <0,1>>, <[0, 1], <1, 0>>,
<[2, 3, 4], <2, 1>>,
<[5], <5, 2>
}
At the moment, I am solving that using cross and filter, which
can be
expensive.
The generalization of that operator would be "set containment
join",
where you join if the right set is contained in the left set.
If there is a general need for that operator, I would also like to
contribute to its implementation.
But maybe, there is already another nice solution which I didn't
discover yet?
Any help would be appreciated. Especially since I would also
like to
contribute some of our graph operators (e.g., graph
summarization) back
to Flink/Gelly (current WIP state can be found here: [1]).
Thanks,
Martin
[1]
https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java