Hi Fabian, hi Stephen,

thanks for answering my question. Good hint with the list replication, I will benchmark this vs. cross + filter.

Best, Martin

Am 17.07.2015 um 11:17 schrieb Stephan Ewen:
I would rewrite this to replicate the list into tuples:

"foreach x in list: emit (x, list)"
Then join on fields 0.

This replicates the lists, but makes the join very efficient.

On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <fhue...@gmail.com <mailto:fhue...@gmail.com>> wrote:

    Hi Martin,

    good to hear that you like Flink :-)
    AFAIK, there are no plans to add a containment join. The Flink
    community is currently working on adding support for outer joins.
    Regarding a containment join, I am not sure about the number of
    use cases. I would rather try to implement it on top of Flink's
    batch API instead of adding it as an internal feature/operator to
    the system because this would touch a lot of things (API,
    optimizer, operator implementation).

    There might be better ways to implement a containment join than
    using a cross and a filter.
    - Do you know a distributed algorithm for containment joins? Maybe
    it can be implemented with Flink's API.
    - I guess, you are implementing a generic graph framework, but can
    you make certain assumptions about the data such as relative sizes
    of the inputs or avg/max size of the lists, etc.?

    Contributions to Gelly (and Flink in general) are highly welcome.

    Best, Fabian


    2015-07-16 9:39 GMT+02:00 Martin Junghanns
    <martin.jungha...@gmx.net <mailto:martin.jungha...@gmx.net>>:

        Hi everyone,

        at first, thanks for building this great framework! We are
        using Flink
        and especially Gelly for building a graph analytics stack
        (gradoop.com <http://gradoop.com>).

        I was wondering if there is a [planned] support for a
        containment join
        operator. Consider the following example:

        DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
        DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>,
        <5, 2>}

        What I want to compute is

        left.join(right).where(list).contains(tuple.f0) :=

        {
        <[0, 1], <0,1>>, <[0, 1], <1, 0>>,
        <[2, 3, 4], <2, 1>>,
        <[5], <5, 2>
        }

        At the moment, I am solving that using cross and filter, which
        can be
        expensive.

        The generalization of that operator would be "set containment
        join",
        where you join if the right set is contained in the left set.

        If there is a general need for that operator, I would also like to
        contribute to its implementation.

        But maybe, there is already another nice solution which I didn't
        discover yet?

        Any help would be appreciated. Especially since I would also
        like to
        contribute some of our graph operators (e.g., graph
        summarization) back
        to Flink/Gelly (current WIP state can be found here: [1]).

        Thanks,

        Martin


        [1]
        
https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java





Reply via email to