[ https://issues.apache.org/jira/browse/FLINK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Greg Hogan updated FLINK-3910: ------------------------------ Comment: was deleted (was: I see this case as part of flushing out the remaining join operators. Flink could get by with `map`, `reduce`, and `join` but we are kindly given additional operators for clarity and performance. Outer joins have been quite useful despite that we could instead use `coGroup`. anti- and semi-joins would be similarly useful but are for now just comments in code. `selfJoin` can have a large impact on performance. A `reduce` is `O(n)` but a join is `O(n^2)` so data skew has a much larger effect. How would extension classes contrast with simply marking methods as `@PublicEvolving`? I do see that it may be desirable to defer major features to the next release when there is insufficient time to settle.) > New self-join operator > ---------------------- > > Key: FLINK-3910 > URL: https://issues.apache.org/jira/browse/FLINK-3910 > Project: Flink > Issue Type: New Feature > Components: DataSet API, Java API, Scala API > Affects Versions: 1.1.0 > Reporter: Greg Hogan > Assignee: Greg Hogan > > Flink currently provides inner- and outer-joins as well as cogroup and the > non-keyed cross. {{JoinOperator}} hints at future support for semi- and > anti-joins. > Many Gelly algorithms perform a self-join [0]. Still pending reviews, > FLINK-3768 performs a self-join on non-skewed data in TriangleListing.java > and FLINK-3780 performs a self-join on skewed data in JaccardSimilarity.java. > A {{SelfJoinHint}} will select between skewed and non-skewed implementations. > The object-reuse-disabled case can be simply handled with a new {{Operator}}. > The object-reuse-enabled case requires either {{CopyableValue}} types (as in > the code above) or a custom driver which has access to the serializer (or > making the serializer accessible to rich functions, and I think there be > dragons). > If the idea of a self-join is agreeable, I'd like to work out a rough > implementation and go from there. > [0] https://en.wikipedia.org/wiki/Join_%28SQL%29#Self-join -- This message was sent by Atlassian JIRA (v6.3.4#6332)