[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm
[ https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963339#comment-14963339 ] ASF GitHub Bot commented on FLINK-2107: --- Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/1262#issuecomment-149220611 I'd like to merge this later today, unless somebody speaks up. > Implement Hash Outer Join algorithm > --- > > Key: FLINK-2107 > URL: https://issues.apache.org/jira/browse/FLINK-2107 > Project: Flink > Issue Type: New Feature > Components: Local Runtime >Reporter: Fabian Hueske >Assignee: Chiwan Park >Priority: Minor > Fix For: pre-apache > > > Flink does not natively support outer joins at the moment. > This issue proposes to implement a hash outer join algorithm that can cover > left and right outer joins. > The implementation can be based on the regular hash join iterators (for > example `ReusingBuildFirstHashMatchIterator` and > `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class) > The Reusing and NonReusing variants differ in whether object instances are > reused or new objects are created. I would start with the NonReusing variant > which is safer from a user's point of view and should also be easier to > implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm
[ https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963514#comment-14963514 ] ASF GitHub Bot commented on FLINK-2107: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1262 > Implement Hash Outer Join algorithm > --- > > Key: FLINK-2107 > URL: https://issues.apache.org/jira/browse/FLINK-2107 > Project: Flink > Issue Type: New Feature > Components: Local Runtime >Reporter: Fabian Hueske >Assignee: Chiwan Park >Priority: Minor > Fix For: pre-apache > > > Flink does not natively support outer joins at the moment. > This issue proposes to implement a hash outer join algorithm that can cover > left and right outer joins. > The implementation can be based on the regular hash join iterators (for > example `ReusingBuildFirstHashMatchIterator` and > `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class) > The Reusing and NonReusing variants differ in whether object instances are > reused or new objects are created. I would start with the NonReusing variant > which is safer from a user's point of view and should also be easier to > implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm
[ https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960424#comment-14960424 ] ASF GitHub Bot commented on FLINK-2107: --- GitHub user fhueske opened a pull request: https://github.com/apache/flink/pull/1262 [FLINK-2107] Add hash-based strategies for left and right outer joins. This PR adds hash-based execution strategies for left and right outer joins, that have the outer side as the probe side of a hash table. You can merge this pull request into a Git repository by running: $ git pull https://github.com/fhueske/flink outerJoinHash Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1262.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1262 commit e24f487112694e1cd757601f5ad59037c0312499 Author: Fabian HueskeDate: 2015-10-15T08:58:58Z [FLINK-2107] Add hash-based strategies for left and right outer joins. > Implement Hash Outer Join algorithm > --- > > Key: FLINK-2107 > URL: https://issues.apache.org/jira/browse/FLINK-2107 > Project: Flink > Issue Type: New Feature > Components: Local Runtime >Reporter: Fabian Hueske >Assignee: Chiwan Park >Priority: Minor > Fix For: pre-apache > > > Flink does not natively support outer joins at the moment. > This issue proposes to implement a hash outer join algorithm that can cover > left and right outer joins. > The implementation can be based on the regular hash join iterators (for > example `ReusingBuildFirstHashMatchIterator` and > `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class) > The Reusing and NonReusing variants differ in whether object instances are > reused or new objects are created. I would start with the NonReusing variant > which is safer from a user's point of view and should also be easier to > implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm
[ https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959066#comment-14959066 ] Chiwan Park commented on FLINK-2107: [~fhueske] Yes, you can take this issue. :-) Sorry for delaying. > Implement Hash Outer Join algorithm > --- > > Key: FLINK-2107 > URL: https://issues.apache.org/jira/browse/FLINK-2107 > Project: Flink > Issue Type: New Feature > Components: Local Runtime >Reporter: Fabian Hueske >Assignee: Chiwan Park >Priority: Minor > Fix For: pre-apache > > > Flink does not natively support outer joins at the moment. > This issue proposes to implement a hash outer join algorithm that can cover > left and right outer joins. > The implementation can be based on the regular hash join iterators (for > example `ReusingBuildFirstHashMatchIterator` and > `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class) > The Reusing and NonReusing variants differ in whether object instances are > reused or new objects are created. I would start with the NonReusing variant > which is safer from a user's point of view and should also be easier to > implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm
[ https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959088#comment-14959088 ] Fabian Hueske commented on FLINK-2107: -- No problem. :-) I will not solve all aspects, just the "easy" special cases. For the remaining cases, we need a special hash table implementation that allows to read all records from the hash table that have not been accessed during the probe phase. If you are interested, you can continue to work on these cases. > Implement Hash Outer Join algorithm > --- > > Key: FLINK-2107 > URL: https://issues.apache.org/jira/browse/FLINK-2107 > Project: Flink > Issue Type: New Feature > Components: Local Runtime >Reporter: Fabian Hueske >Assignee: Chiwan Park >Priority: Minor > Fix For: pre-apache > > > Flink does not natively support outer joins at the moment. > This issue proposes to implement a hash outer join algorithm that can cover > left and right outer joins. > The implementation can be based on the regular hash join iterators (for > example `ReusingBuildFirstHashMatchIterator` and > `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class) > The Reusing and NonReusing variants differ in whether object instances are > reused or new objects are created. I would start with the NonReusing variant > which is safer from a user's point of view and should also be easier to > implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm
[ https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959060#comment-14959060 ] Fabian Hueske commented on FLINK-2107: -- Hi @chiwan, is it OK if I take over some parts of this issue? I would like to get the "simple" hash outer join case into the next release because it would allow me to remove one limitation from the Cascading on Flink adapter. This would be the case where the the left or right outer side is the probe side of the hash table. Thanks, Fabian > Implement Hash Outer Join algorithm > --- > > Key: FLINK-2107 > URL: https://issues.apache.org/jira/browse/FLINK-2107 > Project: Flink > Issue Type: New Feature > Components: Local Runtime >Reporter: Fabian Hueske >Assignee: Chiwan Park >Priority: Minor > Fix For: pre-apache > > > Flink does not natively support outer joins at the moment. > This issue proposes to implement a hash outer join algorithm that can cover > left and right outer joins. > The implementation can be based on the regular hash join iterators (for > example `ReusingBuildFirstHashMatchIterator` and > `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class) > The Reusing and NonReusing variants differ in whether object instances are > reused or new objects are created. I would start with the NonReusing variant > which is safer from a user's point of view and should also be easier to > implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm
[ https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14679107#comment-14679107 ] Fabian Hueske commented on FLINK-2107: -- [~Zentol] is right. This is an optimization to avoid copying the probe side record if there is only one build side record. 1-n joins where the build-side contains only unique keys are quite common. That is why this optimization can make a difference. The probe side records need to be copied, because the user-defined join function can modify all incoming records. If we would not create a new copy for each join function call, the second call of the join function might happen with a probe side record that was modified by the first call of the join function which violates the assumption of independent function calls and produces wrong results. Implement Hash Outer Join algorithm --- Key: FLINK-2107 URL: https://issues.apache.org/jira/browse/FLINK-2107 Project: Flink Issue Type: Sub-task Components: Local Runtime Reporter: Fabian Hueske Assignee: Chiwan Park Priority: Minor Fix For: pre-apache Flink does not natively support outer joins at the moment. This issue proposes to implement a hash outer join algorithm that can cover left and right outer joins. The implementation can be based on the regular hash join iterators (for example `ReusingBuildFirstHashMatchIterator` and `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class) The Reusing and NonReusing variants differ in whether object instances are reused or new objects are created. I would start with the NonReusing variant which is safer from a user's point of view and should also be easier to implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm
[ https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662083#comment-14662083 ] Chesnay Schepler commented on FLINK-2107: - looks like an optimization thing to me. you could probably replace the whole block from L116 to L138 with {code:java} while (running ((nextBuildSideRecord = buildSideIterator.next()) != null)) { probeCopy = this.probeSideSerializer.copy(probeRecord); matchFunction.join(nextBuildSideRecord, probeCopy, collector); } {code} but this would mean that you would always create a copy, even if there is only a single match, which is what the following bit checks for. {code:java} if ((tmpRec = buildSideIterator.next()) != null) { {code} if this is true we have accessed two build-side values without calling join, and as such have to deal with them outside the loop. Implement Hash Outer Join algorithm --- Key: FLINK-2107 URL: https://issues.apache.org/jira/browse/FLINK-2107 Project: Flink Issue Type: Sub-task Components: Local Runtime Reporter: Fabian Hueske Assignee: Chiwan Park Priority: Minor Fix For: pre-apache Flink does not natively support outer joins at the moment. This issue proposes to implement a hash outer join algorithm that can cover left and right outer joins. The implementation can be based on the regular hash join iterators (for example `ReusingBuildFirstHashMatchIterator` and `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class) The Reusing and NonReusing variants differ in whether object instances are reused or new objects are created. I would start with the NonReusing variant which is safer from a user's point of view and should also be easier to implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)