[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm

2015-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963339#comment-14963339
 ] 

ASF GitHub Bot commented on FLINK-2107:
---

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/1262#issuecomment-149220611
  
I'd like to merge this later today, unless somebody speaks up.


> Implement Hash Outer Join algorithm
> ---
>
> Key: FLINK-2107
> URL: https://issues.apache.org/jira/browse/FLINK-2107
> Project: Flink
>  Issue Type: New Feature
>  Components: Local Runtime
>Reporter: Fabian Hueske
>Assignee: Chiwan Park
>Priority: Minor
> Fix For: pre-apache
>
>
> Flink does not natively support outer joins at the moment.
> This issue proposes to implement a hash outer join algorithm that can cover 
> left and right outer joins.
> The implementation can be based on the regular hash join iterators (for 
> example `ReusingBuildFirstHashMatchIterator` and 
> `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class)
> The Reusing and NonReusing variants differ in whether object instances are 
> reused or new objects are created. I would start with the NonReusing variant 
> which is safer from a user's point of view and should also be easier to 
> implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm

2015-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963514#comment-14963514
 ] 

ASF GitHub Bot commented on FLINK-2107:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/1262


> Implement Hash Outer Join algorithm
> ---
>
> Key: FLINK-2107
> URL: https://issues.apache.org/jira/browse/FLINK-2107
> Project: Flink
>  Issue Type: New Feature
>  Components: Local Runtime
>Reporter: Fabian Hueske
>Assignee: Chiwan Park
>Priority: Minor
> Fix For: pre-apache
>
>
> Flink does not natively support outer joins at the moment.
> This issue proposes to implement a hash outer join algorithm that can cover 
> left and right outer joins.
> The implementation can be based on the regular hash join iterators (for 
> example `ReusingBuildFirstHashMatchIterator` and 
> `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class)
> The Reusing and NonReusing variants differ in whether object instances are 
> reused or new objects are created. I would start with the NonReusing variant 
> which is safer from a user's point of view and should also be easier to 
> implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm

2015-10-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960424#comment-14960424
 ] 

ASF GitHub Bot commented on FLINK-2107:
---

GitHub user fhueske opened a pull request:

https://github.com/apache/flink/pull/1262

[FLINK-2107] Add hash-based strategies for left and right outer joins.

This PR adds hash-based execution strategies for left and right outer 
joins, that have the outer side as the probe side of a hash table.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fhueske/flink outerJoinHash

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1262.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1262


commit e24f487112694e1cd757601f5ad59037c0312499
Author: Fabian Hueske 
Date:   2015-10-15T08:58:58Z

[FLINK-2107] Add hash-based strategies for left and right outer joins.




> Implement Hash Outer Join algorithm
> ---
>
> Key: FLINK-2107
> URL: https://issues.apache.org/jira/browse/FLINK-2107
> Project: Flink
>  Issue Type: New Feature
>  Components: Local Runtime
>Reporter: Fabian Hueske
>Assignee: Chiwan Park
>Priority: Minor
> Fix For: pre-apache
>
>
> Flink does not natively support outer joins at the moment.
> This issue proposes to implement a hash outer join algorithm that can cover 
> left and right outer joins.
> The implementation can be based on the regular hash join iterators (for 
> example `ReusingBuildFirstHashMatchIterator` and 
> `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class)
> The Reusing and NonReusing variants differ in whether object instances are 
> reused or new objects are created. I would start with the NonReusing variant 
> which is safer from a user's point of view and should also be easier to 
> implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm

2015-10-15 Thread Chiwan Park (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959066#comment-14959066
 ] 

Chiwan Park commented on FLINK-2107:


[~fhueske] Yes, you can take this issue. :-) Sorry for delaying.

> Implement Hash Outer Join algorithm
> ---
>
> Key: FLINK-2107
> URL: https://issues.apache.org/jira/browse/FLINK-2107
> Project: Flink
>  Issue Type: New Feature
>  Components: Local Runtime
>Reporter: Fabian Hueske
>Assignee: Chiwan Park
>Priority: Minor
> Fix For: pre-apache
>
>
> Flink does not natively support outer joins at the moment.
> This issue proposes to implement a hash outer join algorithm that can cover 
> left and right outer joins.
> The implementation can be based on the regular hash join iterators (for 
> example `ReusingBuildFirstHashMatchIterator` and 
> `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class)
> The Reusing and NonReusing variants differ in whether object instances are 
> reused or new objects are created. I would start with the NonReusing variant 
> which is safer from a user's point of view and should also be easier to 
> implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm

2015-10-15 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959088#comment-14959088
 ] 

Fabian Hueske commented on FLINK-2107:
--

No problem. :-)
I will not solve all aspects, just the "easy" special cases. 
For the remaining cases, we need a special hash table implementation that 
allows to read all records from the hash table that have not been accessed 
during the probe phase.

If you are interested, you can continue to work on these cases.

> Implement Hash Outer Join algorithm
> ---
>
> Key: FLINK-2107
> URL: https://issues.apache.org/jira/browse/FLINK-2107
> Project: Flink
>  Issue Type: New Feature
>  Components: Local Runtime
>Reporter: Fabian Hueske
>Assignee: Chiwan Park
>Priority: Minor
> Fix For: pre-apache
>
>
> Flink does not natively support outer joins at the moment.
> This issue proposes to implement a hash outer join algorithm that can cover 
> left and right outer joins.
> The implementation can be based on the regular hash join iterators (for 
> example `ReusingBuildFirstHashMatchIterator` and 
> `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class)
> The Reusing and NonReusing variants differ in whether object instances are 
> reused or new objects are created. I would start with the NonReusing variant 
> which is safer from a user's point of view and should also be easier to 
> implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm

2015-10-15 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959060#comment-14959060
 ] 

Fabian Hueske commented on FLINK-2107:
--

Hi @chiwan, is it OK if I take over some parts of this issue?
I would like to get the "simple" hash outer join case into the next release 
because it would allow me to remove one limitation from the Cascading on Flink 
adapter.
This would be the case where the the left or right outer side is the probe side 
of the hash table.

Thanks, Fabian


> Implement Hash Outer Join algorithm
> ---
>
> Key: FLINK-2107
> URL: https://issues.apache.org/jira/browse/FLINK-2107
> Project: Flink
>  Issue Type: New Feature
>  Components: Local Runtime
>Reporter: Fabian Hueske
>Assignee: Chiwan Park
>Priority: Minor
> Fix For: pre-apache
>
>
> Flink does not natively support outer joins at the moment.
> This issue proposes to implement a hash outer join algorithm that can cover 
> left and right outer joins.
> The implementation can be based on the regular hash join iterators (for 
> example `ReusingBuildFirstHashMatchIterator` and 
> `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class)
> The Reusing and NonReusing variants differ in whether object instances are 
> reused or new objects are created. I would start with the NonReusing variant 
> which is safer from a user's point of view and should also be easier to 
> implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm

2015-08-09 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14679107#comment-14679107
 ] 

Fabian Hueske commented on FLINK-2107:
--

[~Zentol] is right. This is an optimization to avoid copying the probe side 
record if there is only one build side record. 1-n joins where the build-side 
contains only unique keys are quite common. That is why this optimization can 
make a difference.

The probe side records need to be copied, because the user-defined join 
function can modify all incoming records. If we would not create a new copy for 
each join function call, the second call of the join function might happen with 
a probe side record that was modified by the first call of the join function 
which violates the assumption of independent function calls and produces wrong 
results.

 Implement Hash Outer Join algorithm
 ---

 Key: FLINK-2107
 URL: https://issues.apache.org/jira/browse/FLINK-2107
 Project: Flink
  Issue Type: Sub-task
  Components: Local Runtime
Reporter: Fabian Hueske
Assignee: Chiwan Park
Priority: Minor
 Fix For: pre-apache


 Flink does not natively support outer joins at the moment.
 This issue proposes to implement a hash outer join algorithm that can cover 
 left and right outer joins.
 The implementation can be based on the regular hash join iterators (for 
 example `ReusingBuildFirstHashMatchIterator` and 
 `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class)
 The Reusing and NonReusing variants differ in whether object instances are 
 reused or new objects are created. I would start with the NonReusing variant 
 which is safer from a user's point of view and should also be easier to 
 implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2107) Implement Hash Outer Join algorithm

2015-08-07 Thread Chesnay Schepler (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662083#comment-14662083
 ] 

Chesnay Schepler commented on FLINK-2107:
-

looks like an optimization thing to me. you could probably replace the whole 
block from L116 to L138 with
{code:java}
while (running  ((nextBuildSideRecord = buildSideIterator.next()) != null)) {
probeCopy = this.probeSideSerializer.copy(probeRecord);
matchFunction.join(nextBuildSideRecord, probeCopy, collector);
}
{code}

but this would mean that you would always create a copy, even if there is only 
a single match, which is what the following bit checks for.
{code:java}
if ((tmpRec = buildSideIterator.next()) != null) {
{code}

if this is true we have accessed two build-side values without calling join, 
and as such have to deal with them outside the loop.

 Implement Hash Outer Join algorithm
 ---

 Key: FLINK-2107
 URL: https://issues.apache.org/jira/browse/FLINK-2107
 Project: Flink
  Issue Type: Sub-task
  Components: Local Runtime
Reporter: Fabian Hueske
Assignee: Chiwan Park
Priority: Minor
 Fix For: pre-apache


 Flink does not natively support outer joins at the moment.
 This issue proposes to implement a hash outer join algorithm that can cover 
 left and right outer joins.
 The implementation can be based on the regular hash join iterators (for 
 example `ReusingBuildFirstHashMatchIterator` and 
 `NonReusingBuildFirstHashMatchIterator`, see also `MatchDriver` class)
 The Reusing and NonReusing variants differ in whether object instances are 
 reused or new objects are created. I would start with the NonReusing variant 
 which is safer from a user's point of view and should also be easier to 
 implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)