Re: Left outer join

Flavio Pompermaier Fri, 17 Apr 2015 04:20:13 -0700

Could you explain a little more in detail this caching mechanism with a
simple code snippet...?


Thanks,
Flavio
On Apr 17, 2015 1:12 PM, "Fabian Hueske" <fhue...@gmail.com> wrote:

> If you know that the group cardinality of one input is always 1 (or 0) you
> can make that input the one to cache in memory and stream the other input
> with potentially more group elements.
>
> 2015-04-17 4:09 GMT-05:00 Flavio Pompermaier <pomperma...@okkam.it>:
>
>> That would be very helpful...
>>
>> Thanks for the support,
>> Flavio
>>
>> On Fri, Apr 17, 2015 at 10:04 AM, Till Rohrmann <till.rohrm...@gmail.com>
>> wrote:
>>
>>> No its not, but at the moment there is afaik no other way around it.
>>> There is an issue for proper outer join support [1]
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-687
>>>
>>> On Fri, Apr 17, 2015 at 10:01 AM, Flavio Pompermaier <
>>> pomperma...@okkam.it> wrote:
>>>
>>>> Could resolve the problem but the fact to accumulate stuff in a local
>>>> variable is it safe if datasets are huge..?
>>>>
>>>> On Fri, Apr 17, 2015 at 9:54 AM, Till Rohrmann <till.rohrm...@gmail.com
>>>> > wrote:
>>>>
>>>>> If it's fine when you have null string values in the cases where
>>>>> D1.f1!="a1" or D1.f2!="a2" then a possible solution could look like (with
>>>>> Scala API):
>>>>>
>>>>> val ds1: DataSet[(String, String, String)] = getDS1
>>>>> val ds2: DataSet[(String, String, String)] = getDS2
>>>>>
>>>>> ds1.coGroup(ds2).where(2).equalTo(0) {
>>>>>   (left, right, collector: Collector[(String, String, String,
>>>>> String)]) => {
>>>>>     if(right.isEmpty) {
>>>>>       left foreach {
>>>>>       element => {
>>>>>       val value1 = if(element._2 == "a1") element._3 else null
>>>>>       val value2 = if(element._2 == "a2") element._3 else null
>>>>>       collector.collect((element._1, null, value1, value2))
>>>>>         }
>>>>>       }
>>>>>     } else {
>>>>>       val array = right.toArray
>>>>>       for(leftElement <- left) {
>>>>>       val value1 = if(leftElement._2 == "a1") leftElement._3 else null
>>>>>     val value2 = if(leftElement._2 == "a2") leftElement._3 else null
>>>>>
>>>>>     for(rightElement <- array) {
>>>>>       collector.collect(leftElement._1, rightElement._1, value1,
>>>>> value2))
>>>>>     }
>>>>>       }
>>>>>     }
>>>>>   }
>>>>> }
>>>>>
>>>>> Does this solve your problem?
>>>>>
>>>>> On Fri, Apr 17, 2015 at 9:30 AM, Flavio Pompermaier <
>>>>> pomperma...@okkam.it> wrote:
>>>>>
>>>>>> Hi Till,
>>>>>> thanks for the reply.
>>>>>> What I'd like to do is to merge D1 and D2 if there's a ref from D1 to
>>>>>> D2 (D1.f2==D2.f0).
>>>>>> If this condition is true, I would like to produce a set of tuples
>>>>>> with the matching elements
>>>>>> at the first to places (D1.*f2*, D2.*f0*) and the other two values
>>>>>> (if present) of the matching tuple
>>>>>> in D1 when D1.f1==*"a1"* and D1.f2=*"a2"* (string values)
>>>>>> respectively.
>>>>>> (PS: For each value of D1.f0 you can have at most one value of a1 and
>>>>>> a2)
>>>>>>
>>>>>> Is it more clear?
>>>>>>
>>>>>> On Fri, Apr 17, 2015 at 9:03 AM, Till Rohrmann <
>>>>>> till.rohrm...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Flavio,
>>>>>>>
>>>>>>> I don't really understand what you try to do. What does
>>>>>>> D1.f2(D1.f1==p1) mean? What does happen if the condition in D1.f2(if
>>>>>>> D1.f1==p2) is false?
>>>>>>>
>>>>>>> Where does the values a1 and a2 in (A, X, a1, a2) come from when you
>>>>>>> join [(A, p3, X), (X, s, V)] and [(A, p3, X), (X, r, 2)]? Maybe you can
>>>>>>> elaborate a bit more on your example.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Till
>>>>>>>
>>>>>>> On Thu, Apr 16, 2015 at 10:09 PM, Flavio Pompermaier <
>>>>>>> pomperma...@okkam.it> wrote:
>>>>>>>
>>>>>>>> I cannot find a solution to my use case :(
>>>>>>>> I have 2 datasets D1 and D2 like:
>>>>>>>>
>>>>>>>> D1:
>>>>>>>> A,p1,a1
>>>>>>>> A,p2,a2
>>>>>>>> A,p3,X
>>>>>>>> B,p3,Y
>>>>>>>> B,p1,b1
>>>>>>>>
>>>>>>>> D2:
>>>>>>>> X,s,V
>>>>>>>> X,r,2
>>>>>>>> Y,j,k
>>>>>>>>
>>>>>>>> I'd like to have a unique dataset D3(Tuple4) like
>>>>>>>>
>>>>>>>> A,X,a1,a2
>>>>>>>> B,Y,b1,null
>>>>>>>>
>>>>>>>> Basically filling with <D1.f0,D2.f0,D1.f2(D1.f1==p1),D1.f2(if
>>>>>>>> D1.f1==p2)> when D1.f2==D2.f0.
>>>>>>>> Is that possible and how?
>>>>>>>> Could you show me a simple snippet?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Flavio
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 16, 2015 at 9:48 PM, Till Rohrmann <
>>>>>>>> trohrm...@apache.org> wrote:
>>>>>>>>
>>>>>>>>> You can materialize the input of the right input by creating an
>>>>>>>>> array out of it, for example. Then you can reiterate over it.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Till
>>>>>>>>> On Apr 16, 2015 7:37 PM, "Flavio Pompermaier" <
>>>>>>>>> pomperma...@okkam.it> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Maximilian,
>>>>>>>>>> I tried your solution but it doesn't work because the
>>>>>>>>>> rightElements iterator cannot be used more than once:
>>>>>>>>>>
>>>>>>>>>> Caused by: org.apache.flink.util.TraversableOnceException: The
>>>>>>>>>> Iterable can be iterated over only once. Only the first call to
>>>>>>>>>> 'iterator()' will succeed.
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 15, 2015 at 12:59 PM, Maximilian Michels <
>>>>>>>>>> m...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Flavio,
>>>>>>>>>>>
>>>>>>>>>>> Here's an simple example of a Left Outer Join:
>>>>>>>>>>> https://gist.github.com/mxm/c2e9c459a9d82c18d789
>>>>>>>>>>>
>>>>>>>>>>> As Stephan pointed out, this can be very easily modified to
>>>>>>>>>>> construct a Right Outer Join (just exchange leftElements and 
>>>>>>>>>>> rightElements
>>>>>>>>>>> in the two loops).
>>>>>>>>>>>
>>>>>>>>>>> Here's an excerpt with the most important part, the coGroup
>>>>>>>>>>> function:
>>>>>>>>>>>
>>>>>>>>>>> public static class LeftOuterJoin implements 
>>>>>>>>>>> CoGroupFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, 
>>>>>>>>>>> Tuple2<Integer, Integer>> {
>>>>>>>>>>>
>>>>>>>>>>>    @Override
>>>>>>>>>>>    public void coGroup(Iterable<Tuple2<Integer, String>> 
>>>>>>>>>>> leftElements,
>>>>>>>>>>>                        Iterable<Tuple2<Integer, String>> 
>>>>>>>>>>> rightElements,
>>>>>>>>>>>                        Collector<Tuple2<Integer, Integer>> out) 
>>>>>>>>>>> throws Exception {
>>>>>>>>>>>
>>>>>>>>>>>       final int NULL_ELEMENT = -1;
>>>>>>>>>>>
>>>>>>>>>>>       for (Tuple2<Integer, String> leftElem : leftElements) {
>>>>>>>>>>>          boolean hadElements = false;
>>>>>>>>>>>          for (Tuple2<Integer, String> rightElem : rightElements) {
>>>>>>>>>>>             out.collect(new Tuple2<Integer, Integer>(leftElem.f0, 
>>>>>>>>>>> rightElem.f0));
>>>>>>>>>>>             hadElements = true;
>>>>>>>>>>>          }
>>>>>>>>>>>          if (!hadElements) {
>>>>>>>>>>>             out.collect(new Tuple2<Integer, Integer>(leftElem.f0, 
>>>>>>>>>>> NULL_ELEMENT));
>>>>>>>>>>>          }
>>>>>>>>>>>       }
>>>>>>>>>>>
>>>>>>>>>>>    }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 15, 2015 at 11:01 AM, Stephan Ewen <se...@apache.org
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I think this may be a great example to add as a utility
>>>>>>>>>>>> function.
>>>>>>>>>>>>
>>>>>>>>>>>> Or actually add as an function to the DataSet, internally
>>>>>>>>>>>> realized as a special case of coGroup.
>>>>>>>>>>>>
>>>>>>>>>>>> We do not have a ready example of that, but it should be
>>>>>>>>>>>> straightforward to realize. Similar as for the join, coGroup on 
>>>>>>>>>>>> the join
>>>>>>>>>>>> keys. Inside the coGroup function, emit the combination of all 
>>>>>>>>>>>> values from
>>>>>>>>>>>> the two iterators. If one of them is empty (the one that is not 
>>>>>>>>>>>> outer) then
>>>>>>>>>>>> emit all values from the outer side.
>>>>>>>>>>>>
>>>>>>>>>>>> Greetings,
>>>>>>>>>>>> Stephan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 15, 2015 at 10:36 AM, Flavio Pompermaier <
>>>>>>>>>>>> pomperma...@okkam.it> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Do you have an already working example of it? :)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Apr 15, 2015 at 10:32 AM, Ufuk Celebi <u...@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 15 Apr 2015, at 10:30, Flavio Pompermaier <
>>>>>>>>>>>>>> pomperma...@okkam.it> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Hi to all,
>>>>>>>>>>>>>> > I have to join two datasets but I'd like to keep all data
>>>>>>>>>>>>>> in the left also if there' no right dataset.
>>>>>>>>>>>>>> > How can you achieve that in Flink? maybe I should use
>>>>>>>>>>>>>> coGroup?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, currently you have to implement this manually with a
>>>>>>>>>>>>>> coGroup
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Left outer join

Reply via email to