Re: Union of multiple datasets vs Join

Stephan Ewen Tue, 17 Mar 2015 14:30:17 -0700

Yes, that is the way to do it.

This makes me think that it would be nice to have a method that builds the
union of a list of data sets.


DataSet<T> union(DataSet<T>... sets)

It would be implemented like in your loop. Would that be helpful?

Stephan


On Tue, Mar 17, 2015 at 6:03 PM, Flavio Pompermaier <pomperma...@okkam.it>
wrote:

> As always one minute after I sent the email I found the problem!
> It was that I should reassign the initial dataset:
>     ret = ret.union(sourceDs);
>
> Bye,
> Flavio
>
> On Tue, Mar 17, 2015 at 5:58 PM, Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
>
>> Hi Fabian,
>> I was trying to use the strategy you suggested with flink 0.8.1 but it
>> seems that the union of the datasets cannot be created programmatically
>> because the union operator gives a name to the generated dataset that is
>> the name of the calling function so that  only the first dataset is read.
>> My code looks like:
>>
>> private static DataSet<Tuple6<...> getSourceDs(ExecutionEnvironment env, 
>> final
>> String outputGraph, List<String> tableNames) {
>> DataSet<Tuple6<...>> ret = null;
>> for (String tableName : tableNames) {
>> DataSet<Tuple6<...>> sourceDs = env.createInput(new
>> MyTableInputFormat(tableName))
>>                         ....
>>
>> if(ret==null)
>> ret = sourceDs;
>> else
>> ret.union(sourceDs);
>>                }
>>               return ret;
>>        }
>>
>> Is this a bug or am I'm doing something wrong?
>> Thanks in advance,
>> Flavio
>>
>> On Mon, Dec 22, 2014 at 2:42 PM, <fhue...@gmail.com> wrote:
>>
>>>  Union is just combining data from multiple sources into a single
>>> dataset.
>>> That’s it. No memory, no disk involved.
>>>
>>> In you case you have
>>>
>>> input1.union(input2).groupBy(1).reduce(…)
>>>
>>> This will translate into:
>>>
>>> input1 -> repartition ->
>>>                                         read-both-inputs ->  sort ->
>>> reduce
>>> input2 -> repartition ->
>>>
>>> So, in your case not even additional network transfer is involved,
>>> because both data sets would need to be partitioned for the reduce anyway.
>>>
>>> Note, union in Flink has SQL union-all semantics, i.e., there is
>>> not removal of duplicates.
>>>
>>> Cheers, Fabian
>>>
>>> *From:* Flavio Pompermaier <pomperma...@okkam.it>
>>> *Sent:* ‎Monday‎, ‎22‎. ‎December‎, ‎2014 ‎14‎:‎32
>>> *To:* u...@flink.incubator.apache.org
>>>
>>> Ok thanks Fabian. I'd like just to know the internals of the union of
>>> multiple datasets (partitioning, distribution among server, memory/disk,
>>> etc..). Do you have any ref to this?
>>>
>>> Thanks in advance,
>>> Flavio
>>>
>>> On Mon, Dec 22, 2014 at 12:46 PM, Fabian Hueske <fhue...@apache.org>
>>> wrote:
>>>
>>>> Follow the first approach.
>>>> Joins are expensive, union comes for free.
>>>>
>>>> Best, Fabian
>>>>
>>>> 2014-12-22 11:47 GMT+01:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> In my use case I have multiple Datasets with the same structure (e.g.
>>>>> Tuple3) and I want to produce an output Dataset containing all Tuple3
>>>>> grouped by the first field (0).
>>>>> I can obtain the same results performing a union of all datasets and
>>>>> then a group by (simplest implementation) or join all of them pairwise
>>>>> (((A->B)->C)->D)..) or I don't know if there is any other solution. When
>>>>> should I use the first or the second approach? Could you help me in
>>>>> figuring out the internals of the two approaches? I always have some fear
>>>>> when using multiple joins when I don't know exactly their size..
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>
>>>>
>>>
>

Re: Union of multiple datasets vs Join

Reply via email to