Re: Union of multiple datasets vs Join

Flavio Pompermaier Wed, 18 Mar 2015 01:25:51 -0700

I don't know if that could be useful, do you?

On Tue, Mar 17, 2015 at 10:29 PM, Stephan Ewen <se...@apache.org> wrote:


> Yes, that is the way to do it.
>
> This makes me think that it would be nice to have a method that builds the
> union of a list of data sets.
>
> DataSet<T> union(DataSet<T>... sets)
>
> It would be implemented like in your loop. Would that be helpful?
>
> Stephan
>
>
> On Tue, Mar 17, 2015 at 6:03 PM, Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
>
>> As always one minute after I sent the email I found the problem!
>> It was that I should reassign the initial dataset:
>>     ret = ret.union(sourceDs);
>>
>> Bye,
>> Flavio
>>
>> On Tue, Mar 17, 2015 at 5:58 PM, Flavio Pompermaier <pomperma...@okkam.it
>> > wrote:
>>
>>> Hi Fabian,
>>> I was trying to use the strategy you suggested with flink 0.8.1 but it
>>> seems that the union of the datasets cannot be created programmatically
>>> because the union operator gives a name to the generated dataset that is
>>> the name of the calling function so that  only the first dataset is read.
>>> My code looks like:
>>>
>>> private static DataSet<Tuple6<...> getSourceDs(ExecutionEnvironment env,
>>>  final String outputGraph, List<String> tableNames) {
>>> DataSet<Tuple6<...>> ret = null;
>>> for (String tableName : tableNames) {
>>> DataSet<Tuple6<...>> sourceDs = env.createInput(new
>>> MyTableInputFormat(tableName))
>>>                         ....
>>>
>>> if(ret==null)
>>> ret = sourceDs;
>>> else
>>> ret.union(sourceDs);
>>>                }
>>>               return ret;
>>>        }
>>>
>>> Is this a bug or am I'm doing something wrong?
>>> Thanks in advance,
>>> Flavio
>>>
>>> On Mon, Dec 22, 2014 at 2:42 PM, <fhue...@gmail.com> wrote:
>>>
>>>>  Union is just combining data from multiple sources into a single
>>>> dataset.
>>>> That’s it. No memory, no disk involved.
>>>>
>>>> In you case you have
>>>>
>>>> input1.union(input2).groupBy(1).reduce(…)
>>>>
>>>> This will translate into:
>>>>
>>>> input1 -> repartition ->
>>>>                                         read-both-inputs ->  sort ->
>>>> reduce
>>>> input2 -> repartition ->
>>>>
>>>> So, in your case not even additional network transfer is involved,
>>>> because both data sets would need to be partitioned for the reduce anyway.
>>>>
>>>> Note, union in Flink has SQL union-all semantics, i.e., there is
>>>> not removal of duplicates.
>>>>
>>>> Cheers, Fabian
>>>>
>>>> *From:* Flavio Pompermaier <pomperma...@okkam.it>
>>>> *Sent:* ‎Monday‎, ‎22‎. ‎December‎, ‎2014 ‎14‎:‎32
>>>> *To:* u...@flink.incubator.apache.org
>>>>
>>>> Ok thanks Fabian. I'd like just to know the internals of the union of
>>>> multiple datasets (partitioning, distribution among server, memory/disk,
>>>> etc..). Do you have any ref to this?
>>>>
>>>> Thanks in advance,
>>>> Flavio
>>>>
>>>> On Mon, Dec 22, 2014 at 12:46 PM, Fabian Hueske <fhue...@apache.org>
>>>> wrote:
>>>>
>>>>> Follow the first approach.
>>>>> Joins are expensive, union comes for free.
>>>>>
>>>>> Best, Fabian
>>>>>
>>>>> 2014-12-22 11:47 GMT+01:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> In my use case I have multiple Datasets with the same structure (e.g.
>>>>>> Tuple3) and I want to produce an output Dataset containing all Tuple3
>>>>>> grouped by the first field (0).
>>>>>> I can obtain the same results performing a union of all datasets and
>>>>>> then a group by (simplest implementation) or join all of them pairwise
>>>>>> (((A->B)->C)->D)..) or I don't know if there is any other solution. When
>>>>>> should I use the first or the second approach? Could you help me in
>>>>>> figuring out the internals of the two approaches? I always have some fear
>>>>>> when using multiple joins when I don't know exactly their size..
>>>>>>
>>>>>> Best,
>>>>>> Flavio
>>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: Union of multiple datasets vs Join

Reply via email to