Thank you for that answer. Helped me a lot

2017-02-23 22:10 GMT+01:00 Fabian Hueske <fhue...@gmail.com>:

> Hi Patrick,
>
> as Robert said, partitionBy() shuffles the data such that all records with
> the same key end up in the same partition. That's all it does.
> groupBy() also prepares the data in each partition to be processed per
> key. For example, if you run a groupReduce after a groupBy(), the data is
> first shuffled (just like partitionBy()) and then in each partition sorted
> to organize it by key. So groupBy() does more than partitionBy() because it
> organizes the data in each partition to be processed by key.
>
> Moreover, groupBy() alone is not a complete operation but just "prepares"
> a following operation. It must be called with a reduce or combine operator.
> In contrast partitionBy() is by itself complete.
> So the difference between partitionBy() and groupBy() is more than just an
> API thing.
>
> Hope that helps,
> Fabian
>
> 2017-02-23 21:51 GMT+01:00 Robert Metzger <rmetz...@apache.org>:
>
>> Hi Patrick,
>>
>> I think (but I'm not 100% sure) its not a difference in what the engine
>> does in the end, its more of an API thing. When you are grouping, you can
>> perform operations such as reducing afterwards.
>> On a partitioned dataset, you can do stuff like processing each partition
>> in parallel, or sort them.
>>
>> The parallelism is independent of the partitioning or grouping. Usually
>> there are more partitions than parallel instances, so each instance will
>> take care of multiple partitions.
>>
>>
>>
>> On Thu, Feb 23, 2017 at 6:16 PM, Patrick Brunmayr <j...@kpibench.com>
>> wrote:
>>
>>> What is the basic difference between partitioning datasets by key or
>>> grouping them by key ?
>>>
>>> Does it make a difference in terms of paralellism ?
>>>
>>> Thx
>>>
>>
>>
>

Reply via email to