Thank you for that answer. Helped me a lot 2017-02-23 22:10 GMT+01:00 Fabian Hueske <fhue...@gmail.com>:
> Hi Patrick, > > as Robert said, partitionBy() shuffles the data such that all records with > the same key end up in the same partition. That's all it does. > groupBy() also prepares the data in each partition to be processed per > key. For example, if you run a groupReduce after a groupBy(), the data is > first shuffled (just like partitionBy()) and then in each partition sorted > to organize it by key. So groupBy() does more than partitionBy() because it > organizes the data in each partition to be processed by key. > > Moreover, groupBy() alone is not a complete operation but just "prepares" > a following operation. It must be called with a reduce or combine operator. > In contrast partitionBy() is by itself complete. > So the difference between partitionBy() and groupBy() is more than just an > API thing. > > Hope that helps, > Fabian > > 2017-02-23 21:51 GMT+01:00 Robert Metzger <rmetz...@apache.org>: > >> Hi Patrick, >> >> I think (but I'm not 100% sure) its not a difference in what the engine >> does in the end, its more of an API thing. When you are grouping, you can >> perform operations such as reducing afterwards. >> On a partitioned dataset, you can do stuff like processing each partition >> in parallel, or sort them. >> >> The parallelism is independent of the partitioning or grouping. Usually >> there are more partitions than parallel instances, so each instance will >> take care of multiple partitions. >> >> >> >> On Thu, Feb 23, 2017 at 6:16 PM, Patrick Brunmayr <j...@kpibench.com> >> wrote: >> >>> What is the basic difference between partitioning datasets by key or >>> grouping them by key ? >>> >>> Does it make a difference in terms of paralellism ? >>> >>> Thx >>> >> >> >