Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-22 Thread Takeshi Yamamuro
My bad, thanks. On Fri, Jan 22, 2016 at 4:34 PM, Reynold Xin wrote: > The original email was asking about data partitioning (Hive style) for > files, not in memory caching. > > > On Thursday, January 21, 2016, Takeshi Yamamuro > wrote: > >> You mean RDD#partitions are possibly split into multip

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Reynold Xin
The original email was asking about data partitioning (Hive style) for files, not in memory caching. On Thursday, January 21, 2016, Takeshi Yamamuro wrote: > You mean RDD#partitions are possibly split into multiple Spark task > partitions? > If so, the optimization below is wrong? > > Without op

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Takeshi Yamamuro
You mean RDD#partitions are possibly split into multiple Spark task partitions? If so, the optimization below is wrong? Without opt.: == Physical Plan == TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], outp

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Reynold Xin
It is not necessary if you are using bucketing available in Spark 2.0. For partitioning, it is still necessary because we do not assume each partition is small, and as a result there is no guarantee all the records for a partition end up in a single Spark task partition. On Thu, Jan 21, 2016 at 3

Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Justin Uang
Hi, If I had a df and I wrote it out via partitionBy("id"), presumably, when I load in the df and do a groupBy("id"), a shuffle shouldn't be necessary right? Effectively, we can load in the dataframe with a hash partitioner already set, since each task can simply read all the folders where id= whe