Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-22 Thread Takeshi Yamamuro
My bad, thanks. On Fri, Jan 22, 2016 at 4:34 PM, Reynold Xin wrote: > The original email was asking about data partitioning (Hive style) for > files, not in memory caching. > > > On Thursday, January 21, 2016, Takeshi Yamamuro > wrote: > >> You mean RDD#partitions are possibly split into multip

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Reynold Xin
The original email was asking about data partitioning (Hive style) for files, not in memory caching. On Thursday, January 21, 2016, Takeshi Yamamuro wrote: > You mean RDD#partitions are possibly split into multiple Spark task > partitions? > If so, the optimization below is wrong? > > Without op

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Takeshi Yamamuro
You mean RDD#partitions are possibly split into multiple Spark task partitions? If so, the optimization below is wrong? Without opt.: == Physical Plan == TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], outp

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Reynold Xin
It is not necessary if you are using bucketing available in Spark 2.0. For partitioning, it is still necessary because we do not assume each partition is small, and as a result there is no guarantee all the records for a partition end up in a single Spark task partition. On Thu, Jan 21, 2016 at 3