subject:"Re\: Spark SQL\: Avoid shuffles when data is already partitioned on disk"

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-22 Thread Takeshi Yamamuro

My bad, thanks. On Fri, Jan 22, 2016 at 4:34 PM, Reynold Xin wrote: > The original email was asking about data partitioning (Hive style) for > files, not in memory caching. > > > On Thursday, January 21, 2016, Takeshi Yamamuro > wrote: > >> You mean RDD#partitions are possibly split into multip

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Reynold Xin

The original email was asking about data partitioning (Hive style) for files, not in memory caching. On Thursday, January 21, 2016, Takeshi Yamamuro wrote: > You mean RDD#partitions are possibly split into multiple Spark task > partitions? > If so, the optimization below is wrong? > > Without op

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Takeshi Yamamuro

You mean RDD#partitions are possibly split into multiple Spark task partitions? If so, the optimization below is wrong? Without opt.: == Physical Plan == TungstenAggregate(key=[col0#159], functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], outp

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Reynold Xin

It is not necessary if you are using bucketing available in Spark 2.0. For partitioning, it is still necessary because we do not assume each partition is small, and as a result there is no guarantee all the records for a partition end up in a single Spark task partition. On Thu, Jan 21, 2016 at 3