Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Mark Hamstra
You still have the problem that even within a single Job it is often the case that not every Exchange really wants to use the same number of shuffle partitions. On Tue, Nov 15, 2016 at 2:46 AM, Sean Owen wrote: > Once you get to needing this level of fine-grained control,

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Mark Hamstra
AFAIK, the adaptive shuffle partitioning still isn't completely ready to be made the default, and there are some corner issues that need to be addressed before this functionality is declared finished and ready. E.g., the current logic can make data skew problems worse by turning One Big Partition

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Sean Owen
Once you get to needing this level of fine-grained control, should you not consider using the programmatic API in part, to let you control individual jobs? On Tue, Nov 15, 2016 at 1:19 AM leo9r wrote: > Hi Daniel, > > I completely agree with your request. As the amount of

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread leo9r
That's great insight Mark, I'm looking forward to give it a try!! According to jira's Adaptive execution in Spark , it seems that some functionality was added in Spark 1.6.0 and the rest is still in progress. Are there any improvements to the

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-14 Thread Mark Hamstra
Take a look at spark.sql.adaptive.enabled and the ExchangeCoordinator. A single, fixed-sized sql.shuffle.partitions is not the only way to control the number of partitions in an Exchange -- if you are willing to deal with code that is still off by by default. On Mon, Nov 14, 2016 at 4:19 PM,

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-14 Thread leo9r
Hi Daniel, I completely agree with your request. As the amount of data being processed with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need to prevent OOM and performance degradation. The fact that sql.shuffle.partitions cannot be set several times in the same job/action,

Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2015-07-15 Thread daniel.mescheder
Hey everyone, Consider the following use of spark.sql.shuffle.partitions: case class Data(A:String = f${(math.random*1e8).toLong}%09.0f, B: String = f${(math.random*1e8).toLong}%09.0f) val dataFrame = (1 to 1000).map(_ = Data()).toDF dataFrame.registerTempTable(data) sqlContext.setConf(