Re: Dataframe Partitioning

2016-03-01 Thread yash datta
+1 This is one of the most common problems we encounter in our flow. Mark, I am happy to help if you would like to share some of the workload. Best Yash On Wednesday 2 March 2016, Mark Hamstra wrote: > I don't entirely agree. You're best off picking the right size

Re: Dataframe Partitioning

2016-03-01 Thread Mark Hamstra
I don't entirely agree. You're best off picking the right size :). That's almost impossible, though, since at the input end of the query processing you often want a large number of partitions to get sufficient parallelism for both performance and to avoid spilling or OOM, while at the output end

Dataframe Partitioning

2016-03-01 Thread Teng Liao
Hi, I was wondering what the rationale behind defaulting all repartitioning to spark.sql.shuffle.partitions is. I’m seeing a huge overhead when running a job whose input partitions is 2 and, using the default value for spark.sql.shuffle.partitions, this is now 200. Thanks. -Teng Fei Liao