Understanding shuffle file name conflicts

2015-03-24 Thread Kannan Rajah
I am working on SPARK-1529. I ran into an issue with my change, where the same shuffle file was being reused across 2 jobs. Please note this only happens when I use a hard coded location to use for shuffle files, say /tmp. It does not happen with normal code path that uses DiskBlockManager to pick

Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-18 Thread Kannan Rajah
According to hive documentation, sort by is supposed to order the results for each reducer. So if we set a single reducer, then the results should be sorted, right? But this is not happening. Any idea why? Looks like the settings I am using to restrict the number of reducers is not having an

Performance test for sort shuffle

2015-02-02 Thread Kannan Rajah
Is there a recommended performance test for sort based shuffle? Something similar to terasort on Hadoop. I couldn't find one on the spark-perf code base. https://github.com/databricks/spark-perf -- Kannan