I am working on SPARK-1529. I ran into an issue with my change, where the
same shuffle file was being reused across 2 jobs. Please note this only
happens when I use a hard coded location to use for shuffle files, say
/tmp. It does not happen with normal code path that uses DiskBlockManager
to pick
According to hive documentation, sort by is supposed to order the results
for each reducer. So if we set a single reducer, then the results should be
sorted, right? But this is not happening. Any idea why? Looks like the
settings I am using to restrict the number of reducers is not having an
Is there a recommended performance test for sort based shuffle? Something
similar to terasort on Hadoop. I couldn't find one on the spark-perf code
base.
https://github.com/databricks/spark-perf
--
Kannan