Matei Zaharia created SPARK-2045:
------------------------------------

             Summary: Sort-based shuffle implementation
                 Key: SPARK-2045
                 URL: https://issues.apache.org/jira/browse/SPARK-2045
             Project: Spark
          Issue Type: New Feature
            Reporter: Matei Zaharia


Building on the pluggability in SPARK-2044, a sort-based shuffle implementation 
that takes advantage of an Ordering for keys (or just sorts by hashcode for 
keys that don't have it) would likely improve performance and memory usage in 
very large shuffles. Our current hash-based shuffle needs an open file for each 
reduce task, which can fill up a lot of memory for compression buffers and 
cause inefficient IO. This would avoid both of those issues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to