Re: Possible space improvements to shuffle

2015-06-02 Thread John Carrino
Yes, I think that bug is what I want. Thank you. So I guess the current reason is that we don't want to buffer up numMapper incoming streams. So we just iterate through each and transfer it over in full because that is more network efficient? I'm not sure I understand why you wouldn't want to so

Re: Possible space improvements to shuffle

2015-06-02 Thread Josh Rosen
The relevant JIRA that springs to mind is https://issues.apache.org/jira/browse/SPARK-2926 If an aggregator and ordering are both defined, then the map side of sort-based shuffle will sort based on the key ordering so that map-side spills can be efficiently merged. We do not currently do a sort-b

Possible space improvements to shuffle

2015-06-02 Thread John Carrino
One thing I have noticed with ExternalSorter is that if an ordering is not defined, it does the sort using only the partition_id, instead of (parition_id, hash). This means that on the reduce side you need to pull the entire dataset into memory before you can begin iterating over the results. I f