Hi
Try using *reduceByKeyLocally*.
Regards
Lukas Nalezenec
On Sun, May 18, 2014 at 3:33 AM, Matei Zaharia matei.zaha...@gmail.comwrote:
Make sure you set up enough reduce partitions so you don’t overload them.
Another thing that may help is checking whether you’ve run out of local
disk space on the machines, and turning on spark.shuffle.consolidateFiles
to produce fewer files. Finally, there’s been a recent fix in both branch
0.9 and master that reduces the amount of memory used when there are small
files (due to extra memory that was being taken by mmap()):
https://issues.apache.org/jira/browse/SPARK-1145. You can find this in
either the 1.0 release candidates on the dev list, or branch-0.9 in git.
Matei
On May 17, 2014, at 5:45 PM, Madhu ma...@madhu.com wrote:
Daniel,
How many partitions do you have?
Are they more or less uniformly distributed?
We have similar data volume currently running well on Hadoop MapReduce
with
roughly 30 nodes.
I was planning to test it with Spark.
I'm very interested in your findings.
-
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.