Re: Configuring Spark for reduceByKey on on massive data sets

2015-10-12 Thread hotdog
hi Daniel, 
Do you solve your problem?
I met the same problem when running massive data using reduceByKey on yarn.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p25023.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread lukas nalezenec
Hi
Try using *reduceByKeyLocally*.
Regards
Lukas Nalezenec


On Sun, May 18, 2014 at 3:33 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Make sure you set up enough reduce partitions so you don’t overload them.
 Another thing that may help is checking whether you’ve run out of local
 disk space on the machines, and turning on spark.shuffle.consolidateFiles
 to produce fewer files. Finally, there’s been a recent fix in both branch
 0.9 and master that reduces the amount of memory used when there are small
 files (due to extra memory that was being taken by mmap()):
 https://issues.apache.org/jira/browse/SPARK-1145. You can find this in
 either the 1.0 release candidates on the dev list, or branch-0.9 in git.

 Matei

 On May 17, 2014, at 5:45 PM, Madhu ma...@madhu.com wrote:

  Daniel,
 
  How many partitions do you have?
  Are they more or less uniformly distributed?
  We have similar data volume currently running well on Hadoop MapReduce
 with
  roughly 30 nodes.
  I was planning to test it with Spark.
  I'm very interested in your findings.
 
 
 
  -
  Madhu
  https://www.linkedin.com/in/msiddalingaiah
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-17 Thread Madhu
Daniel,

How many partitions do you have?
Are they more or less uniformly distributed?
We have similar data volume currently running well on Hadoop MapReduce with
roughly 30 nodes. 
I was planning to test it with Spark. 
I'm very interested in your findings. 



-
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.