I'm using a manually installation of Spark under Yarn to run a 30 node r3.8xlarge EC2 cluster (each node has 244Gb RAM, 600Gb SDD). All my code runs much faster on a cluster launched w/ the spark-ec2 script, but there's a mysterious problem with nodes becoming inaccessible, so I switched to using Spark under Yarn because I figure Yarn wouldn't let Spark eat up all the resources and render a machine inaccessible. So far, this seems to be the case. Now my code runs to completion, but much slower, so I'm wondering how I can tune my Spark under Yarn installation to make it as fast as the standalone spark install.
The current code I'm interested in speeding up just loads a dense 1Tb matrix from Parquet format and then computes a low rank approximation by essentially doing a bunch of distributed matrix multiplies. Before my code completed in half an hour from loading to writing the output, now I expect it to take 4 or so hours to complete. My spark-submit options are --master yarn \ --num-executors 29 \ --driver-memory 180G \ --executor-memory 180G \ --conf spark.eventLog.enabled=true \ --conf spark.eventLog.dir=$LOGDIR \ --conf spark.driver.maxResultSize=50G \ --conf spark.task.maxFailures=4 \ --conf spark.worker.timeout=1200000 \ --conf spark.network.timeout=1200000 \ the huge timeouts were necessary on EC2 to avoid losing executors. Not sure that they've remained necessary when switching to Yarn. My yarn-site.xml has the following settings: <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>236000</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>59000</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>220000</value> </property> Any suggestions? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-on-yarn-is-slower-than-spark-ec2-standalone-how-to-tune-tp24282.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org