I have a few questions about yarn-standalone and yarn-client deployment modes that are described on the Launching Spark on YARN <http://spark.incubator.apache.org/docs/latest/running-on-yarn.html> page.

1) Can someone give me a basic conceptual overview? I am struggling with understanding the difference between yarn-standalone and yarn-client deployment modes. I understand that yarn-standalone runs on the name node and that yarn-client can be run from a remote machine - but otherwise don't understand how they are different. It seems like having yarn-client is the obvious better approach because it can run from anywhere - but presumably, there is some advantage to having yarn-standalone (otherwise, why not just run yarn-client on the name node or from a remote machine.) I'm also curious to know what "standalone" refers to here.

2) I was able to run the SparkPi in yarn-client mode from a simple scala main method by providing only SPARK_JAR and SPARK_YARN_APP_JAR environment variables and by putting the various *-site.xml files on my classpath. That is, I didn't call run-example - just called my Scala app directly. We've had troubles duplicating this success on our own app and are in the process of applying the patch detailed here:

https://github.com/apache/incubator-spark/pull/371

However, one think that I think I learned is that Spark doesn't have to be installed on the name node. Is that correct? Should I need to have Spark installed at all either on my remote machine or on the name node? It would be great if all that was needed were the SPARK_JAR and the SPARK_YARN_APP_JAR.

3) Finally, is it possible to pre-stage the assembly jar files so they don't need to be copied over every time I start a new Spark job in yarn-client mode? Any advice here is appreciated.

Thanks!
Philip

Reply via email to