Hi, I have explained this in my following Linkedlin article "The Operational Advantages of Spark as a Distributed Processing Framework <https://www.linkedin.com/pulse/operational-advantages-spark-distributed-processing-mich/> "
An extract *2) YARN Deployment Modes* The term D*eployment mode of Spark*, simply means that “where the driver program will be run”. There are two ways, namely; *Spark Client Mode* <https://spark.apache.org/docs/latest/running-on-yarn.html>* and **Spark Cluster Mode* <https://spark.apache.org/docs/latest/cluster-overview.html> *.* These are described below: *In the Client mode,* *the driver daemon runs in the node through which you submit the spark job to your cluster.* This is often done through the Edge Node. This mode is valuable when you want to use spark interactively like in our case where we would like to display high value prices in the dashboard. In the Client mode you do not want to reserve any resource from your cluster for the driver daemon *In Cluster mode,* *you submit the spark job to your cluster and the driver daemon is run inside your cluster and application master*. In this mode you do not get to use the spark job interactively as the client through which you submit the job is gone as soon as it successfully submits the job to cluster. You will have to reserve some resources for the driver daemon process as it will be running in your cluster. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Sat, 23 Mar 2019 at 21:13, Pat Ferrel <p...@occamsmachete.com> wrote: > I have researched this for a significant amount of time and find answers > that seem to be for a slightly different question than mine. > > The Spark 2.3.3 cluster is running fine. I see the GUI on “ > http://master-address:8080", there are 2 idle workers, as configured. > > I have a Scala application that creates a context and starts execution of > a Job. I *do not use spark-submit*, I start the Job programmatically and > this is where many explanations forks from my question. > > In "my-app" I create a new SparkConf, with the following code (slightly > abbreviated): > > conf.setAppName(“my-job") > conf.setMaster(“spark://master-address:7077”) > conf.set(“deployMode”, “cluster”) > // other settings like driver and executor memory requests > // the driver and executor memory requests are for all mem on the > slaves, more than > // mem available on the launching machine with “my-app" > val jars = listJars(“/path/to/lib") > conf.setJars(jars) > … > > When I launch the job I see 2 executors running on the 2 workers/slaves. > Everything seems to run fine and sometimes completes successfully. Frequent > failures are the reason for this question. > > Where is the Driver running? I don’t see it in the GUI, I see 2 Executors > taking all cluster resources. With a Yarn cluster I would expect the > “Driver" to run on/in the Yarn Master but I am using the Spark Standalone > Master, where is the Drive part of the Job running? > > If is is running in the Master, we are in trouble because I start the > Master on one of my 2 Workers sharing resources with one of the Executors. > Executor mem + driver mem is > available mem on a Worker. I can change this > but need so understand where the Driver part of the Spark Job runs. Is it > in the Spark Master, or inside and Executor, or ??? > > The “Driver” creates and broadcasts some large data structures so the need > for an answer is more critical than with more typical tiny Drivers. > > Thanks for you help! >