Re: Where does the Driver run?

2019-03-29 Thread ayan guha
Have you tried apache Livy?

On Fri, 29 Mar 2019 at 9:32 pm, Jianneng Li  wrote:

> Hi Pat,
>
> Now that I understand your terminology better, the method I described was
> actually closer to spark-submit than what you referred to as
> "programmatically". You want to have SparkContext running in the launcher
> program, and also the driver somehow running on the cluster, and
> unfortunately I don't think you can do that.
>
> So yes, it does look like you need to refactor. If you need to actively
> use SparkContext to submit more jobs after the Spark application has
> started, you can write a custom Spark driver that, for example, runs a HTTP
> server that can receive requests and call SparkContext accordingly.
>
> Best,
>
> Jianneng
>
> --
> *From:* Pat Ferrel 
> *Sent:* Thursday, March 28, 2019 10:10 AM
> *To:* Jianneng Li
> *Cc:* user@spark.apache.org; ak...@hacked.work; andrew.m...@gmail.com;
> and...@actionml.com
>
> *Subject:* Re: Where does the Driver run?
>
> Thanks for the pointers. We’ll investigate.
>
> We have been told that the “Driver” is run in the launching JVM because
> deployMode = cluster is ignored if spark-submit is not used to launch.
>
> You are saying that there is a loophole and if you use one of these client
> classes there is a way to run part of the app on the cluster, and you have
> seen this for Yarn?
>
> To explain more, we create a SparkConf, and then a SparkContext, which we
> pass around implicitly to functions that I would define as the Spark
> Driver. It seems that if you do not use spark-submit, the entire launching
> app/JVM process is considered the Driver AND is always run in client mode.
>
> I hope your loophole pays off or we will have to do a major refactoring.
>
>
> From: Jianneng Li  
> Reply: Jianneng Li  
> Date: March 28, 2019 at 2:03:47 AM
> To: p...@occamsmachete.com  
> Cc: andrew.m...@gmail.com  ,
> user@spark.apache.org  ,
> ak...@hacked.work  
> Subject:  Re: Where does the Driver run?
>
> Hi Pat,
>
> The driver runs in the same JVM as SparkContext. You didn't go into detail
> about how you "launch" the job (i.e. how the SparkContext is created), so
> it's hard for me to guess where the driver is.
>
> For reference, we've had success launching Spark programmatically to YARN
> in cluster mode by creating a SparkConf like you did and using it to call
> this class:
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_resource-2Dmanagers_yarn_src_main_scala_org_apache_spark_deploy_yarn_Client.scala=DwMFaQ=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc=VEtAA5SS60IF_f_H4BzelvlCoMSY5ifjy9fFlCw_oas=57pw6H_5YXkfV4f7GSBdsVbhwlnRRKgUkyPEAczMtjQ=BUpRknSFJ1_EkStADxp--Qgj0q8tgVpqWIWOefQDQb8=>
>
> I haven't tried this myself, but for standalone mode you might be able to
> use this:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/Client.scala
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_core_src_main_scala_org_apache_spark_deploy_Client.scala=DwMFaQ=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc=VEtAA5SS60IF_f_H4BzelvlCoMSY5ifjy9fFlCw_oas=57pw6H_5YXkfV4f7GSBdsVbhwlnRRKgUkyPEAczMtjQ=zL7v5Vs2jtU6LJociB66bGfdnJi4b497Lq_haPdVTCY=>
>
> Lastly, you can always check where Spark processes run by executing ps on
> the machine, i.e. `ps aux | grep java`.
>
> Best,
>
> Jianneng
>
>
>
> *From:* Pat Ferrel 
> *Date:* Monday, March 25, 2019 at 12:58 PM
> *To:* Andrew Melo 
> *Cc:* user , Akhil Das 
> *Subject:* Re: Where does the Driver run?
>
>
>
> I’m beginning to agree with you and find it rather surprising that this is
> mentioned nowhere explicitly (maybe I missed?). It is possible to serialize
> code to be executed in executors to various nodes. It also seems possible
> to serialize the “driver” bits of code although I’m not sure how the
> boundary would be defined. All code is in the jars we pass to Spark so
> until now I did not question the docs.
>
>
>
> I see no mention of a distinction between running a driver in spark-submit
> vs being programmatically launched for any of the Spark Master types:
> Standalone, Yarn, Mesos, k8s.
>
>
>
> We are building a Machine Learning Server in OSS. It has pluggable Engines
> for different algorithms. Some of these use Spark so it is highly desirable
> to offload driver code to the cluster since we don’t want the diver
> embedded in the Server process. The Driver portion of our training workflow
> could be very large indeed

Re: Where does the Driver run?

2019-03-29 Thread Jianneng Li
Hi Pat,

Now that I understand your terminology better, the method I described was 
actually closer to spark-submit than what you referred to as 
"programmatically". You want to have SparkContext running in the launcher 
program, and also the driver somehow running on the cluster, and unfortunately 
I don't think you can do that.

So yes, it does look like you need to refactor. If you need to actively use 
SparkContext to submit more jobs after the Spark application has started, you 
can write a custom Spark driver that, for example, runs a HTTP server that can 
receive requests and call SparkContext accordingly.

Best,

Jianneng


From: Pat Ferrel 
Sent: Thursday, March 28, 2019 10:10 AM
To: Jianneng Li
Cc: user@spark.apache.org; ak...@hacked.work; andrew.m...@gmail.com; 
and...@actionml.com
Subject: Re: Where does the Driver run?

Thanks for the pointers. We’ll investigate.

We have been told that the “Driver” is run in the launching JVM because 
deployMode = cluster is ignored if spark-submit is not used to launch.

You are saying that there is a loophole and if you use one of these client 
classes there is a way to run part of the app on the cluster, and you have seen 
this for Yarn?

To explain more, we create a SparkConf, and then a SparkContext, which we pass 
around implicitly to functions that I would define as the Spark Driver. It 
seems that if you do not use spark-submit, the entire launching app/JVM process 
is considered the Driver AND is always run in client mode.

I hope your loophole pays off or we will have to do a major refactoring.


From: Jianneng Li <mailto:jianneng...@workday.com>
Reply: Jianneng Li <mailto:jianneng...@workday.com>
Date: March 28, 2019 at 2:03:47 AM
To: p...@occamsmachete.com<mailto:p...@occamsmachete.com> 
<mailto:p...@occamsmachete.com>
Cc: andrew.m...@gmail.com<mailto:andrew.m...@gmail.com> 
<mailto:andrew.m...@gmail.com>, 
user@spark.apache.org<mailto:user@spark.apache.org> 
<mailto:user@spark.apache.org>, 
ak...@hacked.work<mailto:ak...@hacked.work> 
<mailto:ak...@hacked.work>
Subject:  Re: Where does the Driver run?

Hi Pat,

The driver runs in the same JVM as SparkContext. You didn't go into detail 
about how you "launch" the job (i.e. how the SparkContext is created), so it's 
hard for me to guess where the driver is.

For reference, we've had success launching Spark programmatically to YARN in 
cluster mode by creating a SparkConf like you did and using it to call this 
class: 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_resource-2Dmanagers_yarn_src_main_scala_org_apache_spark_deploy_yarn_Client.scala=DwMFaQ=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc=VEtAA5SS60IF_f_H4BzelvlCoMSY5ifjy9fFlCw_oas=57pw6H_5YXkfV4f7GSBdsVbhwlnRRKgUkyPEAczMtjQ=BUpRknSFJ1_EkStADxp--Qgj0q8tgVpqWIWOefQDQb8=>

I haven't tried this myself, but for standalone mode you might be able to use 
this: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/Client.scala<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_core_src_main_scala_org_apache_spark_deploy_Client.scala=DwMFaQ=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc=VEtAA5SS60IF_f_H4BzelvlCoMSY5ifjy9fFlCw_oas=57pw6H_5YXkfV4f7GSBdsVbhwlnRRKgUkyPEAczMtjQ=zL7v5Vs2jtU6LJociB66bGfdnJi4b497Lq_haPdVTCY=>

Lastly, you can always check where Spark processes run by executing ps on the 
machine, i.e. `ps aux | grep java`.

Best,

Jianneng




From: Pat Ferrel mailto:p...@occamsmachete.com>>
Date: Monday, March 25, 2019 at 12:58 PM
To: Andrew Melo mailto:andrew.m...@gmail.com>>
Cc: user mailto:user@spark.apache.org>>, Akhil Das 
mailto:ak...@hacked.work>>
Subject: Re: Where does the Driver run?



I’m beginning to agree with you and find it rather surprising that this is 
mentioned nowhere explicitly (maybe I missed?). It is possible to serialize 
code to be executed in executors to various nodes. It also seems possible to 
serialize the “driver” bits of code although I’m not sure how the boundary 
would be defined. All code is in the jars we pass to Spark so until now I did 
not question the docs.



I see no mention of a distinction between running a driver in spark-submit vs 
being programmatically launched for any of the Spark Master types: Standalone, 
Yarn, Mesos, k8s.



We are building a Machine Learning Server in OSS. It has pluggable Engines for 
different algorithms. Some of these use Spark so it is highly desirable to 
offload driver code to the cluster since we don’t want the diver embedded in 
the Server process. The Driver portion of our training workflow could be very 
large indeed and so could force the scaling of the server to worst case.


Re: Where does the Driver run?

2019-03-28 Thread Pat Ferrel
Thanks for the pointers. We’ll investigate.

We have been told that the “Driver” is run in the launching JVM because
deployMode = cluster is ignored if spark-submit is not used to launch.

You are saying that there is a loophole and if you use one of these client
classes there is a way to run part of the app on the cluster, and you have
seen this for Yarn?

To explain more, we create a SparkConf, and then a SparkContext, which we
pass around implicitly to functions that I would define as the Spark
Driver. It seems that if you do not use spark-submit, the entire launching
app/JVM process is considered the Driver AND is always run in client mode.

I hope your loophole pays off or we will have to do a major refactoring.


From: Jianneng Li  
Reply: Jianneng Li  
Date: March 28, 2019 at 2:03:47 AM
To: p...@occamsmachete.com  
Cc: andrew.m...@gmail.com  ,
user@spark.apache.org  ,
ak...@hacked.work  
Subject:  Re: Where does the Driver run?

Hi Pat,

The driver runs in the same JVM as SparkContext. You didn't go into detail
about how you "launch" the job (i.e. how the SparkContext is created), so
it's hard for me to guess where the driver is.

For reference, we've had success launching Spark programmatically to YARN
in cluster mode by creating a SparkConf like you did and using it to call
this class:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

I haven't tried this myself, but for standalone mode you might be able to
use this:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/Client.scala

Lastly, you can always check where Spark processes run by executing ps on
the machine, i.e. `ps aux | grep java`.

Best,

Jianneng



*From:* Pat Ferrel 
*Date:* Monday, March 25, 2019 at 12:58 PM
*To:* Andrew Melo 
*Cc:* user , Akhil Das 
*Subject:* Re: Where does the Driver run?



I’m beginning to agree with you and find it rather surprising that this is
mentioned nowhere explicitly (maybe I missed?). It is possible to serialize
code to be executed in executors to various nodes. It also seems possible
to serialize the “driver” bits of code although I’m not sure how the
boundary would be defined. All code is in the jars we pass to Spark so
until now I did not question the docs.



I see no mention of a distinction between running a driver in spark-submit
vs being programmatically launched for any of the Spark Master types:
Standalone, Yarn, Mesos, k8s.



We are building a Machine Learning Server in OSS. It has pluggable Engines
for different algorithms. Some of these use Spark so it is highly desirable
to offload driver code to the cluster since we don’t want the diver
embedded in the Server process. The Driver portion of our training workflow
could be very large indeed and so could force the scaling of the server to
worst case.



I hope someone knows how to run “Driver” code on the cluster when our
server is launching the code. So deployMode = cluster, deploy method =
programatic launch.




From: Andrew Melo  
Reply: Andrew Melo  
Date: March 25, 2019 at 11:40:07 AM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?



Hi Pat,



Indeed, I don't think that it's possible to use cluster mode w/o
spark-submit. All the docs I see appear to always describe needing to use
spark-submit for cluster mode -- it's not even compatible with spark-shell.
But it makes sense to me -- if you want Spark to run your application's
driver, you need to package it up and send it to the cluster manager. You
can't start spark one place and then later migrate it to the cluster. It's
also why you can't use spark-shell in cluster mode either, I think.



Cheers

Andrew



On Mon, Mar 25, 2019 at 11:22 AM Pat Ferrel  wrote:

In the GUI while the job is running the app-id link brings up logs to both
executors, The “name” link goes to 4040 of the machine that launched the
job but is not resolvable right now so the page is not shown. I’ll try the
netstat but the use of port 4040 was a good clue.



By what you say below this indicates the Driver is running on the launching
machine, the client to the Spark Cluster. This should be the case in
deployMode = client.



Can someone explain what us going on? The Evidence seems to say that
deployMode = cluster *does not work* as described unless you use
spark-submit (and I’m only guessing at that).



Further; if we don’t use spark-submit we can’t use deployMode = cluster ???




From: Akhil Das  
Reply: Akhil Das  
Date: March 24, 2019 at 7:45:07 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: Where does the Driver run?



There's also a driver ui (usually available on port 4040), after running
your code, I assume you are running it on your machine, visit
localhost:4040 and you will get the driver UI.



If you think the driver is running on your master/executor nodes, login to
those machines and do a



   netstat -napt | grep -I listen



You will see 

Re: Where does the Driver run?

2019-03-28 Thread Mich Talebzadeh
Hi,

I have explained this in my following Linkedlin article "The Operational
Advantages of Spark as a Distributed Processing Framework

"

An extract

*2) YARN Deployment Modes*

The term D*eployment mode of Spark*, simply means that “where the driver
program will be run”. There are two ways, namely; *Spark Client Mode*
* and **Spark
Cluster Mode* 
*.* These are described below:

*In the Client mode,* *the driver daemon runs in the node through which you
submit the spark job to your cluster.* This is often done through the Edge
Node. This mode is valuable when you want to use spark interactively like
in our case where we would like to display high value prices in the
dashboard. In the Client mode you do not want to reserve any resource from
your cluster for the driver daemon

*In Cluster mode,* *you submit the spark job to your cluster and the driver
daemon is run inside your cluster and application master*. In this mode you
do not get to use the spark job interactively as the client through which
you submit the job is gone as soon as it successfully submits the job to
cluster. You will have to reserve some resources for the driver daemon
process as it will be running in your cluster.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 23 Mar 2019 at 21:13, Pat Ferrel  wrote:

> I have researched this for a significant amount of time and find answers
> that seem to be for a slightly different question than mine.
>
> The Spark 2.3.3 cluster is running fine. I see the GUI on “
> http://master-address:8080;, there are 2 idle workers, as configured.
>
> I have a Scala application that creates a context and starts execution of
> a Job. I *do not use spark-submit*, I start the Job programmatically and
> this is where many explanations forks from my question.
>
> In "my-app" I create a new SparkConf, with the following code (slightly
> abbreviated):
>
>   conf.setAppName(“my-job")
>   conf.setMaster(“spark://master-address:7077”)
>   conf.set(“deployMode”, “cluster”)
>   // other settings like driver and executor memory requests
>   // the driver and executor memory requests are for all mem on the
> slaves, more than
>   // mem available on the launching machine with “my-app"
>   val jars = listJars(“/path/to/lib")
>   conf.setJars(jars)
>   …
>
> When I launch the job I see 2 executors running on the 2 workers/slaves.
> Everything seems to run fine and sometimes completes successfully. Frequent
> failures are the reason for this question.
>
> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
> taking all cluster resources. With a Yarn cluster I would expect the
> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
> Master, where is the Drive part of the Job running?
>
> If is is running in the Master, we are in trouble because I start the
> Master on one of my 2 Workers sharing resources with one of the Executors.
> Executor mem + driver mem is > available mem on a Worker. I can change this
> but need so understand where the Driver part of the Spark Job runs. Is it
> in the Spark Master, or inside and Executor, or ???
>
> The “Driver” creates and broadcasts some large data structures so the need
> for an answer is more critical than with more typical tiny Drivers.
>
> Thanks for you help!
>


Re: Where does the Driver run?

2019-03-28 Thread Jianneng Li
Hi Pat,

The driver runs in the same JVM as SparkContext. You didn't go into detail 
about how you "launch" the job (i.e. how the SparkContext is created), so it's 
hard for me to guess where the driver is.

For reference, we've had success launching Spark programmatically to YARN in 
cluster mode by creating a SparkConf like you did and using it to call this 
class: 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

I haven't tried this myself, but for standalone mode you might be able to use 
this: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/Client.scala

Lastly, you can always check where Spark processes run by executing ps on the 
machine, i.e. `ps aux | grep java`.

Best,

Jianneng




From: Pat Ferrel 
Date: Monday, March 25, 2019 at 12:58 PM
To: Andrew Melo 
Cc: user , Akhil Das 
Subject: Re: Where does the Driver run?



I’m beginning to agree with you and find it rather surprising that this is 
mentioned nowhere explicitly (maybe I missed?). It is possible to serialize 
code to be executed in executors to various nodes. It also seems possible to 
serialize the “driver” bits of code although I’m not sure how the boundary 
would be defined. All code is in the jars we pass to Spark so until now I did 
not question the docs.



I see no mention of a distinction between running a driver in spark-submit vs 
being programmatically launched for any of the Spark Master types: Standalone, 
Yarn, Mesos, k8s.



We are building a Machine Learning Server in OSS. It has pluggable Engines for 
different algorithms. Some of these use Spark so it is highly desirable to 
offload driver code to the cluster since we don’t want the diver embedded in 
the Server process. The Driver portion of our training workflow could be very 
large indeed and so could force the scaling of the server to worst case.



I hope someone knows how to run “Driver” code on the cluster when our server is 
launching the code. So deployMode = cluster, deploy method = programatic launch.



From: Andrew Melo <mailto:andrew.m...@gmail.com>
Reply: Andrew Melo <mailto:andrew.m...@gmail.com>
Date: March 25, 2019 at 11:40:07 AM
To: Pat Ferrel <mailto:p...@occamsmachete.com>
Cc: Akhil Das <mailto:ak...@hacked.work>, user 
<mailto:user@spark.apache.org>
Subject:  Re: Where does the Driver run?



Hi Pat,



Indeed, I don't think that it's possible to use cluster mode w/o spark-submit. 
All the docs I see appear to always describe needing to use spark-submit for 
cluster mode -- it's not even compatible with spark-shell. But it makes sense 
to me -- if you want Spark to run your application's driver, you need to 
package it up and send it to the cluster manager. You can't start spark one 
place and then later migrate it to the cluster. It's also why you can't use 
spark-shell in cluster mode either, I think.



Cheers

Andrew



On Mon, Mar 25, 2019 at 11:22 AM Pat Ferrel 
mailto:p...@occamsmachete.com>> wrote:

In the GUI while the job is running the app-id link brings up logs to both 
executors, The “name” link goes to 4040 of the machine that launched the job 
but is not resolvable right now so the page is not shown. I’ll try the netstat 
but the use of port 4040 was a good clue.



By what you say below this indicates the Driver is running on the launching 
machine, the client to the Spark Cluster. This should be the case in deployMode 
= client.



Can someone explain what us going on? The Evidence seems to say that deployMode 
= cluster does not work as described unless you use spark-submit (and I’m only 
guessing at that).



Further; if we don’t use spark-submit we can’t use deployMode = cluster ???



From: Akhil Das <mailto:ak...@hacked.work>
Reply: Akhil Das <mailto:ak...@hacked.work>
Date: March 24, 2019 at 7:45:07 PM
To: Pat Ferrel <mailto:p...@occamsmachete.com>
Cc: user <mailto:user@spark.apache.org>
Subject:  Re: Where does the Driver run?



There's also a driver ui (usually available on port 4040), after running your 
code, I assume you are running it on your machine, visit localhost:4040 and you 
will get the driver UI.



If you think the driver is running on your master/executor nodes, login to 
those machines and do a



   netstat -napt | grep -I listen



You will see the driver listening on 404x there, this won't be the case mostly 
as you are not doing Spark-submit or using the deployMode=cluster.



On Mon, 25 Mar 2019, 01:03 Pat Ferrel, 
mailto:p...@occamsmachete.com>> wrote:

Thanks, I have seen this many times in my research. Paraphrasing docs: “in 
deployMode ‘cluster' the Driver runs on a Worker in the cluster”



When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1 with 
addresses that match slaves). When I look at memory usage while the job runs I 
see virtually identical usage on the 2 Workers. This 

Re: Where does the Driver run?

2019-03-25 Thread Pat Ferrel
I’m beginning to agree with you and find it rather surprising that this is
mentioned nowhere explicitly (maybe I missed?). It is possible to serialize
code to be executed in executors to various nodes. It also seems possible
to serialize the “driver” bits of code although I’m not sure how the
boundary would be defined. All code is in the jars we pass to Spark so
until now I did not question the docs.

I see no mention of a distinction between running a driver in spark-submit
vs being programmatically launched for any of the Spark Master types:
Standalone, Yarn, Mesos, k8s.

We are building a Machine Learning Server in OSS. It has pluggable Engines
for different algorithms. Some of these use Spark so it is highly desirable
to offload driver code to the cluster since we don’t want the diver
embedded in the Server process. The Driver portion of our training workflow
could be very large indeed and so could force the scaling of the server to
worst case.

I hope someone knows how to run “Driver” code on the cluster when our
server is launching the code. So deployMode = cluster, deploy method =
programatic launch.


From: Andrew Melo  
Reply: Andrew Melo  
Date: March 25, 2019 at 11:40:07 AM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?

Hi Pat,

Indeed, I don't think that it's possible to use cluster mode w/o
spark-submit. All the docs I see appear to always describe needing to use
spark-submit for cluster mode -- it's not even compatible with spark-shell.
But it makes sense to me -- if you want Spark to run your application's
driver, you need to package it up and send it to the cluster manager. You
can't start spark one place and then later migrate it to the cluster. It's
also why you can't use spark-shell in cluster mode either, I think.

Cheers
Andrew

On Mon, Mar 25, 2019 at 11:22 AM Pat Ferrel  wrote:

> In the GUI while the job is running the app-id link brings up logs to both
> executors, The “name” link goes to 4040 of the machine that launched the
> job but is not resolvable right now so the page is not shown. I’ll try the
> netstat but the use of port 4040 was a good clue.
>
> By what you say below this indicates the Driver is running on the
> launching machine, the client to the Spark Cluster. This should be the case
> in deployMode = client.
>
> Can someone explain what us going on? The Evidence seems to say that
> deployMode = cluster *does not work* as described unless you use
> spark-submit (and I’m only guessing at that).
>
> Further; if we don’t use spark-submit we can’t use deployMode = cluster ???
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 24, 2019 at 7:45:07 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> There's also a driver ui (usually available on port 4040), after running
> your code, I assume you are running it on your machine, visit
> localhost:4040 and you will get the driver UI.
>
> If you think the driver is running on your master/executor nodes, login to
> those machines and do a
>
>netstat -napt | grep -I listen
>
> You will see the driver listening on 404x there, this won't be the case
> mostly as you are not doing Spark-submit or using the deployMode=cluster.
>
> On Mon, 25 Mar 2019, 01:03 Pat Ferrel,  wrote:
>
>> Thanks, I have seen this many times in my research. Paraphrasing docs:
>> “in deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>>
>> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
>> with addresses that match slaves). When I look at memory usage while the
>> job runs I see virtually identical usage on the 2 Workers. This would
>> support your claim and contradict Spark docs for deployMode = cluster.
>>
>> The evidence seems to contradict the docs. I am now beginning to wonder
>> if the Driver only runs in the cluster if we use spark-submit
>>
>>
>>
>> From: Akhil Das  
>> Reply: Akhil Das  
>> Date: March 23, 2019 at 9:26:50 PM
>> To: Pat Ferrel  
>> Cc: user  
>> Subject:  Re: Where does the Driver run?
>>
>> If you are starting your "my-app" on your local machine, that's where the
>> driver is running.
>>
>> [image: image.png]
>>
>> Hope this helps.
>> <https://spark.apache.org/docs/latest/cluster-overview.html>
>>
>> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>>
>>> I have researched this for a significant amount of time and find answers
>>> that seem to be for a slightly different question than mine.
>>>
>>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>>> http://master-address:8080;, there are 2 idle workers, as configured.
>>>
>>> I have a S

Re: Where does the Driver run?

2019-03-25 Thread Andrew Melo
Hi Pat,

Indeed, I don't think that it's possible to use cluster mode w/o
spark-submit. All the docs I see appear to always describe needing to use
spark-submit for cluster mode -- it's not even compatible with spark-shell.
But it makes sense to me -- if you want Spark to run your application's
driver, you need to package it up and send it to the cluster manager. You
can't start spark one place and then later migrate it to the cluster. It's
also why you can't use spark-shell in cluster mode either, I think.

Cheers
Andrew

On Mon, Mar 25, 2019 at 11:22 AM Pat Ferrel  wrote:

> In the GUI while the job is running the app-id link brings up logs to both
> executors, The “name” link goes to 4040 of the machine that launched the
> job but is not resolvable right now so the page is not shown. I’ll try the
> netstat but the use of port 4040 was a good clue.
>
> By what you say below this indicates the Driver is running on the
> launching machine, the client to the Spark Cluster. This should be the case
> in deployMode = client.
>
> Can someone explain what us going on? The Evidence seems to say that
> deployMode = cluster *does not work *as described unless you use
> spark-submit (and I’m only guessing at that).
>
> Further; if we don’t use spark-submit we can’t use deployMode = cluster ???
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 24, 2019 at 7:45:07 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> There's also a driver ui (usually available on port 4040), after running
> your code, I assume you are running it on your machine, visit
> localhost:4040 and you will get the driver UI.
>
> If you think the driver is running on your master/executor nodes, login to
> those machines and do a
>
>netstat -napt | grep -I listen
>
> You will see the driver listening on 404x there, this won't be the case
> mostly as you are not doing Spark-submit or using the deployMode=cluster.
>
> On Mon, 25 Mar 2019, 01:03 Pat Ferrel,  wrote:
>
>> Thanks, I have seen this many times in my research. Paraphrasing docs:
>> “in deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>>
>> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
>> with addresses that match slaves). When I look at memory usage while the
>> job runs I see virtually identical usage on the 2 Workers. This would
>> support your claim and contradict Spark docs for deployMode = cluster.
>>
>> The evidence seems to contradict the docs. I am now beginning to wonder
>> if the Driver only runs in the cluster if we use spark-submit
>>
>>
>>
>> From: Akhil Das  
>> Reply: Akhil Das  
>> Date: March 23, 2019 at 9:26:50 PM
>> To: Pat Ferrel  
>> Cc: user  
>> Subject:  Re: Where does the Driver run?
>>
>> If you are starting your "my-app" on your local machine, that's where the
>> driver is running.
>>
>> [image: image.png]
>>
>> Hope this helps.
>> <https://spark.apache.org/docs/latest/cluster-overview.html>
>>
>> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>>
>>> I have researched this for a significant amount of time and find answers
>>> that seem to be for a slightly different question than mine.
>>>
>>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>>> http://master-address:8080;, there are 2 idle workers, as configured.
>>>
>>> I have a Scala application that creates a context and starts execution
>>> of a Job. I *do not use spark-submit*, I start the Job programmatically and
>>> this is where many explanations forks from my question.
>>>
>>> In "my-app" I create a new SparkConf, with the following code (slightly
>>> abbreviated):
>>>
>>>   conf.setAppName(“my-job")
>>>   conf.setMaster(“spark://master-address:7077”)
>>>   conf.set(“deployMode”, “cluster”)
>>>   // other settings like driver and executor memory requests
>>>   // the driver and executor memory requests are for all mem on the
>>> slaves, more than
>>>   // mem available on the launching machine with “my-app"
>>>   val jars = listJars(“/path/to/lib")
>>>   conf.setJars(jars)
>>>   …
>>>
>>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>>> Everything seems to run fine and sometimes completes successfully. Frequent
>>> failures are the reason for this question.
>>>
>>> Where is the Driver running? I don’t see it in the GUI, I see 2
>>> Executors 

Re: Where does the Driver run?

2019-03-25 Thread Pat Ferrel
In the GUI while the job is running the app-id link brings up logs to both
executors, The “name” link goes to 4040 of the machine that launched the
job but is not resolvable right now so the page is not shown. I’ll try the
netstat but the use of port 4040 was a good clue.

By what you say below this indicates the Driver is running on the launching
machine, the client to the Spark Cluster. This should be the case in
deployMode = client.

Can someone explain what us going on? The Evidence seems to say that
deployMode = cluster *does not work *as described unless you use
spark-submit (and I’m only guessing at that).

Further; if we don’t use spark-submit we can’t use deployMode = cluster ???


From: Akhil Das  
Reply: Akhil Das  
Date: March 24, 2019 at 7:45:07 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: Where does the Driver run?

There's also a driver ui (usually available on port 4040), after running
your code, I assume you are running it on your machine, visit
localhost:4040 and you will get the driver UI.

If you think the driver is running on your master/executor nodes, login to
those machines and do a

   netstat -napt | grep -I listen

You will see the driver listening on 404x there, this won't be the case
mostly as you are not doing Spark-submit or using the deployMode=cluster.

On Mon, 25 Mar 2019, 01:03 Pat Ferrel,  wrote:

> Thanks, I have seen this many times in my research. Paraphrasing docs: “in
> deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>
> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
> with addresses that match slaves). When I look at memory usage while the
> job runs I see virtually identical usage on the 2 Workers. This would
> support your claim and contradict Spark docs for deployMode = cluster.
>
> The evidence seems to contradict the docs. I am now beginning to wonder if
> the Driver only runs in the cluster if we use spark-submit
>
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 23, 2019 at 9:26:50 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> If you are starting your "my-app" on your local machine, that's where the
> driver is running.
>
> [image: image.png]
>
> Hope this helps.
> <https://spark.apache.org/docs/latest/cluster-overview.html>
>
> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>
>> I have researched this for a significant amount of time and find answers
>> that seem to be for a slightly different question than mine.
>>
>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>> http://master-address:8080;, there are 2 idle workers, as configured.
>>
>> I have a Scala application that creates a context and starts execution of
>> a Job. I *do not use spark-submit*, I start the Job programmatically and
>> this is where many explanations forks from my question.
>>
>> In "my-app" I create a new SparkConf, with the following code (slightly
>> abbreviated):
>>
>>   conf.setAppName(“my-job")
>>   conf.setMaster(“spark://master-address:7077”)
>>   conf.set(“deployMode”, “cluster”)
>>   // other settings like driver and executor memory requests
>>   // the driver and executor memory requests are for all mem on the
>> slaves, more than
>>   // mem available on the launching machine with “my-app"
>>   val jars = listJars(“/path/to/lib")
>>   conf.setJars(jars)
>>   …
>>
>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>> Everything seems to run fine and sometimes completes successfully. Frequent
>> failures are the reason for this question.
>>
>> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
>> taking all cluster resources. With a Yarn cluster I would expect the
>> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
>> Master, where is the Drive part of the Job running?
>>
>> If is is running in the Master, we are in trouble because I start the
>> Master on one of my 2 Workers sharing resources with one of the Executors.
>> Executor mem + driver mem is > available mem on a Worker. I can change this
>> but need so understand where the Driver part of the Spark Job runs. Is it
>> in the Spark Master, or inside and Executor, or ???
>>
>> The “Driver” creates and broadcasts some large data structures so the
>> need for an answer is more critical than with more typical tiny Drivers.
>>
>> Thanks for you help!
>>
>
>
> --
> Cheers!
>
>


Re: Where does the Driver run?

2019-03-24 Thread Akhil Das
There's also a driver ui (usually available on port 4040), after running
your code, I assume you are running it on your machine, visit
localhost:4040 and you will get the driver UI.

If you think the driver is running on your master/executor nodes, login to
those machines and do a

   netstat -napt | grep -I listen

You will see the driver listening on 404x there, this won't be the case
mostly as you are not doing Spark-submit or using the deployMode=cluster.

On Mon, 25 Mar 2019, 01:03 Pat Ferrel,  wrote:

> Thanks, I have seen this many times in my research. Paraphrasing docs: “in
> deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>
> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
> with addresses that match slaves). When I look at memory usage while the
> job runs I see virtually identical usage on the 2 Workers. This would
> support your claim and contradict Spark docs for deployMode = cluster.
>
> The evidence seems to contradict the docs. I am now beginning to wonder if
> the Driver only runs in the cluster if we use spark-submit
>
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 23, 2019 at 9:26:50 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> If you are starting your "my-app" on your local machine, that's where the
> driver is running.
>
> [image: image.png]
>
> Hope this helps.
> <https://spark.apache.org/docs/latest/cluster-overview.html>
>
> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>
>> I have researched this for a significant amount of time and find answers
>> that seem to be for a slightly different question than mine.
>>
>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>> http://master-address:8080;, there are 2 idle workers, as configured.
>>
>> I have a Scala application that creates a context and starts execution of
>> a Job. I *do not use spark-submit*, I start the Job programmatically and
>> this is where many explanations forks from my question.
>>
>> In "my-app" I create a new SparkConf, with the following code (slightly
>> abbreviated):
>>
>>   conf.setAppName(“my-job")
>>   conf.setMaster(“spark://master-address:7077”)
>>   conf.set(“deployMode”, “cluster”)
>>   // other settings like driver and executor memory requests
>>   // the driver and executor memory requests are for all mem on the
>> slaves, more than
>>   // mem available on the launching machine with “my-app"
>>   val jars = listJars(“/path/to/lib")
>>   conf.setJars(jars)
>>   …
>>
>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>> Everything seems to run fine and sometimes completes successfully. Frequent
>> failures are the reason for this question.
>>
>> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
>> taking all cluster resources. With a Yarn cluster I would expect the
>> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
>> Master, where is the Drive part of the Job running?
>>
>> If is is running in the Master, we are in trouble because I start the
>> Master on one of my 2 Workers sharing resources with one of the Executors.
>> Executor mem + driver mem is > available mem on a Worker. I can change this
>> but need so understand where the Driver part of the Spark Job runs. Is it
>> in the Spark Master, or inside and Executor, or ???
>>
>> The “Driver” creates and broadcasts some large data structures so the
>> need for an answer is more critical than with more typical tiny Drivers.
>>
>> Thanks for you help!
>>
>
>
> --
> Cheers!
>
>


Re: Where does the Driver run?

2019-03-24 Thread Arko Provo Mukherjee
Hello,

Is spark.driver.memory per Job or shared across jobs? You should do load
testing before setting this?

Thanks & regards
Arko


On Sun, Mar 24, 2019 at 3:09 PM Pat Ferrel  wrote:

>
> 2 Slaves, one of which is also Master.
>
> Node 1 & 2 are slaves. Node 1 is where I run start-all.sh.
>
> The machines both have 60g of free memory (leaving about 4g for the master
> process on Node 1). The only constraint to the Driver and Executors is
> spark.driver.memory = spark.executor.memory = 60g
>
> BTW I would expect this to create one Executor, one Driver, and the Master
> on 2 Workers.
>
>
>
>
> From: Andrew Melo  
> Reply: Andrew Melo  
> Date: March 24, 2019 at 12:46:35 PM
> To: Pat Ferrel  
> Cc: Akhil Das  , user
>  
> Subject:  Re: Where does the Driver run?
>
> Hi Pat,
>
> On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel  wrote:
>
>> Thanks, I have seen this many times in my research. Paraphrasing docs:
>> “in deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>>
>> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
>> with addresses that match slaves). When I look at memory usage while the
>> job runs I see virtually identical usage on the 2 Workers. This would
>> support your claim and contradict Spark docs for deployMode = cluster.
>>
>> The evidence seems to contradict the docs. I am now beginning to wonder
>> if the Driver only runs in the cluster if we use spark-submit
>>
>
> Where/how are you starting "./sbin/start-master.sh"?
>
> Cheers
> Andrew
>
>
>>
>>
>>
>> From: Akhil Das  
>> Reply: Akhil Das  
>> Date: March 23, 2019 at 9:26:50 PM
>> To: Pat Ferrel  
>> Cc: user  
>> Subject:  Re: Where does the Driver run?
>>
>> If you are starting your "my-app" on your local machine, that's where the
>> driver is running.
>>
>> [image: image.png]
>>
>> Hope this helps.
>> <https://spark.apache.org/docs/latest/cluster-overview.html>
>>
>> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>>
>>> I have researched this for a significant amount of time and find answers
>>> that seem to be for a slightly different question than mine.
>>>
>>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>>> http://master-address:8080;, there are 2 idle workers, as configured.
>>>
>>> I have a Scala application that creates a context and starts execution
>>> of a Job. I *do not use spark-submit*, I start the Job programmatically and
>>> this is where many explanations forks from my question.
>>>
>>> In "my-app" I create a new SparkConf, with the following code (slightly
>>> abbreviated):
>>>
>>>   conf.setAppName(“my-job")
>>>   conf.setMaster(“spark://master-address:7077”)
>>>   conf.set(“deployMode”, “cluster”)
>>>   // other settings like driver and executor memory requests
>>>   // the driver and executor memory requests are for all mem on the
>>> slaves, more than
>>>   // mem available on the launching machine with “my-app"
>>>   val jars = listJars(“/path/to/lib")
>>>   conf.setJars(jars)
>>>   …
>>>
>>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>>> Everything seems to run fine and sometimes completes successfully. Frequent
>>> failures are the reason for this question.
>>>
>>> Where is the Driver running? I don’t see it in the GUI, I see 2
>>> Executors taking all cluster resources. With a Yarn cluster I would expect
>>> the “Driver" to run on/in the Yarn Master but I am using the Spark
>>> Standalone Master, where is the Drive part of the Job running?
>>>
>>> If is is running in the Master, we are in trouble because I start the
>>> Master on one of my 2 Workers sharing resources with one of the Executors.
>>> Executor mem + driver mem is > available mem on a Worker. I can change this
>>> but need so understand where the Driver part of the Spark Job runs. Is it
>>> in the Spark Master, or inside and Executor, or ???
>>>
>>> The “Driver” creates and broadcasts some large data structures so the
>>> need for an answer is more critical than with more typical tiny Drivers.
>>>
>>> Thanks for you help!
>>>
>>
>>
>> --
>> Cheers!
>>
>>


CCEACC67-4431-4246-AEB8-60CEC0940BA9
Description: Binary data


Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
2 Slaves, one of which is also Master.

Node 1 & 2 are slaves. Node 1 is where I run start-all.sh.

The machines both have 60g of free memory (leaving about 4g for the master
process on Node 1). The only constraint to the Driver and Executors is
spark.driver.memory = spark.executor.memory = 60g

BTW I would expect this to create one Executor, one Driver, and the Master
on 2 Workers.




From: Andrew Melo  
Reply: Andrew Melo  
Date: March 24, 2019 at 12:46:35 PM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?

Hi Pat,

On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel  wrote:

> Thanks, I have seen this many times in my research. Paraphrasing docs: “in
> deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>
> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
> with addresses that match slaves). When I look at memory usage while the
> job runs I see virtually identical usage on the 2 Workers. This would
> support your claim and contradict Spark docs for deployMode = cluster.
>
> The evidence seems to contradict the docs. I am now beginning to wonder if
> the Driver only runs in the cluster if we use spark-submit
>

Where/how are you starting "./sbin/start-master.sh"?

Cheers
Andrew


>
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 23, 2019 at 9:26:50 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> If you are starting your "my-app" on your local machine, that's where the
> driver is running.
>
> [image: image.png]
>
> Hope this helps.
> <https://spark.apache.org/docs/latest/cluster-overview.html>
>
> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>
>> I have researched this for a significant amount of time and find answers
>> that seem to be for a slightly different question than mine.
>>
>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>> http://master-address:8080;, there are 2 idle workers, as configured.
>>
>> I have a Scala application that creates a context and starts execution of
>> a Job. I *do not use spark-submit*, I start the Job programmatically and
>> this is where many explanations forks from my question.
>>
>> In "my-app" I create a new SparkConf, with the following code (slightly
>> abbreviated):
>>
>>   conf.setAppName(“my-job")
>>   conf.setMaster(“spark://master-address:7077”)
>>   conf.set(“deployMode”, “cluster”)
>>   // other settings like driver and executor memory requests
>>   // the driver and executor memory requests are for all mem on the
>> slaves, more than
>>   // mem available on the launching machine with “my-app"
>>   val jars = listJars(“/path/to/lib")
>>   conf.setJars(jars)
>>   …
>>
>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>> Everything seems to run fine and sometimes completes successfully. Frequent
>> failures are the reason for this question.
>>
>> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
>> taking all cluster resources. With a Yarn cluster I would expect the
>> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
>> Master, where is the Drive part of the Job running?
>>
>> If is is running in the Master, we are in trouble because I start the
>> Master on one of my 2 Workers sharing resources with one of the Executors.
>> Executor mem + driver mem is > available mem on a Worker. I can change this
>> but need so understand where the Driver part of the Spark Job runs. Is it
>> in the Spark Master, or inside and Executor, or ???
>>
>> The “Driver” creates and broadcasts some large data structures so the
>> need for an answer is more critical than with more typical tiny Drivers.
>>
>> Thanks for you help!
>>
>
>
> --
> Cheers!
>
>


CCEACC67-4431-4246-AEB8-60CEC0940BA9
Description: Binary data


Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
2 Slaves, one of which is also Master.

Node 1 & 2 are slaves. Node 1 is where I run start-all.sh.

The machines both have 60g of free memory (leaving about 4g for the master
process on Node 1). The only constraint to the Driver and Executors is
spark.driver.memory = spark.executor.memory = 60g


From: Andrew Melo  
Reply: Andrew Melo  
Date: March 24, 2019 at 12:46:35 PM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?

Hi Pat,

On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel  wrote:

> Thanks, I have seen this many times in my research. Paraphrasing docs: “in
> deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>
> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
> with addresses that match slaves). When I look at memory usage while the
> job runs I see virtually identical usage on the 2 Workers. This would
> support your claim and contradict Spark docs for deployMode = cluster.
>
> The evidence seems to contradict the docs. I am now beginning to wonder if
> the Driver only runs in the cluster if we use spark-submit
>

Where/how are you starting "./sbin/start-master.sh"?

Cheers
Andrew


>
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 23, 2019 at 9:26:50 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> If you are starting your "my-app" on your local machine, that's where the
> driver is running.
>
> [image: image.png]
>
> Hope this helps.
> <https://spark.apache.org/docs/latest/cluster-overview.html>
>
> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>
>> I have researched this for a significant amount of time and find answers
>> that seem to be for a slightly different question than mine.
>>
>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>> http://master-address:8080;, there are 2 idle workers, as configured.
>>
>> I have a Scala application that creates a context and starts execution of
>> a Job. I *do not use spark-submit*, I start the Job programmatically and
>> this is where many explanations forks from my question.
>>
>> In "my-app" I create a new SparkConf, with the following code (slightly
>> abbreviated):
>>
>>   conf.setAppName(“my-job")
>>   conf.setMaster(“spark://master-address:7077”)
>>   conf.set(“deployMode”, “cluster”)
>>   // other settings like driver and executor memory requests
>>   // the driver and executor memory requests are for all mem on the
>> slaves, more than
>>   // mem available on the launching machine with “my-app"
>>   val jars = listJars(“/path/to/lib")
>>   conf.setJars(jars)
>>   …
>>
>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>> Everything seems to run fine and sometimes completes successfully. Frequent
>> failures are the reason for this question.
>>
>> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
>> taking all cluster resources. With a Yarn cluster I would expect the
>> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
>> Master, where is the Drive part of the Job running?
>>
>> If is is running in the Master, we are in trouble because I start the
>> Master on one of my 2 Workers sharing resources with one of the Executors.
>> Executor mem + driver mem is > available mem on a Worker. I can change this
>> but need so understand where the Driver part of the Spark Job runs. Is it
>> in the Spark Master, or inside and Executor, or ???
>>
>> The “Driver” creates and broadcasts some large data structures so the
>> need for an answer is more critical than with more typical tiny Drivers.
>>
>> Thanks for you help!
>>
>
>
> --
> Cheers!
>
>


3847fb65eedb5792_0.1.1
Description: Binary data


Re: Where does the Driver run?

2019-03-24 Thread Andrew Melo
Hi Pat,

On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel  wrote:

> Thanks, I have seen this many times in my research. Paraphrasing docs: “in
> deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>
> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
> with addresses that match slaves). When I look at memory usage while the
> job runs I see virtually identical usage on the 2 Workers. This would
> support your claim and contradict Spark docs for deployMode = cluster.
>
> The evidence seems to contradict the docs. I am now beginning to wonder if
> the Driver only runs in the cluster if we use spark-submit
>

Where/how are you starting "./sbin/start-master.sh"?

Cheers
Andrew


>
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 23, 2019 at 9:26:50 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> If you are starting your "my-app" on your local machine, that's where the
> driver is running.
>
> [image: image.png]
>
> Hope this helps.
> <https://spark.apache.org/docs/latest/cluster-overview.html>
>
> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>
>> I have researched this for a significant amount of time and find answers
>> that seem to be for a slightly different question than mine.
>>
>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>> http://master-address:8080;, there are 2 idle workers, as configured.
>>
>> I have a Scala application that creates a context and starts execution of
>> a Job. I *do not use spark-submit*, I start the Job programmatically and
>> this is where many explanations forks from my question.
>>
>> In "my-app" I create a new SparkConf, with the following code (slightly
>> abbreviated):
>>
>>   conf.setAppName(“my-job")
>>   conf.setMaster(“spark://master-address:7077”)
>>   conf.set(“deployMode”, “cluster”)
>>   // other settings like driver and executor memory requests
>>   // the driver and executor memory requests are for all mem on the
>> slaves, more than
>>   // mem available on the launching machine with “my-app"
>>   val jars = listJars(“/path/to/lib")
>>   conf.setJars(jars)
>>   …
>>
>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>> Everything seems to run fine and sometimes completes successfully. Frequent
>> failures are the reason for this question.
>>
>> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
>> taking all cluster resources. With a Yarn cluster I would expect the
>> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
>> Master, where is the Drive part of the Job running?
>>
>> If is is running in the Master, we are in trouble because I start the
>> Master on one of my 2 Workers sharing resources with one of the Executors.
>> Executor mem + driver mem is > available mem on a Worker. I can change this
>> but need so understand where the Driver part of the Spark Job runs. Is it
>> in the Spark Master, or inside and Executor, or ???
>>
>> The “Driver” creates and broadcasts some large data structures so the
>> need for an answer is more critical than with more typical tiny Drivers.
>>
>> Thanks for you help!
>>
>
>
> --
> Cheers!
>
>


Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
Thanks, I have seen this many times in my research. Paraphrasing docs: “in
deployMode ‘cluster' the Driver runs on a Worker in the cluster”

When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
with addresses that match slaves). When I look at memory usage while the
job runs I see virtually identical usage on the 2 Workers. This would
support your claim and contradict Spark docs for deployMode = cluster.

The evidence seems to contradict the docs. I am now beginning to wonder if
the Driver only runs in the cluster if we use spark-submit



From: Akhil Das  
Reply: Akhil Das  
Date: March 23, 2019 at 9:26:50 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: Where does the Driver run?

If you are starting your "my-app" on your local machine, that's where the
driver is running.

[image: image.png]

Hope this helps.
<https://spark.apache.org/docs/latest/cluster-overview.html>

On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:

> I have researched this for a significant amount of time and find answers
> that seem to be for a slightly different question than mine.
>
> The Spark 2.3.3 cluster is running fine. I see the GUI on “
> http://master-address:8080;, there are 2 idle workers, as configured.
>
> I have a Scala application that creates a context and starts execution of
> a Job. I *do not use spark-submit*, I start the Job programmatically and
> this is where many explanations forks from my question.
>
> In "my-app" I create a new SparkConf, with the following code (slightly
> abbreviated):
>
>   conf.setAppName(“my-job")
>   conf.setMaster(“spark://master-address:7077”)
>   conf.set(“deployMode”, “cluster”)
>   // other settings like driver and executor memory requests
>   // the driver and executor memory requests are for all mem on the
> slaves, more than
>   // mem available on the launching machine with “my-app"
>   val jars = listJars(“/path/to/lib")
>   conf.setJars(jars)
>   …
>
> When I launch the job I see 2 executors running on the 2 workers/slaves.
> Everything seems to run fine and sometimes completes successfully. Frequent
> failures are the reason for this question.
>
> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
> taking all cluster resources. With a Yarn cluster I would expect the
> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
> Master, where is the Drive part of the Job running?
>
> If is is running in the Master, we are in trouble because I start the
> Master on one of my 2 Workers sharing resources with one of the Executors.
> Executor mem + driver mem is > available mem on a Worker. I can change this
> but need so understand where the Driver part of the Spark Job runs. Is it
> in the Spark Master, or inside and Executor, or ???
>
> The “Driver” creates and broadcasts some large data structures so the need
> for an answer is more critical than with more typical tiny Drivers.
>
> Thanks for you help!
>


--
Cheers!


ii_jtmf6k1q0.png
Description: Binary data


Re: Where does the Driver run?

2019-03-23 Thread Akhil Das
If you are starting your "my-app" on your local machine, that's where the
driver is running.

[image: image.png]

Hope this helps.


On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:

> I have researched this for a significant amount of time and find answers
> that seem to be for a slightly different question than mine.
>
> The Spark 2.3.3 cluster is running fine. I see the GUI on “
> http://master-address:8080;, there are 2 idle workers, as configured.
>
> I have a Scala application that creates a context and starts execution of
> a Job. I *do not use spark-submit*, I start the Job programmatically and
> this is where many explanations forks from my question.
>
> In "my-app" I create a new SparkConf, with the following code (slightly
> abbreviated):
>
>   conf.setAppName(“my-job")
>   conf.setMaster(“spark://master-address:7077”)
>   conf.set(“deployMode”, “cluster”)
>   // other settings like driver and executor memory requests
>   // the driver and executor memory requests are for all mem on the
> slaves, more than
>   // mem available on the launching machine with “my-app"
>   val jars = listJars(“/path/to/lib")
>   conf.setJars(jars)
>   …
>
> When I launch the job I see 2 executors running on the 2 workers/slaves.
> Everything seems to run fine and sometimes completes successfully. Frequent
> failures are the reason for this question.
>
> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
> taking all cluster resources. With a Yarn cluster I would expect the
> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
> Master, where is the Drive part of the Job running?
>
> If is is running in the Master, we are in trouble because I start the
> Master on one of my 2 Workers sharing resources with one of the Executors.
> Executor mem + driver mem is > available mem on a Worker. I can change this
> but need so understand where the Driver part of the Spark Job runs. Is it
> in the Spark Master, or inside and Executor, or ???
>
> The “Driver” creates and broadcasts some large data structures so the need
> for an answer is more critical than with more typical tiny Drivers.
>
> Thanks for you help!
>


-- 
Cheers!


Where does the Driver run?

2019-03-23 Thread Pat Ferrel
I have researched this for a significant amount of time and find answers
that seem to be for a slightly different question than mine.

The Spark 2.3.3 cluster is running fine. I see the GUI on “
http://master-address:8080;, there are 2 idle workers, as configured.

I have a Scala application that creates a context and starts execution of a
Job. I *do not use spark-submit*, I start the Job programmatically and this
is where many explanations forks from my question.

In "my-app" I create a new SparkConf, with the following code (slightly
abbreviated):

  conf.setAppName(“my-job")
  conf.setMaster(“spark://master-address:7077”)
  conf.set(“deployMode”, “cluster”)
  // other settings like driver and executor memory requests
  // the driver and executor memory requests are for all mem on the
slaves, more than
  // mem available on the launching machine with “my-app"
  val jars = listJars(“/path/to/lib")
  conf.setJars(jars)
  …

When I launch the job I see 2 executors running on the 2 workers/slaves.
Everything seems to run fine and sometimes completes successfully. Frequent
failures are the reason for this question.

Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
taking all cluster resources. With a Yarn cluster I would expect the
“Driver" to run on/in the Yarn Master but I am using the Spark Standalone
Master, where is the Drive part of the Job running?

If is is running in the Master, we are in trouble because I start the
Master on one of my 2 Workers sharing resources with one of the Executors.
Executor mem + driver mem is > available mem on a Worker. I can change this
but need so understand where the Driver part of the Spark Job runs. Is it
in the Spark Master, or inside and Executor, or ???

The “Driver” creates and broadcasts some large data structures so the need
for an answer is more critical than with more typical tiny Drivers.

Thanks for you help!