Re: Running Spark shell on YARN

2014-08-16 Thread Soumya Simanta
I followed this thread

http://apache-spark-user-list.1001560.n3.nabble.com/YARN-issues-with-resourcemanager-scheduler-address-td5201.html#a5258

to set SPARK_YARN_USER_ENV  to HADOOP_CONF_DIR
export SPARK_YARN_USER_ENV="CLASSPATH=$HADOOP_CONF_DIR"

and used the following command to share conf directories on all machines.

export SPARK_YARN_DIST_FILES=$(ls $HADOOP_CONF_DIR* | sed 's#^#file://#g'
|tr '\n' ',' )

and then I used the following command to start spark-shell

./spark-shell --master yarn-client --executor-memory 32g

This time I didn't get the "14/08/15 15:44:51 INFO
cluster.YarnClientSchedulerBackend:
Application report from ASM:" errors. but a new exception (see below
java.net.URISyntaxException). Any idea why this is happening ?
Also, although I see the REPL prompt, sc is not available in the REPL.

14/08/16 02:27:52 INFO yarn.Client: Uploading
file:/usr/lib/spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563/lib/spark-assembly-1.0.1.2.1.3.0-563-hadoop2.4.0.2.1.3.0-563.jar
to
hdfs://n001-10ge1:8020/user/ssimanta/.sparkStaging/application_1408130563059_0011/spark-assembly-1.0.1.2.1.3.0-563-hadoop2.4.0.2.1.3.0-563.jar

*java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected
scheme-specific part at index 5: conf:*

at org.apache.hadoop.fs.Path.initialize(Path.java:206)

at org.apache.hadoop.fs.Path.(Path.java:172)

at org.apache.hadoop.fs.Path.(Path.java:94)

at org.apache.spark.deploy.yarn.ClientBase$class.org
$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:161)

at
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:238)

at
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:233)

at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

at
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:233)

at
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:231)

at scala.collection.immutable.List.foreach(List.scala:318)

at
org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:231)

at
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39)

at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)

at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81)

at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:136)

at org.apache.spark.SparkContext.(SparkContext.scala:318)

at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)

at $iwC$$iwC.(:8)

at $iwC.(:14)

at (:16)

at .(:20)

at .()

at .(:7)

at .()

at $print()

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)

at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)

at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)

at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)

at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)

at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)

at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)

at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)

at
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)

at
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)

at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)

at
org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)

at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913)

at
org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142)

at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:56)

at
org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:104)

at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:56)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:930)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)

at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

at org.apache.spark.repl.Spa

Re: Running Spark shell on YARN

2014-08-16 Thread Eric Friedman
+1 for such a document. 


Eric Friedman

> On Aug 15, 2014, at 1:10 PM, Kevin Markey  wrote:
> 
> Sandy and others:
> 
> Is there a single source of Yarn/Hadoop properties that should be set or 
> reset for running Spark on Yarn?
> We've sort of stumbled through one property after another, and (unless 
> there's an update I've not yet seen) CDH5 Spark-related properties are for 
> running the Spark Master instead of Yarn.
> 
> Thanks
> Kevin
> 
>> On 08/15/2014 12:47 PM, Sandy Ryza wrote:
>> We generally recommend setting yarn.scheduler.maximum-allocation-mbto the 
>> maximum node capacity.
>> 
>> -Sandy
>> 
>> 
>>> On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta  
>>> wrote:
>>> I just checked the YARN config and looks like I need to change this value. 
>>> Should be upgraded to 48G (the max memory allocated to YARN) per node ? 
>>> 
>>> 
>>> yarn.scheduler.maximum-allocation-mb
>>> 6144
>>> java.io.BufferedInputStream@2e7e1ee
>>> 
>>> 
>>> 
 On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta  
 wrote:
 Andrew, 
 
 Thanks for your response. 
 
 When I try to do the following. 
  ./spark-shell --executor-memory 46g --master yarn
 
 I get the following error. 
 
 Exception in thread "main" java.lang.Exception: When running with master 
 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the 
 environment.
 
 at 
 org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)
 
 at 
 org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:61)
 
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)
 
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 
 After this I set the following env variable. 
 
 export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/
 
 The program launches but then halts with the following error. 
 
 
 14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104 MB), 
 is above the max threshold (6144 MB) of this cluster.
 
 I guess this is some YARN setting that is not set correctly. 
 
 Thanks
 
 -Soumya
 
 
 
> On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or  wrote:
> Hi Soumya,
> 
> The driver's console output prints out how much memory is actually 
> granted to each executor, so from there you can verify how much memory 
> the executors are actually getting. You should use the 
> '--executor-memory' argument in spark-shell. For instance, assuming each 
> node has 48G of memory,
> 
> bin/spark-shell --executor-memory 46g --master yarn
> 
> We leave a small cushion for the OS so we don't take up all of the entire 
> system's memory. This option also applies to the standalone mode you've 
> been using, but if you have been using the ec2 scripts, we set 
> "spark.executor.memory" in conf/spark-defaults.conf for you automatically 
> so you don't have to specify it each time on the command line. Of course, 
> you can also do the same in YARN.
> 
> -Andrew
> 
> 
> 
> 2014-08-15 10:45 GMT-07:00 Soumya Simanta :
> 
>> I've been using the standalone cluster all this time and it worked fine. 
>> Recently I'm using another Spark cluster that is based on YARN and I've 
>> not experience with YARN. 
>> 
>> The YARN cluster has 10 nodes and a total memory of 480G. 
>> 
>> I'm having trouble starting the spark-shell with enough memory. 
>> I'm doing a very simple operation - reading a file 100GB from HDFS and 
>> running a count on it. This fails due to out of memory on the executors. 
>> 
>> Can someone point to the command line parameters that I should use for 
>> spark-shell so that it?
>> 
>> 
>> Thanks
>> -Soumya
> 
> - To 
> unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
> commands, e-mail: user-h...@spark.apache.org  


Re: Running Spark shell on YARN

2014-08-15 Thread Kevin Markey

  
  
Sandy and others:

Is there a single source of Yarn/Hadoop properties that should be
set or reset for running Spark on Yarn?
We've sort of stumbled through one property after another, and
(unless there's an update I've not yet seen) CDH5 Spark-related
properties are for running the Spark Master instead of Yarn.

Thanks
Kevin

On 08/15/2014 12:47 PM, Sandy Ryza
  wrote:


  We generally recommend setting yarn.scheduler.maximum-allocation-mbto
the maximum node capacity.

  

-Sandy
  
  

On Fri, Aug 15, 2014 at 11:41 AM,
  Soumya Simanta 
  wrote:
  
I just checked the YARN config and looks like
  I need to change this value. Should be upgraded to 48G
  (the max memory allocated to YARN) per node ? 
  

  
  

  

  yarn.scheduler.maximum-allocation-mb
  6144
  java.io.BufferedInputStream@2e7e1ee


  


  
  On Fri, Aug 15, 2014 at 2:37 PM,
Soumya Simanta 
wrote:

  Andrew, 


Thanks for your response. 


When I try to do the following. 

   ./spark-shell
  --executor-memory 46g --master yarn
  I get the following error. 
  Exception
  in thread "main" java.lang.Exception: When
  running with master 'yarn' either
  HADOOP_CONF_DIR or YARN_CONF_DIR must be set
  in the environment.
   at
org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)
   at
org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:61)
   at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)
  
  
   at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
  After this I set the following env variable. 
  
  
  export
  YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/
  The program launches but then halts with the
following error. 
  
  
  
  
  14/08/15 14:33:22 ERROR
yarn.Client: Required executor memory (47104
MB), is above the max threshold (6144 MB) of
this cluster.
  
I guess this is some YARN setting that is not
set correctly. 
  
  
  
  Thanks
  
  -Soumya

  
  

  

  


On Fri, Aug 15,
  2014 at 2:19 PM, Andrew Or 
  wrote:
  
Hi Soumya,
  
  
  The driver's console output
prints out how much memory is
actually granted to each executor,
so from there you can verify how
much memory the executors are
actually getting. You should use the
'--executor-memory' argument in
spark-shell. For instance, assuming
each node has 48G of memory,
  
  
  bin/spark-shell --executor-memory
46g --master yarn
  
  
  We leave a small cushion for the
OS so we don't take up all of the
 

Re: Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
After changing the allocation I'm getting the following in my logs. No idea
what this means.

14/08/15 15:44:33 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:34 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:35 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:36 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:37 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:38 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:39 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:40 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:41 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:42 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:43 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:44 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:45 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:46 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:47 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:48 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:49 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:50 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:51 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372
yarnAppState: ACCEPTED


On Fri, Aug 15, 2014 at 2:47 PM, Sandy Ryza  wrote:

> We generally recommend setting yarn.scheduler.maximum-allocation-mbto the
> maximum node capacity.
>
> -Sandy
>
>
> On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta  > wrote:
>
>> I just checked the YARN config and looks like I need to change this
>> value. Should be upgraded to 48G (the max memory allocated to YARN) per
>> node ?
>>
>> 
>> yarn.scheduler.maximum-allocation-mb
>> 6144
>> java.io.BufferedInputStream@2e7e1ee
>> 
>>
>>
>> On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta > > wrote:
>>
>>> Andrew,
>>>
>>> Thanks for your response.
>>>
>>> When I try to do the following.
>>>
>>>  ./spark-shell --executor-memory 46g --master yarn
>>>
>>> I get the following error.
>>>
>>> Exception in thread "main" java.lang.Exception: When running with master
>>> 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the
>>> environment.
>>>
>>> at
>>> org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)
>>>
>>> at
>>> org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:61)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)
>>>
>>>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>
>>> After this I set the following env variable.
>>>
>>> export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/
>>>
>>> The program launches but then halts with the following error.
>>>
>>>
>>> *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104
>>> MB), is above the max threshold (6144 MB) of this cluster.*
>>>
>>> I guess this is some YARN setting that is not set correctly.
>>>
>>>
>>> Thanks
>>>
>>> -Soumya
>>>
>>>
>>> On Fri, A

Re: Running Spark shell on YARN

2014-08-15 Thread Sandy Ryza
We generally recommend setting yarn.scheduler.maximum-allocation-mbto the
maximum node capacity.

-Sandy


On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta 
wrote:

> I just checked the YARN config and looks like I need to change this value.
> Should be upgraded to 48G (the max memory allocated to YARN) per node ?
>
> 
> yarn.scheduler.maximum-allocation-mb
> 6144
> java.io.BufferedInputStream@2e7e1ee
> 
>
>
> On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta 
> wrote:
>
>> Andrew,
>>
>> Thanks for your response.
>>
>> When I try to do the following.
>>
>>  ./spark-shell --executor-memory 46g --master yarn
>>
>> I get the following error.
>>
>> Exception in thread "main" java.lang.Exception: When running with master
>> 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the
>> environment.
>>
>> at
>> org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)
>>
>> at
>> org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:61)
>>
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)
>>
>>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> After this I set the following env variable.
>>
>> export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/
>>
>> The program launches but then halts with the following error.
>>
>>
>> *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104
>> MB), is above the max threshold (6144 MB) of this cluster.*
>>
>> I guess this is some YARN setting that is not set correctly.
>>
>>
>> Thanks
>>
>> -Soumya
>>
>>
>> On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or  wrote:
>>
>>> Hi Soumya,
>>>
>>> The driver's console output prints out how much memory is actually
>>> granted to each executor, so from there you can verify how much memory the
>>> executors are actually getting. You should use the '--executor-memory'
>>> argument in spark-shell. For instance, assuming each node has 48G of memory,
>>>
>>> bin/spark-shell --executor-memory 46g --master yarn
>>>
>>> We leave a small cushion for the OS so we don't take up all of the
>>> entire system's memory. This option also applies to the standalone mode
>>> you've been using, but if you have been using the ec2 scripts, we set
>>> "spark.executor.memory" in conf/spark-defaults.conf for you automatically
>>> so you don't have to specify it each time on the command line. Of course,
>>> you can also do the same in YARN.
>>>
>>> -Andrew
>>>
>>>
>>>
>>> 2014-08-15 10:45 GMT-07:00 Soumya Simanta :
>>>
>>> I've been using the standalone cluster all this time and it worked fine.
 Recently I'm using another Spark cluster that is based on YARN and I've
 not experience with YARN.

 The YARN cluster has 10 nodes and a total memory of 480G.

 I'm having trouble starting the spark-shell with enough memory.
 I'm doing a very simple operation - reading a file 100GB from HDFS and
 running a count on it. This fails due to out of memory on the executors.

 Can someone point to the command line parameters that I should use for
 spark-shell so that it?


 Thanks
 -Soumya


>>>
>>
>


Re: Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
I just checked the YARN config and looks like I need to change this value.
Should be upgraded to 48G (the max memory allocated to YARN) per node ?


yarn.scheduler.maximum-allocation-mb
6144
java.io.BufferedInputStream@2e7e1ee



On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta 
wrote:

> Andrew,
>
> Thanks for your response.
>
> When I try to do the following.
>
>  ./spark-shell --executor-memory 46g --master yarn
>
> I get the following error.
>
> Exception in thread "main" java.lang.Exception: When running with master
> 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the
> environment.
>
> at
> org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)
>
> at
> org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:61)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)
>
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> After this I set the following env variable.
>
> export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/
>
> The program launches but then halts with the following error.
>
>
> *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104 MB),
> is above the max threshold (6144 MB) of this cluster.*
>
> I guess this is some YARN setting that is not set correctly.
>
>
> Thanks
>
> -Soumya
>
>
> On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or  wrote:
>
>> Hi Soumya,
>>
>> The driver's console output prints out how much memory is actually
>> granted to each executor, so from there you can verify how much memory the
>> executors are actually getting. You should use the '--executor-memory'
>> argument in spark-shell. For instance, assuming each node has 48G of memory,
>>
>> bin/spark-shell --executor-memory 46g --master yarn
>>
>> We leave a small cushion for the OS so we don't take up all of the entire
>> system's memory. This option also applies to the standalone mode you've
>> been using, but if you have been using the ec2 scripts, we set
>> "spark.executor.memory" in conf/spark-defaults.conf for you automatically
>> so you don't have to specify it each time on the command line. Of course,
>> you can also do the same in YARN.
>>
>> -Andrew
>>
>>
>>
>> 2014-08-15 10:45 GMT-07:00 Soumya Simanta :
>>
>> I've been using the standalone cluster all this time and it worked fine.
>>> Recently I'm using another Spark cluster that is based on YARN and I've
>>> not experience with YARN.
>>>
>>> The YARN cluster has 10 nodes and a total memory of 480G.
>>>
>>> I'm having trouble starting the spark-shell with enough memory.
>>> I'm doing a very simple operation - reading a file 100GB from HDFS and
>>> running a count on it. This fails due to out of memory on the executors.
>>>
>>> Can someone point to the command line parameters that I should use for
>>> spark-shell so that it?
>>>
>>>
>>> Thanks
>>> -Soumya
>>>
>>>
>>
>


Re: Running Spark shell on YARN

2014-08-15 Thread Andrew Or
Hi Soumya,

The driver's console output prints out how much memory is actually granted
to each executor, so from there you can verify how much memory the
executors are actually getting. You should use the '--executor-memory'
argument in spark-shell. For instance, assuming each node has 48G of memory,

bin/spark-shell --executor-memory 46g --master yarn

We leave a small cushion for the OS so we don't take up all of the entire
system's memory. This option also applies to the standalone mode you've
been using, but if you have been using the ec2 scripts, we set
"spark.executor.memory" in conf/spark-defaults.conf for you automatically
so you don't have to specify it each time on the command line. Of course,
you can also do the same in YARN.

-Andrew



2014-08-15 10:45 GMT-07:00 Soumya Simanta :

> I've been using the standalone cluster all this time and it worked fine.
> Recently I'm using another Spark cluster that is based on YARN and I've
> not experience with YARN.
>
> The YARN cluster has 10 nodes and a total memory of 480G.
>
> I'm having trouble starting the spark-shell with enough memory.
> I'm doing a very simple operation - reading a file 100GB from HDFS and
> running a count on it. This fails due to out of memory on the executors.
>
> Can someone point to the command line parameters that I should use for
> spark-shell so that it?
>
>
> Thanks
> -Soumya
>
>


Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
I've been using the standalone cluster all this time and it worked fine.
Recently I'm using another Spark cluster that is based on YARN and I've not
experience with YARN.

The YARN cluster has 10 nodes and a total memory of 480G.

I'm having trouble starting the spark-shell with enough memory.
I'm doing a very simple operation - reading a file 100GB from HDFS and
running a count on it. This fails due to out of memory on the executors.

Can someone point to the command line parameters that I should use for
spark-shell so that it?


Thanks
-Soumya