Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Dimension Data, LLC.

Hi Sandy:

Thank you. I have not tried that mechanism (I wasn't are of it). I will 
try that instead.


Is it possible to also represent '--driver-memory' and 
'--executor-memory' (and basically all properties)

using the '--conf' directive?

The Reason: I actually discovered the below issue while writing a custom 
PYTHONSTARTUP script that I use
to launch *bpython* or *python* or my *WING python IDE* with. That 
script reads a python *dict* (from a file)
containing key/value pairs from which it constructs the 
"--driver-java-options ...", which I will now
switch generate '--conf key1=val1 --conf key2=val2 --conf key3=val3 (and 
so on), instead.


If all of the properties could be represented in this way, then it makes 
the code cleaner (all in

the dict file, and no one-offs).

Either way, thank you. =:)

Noel,
team didata


On 09/16/2014 08:03 PM, Sandy Ryza wrote:

Hi team didata,

This doesn't directly answer your question, but with Spark 1.1, 
instead of user the driver options, it's better to pass your spark 
properties using the "conf" option.


E.g.
pyspark --master yarn-client --conf spark.shuffle.spill=true --conf 
spark.yarn.executor.memoryOverhead=512M


Additionally, executor and memory have dedicated options:

pyspark --master yarn-client --conf spark.shuffle.spill=true --conf 
spark.yarn.executor.memoryOverhead=512M --driver-memory 3G 
--executor-memory 5G


-Sandy


On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC. 
mailto:subscripti...@didata.us>> wrote:




Hello friends:

Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN
distribution. Everything went fine, and everything seems
to work, but for the following.

Following are two invocations of the 'pyspark' script, one with
enclosing quotes around the options passed to
'--driver-java-options', and one without them. I added the
following one-line in the 'pyspark' script to
show my problem...

ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line
that exports this variable.

=

FIRST:
[ without enclosing quotes ]:

user@linux$ pyspark --master yarn-client --driver-java-options
-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3

-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
xxx --master yarn-client --driver-java-options
-Dspark.executor.memory=1Gxxx  <--- echo statement show option
truncation.

While this succeeds in getting to a pyspark shell prompt (sc), the
context isn't setup properly because, as seen
in red above and below, all but the first option took effect.
(Note spark.executor.memory is correct but that's only because
my spark defaults coincide with it.)

14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
'-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
'-Dspark.serializer.objectStreamReset=100'
'-Dspark.executor.memory=1G' '-Dspark.rdd.compress=True'
'-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles='
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
'-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress='
'-Dspark.app.name <http://Dspark.app.name>=PySparkShell'
'-Dspark.driver.appUIAddress=dstorm:4040'
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
'-Dspark.fileserver.uri=http://192.168.0.16:60305'
'-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused'
--jar  null  --arg 'dstorm:44616' --executor-memory 1024
--executor-cores 1 --num-executors 2 1> /stdout 2>
/stderr

(Note: I happen to notice that 'spark.driver.memory' is missing as
well).

===

NEXT:

[ So let's try with enclosing quotes ]
user@linux$ pyspark --master yarn-client --driver-java-options
'-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3

-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
xxx --master yarn-client --driver-java-options
"-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M
-Dspark.execut

Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Dimension Data, LLC.



Hello friends:

Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN 
distribution. Everything went fine, and everything seems

to work, but for the following.

Following are two invocations of the 'pyspark' script, one with 
enclosing quotes around the options passed to
'--driver-java-options', and one without them. I added the following 
one-line in the 'pyspark' script to

show my problem...

ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line that 
exports this variable.


=

FIRST:
[ without enclosing quotes ]:

user@linux$ pyspark --master yarn-client --driver-java-options 
-Dspark.executor.memory=1G -Dspark.ui.port=8468 
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M 
-Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
xxx --master yarn-client --driver-java-options 
-Dspark.executor.memory=1Gxxx <--- echo statement show option truncation.


While this succeeds in getting to a pyspark shell prompt (sc), the 
context isn't setup properly because, as seen
in red above and below, all but the first option took effect. (Note 
spark.executor.memory is correct but that's only because

my spark defaults coincide with it.)

14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java 
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp 
'-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89' 
'-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G' 
'-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars=' 
'-Dspark.submit.pyFiles=' 
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' 
'-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=' 
'-Dspark.app.name=PySparkShell' 
'-Dspark.driver.appUIAddress=dstorm:4040' 
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G' 
'-Dspark.fileserver.uri=http://192.168.0.16:60305' 
'-Dspark.driver.port=44616' '-Dspark.master=yarn-client' 
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar  
null  --arg  'dstorm:44616' --executor-memory 1024 --executor-cores 1 
--num-executors  2 1> /stdout 2> /stderr


(Note: I happen to notice that 'spark.driver.memory' is missing as well).

===

NEXT:

[ So let's try with enclosing quotes ]
user@linux$ pyspark --master yarn-client --driver-java-options 
'-Dspark.executor.memory=1G -Dspark.ui.port=8468 
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M 
-Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
xxx --master yarn-client --driver-java-options 
"-Dspark.executor.memory=1G -Dspark.ui.port=8468 
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M 
-Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx


While this does have all the options (shown in the red echo output above 
and the command executed below), pyspark invocation fails, indicating

that the application ended before I got to a shell prompt.
See below snippet.

14/09/16 17:44:12 INFO yarn.Client:   command: $JAVA_HOME/bin/java 
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp 
'-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada' 
'-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M' 
'-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' 
'-Dspark.serializer.objectStreamReset=100' 
'-Dspark.executor.instances=3' '-Dspark.rdd.compress=True' 
'-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles=' 
'-Dspark.ui.port=8468' '-Dspark.driver.host=dstorm' 
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' 
'-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell' 
'-Dspark.driver.appUIAddress=dstorm:8468' 
'-Dspark.yarn.executor.memoryOverhead=512M' 
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G 
-Dspark.ui.port=8468 -Dspark.driver.memory=512M 
-Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' 
'-Dspark.fileserver.uri=http://192.168.0.16:54171' 
'-Dspark.master=yarn-client' '-Dspark.driver.port=58542' 
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar  
null  --arg  'dstorm:58542' --executor-memory 1024 --executor-cores 1 
--num-executors  3 1> /stdout 2> /stderr



[ ... SNIP ... ]
4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend: Application 
report from ASM:

 appMasterRpcPort: -1
 appStartTime: 1410903852044
 yarnAppState: ACCEPTED

14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend: Application 
report from ASM:

 appMasterRpcPort: -1
 appStartTime: 1410903852044
 yarnA

If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Dimension Data, LLC.

Hello friends:

It was mentioned in another (Y.A.R.N.-centric) email thread that 
'SPARK_JAR' was deprecated,
and to use the 'spark.yarn.jar' property instead for YARN submission. 
For example:


   user$ pyspark [some-options] --driver-java-options 
spark.yarn.jar=hdfs://namenode:8020/path/to/spark-assembly-*.jar


What is the equivalent property to use for the LOCAL MODE case? 
spark.jar? spark.local.jar?
I searched for this, but can't find where the definitions for these 
exist (perhaps a pointer

to that, too). :)

For completeness/explicitness, I like to specify things like this on the 
CLI, even if there

are default settings them.

Thank you!
didata


//



Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Dimension Data, LLC.

Hi:

Curious... is there any reason not to use one of the below pyspark options
(in red)? Assuming each file is, say 10k in size, is 50 files too much?
Does that touch on some practical limitation?


Usage: ./bin/pyspark [options]
Options:
  --master MASTER_URL spark://host:port, mesos://host:port, 
yarn, or local.
  --deploy-mode DEPLOY_MODE   Where to run the driver program: either 
"client" to run
  on the local machine, or "cluster" to run 
inside cluster.
  --class CLASS_NAME  Your application's main class (for Java / 
Scala apps).

  --name NAME A name of your application.
  --jars JARS Comma-separated list of local jars to 
include on the driver

  and executor classpaths.

  --py-files PY_FILES Comma-separated list of .zip, .egg, or 
.py files to place

  on the PYTHONPATH for Python apps.

  --files FILES   Comma-separated list of files to be 
placed in the working

  directory of each executor.
[ ... snip ... ]




On 09/05/2014 12:00 PM, Davies Liu wrote:

Hi Oleg,

>
> In order to simplify the process of package and distribute you
> codes, you could deploy an shared storage (such as NFS), and put your
> project in it, mount it to all the slaves as "/projects".
>
> In the spark job scripts, you can access your project by put the
> path into sys.path, such as:
>
> import sys sys.path.append("/projects") import myproject
>
> Davies
>
> On Fri, Sep 5, 2014 at 1:28 AM, Oleg Ruchovets 
> wrote:
>> Hi , We avaluating PySpark  and successfully executed examples of
>> PySpark on Yarn.
>>
>> Next step what we want to do: We have a python project ( bunch of
>> python script using Anaconda packages). Question: What is the way
>> to execute PySpark on Yarn having a lot of python files ( ~ 50)?
>> Should it be packaged in archive? How the command to execute
>> Pyspark on Yarn with a lot of files will looks like? Currently
>> command looks like:
>>
>> ./bin/spark-submit --master yarn  --num-executors 3
>> --driver-memory 4g --executor-memory 2g --executor-cores 1
>> examples/src/main/python/wordcount.py   1000
>>
>> Thanks Oleg.
>
> -
>
>
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands,  e-mail: user-h...@spark.apache.org

>




Re: Spark on YARN question

2014-09-02 Thread Dimension Data, LLC.

Hello friends:

I have a follow-up to Andrew's well articulated answer below (thank you 
for that).


(1) I've seen both of these invocations in various places:

  (a) '--master yarn'
  (b) '--master yarn-client'

the latter of which doesn't appear in 
'/pyspark//|//spark-submit|spark-shell --help/' output.


Is case (a) meant for cluster-mode apps (where the driver is out on 
a YARN ApplicationMaster,

and case (b) for client-mode apps needing client interaction locally?

Also (related), is case (b) simply shorthand for the following 
invocation syntax?

   '--master yarn --deploy-mode client'

(2) Seeking clarification on the first sentence below...

/Note: To avoid a copy of the Assembly JAR every time I launch a 
job, I place it (the lat//est//
//version) at a specific (but otherwise arbitrary) location on HDFS, 
and then set SPARK_JAR,

like so (//where you can thankfully use wild-cards//)//://
//
//   export 
SPARK_JAR=hdfs://namenode:8020///path/to///spark-assembly-*.jar/


But my question here is, when specifying additional JARS like this 
'--jars /path/to/jar1,/path/to/jar2,...'
to /pyspark|spark-submit|spark-shell/ commands, are those JARS 
expected to *already* be
at those path locations on both the _submitter_ server, as well as 
on YARN _worker_ servers?


In other words, the '--jars' option won't cause the command to look 
for them locally at those path
locations, and then ship & place them to the same path-locations 
remotely? They need to be there

already, both locally and remotely. Correct?

Thank you. :)
didata


On 09/02/2014 12:05 PM, Andrew Or wrote:

Hi Greg,

You should not need to even manually install Spark on each of the 
worker nodes or put it into HDFS yourself. Spark on Yarn will ship all 
necessary jars (i.e. the assembly + additional jars) to each of the 
containers for you. You can specify additional jars that your 
application depends on through the --jars argument if you are using 
spark-submit / spark-shell / pyspark. As for environment variables, 
you can specify SPARK_YARN_USER_ENV on the driver node (where your 
application is submitted) to specify environment variables to be 
observed by your executors. If you are using the spark-submit / 
spark-shell / pyspark scripts, then you can set Spark properties in 
the conf/spark-defaults.conf properties file, and these will be 
propagated to the executors. In other words, configurations on the 
slave nodes don't do anything.


For example,
$ vim conf/spark-defaults.conf // set a few properties
$ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
$ bin/spark-shell --master yarn --jars 
/local/path/to/my/jar1,/another/jar2


Best,
-Andrew