Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...
Hi Sandy: Thank you. I have not tried that mechanism (I wasn't are of it). I will try that instead. Is it possible to also represent '--driver-memory' and '--executor-memory' (and basically all properties) using the '--conf' directive? The Reason: I actually discovered the below issue while writing a custom PYTHONSTARTUP script that I use to launch *bpython* or *python* or my *WING python IDE* with. That script reads a python *dict* (from a file) containing key/value pairs from which it constructs the "--driver-java-options ...", which I will now switch generate '--conf key1=val1 --conf key2=val2 --conf key3=val3 (and so on), instead. If all of the properties could be represented in this way, then it makes the code cleaner (all in the dict file, and no one-offs). Either way, thank you. =:) Noel, team didata On 09/16/2014 08:03 PM, Sandy Ryza wrote: Hi team didata, This doesn't directly answer your question, but with Spark 1.1, instead of user the driver options, it's better to pass your spark properties using the "conf" option. E.g. pyspark --master yarn-client --conf spark.shuffle.spill=true --conf spark.yarn.executor.memoryOverhead=512M Additionally, executor and memory have dedicated options: pyspark --master yarn-client --conf spark.shuffle.spill=true --conf spark.yarn.executor.memoryOverhead=512M --driver-memory 3G --executor-memory 5G -Sandy On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC. mailto:subscripti...@didata.us>> wrote: Hello friends: Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN distribution. Everything went fine, and everything seems to work, but for the following. Following are two invocations of the 'pyspark' script, one with enclosing quotes around the options passed to '--driver-java-options', and one without them. I added the following one-line in the 'pyspark' script to show my problem... ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line that exports this variable. = FIRST: [ without enclosing quotes ]: user@linux$ pyspark --master yarn-client --driver-java-options -Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar xxx --master yarn-client --driver-java-options -Dspark.executor.memory=1Gxxx <--- echo statement show option truncation. While this succeeds in getting to a pyspark shell prompt (sc), the context isn't setup properly because, as seen in red above and below, all but the first option took effect. (Note spark.executor.memory is correct but that's only because my spark defaults coincide with it.) 14/09/16 17:35:32 INFO yarn.Client: command: $JAVA_HOME/bin/java -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp '-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89' '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G' '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles=' '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' '-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name <http://Dspark.app.name>=PySparkShell' '-Dspark.driver.appUIAddress=dstorm:4040' '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G' '-Dspark.fileserver.uri=http://192.168.0.16:60305' '-Dspark.driver.port=44616' '-Dspark.master=yarn-client' org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar null --arg 'dstorm:44616' --executor-memory 1024 --executor-cores 1 --num-executors 2 1> /stdout 2> /stderr (Note: I happen to notice that 'spark.driver.memory' is missing as well). === NEXT: [ So let's try with enclosing quotes ] user@linux$ pyspark --master yarn-client --driver-java-options '-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' xxx --master yarn-client --driver-java-options "-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M -Dspark.execut
Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...
Hello friends: Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN distribution. Everything went fine, and everything seems to work, but for the following. Following are two invocations of the 'pyspark' script, one with enclosing quotes around the options passed to '--driver-java-options', and one without them. I added the following one-line in the 'pyspark' script to show my problem... ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line that exports this variable. = FIRST: [ without enclosing quotes ]: user@linux$ pyspark --master yarn-client --driver-java-options -Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar xxx --master yarn-client --driver-java-options -Dspark.executor.memory=1Gxxx <--- echo statement show option truncation. While this succeeds in getting to a pyspark shell prompt (sc), the context isn't setup properly because, as seen in red above and below, all but the first option took effect. (Note spark.executor.memory is correct but that's only because my spark defaults coincide with it.) 14/09/16 17:35:32 INFO yarn.Client: command: $JAVA_HOME/bin/java -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp '-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89' '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G' '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles=' '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' '-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell' '-Dspark.driver.appUIAddress=dstorm:4040' '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G' '-Dspark.fileserver.uri=http://192.168.0.16:60305' '-Dspark.driver.port=44616' '-Dspark.master=yarn-client' org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar null --arg 'dstorm:44616' --executor-memory 1024 --executor-cores 1 --num-executors 2 1> /stdout 2> /stderr (Note: I happen to notice that 'spark.driver.memory' is missing as well). === NEXT: [ So let's try with enclosing quotes ] user@linux$ pyspark --master yarn-client --driver-java-options '-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' xxx --master yarn-client --driver-java-options "-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx While this does have all the options (shown in the red echo output above and the command executed below), pyspark invocation fails, indicating that the application ended before I got to a shell prompt. See below snippet. 14/09/16 17:44:12 INFO yarn.Client: command: $JAVA_HOME/bin/java -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp '-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada' '-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M' '-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.instances=3' '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles=' '-Dspark.ui.port=8468' '-Dspark.driver.host=dstorm' '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell' '-Dspark.driver.appUIAddress=dstorm:8468' '-Dspark.yarn.executor.memoryOverhead=512M' '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' '-Dspark.fileserver.uri=http://192.168.0.16:54171' '-Dspark.master=yarn-client' '-Dspark.driver.port=58542' org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar null --arg 'dstorm:58542' --executor-memory 1024 --executor-cores 1 --num-executors 3 1> /stdout 2> /stderr [ ... SNIP ... ] 4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1410903852044 yarnAppState: ACCEPTED 14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1410903852044 yarnA
If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...
Hello friends: It was mentioned in another (Y.A.R.N.-centric) email thread that 'SPARK_JAR' was deprecated, and to use the 'spark.yarn.jar' property instead for YARN submission. For example: user$ pyspark [some-options] --driver-java-options spark.yarn.jar=hdfs://namenode:8020/path/to/spark-assembly-*.jar What is the equivalent property to use for the LOCAL MODE case? spark.jar? spark.local.jar? I searched for this, but can't find where the definitions for these exist (perhaps a pointer to that, too). :) For completeness/explicitness, I like to specify things like this on the CLI, even if there are default settings them. Thank you! didata //
Re: PySpark on Yarn a lot of python scripts project
Hi: Curious... is there any reason not to use one of the below pyspark options (in red)? Assuming each file is, say 10k in size, is 50 files too much? Does that touch on some practical limitation? Usage: ./bin/pyspark [options] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Where to run the driver program: either "client" to run on the local machine, or "cluster" to run inside cluster. --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. --files FILES Comma-separated list of files to be placed in the working directory of each executor. [ ... snip ... ] On 09/05/2014 12:00 PM, Davies Liu wrote: Hi Oleg, > > In order to simplify the process of package and distribute you > codes, you could deploy an shared storage (such as NFS), and put your > project in it, mount it to all the slaves as "/projects". > > In the spark job scripts, you can access your project by put the > path into sys.path, such as: > > import sys sys.path.append("/projects") import myproject > > Davies > > On Fri, Sep 5, 2014 at 1:28 AM, Oleg Ruchovets > wrote: >> Hi , We avaluating PySpark and successfully executed examples of >> PySpark on Yarn. >> >> Next step what we want to do: We have a python project ( bunch of >> python script using Anaconda packages). Question: What is the way >> to execute PySpark on Yarn having a lot of python files ( ~ 50)? >> Should it be packaged in archive? How the command to execute >> Pyspark on Yarn with a lot of files will looks like? Currently >> command looks like: >> >> ./bin/spark-submit --master yarn --num-executors 3 >> --driver-memory 4g --executor-memory 2g --executor-cores 1 >> examples/src/main/python/wordcount.py 1000 >> >> Thanks Oleg. > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org >
Re: Spark on YARN question
Hello friends: I have a follow-up to Andrew's well articulated answer below (thank you for that). (1) I've seen both of these invocations in various places: (a) '--master yarn' (b) '--master yarn-client' the latter of which doesn't appear in '/pyspark//|//spark-submit|spark-shell --help/' output. Is case (a) meant for cluster-mode apps (where the driver is out on a YARN ApplicationMaster, and case (b) for client-mode apps needing client interaction locally? Also (related), is case (b) simply shorthand for the following invocation syntax? '--master yarn --deploy-mode client' (2) Seeking clarification on the first sentence below... /Note: To avoid a copy of the Assembly JAR every time I launch a job, I place it (the lat//est// //version) at a specific (but otherwise arbitrary) location on HDFS, and then set SPARK_JAR, like so (//where you can thankfully use wild-cards//)//:// // // export SPARK_JAR=hdfs://namenode:8020///path/to///spark-assembly-*.jar/ But my question here is, when specifying additional JARS like this '--jars /path/to/jar1,/path/to/jar2,...' to /pyspark|spark-submit|spark-shell/ commands, are those JARS expected to *already* be at those path locations on both the _submitter_ server, as well as on YARN _worker_ servers? In other words, the '--jars' option won't cause the command to look for them locally at those path locations, and then ship & place them to the same path-locations remotely? They need to be there already, both locally and remotely. Correct? Thank you. :) didata On 09/02/2014 12:05 PM, Andrew Or wrote: Hi Greg, You should not need to even manually install Spark on each of the worker nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. the assembly + additional jars) to each of the containers for you. You can specify additional jars that your application depends on through the --jars argument if you are using spark-submit / spark-shell / pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV on the driver node (where your application is submitted) to specify environment variables to be observed by your executors. If you are using the spark-submit / spark-shell / pyspark scripts, then you can set Spark properties in the conf/spark-defaults.conf properties file, and these will be propagated to the executors. In other words, configurations on the slave nodes don't do anything. For example, $ vim conf/spark-defaults.conf // set a few properties $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2 $ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2 Best, -Andrew