Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

Dimension Data, LLC. Tue, 16 Sep 2014 15:43:29 -0700


Hello friends:

Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARNdistribution. Everything went fine, and everything seems

to work, but for the following.

Following are two invocations of the 'pyspark' script, one withenclosing quotes around the options passed to'--driver-java-options', and one without them. I added the followingone-line in the 'pyspark' script to

show my problem...

ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line thatexports this variable.


=========================================================

FIRST:
[ without enclosing quotes ]:

user@linux$ pyspark --master yarn-client --driver-java-options-Dspark.executor.memory=1G -Dspark.ui.port=8468-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M-Dspark.executor.instances=3-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jarxxx --master yarn-client --driver-java-options-Dspark.executor.memory=1Gxxx <--- echo statement show option truncation.

While this succeeds in getting to a pyspark shell prompt (sc), thecontext isn't setup properly because, as seenin red above and below, all but the first option took effect. (Notespark.executor.memory is correct but that's only because

my spark defaults coincide with it.)

14/09/16 17:35:32 INFO yarn.Client: command: $JAVA_HOME/bin/java-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp'-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89''-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G''-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars=''-Dspark.submit.pyFiles=''-Dspark.serializer=org.apache.spark.serializer.KryoSerializer''-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=''-Dspark.app.name=PySparkShell''-Dspark.driver.appUIAddress=dstorm:4040''-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G''-Dspark.fileserver.uri=http://192.168.0.16:60305''-Dspark.driver.port=44616' '-Dspark.master=yarn-client'org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jarnull --arg 'dstorm:44616' --executor-memory 1024 --executor-cores 1--num-executors 2 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr


(Note: I happen to notice that 'spark.driver.memory' is missing as well).

===========================================

NEXT:

[ So let's try with enclosing quotes ]

user@linux$ pyspark --master yarn-client --driver-java-options'-Dspark.executor.memory=1G -Dspark.ui.port=8468-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M-Dspark.executor.instances=3-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'xxx --master yarn-client --driver-java-options"-Dspark.executor.memory=1G -Dspark.ui.port=8468-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M-Dspark.executor.instances=3-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx

While this does have all the options (shown in the red echo output aboveand the command executed below), pyspark invocation fails, indicating

that the application ended before I got to a shell prompt.
See below snippet.

14/09/16 17:44:12 INFO yarn.Client: command: $JAVA_HOME/bin/java-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp'-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada''-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M''-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar''-Dspark.serializer.objectStreamReset=100''-Dspark.executor.instances=3' '-Dspark.rdd.compress=True''-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles=''-Dspark.ui.port=8468' '-Dspark.driver.host=dstorm''-Dspark.serializer=org.apache.spark.serializer.KryoSerializer''-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell''-Dspark.driver.appUIAddress=dstorm:8468''-Dspark.yarn.executor.memoryOverhead=512M''-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G-Dspark.ui.port=8468 -Dspark.driver.memory=512M-Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar''-Dspark.fileserver.uri=http://192.168.0.16:54171''-Dspark.master=yarn-client' '-Dspark.driver.port=58542'org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jarnull --arg 'dstorm:58542' --executor-memory 1024 --executor-cores 1--num-executors 3 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr



[ ... SNIP ... ]

4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend: Applicationreport from ASM:

     appMasterRpcPort: -1
     appStartTime: 1410903852044
     yarnAppState: ACCEPTED

14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend: Applicationreport from ASM:

     appMasterRpcPort: -1
     appStartTime: 1410903852044
     yarnAppState: ACCEPTED

14/09/16 17:44:14 INFO cluster.YarnClientSchedulerBackend: Applicationreport from ASM:

     appMasterRpcPort: -1
     appStartTime: 1410903852044
     yarnAppState: ACCEPTED

14/09/16 17:44:15 INFO cluster.YarnClientSchedulerBackend: Applicationreport from ASM:

     appMasterRpcPort: 0
     appStartTime: 1410903852044
     yarnAppState: RUNNING

14/09/16 17:44:19 ERROR cluster.YarnClientSchedulerBackend: Yarnapplication already ended: FAILED



Am I doing something wrong?

Thank you in advance!
Team didata

Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

Reply via email to