Hello friends:

Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN distribution. Everything went fine, and everything seems
to work, but for the following.

Following are two invocations of the 'pyspark' script, one with enclosing quotes around the options passed to '--driver-java-options', and one without them. I added the following one-line in the 'pyspark' script to
show my problem...

ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line that exports this variable.

=========================================================

FIRST:
[ without enclosing quotes ]:

user@linux$ pyspark --master yarn-client --driver-java-options -Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar xxx --master yarn-client --driver-java-options -Dspark.executor.memory=1Gxxx <--- echo statement show option truncation.

While this succeeds in getting to a pyspark shell prompt (sc), the context isn't setup properly because, as seen in red above and below, all but the first option took effect. (Note spark.executor.memory is correct but that's only because
my spark defaults coincide with it.)

14/09/16 17:35:32 INFO yarn.Client: command: $JAVA_HOME/bin/java -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp '-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89' '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G' '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles=' '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' '-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell' '-Dspark.driver.appUIAddress=dstorm:4040' '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G' '-Dspark.fileserver.uri=http://192.168.0.16:60305' '-Dspark.driver.port=44616' '-Dspark.master=yarn-client' org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar null --arg 'dstorm:44616' --executor-memory 1024 --executor-cores 1 --num-executors 2 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr

(Note: I happen to notice that 'spark.driver.memory' is missing as well).

===========================================

NEXT:

[ So let's try with enclosing quotes ]
user@linux$ pyspark --master yarn-client --driver-java-options '-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' xxx --master yarn-client --driver-java-options "-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx

While this does have all the options (shown in the red echo output above and the command executed below), pyspark invocation fails, indicating
that the application ended before I got to a shell prompt.
See below snippet.

14/09/16 17:44:12 INFO yarn.Client: command: $JAVA_HOME/bin/java -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp '-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada' '-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M' '-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.instances=3' '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles=' '-Dspark.ui.port=8468' '-Dspark.driver.host=dstorm' '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell' '-Dspark.driver.appUIAddress=dstorm:8468' '-Dspark.yarn.executor.memoryOverhead=512M' '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' '-Dspark.fileserver.uri=http://192.168.0.16:54171' '-Dspark.master=yarn-client' '-Dspark.driver.port=58542' org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar null --arg 'dstorm:58542' --executor-memory 1024 --executor-cores 1 --num-executors 3 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr


[ ... SNIP ... ]
4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend: Application report from ASM:
     appMasterRpcPort: -1
     appStartTime: 1410903852044
     yarnAppState: ACCEPTED

14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend: Application report from ASM:
     appMasterRpcPort: -1
     appStartTime: 1410903852044
     yarnAppState: ACCEPTED

14/09/16 17:44:14 INFO cluster.YarnClientSchedulerBackend: Application report from ASM:
     appMasterRpcPort: -1
     appStartTime: 1410903852044
     yarnAppState: ACCEPTED

14/09/16 17:44:15 INFO cluster.YarnClientSchedulerBackend: Application report from ASM:
     appMasterRpcPort: 0
     appStartTime: 1410903852044
     yarnAppState: RUNNING

14/09/16 17:44:19 ERROR cluster.YarnClientSchedulerBackend: Yarn application already ended: FAILED


Am I doing something wrong?

Thank you in advance!
Team didata




Reply via email to