Hello friends:
Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN
distribution. Everything went fine, and everything seems
to work, but for the following.
Following are two invocations of the 'pyspark' script, one with
enclosing quotes around the options passed to
'--driver-java-options', and one without them. I added the following
one-line in the 'pyspark' script to
show my problem...
ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line that
exports this variable.
=========================================================
FIRST:
[ without enclosing quotes ]:
user@linux$ pyspark --master yarn-client --driver-java-options
-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
xxx --master yarn-client --driver-java-options
-Dspark.executor.memory=1Gxxx <--- echo statement show option truncation.
While this succeeds in getting to a pyspark shell prompt (sc), the
context isn't setup properly because, as seen
in red above and below, all but the first option took effect. (Note
spark.executor.memory is correct but that's only because
my spark defaults coincide with it.)
14/09/16 17:35:32 INFO yarn.Client: command: $JAVA_HOME/bin/java
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
'-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
'-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G'
'-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars='
'-Dspark.submit.pyFiles='
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
'-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress='
'-Dspark.app.name=PySparkShell'
'-Dspark.driver.appUIAddress=dstorm:4040'
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
'-Dspark.fileserver.uri=http://192.168.0.16:60305'
'-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar
null --arg 'dstorm:44616' --executor-memory 1024 --executor-cores 1
--num-executors 2 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
(Note: I happen to notice that 'spark.driver.memory' is missing as well).
===========================================
NEXT:
[ So let's try with enclosing quotes ]
user@linux$ pyspark --master yarn-client --driver-java-options
'-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
xxx --master yarn-client --driver-java-options
"-Dspark.executor.memory=1G -Dspark.ui.port=8468
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M
-Dspark.executor.instances=3
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx
While this does have all the options (shown in the red echo output above
and the command executed below), pyspark invocation fails, indicating
that the application ended before I got to a shell prompt.
See below snippet.
14/09/16 17:44:12 INFO yarn.Client: command: $JAVA_HOME/bin/java
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
'-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada'
'-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M'
'-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
'-Dspark.serializer.objectStreamReset=100'
'-Dspark.executor.instances=3' '-Dspark.rdd.compress=True'
'-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles='
'-Dspark.ui.port=8468' '-Dspark.driver.host=dstorm'
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
'-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell'
'-Dspark.driver.appUIAddress=dstorm:8468'
'-Dspark.yarn.executor.memoryOverhead=512M'
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G
-Dspark.ui.port=8468 -Dspark.driver.memory=512M
-Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
'-Dspark.fileserver.uri=http://192.168.0.16:54171'
'-Dspark.master=yarn-client' '-Dspark.driver.port=58542'
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar
null --arg 'dstorm:58542' --executor-memory 1024 --executor-cores 1
--num-executors 3 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
[ ... SNIP ... ]
4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:
appMasterRpcPort: -1
appStartTime: 1410903852044
yarnAppState: ACCEPTED
14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:
appMasterRpcPort: -1
appStartTime: 1410903852044
yarnAppState: ACCEPTED
14/09/16 17:44:14 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:
appMasterRpcPort: -1
appStartTime: 1410903852044
yarnAppState: ACCEPTED
14/09/16 17:44:15 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:
appMasterRpcPort: 0
appStartTime: 1410903852044
yarnAppState: RUNNING
14/09/16 17:44:19 ERROR cluster.YarnClientSchedulerBackend: Yarn
application already ended: FAILED
Am I doing something wrong?
Thank you in advance!
Team didata