Hi, I'm trying to connect to a YARN cluster by running these commands: export HADOOP_CONF_DIR=/hadoop/var/hadoop/conf/ export YARN_CONF_DIR=$HADOOP_CONF_DIR export SPARK_YARN_MODE=true export SPARK_JAR=./assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar export SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar export MASTER=yarn-client
./bin/spark-shell This is what I have in my yarn-site.xml, I have not set yar.resourcemanager.scheduler.address per defaults(https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml): <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>my-machine</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>${yarn.resourcemanager.hostname}:51176</value> </property> <property> <name>yarn.nodemanager.webapp.address</name> <value>${yarn.nodemanager.hostname}:1183</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.application.classpath</name> <value>/apollo/env/ForecastPipelineHadoopCluster/lib/*</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>500</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>5.1</value> <description>we use a lot of jars which consumes a ton of vmem</description> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>24500</value> </property> <property> <name>yarn.resourcemanager.am.max-attempts</name> <value>10</value> </property> <property> <name>yarn.resourcemanager.nodes.exclude-path</name> <value>/apollo/env/ForecastPipelineHadoopCluster/var/hadoop/conf/exclude/resourcemanager.exclude</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>11000</value> <description>This is the maximum amount of ram that any job can ask for. Any more and the job will be denied. 11000 is currently the largest amount of ram any job uses. If a new job needs more ram this the team adding the job needs to ask the Forecasting Platform team for permission to change this number. </description> </property> <property> <name>yarn.nodemanager.user-home-dir</name> <value>/apollo/env/ForecastPipelineHadoopCluster/var/hadoop/tmp/</value> <description>I'm not particularly fond of this but matlab writes to the user's home directory. Without this variable matlab will always segfault. </description> </property> </configuration> When I go to my-machine:8088/conf I get the expected output: <property><name>yarn.resourcemanager.scheduler.address</name><value>my-machine:8030</value><source>programatically</source></property> however, when I try running spark-shell, my application is stuck at this phase: 14/05/02 00:41:35 INFO yarn.Client: Submitting application to ASM 14/05/02 00:41:35 INFO impl.YarnClientImpl: Submitted application application_1397083384516_6571 to ResourceManager at my-machine/my-ip:51176 14/05/02 00:41:35 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: 0 appStartTime: 1398991295872 yarnAppState: ACCEPTED 14/05/02 00:41:36 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: 0 appStartTime: 1398991295872 yarnAppState: ACCEPTED and it keeps going. When I look at the log on the resource manager UI, I get this: 2014-05-02 02:57:31,862 INFO [sparkYarnAM-akka.actor.default-dispatcher-2] slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started 2014-05-02 02:57:31,917 INFO [sparkYarnAM-akka.actor.default-dispatcher-5] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting 2014-05-02 02:57:32,104 INFO [sparkYarnAM-akka.actor.default-dispatcher-2] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening on addresses :[akka.tcp://sparkYarnAM@another-machine:37400] 2014-05-02 02:57:32,105 INFO [sparkYarnAM-akka.actor.default-dispatcher-2] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting now listens on addresses: [akka.tcp://sparkYarnAM@another-machine:37400] 2014-05-02 02:57:33,217 INFO [main] client.RMProxy (RMProxy.java:createRMProxy(56)) - *Connecting to ResourceManager at 0.0.0.0/0.0.0.0:8030* 2014-05-02 02:57:33,293 INFO [main] yarn.WorkerLauncher (Logging.scala:logInfo(50)) - ApplicationAttemptId: appattempt_1397083384516_6859_000001 2014-05-02 02:57:33,294 INFO [main] yarn.WorkerLauncher (Logging.scala:logInfo(50)) - Registering the ApplicationMaster 2014-05-02 02:57:34,330 INFO [main] ipc.Client (Client.java:handleConnectionFailure(783)) - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2014-05-02 02:57:35,334 INFO [main] ipc.Client (Client.java:handleConnectionFailure(783)) - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2014-05-02 02:57:36,338 INFO [main] ipc.Client (Client.java:handleConnectionFailure(783)) - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2014-05-02 02:57:37,342 INFO [main] ipc.Client (Client.java:handleConnectionFailure(783)) - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2014-05-02 02:57:38,346 INFO [main] ipc.Client (Client.java:handleConnectionFailure(783)) - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2014-05-02 02:57:39,350 INFO [main] ipc.Client (Client.java:handleConnectionFailure(783)) - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2014-05-02 02:57:40,354 INFO [main] ipc.Client (Client.java:handleConnectionFailure(783)) - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2014-05-02 02:57:41,358 INFO [main] ipc.Client (Client.java:handleConnectionFailure(783)) - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2014-05-02 02:57:42,362 INFO [main] ipc.Client (Client.java:handleConnectionFailure(783)) - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2014-05-02 02:57:43,366 INFO [main] ipc.Client (Client.java:handleConnectionFailure(783)) - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 So it seems like we are not picking up the yarn.resourcemanager.scheduler.address configuration for some reason. I've tried hardcoding the address in the yarn-site.xml that Spark was looking at, and it did not make a difference, so I think this might be a Yarn issue. thanks du -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/YARN-issues-with-resourcemanager-scheduler-address-tp5201.html Sent from the Apache Spark User List mailing list archive at Nabble.com.