YARN issues with resourcemanager.scheduler.address

zsterone Thu, 01 May 2014 20:07:22 -0700

Hi,

I'm trying to connect to a YARN cluster by running these commands:
export HADOOP_CONF_DIR=/hadoop/var/hadoop/conf/
export YARN_CONF_DIR=$HADOOP_CONF_DIR
export SPARK_YARN_MODE=true
export
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
export
SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar
export MASTER=yarn-client


./bin/spark-shell

This is what I have in my yarn-site.xml, I have not set
yar.resourcemanager.scheduler.address per
defaults(https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml):
<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>my-machine</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>${yarn.resourcemanager.hostname}:51176</value>
    </property>
    <property>
        <name>yarn.nodemanager.webapp.address</name>
        <value>${yarn.nodemanager.hostname}:1183</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.application.classpath</name>
        <value>/apollo/env/ForecastPipelineHadoopCluster/lib/*</value>
    </property>
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>500</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>5.1</value>
        <description>we use a lot of jars which consumes a ton of
vmem</description>
    </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>24500</value>
    </property>
    <property>
        <name>yarn.resourcemanager.am.max-attempts</name>
        <value>10</value>
    </property>
    <property>
        <name>yarn.resourcemanager.nodes.exclude-path</name>
       
<value>/apollo/env/ForecastPipelineHadoopCluster/var/hadoop/conf/exclude/resourcemanager.exclude</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>11000</value>
        <description>This is the maximum amount of ram that any job can ask
for. Any more and the job will be denied.
            11000 is currently the largest amount of ram any job uses. If a
new job needs more ram this the team adding the job
            needs to ask the Forecasting Platform team for permission to
change this number.
        </description>
    </property>
    <property>
        <name>yarn.nodemanager.user-home-dir</name>
       
<value>/apollo/env/ForecastPipelineHadoopCluster/var/hadoop/tmp/</value>
        <description>I'm not particularly fond of this but matlab writes to
the user's home directory. Without this variable matlab will always
segfault. </description>
    </property>
</configuration>


When I go to my-machine:8088/conf
I get the expected output:
<property><name>yarn.resourcemanager.scheduler.address</name><value>my-machine:8030</value><source>programatically</source></property>

however, when I try running spark-shell, my application is stuck at this
phase:

14/05/02 00:41:35 INFO yarn.Client: Submitting application to ASM
14/05/02 00:41:35 INFO impl.YarnClientImpl: Submitted application
application_1397083384516_6571 to ResourceManager at my-machine/my-ip:51176
14/05/02 00:41:35 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM: 
         appMasterRpcPort: 0
         appStartTime: 1398991295872
         yarnAppState: ACCEPTED

14/05/02 00:41:36 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM: 
         appMasterRpcPort: 0
         appStartTime: 1398991295872
         yarnAppState: ACCEPTED


and it keeps going.  When I look at the log on the resource manager UI, I
get this:
2014-05-02 02:57:31,862 INFO  [sparkYarnAM-akka.actor.default-dispatcher-2]
slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started
2014-05-02 02:57:31,917 INFO  [sparkYarnAM-akka.actor.default-dispatcher-5]
Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting
2014-05-02 02:57:32,104 INFO  [sparkYarnAM-akka.actor.default-dispatcher-2]
Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening
on addresses :[akka.tcp://sparkYarnAM@another-machine:37400]
2014-05-02 02:57:32,105 INFO  [sparkYarnAM-akka.actor.default-dispatcher-2]
Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting now listens on
addresses: [akka.tcp://sparkYarnAM@another-machine:37400]
2014-05-02 02:57:33,217 INFO  [main] client.RMProxy
(RMProxy.java:createRMProxy(56)) - *Connecting to ResourceManager at
0.0.0.0/0.0.0.0:8030*
2014-05-02 02:57:33,293 INFO  [main] yarn.WorkerLauncher
(Logging.scala:logInfo(50)) - ApplicationAttemptId:
appattempt_1397083384516_6859_000001
2014-05-02 02:57:33,294 INFO  [main] yarn.WorkerLauncher
(Logging.scala:logInfo(50)) - Registering the ApplicationMaster
2014-05-02 02:57:34,330 INFO  [main] ipc.Client
(Client.java:handleConnectionFailure(783)) - Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2014-05-02 02:57:35,334 INFO  [main] ipc.Client
(Client.java:handleConnectionFailure(783)) - Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2014-05-02 02:57:36,338 INFO  [main] ipc.Client
(Client.java:handleConnectionFailure(783)) - Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2014-05-02 02:57:37,342 INFO  [main] ipc.Client
(Client.java:handleConnectionFailure(783)) - Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2014-05-02 02:57:38,346 INFO  [main] ipc.Client
(Client.java:handleConnectionFailure(783)) - Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2014-05-02 02:57:39,350 INFO  [main] ipc.Client
(Client.java:handleConnectionFailure(783)) - Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2014-05-02 02:57:40,354 INFO  [main] ipc.Client
(Client.java:handleConnectionFailure(783)) - Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2014-05-02 02:57:41,358 INFO  [main] ipc.Client
(Client.java:handleConnectionFailure(783)) - Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2014-05-02 02:57:42,362 INFO  [main] ipc.Client
(Client.java:handleConnectionFailure(783)) - Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 8 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2014-05-02 02:57:43,366 INFO  [main] ipc.Client
(Client.java:handleConnectionFailure(783)) - Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 9 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1

So it seems like we are not picking up the
yarn.resourcemanager.scheduler.address configuration for some reason.  I've
tried hardcoding the address in the yarn-site.xml that Spark was looking at,
and it did not make a difference, so I think this might be a Yarn issue. 

thanks

du



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/YARN-issues-with-resourcemanager-scheduler-address-tp5201.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

YARN issues with resourcemanager.scheduler.address

Reply via email to