Hi, it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my yarn-site.xml.
At the moment, the ResourceManager HA works if: 1) at rm1, run ./sbin/start-yarn.sh yarn rmadmin -getServiceState rm1 active yarn rmadmin -getServiceState rm2 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/192.168.1.1:23142. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS) Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2) at rm2, run ./sbin/start-yarn.sh yarn rmadmin -getServiceState rm1 standby Some questions: Q1) I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well? Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager? Regards Arthur <?xml version="1.0"?> <configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>192.168.1.1:8032</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>192.168.1.1:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>192.168.1.1:8033</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>192.168.1.1:8030</value> </property> <property> <name>yarn.nodemanager.loacl-dirs</name> <value>/edh/hadoop_data/mapred/nodemanager</value> <final>true</final> </property> <property> <name>yarn.web-proxy.address</name> <value>192.168.1.1:8888</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>18432</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>9216</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>18432</value> </property> <property> <name>yarn.resourcemanager.connect.retry-interval.ms</name> <value>2000</value> </property> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.ha.automatic-failover.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.ha.automatic-failover.embedded</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>cluster_rm</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>192.168.1.1</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>192.168.1.2</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value> </property> <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>rm1:2181,m135:2181,m137:2181</value> </property> <property> <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name> <value>5000</value> </property> <!-- RM1 configs --> <property> <name>yarn.resourcemanager.address.rm1</name> <value>192.168.1.1:23140</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm1</name> <value>192.168.1.1:23130</value> </property> <property> <name>yarn.resourcemanager.webapp.https.address.rm1</name> <value>192.168.1.1:23189</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm1</name> <value>192.168.1.1:23188</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm1</name> <value>192.168.1.1:23125</value> </property> <property> <name>yarn.resourcemanager.admin.address.rm1</name> <value>192.168.1.1:23142</value> </property> <!-- RM2 configs --> <property> <name>yarn.resourcemanager.address.rm2</name> <value>192.168.1.2:23140</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm2</name> <value>192.168.1.2:23130</value> </property> <property> <name>yarn.resourcemanager.webapp.https.address.rm2</name> <value>192.168.1.2:23189</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm2</name> <value>192.168.1.2:23188</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm2</name> <value>192.168.1.2:23125</value> </property> <property> <name>yarn.resourcemanager.admin.address.rm2</name> <value>192.168.1.2:23142</value> </property> <property> <name>yarn.nodemanager.remote-app-log-dir</name> <value>/edh/hadoop_logs/hadoop/</value> </property> </configuration> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote: > Hey, Arthur: > > Did you use single node cluster or multiple nodes cluster? Could you > share your configuration file (yarn-site.xml) ? This looks like a > configuration issue. > > Thanks > > Xuan Gong > > > On Mon, Aug 11, 2014 at 9:45 AM, arthur.hk.c...@gmail.com > <arthur.hk.c...@gmail.com> wrote: > Hi, > > If I have TWO nodes for ResourceManager HA, what should be the correct steps > and commands to start and stop ResourceManager in a ResourceManager HA > cluster ? > Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that > ./sbin/start-yarn.sh can only start YARN in a node at a time. > > Regards > Arthur > >