Hi, Thank y very much!
At the moment if I run ./sbin/start-yarn.sh in rm1, the standby STANDBY ResourceManager in rm2 is not started accordingly. Please advise what would be wrong? Thanks Regards Arthur On 12 Aug, 2014, at 1:13 pm, Xuan Gong <[email protected]> wrote: > Some questions: > Q1) I need start yarn in EACH master separately, is this normal? Is there a > way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY > ResourceManager in rm2 started as well? > > No, need to start multiple RMs separately. > > Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down > in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY > ResourceManager? > > Interesting question. But one of the design for auto-failover is that the > down-time of RM is invisible to end users. The end users can submit > applications normally even if the failover happens. > > We can monitor the status of RMs by using the command-line (you did > previously) or from webUI/webService (rm_address:portnumber/cluster/cluster). > We can get the current status from there. > > Thanks > > Xuan Gong > > > On Mon, Aug 11, 2014 at 5:12 PM, [email protected] > <[email protected]> wrote: > Hi, > > it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my > yarn-site.xml. > > At the moment, the ResourceManager HA works if: > > 1) at rm1, run ./sbin/start-yarn.sh > > yarn rmadmin -getServiceState rm1 > active > > yarn rmadmin -getServiceState rm2 > 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: > rm1/192.168.1.1:23142. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS) > Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on connection > exception: java.net.ConnectException: Connection refused; For more details > see: http://wiki.apache.org/hadoop/ConnectionRefused > > > 2) at rm2, run ./sbin/start-yarn.sh > > yarn rmadmin -getServiceState rm1 > standby > > > Some questions: > Q1) I need start yarn in EACH master separately, is this normal? Is there a > way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY > ResourceManager in rm2 started as well? > > Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down > in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY > ResourceManager? > > > Regards > Arthur > > > <?xml version="1.0"?> > <configuration> > > <!-- Site specific YARN configuration properties --> > > <property> > <name>yarn.nodemanager.aux-services</name> > <value>mapreduce_shuffle</value> > </property> > > <property> > <name>yarn.resourcemanager.address</name> > <value>192.168.1.1:8032</value> > </property> > > <property> > <name>yarn.resourcemanager.resource-tracker.address</name> > <value>192.168.1.1:8031</value> > </property> > > <property> > <name>yarn.resourcemanager.admin.address</name> > <value>192.168.1.1:8033</value> > </property> > > <property> > <name>yarn.resourcemanager.scheduler.address</name> > <value>192.168.1.1:8030</value> > </property> > > <property> > <name>yarn.nodemanager.loacl-dirs</name> > <value>/edh/hadoop_data/mapred/nodemanager</value> > <final>true</final> > </property> > > <property> > <name>yarn.web-proxy.address</name> > <value>192.168.1.1:8888</value> > </property> > > <property> > <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> > <value>org.apache.hadoop.mapred.ShuffleHandler</value> > </property> > > > > > <property> > <name>yarn.nodemanager.resource.memory-mb</name> > <value>18432</value> > </property> > > <property> > <name>yarn.scheduler.minimum-allocation-mb</name> > <value>9216</value> > </property> > > <property> > <name>yarn.scheduler.maximum-allocation-mb</name> > <value>18432</value> > </property> > > > > <property> > <name>yarn.resourcemanager.connect.retry-interval.ms</name> > <value>2000</value> > </property> > <property> > <name>yarn.resourcemanager.ha.enabled</name> > <value>true</value> > </property> > <property> > <name>yarn.resourcemanager.ha.automatic-failover.enabled</name> > <value>true</value> > </property> > <property> > <name>yarn.resourcemanager.ha.automatic-failover.embedded</name> > <value>true</value> > </property> > <property> > <name>yarn.resourcemanager.cluster-id</name> > <value>cluster_rm</value> > </property> > <property> > <name>yarn.resourcemanager.ha.rm-ids</name> > <value>rm1,rm2</value> > </property> > <property> > <name>yarn.resourcemanager.hostname.rm1</name> > <value>192.168.1.1</value> > </property> > <property> > <name>yarn.resourcemanager.hostname.rm2</name> > <value>192.168.1.2</value> > </property> > <property> > <name>yarn.resourcemanager.scheduler.class</name> > > <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value> > </property> > <property> > <name>yarn.resourcemanager.recovery.enabled</name> > <value>true</value> > </property> > <property> > <name>yarn.resourcemanager.store.class</name> > > <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> > </property> > <property> > <name>yarn.resourcemanager.zk-address</name> > <value>rm1:2181,m135:2181,m137:2181</value> > </property> > <property> > <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name> > <value>5000</value> > </property> > > <!-- RM1 configs --> > <property> > <name>yarn.resourcemanager.address.rm1</name> > <value>192.168.1.1:23140</value> > </property> > <property> > <name>yarn.resourcemanager.scheduler.address.rm1</name> > <value>192.168.1.1:23130</value> > </property> > <property> > <name>yarn.resourcemanager.webapp.https.address.rm1</name> > <value>192.168.1.1:23189</value> > </property> > <property> > <name>yarn.resourcemanager.webapp.address.rm1</name> > <value>192.168.1.1:23188</value> > </property> > <property> > <name>yarn.resourcemanager.resource-tracker.address.rm1</name> > <value>192.168.1.1:23125</value> > </property> > <property> > <name>yarn.resourcemanager.admin.address.rm1</name> > <value>192.168.1.1:23142</value> > </property> > > > <!-- RM2 configs --> > <property> > <name>yarn.resourcemanager.address.rm2</name> > <value>192.168.1.2:23140</value> > </property> > <property> > <name>yarn.resourcemanager.scheduler.address.rm2</name> > <value>192.168.1.2:23130</value> > </property> > <property> > <name>yarn.resourcemanager.webapp.https.address.rm2</name> > <value>192.168.1.2:23189</value> > </property> > <property> > <name>yarn.resourcemanager.webapp.address.rm2</name> > <value>192.168.1.2:23188</value> > </property> > <property> > <name>yarn.resourcemanager.resource-tracker.address.rm2</name> > <value>192.168.1.2:23125</value> > </property> > <property> > <name>yarn.resourcemanager.admin.address.rm2</name> > <value>192.168.1.2:23142</value> > </property> > > <property> > <name>yarn.nodemanager.remote-app-log-dir</name> > <value>/edh/hadoop_logs/hadoop/</value> > </property> > > </configuration> > > > > On 12 Aug, 2014, at 1:49 am, Xuan Gong <[email protected]> wrote: > >> Hey, Arthur: >> >> Did you use single node cluster or multiple nodes cluster? Could you >> share your configuration file (yarn-site.xml) ? This looks like a >> configuration issue. >> >> Thanks >> >> Xuan Gong >> >> >> On Mon, Aug 11, 2014 at 9:45 AM, [email protected] >> <[email protected]> wrote: >> Hi, >> >> If I have TWO nodes for ResourceManager HA, what should be the correct steps >> and commands to start and stop ResourceManager in a ResourceManager HA >> cluster ? >> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems >> that ./sbin/start-yarn.sh can only start YARN in a node at a time. >> >> Regards >> Arthur >> >> > > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader of > this message is not the intended recipient, you are hereby notified that any > printing, copying, dissemination, distribution, disclosure or forwarding of > this communication is strictly prohibited. If you have received this > communication in error, please contact the sender immediately and delete it > from your system. Thank You.
