Hi I am trying to install Active Namenode HA using blueprints. During the cluster creation through scripts, it does following and completes.
1) Journal nodes starts and initialized (formats journal node). 2) Initialization the HA state in zookeeper or ZKFC ( Both in Active and Standby namenode ) After 96% it fails. I logged into the cluster using UI and re-started the standby namenode. But it throw the exception saying that Namenode not formatted. I have to manually copy the fsimage logs from using this command, "hdfs namenode -bootstrapStandby -force " in the standby NN server. and re-starting the namenode works fine and goes into standby mode. Is it something I am missing in the configuration ? My Namenode HA blue prints looks like this. hadoop-env{ "dfs_ha_initial_namenode_active": "%HOSTGROUP::host_group_master_1%" "dfs_ha_initial_namenode_standby": "%HOSTGROUP::host_group_master_2" } hadoop-ev{ "dfs_ha_initial_namenode_active": "%HOSTGROUP::host_group_master_1%" "dfs_ha_initial_namenode_standby": "%HOSTGROUP::host_group_master_2" } hdfs-site{ "dfs.client.failover.proxy.provider.dfs-nameservices": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", "dfs.ha.automatic-failover.enabled": "true", "dfs.ha.fencing.methods": "shell(/bin/true)", "dfs.ha.namenodes.dfs-nameservices": "nn1,nn2", "dfs.namenode.http-address.dfs-nameservices.nn1": "%HOSTGROUP::host_group_master_1%:50070", "dfs.namenode.http-address.dfs-nameservices.nn2": "%HOSTGROUP::host_group_master_2%:50070", "dfs.namenode.https-address.dfs-nameservices.nn1": "%HOSTGROUP::host_group_master_1%:50470", "dfs.namenode.https-address.dfs-nameservices.nn2": "%HOSTGROUP::host_group_master_2%:50470", "dfs.namenode.rpc-address.dfs-nameservices.nn1": "%HOSTGROUP::host_group_master_1%:8020", "dfs.namenode.rpc-address.dfs-nameservices.nn2": "%HOSTGROUP::host_group_master_2%:8020", "dfs.namenode.shared.edits.dir": "qjournal://%HOSTGROUP::host_group_master_1%:8485;%HOSTGROUP::host_group_master_2%:8485;%HOSTGROUP::host_group_master_3%:8485/dfs-nameservices", "dfs.nameservices": "dfs-nameservices" } core-site{ "fs.defaultFS": "hdfs://dfs-nameservices", "ha.zookeeper.quorum": "%HOSTGROUP::host_group_master_1%:2181,%HOSTGROUP::host_group_master_2%:2181,%HOSTGROUP::host_group_master_3%:2181" } This is the log message of Standby Namenode server. 2015-08-25 08:26:26,373 INFO zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.dir=/usr/hdp/2.2.6.0-2800/hadoop 2015-08-25 08:26:26,380 INFO zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - Initiating client connection, connectString=usw2ha2dpma01.local:2181,usw2ha2dpma02.local:2181,usw2ha2dpma03.local:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5b7a5baa 2015-08-25 08:26:26,399 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server usw2ha2dpma02.local/172.17.213.51:2181. Will not attempt to authenticate using SASL (unknown error) 2015-08-25 08:26:26,405 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to usw2ha2dpma02.local/172.17.213.51:2181, initiating session 2015-08-25 08:26:26,413 INFO zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session establishment complete on server usw2ha2dpma02.local/172.17.213.51:2181, sessionid = 0x24f63f6f3050001, negotiated timeout = 5000 2015-08-25 08:26:26,416 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(547)) - Session connected. 2015-08-25 08:26:26,441 INFO ipc.CallQueueManager (CallQueueManager.java:<init>(53)) - Using callQueue class java.util.concurrent.LinkedBlockingQueue 2015-08-25 08:26:26,472 INFO ipc.Server (Server.java:run(605)) - Starting Socket Reader #1 for port 8019 2015-08-25 08:26:26,520 INFO ipc.Server (Server.java:run(827)) - IPC Server Responder: starting 2015-08-25 08:26:26,526 INFO ipc.Server (Server.java:run(674)) - IPC Server listener on 8019: starting 2015-08-25 08:26:27,596 INFO ipc.Client (Client.java:handleConnectionFailure(859)) - Retrying connect to server: usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS) 2015-08-25 08:26:27,615 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying to monitor health of NameNode at usw2ha2dpma02.local/172.17.213.51:8020: Call From usw2ha2dpma02.local/172.17.213.51 to usw2ha2dpma02.local:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2015-08-25 08:26:27,616 INFO ha.HealthMonitor (HealthMonitor.java:enterState(238)) - Entering state SERVICE_NOT_RESPONDING 2015-08-25 08:26:27,616 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(850)) - Local service NameNode at usw2ha2dpma02.local/172.17.213.51:8020 entered state: SERVICE_NOT_RESPONDING 2015-08-25 08:26:27,616 INFO ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(766)) - Quitting master election for NameNode at usw2ha2dpma02.local/172.17.213.51:8020 and marking that fencing is necessary 2015-08-25 08:26:27,617 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(354)) - Yielding from election 2015-08-25 08:26:27,621 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down 2015-08-25 08:26:27,621 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x24f63f6f3050001 closed 2015-08-25 08:26:29,623 INFO ipc.Client (Client.java:handleConnectionFailure(859)) - Retrying connect to server: usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS) 2015-08-25 08:26:29,624 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying to monitor health of NameNode at usw2ha2dpma02.local/172.17.213.51:8020: Call From usw2ha2dpma02.local/172.17.213.51 to usw2ha2dpma02.local:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2015-08-25 08:26:31,626 INFO ipc.Client (Client.java:handleConnectionFailure(859)) - Retrying connect to server: usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS) 2015-08-25 08:26:31,627 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying to monitor health of NameNode at usw2ha2dpma02.local/172.17.213.51:8020: Call From usw2ha2dpma02.local/172.17.213.51 to usw2ha2dpma02.local:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2015-08-25 08:26:33,629 INFO ipc.Client (Client.java:handleConnectionFailure(859)) - Retrying connect to server: usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS) 2015-08-25 08:26:33,630 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception trying to