It looks like that's tied to the ipc.client.connect.* properties. You can adjust retries & timeout values to something shorter and see if that works for you.
Offhand, I'm not certain if that will affect other services besides HDFS. -Ray On Wed, Dec 3, 2014 at 2:51 AM, mail list <louis.hust...@gmail.com> wrote: > hadoop-2.3.0-cdh5.1.0 > > hi, i move QJM from the l-hbase1.dba.dev.cn0 to another machine, and the > down time reduced to > 5 mins, and the log on the l-hbase2.dba.dev.cn0 like below: > > {log} > 2014-12-03 15:55:51,306 INFO > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Loaded 197 edits > starting from txid 6599 > 2014-12-03 15:55:51,306 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all > datandoes as stale > 2014-12-03 15:55:51,307 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing > replication and invalidation queues > 2014-12-03 15:55:51,307 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: initializing > replication queues > 2014-12-03 15:55:51,307 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing > edit logs at txnid 6797 > 2014-12-03 15:55:51,313 INFO > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at > 6797 > 2014-12-03 15:55:51,373 INFO > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1 > Total time for transactions(ms): 0 Number of transactions batched in Syncs: > 0 Number of syncs: 0 SyncTimes(ms): 0 9 > 2014-12-03 15:55:51,385 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Starting CacheReplicationMonitor with interval 30000 milliseconds > 2014-12-03 15:55:51,385 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Rescanning because of pending operations > 2014-12-03 15:55:51,678 INFO org.apache.hadoop.fs.TrashPolicyDefault: > Namenode trash configuration: Deletion interval = 1440 minutes, Emptier > interval = 0 minutes. > 2014-12-03 15:55:51,679 INFO org.apache.hadoop.fs.TrashPolicyDefault: The > configured checkpoint interval is 0 minutes. Using an interval of 1440 > minutes that is used for deletion instead > 2014-12-03 15:55:51,693 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Total number of > blocks = 179 > 2014-12-03 15:55:51,693 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of > invalid blocks = 0 > 2014-12-03 15:55:51,693 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of > under-replicated blocks = 0 > 2014-12-03 15:55:51,693 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of > over-replicated blocks = 0 > 2014-12-03 15:55:51,693 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of > blocks being written = 4 > 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.StateChange: STATE* > Replication Queue initialization scan for invalid, over- and > under-replicated blocks completed in 386 msec > 2014-12-03 15:55:51,693 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Scanned 0 directive(s) and 0 block(s) in 308 millisecond(s). > 2014-12-03 15:56:21,385 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Rescanning after 30000 milliseconds > 2014-12-03 15:56:21,386 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). > 2014-12-03 15:56:51,386 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Rescanning after 30001 milliseconds > 2014-12-03 15:56:51,386 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). > 2014-12-03 15:57:21,387 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Rescanning after 30000 milliseconds > 2014-12-03 15:57:21,387 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). > 2014-12-03 15:57:51,386 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Rescanning after 30000 milliseconds > 2014-12-03 15:57:51,386 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). > 2014-12-03 15:58:21,387 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Rescanning after 30000 milliseconds > 2014-12-03 15:58:21,387 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). > 2014-12-03 15:58:51,386 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Rescanning after 30000 milliseconds > 2014-12-03 15:58:51,387 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). > 2014-12-03 15:59:21,387 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Rescanning after 30001 milliseconds > 2014-12-03 15:59:21,387 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). > 2014-12-03 15:59:51,387 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Rescanning after 30000 milliseconds > 2014-12-03 15:59:51,388 INFO > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: > Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). > 2014-12-03 16:00:14,295 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocateBlock: caught retry for allocation of a new block in > /hbase/testnn/WALs/l-hbase3.dba.dev.cn0.qunar.com,60020,1417585992012/ > l-hbase3.dba.dev.cn0.qunar.com%2C60020%2C1417585992012.1417593301483. > Returning previously allocated block > blk_1073743458_2634{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, > replicas=[]} > {log} > > > It seems the from 15:55:51 to 16:00:14 , all is > org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor, > what is hadoop doing? how can i reduce the time cause 5 mins is too long! > > > > On Dec 3, 2014, at 16:31, Harsh J <ha...@cloudera.com> wrote: > > > What is your Hadoop version? > > > > On Wed, Dec 3, 2014 at 12:55 PM, mail list <louis.hust...@gmail.com> > wrote: > >> hi all, > >> > >> Attach log again! > >> > >> The failover happened at about time: 2014-12-03 12:01: > >> > >> > >> > >> > >> > >> On Dec 3, 2014, at 14:55, mail list <louis.hust...@gmail.com> wrote: > >> > >>> Sorry forget the log, the failover time at about 2014-12-03 12:01: > >>> > >>> <hadoop-hadoop-namenode-l-hbase2.dba.dev.cn0.log.tar.gz> > >>> On Dec 3, 2014, at 14:48, mail list <louis.hust...@gmail.com> wrote: > >>> > >>>> Hi all, > >>>> > >>>> I deploy the hadoop with 3 machines: > >>>> > >>>> l-hbase1.dba.dev.cn0 (namenode active and QJM) > >>>> l-hbase2.dba.dev.cn0 (namenode standby and datanode and QJM) > >>>> l-hbase3.dba.dev.cn0 (datanode and QJM) > >>>> > >>>> Above the hadoop, i deploy a hbase: > >>>> l-hbase1.dba.dev.cn0 (HMaster active) > >>>> l-hbase2.dba.dev.cn0 (HMaster standby) > >>>> l-hbase3.dba.dev.cn0 (RegionServer) > >>>> > >>>> > >>>> I write a program which put data into hbase one row every seconds in > a loop. > >>>> Then I use iptables to simulate l-hbase1.dba.dev.cn0 offline,and > after that , the program hang and can not > >>>> write to hbase. After about 15 mins, the program can write again. > >>>> > >>>> The time 15mins for the HA failover is too long for me! > >>>> And I’ve no idea about the reason. > >>>> > >>>> Then I check the l-hbase2.dba.dev.cn0 namenode logs, and find many > retry like below: > >>>> {code} > >>>> 2014-12-03 12:13:35,165 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8485. Already tried > 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > >>>> {code} > >>>> > >>>> I have the QJM on l-hbase1.dba.dev.cn0, does it matter? > >>>> > >>>> I am a newbie, Any idea will be appreciated!! > >>> > >> > >> > > > > > > > > -- > > Harsh J > >