here is the snippet
2013-04-19 00:27:38,337 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Recover RBW replica
BP-696828882-10.168.7.226-1364886167971:blk_40107897639761277_174072
2013-04-19 00:27:38,337 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Recovering ReplicaBeingWritten, blk_40107897639761277_174072, RBW
2013-04-19 00:28:11,471 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at
ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020 calls
recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
targets=[*10.156.194.94:50010, 10.156.192.106:50010, 10.156.195.38:50010*],
newGenerationStamp=174413)
2013-04-19 00:41:20,716 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_-5723958680970112840_174056, recoveryId=174413,
replica=ReplicaBeingWritten, blk_-5723958680970112840_174056, RBW
2013-04-19 00:41:20,716 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: changing replica state for
blk_-5723958680970112840_174056 from RBW to RUR
2013-04-19 00:41:20,733 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
updateReplica:
BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
recoveryId=174413, length=119148648, replica=ReplicaUnderRecovery,
blk_-5723958680970112840_174056, RUR
2013-04-19 00:41:20,745 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: recoverBlocks FAILED:
RecoveringBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
getBlockSize()=0; corrupt=false; offset=-1; locs=[10.156.194.94:50010,
10.156.192.106:50010, 10.156.195.38:50010]}
2013-04-19 00:41:23,733 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_-5723958680970112840_174056, recoveryId=174418,
replica=FinalizedReplica, blk_-5723958680970112840_174413, FINALIZED
2013-04-19 00:41:23,733 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: changing replica state for
blk_-5723958680970112840_174056 from FINALIZED to RUR
2013-04-19 00:41:23,736 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
updateReplica:
BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174413,
recoveryId=174418, length=119148648, replica=ReplicaUnderRecovery,
blk_-5723958680970112840_174413, RURBlock recovery takes a long time and eventually seems to fail - during recoverBlock() call - all three datanodes (including the failed/stale one is there) On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <[email protected]> wrote: > Can you show snippet from DN log which mentioned UNDER_RECOVERY ? > > Here is the criteria for stale node checking to kick in (from > > https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch > ): > > + * Check if the datanode is in stale state. Here if > + * the namenode has not received heartbeat msg from a > + * datanode for more than staleInterval (default value is > + * {@link > DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}), > + * the datanode will be treated as stale node. > > > On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <[email protected]> > wrote: > > > Is there a place to upload these logs ? > > > > > > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <[email protected]> > > wrote: > > > > > Hi Nicholas, > > > > > > Attached are the namenode, dn logs (of one of the healthy replicas of > the > > > WAL block) and the rs logs which got stuch doing the log split. Action > > > begins at 2013-04-19 00:27*. > > > > > > Also, the rogue block is 5723958680970112840_174056. Its very > interesting > > > to trace this guy through the HDFS logs (dn and nn). > > > > > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also > > does > > > the stale node stuff kick in for that state ? > > > > > > Thanks > > > Varun > > > > > > > > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <[email protected] > > >wrote: > > > > > >> Thanks for the detailed scenario and analysis. I'm going to have a > look. > > >> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com > > >> timeouts), could you please send them directly to me? > > >> > > >> Thanks, > > >> > > >> Nicolas > > >> > > >> > > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <[email protected]> > > >> wrote: > > >> > > >> > Hi Nicholas, > > >> > > > >> > Here is the failure scenario, I have dug up the logs. > > >> > > > >> > A machine fails and stops accepting/transmitting traffic. The > HMaster > > >> > starts the distributed split for 13 tasks. There are 12 region > > servers. > > >> 12 > > >> > tasks succeed but the 13th one takes a looong time. > > >> > > > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20 > > >> seconds. > > >> > Both patches are there. > > >> > > > >> > a) Machine fails around 27:30 > > >> > b) Master starts the split around 27:40 and submits the tasks. The > one > > >> task > > >> > which fails seems to be the one which contains the WAL being > currently > > >> > written to: > > >> > > > >> > 2013-04-19 00:27:44,325 INFO > > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting > hlog: > > >> > hdfs:// > > >> > > > >> > > > >> > > > ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141 > > >> > , > > >> > length=0 > > >> > > > >> > Basically this region server picks up the task but finds the length > of > > >> this > > >> > file to be 0 and drops. This happens again > > >> > > > >> > c) Finally another region server picks up the task but it ends up > > going > > >> to > > >> > the bad datanode which should not happen because of the stale node > > >> timeout) > > >> > Unfortunately it hits the 45 retries and a connect timeout of 20 > > seconds > > >> > every time. This delays recovery significantly. Now I guess > reducing # > > >> of > > >> > retries to 1 is one possibility. > > >> > But then the namenode logs are very interesting. > > >> > > > >> > d) Namenode seems to be in cyclic lease recovery loop until the node > > is > > >> > marked dead. There is this one last block which exhibits this. > > >> > > > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-* > > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY, > > >> > primaryNodeIndex=1, > > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW], > > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW], > > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery > started, > > >> > primary=10.156.192.106:50010 > > >> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange: > DIR* > > >> > NameSystem.internalReleaseLease: File > > >> > > > >> > > > >> > > > /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141 > > >> > has not been closed. Lease recovery is in progress. RecoveryId = > > 174413 > > >> for > > >> > block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY, > > >> > primaryNodeIndex=1, > > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW], > > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW], > > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} > > >> > > > >> > I see this over and over again in the logs until the datanode is > > marked > > >> > dead. It seems to be cycling through the replicas for this WAL block > > and > > >> > trying to add it to the recovery list. I looked at the code and it > > says: > > >> > > > >> > // Cannot close file right now, since the last block requires > > >> > recovery. > > >> > // This may potentially cause infinite loop in lease recovery > > >> > // if there are no valid replicas on data-nodes. > > >> > NameNode.stateChangeLog.warn( > > >> > "DIR* NameSystem.internalReleaseLease: " + > > >> > "File " + src + " has not been closed." + > > >> > " Lease recovery is in progress. " + > > >> > "RecoveryId = " + blockRecoveryId + " for block " + > > >> > lastBlock); > > >> > break; > > >> > > > >> > Eventually for this block, we get > > >> > > > >> > 2013-04-19 00:41:20,736 INFO > > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > > >> > > > >> > > > >> > > > commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_- > > >> > *5723958680970112840_174056*, newgenerationstamp=174413, > > >> > newlength=119148648, newtargets=[10.156.192.106:50010, > > >> 10.156.195.38:50010 > > >> > ], > > >> > closeFile=true, deleteBlock=false) > > >> > 2013-04-19 00:41:20,736 ERROR > > >> > org.apache.hadoop.security.UserGroupInformation: > > >> PriviledgedActionException > > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block > > >> > > > >> > > > (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056) > > >> > not found > > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC > Server > > >> > handler 35 on 8020, call > > >> > > > >> > > > >> > > > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization > > >> > from 10.156.192.106:53271: error: java.io.IOException: Block > > >> > > > >> > > > (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056) > > >> > not found > > >> > > > >> > On the datanode side, i see a call for recover blocks - I see that a > > >> write > > >> > pipeline is there, which gets terminated with some socket > timeouts... > > >> > > > >> > 00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > > >> NameNode > > >> > at ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls > > >> > > > >> > > > >> > > > recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056, > > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010, > > 10.156.195.38:50010 > > >> ], > > >> > newGenerationStamp=174413) > > >> > > > >> > Not sure but this looks like a case where data could be lost ? > > >> > > > >> > Varun > > >> > > > >> > > > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon < > [email protected]> > > >> > wrote: > > >> > > > >> > > Hey Varun, > > >> > > > > >> > > Could you please share the logs and the configuration (hdfs / > hbase > > >> > > settings + cluster description). What's the failure scenario? > > >> > > From an HDFS pov, HDFS 3703 does not change the dead node status. > > But > > >> > these > > >> > > node will be given the lowest priority when reading. > > >> > > > > >> > > > > >> > > Cheers, > > >> > > > > >> > > Nicolas > > >> > > > > >> > > > > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma < > [email protected]> > > >> > wrote: > > >> > > > > >> > > > Hi, > > >> > > > > > >> > > > We are facing problems with really slow HBase region server > > >> recoveries > > >> > ~ > > >> > > 20 > > >> > > > minuted. Version is hbase 0.94.3 compiled with > hadoop.profile=2.0. > > >> > > > > > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched > and > > >> > stale > > >> > > > node timeouts configured correctly. Time for dead node detection > > is > > >> > still > > >> > > > 10 minutes. > > >> > > > > > >> > > > We see that our region server is trying to read an HLog is stuck > > >> there > > >> > > for > > >> > > > a long time. Logs here: > > >> > > > > > >> > > > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient: > > >> Failed > > >> > to > > >> > > > connect to /10.156.194.251:50010 for file > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599 > > >> > > > for block > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException: > > >> > > > 15000 millis timeout while waiting for channel to be ready for > > read. > > >> > ch : > > >> > > > java.nio.channels.SocketChannel[connected local=/ > > >> 10.156.192.173:52818 > > >> > > > remote=/ > > >> > > > 10.156.194.251:50010] > > >> > > > > > >> > > > I would think that HDFS 3703 would make the server fail fast and > > go > > >> to > > >> > > the > > >> > > > third datanode. Currently, the recovery seems way too slow for > > >> > production > > >> > > > usage... > > >> > > > > > >> > > > Varun > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >
