[ https://issues.apache.org/jira/browse/HDFS-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allen Wittenauer resolved HDFS-1238. ------------------------------------ Resolution: Incomplete Stale. > A block is stuck in ongoingRecovery due to exception not propagated > -------------------------------------------------------------------- > > Key: HDFS-1238 > URL: https://issues.apache.org/jira/browse/HDFS-1238 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client > Affects Versions: 0.20.1 > Reporter: Thanh Do > > - Setup: > + # datanodes = 2 > + replication factor = 2 > + failure type = transient (i.e. a java I/O call that throws I/O Exception or > returns false) > + # failures = 2 > + When/where failures happen: (This is a subtle bug) The first failure is a > transient failure at a datanode during the second phase. Due to the first > failure, the DFSClient will call recoverBlock. The second failure is > injected during this recover block process (i.e. another failure during the > recovery process). > > - Details: > > The expectation here is that since the DFSClient performs lots of retries, > two transient failures should be masked properly by the retries. > We found one case, where the failures are not transparent to the users. > > Here's the stack trace of when/where the two failures happen (please ignore > the line number). > > 1. The first failure: > Exception is thrown at > call(void java.io.DataOutputStream.flush()) > SourceLoc: org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java(252) > Stack Trace: > [0] datanode.BlockReceiver (flush:252) > [1] datanode.BlockReceiver (receivePacket:660) > [2] datanode.BlockReceiver (receiveBlock:743) > [3] datanode.DataXceiver (writeBlock:468) > [4] datanode.DataXceiver (run:119) > > 2. The second failure: > False is returned at > call(boolean java.io.File.renameTo(File)) > SourceLoc: org/apache/hadoop/hdfs/server/datanode/FSDataset.java(105) > Stack Trace: > [0] datanode.FSDataset (tryUpdateBlock:1008) > [1] datanode.FSDataset (updateBlock:859) > [2] datanode.DataNode (updateBlock:1780) > [3] datanode.DataNode (syncBlock:2032) > [4] datanode.DataNode (recoverBlock:1962) > [5] datanode.DataNode (recoverBlock:2101) > > This is what we found out: > The first failure causes the DFSClient to somehow calls recoverBlock, > which will force us to see the 2nd failure. The 2nd failure makes > renameTo returns false, which then causes an IOException to be thrown > from the function that calls renameTo. > But this IOException is not propagated properly! > It is dropped inside DN.syncBlock(). Specifically DN.syncBlock > calls DN.updateBlock() which gets the exception. But syncBlock > only catches that and prints a warning without propagating the exception > properly. Thus syncBlock returns without any exception, > and thus recoverBlock returns without executing the finally{} block > (see below). > > Now, the client will retry recoverBlock for 3-5 more times, > but this retries always see exceptions! The reason is that the first > time we call recoverBlock(blk), this blk is being put into > an ongoingRecovery list inside DN.recoverBlock(). > Normally, blk is only removed (ongoingRecovery.remove(blk)) inside the > finally{} block. > But since the exception is not propagated properly, this finally{} > block is never called, thus the blk is stuck > forever inside the ongoingRecovery list, and hence the next time > client performs the retry, it gets this error message > "Block ... is already being recovered" and recoverBlock() throws > IOException. As a result, the client which calls this whole > process in the context of processDatanodeError will return > from the pde function with closed = true, and hence it never > retries the whole thing again from the beginning, and instead > just returns error. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message was sent by Atlassian JIRA (v6.2#6252)