Yes this is a known issue. The HDFS part of this was addressed in https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not available in 1.x release. I think HBase does not use this API yet.
On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <[email protected]> wrote: > We're running HBase over HDFS 1.0.2 on about 1000 nodes. On Saturday the > data center we were in had a total power failure and the cluster went down > hard. When we brought it back up, HDFS reported 4 files as CORRUPT. We > recovered the data in question from our secondary datacenter, but I'm > trying to understand what happened and whether this is a bug in HDFS that > should be fixed. > > From what I can tell the file was created and closed by the dfs client > (hbase). Then HBase renamed it into a new directory and deleted some other > files containing the same data. Then the cluster lost power. After the > cluster was restarted, the datanodes reported into the namenode but the > blocks for this file appeared as "blocks being written" - the namenode > rejected them and the datanodes deleted the blocks. At this point there > were no replicas for the blocks and the files were marked CORRUPT. The > underlying file systems are ext3. Some questions that I would love get > answers for if anyone with deeper understanding of HDFS can chime in: > > - Is this a known scenario where data loss is expected? (I found > HDFS-1539 but that seems different) > - When are blocks moved from blocksBeingWritten to current? Does that > happen before a file close operation is acknowledged to a hdfs client? > - Could it be that the DataNodes actually moved the blocks to current but > after the restart ext3 rewound state somehow (forgive my ignorance of > underlying file system behavior)? > - Is there any other explanation for how this can happen? > > Here is a sequence of selected relevant log lines from the RS (HBase > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in > question). It includes everything that mentions the block in question in > the NameNode and one DataNode log. Please let me know if this more > information that would be helpful. > > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: / > 10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for > block blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: > Removing lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c > RS 2013-06-29 11:16:11,505 INFO > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 7 > file(s) in n of > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m > > ------- CRASH, RESTART --------- > > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was > rejected: Reported as block being written but is a block of closed file. > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet > of 10.0.6.1:50010 > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but > was rejected: Reported as block being written but is a block of closed file. > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet > of 10.0.5.237:50010 > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: addStoredBlock request received for > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but > was rejected: Reported as block being written but is a block of closed file. > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet > of 10.0.6.24:50010 > (note the clock on the server running DN is wrong after restart. I > believe timestamps are off by 6 hours:) > DN 2013-06-29 06:07:22,877 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block > blk_1395839728632046111_357084589 file > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for > deletion > DN 2013-06-29 06:07:24,952 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block > blk_1395839728632046111_357084589 at file > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 > > > Thanks, > Dave > -- http://hortonworks.com/download/
