Dear all, I have sent quite a couple of questions about the issue we met these days which was killing my team.
We were now using hbase 0.20.6 with hadoop branch-0.20-append. The region server will get tons of logs as following: 2011-05-12 12:11:47,349 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.0.2.14:50010 for file /hbase/URLTag/1220923653/Tag/2268498370035875192 for block 2440422069461309270:java.io.IOException: Got error in response to OP_READ_BLOCK for file /hbase/URLTag/1220923653/Tag/2268498370035875192 for block 2440422069461309270 And then if we went to the data node the log pointed out, we found the following log: 2011-05-12 01:23:55,317 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleting block blk_2440422069461309270_3925117 file /hadoop/dfs/data/current/subdir19/subdir36/blk_2440422069461309270 The data node just deleted a block, and then it tried to serve the block again as the following: 2011-05-12 01:32:37,444 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 10.0.2.14:50010, storageID=DS-1110633472-127.0.0.1-50010-1274814749004, infoPort=50075, ipcPort=50020):Got exception while serving blk_2440422069461309270_3925117 to /10.0.2.12: java.io.IOException: Block blk_2440422069461309270_3925117 is not valid. at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722) at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) at java.lang.Thread.run(Thread.java:619) 2011-05-12 01:32:37,444 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 10.0.2.14:50010, storageID=DS-1110633472-127.0.0.1-50010-1274814749004, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Block blk_2440422069461309270_3925117 is not valid. at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722) at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) at java.lang.Thread.run(Thread.java:619) And still tried to serve the block after 12 hours: 2011-05-12 12:14:40,257 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 10.0.2.14:50010, storageID=DS-1110633472-127.0.0.1-50010-1274814749004, infoPort=50075, ipcPort=50020):Got exception while serving blk_2440422069461309270_3925117 to /10.0.2.12: java.io.IOException: Block blk_2440422069461309270_3925117 is not valid. at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722) at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) at java.lang.Thread.run(Thread.java:619) 2011-05-12 12:14:40,257 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 10.0.2.14:50010, storageID=DS-1110633472-127.0.0.1-50010-1274814749004, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Block blk_2440422069461309270_3925117 is not valid. at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722) at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) at java.lang.Thread.run(Thread.java:619) If we went to the name node, we found the log like the following: 011-05-12 01:23:54,023 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.0.2.13:50010 to delete blk_-7967494102380684849_39242632011-05-12 01:23:54,023 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.0.2.14:50010 to delete *blk_2440422069461309270_3925117* blk_-946722429317009208_3986596 2011-05-12 01:24:06,032 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.0.2.12:50010 to delete blk_232241225919439230_3925136 blk_3644274217045813332_3954711 blk_-5566167900779836043_3946995 blk_7698571737096469832_3922371 blk_6567693665925295269_3925143 blk_-7967494102380684849_3924263 blk_761830835124422304_3954691 blk_22871562856813954_3924967 blk_-108718719454734123_3924500 blk_-5428140231738575121_3924529 *blk_2440422069461309270_3925117* blk_7672999872894289875_3946328 blk_563378129567617626_3924835 blk_9167573765889664081_3924480 blk_-6121243797820261553_3948685 2011-05-12 01:24:06,032 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.0.2.34:50010 to delete blk_8917364772641481342_3924301 * blk_2440422069461309270_3925117* blk_7231518316925385949_3924812 blk_5636355579130421598_3953434 blk_-8159096223500224670_3923072 blk_5958792757745258261_3948355 So the deletion instruction comes from the name node. And today, when we upgrade our hadoop to branch-0.20-append, the server works fine in the first couple of hours, but start to get this problem again at 3:00, which is the time we set to do a hadoop balance. With the log, if I run a get method on some specific row in the hbase assigned to the block, I will get the exception like the following: get 'URLTag', 'http://sports.sina.com.cn/go/2011-01-01/10365389876.shtml' NativeException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact region server 10.0.2.41:60020 for region URLTag, http://sports.sina.com.cn/go/2010-12-15/09435366726.shtml,1304941869544, row 'http://sports.sina.com.cn/go/2011-01-01/10365389876.shtml', but failed after 7 attempts. Exceptions: java.io.IOException: java.io.IOException: java.lang.RuntimeException: java.io.IOException: Got error in response to OP_READ_BLOCK for file /hbase/URLTag/593921622/Tag/7506733521361817664 for block 3375448240935398820 at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:870) at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:860) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1735) at sun.reflect.GeneratedMethodAccessor75.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) Caused by: java.lang.RuntimeException: java.io.IOException: Got error in response to OP_READ_BLOCK for file /hbase/URLTag/593921622/Tag/7506733521361817664 for block 3375448240935398820 at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:61) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:79) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:236) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:106) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.nextInternal(HRegion.java:1915) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:1879) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2500) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2486) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1733) ... 5 more If I restarted the region server that has the error log, the log no longer exist and I could query the entry I wanted from the hbase shell and get the proper result. My question is 1. If it is a data data missing issue, shouldn't it be addressed by the hadoop branch-0.20-append branch? 2. If there are some issues in the .META. or some where else, why a restart of region server will fix the issue? 3. Should we just upgrade to hbase 0.90.2? Does hbase 0.20.6 has any data missing issue with the hadoop branch-0.20-append? I really need your help, haven't slept for last two days. Thanks in advance. Best wishes, Stanley Xu
