We have a 6 node 0.90.3-cdh3u1 cluster. We have 8092 regions. I realize we have too many regions and too few nodes…we're addressing that. We currently have an issue where we seem to have lost region data. When data is requested for a couple of our regions, we get errors like the following on the client:
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: IOException: 1 time, servers with issues: node13host:60020 … java.io.IOException: java.io.IOException: Could not seek StoreFileScanner[HFileScanner for reader reader=hdfs://namenodehost:54310/hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568, compression=none, inMemory=false, firstKey=95ac7c7894f86d4455885294582370e30a68fdf1/data:acquireDate/1321151006961/Put, lastKey=95b47d337ff72da0670d0f3803443dd3634681ec/data:text/1323129675986/Put, avgKeyLen=65, avgValueLen=24, entries=6753283, length=667536405, cur=null] … Caused by: java.io.FileNotFoundException: File does not exist: /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568 On node13host, we see similar exceptions: 2011-12-22 02:25:27,509 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /node13host:50010 for file /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568 for block -7065741853936038270:java.io.IOException: Got error in response to OP_READ_BLOCK self=/node13host:37847, remote= /node13host:50010 for file /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568 for block -7065741853936038270_15820239 2011-12-22 02:25:27,511 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /node08host:50010 for file /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568 for block -7065741853936038270:java.io.IOException: Got error in response to OP_READ_BLOCK self=/node13host:44290, remote= /node08host:50010 for file /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568 for block -7065741853936038270_15820239 2011-12-22 02:25:27,512 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /node10host:50010 for file /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568 for block -7065741853936038270:java.io.IOException: Got error in response to OP_READ_BLOCK self=/node13host:52113, remote= /node10host:50010 for file /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568 for block -7065741853936038270_15820239 2011-12-22 02:25:27,513 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-7065741853936038270_15820239 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... 2011-12-22 02:25:30,515 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.io.IOException: Could not seek StoreFileScanner[HFileScanner for reader reader=hdfs://namenodehost:54310/hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568, compression=none, inMemory=false, firstKey=95ac7c7894f86d4455885294582370e30a68fdf1/data:acquireDate/1321151006961/Put, lastKey=95b47d337ff72da0670d0f3803443dd3634681ec/data:text/1323129675986/Put, avgKeyLen=65, avgValueLen=24, entries=6753283, length=667536405, cur=null] … Caused by: java.io.FileNotFoundException: File does not exist: /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568 The file referenced is indeed not in hdfs. Grepping further back in the logs reveals that the problem has been occuring for over a week (likely longer, but the logs have rolled off). There are a bunch of files in /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/ (270 of them), unsure why they aren't compacting, I looked further in the logs and find similar exceptions when trying to do a major compaction, ultimately failing b/c of: Caused by: java.io.FileNotFoundException: File does not exist: /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568 Any help on how to recover? hbck did identify some inconsistencies, we went forward with a -fix, but the issue remains.
