Dear all, We were tracing a issue we have with our hbase cluster. We are almost sure it is a network issue since the problem seems disappeared after we disabled the ip_forward on all the machines and configured the route to the same configuration. But we didn't really know how these configuration might impact the cluster.
The problem we have met could be found by the following link: http://search-hadoop.com/m/ZpgJ623GoyU1/.META.+inconsistency&subj=The+META+data+inconsistency+issue (The title is not proper for the issue in fact.) And by tracing the logs from region server, data node and name node, I also found something with doubt after we thought the issue is fixed and before the issue appeared. In a region server, I could still find some logs that the RegionServer tried to get a block from a data node, which is no longer served by the data node. I see the following log in region server for block 5056551999889621449 http://pastebin.com/epEt37JK And following log in the data node the region server try to get the block. http://pastebin.com/pnif75rX And following log in the name node which let the data node to delete the block. http://pastebin.com/rQ4QjUcS And if I use fsck to check the file on hdfs, it has 4 replications, which also contains the data node that should have deleted the block. http://pastebin.com/2DecD9GD But if I check the data node's local file system, I could see that the block no longer exist in the local fs. But after 6-7 hours, when I re-run fsck, the data node which should delete the block no longer exist. http://pastebin.com/014h3qNE I am wondering if is it a correct behavior for hadoop and hbase? I am using hadoop branch-0.20-append and hbase 0.20.6 I am wondering except reading all the code, if there is a document or tutorial describe how the hadoop and hbase get the data synchronized in a more detail level comparing to hbase book or official document? Best wishes, Stanley Xu
