We had some serious issues from the hmaster running out of space on the root partition. We were getting "region server not found" errors on the client which then turned to client errors "servers have issues" etc.
I ran the hbck command and found 14 inconsistencies. There were files in hdfs not used for region, regions with the same start key, a hole in the region chain, and a missing start region with an empty key. I tried to follow post examples to moves files and edit the .META. but gave up as it was over my head. I am now trying to truncate the affected tables but do not seem to be able to even do that as the disable does not seem to work because of these issues. I assume now we will have to blow away the entire cluster and start from scratch. We are not in production so we have the luxury to start again, but the damage to our confidence is severe. Is there work going on to improve hbck -fix to actually be able to resolve these types of issues? Do we need to expect to run a production hbase cluster to be able to move around and rebuild the region definitions and the .META. table by hand? Things just got a lot scarier fast for us, especially since we were hoping to go into production next month. Running out of disk space on the master's root partition can bring down the entire cluster? This is scary...
