We had some serious issues from the hmaster running out of space on the root
partition. We were getting "region server not found" errors on the client
which then turned to client errors "servers have issues" etc.

I ran the hbck command and found 14 inconsistencies. There were files in
hdfs not used for region, regions with the same start key, a hole in the
region chain, and a missing start region with an empty key. I tried to
follow post examples to moves files and edit the .META. but gave up as it
was over my head. I am now trying to truncate the affected tables but do not
seem to be able to even do that as the disable does not seem to work because
of these issues. I assume now we will have to blow away the entire cluster
and start from scratch.

We are not in production so we have the luxury to start again, but the
damage to our confidence is severe. Is there work going on to improve hbck
-fix to actually be able to resolve these types of issues? Do we need to
expect to run a production hbase cluster to be able to move around and
rebuild the region definitions and the .META. table by hand? Things just got
a lot scarier fast for us, especially since we were hoping to go into
production next month. Running out of disk space on the master's root
partition can bring down the entire cluster? This is scary...

Reply via email to