Re: hbck -fix

Stack Sat, 02 Jul 2011 21:41:18 -0700

You have a snapshot of the state of .META. at time you noticed it
messed up?  And the master log from around the time of the startup
post-fs-fillup?
St.Ack


On Sat, Jul 2, 2011 at 7:27 PM, Wayne <[email protected]> wrote:
> Like most problems we brought it on ourselves. To me the bigger issue is how
> to get out. Since region definitions are the core of what hbase does, it
> would be great to have a bullet proof recovery process that we can invoke to
> get us out. Bugs and human error will bring on problems and nothing will
> ever change that, but not having tools to help recover out of the hole is
> where I think it is lacking. HDFS is very stable. The hbase .META. table
> (and -ROOT-?) are the core how HBase manages things. If this gets out of
> whack all is lost. I think it would be great to have automatic backup of the
> meta table and the ability to recover everything based on the HDFS data out
> there and the backup. Something like a recovery mode that goes through and
> sees what is out there and rebuilds the meta based on it. With corrupted
> data and lost regions etc. etc. like any relational database there should be
> one or more recovery modes that goes through everything and rebuilds it
> consistently. Data may be lost but at least the cluster will be left in a
> 100% consistent/clean state. Manual editing of .META. is not something
> anyone should do (especially me). It is prone to human error...it should be
> easy to have well tested recover tools that can do the hard work for us.
>
> Below is an attempt at the play by play in case it helps. It all started
> with the root partition of the namenode/hmaster filling up due to a table
> export.
>
> When I restarted hadoop this error was in the namenode log;
> "java.io.IOException: Incorrect data format. logVersion is -18 but
> writables.length is 0"
>
> So i found 
> this<https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/e35ee876da1a3bbc>,
> which mentioned editing the namenode log files after verifying our namenode
> log files seem to have the same symptom. So I copied each namenode "name"
> file to root's home directory and followed their advice.
> That allowed the namenode to start, but then HDFS wouldn't come up. It kept
> hanging in safe-mode with the repeated error;
> "The ratio of reported blocks 0.9925 has not reached the threshold 0.9990.
> Safe mode will be turned off automatically."
> So i turned safe-mode off with; "hadoop dfsadmin -safemode leave" and I
> tried to run "hadoop fsck" a few times and it still showed HDFS as
> "corrupt", so i did "hadoop fsck -move" and this is the last part of the
> output;
> ....................................................................................Status:
> CORRUPT
>  Total size: 1423140871890 B (Total open files size: 668770828 B)
>  Total dirs: 3172
>  Total files: 2584 (Files currently being written: 11)
>  Total blocks (validated): 23095 (avg. block size 61621167 B) (Total open
> file blocks (not validated): 10)
>  ********************************
>  CORRUPT FILES: 65
>  MISSING BLOCKS: 173
>  MISSING SIZE: 8560948988 B
>  CORRUPT BLOCKS: 173
>  ********************************
>  Minimally replicated blocks: 22922 (99.25092 %)
>  Over-replicated blocks: 0 (0.0 %)
>  Under-replicated blocks: 0 (0.0 %)
>  Mis-replicated blocks: 0 (0.0 %)
>  Default replication factor: 3
>  Average block replication: 2.9775276
>  Corrupt blocks: 173
>  Missing replicas: 0 (0.0 %)
>  Number of data-nodes: 10
>  Number of racks: 1
>
> I ran it again and got this;
> .Status: HEALTHY
>  Total size: 1414579922902 B (Total open files size: 668770828 B)
>  Total dirs: 3272
>  Total files: 2519 (Files currently being written: 11)
>  Total blocks (validated): 22922 (avg. block size 61712761 B) (Total open
> file blocks (not validated): 10)
>  Minimally replicated blocks: 22922 (100.0 %)
>  Over-replicated blocks: 0 (0.0 %)
>  Under-replicated blocks: 0 (0.0 %)
>  Mis-replicated blocks: 0 (0.0 %)
>  Default replication factor: 3
>  Average block replication: 3.0
>  Corrupt blocks: 0
>  Missing replicas: 0 (0.0 %)
>  Number of data-nodes: 10
>  Number of racks: 1
>
>
> The filesystem under path '/' is HEALTHY
>
> So i started everything and it seemed to be superficially functional.
>
> I then shutdown hadoop and restarted. hadoop came up in a matter of a few
> minutes, then hbase took about ten minutes of seeming to copy files around,
> based on the hbase master logs.
>
> After this we saw region not found client errors on some tables. I ran hbase
> hbck to look for problems and saw the errors I reported in the original
> post. Add in the ganglia problems and a botched attempt to edit the .META.
> table which brought us even further into the rabbit hole. I then decided to
> drop the affected tables but lo and behold one can not disable a table that
> has messed up regions...so I manually deleted the data but some of the
> .META. table entries were still there. Finally this afternoon we reformatted
> the entire cluster.
>
> Thanks.
>
>
>
> On Sat, Jul 2, 2011 at 5:25 PM, Stack <[email protected]> wrote:
>
>> On Sat, Jul 2, 2011 at 9:55 AM, Wayne <[email protected]> wrote:
>> > It just returns a ton of errors (import: command not found). Our cluster
>> is
>> > hosed anyway. I am waiting to get it completely re-installed from
>> scratch.
>> > Hope has long since flown out the window. I just changed my opinion of
>> what
>> > it takes to manage hbase. A Java engineer is required on staff. I also
>> > realized now a backup strategy is more important than for a RDBMS. Having
>> > RF=3 in HDFS offers no insurance against hbase lossing its shirt and
>> having
>> > .META. getting corrupted. I think I just found the achilles heel.
>> >
>> >
>>
>> Yeah, stability is primary but I do not know how you got into the
>> circumstance you find yourself in.  All I can offer is to try and do
>> diagnostics since avoiding hitting this situation again is of utmost
>> importance.
>>
>> St.Ack
>>
>

Re: hbck -fix

Reply via email to