I shorthanded this a bit: > Certainly a seasoned operations engineer would be a good investment for > anyone.
Let's try instead: Certainly a seasoned operations engineer [with Java experience] would be a good investment for anyone [running Hadoop based systems]. I'm not sure what I wrote earlier adequately conveyed the thought. - Andy > From: Andrew Purtell <[email protected]> > To: "[email protected]" <[email protected]> > Cc: > Sent: Sunday, July 3, 2011 12:39 AM > Subject: Re: hbck -fix > > Wayne, > > Did you by chance have your NameNode configured to write the edit log to only > one disk, and in this case only the root volume of the NameNode host? As I'm > sure you are now aware, the NameNode's edit log was corrupted, at least the > tail of it anyway, when the volume upon which it was being written was filled > by > an errant process. The HDFS NameNode has a special critical role and it > really > must be treated with the utmost care. It can and should be configured to > write > the fsimage and edit log to multiple local dedicated disks. And, user > processes > should never run on it. > > >> Hope has long since flown out the window. I just changed my opinion of what >> it takes to manage hbase. A Java engineer is required on staff. > > Perhaps. > > Certainly a seasoned operations engineer would be a good investment for > anyone. > >> Having >> RF=3 in HDFS offers no insurance against hbase lossing its shirt and having >> .META. getting corrupted. > > This is a valid point. If HDFS loses track of blocks containing META table > data > due to fsimage corruption on the NameNode, having those blocks on 3 DataNodes > is > of no use. > > > I've done exercises in the past like delete META on disk and recreate it > with the earlier set of utilities (add_table.rb). This always "worked for > me" when I've tried it. > > > Results from torture tests that HBase was subjected to in the timeframe > leading > up to 0.90 also resulted in better handling of .META. table related errors. > They > are fortunately demonstrably now rare. > > > Clearly however there is room for further improvement here. > I will work on https://issues.apache.org/jira/browse/HBASE-4058 and hopefully > produce a unit test that fully exercises the ability of HBCK to reconstitute > META and gives > reliable results that can be incorporated into the test suite. My concern > here > is getting repeatable results demonstrating HBCK weaknesses will be > challenging. > > > Best regards, > > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein (via > Tom White) > > > ----- Original Message ----- >> From: Wayne <[email protected]> >> To: [email protected] >> Cc: >> Sent: Saturday, July 2, 2011 9:55 AM >> Subject: Re: hbck -fix >> >> It just returns a ton of errors (import: command not found). Our cluster is >> hosed anyway. I am waiting to get it completely re-installed from scratch. >> Hope has long since flown out the window. I just changed my opinion of what >> it takes to manage hbase. A Java engineer is required on staff. I also >> realized now a backup strategy is more important than for a RDBMS. Having >> RF=3 in HDFS offers no insurance against hbase lossing its shirt and having >> .META. getting corrupted. I think I just found the achilles heel. >> >> >> On Sat, Jul 2, 2011 at 12:40 PM, Ted Yu <[email protected]> wrote: >> >>> Have you tried running check_meta.rb with --fix ? >>> >>> On Sat, Jul 2, 2011 at 9:19 AM, Wayne <[email protected]> wrote: >>> >>> > We are running 0.90.3. We were testing the table export not > realizing >> the >>> > data goes to the root drive and not HDFS. The export filled the >> master's >>> > root partition. The logger had issues and HDFS got corrupted >>> > ("java.io.IOException: >>> > Incorrect data format. logVersion is -18 but writables.length is >> 0"). We >>> > had >>> > to run hadoop fsck -move to fix the corrupted hdfs files. Were > were >> able >>> to >>> > get hdfs running without issues but hbase ended up with the > region >>> issues. >>> > >>> > We also had another issue making it worse with Ganglia. We had > moved >> the >>> > Ganglia host to the master server and Ganglia took up so many >> resources >>> > that >>> > it actually caused timeouts talking to the master and most nodes > ended >> up >>> > shutting down. I guess Ganglia is a pig in terms or resources... >>> > >>> > I just tried to manually edit the .META. table removing the > remnants >> of >>> the >>> > old table but the shell went haywire on me and turned to control >>> > characters..??...I ended up corrupting the whole thing and had to > >> delete >>> > all >>> > tables...we have just not had a good week. >>> > >>> > I will add comments to HBASE-3695 in terms of suggestions. >>> > >>> > Thanks. >>> > >>> > On Fri, Jul 1, 2011 at 4:55 PM, Stack <[email protected]> > wrote: >>> > >>> > > What version of hbase are you on Wayne? >>> > > >>> > > On Fri, Jul 1, 2011 at 8:32 AM, Wayne > <[email protected]> >> wrote: >>> > > > I ran the hbck command and found 14 inconsistencies. > There >> were files >>> > in >>> > > > hdfs not used for region >>> > > >>> > > These are usually harmless. Bad accounting on our part. > Need to >> plug >>> > the >>> > > hole. >>> > > >>> > > >, regions with the same start key, a hole in the >>> > > > region chain, and a missing start region with an empty > key. >>> > > >>> > > These are pretty serious. >>> > > >>> > > How'd the master running out of root partition do this? > >> I'd be >>> > > interested to know. >>> > > >>> > > > We are not in production so we have the luxury to start > >> again, but >>> the >>> > > > damage to our confidence is severe. Is there work going > on >> to improve >>> > > hbck >>> > > > -fix to actually be able to resolve these types of > issues? >> Do we need >>> > to >>> > > > expect to run a production hbase cluster to be able to > move >> around >>> and >>> > > > rebuild the region definitions and the .META. table by > hand? >> Things >>> > just >>> > > got >>> > > > a lot scarier fast for us, especially since we were > hoping >> to go into >>> > > > production next month. Running out of disk space on the > >> master's root >>> > > > partition can bring down the entire cluster? This is >> scary... >>> > > > >>> > > >>> > > Understood. >>> > > >>> > > St.Ack >>> > > >>> > >>> >> >----- Original Message -----
