On Sun, Jul 3, 2011 at 10:12 AM, Wayne <[email protected]> wrote: > HBase needs to evolve a little more before organizations > like ours can just "use it" without having to become experts.
I'd agree with this. In its current state, at least a part-time, seasoned operations engineer (per Andrew's description) is necessary if a substantial production deploy. I don't think that an onerous expectation for a critical piece of infrastructure. It'd certainly broaden our appeal though if we could get into the mysql calibre of ease-of-use.... That said, the issue you ran into where an 'incident' make it so a 'smart' fellow was unable to reconstitute his store needs addressing. We'll work on this. St.Ack > I have to say the community behind HBase is fantastic and goes above and > beyond to help greenies like ourselves be successful. With just a little > more polish around the edges I think it can and will really > become successful for a much wider audience. Thanks for everyones help. > > > On Sun, Jul 3, 2011 at 4:08 AM, Andrew Purtell <[email protected]> wrote: > >> I shorthanded this a bit: >> >> > Certainly a seasoned operations engineer would be a good investment for >> anyone. >> >> >> Let's try instead: >> >> Certainly a seasoned operations engineer [with Java experience] would be a >> good investment for anyone [running Hadoop based systems]. >> >> I'm not sure what I wrote earlier adequately conveyed the thought. >> >> >> - Andy >> >> >> >> >> > From: Andrew Purtell <[email protected]> >> > To: "[email protected]" <[email protected]> >> > Cc: >> > Sent: Sunday, July 3, 2011 12:39 AM >> > Subject: Re: hbck -fix >> > >> > Wayne, >> > >> > Did you by chance have your NameNode configured to write the edit log to >> only >> > one disk, and in this case only the root volume of the NameNode host? As >> I'm >> > sure you are now aware, the NameNode's edit log was corrupted, at least >> the >> > tail of it anyway, when the volume upon which it was being written was >> filled by >> > an errant process. The HDFS NameNode has a special critical role and it >> really >> > must be treated with the utmost care. It can and should be configured to >> write >> > the fsimage and edit log to multiple local dedicated disks. And, user >> processes >> > should never run on it. >> > >> > >> >> Hope has long since flown out the window. I just changed my opinion of >> what >> >> it takes to manage hbase. A Java engineer is required on staff. >> > >> > Perhaps. >> > >> > Certainly a seasoned operations engineer would be a good investment for >> anyone. >> > >> >> Having >> >> RF=3 in HDFS offers no insurance against hbase lossing its shirt and >> having >> >> .META. getting corrupted. >> > >> > This is a valid point. If HDFS loses track of blocks containing META >> table data >> > due to fsimage corruption on the NameNode, having those blocks on 3 >> DataNodes is >> > of no use. >> > >> > >> > I've done exercises in the past like delete META on disk and recreate it >> > with the earlier set of utilities (add_table.rb). This always "worked for >> > me" when I've tried it. >> > >> > >> > Results from torture tests that HBase was subjected to in the timeframe >> leading >> > up to 0.90 also resulted in better handling of .META. table related >> errors. They >> > are fortunately demonstrably now rare. >> > >> > >> > Clearly however there is room for further improvement here. >> > I will work on https://issues.apache.org/jira/browse/HBASE-4058 and >> hopefully >> > produce a unit test that fully exercises the ability of HBCK to >> reconstitute >> > META and gives >> > reliable results that can be incorporated into the test suite. My concern >> here >> > is getting repeatable results demonstrating HBCK weaknesses will be >> challenging. >> > >> > >> > Best regards, >> > >> > >> > - Andy >> > >> > Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via >> > Tom White) >> > >> > >> > ----- Original Message ----- >> >> From: Wayne <[email protected]> >> >> To: [email protected] >> >> Cc: >> >> Sent: Saturday, July 2, 2011 9:55 AM >> >> Subject: Re: hbck -fix >> >> >> >> It just returns a ton of errors (import: command not found). Our >> cluster is >> >> hosed anyway. I am waiting to get it completely re-installed from >> scratch. >> >> Hope has long since flown out the window. I just changed my opinion of >> what >> >> it takes to manage hbase. A Java engineer is required on staff. I also >> >> realized now a backup strategy is more important than for a RDBMS. >> Having >> >> RF=3 in HDFS offers no insurance against hbase lossing its shirt and >> having >> >> .META. getting corrupted. I think I just found the achilles heel. >> >> >> >> >> >> On Sat, Jul 2, 2011 at 12:40 PM, Ted Yu <[email protected]> wrote: >> >> >> >>> Have you tried running check_meta.rb with --fix ? >> >>> >> >>> On Sat, Jul 2, 2011 at 9:19 AM, Wayne <[email protected]> wrote: >> >>> >> >>> > We are running 0.90.3. We were testing the table export not >> > realizing >> >> the >> >>> > data goes to the root drive and not HDFS. The export filled the >> >> master's >> >>> > root partition. The logger had issues and HDFS got corrupted >> >>> > ("java.io.IOException: >> >>> > Incorrect data format. logVersion is -18 but writables.length is >> >> 0"). We >> >>> > had >> >>> > to run hadoop fsck -move to fix the corrupted hdfs files. Were >> > were >> >> able >> >>> to >> >>> > get hdfs running without issues but hbase ended up with the >> > region >> >>> issues. >> >>> > >> >>> > We also had another issue making it worse with Ganglia. We had >> > moved >> >> the >> >>> > Ganglia host to the master server and Ganglia took up so many >> >> resources >> >>> > that >> >>> > it actually caused timeouts talking to the master and most nodes >> > ended >> >> up >> >>> > shutting down. I guess Ganglia is a pig in terms or resources... >> >>> > >> >>> > I just tried to manually edit the .META. table removing the >> > remnants >> >> of >> >>> the >> >>> > old table but the shell went haywire on me and turned to control >> >>> > characters..??...I ended up corrupting the whole thing and had to >> > >> >> delete >> >>> > all >> >>> > tables...we have just not had a good week. >> >>> > >> >>> > I will add comments to HBASE-3695 in terms of suggestions. >> >>> > >> >>> > Thanks. >> >>> > >> >>> > On Fri, Jul 1, 2011 at 4:55 PM, Stack <[email protected]> >> > wrote: >> >>> > >> >>> > > What version of hbase are you on Wayne? >> >>> > > >> >>> > > On Fri, Jul 1, 2011 at 8:32 AM, Wayne >> > <[email protected]> >> >> wrote: >> >>> > > > I ran the hbck command and found 14 inconsistencies. >> > There >> >> were files >> >>> > in >> >>> > > > hdfs not used for region >> >>> > > >> >>> > > These are usually harmless. Bad accounting on our part. >> > Need to >> >> plug >> >>> > the >> >>> > > hole. >> >>> > > >> >>> > > >, regions with the same start key, a hole in the >> >>> > > > region chain, and a missing start region with an empty >> > key. >> >>> > > >> >>> > > These are pretty serious. >> >>> > > >> >>> > > How'd the master running out of root partition do this? >> > >> >> I'd be >> >>> > > interested to know. >> >>> > > >> >>> > > > We are not in production so we have the luxury to start >> > >> >> again, but >> >>> the >> >>> > > > damage to our confidence is severe. Is there work going >> > on >> >> to improve >> >>> > > hbck >> >>> > > > -fix to actually be able to resolve these types of >> > issues? >> >> Do we need >> >>> > to >> >>> > > > expect to run a production hbase cluster to be able to >> > move >> >> around >> >>> and >> >>> > > > rebuild the region definitions and the .META. table by >> > hand? >> >> Things >> >>> > just >> >>> > > got >> >>> > > > a lot scarier fast for us, especially since we were >> > hoping >> >> to go into >> >>> > > > production next month. Running out of disk space on the >> > >> >> master's root >> >>> > > > partition can bring down the entire cluster? This is >> >> scary... >> >>> > > > >> >>> > > >> >>> > > Understood. >> >>> > > >> >>> > > St.Ack >> >>> > > >> >>> > >> >>> >> >> >> >----- Original Message ----- >> >> >
