Did you see any exception when you ran add_table? Did it even terminated correctly?
After a restart, the regions aren't readily available. If something blocked the master from assigning -ROOT-, it should be pretty evident by looking at the master log. J-D On Thu, Jul 1, 2010 at 5:23 PM, Jinsong Hu <[email protected]> wrote: > After I run the add_table.rb, I refreshed the master's UI page, and then > clicked on the table to show the regions. I expect that all regions will be > there. > But , I found that there are significantly fewer regions. Lots of regions > that was there before were gone. > > I then restarted the whole hbase master and region server. And now it is > even worse. the master UI page doesn't even load. saying the _ROOT region > is and .META is not served by any regionserver. The whole cluster is not in > a usable state. > > That forced me to rename the /hbase to /hbase-0.20.4, and restart all hbase > master and regionservers. recreate all tables, etc.essentially starting > from scratch. > > Jimmy > > -------------------------------------------------- > From: "Jean-Daniel Cryans" <[email protected]> > Sent: Thursday, July 01, 2010 5:10 PM > To: <[email protected]> > Subject: Re: dilemma of memory and CPU for hbase. > >> add_table.rb doesn't actually write much in the file system, all your >> data is still there. It just wipes all the .META. entries and replaces >> them with the .regioninfo files found in every region directory. >> >> Can you define what you mean by "corrupted". It's really an >> overloaded-term. >> >> J-D >> >> On Thu, Jul 1, 2010 at 5:01 PM, Jinsong Hu <[email protected]> wrote: >>> >>> Hi, Jean: >>> Thanks! I will run the add_table.rb and see if it fixes the problem. >>> Our namenode is backed up with HA and DRBD, and the hbase master >>> machine >>> colocates with name node , job tracker so we are not wasting resources. >>> >>> The region hole probably comes from previous 0.20.4 hbase operation. the >>> 0.20.4 hbase was >>> very unstable during its operation. lots of times the master says the >>> region >>> is not there but actually >>> the region server says it was serving the region. >>> >>> >>> I followed the instruction and run commands like >>> >>> bin/hbase org.jruby.Main bin/add_table.rb /hbase/table_name >>> >>> After the execution, I found all my tables are corrupted and I can't use >>> it >>> any more. restarting hbase >>> doesn't help either. I have to wipe out all the /hbase directory and >>> start >>> from scratch. >>> >>> >>> it looks that the add_table.rb can corrupt the whole hbase. Anyway, I am >>> regenerating the data from >>> scratch and let's see if it will work out. >>> >>> Jimmy. >>> >>> >>> -------------------------------------------------- >>> From: "Jean-Daniel Cryans" <[email protected]> >>> Sent: Thursday, July 01, 2010 2:17 PM >>> To: <[email protected]> >>> Subject: Re: dilemma of memory and CPU for hbase. >>> >>>> (taking the conversation back to the list after receiving logs and heap >>>> dump) >>>> >>>> The issue here is actually much more nasty than it seems. But before I >>>> describe the problem, you said: >>>> >>>>> I have 3 machines as hbase master (only 1 is active), 3 zookeepers. 8 >>>>> regionservers. >>>> >>>> If those are all distinct machines, you are wasting a lot of hardware. >>>> Unless you have a HA Namenode (I highly doubt), then you already have >>>> a SPOF there so you might as well put every service on that single >>>> node (1 master, 1 zookeeper). You might be afraid of using only 1 ZK >>>> node, but unless you share the zookeeper ensemble between clusters >>>> then losing the Namenode is as bad as losing ZK so might as well put >>>> them together. At StumbleUpon we have 2-3 clusters using the same >>>> ensembles, so it makes more sense to put them in a HA setup. >>>> >>>> That said, in your log I see: >>>> >>>> 2010-06-29 00:00:00,064 DEBUG >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts >>>> interrupted at index=0 because:Requested row out of range for HRegion >>>> Spam_MsgEventTable,2010-06-28 11:34:02blah >>>> ... >>>> 2010-06-29 12:26:13,352 DEBUG >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts >>>> interrupted at index=0 because:Requested row out of range for HRegion >>>> Spam_MsgEventTable,2010-06-28 11:34:02blah >>>> >>>> So for 12 hours (and probably more), the same row was requested almost >>>> every 100ms but it was always failing on a WrongRegionException >>>> (that's the name of what we see here). You probably use the write >>>> buffer since you want to import as fast as possible, so all these >>>> buffers are left unused after the clients terminate their RPC. That >>>> rate of failed insertion must have kept your garbage collector _very_ >>>> busy, and at some point the JVM OOMEd. This is the stack from your >>>> OOME: >>>> >>>> java.lang.OutOfMemoryError: Java heap space >>>> at >>>> >>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:175) >>>> at >>>> >>>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:867) >>>> at >>>> >>>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:835) >>>> at >>>> >>>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:419) >>>> at >>>> >>>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:318) >>>> >>>> This is where we deserialize client data, so it correlates with what I >>>> just described. >>>> >>>> Now, this means that you probably have a hole (or more) in your .META. >>>> table. It usually happens after a region server fails if it was >>>> carrying it (since data loss is possible with that version of HDFS) or >>>> if a bug in the master messes up the .META. region. Now 2 things: >>>> >>>> - It would be nice to know why you have a hole. Look at your .META. >>>> table around the row in your region server log, you should see that >>>> the start/end keys don't match. Then you can look in the master log >>>> from yesterday to search for what went wrong, maybe see some >>>> exceptions, or maybe a region server failed for any reason and it was >>>> hosting .META. >>>> >>>> - You probably want to fix your table. Use the bin/add_table.rb >>>> script (other people on this list used it in the past, search the >>>> archive for more info). >>>> >>>> Finally (whew!), if you are still developing your solution around >>>> HBase, you might want to try out one of our dev release that does work >>>> with a durable Hadoop release. See >>>> http://hbase.apache.org/docs/r0.89.20100621/ for more info. Cloudera's >>>> CDH3b2 also has everything you need. >>>> >>>> J-D >>>> >>>> On Thu, Jul 1, 2010 at 12:03 PM, Jean-Daniel Cryans >>>> <[email protected]> >>>> wrote: >>>>> >>>>> 653 regions is very low, even if you had a total of 3 region servers I >>>>> wouldn't expect any problem. >>>>> >>>>> So to me it seems to point towards either a configuration issue or a >>>>> usage issue. Can you: >>>>> >>>>> - Put the log of one region server that OOMEd on a public server. >>>>> - Tell us more about your setup: # of nodes, hardware, configuration >>>>> file >>>>> - Tell us more about how you insert data into HBase >>>>> >>>>> And BTW are you trying to do an initial import of your data set? If >>>>> so, have you considered using HFileOutputFormat? >>>>> >>>>> Thx, >>>>> >>>>> J-D >>>>> >>>>> On Thu, Jul 1, 2010 at 11:52 AM, Jinsong Hu <[email protected]> >>>>> wrote: >>>>>> >>>>>> Hi, Sir: >>>>>> I am using hbase 0.20.5 and this morning I found that 3 of my region >>>>>> server running out of memory. >>>>>> the regionserver is given 6G memory each, and on average, I have 653 >>>>>> regions >>>>>> in total. max store size >>>>>> is 256M. I analyzed the dump and it shows that there are too many >>>>>> HRegion in >>>>>> memory. >>>>>> >>>>>> Previously set max store size to 2G, but then I found the region >>>>>> server >>>>>> constantly does minor compaction and the CPU usage is very high, It >>>>>> also >>>>>> blocks the heavy client record insertion. >>>>>> >>>>>> So now I am limited on one side by memory, limited on another size >>>>>> by >>>>>> CPU. >>>>>> Is there anyway to get out of this dilemma ? >>>>>> >>>>>> Jimmy. >>>>>> >>>>> >>>> >>> >> >
