Did you see any exception when you ran add_table? Did it even
terminated correctly?

After a restart, the regions aren't readily available. If something
blocked the master from assigning -ROOT-, it should be pretty evident
by looking at the master log.

J-D

On Thu, Jul 1, 2010 at 5:23 PM, Jinsong Hu <[email protected]> wrote:
> After I run the add_table.rb, I  refreshed the master's UI page, and then
> clicked on the table to show the regions. I expect that all regions will be
> there.
> But , I found that there are significantly fewer regions. Lots of regions
> that was there before were gone.
>
> I then restarted the whole hbase master and region server. And now it is
> even worse. the master UI page doesn't even load. saying the _ROOT region
> is and .META is not served by any regionserver.  The whole cluster is not in
> a usable state.
>
> That forced me to rename the /hbase to /hbase-0.20.4, and restart all hbase
> master and regionservers. recreate all tables, etc.essentially starting
> from scratch.
>
> Jimmy
>
> --------------------------------------------------
> From: "Jean-Daniel Cryans" <[email protected]>
> Sent: Thursday, July 01, 2010 5:10 PM
> To: <[email protected]>
> Subject: Re: dilemma of memory and CPU for hbase.
>
>> add_table.rb doesn't actually write much in the file system, all your
>> data is still there. It just wipes all the .META. entries and replaces
>> them with the .regioninfo files found in every region directory.
>>
>> Can you define what you mean by "corrupted". It's really an
>> overloaded-term.
>>
>> J-D
>>
>> On Thu, Jul 1, 2010 at 5:01 PM, Jinsong Hu <[email protected]> wrote:
>>>
>>> Hi, Jean:
>>>  Thanks! I will run the add_table.rb and see if it fixes the problem.
>>>  Our namenode is backed up with  HA and DRBD, and the hbase master
>>> machine
>>> colocates with name node , job tracker so we are not wasting resources.
>>>
>>>  The region hole probably comes from previous 0.20.4 hbase operation. the
>>> 0.20.4 hbase was
>>> very unstable during its operation. lots of times the master says the
>>> region
>>> is not there but actually
>>> the region server says it was serving the region.
>>>
>>>
>>> I followed the instruction and run commands like
>>>
>>> bin/hbase org.jruby.Main bin/add_table.rb /hbase/table_name
>>>
>>> After the execution, I found all my tables are corrupted and I can't use
>>> it
>>> any more. restarting hbase
>>> doesn't help either. I have to wipe out all the /hbase directory and
>>> start
>>> from scratch.
>>>
>>>
>>> it looks that the add_table.rb can corrupt the whole hbase.  Anyway, I am
>>> regenerating the data from
>>> scratch and let's see if it will work out.
>>>
>>> Jimmy.
>>>
>>>
>>> --------------------------------------------------
>>> From: "Jean-Daniel Cryans" <[email protected]>
>>> Sent: Thursday, July 01, 2010 2:17 PM
>>> To: <[email protected]>
>>> Subject: Re: dilemma of memory and CPU for hbase.
>>>
>>>> (taking the conversation back to the list after receiving logs and heap
>>>> dump)
>>>>
>>>> The issue here is actually much more nasty than it seems. But before I
>>>> describe the problem, you said:
>>>>
>>>>>  I have 3 machines as hbase master (only 1 is active), 3 zookeepers. 8
>>>>> regionservers.
>>>>
>>>> If those are all distinct machines, you are wasting a lot of hardware.
>>>> Unless you have a HA Namenode (I highly doubt), then you already have
>>>> a SPOF there so you might as well put every service on that single
>>>> node (1 master, 1 zookeeper). You might be afraid of using only 1 ZK
>>>> node, but unless you share the zookeeper ensemble between clusters
>>>> then losing the Namenode is as bad as losing ZK so might as well put
>>>> them together. At StumbleUpon we have 2-3 clusters using the same
>>>> ensembles, so it makes more sense to put them in a HA setup.
>>>>
>>>> That said, in your log I see:
>>>>
>>>> 2010-06-29 00:00:00,064 DEBUG
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts
>>>> interrupted at index=0 because:Requested row out of range for HRegion
>>>> Spam_MsgEventTable,2010-06-28 11:34:02blah
>>>> ...
>>>> 2010-06-29 12:26:13,352 DEBUG
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts
>>>> interrupted at index=0 because:Requested row out of range for HRegion
>>>> Spam_MsgEventTable,2010-06-28 11:34:02blah
>>>>
>>>> So for 12 hours (and probably more), the same row was requested almost
>>>> every 100ms but it was always failing on a WrongRegionException
>>>> (that's the name of what we see here). You probably use the write
>>>> buffer since you want to import as fast as possible, so all these
>>>> buffers are left unused after the clients terminate their RPC. That
>>>> rate of failed insertion must have kept your garbage collector _very_
>>>> busy, and at some point the JVM OOMEd. This is the stack from your
>>>> OOME:
>>>>
>>>> java.lang.OutOfMemoryError: Java heap space
>>>> at
>>>>
>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:175)
>>>> at
>>>>
>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:867)
>>>> at
>>>>
>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:835)
>>>> at
>>>>
>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:419)
>>>> at
>>>>
>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:318)
>>>>
>>>> This is where we deserialize client data, so it correlates with what I
>>>> just described.
>>>>
>>>> Now, this means that you probably have a hole (or more) in your .META.
>>>> table. It usually happens after a region server fails if it was
>>>> carrying it (since data loss is possible with that version of HDFS) or
>>>> if a bug in the master messes up the .META. region. Now 2 things:
>>>>
>>>> - It would be nice to know why you have a hole. Look at your .META.
>>>> table around the row in your region server log, you should see that
>>>> the start/end keys don't match. Then you can look in the master log
>>>> from yesterday to search for what went wrong, maybe see some
>>>> exceptions, or maybe a region server failed for any reason and it was
>>>> hosting .META.
>>>>
>>>> - You probably want to fix your table. Use the bin/add_table.rb
>>>> script (other people on this list used it in the past, search the
>>>> archive for more info).
>>>>
>>>> Finally (whew!), if you are still developing your solution around
>>>> HBase, you might want to try out one of our dev release that does work
>>>> with a durable Hadoop release. See
>>>> http://hbase.apache.org/docs/r0.89.20100621/ for more info. Cloudera's
>>>> CDH3b2 also has everything you need.
>>>>
>>>> J-D
>>>>
>>>> On Thu, Jul 1, 2010 at 12:03 PM, Jean-Daniel Cryans
>>>> <[email protected]>
>>>> wrote:
>>>>>
>>>>> 653 regions is very low, even if you had a total of 3 region servers I
>>>>> wouldn't expect any problem.
>>>>>
>>>>> So to me it seems to point towards either a configuration issue or a
>>>>> usage issue. Can you:
>>>>>
>>>>>  - Put the log of one region server that OOMEd on a public server.
>>>>>  - Tell us more about your setup: # of nodes, hardware, configuration
>>>>> file
>>>>>  - Tell us more about how you insert data into HBase
>>>>>
>>>>> And BTW are you trying to do an initial import of your data set? If
>>>>> so, have you considered using HFileOutputFormat?
>>>>>
>>>>> Thx,
>>>>>
>>>>> J-D
>>>>>
>>>>> On Thu, Jul 1, 2010 at 11:52 AM, Jinsong Hu <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Hi, Sir:
>>>>>>  I am using hbase 0.20.5 and this morning I found that 3 of  my region
>>>>>> server running out of memory.
>>>>>> the regionserver is given 6G memory each, and on average, I have 653
>>>>>> regions
>>>>>> in total. max store size
>>>>>> is 256M. I analyzed the dump and it shows that there are too many
>>>>>> HRegion in
>>>>>> memory.
>>>>>>
>>>>>>  Previously set max store size to 2G, but then I found the region
>>>>>> server
>>>>>> constantly does minor compaction and the CPU usage is very high, It
>>>>>> also
>>>>>> blocks the heavy client record insertion.
>>>>>>
>>>>>>  So now I am limited on one side by memory,  limited on another size
>>>>>> by
>>>>>> CPU.
>>>>>> Is there anyway to get out of this dilemma ?
>>>>>>
>>>>>> Jimmy.
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to