Re: dilemma of memory and CPU for hbase.

Jean-Daniel Cryans Thu, 01 Jul 2010 17:42:19 -0700

When I start HBase I usually just tail the master log, but it's
actually just a few seconds then another few seconds for .META. then
it starts assigning all other regions.


Did you make sure your master log was clean of errors?

J-D

On Thu, Jul 1, 2010 at 5:40 PM, Jinsong Hu <[email protected]> wrote:
> yes, it terminated correctely. there is no exception while running the
> add_table.
>
> are you saying that after restart, I need to wait for some time for the
> -ROOT- to
> be assigned ? usually how long I need to wait ?
>
> Jimmy
>
> --------------------------------------------------
> From: "Jean-Daniel Cryans" <[email protected]>
> Sent: Thursday, July 01, 2010 5:27 PM
> To: <[email protected]>
> Subject: Re: dilemma of memory and CPU for hbase.
>
>> Did you see any exception when you ran add_table? Did it even
>> terminated correctly?
>>
>> After a restart, the regions aren't readily available. If something
>> blocked the master from assigning -ROOT-, it should be pretty evident
>> by looking at the master log.
>>
>> J-D
>>
>> On Thu, Jul 1, 2010 at 5:23 PM, Jinsong Hu <[email protected]> wrote:
>>>
>>> After I run the add_table.rb, I  refreshed the master's UI page, and then
>>> clicked on the table to show the regions. I expect that all regions will
>>> be
>>> there.
>>> But , I found that there are significantly fewer regions. Lots of regions
>>> that was there before were gone.
>>>
>>> I then restarted the whole hbase master and region server. And now it is
>>> even worse. the master UI page doesn't even load. saying the _ROOT region
>>> is and .META is not served by any regionserver.  The whole cluster is not
>>> in
>>> a usable state.
>>>
>>> That forced me to rename the /hbase to /hbase-0.20.4, and restart all
>>> hbase
>>> master and regionservers. recreate all tables, etc.essentially starting
>>> from scratch.
>>>
>>> Jimmy
>>>
>>> --------------------------------------------------
>>> From: "Jean-Daniel Cryans" <[email protected]>
>>> Sent: Thursday, July 01, 2010 5:10 PM
>>> To: <[email protected]>
>>> Subject: Re: dilemma of memory and CPU for hbase.
>>>
>>>> add_table.rb doesn't actually write much in the file system, all your
>>>> data is still there. It just wipes all the .META. entries and replaces
>>>> them with the .regioninfo files found in every region directory.
>>>>
>>>> Can you define what you mean by "corrupted". It's really an
>>>> overloaded-term.
>>>>
>>>> J-D
>>>>
>>>> On Thu, Jul 1, 2010 at 5:01 PM, Jinsong Hu <[email protected]>
>>>> wrote:
>>>>>
>>>>> Hi, Jean:
>>>>>  Thanks! I will run the add_table.rb and see if it fixes the problem.
>>>>>  Our namenode is backed up with  HA and DRBD, and the hbase master
>>>>> machine
>>>>> colocates with name node , job tracker so we are not wasting resources.
>>>>>
>>>>>  The region hole probably comes from previous 0.20.4 hbase operation.
>>>>> the
>>>>> 0.20.4 hbase was
>>>>> very unstable during its operation. lots of times the master says the
>>>>> region
>>>>> is not there but actually
>>>>> the region server says it was serving the region.
>>>>>
>>>>>
>>>>> I followed the instruction and run commands like
>>>>>
>>>>> bin/hbase org.jruby.Main bin/add_table.rb /hbase/table_name
>>>>>
>>>>> After the execution, I found all my tables are corrupted and I can't
>>>>> use
>>>>> it
>>>>> any more. restarting hbase
>>>>> doesn't help either. I have to wipe out all the /hbase directory and
>>>>> start
>>>>> from scratch.
>>>>>
>>>>>
>>>>> it looks that the add_table.rb can corrupt the whole hbase.  Anyway, I
>>>>> am
>>>>> regenerating the data from
>>>>> scratch and let's see if it will work out.
>>>>>
>>>>> Jimmy.
>>>>>
>>>>>
>>>>> --------------------------------------------------
>>>>> From: "Jean-Daniel Cryans" <[email protected]>
>>>>> Sent: Thursday, July 01, 2010 2:17 PM
>>>>> To: <[email protected]>
>>>>> Subject: Re: dilemma of memory and CPU for hbase.
>>>>>
>>>>>> (taking the conversation back to the list after receiving logs and
>>>>>> heap
>>>>>> dump)
>>>>>>
>>>>>> The issue here is actually much more nasty than it seems. But before I
>>>>>> describe the problem, you said:
>>>>>>
>>>>>>>  I have 3 machines as hbase master (only 1 is active), 3 zookeepers.
>>>>>>> 8
>>>>>>> regionservers.
>>>>>>
>>>>>> If those are all distinct machines, you are wasting a lot of hardware.
>>>>>> Unless you have a HA Namenode (I highly doubt), then you already have
>>>>>> a SPOF there so you might as well put every service on that single
>>>>>> node (1 master, 1 zookeeper). You might be afraid of using only 1 ZK
>>>>>> node, but unless you share the zookeeper ensemble between clusters
>>>>>> then losing the Namenode is as bad as losing ZK so might as well put
>>>>>> them together. At StumbleUpon we have 2-3 clusters using the same
>>>>>> ensembles, so it makes more sense to put them in a HA setup.
>>>>>>
>>>>>> That said, in your log I see:
>>>>>>
>>>>>> 2010-06-29 00:00:00,064 DEBUG
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts
>>>>>> interrupted at index=0 because:Requested row out of range for HRegion
>>>>>> Spam_MsgEventTable,2010-06-28 11:34:02blah
>>>>>> ...
>>>>>> 2010-06-29 12:26:13,352 DEBUG
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts
>>>>>> interrupted at index=0 because:Requested row out of range for HRegion
>>>>>> Spam_MsgEventTable,2010-06-28 11:34:02blah
>>>>>>
>>>>>> So for 12 hours (and probably more), the same row was requested almost
>>>>>> every 100ms but it was always failing on a WrongRegionException
>>>>>> (that's the name of what we see here). You probably use the write
>>>>>> buffer since you want to import as fast as possible, so all these
>>>>>> buffers are left unused after the clients terminate their RPC. That
>>>>>> rate of failed insertion must have kept your garbage collector _very_
>>>>>> busy, and at some point the JVM OOMEd. This is the stack from your
>>>>>> OOME:
>>>>>>
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>> at
>>>>>>
>>>>>>
>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:175)
>>>>>> at
>>>>>>
>>>>>>
>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:867)
>>>>>> at
>>>>>>
>>>>>>
>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:835)
>>>>>> at
>>>>>>
>>>>>>
>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:419)
>>>>>> at
>>>>>>
>>>>>>
>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:318)
>>>>>>
>>>>>> This is where we deserialize client data, so it correlates with what I
>>>>>> just described.
>>>>>>
>>>>>> Now, this means that you probably have a hole (or more) in your .META.
>>>>>> table. It usually happens after a region server fails if it was
>>>>>> carrying it (since data loss is possible with that version of HDFS) or
>>>>>> if a bug in the master messes up the .META. region. Now 2 things:
>>>>>>
>>>>>> - It would be nice to know why you have a hole. Look at your .META.
>>>>>> table around the row in your region server log, you should see that
>>>>>> the start/end keys don't match. Then you can look in the master log
>>>>>> from yesterday to search for what went wrong, maybe see some
>>>>>> exceptions, or maybe a region server failed for any reason and it was
>>>>>> hosting .META.
>>>>>>
>>>>>> - You probably want to fix your table. Use the bin/add_table.rb
>>>>>> script (other people on this list used it in the past, search the
>>>>>> archive for more info).
>>>>>>
>>>>>> Finally (whew!), if you are still developing your solution around
>>>>>> HBase, you might want to try out one of our dev release that does work
>>>>>> with a durable Hadoop release. See
>>>>>> http://hbase.apache.org/docs/r0.89.20100621/ for more info. Cloudera's
>>>>>> CDH3b2 also has everything you need.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Thu, Jul 1, 2010 at 12:03 PM, Jean-Daniel Cryans
>>>>>> <[email protected]>
>>>>>> wrote:
>>>>>>>
>>>>>>> 653 regions is very low, even if you had a total of 3 region servers
>>>>>>> I
>>>>>>> wouldn't expect any problem.
>>>>>>>
>>>>>>> So to me it seems to point towards either a configuration issue or a
>>>>>>> usage issue. Can you:
>>>>>>>
>>>>>>>  - Put the log of one region server that OOMEd on a public server.
>>>>>>>  - Tell us more about your setup: # of nodes, hardware, configuration
>>>>>>> file
>>>>>>>  - Tell us more about how you insert data into HBase
>>>>>>>
>>>>>>> And BTW are you trying to do an initial import of your data set? If
>>>>>>> so, have you considered using HFileOutputFormat?
>>>>>>>
>>>>>>> Thx,
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Thu, Jul 1, 2010 at 11:52 AM, Jinsong Hu <[email protected]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi, Sir:
>>>>>>>>  I am using hbase 0.20.5 and this morning I found that 3 of  my
>>>>>>>> region
>>>>>>>> server running out of memory.
>>>>>>>> the regionserver is given 6G memory each, and on average, I have 653
>>>>>>>> regions
>>>>>>>> in total. max store size
>>>>>>>> is 256M. I analyzed the dump and it shows that there are too many
>>>>>>>> HRegion in
>>>>>>>> memory.
>>>>>>>>
>>>>>>>>  Previously set max store size to 2G, but then I found the region
>>>>>>>> server
>>>>>>>> constantly does minor compaction and the CPU usage is very high, It
>>>>>>>> also
>>>>>>>> blocks the heavy client record insertion.
>>>>>>>>
>>>>>>>>  So now I am limited on one side by memory,  limited on another size
>>>>>>>> by
>>>>>>>> CPU.
>>>>>>>> Is there anyway to get out of this dilemma ?
>>>>>>>>
>>>>>>>> Jimmy.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: dilemma of memory and CPU for hbase.

Reply via email to