yes, it terminated correctely. there is no exception while running the
add_table.
are you saying that after restart, I need to wait for some time for
the -ROOT- to
be assigned ? usually how long I need to wait ?
Jimmy
--------------------------------------------------
From: "Jean-Daniel Cryans" <[email protected]>
Sent: Thursday, July 01, 2010 5:27 PM
To: <[email protected]>
Subject: Re: dilemma of memory and CPU for hbase.
Did you see any exception when you ran add_table? Did it even
terminated correctly?
After a restart, the regions aren't readily available. If something
blocked the master from assigning -ROOT-, it should be pretty evident
by looking at the master log.
J-D
On Thu, Jul 1, 2010 at 5:23 PM, Jinsong Hu <[email protected]> wrote:
After I run the add_table.rb, I refreshed the master's UI page, and then
clicked on the table to show the regions. I expect that all regions will
be
there.
But , I found that there are significantly fewer regions. Lots of regions
that was there before were gone.
I then restarted the whole hbase master and region server. And now it is
even worse. the master UI page doesn't even load. saying the _ROOT region
is and .META is not served by any regionserver. The whole cluster is not
in
a usable state.
That forced me to rename the /hbase to /hbase-0.20.4, and restart all
hbase
master and regionservers. recreate all tables, etc.essentially starting
from scratch.
Jimmy
--------------------------------------------------
From: "Jean-Daniel Cryans" <[email protected]>
Sent: Thursday, July 01, 2010 5:10 PM
To: <[email protected]>
Subject: Re: dilemma of memory and CPU for hbase.
add_table.rb doesn't actually write much in the file system, all your
data is still there. It just wipes all the .META. entries and replaces
them with the .regioninfo files found in every region directory.
Can you define what you mean by "corrupted". It's really an
overloaded-term.
J-D
On Thu, Jul 1, 2010 at 5:01 PM, Jinsong Hu <[email protected]>
wrote:
Hi, Jean:
Thanks! I will run the add_table.rb and see if it fixes the problem.
Our namenode is backed up with HA and DRBD, and the hbase master
machine
colocates with name node , job tracker so we are not wasting resources.
The region hole probably comes from previous 0.20.4 hbase operation.
the
0.20.4 hbase was
very unstable during its operation. lots of times the master says the
region
is not there but actually
the region server says it was serving the region.
I followed the instruction and run commands like
bin/hbase org.jruby.Main bin/add_table.rb /hbase/table_name
After the execution, I found all my tables are corrupted and I can't
use
it
any more. restarting hbase
doesn't help either. I have to wipe out all the /hbase directory and
start
from scratch.
it looks that the add_table.rb can corrupt the whole hbase. Anyway, I
am
regenerating the data from
scratch and let's see if it will work out.
Jimmy.
--------------------------------------------------
From: "Jean-Daniel Cryans" <[email protected]>
Sent: Thursday, July 01, 2010 2:17 PM
To: <[email protected]>
Subject: Re: dilemma of memory and CPU for hbase.
(taking the conversation back to the list after receiving logs and
heap
dump)
The issue here is actually much more nasty than it seems. But before I
describe the problem, you said:
I have 3 machines as hbase master (only 1 is active), 3 zookeepers.
8
regionservers.
If those are all distinct machines, you are wasting a lot of hardware.
Unless you have a HA Namenode (I highly doubt), then you already have
a SPOF there so you might as well put every service on that single
node (1 master, 1 zookeeper). You might be afraid of using only 1 ZK
node, but unless you share the zookeeper ensemble between clusters
then losing the Namenode is as bad as losing ZK so might as well put
them together. At StumbleUpon we have 2-3 clusters using the same
ensembles, so it makes more sense to put them in a HA setup.
That said, in your log I see:
2010-06-29 00:00:00,064 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts
interrupted at index=0 because:Requested row out of range for HRegion
Spam_MsgEventTable,2010-06-28 11:34:02blah
...
2010-06-29 12:26:13,352 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts
interrupted at index=0 because:Requested row out of range for HRegion
Spam_MsgEventTable,2010-06-28 11:34:02blah
So for 12 hours (and probably more), the same row was requested almost
every 100ms but it was always failing on a WrongRegionException
(that's the name of what we see here). You probably use the write
buffer since you want to import as fast as possible, so all these
buffers are left unused after the clients terminate their RPC. That
rate of failed insertion must have kept your garbage collector _very_
busy, and at some point the JVM OOMEd. This is the stack from your
OOME:
java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:175)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:867)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:835)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:419)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:318)
This is where we deserialize client data, so it correlates with what I
just described.
Now, this means that you probably have a hole (or more) in your .META.
table. It usually happens after a region server fails if it was
carrying it (since data loss is possible with that version of HDFS) or
if a bug in the master messes up the .META. region. Now 2 things:
- It would be nice to know why you have a hole. Look at your .META.
table around the row in your region server log, you should see that
the start/end keys don't match. Then you can look in the master log
from yesterday to search for what went wrong, maybe see some
exceptions, or maybe a region server failed for any reason and it was
hosting .META.
- You probably want to fix your table. Use the bin/add_table.rb
script (other people on this list used it in the past, search the
archive for more info).
Finally (whew!), if you are still developing your solution around
HBase, you might want to try out one of our dev release that does work
with a durable Hadoop release. See
http://hbase.apache.org/docs/r0.89.20100621/ for more info. Cloudera's
CDH3b2 also has everything you need.
J-D
On Thu, Jul 1, 2010 at 12:03 PM, Jean-Daniel Cryans
<[email protected]>
wrote:
653 regions is very low, even if you had a total of 3 region servers
I
wouldn't expect any problem.
So to me it seems to point towards either a configuration issue or a
usage issue. Can you:
- Put the log of one region server that OOMEd on a public server.
- Tell us more about your setup: # of nodes, hardware, configuration
file
- Tell us more about how you insert data into HBase
And BTW are you trying to do an initial import of your data set? If
so, have you considered using HFileOutputFormat?
Thx,
J-D
On Thu, Jul 1, 2010 at 11:52 AM, Jinsong Hu <[email protected]>
wrote:
Hi, Sir:
I am using hbase 0.20.5 and this morning I found that 3 of my
region
server running out of memory.
the regionserver is given 6G memory each, and on average, I have 653
regions
in total. max store size
is 256M. I analyzed the dump and it shows that there are too many
HRegion in
memory.
Previously set max store size to 2G, but then I found the region
server
constantly does minor compaction and the CPU usage is very high, It
also
blocks the heavy client record insertion.
So now I am limited on one side by memory, limited on another size
by
CPU.
Is there anyway to get out of this dilemma ?
Jimmy.