On Wed, Jul 20, 2011 at 8:22 AM, Xu-Feng Mao <[email protected]> wrote:
> Hi,
> We're running a 25-node regionserver hbase cluster, using cdh3u0.
> 1. We run into several jvm crashes on master today. It seems like jvm
> issues, as I attached the hs_error_pid files
> with this message. Just want to confirm that if this is really a jvm issue,
> or maybe some master issue trigger the
> low level one.
Tell us the jvm version and your os. Attachments don't come through
usually. u18 jvm had a tendency to crash. Google your os+jvm
version to see if you can turn up others experiencing what you see.
> 2. We also have two regionservers down today, after the regionserver
> restarted, the regions assigned to them
> is much less than the others.
...
> 2011-07-20 23:14:43,228 DEBUG org.apache.hadoop.hbase.master.HMaster: Not
> running balancer because 2 region(s) in transition:
> {1efae368e8d64cc59aeadbb3289fddac=S3Table,EStore_everbox_HlYLqw4bMBDuLsfFZXg0kesvVjg=,1305255180483.1efae368e8d64cc59aeadbb3289fddac.
> state=PENDING_CLOSE, ts=1311174872837,
> 41491bc74321aeb578f00aae2725eefc=S3Table,ku6_ku6upload_1307149487260,13111025078...
So, balancer is not running because two regions are still being
'balanced'. If you look in master logs, these regions are probably
being timedout and reassigned, then failing for some reason. Can you
get a clue why? Are these two regions up in zk (use ./bin/hbase
zkcli to poke around in the /hbase/unassigned directory). You could
try killing your master then bringing it up again. The state may just
be in Master memory.
CDH is a bit behind in the hbase bug fixes. You should consider
coming up to 0.90.3 (i'm working on putting out 0.90.4). We've fixed
some bugs in and around here (see CHANGES.txt).
> And operation can I make to bypass this issue?
Do you see these regions actually open out on the cluster? If so,
the kill of master might fix it else delete of these regions in zk
might do. Otherwise, we'll need to look deeper.
St.Ack