Hi
this is the stacktrace of the master process till restart:

2011-11-16 09:28:20,007 INFO org.apache.hadoop.hbase.master.LoadBalancer:
Skipping load balancing.  servers=51 regions=76793 average=1505.7451
mostloaded=1506 leastloaded=1506
2011-11-16 09:29:51,276 INFO org.apache.hadoop.hbase.master.ServerManager:
Registering server=hdp1156,60020,1321432190747, regionCount=0,
userLoad=false
2011-11-16 09:34:35,646 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321432287936
2011-11-16 09:34:35,646 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 09:34:35,647 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned java.net.ConnectException:
Connection refused for 1895561433
2011-11-16 09:37:35,648 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321432475647
2011-11-16 09:43:35,651 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 09:43:35,652 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned java.net.ConnectException:
Connection refused for 1895561433
2011-11-16 09:46:35,653 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321433015652
2011-11-16 09:49:35,656 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 09:49:35,657 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned java.net.ConnectException:
Connection refused for 1895561433
2011-11-16 09:49:36,305 INFO
org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
ephemeral node deleted, processing expiration [hdp1105,60020,1321217721376]
2011-11-16 09:49:36,305 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
logs for hdp1105,60020,1321217721376
2011-11-16 09:49:36,309 ERROR
org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while
processing event M_SERVER_SHUTDOWN
java.lang.NullPointerException
        at
org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1781)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2011-11-16 09:52:35,657 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321433375656
2011-11-16 09:52:35,657 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 09:52:35,658 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned java.net.ConnectException:
Connection refused for 1895561433
2011-11-16 09:54:58,681 INFO org.apache.hadoop.hbase.master.HMaster:
Balance=true
2011-11-16 09:55:35,659 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321433555658
2011-11-16 09:58:35,661 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 09:58:35,667 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
message,99086_700177012,1302074416350.1895561433 but we are not serving it
for 1895561433
2011-11-16 10:01:35,663 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321433915662
2011-11-16 10:01:35,663 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 10:01:35,668 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
message,99086_700177012,1302074416350.1895561433 but we are not serving it
for 1895561433
2011-11-16 10:02:35,647 INFO org.apache.hadoop.hbase.master.HMaster:
Balance=true
2011-11-16 10:04:35,665 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321434095664
2011-11-16 10:04:35,665 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 10:04:35,671 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
message,99086_700177012,1302074416350.1895561433 but we are not serving it
for 1895561433
2011-11-16 10:04:38,420 INFO org.apache.zookeeper.ZooKeeper: Initiating
client connection, connectString=hdz1072:2281,hdz1071:2281,hdz1070:2281
sessionTimeout=60000 watcher=hconnection
2011-11-16 10:04:38,421 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hdz1071/172.17.0.71:2281
2011-11-16 10:04:38,423 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hdz1071/172.17.0.71:2281, initiating session
2011-11-16 10:04:38,427 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server hdz1071/172.17.0.71:2281, sessionid =
0x132deb2ddbb9560, negotiated timeout = 40000
2011-11-16 10:04:45,243 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closed zookeeper sessionid=0x132deb2ddbb9560
2011-11-16 10:04:45,244 INFO org.apache.zookeeper.ZooKeeper: Session:
0x132deb2ddbb9560 closed
2011-11-16 10:04:45,244 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2011-11-16 10:05:44,032 INFO
org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
ephemeral node deleted, processing expiration [hdp1156,60020,1321432190747]
2011-11-16 10:05:44,033 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
logs for hdp1156,60020,1321432190747
2011-11-16 10:05:44,034 ERROR
org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while
processing event M_SERVER_SHUTDOWN
java.lang.NullPointerException
        at
org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1781)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2011-11-16 10:05:49,678 INFO org.apache.hadoop.hbase.master.ServerManager:
Registering server=hdp1156,60020,1321434349153, regionCount=0,
userLoad=false
2011-11-16 10:07:35,667 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321434275666
2011-11-16 10:07:35,667 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 10:07:35,674 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
message,99086_700177012,1302074416350.1895561433 but we are not serving it
for 1895561433
2011-11-16 10:09:30,671 INFO org.apache.hadoop.hbase.master.HMaster:
Balance=true
2011-11-16 10:10:35,669 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321434455668
2011-11-16 10:10:35,669 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 10:10:35,693 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
message,99086_700177012,1302074416350.1895561433 but we are not serving it
for 1895561433
2011-11-16 10:13:35,671 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321434635670
2011-11-16 10:16:35,673 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 10:16:35,679 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
message,99086_700177012,1302074416350.1895561433 but we are not serving it
for 1895561433
Wed Nov 16 10:19:33 CET 2011 Killing master
Wed Nov 16 10:19:37 CET 2011 Starting master on hdm1004
ulimit -n 65536
2011-11-16 10:19:38,823 INFO org.apache.hadoop.hbase.ipc.HBaseRpcMetrics:
Initializing RPC Metrics with hostName=HMaster, port=60000
2011-11-16 10:19:39,086 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder: starting
2011-11-16 10:19:39,086 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
listener on 60000: starting
2011-11-16 10:19:39,088 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 0 on 60000: starting
...................

On Wed, Nov 16, 2011 at 12:52 PM, <[email protected]> wrote:

> Can you post the stack trace and relevant log snippets ?
>
> Thanks
>
>
>
> On Nov 16, 2011, at 2:40 AM, Matthias Hofschen <[email protected]> wrote:
>
> > Hi,
> > we had a case today where the loadbalancer stopped working.
> > (cloudera-cdh3-u1, 52nodes). Basically we had a hot region that we moved
> to
> > another node. Shortly thereafter the regionserver of that region was
> > stopped. In the master logs we see that master is trying to contact this
> > regionserver to move the region. At the same time we had about 1000
> regions
> > missing because of the stopped regionserver. These where not assigned to
> > another regionserver. I suspect that the the load balancer on the master
> > machine was blocked by trying to move the one region which was not
> > possible. We then restarted the master which solved the problem.
> >
> > Is this a known problem, should I log an issue? I do have a stacktrace
> from
> > the master machine.
> >
> > Cheers Matthias
>

Reply via email to