Hi all!

I have setup one master and 5 regionservers to collect log data. But every ~24 
hours, at random times, the regionservers generating a fatal error and all 
stopping one by one. Eventually the master will stop. I also see some weird 
characters before the server names in the logs. Seems like some encoding issue.

I have read in the documentation, that if the garbage collection is taking to 
long, you will also get the session expired message. But I have logged the GC 
on the master, and it seems oke. Could someone help me figure out why this is 
happening?

Furthermore, I am currently monitoring the memory usage of the master with JMX. 
I notice that the heap size is slowly growing. Could there be a memory leakage?
xmx is set to 1gb.

Setup:
hbase 0.94.20
hadoop 1.2.1
debian wheezy

Thanks in advice,

Ron

Logs of master:
===============

2014-08-23 07:00:20,104 WARN 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient 
ZooKeeper exception: 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /hbase/unassigned/70236052
2014-08-23 07:00:20,406 ERROR org.apache.hadoop.hbase.master.HMaster: Region 
server vps2060.directvps.nl,60020,1408691165501 reported a fatal error:
ABORTING region server vps2060.directvps.nl,60020,1408691165501: 
regionserver:60020-0x347fc15265a00eb-0x347fc15265a00eb-0x347fc15265a00eb 
regionserver:60020-0x347fc15265a00eb-0x347fc15265a00eb-0x347fc15265a00eb 
received expired from ZooKeeper, aborting
Cause:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384)
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303)
    at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)

2014-08-23 07:00:20,911 ERROR org.apache.hadoop.hbase.master.HMaster: Region 
server vps2057.directvps.nl,60020,1408691165499 reported a fatal error:
ABORTING region server vps2057.directvps.nl,60020,1408691165499: 
regionserver:60020-0x347fc15265a00ea-0x347fc15265a00ea-0x347fc15265a00ea-0x347fc15265a00ea
 
regionserver:60020-0x347fc15265a00ea-0x347fc15265a00ea-0x347fc15265a00ea-0x347fc15265a00ea
 received expired from ZooKeeper, aborting
Cause:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384)
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303)
    at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)

2014-08-23 07:00:21,001 ERROR org.apache.hadoop.hbase.master.HMaster: Region 
server vps2059.directvps.nl,60020,1408691165851 reported a fatal error:
ABORTING region server vps2059.directvps.nl,60020,1408691165851: 
regionserver:60020-0x147fc1616d200bb-0x147fc1616d200bb-0x147fc1616d200bb 
regionserver:60020-0x147fc1616d200bb-0x147fc1616d200bb-0x147fc1616d200bb 
received expired from ZooKeeper, aborting
Cause:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384)
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303)
    at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)

2014-08-23 07:00:21,056 ERROR org.apache.hadoop.hbase.master.HMaster: Region 
server vps2058.directvps.nl,60020,1408691165675 reported a fatal error:
ABORTING region server vps2058.directvps.nl,60020,1408691165675: 
regionserver:60020-0x347fc15265a00ec-0x347fc15265a00ec-0x347fc15265a00ec-0x347fc15265a00ec
 
regionserver:60020-0x347fc15265a00ec-0x347fc15265a00ec-0x347fc15265a00ec-0x347fc15265a00ec
 received expired from ZooKeeper, aborting
Cause:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384)
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303)
    at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)

2014-08-23 07:00:22,140 WARN 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient 
ZooKeeper exception: 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/unassigned/70236052
2014-08-23 07:00:26,141 WARN 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient 
ZooKeeper exception: 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/unassigned/70236052
2014-08-23 07:00:34,114 ERROR org.apache.hadoop.hbase.master.HMaster: Region 
server vps2056.directvps.nl,60020,1408691165439 reported a fatal error:
ABORTING region server vps2056.directvps.nl,60020,1408691165439: Unexpected 
exception handling nodeDeleted event
Cause:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/master
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
    at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:172)
    at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:420)
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDeleted(ZooKeeperNodeTracker.java:182)
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:318)
    at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)

2014-08-23 07:00:34,118 ERROR org.apache.hadoop.hbase.master.HMaster: Region 
server vps2056.directvps.nl,60020,1408691165439 reported a fatal error:
ABORTING region server vps2056.directvps.nl,60020,1408691165439: 
regionserver:60020-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2
 
regionserver:60020-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2
 received expired from ZooKeeper, aborting
Cause:
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384)
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303)
    at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)

2014-08-23 07:00:34,141 WARN 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient 
ZooKeeper exception: 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/unassigned/70236052
2014-08-23 07:00:34,142 ERROR 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper getData 
failed after 3 retries
2014-08-23 07:00:34,152 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:60000-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x347fcb5a0130000-0x247ffe833880001-0x247ffe833880001-0x247ffe833880001
 Unable to get data of znode /hbase/unassigned/70236052
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/unassigned/70236052
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
    at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:290)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:709)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:685)
    at org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:852)
    at 
org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRegion(AssignmentManager.java:3274)
    at 
org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRoot(AssignmentManager.java:3255)
    at 
org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:382)
    at 
org.apache.hadoop.hbase.zookeeper.RegionServerTracker.nodeDeleted(RegionServerTracker.java:122)
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:318)
    at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
2014-08-23 07:00:34,152 ERROR 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
master:60000-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x347fcb5a0130000-0x247ffe833880001-0x247ffe833880001-0x247ffe833880001
 Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/unassigned/70236052
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
    at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:290)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:709)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:685)
    at org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:852)
    at 
org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRegion(AssignmentManager.java:3274)
    at 
org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRoot(AssignmentManager.java:3255)
    at 
org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:382)
    at 
org.apache.hadoop.hbase.zookeeper.RegionServerTracker.nodeDeleted(RegionServerTracker.java:122)
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:318)
    at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
2014-08-23 07:00:34,163 FATAL org.apache.hadoop.hbase.master.HMaster: Master 
server abort: loaded coprocessors are: []
2014-08-23 07:00:34,215 WARN 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node 
/hbase/backup-masters/vps2008.directvps.nl,60000,1408691163492 already deleted, 
and this is not a retry
2014-08-23 07:05:34,165 WARN org.apache.hadoop.hbase.master.SplitLogManager: 
Interrupted while waiting for log splits to be completed
2014-08-23 07:05:34,179 FATAL org.apache.hadoop.hbase.master.HMaster: 
Unexpected ZK exception reading unassigned node for region=70236052
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/unassigned/70236052
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
    at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:290)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:709)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:685)
    at org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:852)
    at 
org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRegion(AssignmentManager.java:3274)
    at 
org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRoot(AssignmentManager.java:3255)
    at 
org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:382)
    at 
org.apache.hadoop.hbase.zookeeper.RegionServerTracker.nodeDeleted(RegionServerTracker.java:122)
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:318)
    at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
2014-08-23 07:05:34,179 WARN org.apache.hadoop.hbase.master.SplitLogManager: 
error while splitting logs in 
[hdfs://namenode.openindex.io:8020/hbase/.logs/vps2058.directvps.nl,60020,1408691165675-splitting]
 installed = 2 but only 0 done
2014-08-23 07:05:34,184 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server 
on 60000
2014-08-23 07:05:34,185 WARN org.apache.hadoop.hbase.master.CatalogJanitor: 
Failed scan of catalog table
java.io.IOException: Giving up after tries=1
    at 
org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:210)
    at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:188)
    at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:82)
    at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:67)
    at 
org.apache.hadoop.hbase.master.CatalogJanitor.getSplitParents(CatalogJanitor.java:126)
    at 
org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:137)
    at 
org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:93)
    at org.apache.hadoop.hbase.Chore.run(Chore.java:67)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException: sleep interrupted
    at java.lang.Thread.sleep(Native Method)
    at 
org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:207)
    ... 8 more
2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 5 on 60000: exiting
2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 6 on 60000: exiting
2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server 
handler 0 on 60000: exiting
2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 2 on 60000: exiting
2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 3 on 60000: exiting
2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 4 on 60000: exiting
2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 7 on 60000: exiting
2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 8 on 60000: exiting
2014-08-23 07:05:34,213 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 9 on 60000: exiting
2014-08-23 07:05:34,213 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server 
handler 2 on 60000: exiting
2014-08-23 07:05:34,213 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC 
Server listener on 60000
2014-08-23 07:05:34,213 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC 
Server Responder
2014-08-23 07:05:34,214 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC 
Server Responder
2014-08-23 07:05:34,212 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server 
handler 1 on 60000: exiting
2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 0 on 60000: exiting
2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 1 on 60000: exiting
2014-08-23 07:05:34,256 INFO org.mortbay.log: Stopped 
[email protected]:60010
2014-08-23 07:05:34,259 FATAL org.apache.hadoop.hbase.master.HMaster: Master 
server abort: loaded coprocessors are: []
2014-08-23 07:05:34,260 FATAL org.apache.hadoop.hbase.master.HMaster: 
master:60000-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x347fcb5a0130000-0x247ffe833880001-0x247ffe833880001-0x247ffe833880001-0x347fcb5a0130001
 
master:60000-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x347fcb5a0130000-0x247ffe833880001-0x247ffe833880001-0x247ffe833880001-0x347fcb5a0130001
 received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384)
    at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303)
    at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
2014-08-23 07:05:34,414 ERROR 
org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
java.lang.RuntimeException: HMaster Aborted
    at 
org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
    at 
org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at 
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
    at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2129)

GC log:
=======

0.185: Application time: 0.1304320 seconds
0.185: [GC0.185: [ParNew: 4288K->511K(4800K), 0.0120520 secs] 
4288K->1008K(15424K), 0.0121600 secs] [Times: user=0.01 sys=0.01, real=0.01 
secs] 
0.197: Total time for which application threads were stopped: 0.0126240 seconds
Heap
 par new generation   total 4800K, used 3580K [0x00000000b7200000, 
0x00000000b7730000, 0x00000000c1860000)
  eden space 4288K,  71% used [0x00000000b7200000, 0x00000000b74ff328, 
0x00000000b7630000)
  from space 512K,  99% used [0x00000000b76b0000, 0x00000000b772fff8, 
0x00000000b7730000)
  to   space 512K,   0% used [0x00000000b7630000, 0x00000000b7630000, 
0x00000000b76b0000)
 concurrent mark-sweep generation total 10624K, used 496K [0x00000000c1860000, 
0x00000000c22c0000, 0x00000000f5a00000)
 concurrent-mark-sweep perm gen total 21248K, used 6688K [0x00000000f5a00000, 
0x00000000f6ec0000, 0x0000000100000000)
0.370: Application time: 0.1728650 seconds

Reply via email to