Hi all! I have setup one master and 5 regionservers to collect log data. But every ~24 hours, at random times, the regionservers generating a fatal error and all stopping one by one. Eventually the master will stop. I also see some weird characters before the server names in the logs. Seems like some encoding issue.
I have read in the documentation, that if the garbage collection is taking to long, you will also get the session expired message. But I have logged the GC on the master, and it seems oke. Could someone help me figure out why this is happening? Furthermore, I am currently monitoring the memory usage of the master with JMX. I notice that the heap size is slowly growing. Could there be a memory leakage? xmx is set to 1gb. Setup: hbase 0.94.20 hadoop 1.2.1 debian wheezy Thanks in advice, Ron Logs of master: =============== 2014-08-23 07:00:20,104 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/unassigned/70236052 2014-08-23 07:00:20,406 ERROR org.apache.hadoop.hbase.master.HMaster: Region server vps2060.directvps.nl,60020,1408691165501 reported a fatal error: ABORTING region server vps2060.directvps.nl,60020,1408691165501: regionserver:60020-0x347fc15265a00eb-0x347fc15265a00eb-0x347fc15265a00eb regionserver:60020-0x347fc15265a00eb-0x347fc15265a00eb-0x347fc15265a00eb received expired from ZooKeeper, aborting Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2014-08-23 07:00:20,911 ERROR org.apache.hadoop.hbase.master.HMaster: Region server vps2057.directvps.nl,60020,1408691165499 reported a fatal error: ABORTING region server vps2057.directvps.nl,60020,1408691165499: regionserver:60020-0x347fc15265a00ea-0x347fc15265a00ea-0x347fc15265a00ea-0x347fc15265a00ea regionserver:60020-0x347fc15265a00ea-0x347fc15265a00ea-0x347fc15265a00ea-0x347fc15265a00ea received expired from ZooKeeper, aborting Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2014-08-23 07:00:21,001 ERROR org.apache.hadoop.hbase.master.HMaster: Region server vps2059.directvps.nl,60020,1408691165851 reported a fatal error: ABORTING region server vps2059.directvps.nl,60020,1408691165851: regionserver:60020-0x147fc1616d200bb-0x147fc1616d200bb-0x147fc1616d200bb regionserver:60020-0x147fc1616d200bb-0x147fc1616d200bb-0x147fc1616d200bb received expired from ZooKeeper, aborting Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2014-08-23 07:00:21,056 ERROR org.apache.hadoop.hbase.master.HMaster: Region server vps2058.directvps.nl,60020,1408691165675 reported a fatal error: ABORTING region server vps2058.directvps.nl,60020,1408691165675: regionserver:60020-0x347fc15265a00ec-0x347fc15265a00ec-0x347fc15265a00ec-0x347fc15265a00ec regionserver:60020-0x347fc15265a00ec-0x347fc15265a00ec-0x347fc15265a00ec-0x347fc15265a00ec received expired from ZooKeeper, aborting Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2014-08-23 07:00:22,140 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/unassigned/70236052 2014-08-23 07:00:26,141 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/unassigned/70236052 2014-08-23 07:00:34,114 ERROR org.apache.hadoop.hbase.master.HMaster: Region server vps2056.directvps.nl,60020,1408691165439 reported a fatal error: ABORTING region server vps2056.directvps.nl,60020,1408691165439: Unexpected exception handling nodeDeleted event Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/master at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:172) at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:420) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDeleted(ZooKeeperNodeTracker.java:182) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:318) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2014-08-23 07:00:34,118 ERROR org.apache.hadoop.hbase.master.HMaster: Region server vps2056.directvps.nl,60020,1408691165439 reported a fatal error: ABORTING region server vps2056.directvps.nl,60020,1408691165439: regionserver:60020-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2 regionserver:60020-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2 received expired from ZooKeeper, aborting Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2014-08-23 07:00:34,141 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/unassigned/70236052 2014-08-23 07:00:34,142 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 3 retries 2014-08-23 07:00:34,152 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x347fcb5a0130000-0x247ffe833880001-0x247ffe833880001-0x247ffe833880001 Unable to get data of znode /hbase/unassigned/70236052 org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/unassigned/70236052 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:290) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:709) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:685) at org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:852) at org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRegion(AssignmentManager.java:3274) at org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRoot(AssignmentManager.java:3255) at org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:382) at org.apache.hadoop.hbase.zookeeper.RegionServerTracker.nodeDeleted(RegionServerTracker.java:122) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:318) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2014-08-23 07:00:34,152 ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:60000-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x347fcb5a0130000-0x247ffe833880001-0x247ffe833880001-0x247ffe833880001 Received unexpected KeeperException, re-throwing exception org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/unassigned/70236052 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:290) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:709) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:685) at org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:852) at org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRegion(AssignmentManager.java:3274) at org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRoot(AssignmentManager.java:3255) at org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:382) at org.apache.hadoop.hbase.zookeeper.RegionServerTracker.nodeDeleted(RegionServerTracker.java:122) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:318) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2014-08-23 07:00:34,163 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: [] 2014-08-23 07:00:34,215 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/backup-masters/vps2008.directvps.nl,60000,1408691163492 already deleted, and this is not a retry 2014-08-23 07:05:34,165 WARN org.apache.hadoop.hbase.master.SplitLogManager: Interrupted while waiting for log splits to be completed 2014-08-23 07:05:34,179 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected ZK exception reading unassigned node for region=70236052 org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/unassigned/70236052 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:290) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:709) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:685) at org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:852) at org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRegion(AssignmentManager.java:3274) at org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRoot(AssignmentManager.java:3255) at org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:382) at org.apache.hadoop.hbase.zookeeper.RegionServerTracker.nodeDeleted(RegionServerTracker.java:122) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:318) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2014-08-23 07:05:34,179 WARN org.apache.hadoop.hbase.master.SplitLogManager: error while splitting logs in [hdfs://namenode.openindex.io:8020/hbase/.logs/vps2058.directvps.nl,60020,1408691165675-splitting] installed = 2 but only 0 done 2014-08-23 07:05:34,184 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60000 2014-08-23 07:05:34,185 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table java.io.IOException: Giving up after tries=1 at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:210) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:188) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:82) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:67) at org.apache.hadoop.hbase.master.CatalogJanitor.getSplitParents(CatalogJanitor.java:126) at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:137) at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:93) at org.apache.hadoop.hbase.Chore.run(Chore.java:67) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:207) ... 8 more 2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60000: exiting 2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 6 on 60000: exiting 2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 0 on 60000: exiting 2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 2 on 60000: exiting 2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60000: exiting 2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 4 on 60000: exiting 2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 7 on 60000: exiting 2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8 on 60000: exiting 2014-08-23 07:05:34,213 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60000: exiting 2014-08-23 07:05:34,213 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 2 on 60000: exiting 2014-08-23 07:05:34,213 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60000 2014-08-23 07:05:34,213 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder 2014-08-23 07:05:34,214 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder 2014-08-23 07:05:34,212 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 1 on 60000: exiting 2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60000: exiting 2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on 60000: exiting 2014-08-23 07:05:34,256 INFO org.mortbay.log: Stopped [email protected]:60010 2014-08-23 07:05:34,259 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: [] 2014-08-23 07:05:34,260 FATAL org.apache.hadoop.hbase.master.HMaster: master:60000-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x347fcb5a0130000-0x247ffe833880001-0x247ffe833880001-0x247ffe833880001-0x347fcb5a0130001 master:60000-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x347fcb5a0130000-0x247ffe833880001-0x247ffe833880001-0x247ffe833880001-0x347fcb5a0130001 received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2014-08-23 07:05:34,414 ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master java.lang.RuntimeException: HMaster Aborted at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160) at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76) at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2129) GC log: ======= 0.185: Application time: 0.1304320 seconds 0.185: [GC0.185: [ParNew: 4288K->511K(4800K), 0.0120520 secs] 4288K->1008K(15424K), 0.0121600 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 0.197: Total time for which application threads were stopped: 0.0126240 seconds Heap par new generation total 4800K, used 3580K [0x00000000b7200000, 0x00000000b7730000, 0x00000000c1860000) eden space 4288K, 71% used [0x00000000b7200000, 0x00000000b74ff328, 0x00000000b7630000) from space 512K, 99% used [0x00000000b76b0000, 0x00000000b772fff8, 0x00000000b7730000) to space 512K, 0% used [0x00000000b7630000, 0x00000000b7630000, 0x00000000b76b0000) concurrent mark-sweep generation total 10624K, used 496K [0x00000000c1860000, 0x00000000c22c0000, 0x00000000f5a00000) concurrent-mark-sweep perm gen total 21248K, used 6688K [0x00000000f5a00000, 0x00000000f6ec0000, 0x0000000100000000) 0.370: Application time: 0.1728650 seconds
