Have you checked zookeeper logs ? What zookeeper release are you using ? bq. Could there be a memory leakage?
You can use jmap to capture heap memory details: http://docs.oracle.com/javase/6/docs/technotes/tools/share/jmap.html Cheers On Mon, Aug 25, 2014 at 4:51 AM, Ron van der Vegt < [email protected]> wrote: > Hi all! > > I have setup one master and 5 regionservers to collect log data. But every > ~24 hours, at random times, the regionservers generating a fatal error and > all stopping one by one. Eventually the master will stop. I also see some > weird characters before the server names in the logs. Seems like some > encoding issue. > > I have read in the documentation, that if the garbage collection is taking > to long, you will also get the session expired message. But I have logged > the GC on the master, and it seems oke. Could someone help me figure out > why this is happening? > > Furthermore, I am currently monitoring the memory usage of the master with > JMX. I notice that the heap size is slowly growing. Could there be a memory > leakage? > xmx is set to 1gb. > > Setup: > hbase 0.94.20 > hadoop 1.2.1 > debian wheezy > > Thanks in advice, > > Ron > > Logs of master: > =============== > > 2014-08-23 07:00:20,104 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper exception: > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase/unassigned/70236052 > 2014-08-23 07:00:20,406 ERROR org.apache.hadoop.hbase.master.HMaster: > Region server vps2060.directvps.nl,60020,1408691165501 reported a fatal > error: > ABORTING region server vps2060.directvps.nl,60020,1408691165501: > regionserver:60020-0x347fc15265a00eb-0x347fc15265a00eb-0x347fc15265a00eb > regionserver:60020-0x347fc15265a00eb-0x347fc15265a00eb-0x347fc15265a00eb > received expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > > 2014-08-23 07:00:20,911 ERROR org.apache.hadoop.hbase.master.HMaster: > Region server vps2057.directvps.nl,60020,1408691165499 reported a fatal > error: > ABORTING region server vps2057.directvps.nl,60020,1408691165499: > regionserver:60020-0x347fc15265a00ea-0x347fc15265a00ea-0x347fc15265a00ea-0x347fc15265a00ea > regionserver:60020-0x347fc15265a00ea-0x347fc15265a00ea-0x347fc15265a00ea-0x347fc15265a00ea > received expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > > 2014-08-23 07:00:21,001 ERROR org.apache.hadoop.hbase.master.HMaster: > Region server vps2059.directvps.nl,60020,1408691165851 reported a fatal > error: > ABORTING region server vps2059.directvps.nl,60020,1408691165851: > regionserver:60020-0x147fc1616d200bb-0x147fc1616d200bb-0x147fc1616d200bb > regionserver:60020-0x147fc1616d200bb-0x147fc1616d200bb-0x147fc1616d200bb > received expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > > 2014-08-23 07:00:21,056 ERROR org.apache.hadoop.hbase.master.HMaster: > Region server vps2058.directvps.nl,60020,1408691165675 reported a fatal > error: > ABORTING region server vps2058.directvps.nl,60020,1408691165675: > regionserver:60020-0x347fc15265a00ec-0x347fc15265a00ec-0x347fc15265a00ec-0x347fc15265a00ec > regionserver:60020-0x347fc15265a00ec-0x347fc15265a00ec-0x347fc15265a00ec-0x347fc15265a00ec > received expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > > 2014-08-23 07:00:22,140 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper exception: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /hbase/unassigned/70236052 > 2014-08-23 07:00:26,141 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper exception: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /hbase/unassigned/70236052 > 2014-08-23 07:00:34,114 ERROR org.apache.hadoop.hbase.master.HMaster: > Region server vps2056.directvps.nl,60020,1408691165439 reported a fatal > error: > ABORTING region server vps2056.directvps.nl,60020,1408691165439: > Unexpected exception handling nodeDeleted event > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /hbase/master > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:172) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:420) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDeleted(ZooKeeperNodeTracker.java:182) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:318) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > > 2014-08-23 07:00:34,118 ERROR org.apache.hadoop.hbase.master.HMaster: > Region server vps2056.directvps.nl,60020,1408691165439 reported a fatal > error: > ABORTING region server vps2056.directvps.nl,60020,1408691165439: > regionserver:60020-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2 > regionserver:60020-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2-0x247fc16c80500d2 > received expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > > 2014-08-23 07:00:34,141 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient > ZooKeeper exception: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /hbase/unassigned/70236052 > 2014-08-23 07:00:34,142 ERROR > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper getData > failed after 3 retries > 2014-08-23 07:00:34,152 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: > master:60000-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x347fcb5a0130000-0x247ffe833880001-0x247ffe833880001-0x247ffe833880001 > Unable to get data of znode /hbase/unassigned/70236052 > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /hbase/unassigned/70236052 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:290) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:709) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:685) > at > org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:852) > at > org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRegion(AssignmentManager.java:3274) > at > org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRoot(AssignmentManager.java:3255) > at > org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:382) > at > org.apache.hadoop.hbase.zookeeper.RegionServerTracker.nodeDeleted(RegionServerTracker.java:122) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:318) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > 2014-08-23 07:00:34,152 ERROR > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: > master:60000-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x347fcb5a0130000-0x247ffe833880001-0x247ffe833880001-0x247ffe833880001 > Received unexpected KeeperException, re-throwing exception > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /hbase/unassigned/70236052 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:290) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:709) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:685) > at > org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:852) > at > org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRegion(AssignmentManager.java:3274) > at > org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRoot(AssignmentManager.java:3255) > at > org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:382) > at > org.apache.hadoop.hbase.zookeeper.RegionServerTracker.nodeDeleted(RegionServerTracker.java:122) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:318) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > 2014-08-23 07:00:34,163 FATAL org.apache.hadoop.hbase.master.HMaster: > Master server abort: loaded coprocessors are: [] > 2014-08-23 07:00:34,215 WARN > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node > /hbase/backup-masters/vps2008.directvps.nl,60000,1408691163492 already > deleted, and this is not a retry > 2014-08-23 07:05:34,165 WARN > org.apache.hadoop.hbase.master.SplitLogManager: Interrupted while waiting > for log splits to be completed > 2014-08-23 07:05:34,179 FATAL org.apache.hadoop.hbase.master.HMaster: > Unexpected ZK exception reading unassigned node for region=70236052 > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /hbase/unassigned/70236052 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:290) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:709) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:685) > at > org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:852) > at > org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRegion(AssignmentManager.java:3274) > at > org.apache.hadoop.hbase.master.AssignmentManager.isCarryingRoot(AssignmentManager.java:3255) > at > org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:382) > at > org.apache.hadoop.hbase.zookeeper.RegionServerTracker.nodeDeleted(RegionServerTracker.java:122) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:318) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > 2014-08-23 07:05:34,179 WARN > org.apache.hadoop.hbase.master.SplitLogManager: error while splitting logs > in [hdfs:// > namenode.openindex.io:8020/hbase/.logs/vps2058.directvps.nl,60020,1408691165675-splitting] > installed = 2 but only 0 done > 2014-08-23 07:05:34,184 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > server on 60000 > 2014-08-23 07:05:34,185 WARN > org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table > java.io.IOException: Giving up after tries=1 > at > org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:210) > at > org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:188) > at > org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:82) > at > org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:67) > at > org.apache.hadoop.hbase.master.CatalogJanitor.getSplitParents(CatalogJanitor.java:126) > at > org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:137) > at > org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:93) > at org.apache.hadoop.hbase.Chore.run(Chore.java:67) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:207) > ... 8 more > 2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 5 on 60000: exiting > 2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 6 on 60000: exiting > 2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC > Server handler 0 on 60000: exiting > 2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 2 on 60000: exiting > 2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 3 on 60000: exiting > 2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 4 on 60000: exiting > 2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 7 on 60000: exiting > 2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 8 on 60000: exiting > 2014-08-23 07:05:34,213 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 9 on 60000: exiting > 2014-08-23 07:05:34,213 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC > Server handler 2 on 60000: exiting > 2014-08-23 07:05:34,213 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > IPC Server listener on 60000 > 2014-08-23 07:05:34,213 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > IPC Server Responder > 2014-08-23 07:05:34,214 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > IPC Server Responder > 2014-08-23 07:05:34,212 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC > Server handler 1 on 60000: exiting > 2014-08-23 07:05:34,186 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 0 on 60000: exiting > 2014-08-23 07:05:34,185 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 1 on 60000: exiting > 2014-08-23 07:05:34,256 INFO org.mortbay.log: Stopped > [email protected]:60010 > 2014-08-23 07:05:34,259 FATAL org.apache.hadoop.hbase.master.HMaster: > Master server abort: loaded coprocessors are: [] > 2014-08-23 07:05:34,260 FATAL org.apache.hadoop.hbase.master.HMaster: > master:60000-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x347fcb5a0130000-0x247ffe833880001-0x247ffe833880001-0x247ffe833880001-0x347fcb5a0130001 > master:60000-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x147fc1616d200ba-0x347fcb5a0130000-0x247ffe833880001-0x247ffe833880001-0x247ffe833880001-0x347fcb5a0130001 > received expired from ZooKeeper, aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:384) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:303) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > 2014-08-23 07:05:34,414 ERROR > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master > java.lang.RuntimeException: HMaster Aborted > at > org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160) > at > org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76) > at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2129) > > GC log: > ======= > > 0.185: Application time: 0.1304320 seconds > 0.185: [GC0.185: [ParNew: 4288K->511K(4800K), 0.0120520 secs] > 4288K->1008K(15424K), 0.0121600 secs] [Times: user=0.01 sys=0.01, real=0.01 > secs] > 0.197: Total time for which application threads were stopped: 0.0126240 > seconds > Heap > par new generation total 4800K, used 3580K [0x00000000b7200000, > 0x00000000b7730000, 0x00000000c1860000) > eden space 4288K, 71% used [0x00000000b7200000, 0x00000000b74ff328, > 0x00000000b7630000) > from space 512K, 99% used [0x00000000b76b0000, 0x00000000b772fff8, > 0x00000000b7730000) > to space 512K, 0% used [0x00000000b7630000, 0x00000000b7630000, > 0x00000000b76b0000) > concurrent mark-sweep generation total 10624K, used 496K > [0x00000000c1860000, 0x00000000c22c0000, 0x00000000f5a00000) > concurrent-mark-sweep perm gen total 21248K, used 6688K > [0x00000000f5a00000, 0x00000000f6ec0000, 0x0000000100000000) > 0.370: Application time: 0.1728650 seconds > >
