I'm using Hortonworks HDP and my HBase cluster has 14 nodes and one of them crashed today (there is no job running). I looked up the crashed region server's log and found that its logging paused for about 6 minutes(the log level is INFO), then that RS crashed.
The RS's log is as follows: 2014-04-03 08:36:52,132 INFO [PostOpenDeployTasks:4264e3258e009f99263dad3c8f2888ed] regionserver.HRegionServer: Post open deploy tasks for region=QSH_CONS_DATA_201403,e6666658,1396485424003.4264e3258e009f99263dad3c8f2888ed. 2014-04-03 08:36:52,136 INFO [PostOpenDeployTasks:4264e3258e009f99263dad3c8f2888ed] catalog.MetaEditor: Updated row QSH_CONS_DATA_201403,e6666658,1396485424003.4264e3258e009f99263dad3c8f2888ed. with server=b05.jsepc.com,60020,1396421585047 *2014-04-03 08:36:52,136* INFO [PostOpenDeployTasks:4264e3258e009f99263dad3c8f2888ed] regionserver.HRegionServer: Finished post open deploy task for QSH_CONS_DATA_201403,e6666658,1396485424003.4264e3258e009f99263dad3c8f2888ed. *2014-04-03 08:43:13,535* INFO [RpcServer.handler=96,port=60020] hdfs.DFSClient: Will fetch a new access token and retry, access token was invalid when connecting to /10.134.101.119:50010 : org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got access token error for OP_READ_BLOCK, self=/10.134.101.115:34507, remote=/ 10.134.101.119:50010, for file /apps/hbase/data/data/default/E_MP_DAY_READ_20140322/b396e822aac1d59c2e52a9bb072c9d2f/info/030a5559a655442e9e86700fe53facfe, for pool BP-898918553-10.134.101.112-1393904898674 block 1075306210_1569446 2014-04-03 08:43:27,624 INFO [RpcServer.handler=61,port=60020] regionserver.HRegion: writing data to region QSH_CONS_DATA_201403,0ccccccc,1396485424002.5c8a662eacc9e712378ea48819a74267. with WAL disabled. Data may be lost in the event of a crash. 2014-04-03 08:43:35,486 INFO [RpcServer.handler=40,port=60020] regionserver.HRegion: writing data to region QSH_CONS_DATA_201403,0ccccccc,1396485424002.5c8a662eacc9e712378ea48819a74267. with WAL disabled. Data may be lost in the event of a crash. 2014-04-03 08:43:35,640 WARN [Thread-16] wal.FSHLog: Couldn't find oldest seqNum for the region we are about to flush: [5c8a662eacc9e712378ea48819a74267] 2014-04-03 08:44:14,390 INFO [regionserver60020-SendThread( a03.jsepc.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 41884ms for sessionid 0x245211f9a740011, closing socket connection and attempting reconnect 2014-04-03 08:44:14,390 WARN [regionserver60020] util.Sleeper: We slept 38975ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2014-04-03 08:44:14,390 WARN [regionserver60020.compactionChecker] util.Sleeper: We slept 46085ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2014-04-03 08:44:14,391 WARN [ResponseProcessor for block BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:721) 2014-04-03 08:44:14,390 INFO [regionserver60020-SendThread( b05.jsepc.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 41884ms for sessionid 0x545211f4de9001e, closing socket connection and attempting reconnect 2014-04-03 08:44:14,390 INFO [regionserver60020-SendThread( a02.jsepc.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 48443ms for sessionid 0x145211fd930001f, closing socket connection and attempting reconnect 2014-04-03 08:44:14,391 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 37907ms GC pool 'ParNew' had collection(s): count=1 time=31371ms GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=1577ms 2014-04-03 08:44:14,391 WARN [regionserver60020.periodicFlusher] util.Sleeper: We slept 46094ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2014-04-03 08:44:14,416 WARN [DataStreamer for file /apps/hbase/data/WALs/ b05.jsepc.com,60020,1396421585047/b05.jsepc.com%2C60020%2C1396421585047.1396484294737 block BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647] hdfs.DFSClient: Error Recovery for block BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647 in pipeline 10.134.101.115:50010, 10.134.101.108:50010, 10.134.101.104:50010: bad datanode 10.134.101.115:50010 2014-04-03 08:44:14,460 WARN [DataStreamer for file /apps/hbase/data/WALs/ b05.jsepc.com,60020,1396421585047/b05.jsepc.com%2C60020%2C1396421585047.1396484294737 block BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647] hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /apps/hbase/data/WALs/b05.jsepc.com,60020,1396421585047/ b05.jsepc.com%2C60020%2C1396421585047.1396484294737: File does not exist. Holder DFSClient_hb_rs_b05.jsepc.com,60020,1396421585047_-1660168802_29 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2764) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2754) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:2682) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:585) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:409) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59584) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047) at org.apache.hadoop.ipc.Client.call(Client.java:1347) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at $Proxy13.getAdditionalDatanode(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:352) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at $Proxy14.getAdditionalDatanode(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:266) at $Proxy15.getAdditionalDatanode(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:919) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475) 2014-04-03 08:44:14,494 FATAL [regionserver60020] regionserver.HRegionServer: ABORTING region server b05.jsepc.com,60020,1396421585047: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing b05.jsepc.com,60020,1396421585047 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing b05.jsepc.com,60020,1396421585047 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:277) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:985) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:832) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing b05.jsepc.com,60020,1396421585047 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1449) at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653) at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:5414) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:983) ... 2 more 2014-04-03 08:44:14,495 FATAL [regionserver60020] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [] 2014-04-03 08:44:14,525 INFO [regionserver60020] regionserver.HRegionServer: STOPPED: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing b05.jsepc.com,60020,1396421585047 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) After the log continued at 08:43:13, the RS complained some exceptions and then crashed. The RS stopped logging from 08:36:52 to 08:43:13, is this the reason that caused the RS crash ? I suspect there was a full gc during this 6 minutes, so I looked up the GC log on this RS but did not find any full gc, neither did I find any gc taking longer than dozens of seconds. *I looked up HMaster's log and fount its logging also paused for a long time (7 minutes)*. HMaster's log is as follows: 2014-04-03 08:37:04,751 INFO [AM.ZK.Worker-pool2-t750] master.RegionStates: Transitioned {6d768061a22ccd020f0b43879dd9dd3b state=OPENING, ts=1396485424686, server=b07.jsepc.com,60020,1396437592290} to {6d768061a22ccd020f0b43879dd9dd3b state=OPEN, ts=1396485424751, server= b07.jsepc.com,60020,1396437592290} *2014-04-03 08:37:04*,753 INFO [AM.ZK.Worker-pool2-t750] master.RegionStates: Onlined 6d768061a22ccd020f0b43879dd9dd3b on b07.jsepc.com,60020,1396437592290 *2014-04-03 08:44:26*,599 INFO [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [b05.jsepc.com,60020,1396421585047] 2014-04-03 08:44:26,900 INFO [MASTER_SERVER_OPERATIONS-b03:60000-0] handler.ServerShutdownHandler: Splitting logs for b05.jsepc.com,60020,1396421585047 before assignment. 2014-04-03 08:44:26,933 INFO [MASTER_SERVER_OPERATIONS-b03:60000-0] master.SplitLogManager: dead splitlog workers [b05.jsepc.com ,60020,1396421585047] 2014-04-03 08:44:26,935 INFO [MASTER_SERVER_OPERATIONS-b03:60000-0] master.SplitLogManager: started splitting 4 logs in [hdfs://jsepc01HA/apps/hbase/data/WALs/b05.jsepc.com ,60020,1396421585047-splitting] 2014-04-03 08:44:26,941 INFO [main-EventThread] master.SplitLogManager: task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396479046831 acquired by a07.jsepc.com,60020,1396421598233 2014-04-03 08:44:26,941 INFO [main-EventThread] master.SplitLogManager: task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396484294737 acquired by b08.jsepc.com,60020,1396421591104 2014-04-03 08:44:26,941 INFO [main-EventThread] master.SplitLogManager: task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396480850188 acquired by a08.jsepc.com,60020,1396421590699 2014-04-03 08:44:26,941 INFO [main-EventThread] master.SplitLogManager: task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396482573001 acquired by b09.jsepc.com,60020,1396421585070 2014-04-03 08:44:27,120 ERROR [RpcServer.handler=32,port=60000] master.HMaster: Region server b05.jsepc.com,60020,1396421585047 reported a fatal error: ABORTING region server b05.jsepc.com,60020,1396421585047: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing b05.jsepc.com,60020,1396421585047 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) Cause: org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing b05.jsepc.com,60020,1396421585047 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:277) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:985) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:832) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing b05.jsepc.com,60020,1396421585047 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1449) at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653) at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:5414) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:983) ... 2 more 2014-04-03 08:44:27,702 INFO [b03.jsepc.com,60000,1396421381897.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 4 unassigned = 0 tasks={/hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396482573001=last_update = 1396485866944 last_version = 2 cur_worker_name = b09.jsepc.com,60020,1396421585070 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4 done = 0 error = 0, /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396484294737=last_update = 1396485866943 last_version = 2 cur_worker_name = b08.jsepc.com,60020,1396421591104 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4 done = 0 error = 0, /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396479046831=last_update = 1396485866943 last_version = 2 cur_worker_name = a07.jsepc.com,60020,1396421598233 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4 done = 0 error = 0, /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com %2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396480850188=last_update = 1396485866943 last_version = 2 cur_worker_name = a08.jsepc.com,60020,1396421590699 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4 done = 0 error = 0} 2014-04-03 08:44:27,715 ERROR [RpcServer.handler=51,port=60000] master.HMaster: Region server b05.jsepc.com,60020,1396421585047 reported a fatal error: ABORTING region server b05.jsepc.com,60020,1396421585047: regionserver:60020-0x545211f4de9001e, quorum=b03.jsepc.com:2181, a04.jsepc.com:2181,a03.jsepc.com:2181,a02.jsepc.com:2181,b05.jsepc.com:2181, baseZNode=/hbase-unsecure regionserver:60020-0x545211f4de9001e received expired from ZooKeeper, aborting Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:401) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:319) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) I looked up HMaster's gc log and did not found any gc taking longer than 10 seconds. *Actually I looked up logs of other region servers which did not crash and found there were also long time pause during almost the same time.* So the phenomenon is : RS and HMaster stopped logging for several minutes (about 5~7 minutes) and some of the RS crashed Why would this happen? I changed the log level from the default DEBUG to INFO, but I don't think this would cause such a long time gap in logging. Appreciate for any help.
