HBase logging paused for a long time then RS crashed

Tao Xiao Wed, 02 Apr 2014 19:46:31 -0700

I'm using Hortonworks HDP and my HBase cluster has 14 nodes and one of them
crashed today (there is no job running).  I looked up the crashed region
server's log and found that its logging paused for about 6 minutes(the log
level is INFO), then that RS crashed.


The RS's log is as follows:

2014-04-03 08:36:52,132 INFO
 [PostOpenDeployTasks:4264e3258e009f99263dad3c8f2888ed]
regionserver.HRegionServer: Post open deploy tasks for
region=QSH_CONS_DATA_201403,e6666658,1396485424003.4264e3258e009f99263dad3c8f2888ed.
2014-04-03 08:36:52,136 INFO
 [PostOpenDeployTasks:4264e3258e009f99263dad3c8f2888ed] catalog.MetaEditor:
Updated row
QSH_CONS_DATA_201403,e6666658,1396485424003.4264e3258e009f99263dad3c8f2888ed.
with server=b05.jsepc.com,60020,1396421585047
*2014-04-03 08:36:52,136* INFO
 [PostOpenDeployTasks:4264e3258e009f99263dad3c8f2888ed]
regionserver.HRegionServer: Finished post open deploy task for
QSH_CONS_DATA_201403,e6666658,1396485424003.4264e3258e009f99263dad3c8f2888ed.
*2014-04-03 08:43:13,535* INFO  [RpcServer.handler=96,port=60020]
hdfs.DFSClient: Will fetch a new access token and retry, access token was
invalid when connecting to /10.134.101.119:50010 :
org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got
access token error for OP_READ_BLOCK, self=/10.134.101.115:34507, remote=/
10.134.101.119:50010, for file
/apps/hbase/data/data/default/E_MP_DAY_READ_20140322/b396e822aac1d59c2e52a9bb072c9d2f/info/030a5559a655442e9e86700fe53facfe,
for pool BP-898918553-10.134.101.112-1393904898674 block 1075306210_1569446
2014-04-03 08:43:27,624 INFO  [RpcServer.handler=61,port=60020]
regionserver.HRegion: writing data to region
QSH_CONS_DATA_201403,0ccccccc,1396485424002.5c8a662eacc9e712378ea48819a74267.
with WAL disabled. Data may be lost in the event of a crash.
2014-04-03 08:43:35,486 INFO  [RpcServer.handler=40,port=60020]
regionserver.HRegion: writing data to region
QSH_CONS_DATA_201403,0ccccccc,1396485424002.5c8a662eacc9e712378ea48819a74267.
with WAL disabled. Data may be lost in the event of a crash.
2014-04-03 08:43:35,640 WARN  [Thread-16] wal.FSHLog: Couldn't find oldest
seqNum for the region we are about to flush:
[5c8a662eacc9e712378ea48819a74267]
2014-04-03 08:44:14,390 INFO  [regionserver60020-SendThread(
a03.jsepc.com:2181)] zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 41884ms for sessionid 0x245211f9a740011, closing
socket connection and attempting reconnect
2014-04-03 08:44:14,390 WARN  [regionserver60020] util.Sleeper: We slept
38975ms instead of 3000ms, this is likely due to a long garbage collecting
pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2014-04-03 08:44:14,390 WARN  [regionserver60020.compactionChecker]
util.Sleeper: We slept 46085ms instead of 10000ms, this is likely due to a
long garbage collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2014-04-03 08:44:14,391 WARN  [ResponseProcessor for block
BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647]
hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647
java.io.EOFException: Premature EOF: no length prefix available
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:721)
2014-04-03 08:44:14,390 INFO  [regionserver60020-SendThread(
b05.jsepc.com:2181)] zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 41884ms for sessionid 0x545211f4de9001e, closing
socket connection and attempting reconnect
2014-04-03 08:44:14,390 INFO  [regionserver60020-SendThread(
a02.jsepc.com:2181)] zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 48443ms for sessionid 0x145211fd930001f, closing
socket connection and attempting reconnect
2014-04-03 08:44:14,391 WARN  [JvmPauseMonitor] util.JvmPauseMonitor:
Detected pause in JVM or host machine (eg GC): pause of approximately
37907ms
GC pool 'ParNew' had collection(s): count=1 time=31371ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=1577ms
2014-04-03 08:44:14,391 WARN  [regionserver60020.periodicFlusher]
util.Sleeper: We slept 46094ms instead of 10000ms, this is likely due to a
long garbage collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2014-04-03 08:44:14,416 WARN  [DataStreamer for file /apps/hbase/data/WALs/
b05.jsepc.com,60020,1396421585047/b05.jsepc.com%2C60020%2C1396421585047.1396484294737
block BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647]
hdfs.DFSClient: Error Recovery for block
BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647 in
pipeline 10.134.101.115:50010, 10.134.101.108:50010, 10.134.101.104:50010:
bad datanode 10.134.101.115:50010
2014-04-03 08:44:14,460 WARN  [DataStreamer for file /apps/hbase/data/WALs/
b05.jsepc.com,60020,1396421585047/b05.jsepc.com%2C60020%2C1396421585047.1396484294737
block BP-898918553-10.134.101.112-1393904898674:blk_1075358390_1621647]
hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /apps/hbase/data/WALs/b05.jsepc.com,60020,1396421585047/
b05.jsepc.com%2C60020%2C1396421585047.1396484294737: File does not exist.
Holder DFSClient_hb_rs_b05.jsepc.com,60020,1396421585047_-1660168802_29
does not have any open files.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2764)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2754)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:2682)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:585)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:409)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59584)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)

at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at $Proxy13.getAdditionalDatanode(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:352)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at $Proxy14.getAdditionalDatanode(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:266)
at $Proxy15.getAdditionalDatanode(Unknown Source)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:919)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)
2014-04-03 08:44:14,494 FATAL [regionserver60020]
regionserver.HRegionServer: ABORTING region server
b05.jsepc.com,60020,1396421585047:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing b05.jsepc.com,60020,1396421585047 as dead server
at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)

org.apache.hadoop.hbase.YouAreDeadException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing b05.jsepc.com,60020,1396421585047 as dead server
at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:277)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:985)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:832)
at java.lang.Thread.run(Thread.java:662)
Caused by:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException):
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing b05.jsepc.com,60020,1396421585047 as dead server
at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)

at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1449)
at
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653)
at
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:5414)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:983)
... 2 more
2014-04-03 08:44:14,495 FATAL [regionserver60020]
regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
2014-04-03 08:44:14,525 INFO  [regionserver60020]
regionserver.HRegionServer: STOPPED:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing b05.jsepc.com,60020,1396421585047 as dead server
at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)


After the log continued at 08:43:13, the RS complained some exceptions and
then crashed.
The RS stopped logging from 08:36:52 to 08:43:13, is this the reason that
caused the RS crash ? I suspect there was a full gc during this 6 minutes,
so I looked up the GC log on this RS but did not find any full gc, neither
did I find any gc taking longer than dozens of seconds.

*I looked up HMaster's log and fount its logging also paused for a long
time (7 minutes)*. HMaster's log is as follows:

2014-04-03 08:37:04,751 INFO  [AM.ZK.Worker-pool2-t750]
master.RegionStates: Transitioned {6d768061a22ccd020f0b43879dd9dd3b
state=OPENING, ts=1396485424686, server=b07.jsepc.com,60020,1396437592290}
to {6d768061a22ccd020f0b43879dd9dd3b state=OPEN, ts=1396485424751, server=
b07.jsepc.com,60020,1396437592290}
*2014-04-03 08:37:04*,753 INFO  [AM.ZK.Worker-pool2-t750]
master.RegionStates: Onlined 6d768061a22ccd020f0b43879dd9dd3b on
b07.jsepc.com,60020,1396437592290
*2014-04-03 08:44:26*,599 INFO  [main-EventThread]
zookeeper.RegionServerTracker: RegionServer ephemeral node deleted,
processing expiration [b05.jsepc.com,60020,1396421585047]
2014-04-03 08:44:26,900 INFO  [MASTER_SERVER_OPERATIONS-b03:60000-0]
handler.ServerShutdownHandler: Splitting logs for
b05.jsepc.com,60020,1396421585047
before assignment.
2014-04-03 08:44:26,933 INFO  [MASTER_SERVER_OPERATIONS-b03:60000-0]
master.SplitLogManager: dead splitlog workers [b05.jsepc.com
,60020,1396421585047]
2014-04-03 08:44:26,935 INFO  [MASTER_SERVER_OPERATIONS-b03:60000-0]
master.SplitLogManager: started splitting 4 logs in
[hdfs://jsepc01HA/apps/hbase/data/WALs/b05.jsepc.com
,60020,1396421585047-splitting]
2014-04-03 08:44:26,941 INFO  [main-EventThread] master.SplitLogManager:
task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
%2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396479046831
acquired by a07.jsepc.com,60020,1396421598233
2014-04-03 08:44:26,941 INFO  [main-EventThread] master.SplitLogManager:
task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
%2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396484294737
acquired by b08.jsepc.com,60020,1396421591104
2014-04-03 08:44:26,941 INFO  [main-EventThread] master.SplitLogManager:
task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
%2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396480850188
acquired by a08.jsepc.com,60020,1396421590699
2014-04-03 08:44:26,941 INFO  [main-EventThread] master.SplitLogManager:
task /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
%2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396482573001
acquired by b09.jsepc.com,60020,1396421585070
2014-04-03 08:44:27,120 ERROR [RpcServer.handler=32,port=60000]
master.HMaster: Region server b05.jsepc.com,60020,1396421585047 reported a
fatal error:
ABORTING region server b05.jsepc.com,60020,1396421585047:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing b05.jsepc.com,60020,1396421585047 as dead server
at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)

Cause:
org.apache.hadoop.hbase.YouAreDeadException:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing b05.jsepc.com,60020,1396421585047 as dead server
at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:277)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:985)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:832)
at java.lang.Thread.run(Thread.java:662)
Caused by:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException):
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing b05.jsepc.com,60020,1396421585047 as dead server
at
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1342)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)

at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1449)
at
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653)
at
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711)
at
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:5414)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:983)
... 2 more

2014-04-03 08:44:27,702 INFO
[b03.jsepc.com,60000,1396421381897.splitLogManagerTimeoutMonitor]
master.SplitLogManager: total tasks = 4 unassigned = 0
tasks={/hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
%2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396482573001=last_update
= 1396485866944 last_version = 2 cur_worker_name =
b09.jsepc.com,60020,1396421585070
status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4
done = 0 error = 0, /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
%2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396484294737=last_update
= 1396485866943 last_version = 2 cur_worker_name =
b08.jsepc.com,60020,1396421591104
status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4
done = 0 error = 0, /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
%2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396479046831=last_update
= 1396485866943 last_version = 2 cur_worker_name =
a07.jsepc.com,60020,1396421598233
status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4
done = 0 error = 0, /hbase-unsecure/splitWAL/WALs%2Fb05.jsepc.com
%2C60020%2C1396421585047-splitting%2Fb05.jsepc.com%252C60020%252C1396421585047.1396480850188=last_update
= 1396485866943 last_version = 2 cur_worker_name =
a08.jsepc.com,60020,1396421590699
status = in_progress incarnation = 0 resubmits = 0 batch = installed = 4
done = 0 error = 0}
2014-04-03 08:44:27,715 ERROR [RpcServer.handler=51,port=60000]
master.HMaster: Region server b05.jsepc.com,60020,1396421585047 reported a
fatal error:
ABORTING region server b05.jsepc.com,60020,1396421585047:
regionserver:60020-0x545211f4de9001e, quorum=b03.jsepc.com:2181,
a04.jsepc.com:2181,a03.jsepc.com:2181,a02.jsepc.com:2181,b05.jsepc.com:2181,
baseZNode=/hbase-unsecure regionserver:60020-0x545211f4de9001e received
expired from ZooKeeper, aborting
Cause:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:401)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:319)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)


I looked up HMaster's gc log and did not found any gc taking longer than 10
seconds.

*Actually I looked up logs of other region servers which did not crash and
found there were also long time pause during almost the same time.*

So the phenomenon is : RS and HMaster stopped logging for several minutes
(about 5~7 minutes) and some of the RS crashed

Why would this happen?
I changed the log level from the default DEBUG to INFO, but I don't think
this would cause such a long time gap in logging.

Appreciate for any help.

HBase logging paused for a long time then RS crashed

Reply via email to