Re: RegionSevevers repeatedly getting killed with GC Pause and Zookeeper timeout

Sandeep Reddy Tue, 13 Dec 2016 00:14:09 -0800

This week also we are facing same problem.

At lest from 4 to 5 months we haven't changed any HBase configuration.


All of sudden we are started seeing this pattern where regionservers are 
getting killed due GC pause & later zookeeper timeouts.

We are using 5 GB for HBase heap & 6 GB for bucket cache.


Following is the log from one of the regionserver:

2016-12-12 17:38:59,142 WARN  [regionserver60020.periodicFlusher] util.Sleeper: 
We slept 30938ms instead of 10000ms, this is likely due to a long garbage 
collecting pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2016-12-12 17:40:25,076 INFO  
[SplitLogWorker-HOST30,60020,1481556258263-SendThread(HOSTM5:4181)] 
zookeeper.ClientCnxn: Socket connection established to 
HOSTM5/192.168.190.179:4181, initiating session
2016-12-12 17:38:59,142 WARN  [regionserver60020] util.Sleeper: We slept 
19044ms instead of 3000ms, this is likely due to a long garbage collecting 
pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2016-12-12 17:38:54,384 WARN  [regionserver60020.compactionChecker] 
util.Sleeper: We slept 23805ms instead of 10000ms, this is likely due to a long 
garbage collecting pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2016-12-12 17:38:16,281 INFO  [regionserver60020-SendThread(HOSTM3:4181)] 
zookeeper.ClientCnxn: Socket connection established to 
HOSTM3/192.168.167.7:4181, initiating session
2016-12-12 17:40:25,091 INFO  [regionserver60020-SendThread(HOSTM1:4181)] 
zookeeper.ClientCnxn: Socket connection established to 
HOSTM1/192.168.178.226:4181, initiating session
2016-12-12 17:40:25,093 INFO  [regionserver60020-SendThread(HOSTM3:4181)] 
zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
128812ms for sessionid 0x558f30318e204de, closing socket connection and 
attempting reconnect
2016-12-12 17:40:25,093 INFO  [regionserver60020-SendThread(HOSTM2:4181)] 
zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 
0x558f30318e204df has expired, closing socket connection
2016-12-12 17:40:25,093 INFO  
[SplitLogWorker-HOST30,60020,1481556258263-SendThread(HOSTM5:4181)] 
zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 
0x458f30318de051d has expired, closing socket connection
2016-12-12 17:40:25,089 INFO  [regionserver60020.leaseChecker] 
regionserver.HRegionServer: Scanner 42156 lease expired on region 
PostsAnalysis-2016-11-5,exam,1480313370104.4e37b0f96946a104474a8edbba4f87fd.
2016-12-12 17:40:25,193 INFO  [regionserver60020.leaseChecker] 
regionserver.HRegionServer: Scanner 42155 lease expired on region 
PostsAnalysis-2016-11-4,exam,1480313365296.4c80cf384fcdc7bfb7c83f625f936cfe.
2016-12-12 17:40:25,194 FATAL [regionserver60020] regionserver.HRegionServer: 
ABORTING region server HOST30,60020,1481556258263: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently 
processing HOST30,60020,1481556258263 as dead server
        at 
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:370)
        at 
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:275)
        at 
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1339)
        at 
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at 
org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

org.apache.hadoop.hbase.YouAreDeadException: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently 
processing HOST30,60020,1481556258263 as dead server
        at 
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:370)
        at 
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:275)
        at 
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1339)
        at 
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at 
org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:284)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1087)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:912)
        at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException):
 org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently 
processing HOST30,60020,1481556258263 as dead server
        at 
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:370)
        at 
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:275)
        at 
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1339)
        at 
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at 
org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1457)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)
        at 
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:8289)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1085)
        ... 2 more
2016-12-12 17:40:25,087 DEBUG [BucketCache Statistics #0] bucket.BucketCache: 
BucketCache Stats: failedBlockAdditions=23240, total=4.00 GB, free=1.05 GB, 
usedSize=2.96 GB, cacheSize=2.54 GB, accesses=7557955, hits=6538991, 
IOhitsPerSecond=1432, IOTimePerHit=0.88, hitRatio=86.52%, 
cachingAccesses=7525976, cachingHits=6530392, cachingHitsRatio=86.77%, 
evictions=779, evicted=3334211, evictedPerRun=4280.11669921875
2016-12-12 17:40:25,304 FATAL [regionserver60020] regionserver.HRegionServer: 
RegionServer abort: loaded coprocessors are: []
2016-12-12 17:40:25,116 WARN  
[SplitLogWorker-HOST30,60020,1481556258263-EventThread] 
client.HConnectionManager$HConnectionImplementation: This client just lost it's 
session with ZooKeeper, closing it. It will be recreated next time someone 
needs it
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:403)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:321)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-12-12 17:40:25,105 WARN  [ResponseProcessor for block 
BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521] 
hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block 
BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521
java.net.SocketTimeoutException: 75000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/192.168.139.245:50366 remote=/192.168.139.245:45710]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1998)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:798)
2016-12-12 17:40:25,095 WARN  [regionserver60020-EventThread] 
client.HConnectionManager$HConnectionImplementation: This client just lost it's 
session with ZooKeeper, closing it. It will be recreated next time someone 
needs it
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:403)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:321)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-12-12 17:40:25,307 INFO  [regionserver60020-EventThread] 
client.HConnectionManager$HConnectionImplementation: Closing zookeeper 
sessionid=0x458f30318de051c
2016-12-12 17:40:25,307 INFO  [regionserver60020-EventThread] 
zookeeper.ClientCnxn: EventThread shut down
2016-12-12 17:40:25,095 INFO  [regionserver60020-SendThread(HOSTM1:4181)] 
zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 
0x458f30318de051c has expired, closing socket connection
2016-12-12 17:40:25,094 WARN  [regionserver60020-EventThread] 
client.HConnectionManager$HConnectionImplementation: This client just lost it's 
session with ZooKeeper, closing it. It will be recreated next time someone 
needs it
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:403)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:321)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-12-12 17:40:25,306 WARN  [DataStreamer for file 
/Veooz-hbase/WALs/HOST30,60020,1481556258263/HOST30%2C60020%2C1481556258263.1481564167609
 block BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521] 
hdfs.DFSClient: Error Recovery for block 
BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521 in pipeline 
192.168.139.245:45710, 192.168.212.113:45710, 192.168.195.93:45710: bad 
datanode 192.168.139.245:45710
2016-12-12 17:40:25,306 INFO  
[SplitLogWorker-HOST30,60020,1481556258263-EventThread] 
client.HConnectionManager$HConnectionImplementation: Closing zookeeper 
sessionid=0x458f30318de051d
2016-12-12 17:40:25,314 INFO  [regionserver60020-EventThread] 
client.HConnectionManager$HConnectionImplementation: Closing zookeeper 
sessionid=0x558f30318e204df
2016-12-12 17:40:25,314 INFO  
[SplitLogWorker-HOST30,60020,1481556258263-EventThread] zookeeper.ClientCnxn: 
EventThread shut down
2016-12-12 17:40:25,314 INFO  [regionserver60020-EventThread] 
zookeeper.ClientCnxn: EventThread shut down
2016-12-12 17:40:25,518 INFO  [regionserver60020] regionserver.HRegionServer: 
STOPPED: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
currently processing HOST30,60020,1481556258263 as dead server
        at 
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:370)
        at 
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:275)
        at 
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1339)
        at 
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at 
org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

2016-12-12 17:40:25,518 INFO  [regionserver60020] ipc.RpcServer: Stopping 
server on 60020
2016-12-12 17:40:25,518 INFO  [RpcServer.listener,port=60020] ipc.RpcServer: 
RpcServer.listener,port=60020: stopping
2016-12-12 17:40:25,519 DEBUG [B.DefaultRpcServer.handler=0,queue=0,port=60020] 
wal.FSHLog: interrupted while waiting for notification from AsyncNotifier
2016-12-12 17:40:25,519 INFO  [RpcServer.responder] ipc.RpcServer: 
RpcServer.responder: stopped



Thanks,
Sandeep.
________________________________
From: [email protected] <[email protected]> on behalf of Stack 
<[email protected]>
Sent: Tuesday, December 13, 2016 12:38:53 PM
To: Hbase-User
Subject: Re: RegionSevevers repeatedly getting killed with GC Pause and 
Zookeeper timeout

On Mon, Dec 12, 2016 at 10:22 PM, Sandeep Reddy <[email protected]>
wrote:

> Hi,
>
>
> From past one week multiple regionservers are getting killed due to GC
> pause and zookeeper timeouts exceptions.
>
>
And previous to this week, all ran fine?

If so, do you know what changed?



> We are trying to tune GC values with different HBASE_OPTS, but no luck.
>
>
> Following is our regionserver configuration:
>
>   *   Hardware: 24GB RAM with 8 Cores
>   *
> hbase-env.sh configuration:
> export HBASE_STOP_TIMEOUT="180"
> export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError
> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:
> CMSInitiatingOccupancyFraction=70 -Xmn256m -XX:+
> UseCMSInitiatingOccupancyOnly"
> export HBASE_OPTS="$HBASE_OPTS -XX:MaxNewSize=128m -XX:NewSize=128m"
> export HBASE_REGIONSERVER_OPTS="-Xmx5g -Xms5g"
> export HBASE_OPTS="$HBASE_OPTS -XX:MaxDirectMemorySize=5g"
> export HBASE_MANAGES_ZK=falseexport HBASE_HEAPSIZE=4096
>
>   *   zookeeper tickTime=2000
>
>
There's a bunch that has been written on tuning GC and GC in HBase in
particular if you poke around on the net.

Can you give the server more heap? Above it says 5G. Can you go up from
there or are you up against some ceiling.

What does it say caused the Full GC? Is it always the same thing?

Do you get a heap given you run with  -XX:+HeapDumpOnOutOfMemoryError  ? If
so, open it in a profiler (JMC?). Usually the object retainer can be
plainly seen.

St.Ack






> Thanks,
> Sandeep.
>

Re: RegionSevevers repeatedly getting killed with GC Pause and Zookeeper timeout

Reply via email to