This week also we are facing same problem. At lest from 4 to 5 months we haven't changed any HBase configuration.
All of sudden we are started seeing this pattern where regionservers are getting killed due GC pause & later zookeeper timeouts. We are using 5 GB for HBase heap & 6 GB for bucket cache. Following is the log from one of the regionserver: 2016-12-12 17:38:59,142 WARN [regionserver60020.periodicFlusher] util.Sleeper: We slept 30938ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2016-12-12 17:40:25,076 INFO [SplitLogWorker-HOST30,60020,1481556258263-SendThread(HOSTM5:4181)] zookeeper.ClientCnxn: Socket connection established to HOSTM5/192.168.190.179:4181, initiating session 2016-12-12 17:38:59,142 WARN [regionserver60020] util.Sleeper: We slept 19044ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2016-12-12 17:38:54,384 WARN [regionserver60020.compactionChecker] util.Sleeper: We slept 23805ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2016-12-12 17:38:16,281 INFO [regionserver60020-SendThread(HOSTM3:4181)] zookeeper.ClientCnxn: Socket connection established to HOSTM3/192.168.167.7:4181, initiating session 2016-12-12 17:40:25,091 INFO [regionserver60020-SendThread(HOSTM1:4181)] zookeeper.ClientCnxn: Socket connection established to HOSTM1/192.168.178.226:4181, initiating session 2016-12-12 17:40:25,093 INFO [regionserver60020-SendThread(HOSTM3:4181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 128812ms for sessionid 0x558f30318e204de, closing socket connection and attempting reconnect 2016-12-12 17:40:25,093 INFO [regionserver60020-SendThread(HOSTM2:4181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x558f30318e204df has expired, closing socket connection 2016-12-12 17:40:25,093 INFO [SplitLogWorker-HOST30,60020,1481556258263-SendThread(HOSTM5:4181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x458f30318de051d has expired, closing socket connection 2016-12-12 17:40:25,089 INFO [regionserver60020.leaseChecker] regionserver.HRegionServer: Scanner 42156 lease expired on region PostsAnalysis-2016-11-5,exam,1480313370104.4e37b0f96946a104474a8edbba4f87fd. 2016-12-12 17:40:25,193 INFO [regionserver60020.leaseChecker] regionserver.HRegionServer: Scanner 42155 lease expired on region PostsAnalysis-2016-11-4,exam,1480313365296.4c80cf384fcdc7bfb7c83f625f936cfe. 2016-12-12 17:40:25,194 FATAL [regionserver60020] regionserver.HRegionServer: ABORTING region server HOST30,60020,1481556258263: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing HOST30,60020,1481556258263 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:370) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:275) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1339) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing HOST30,60020,1481556258263 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:370) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:275) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1339) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:284) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1087) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:912) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing HOST30,60020,1481556258263 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:370) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:275) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1339) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1457) at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661) at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:8289) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1085) ... 2 more 2016-12-12 17:40:25,087 DEBUG [BucketCache Statistics #0] bucket.BucketCache: BucketCache Stats: failedBlockAdditions=23240, total=4.00 GB, free=1.05 GB, usedSize=2.96 GB, cacheSize=2.54 GB, accesses=7557955, hits=6538991, IOhitsPerSecond=1432, IOTimePerHit=0.88, hitRatio=86.52%, cachingAccesses=7525976, cachingHits=6530392, cachingHitsRatio=86.77%, evictions=779, evicted=3334211, evictedPerRun=4280.11669921875 2016-12-12 17:40:25,304 FATAL [regionserver60020] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [] 2016-12-12 17:40:25,116 WARN [SplitLogWorker-HOST30,60020,1481556258263-EventThread] client.HConnectionManager$HConnectionImplementation: This client just lost it's session with ZooKeeper, closing it. It will be recreated next time someone needs it org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:403) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:321) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 2016-12-12 17:40:25,105 WARN [ResponseProcessor for block BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521 java.net.SocketTimeoutException: 75000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.139.245:50366 remote=/192.168.139.245:45710] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1998) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:798) 2016-12-12 17:40:25,095 WARN [regionserver60020-EventThread] client.HConnectionManager$HConnectionImplementation: This client just lost it's session with ZooKeeper, closing it. It will be recreated next time someone needs it org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:403) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:321) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 2016-12-12 17:40:25,307 INFO [regionserver60020-EventThread] client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x458f30318de051c 2016-12-12 17:40:25,307 INFO [regionserver60020-EventThread] zookeeper.ClientCnxn: EventThread shut down 2016-12-12 17:40:25,095 INFO [regionserver60020-SendThread(HOSTM1:4181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x458f30318de051c has expired, closing socket connection 2016-12-12 17:40:25,094 WARN [regionserver60020-EventThread] client.HConnectionManager$HConnectionImplementation: This client just lost it's session with ZooKeeper, closing it. It will be recreated next time someone needs it org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:403) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:321) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 2016-12-12 17:40:25,306 WARN [DataStreamer for file /Veooz-hbase/WALs/HOST30,60020,1481556258263/HOST30%2C60020%2C1481556258263.1481564167609 block BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521] hdfs.DFSClient: Error Recovery for block BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521 in pipeline 192.168.139.245:45710, 192.168.212.113:45710, 192.168.195.93:45710: bad datanode 192.168.139.245:45710 2016-12-12 17:40:25,306 INFO [SplitLogWorker-HOST30,60020,1481556258263-EventThread] client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x458f30318de051d 2016-12-12 17:40:25,314 INFO [regionserver60020-EventThread] client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x558f30318e204df 2016-12-12 17:40:25,314 INFO [SplitLogWorker-HOST30,60020,1481556258263-EventThread] zookeeper.ClientCnxn: EventThread shut down 2016-12-12 17:40:25,314 INFO [regionserver60020-EventThread] zookeeper.ClientCnxn: EventThread shut down 2016-12-12 17:40:25,518 INFO [regionserver60020] regionserver.HRegionServer: STOPPED: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing HOST30,60020,1481556258263 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:370) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:275) at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1339) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2016-12-12 17:40:25,518 INFO [regionserver60020] ipc.RpcServer: Stopping server on 60020 2016-12-12 17:40:25,518 INFO [RpcServer.listener,port=60020] ipc.RpcServer: RpcServer.listener,port=60020: stopping 2016-12-12 17:40:25,519 DEBUG [B.DefaultRpcServer.handler=0,queue=0,port=60020] wal.FSHLog: interrupted while waiting for notification from AsyncNotifier 2016-12-12 17:40:25,519 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped Thanks, Sandeep. ________________________________ From: [email protected] <[email protected]> on behalf of Stack <[email protected]> Sent: Tuesday, December 13, 2016 12:38:53 PM To: Hbase-User Subject: Re: RegionSevevers repeatedly getting killed with GC Pause and Zookeeper timeout On Mon, Dec 12, 2016 at 10:22 PM, Sandeep Reddy <[email protected]> wrote: > Hi, > > > From past one week multiple regionservers are getting killed due to GC > pause and zookeeper timeouts exceptions. > > And previous to this week, all ran fine? If so, do you know what changed? > We are trying to tune GC values with different HBASE_OPTS, but no luck. > > > Following is our regionserver configuration: > > * Hardware: 24GB RAM with 8 Cores > * > hbase-env.sh configuration: > export HBASE_STOP_TIMEOUT="180" > export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError > -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX: > CMSInitiatingOccupancyFraction=70 -Xmn256m -XX:+ > UseCMSInitiatingOccupancyOnly" > export HBASE_OPTS="$HBASE_OPTS -XX:MaxNewSize=128m -XX:NewSize=128m" > export HBASE_REGIONSERVER_OPTS="-Xmx5g -Xms5g" > export HBASE_OPTS="$HBASE_OPTS -XX:MaxDirectMemorySize=5g" > export HBASE_MANAGES_ZK=falseexport HBASE_HEAPSIZE=4096 > > * zookeeper tickTime=2000 > > There's a bunch that has been written on tuning GC and GC in HBase in particular if you poke around on the net. Can you give the server more heap? Above it says 5G. Can you go up from there or are you up against some ceiling. What does it say caused the Full GC? Is it always the same thing? Do you get a heap given you run with -XX:+HeapDumpOnOutOfMemoryError ? If so, open it in a profiler (JMC?). Usually the object retainer can be plainly seen. St.Ack > Thanks, > Sandeep. >
