Can you post the GC logs around 2016-12-12 17:38? It seems that your RS run STW Full GCs time by time.
As stack suggested above, you'd better get a heap dump and find out the objects that occupy the heap space. And what's your on heap block cache config and memstore config? Thanks. 2016-12-13 16:13 GMT+08:00 Sandeep Reddy <[email protected]>: > This week also we are facing same problem. > > At lest from 4 to 5 months we haven't changed any HBase configuration. > > All of sudden we are started seeing this pattern where regionservers are > getting killed due GC pause & later zookeeper timeouts. > > We are using 5 GB for HBase heap & 6 GB for bucket cache. > > > Following is the log from one of the regionserver: > > 2016-12-12 17:38:59,142 WARN [regionserver60020.periodicFlusher] > util.Sleeper: We slept 30938ms instead of 10000ms, this is likely due to a > long garbage collecting pause and it's usually bad, see > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > 2016-12-12 17:40:25,076 INFO [SplitLogWorker-HOST30,60020, > 1481556258263-SendThread(HOSTM5:4181)] zookeeper.ClientCnxn: Socket > connection established to HOSTM5/192.168.190.179:4181, initiating session > 2016-12-12 17:38:59,142 WARN [regionserver60020] util.Sleeper: We slept > 19044ms instead of 3000ms, this is likely due to a long garbage collecting > pause and it's usually bad, see http://hbase.apache.org/book. > html#trouble.rs.runtime.zkexpired > 2016-12-12 17:38:54,384 WARN [regionserver60020.compactionChecker] > util.Sleeper: We slept 23805ms instead of 10000ms, this is likely due to a > long garbage collecting pause and it's usually bad, see > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > 2016-12-12 17:38:16,281 INFO [regionserver60020-SendThread(HOSTM3:4181)] > zookeeper.ClientCnxn: Socket connection established to HOSTM3/ > 192.168.167.7:4181, initiating session > 2016-12-12 17:40:25,091 INFO [regionserver60020-SendThread(HOSTM1:4181)] > zookeeper.ClientCnxn: Socket connection established to HOSTM1/ > 192.168.178.226:4181, initiating session > 2016-12-12 17:40:25,093 INFO [regionserver60020-SendThread(HOSTM3:4181)] > zookeeper.ClientCnxn: Client session timed out, have not heard from server > in 128812ms for sessionid 0x558f30318e204de, closing socket connection and > attempting reconnect > 2016-12-12 17:40:25,093 INFO [regionserver60020-SendThread(HOSTM2:4181)] > zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session > 0x558f30318e204df has expired, closing socket connection > 2016-12-12 17:40:25,093 INFO [SplitLogWorker-HOST30,60020, > 1481556258263-SendThread(HOSTM5:4181)] zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x458f30318de051d has expired, > closing socket connection > 2016-12-12 17:40:25,089 INFO [regionserver60020.leaseChecker] > regionserver.HRegionServer: Scanner 42156 lease expired on region > PostsAnalysis-2016-11-5,exam,1480313370104.4e37b0f96946a104474a8edbba4f87 > fd. > 2016-12-12 17:40:25,193 INFO [regionserver60020.leaseChecker] > regionserver.HRegionServer: Scanner 42155 lease expired on region > PostsAnalysis-2016-11-4,exam,1480313365296.4c80cf384fcdc7bfb7c83f625f936c > fe. > 2016-12-12 17:40:25,194 FATAL [regionserver60020] > regionserver.HRegionServer: ABORTING region server > HOST30,60020,1481556258263: org.apache.hadoop.hbase.YouAreDeadException: > Server REPORT rejected; currently processing HOST30,60020,1481556258263 as > dead server > at org.apache.hadoop.hbase.master.ServerManager. > checkIsDead(ServerManager.java:370) > at org.apache.hadoop.hbase.master.ServerManager. > regionServerReport(ServerManager.java:275) > at org.apache.hadoop.hbase.master.HMaster. > regionServerReport(HMaster.java:1339) > at org.apache.hadoop.hbase.protobuf.generated. > RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod( > RegionServerStatusProtos.java:7912) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run( > FifoRpcScheduler.java:74) > at java.util.concurrent.Executors$RunnableAdapter. > call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > org.apache.hadoop.hbase.YouAreDeadException: > org.apache.hadoop.hbase.YouAreDeadException: > Server REPORT rejected; currently processing HOST30,60020,1481556258263 as > dead server > at org.apache.hadoop.hbase.master.ServerManager. > checkIsDead(ServerManager.java:370) > at org.apache.hadoop.hbase.master.ServerManager. > regionServerReport(ServerManager.java:275) > at org.apache.hadoop.hbase.master.HMaster. > regionServerReport(HMaster.java:1339) > at org.apache.hadoop.hbase.protobuf.generated. > RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod( > RegionServerStatusProtos.java:7912) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run( > FifoRpcScheduler.java:74) > at java.util.concurrent.Executors$RunnableAdapter. > call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at sun.reflect.NativeConstructorAccessorImpl.newInstance( > NativeConstructorAccessorImpl.java:57) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance( > DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at org.apache.hadoop.ipc.RemoteException.instantiateException( > RemoteException.java:106) > at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException( > RemoteException.java:95) > at org.apache.hadoop.hbase.protobuf.ProtobufUtil. > getRemoteException(ProtobufUtil.java:284) > at org.apache.hadoop.hbase.regionserver.HRegionServer. > tryRegionServerReport(HRegionServer.java:1087) > at org.apache.hadoop.hbase.regionserver.HRegionServer. > run(HRegionServer.java:912) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org. > apache.hadoop.hbase.YouAreDeadException): > org.apache.hadoop.hbase.YouAreDeadException: > Server REPORT rejected; currently processing HOST30,60020,1481556258263 as > dead server > at org.apache.hadoop.hbase.master.ServerManager. > checkIsDead(ServerManager.java:370) > at org.apache.hadoop.hbase.master.ServerManager. > regionServerReport(ServerManager.java:275) > at org.apache.hadoop.hbase.master.HMaster. > regionServerReport(HMaster.java:1339) > at org.apache.hadoop.hbase.protobuf.generated. > RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod( > RegionServerStatusProtos.java:7912) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run( > FifoRpcScheduler.java:74) > at java.util.concurrent.Executors$RunnableAdapter. > call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1457) > at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod( > RpcClient.java:1661) > at org.apache.hadoop.hbase.ipc.RpcClient$ > BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719) > at org.apache.hadoop.hbase.protobuf.generated. > RegionServerStatusProtos$RegionServerStatusService$BlockingStub. > regionServerReport(RegionServerStatusProtos.java:8289) > at org.apache.hadoop.hbase.regionserver.HRegionServer. > tryRegionServerReport(HRegionServer.java:1085) > ... 2 more > 2016-12-12 17:40:25,087 DEBUG [BucketCache Statistics #0] > bucket.BucketCache: BucketCache Stats: failedBlockAdditions=23240, > total=4.00 GB, free=1.05 GB, usedSize=2.96 GB, cacheSize=2.54 GB, > accesses=7557955, hits=6538991, IOhitsPerSecond=1432, IOTimePerHit=0.88, > hitRatio=86.52%, cachingAccesses=7525976, cachingHits=6530392, > cachingHitsRatio=86.77%, evictions=779, evicted=3334211, > evictedPerRun=4280.11669921875 > 2016-12-12 17:40:25,304 FATAL [regionserver60020] > regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [] > 2016-12-12 17:40:25,116 WARN > [SplitLogWorker-HOST30,60020,1481556258263-EventThread] > client.HConnectionManager$HConnectionImplementation: This client just > lost it's session with ZooKeeper, closing it. It will be recreated next > time someone needs it > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > connectionEvent(ZooKeeperWatcher.java:403) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > process(ZooKeeperWatcher.java:321) > at org.apache.zookeeper.ClientCnxn$EventThread. > processEvent(ClientCnxn.java:522) > at org.apache.zookeeper.ClientCnxn$EventThread.run( > ClientCnxn.java:498) > 2016-12-12 17:40:25,105 WARN [ResponseProcessor for block > BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521] > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block > BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521 > java.net.SocketTimeoutException: 75000 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/192.168.139.245:50366 remote=/192.168.139.245:45710] > at org.apache.hadoop.net.SocketIOWithTimeout.doIO( > SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read( > SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read( > SocketInputStream.java:131) > at org.apache.hadoop.net.SocketInputStream.read( > SocketInputStream.java:118) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at org.apache.hadoop.hdfs.protocolPB.PBHelper. > vintPrefixed(PBHelper.java:1998) > at org.apache.hadoop.hdfs.protocol.datatransfer. > PipelineAck.readFields(PipelineAck.java:176) > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ > ResponseProcessor.run(DFSOutputStream.java:798) > 2016-12-12 17:40:25,095 WARN [regionserver60020-EventThread] > client.HConnectionManager$HConnectionImplementation: This client just > lost it's session with ZooKeeper, closing it. It will be recreated next > time someone needs it > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > connectionEvent(ZooKeeperWatcher.java:403) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > process(ZooKeeperWatcher.java:321) > at org.apache.zookeeper.ClientCnxn$EventThread. > processEvent(ClientCnxn.java:522) > at org.apache.zookeeper.ClientCnxn$EventThread.run( > ClientCnxn.java:498) > 2016-12-12 17:40:25,307 INFO [regionserver60020-EventThread] > client.HConnectionManager$HConnectionImplementation: Closing zookeeper > sessionid=0x458f30318de051c > 2016-12-12 17:40:25,307 INFO [regionserver60020-EventThread] > zookeeper.ClientCnxn: EventThread shut down > 2016-12-12 17:40:25,095 INFO [regionserver60020-SendThread(HOSTM1:4181)] > zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session > 0x458f30318de051c has expired, closing socket connection > 2016-12-12 17:40:25,094 WARN [regionserver60020-EventThread] > client.HConnectionManager$HConnectionImplementation: This client just > lost it's session with ZooKeeper, closing it. It will be recreated next > time someone needs it > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > connectionEvent(ZooKeeperWatcher.java:403) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > process(ZooKeeperWatcher.java:321) > at org.apache.zookeeper.ClientCnxn$EventThread. > processEvent(ClientCnxn.java:522) > at org.apache.zookeeper.ClientCnxn$EventThread.run( > ClientCnxn.java:498) > 2016-12-12 17:40:25,306 WARN [DataStreamer for file > /Veooz-hbase/WALs/HOST30,60020,1481556258263/HOST30% > 2C60020%2C1481556258263.1481564167609 block BP-993765379-192.168.178.226- > 1413267646677:blk_1090362673_16624521] hdfs.DFSClient: Error Recovery for > block BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521 > in pipeline 192.168.139.245:45710, 192.168.212.113:45710, > 192.168.195.93:45710: bad datanode 192.168.139.245:45710 > 2016-12-12 17:40:25,306 INFO > [SplitLogWorker-HOST30,60020,1481556258263-EventThread] > client.HConnectionManager$HConnectionImplementation: Closing zookeeper > sessionid=0x458f30318de051d > 2016-12-12 17:40:25,314 INFO [regionserver60020-EventThread] > client.HConnectionManager$HConnectionImplementation: Closing zookeeper > sessionid=0x558f30318e204df > 2016-12-12 17:40:25,314 INFO > [SplitLogWorker-HOST30,60020,1481556258263-EventThread] > zookeeper.ClientCnxn: EventThread shut down > 2016-12-12 17:40:25,314 INFO [regionserver60020-EventThread] > zookeeper.ClientCnxn: EventThread shut down > 2016-12-12 17:40:25,518 INFO [regionserver60020] > regionserver.HRegionServer: STOPPED: > org.apache.hadoop.hbase.YouAreDeadException: > Server REPORT rejected; currently processing HOST30,60020,1481556258263 as > dead server > at org.apache.hadoop.hbase.master.ServerManager. > checkIsDead(ServerManager.java:370) > at org.apache.hadoop.hbase.master.ServerManager. > regionServerReport(ServerManager.java:275) > at org.apache.hadoop.hbase.master.HMaster. > regionServerReport(HMaster.java:1339) > at org.apache.hadoop.hbase.protobuf.generated. > RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod( > RegionServerStatusProtos.java:7912) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run( > FifoRpcScheduler.java:74) > at java.util.concurrent.Executors$RunnableAdapter. > call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > 2016-12-12 17:40:25,518 INFO [regionserver60020] ipc.RpcServer: Stopping > server on 60020 > 2016-12-12 17:40:25,518 INFO [RpcServer.listener,port=60020] > ipc.RpcServer: RpcServer.listener,port=60020: stopping > 2016-12-12 17:40:25,519 DEBUG > [B.DefaultRpcServer.handler=0,queue=0,port=60020] > wal.FSHLog: interrupted while waiting for notification from AsyncNotifier > 2016-12-12 17:40:25,519 INFO [RpcServer.responder] ipc.RpcServer: > RpcServer.responder: stopped > > > > Thanks, > Sandeep. > ________________________________ > From: [email protected] <[email protected]> on behalf of Stack < > [email protected]> > Sent: Tuesday, December 13, 2016 12:38:53 PM > To: Hbase-User > Subject: Re: RegionSevevers repeatedly getting killed with GC Pause and > Zookeeper timeout > > On Mon, Dec 12, 2016 at 10:22 PM, Sandeep Reddy <[email protected] > > > wrote: > > > Hi, > > > > > > From past one week multiple regionservers are getting killed due to GC > > pause and zookeeper timeouts exceptions. > > > > > And previous to this week, all ran fine? > > If so, do you know what changed? > > > > > We are trying to tune GC values with different HBASE_OPTS, but no luck. > > > > > > Following is our regionserver configuration: > > > > * Hardware: 24GB RAM with 8 Cores > > * > > hbase-env.sh configuration: > > export HBASE_STOP_TIMEOUT="180" > > export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError > > -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX: > > CMSInitiatingOccupancyFraction=70 -Xmn256m -XX:+ > > UseCMSInitiatingOccupancyOnly" > > export HBASE_OPTS="$HBASE_OPTS -XX:MaxNewSize=128m -XX:NewSize=128m" > > export HBASE_REGIONSERVER_OPTS="-Xmx5g -Xms5g" > > export HBASE_OPTS="$HBASE_OPTS -XX:MaxDirectMemorySize=5g" > > export HBASE_MANAGES_ZK=falseexport HBASE_HEAPSIZE=4096 > > > > * zookeeper tickTime=2000 > > > > > There's a bunch that has been written on tuning GC and GC in HBase in > particular if you poke around on the net. > > Can you give the server more heap? Above it says 5G. Can you go up from > there or are you up against some ceiling. > > What does it say caused the Full GC? Is it always the same thing? > > Do you get a heap given you run with -XX:+HeapDumpOnOutOfMemoryError ? > If > so, open it in a profiler (JMC?). Usually the object retainer can be > plainly seen. > > St.Ack > > > > > > > > Thanks, > > Sandeep. > > >
