Re: RegionSevevers repeatedly getting killed with GC Pause and Zookeeper timeout

张铎 Tue, 13 Dec 2016 00:36:08 -0800

Can you post the GC logs around 2016-12-12 17:38? It seems that your RS run
STW Full GCs time by time.


As stack suggested above, you'd better get a heap dump and find out the
objects that occupy the heap space.

And what's your on heap block cache config and memstore config?

Thanks.

2016-12-13 16:13 GMT+08:00 Sandeep Reddy <[email protected]>:

> This week also we are facing same problem.
>
> At lest from 4 to 5 months we haven't changed any HBase configuration.
>
> All of sudden we are started seeing this pattern where regionservers are
> getting killed due GC pause & later zookeeper timeouts.
>
> We are using 5 GB for HBase heap & 6 GB for bucket cache.
>
>
> Following is the log from one of the regionserver:
>
> 2016-12-12 17:38:59,142 WARN  [regionserver60020.periodicFlusher]
> util.Sleeper: We slept 30938ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2016-12-12 17:40:25,076 INFO  [SplitLogWorker-HOST30,60020,
> 1481556258263-SendThread(HOSTM5:4181)] zookeeper.ClientCnxn: Socket
> connection established to HOSTM5/192.168.190.179:4181, initiating session
> 2016-12-12 17:38:59,142 WARN  [regionserver60020] util.Sleeper: We slept
> 19044ms instead of 3000ms, this is likely due to a long garbage collecting
> pause and it's usually bad, see http://hbase.apache.org/book.
> html#trouble.rs.runtime.zkexpired
> 2016-12-12 17:38:54,384 WARN  [regionserver60020.compactionChecker]
> util.Sleeper: We slept 23805ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2016-12-12 17:38:16,281 INFO  [regionserver60020-SendThread(HOSTM3:4181)]
> zookeeper.ClientCnxn: Socket connection established to HOSTM3/
> 192.168.167.7:4181, initiating session
> 2016-12-12 17:40:25,091 INFO  [regionserver60020-SendThread(HOSTM1:4181)]
> zookeeper.ClientCnxn: Socket connection established to HOSTM1/
> 192.168.178.226:4181, initiating session
> 2016-12-12 17:40:25,093 INFO  [regionserver60020-SendThread(HOSTM3:4181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server
> in 128812ms for sessionid 0x558f30318e204de, closing socket connection and
> attempting reconnect
> 2016-12-12 17:40:25,093 INFO  [regionserver60020-SendThread(HOSTM2:4181)]
> zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session
> 0x558f30318e204df has expired, closing socket connection
> 2016-12-12 17:40:25,093 INFO  [SplitLogWorker-HOST30,60020,
> 1481556258263-SendThread(HOSTM5:4181)] zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x458f30318de051d has expired,
> closing socket connection
> 2016-12-12 17:40:25,089 INFO  [regionserver60020.leaseChecker]
> regionserver.HRegionServer: Scanner 42156 lease expired on region
> PostsAnalysis-2016-11-5,exam,1480313370104.4e37b0f96946a104474a8edbba4f87
> fd.
> 2016-12-12 17:40:25,193 INFO  [regionserver60020.leaseChecker]
> regionserver.HRegionServer: Scanner 42155 lease expired on region
> PostsAnalysis-2016-11-4,exam,1480313365296.4c80cf384fcdc7bfb7c83f625f936c
> fe.
> 2016-12-12 17:40:25,194 FATAL [regionserver60020]
> regionserver.HRegionServer: ABORTING region server
> HOST30,60020,1481556258263: org.apache.hadoop.hbase.YouAreDeadException:
> Server REPORT rejected; currently processing HOST30,60020,1481556258263 as
> dead server
>         at org.apache.hadoop.hbase.master.ServerManager.
> checkIsDead(ServerManager.java:370)
>         at org.apache.hadoop.hbase.master.ServerManager.
> regionServerReport(ServerManager.java:275)
>         at org.apache.hadoop.hbase.master.HMaster.
> regionServerReport(HMaster.java:1339)
>         at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(
> FifoRpcScheduler.java:74)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> org.apache.hadoop.hbase.YouAreDeadException: 
> org.apache.hadoop.hbase.YouAreDeadException:
> Server REPORT rejected; currently processing HOST30,60020,1481556258263 as
> dead server
>         at org.apache.hadoop.hbase.master.ServerManager.
> checkIsDead(ServerManager.java:370)
>         at org.apache.hadoop.hbase.master.ServerManager.
> regionServerReport(ServerManager.java:275)
>         at org.apache.hadoop.hbase.master.HMaster.
> regionServerReport(HMaster.java:1339)
>         at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(
> FifoRpcScheduler.java:74)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:57)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at org.apache.hadoop.ipc.RemoteException.instantiateException(
> RemoteException.java:106)
>         at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> RemoteException.java:95)
>         at org.apache.hadoop.hbase.protobuf.ProtobufUtil.
> getRemoteException(ProtobufUtil.java:284)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.
> tryRegionServerReport(HRegionServer.java:1087)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.
> run(HRegionServer.java:912)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.
> apache.hadoop.hbase.YouAreDeadException): 
> org.apache.hadoop.hbase.YouAreDeadException:
> Server REPORT rejected; currently processing HOST30,60020,1481556258263 as
> dead server
>         at org.apache.hadoop.hbase.master.ServerManager.
> checkIsDead(ServerManager.java:370)
>         at org.apache.hadoop.hbase.master.ServerManager.
> regionServerReport(ServerManager.java:275)
>         at org.apache.hadoop.hbase.master.HMaster.
> regionServerReport(HMaster.java:1339)
>         at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(
> FifoRpcScheduler.java:74)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>         at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1457)
>         at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(
> RpcClient.java:1661)
>         at org.apache.hadoop.hbase.ipc.RpcClient$
> BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)
>         at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$RegionServerStatusService$BlockingStub.
> regionServerReport(RegionServerStatusProtos.java:8289)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.
> tryRegionServerReport(HRegionServer.java:1085)
>         ... 2 more
> 2016-12-12 17:40:25,087 DEBUG [BucketCache Statistics #0]
> bucket.BucketCache: BucketCache Stats: failedBlockAdditions=23240,
> total=4.00 GB, free=1.05 GB, usedSize=2.96 GB, cacheSize=2.54 GB,
> accesses=7557955, hits=6538991, IOhitsPerSecond=1432, IOTimePerHit=0.88,
> hitRatio=86.52%, cachingAccesses=7525976, cachingHits=6530392,
> cachingHitsRatio=86.77%, evictions=779, evicted=3334211,
> evictedPerRun=4280.11669921875
> 2016-12-12 17:40:25,304 FATAL [regionserver60020]
> regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
> 2016-12-12 17:40:25,116 WARN  
> [SplitLogWorker-HOST30,60020,1481556258263-EventThread]
> client.HConnectionManager$HConnectionImplementation: This client just
> lost it's session with ZooKeeper, closing it. It will be recreated next
> time someone needs it
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> connectionEvent(ZooKeeperWatcher.java:403)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> process(ZooKeeperWatcher.java:321)
>         at org.apache.zookeeper.ClientCnxn$EventThread.
> processEvent(ClientCnxn.java:522)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(
> ClientCnxn.java:498)
> 2016-12-12 17:40:25,105 WARN  [ResponseProcessor for block
> BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521
> java.net.SocketTimeoutException: 75000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/192.168.139.245:50366 remote=/192.168.139.245:45710]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(
> SocketIOWithTimeout.java:164)
>         at org.apache.hadoop.net.SocketInputStream.read(
> SocketInputStream.java:161)
>         at org.apache.hadoop.net.SocketInputStream.read(
> SocketInputStream.java:131)
>         at org.apache.hadoop.net.SocketInputStream.read(
> SocketInputStream.java:118)
>         at java.io.FilterInputStream.read(FilterInputStream.java:83)
>         at java.io.FilterInputStream.read(FilterInputStream.java:83)
>         at org.apache.hadoop.hdfs.protocolPB.PBHelper.
> vintPrefixed(PBHelper.java:1998)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.
> PipelineAck.readFields(PipelineAck.java:176)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$
> ResponseProcessor.run(DFSOutputStream.java:798)
> 2016-12-12 17:40:25,095 WARN  [regionserver60020-EventThread]
> client.HConnectionManager$HConnectionImplementation: This client just
> lost it's session with ZooKeeper, closing it. It will be recreated next
> time someone needs it
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> connectionEvent(ZooKeeperWatcher.java:403)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> process(ZooKeeperWatcher.java:321)
>         at org.apache.zookeeper.ClientCnxn$EventThread.
> processEvent(ClientCnxn.java:522)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(
> ClientCnxn.java:498)
> 2016-12-12 17:40:25,307 INFO  [regionserver60020-EventThread]
> client.HConnectionManager$HConnectionImplementation: Closing zookeeper
> sessionid=0x458f30318de051c
> 2016-12-12 17:40:25,307 INFO  [regionserver60020-EventThread]
> zookeeper.ClientCnxn: EventThread shut down
> 2016-12-12 17:40:25,095 INFO  [regionserver60020-SendThread(HOSTM1:4181)]
> zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session
> 0x458f30318de051c has expired, closing socket connection
> 2016-12-12 17:40:25,094 WARN  [regionserver60020-EventThread]
> client.HConnectionManager$HConnectionImplementation: This client just
> lost it's session with ZooKeeper, closing it. It will be recreated next
> time someone needs it
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> connectionEvent(ZooKeeperWatcher.java:403)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> process(ZooKeeperWatcher.java:321)
>         at org.apache.zookeeper.ClientCnxn$EventThread.
> processEvent(ClientCnxn.java:522)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(
> ClientCnxn.java:498)
> 2016-12-12 17:40:25,306 WARN  [DataStreamer for file
> /Veooz-hbase/WALs/HOST30,60020,1481556258263/HOST30%
> 2C60020%2C1481556258263.1481564167609 block BP-993765379-192.168.178.226-
> 1413267646677:blk_1090362673_16624521] hdfs.DFSClient: Error Recovery for
> block BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521
> in pipeline 192.168.139.245:45710, 192.168.212.113:45710,
> 192.168.195.93:45710: bad datanode 192.168.139.245:45710
> 2016-12-12 17:40:25,306 INFO  
> [SplitLogWorker-HOST30,60020,1481556258263-EventThread]
> client.HConnectionManager$HConnectionImplementation: Closing zookeeper
> sessionid=0x458f30318de051d
> 2016-12-12 17:40:25,314 INFO  [regionserver60020-EventThread]
> client.HConnectionManager$HConnectionImplementation: Closing zookeeper
> sessionid=0x558f30318e204df
> 2016-12-12 17:40:25,314 INFO  
> [SplitLogWorker-HOST30,60020,1481556258263-EventThread]
> zookeeper.ClientCnxn: EventThread shut down
> 2016-12-12 17:40:25,314 INFO  [regionserver60020-EventThread]
> zookeeper.ClientCnxn: EventThread shut down
> 2016-12-12 17:40:25,518 INFO  [regionserver60020]
> regionserver.HRegionServer: STOPPED: 
> org.apache.hadoop.hbase.YouAreDeadException:
> Server REPORT rejected; currently processing HOST30,60020,1481556258263 as
> dead server
>         at org.apache.hadoop.hbase.master.ServerManager.
> checkIsDead(ServerManager.java:370)
>         at org.apache.hadoop.hbase.master.ServerManager.
> regionServerReport(ServerManager.java:275)
>         at org.apache.hadoop.hbase.master.HMaster.
> regionServerReport(HMaster.java:1339)
>         at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(
> FifoRpcScheduler.java:74)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> 2016-12-12 17:40:25,518 INFO  [regionserver60020] ipc.RpcServer: Stopping
> server on 60020
> 2016-12-12 17:40:25,518 INFO  [RpcServer.listener,port=60020]
> ipc.RpcServer: RpcServer.listener,port=60020: stopping
> 2016-12-12 17:40:25,519 DEBUG 
> [B.DefaultRpcServer.handler=0,queue=0,port=60020]
> wal.FSHLog: interrupted while waiting for notification from AsyncNotifier
> 2016-12-12 17:40:25,519 INFO  [RpcServer.responder] ipc.RpcServer:
> RpcServer.responder: stopped
>
>
>
> Thanks,
> Sandeep.
> ________________________________
> From: [email protected] <[email protected]> on behalf of Stack <
> [email protected]>
> Sent: Tuesday, December 13, 2016 12:38:53 PM
> To: Hbase-User
> Subject: Re: RegionSevevers repeatedly getting killed with GC Pause and
> Zookeeper timeout
>
> On Mon, Dec 12, 2016 at 10:22 PM, Sandeep Reddy <[email protected]
> >
> wrote:
>
> > Hi,
> >
> >
> > From past one week multiple regionservers are getting killed due to GC
> > pause and zookeeper timeouts exceptions.
> >
> >
> And previous to this week, all ran fine?
>
> If so, do you know what changed?
>
>
>
> > We are trying to tune GC values with different HBASE_OPTS, but no luck.
> >
> >
> > Following is our regionserver configuration:
> >
> >   *   Hardware: 24GB RAM with 8 Cores
> >   *
> > hbase-env.sh configuration:
> > export HBASE_STOP_TIMEOUT="180"
> > export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError
> > -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:
> > CMSInitiatingOccupancyFraction=70 -Xmn256m -XX:+
> > UseCMSInitiatingOccupancyOnly"
> > export HBASE_OPTS="$HBASE_OPTS -XX:MaxNewSize=128m -XX:NewSize=128m"
> > export HBASE_REGIONSERVER_OPTS="-Xmx5g -Xms5g"
> > export HBASE_OPTS="$HBASE_OPTS -XX:MaxDirectMemorySize=5g"
> > export HBASE_MANAGES_ZK=falseexport HBASE_HEAPSIZE=4096
> >
> >   *   zookeeper tickTime=2000
> >
> >
> There's a bunch that has been written on tuning GC and GC in HBase in
> particular if you poke around on the net.
>
> Can you give the server more heap? Above it says 5G. Can you go up from
> there or are you up against some ceiling.
>
> What does it say caused the Full GC? Is it always the same thing?
>
> Do you get a heap given you run with  -XX:+HeapDumpOnOutOfMemoryError  ?
> If
> so, open it in a profiler (JMC?). Usually the object retainer can be
> plainly seen.
>
> St.Ack
>
>
>
>
>
>
> > Thanks,
> > Sandeep.
> >
>

Re: RegionSevevers repeatedly getting killed with GC Pause and Zookeeper timeout

Reply via email to