Re: RegionSevevers repeatedly getting killed with GC Pause and Zookeeper timeout

Stack Tue, 13 Dec 2016 11:57:06 -0800

On Tue, Dec 13, 2016 at 12:13 AM, Sandeep Reddy <[email protected]>
wrote:


> This week also we are facing same problem.
>
> At lest from 4 to 5 months we haven't changed any HBase configuration.
>
> All of sudden we are started seeing this pattern where regionservers are
> getting killed due GC pause & later zookeeper timeouts.
>
>
What changed? More reading/writing? Less machines in cluster?




> We are using 5 GB for HBase heap & 6 GB for bucket cache.
>
>
Why this particular setup?

Have you tried giving 11G to HBase instead? Onheap is more performant if
most of your workload fits in RAM (does it?). Otherwise bucket cache is
better ... Can you give HBase more heap?

What versions are you running?

The posted log is reporting on a server dying. As per Duo Zhang, have you
tried correlate GC logs w/ events in RegionServer log.

Have you tried any tuning/diagnosis at all?

St.Ack



>
> Following is the log from one of the regionserver:
>
> 2016-12-12 17:38:59,142 WARN  [regionserver60020.periodicFlusher]
> util.Sleeper: We slept 30938ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2016-12-12 17:40:25,076 INFO  [SplitLogWorker-HOST30,60020,
> 1481556258263-SendThread(HOSTM5:4181)] zookeeper.ClientCnxn: Socket
> connection established to HOSTM5/192.168.190.179:4181, initiating session
> 2016-12-12 17:38:59,142 WARN  [regionserver60020] util.Sleeper: We slept
> 19044ms instead of 3000ms, this is likely due to a long garbage collecting
> pause and it's usually bad, see http://hbase.apache.org/book.
> html#trouble.rs.runtime.zkexpired
> 2016-12-12 17:38:54,384 WARN  [regionserver60020.compactionChecker]
> util.Sleeper: We slept 23805ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2016-12-12 17:38:16,281 INFO  [regionserver60020-SendThread(HOSTM3:4181)]
> zookeeper.ClientCnxn: Socket connection established to HOSTM3/
> 192.168.167.7:4181, initiating session
> 2016-12-12 17:40:25,091 INFO  [regionserver60020-SendThread(HOSTM1:4181)]
> zookeeper.ClientCnxn: Socket connection established to HOSTM1/
> 192.168.178.226:4181, initiating session
> 2016-12-12 17:40:25,093 INFO  [regionserver60020-SendThread(HOSTM3:4181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server
> in 128812ms for sessionid 0x558f30318e204de, closing socket connection and
> attempting reconnect
> 2016-12-12 17:40:25,093 INFO  [regionserver60020-SendThread(HOSTM2:4181)]
> zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session
> 0x558f30318e204df has expired, closing socket connection
> 2016-12-12 17:40:25,093 INFO  [SplitLogWorker-HOST30,60020,
> 1481556258263-SendThread(HOSTM5:4181)] zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x458f30318de051d has expired,
> closing socket connection
> 2016-12-12 17:40:25,089 INFO  [regionserver60020.leaseChecker]
> regionserver.HRegionServer: Scanner 42156 lease expired on region
> PostsAnalysis-2016-11-5,exam,1480313370104.4e37b0f96946a104474a8edbba4f87
> fd.
> 2016-12-12 17:40:25,193 INFO  [regionserver60020.leaseChecker]
> regionserver.HRegionServer: Scanner 42155 lease expired on region
> PostsAnalysis-2016-11-4,exam,1480313365296.4c80cf384fcdc7bfb7c83f625f936c
> fe.
> 2016-12-12 17:40:25,194 FATAL [regionserver60020]
> regionserver.HRegionServer: ABORTING region server
> HOST30,60020,1481556258263: org.apache.hadoop.hbase.YouAreDeadException:
> Server REPORT rejected; currently processing HOST30,60020,1481556258263 as
> dead server
>         at org.apache.hadoop.hbase.master.ServerManager.
> checkIsDead(ServerManager.java:370)
>         at org.apache.hadoop.hbase.master.ServerManager.
> regionServerReport(ServerManager.java:275)
>         at org.apache.hadoop.hbase.master.HMaster.
> regionServerReport(HMaster.java:1339)
>         at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(
> FifoRpcScheduler.java:74)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> org.apache.hadoop.hbase.YouAreDeadException: 
> org.apache.hadoop.hbase.YouAreDeadException:
> Server REPORT rejected; currently processing HOST30,60020,1481556258263 as
> dead server
>         at org.apache.hadoop.hbase.master.ServerManager.
> checkIsDead(ServerManager.java:370)
>         at org.apache.hadoop.hbase.master.ServerManager.
> regionServerReport(ServerManager.java:275)
>         at org.apache.hadoop.hbase.master.HMaster.
> regionServerReport(HMaster.java:1339)
>         at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(
> FifoRpcScheduler.java:74)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:57)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at org.apache.hadoop.ipc.RemoteException.instantiateException(
> RemoteException.java:106)
>         at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> RemoteException.java:95)
>         at org.apache.hadoop.hbase.protobuf.ProtobufUtil.
> getRemoteException(ProtobufUtil.java:284)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.
> tryRegionServerReport(HRegionServer.java:1087)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.
> run(HRegionServer.java:912)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.
> apache.hadoop.hbase.YouAreDeadException): 
> org.apache.hadoop.hbase.YouAreDeadException:
> Server REPORT rejected; currently processing HOST30,60020,1481556258263 as
> dead server
>         at org.apache.hadoop.hbase.master.ServerManager.
> checkIsDead(ServerManager.java:370)
>         at org.apache.hadoop.hbase.master.ServerManager.
> regionServerReport(ServerManager.java:275)
>         at org.apache.hadoop.hbase.master.HMaster.
> regionServerReport(HMaster.java:1339)
>         at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(
> FifoRpcScheduler.java:74)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>         at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1457)
>         at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(
> RpcClient.java:1661)
>         at org.apache.hadoop.hbase.ipc.RpcClient$
> BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)
>         at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$RegionServerStatusService$BlockingStub.
> regionServerReport(RegionServerStatusProtos.java:8289)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.
> tryRegionServerReport(HRegionServer.java:1085)
>         ... 2 more
> 2016-12-12 17:40:25,087 DEBUG [BucketCache Statistics #0]
> bucket.BucketCache: BucketCache Stats: failedBlockAdditions=23240,
> total=4.00 GB, free=1.05 GB, usedSize=2.96 GB, cacheSize=2.54 GB,
> accesses=7557955, hits=6538991, IOhitsPerSecond=1432, IOTimePerHit=0.88,
> hitRatio=86.52%, cachingAccesses=7525976, cachingHits=6530392,
> cachingHitsRatio=86.77%, evictions=779, evicted=3334211,
> evictedPerRun=4280.11669921875
> 2016-12-12 17:40:25,304 FATAL [regionserver60020]
> regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
> 2016-12-12 17:40:25,116 WARN  
> [SplitLogWorker-HOST30,60020,1481556258263-EventThread]
> client.HConnectionManager$HConnectionImplementation: This client just
> lost it's session with ZooKeeper, closing it. It will be recreated next
> time someone needs it
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> connectionEvent(ZooKeeperWatcher.java:403)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> process(ZooKeeperWatcher.java:321)
>         at org.apache.zookeeper.ClientCnxn$EventThread.
> processEvent(ClientCnxn.java:522)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(
> ClientCnxn.java:498)
> 2016-12-12 17:40:25,105 WARN  [ResponseProcessor for block
> BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521
> java.net.SocketTimeoutException: 75000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/192.168.139.245:50366 remote=/192.168.139.245:45710]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(
> SocketIOWithTimeout.java:164)
>         at org.apache.hadoop.net.SocketInputStream.read(
> SocketInputStream.java:161)
>         at org.apache.hadoop.net.SocketInputStream.read(
> SocketInputStream.java:131)
>         at org.apache.hadoop.net.SocketInputStream.read(
> SocketInputStream.java:118)
>         at java.io.FilterInputStream.read(FilterInputStream.java:83)
>         at java.io.FilterInputStream.read(FilterInputStream.java:83)
>         at org.apache.hadoop.hdfs.protocolPB.PBHelper.
> vintPrefixed(PBHelper.java:1998)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.
> PipelineAck.readFields(PipelineAck.java:176)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$
> ResponseProcessor.run(DFSOutputStream.java:798)
> 2016-12-12 17:40:25,095 WARN  [regionserver60020-EventThread]
> client.HConnectionManager$HConnectionImplementation: This client just
> lost it's session with ZooKeeper, closing it. It will be recreated next
> time someone needs it
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> connectionEvent(ZooKeeperWatcher.java:403)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> process(ZooKeeperWatcher.java:321)
>         at org.apache.zookeeper.ClientCnxn$EventThread.
> processEvent(ClientCnxn.java:522)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(
> ClientCnxn.java:498)
> 2016-12-12 17:40:25,307 INFO  [regionserver60020-EventThread]
> client.HConnectionManager$HConnectionImplementation: Closing zookeeper
> sessionid=0x458f30318de051c
> 2016-12-12 17:40:25,307 INFO  [regionserver60020-EventThread]
> zookeeper.ClientCnxn: EventThread shut down
> 2016-12-12 17:40:25,095 INFO  [regionserver60020-SendThread(HOSTM1:4181)]
> zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session
> 0x458f30318de051c has expired, closing socket connection
> 2016-12-12 17:40:25,094 WARN  [regionserver60020-EventThread]
> client.HConnectionManager$HConnectionImplementation: This client just
> lost it's session with ZooKeeper, closing it. It will be recreated next
> time someone needs it
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> connectionEvent(ZooKeeperWatcher.java:403)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> process(ZooKeeperWatcher.java:321)
>         at org.apache.zookeeper.ClientCnxn$EventThread.
> processEvent(ClientCnxn.java:522)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(
> ClientCnxn.java:498)
> 2016-12-12 17:40:25,306 WARN  [DataStreamer for file
> /Veooz-hbase/WALs/HOST30,60020,1481556258263/HOST30%
> 2C60020%2C1481556258263.1481564167609 block BP-993765379-192.168.178.226-
> 1413267646677:blk_1090362673_16624521] hdfs.DFSClient: Error Recovery for
> block BP-993765379-192.168.178.226-1413267646677:blk_1090362673_16624521
> in pipeline 192.168.139.245:45710, 192.168.212.113:45710,
> 192.168.195.93:45710: bad datanode 192.168.139.245:45710
> 2016-12-12 17:40:25,306 INFO  
> [SplitLogWorker-HOST30,60020,1481556258263-EventThread]
> client.HConnectionManager$HConnectionImplementation: Closing zookeeper
> sessionid=0x458f30318de051d
> 2016-12-12 17:40:25,314 INFO  [regionserver60020-EventThread]
> client.HConnectionManager$HConnectionImplementation: Closing zookeeper
> sessionid=0x558f30318e204df
> 2016-12-12 17:40:25,314 INFO  
> [SplitLogWorker-HOST30,60020,1481556258263-EventThread]
> zookeeper.ClientCnxn: EventThread shut down
> 2016-12-12 17:40:25,314 INFO  [regionserver60020-EventThread]
> zookeeper.ClientCnxn: EventThread shut down
> 2016-12-12 17:40:25,518 INFO  [regionserver60020]
> regionserver.HRegionServer: STOPPED: 
> org.apache.hadoop.hbase.YouAreDeadException:
> Server REPORT rejected; currently processing HOST30,60020,1481556258263 as
> dead server
>         at org.apache.hadoop.hbase.master.ServerManager.
> checkIsDead(ServerManager.java:370)
>         at org.apache.hadoop.hbase.master.ServerManager.
> regionServerReport(ServerManager.java:275)
>         at org.apache.hadoop.hbase.master.HMaster.
> regionServerReport(HMaster.java:1339)
>         at org.apache.hadoop.hbase.protobuf.generated.
> RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(
> FifoRpcScheduler.java:74)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> 2016-12-12 17:40:25,518 INFO  [regionserver60020] ipc.RpcServer: Stopping
> server on 60020
> 2016-12-12 17:40:25,518 INFO  [RpcServer.listener,port=60020]
> ipc.RpcServer: RpcServer.listener,port=60020: stopping
> 2016-12-12 17:40:25,519 DEBUG 
> [B.DefaultRpcServer.handler=0,queue=0,port=60020]
> wal.FSHLog: interrupted while waiting for notification from AsyncNotifier
> 2016-12-12 17:40:25,519 INFO  [RpcServer.responder] ipc.RpcServer:
> RpcServer.responder: stopped
>
>
>
> Thanks,
> Sandeep.
> ________________________________
> From: [email protected] <[email protected]> on behalf of Stack <
> [email protected]>
> Sent: Tuesday, December 13, 2016 12:38:53 PM
> To: Hbase-User
> Subject: Re: RegionSevevers repeatedly getting killed with GC Pause and
> Zookeeper timeout
>
> On Mon, Dec 12, 2016 at 10:22 PM, Sandeep Reddy <[email protected]
> >
> wrote:
>
> > Hi,
> >
> >
> > From past one week multiple regionservers are getting killed due to GC
> > pause and zookeeper timeouts exceptions.
> >
> >
> And previous to this week, all ran fine?
>
> If so, do you know what changed?
>
>
>
> > We are trying to tune GC values with different HBASE_OPTS, but no luck.
> >
> >
> > Following is our regionserver configuration:
> >
> >   *   Hardware: 24GB RAM with 8 Cores
> >   *
> > hbase-env.sh configuration:
> > export HBASE_STOP_TIMEOUT="180"
> > export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError
> > -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:
> > CMSInitiatingOccupancyFraction=70 -Xmn256m -XX:+
> > UseCMSInitiatingOccupancyOnly"
> > export HBASE_OPTS="$HBASE_OPTS -XX:MaxNewSize=128m -XX:NewSize=128m"
> > export HBASE_REGIONSERVER_OPTS="-Xmx5g -Xms5g"
> > export HBASE_OPTS="$HBASE_OPTS -XX:MaxDirectMemorySize=5g"
> > export HBASE_MANAGES_ZK=falseexport HBASE_HEAPSIZE=4096
> >
> >   *   zookeeper tickTime=2000
> >
> >
> There's a bunch that has been written on tuning GC and GC in HBase in
> particular if you poke around on the net.
>
> Can you give the server more heap? Above it says 5G. Can you go up from
> there or are you up against some ceiling.
>
> What does it say caused the Full GC? Is it always the same thing?
>
> Do you get a heap given you run with  -XX:+HeapDumpOnOutOfMemoryError  ?
> If
> so, open it in a profiler (JMC?). Usually the object retainer can be
> plainly seen.
>
> St.Ack
>
>
>
>
>
>
> > Thanks,
> > Sandeep.
> >
>

Re: RegionSevevers repeatedly getting killed with GC Pause and Zookeeper timeout

Reply via email to