i met similar problem before, I suggest you find out the gc log and cpu usage log.
My experience #1 see gc too long, cause zookeeper timeout, then region server determine itself as dead node, and shutdown itself. #2 optimize the gc, and monitor gc log. Still happen again. #3 finally found out the reason is region server too busy, cpu usage are too high, cause all the process slow or blocked. Hope useful 2015年11月30日星期一,Stack <[email protected]> 写道: > On Sat, Nov 28, 2015 at 4:44 PM, 聪聪 <[email protected] <javascript:;>> > wrote: > > > hi,all: > > This morning regionserver shutdown.By analyzing the log, I guess it's > > about GC.Then,How to optimize the GC in HBase. > > I use the HBase version is hbase-0.98.6-cdh5.2.0.I look for your help in > > this. > > > > > > JVM configuration is as follows: > > export HBASE_REGIONSERVER_OPTS="-Xms92g -Xmx92g -XX:PermSize=256m > > -XX:MaxPermSize=256m -XX:+UseG1GC -server -XX:+DisableExplicitGC > > -XX:+UseFastAccessorMethods -XX:SoftRefLRUPolicyMSPerMB=0 > > -XX:G1ReservePercent=15 -XX:InitiatingHeapOccupancyPercent=40 > > -XX:ConcGCThreads=18 -XX:+ParallelRefProcEnabled -XX:-ResizePLAB > > -XX:ParallelGCThreads=18 -XX:+PrintClassHistogram -XX:+PrintGCDetails > > -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC > > > -Xloggc:/home/q/hbase/hbase-0.98.6-cdh5.2.0/logs/gc-$(hostname)-hbase.log" > > > > > What the others have said. > > How did you arrive at the above configurations? > > Looks like you are keeping the gc log in the logs dir at > gc-HOSTNAME-hbase.log. Looks like your GC output should be rich with lots > of helpful detail. > > What is your loading like? All reads? All writes? Big objects? A mix? > > St.Ack > > > > > > Regionserver on the log is as follows: > > 2015-11-29 00:48:30,521 DEBUG > > [regionserver60020-smallCompactions-1447234955800] compactions.Compactor: > > Compacting > > > hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/2becc4de83bc4102802a4f69adfba433, > > keycount=1244806, bloomtype=ROW, size=462.2 M, encoding=NONE, > seqNum=22878 > > 2015-11-29 00:48:30,521 DEBUG > > [regionserver60020-smallCompactions-1447234955800] compactions.Compactor: > > Compacting > > > hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/52669d6cf3a34dc89af6faf23ea03fa0, > > keycount=613700, bloomtype=ROW, size=236.4 M, encoding=NONE, seqNum=22885 > > 2015-11-29 00:48:30,521 DEBUG > > [regionserver60020-smallCompactions-1447234955800] compactions.Compactor: > > Compacting > > > hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/1ff32b168134412ea3f6942126623ef6, > > keycount=373741, bloomtype=ROW, size=143.9 M, encoding=NONE, seqNum=22889 > > 2015-11-29 00:48:30,521 DEBUG > > [regionserver60020-smallCompactions-1447234955800] compactions.Compactor: > > Compacting > > > hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/b993875f0f384597900d2a2d3e69fde8, > > keycount=136626, bloomtype=ROW, size=49.6 M, encoding=NONE, seqNum=22891 > > 2015-11-29 00:49:46,912 WARN [regionserver60020.periodicFlusher] > > util.Sleeper: We slept 71814ms instead of 10000ms, this is likely due to > a > > long garbage collecting pause and it's usually bad, see > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > > 2015-11-29 00:49:46,912 INFO [ > SplitLogWorker-l-hbase27.dba.cn2.qunar.com > > ,60020,1447234715888-SendThread(l-zk2.dba.cn6.qunar.com:2181)] > > zookeeper.ClientCnxn: Client session timed out, have not heard from > server > > in 77297ms for sessionid 0x150ad3cf038d6e6, closing socket connection and > > attempting reconnect > > 2015-11-29 00:49:46,912 WARN [regionserver60020.compactionChecker] > > util.Sleeper: We slept 74029ms instead of 10000ms, this is likely due to > a > > long garbage collecting pause and it's usually bad, see > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > > 2015-11-29 00:49:46,912 INFO [regionserver60020-SendThread( > > l-zk1.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session > timed > > out, have not heard from server in 74953ms for sessionid > 0x34f16f641cc2e77, > > closing socket connection and attempting reconnect > > 2015-11-29 00:49:46,913 WARN [JvmPauseMonitor] util.JvmPauseMonitor: > > Detected pause in JVM or host machine (eg GC): pause of approximately > > 70587ms > > No GCs detected > > 2015-11-29 00:49:46,913 WARN [regionserver60020] util.Sleeper: We slept > > 72658ms instead of 3000ms, this is likely due to a long garbage > collecting > > pause and it's usually bad, see > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > > 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread( > > l-zk4.dba.cn8.qunar.com:2181)] zookeeper.ClientCnxn: Client session > timed > > out, have not heard from server in 81433ms for sessionid > 0x54f16f4aa3363fa, > > closing socket connection and attempting reconnect > > 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread( > > l-zk2.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session > timed > > out, have not heard from server in 79798ms for sessionid > 0x54f16f4aa3363fe, > > closing socket connection and attempting reconnect > > 2015-11-29 00:49:46,922 WARN [ResponseProcessor for block > > BP-329468547-10.90.18.19-1444480193809:blk_1078012346_4271549] > > hdfs.DFSClient: Slow ReadProcessor read fields took 72827ms > > (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR > > downstreamAckTimeNanos: 0, targets: [10.90.18.47:50010, > 10.90.18.60:50010, > > 10.90.18.63:50010] > > 2015-11-29 00:49:46,921 WARN [ResponseProcessor for block > > BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823] > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block > > BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823 > > java.io.EOFException: Premature EOF: no length prefix available > > at > > > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103) > > at > > > org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) > > at > > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810) > > 2015-11-29 00:49:46,914 DEBUG [LruStats #0] hfile.LruBlockCache: > > Total=20.98 MB, free=18.38 GB, max=18.40 GB, blocks=19756849152, > > accesses=564477317, hits=166122280, hitRatio=29.43%, , > > cachingAccesses=166131865, cachingHits=166015128, > cachingHitsRatio=99.93%, > > evictions=149453, evicted=116562, evictedPerRun=0.7799240946769714 > > 2015-11-29 00:49:46,929 WARN [DataStreamer for file > /hbase/tc_qtrace/WALs/ > > l-hbase27.dba.cn2.qunar.com,60020,1447234715888/ > > l-hbase27.dba.cn2.qunar.com%2C60020%2C1447234715888.1448726130837.meta > > block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823] > > hdfs.DFSClient: Error Recovery for block > > BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823 in pipeline > > 10.90.18.47:50010, 10.90.18.61:50010, 10.90.18.53:50010: bad datanode > > 10.90.18.47:50010 > -- Sincerely. 伍 涛 | Tony Wu
