Hi Team, We are seeing regionservers getting down whenever major compaction is triggered on table(8.5TB size). Can anybody help with the resolution or give pointers to resolve this.
Below are the current observation:-
1. Above behaviour is seen even when compaction is run on compacted tables.
2. Load average seems to be normal and under 4(for 32 core machine).
3. Except bad datanode and JVM pause errors, No other error is seen in the
logs.
Cluster configuration:-
79 Nodes
32 core machine,64GB RAM ,1.2TB SSDs
JVM OPTs:-
export HBASE_OPTS="$HBASE_OPTS -XX:+UseParNewGC -XX:+PerfDisableSharedMem
-XX:+UseConcMarkSweepGC -XX:ErrorFile={{log_dir}}/hs_err_pid%p.log"
$HBASE_REGIONSERVER_OPTS -XX:+PerfDisableSharedMem -XX:PermSize=128m
-XX:MaxPermSize=256m -XX:+UseCMSInitiatingOccupancyOnly -Xmn1024m
-XX:CMSInitiatingOccupancyFraction=70 -Xms31744m -Xmx31744
HBase-site.xml:-
PFA
GC logs:-
2015-07-12T23:15:29.485-0700: 9260.407: [GC2015-07-12T23:15:29.485-0700:
9260.407: [ParNew: 839872K->947K(943744K), 0.0324180 secs]
1431555K->592630K(32401024K), 0.0325930 secs] [Times: user=0.72 sys=0.00,
real=0.03 secs]
2015-07-12T23:15:30.532-0700: 9261.454: [GC2015-07-12T23:15:30.532-0700:
9261.454: [ParNew: 839859K->1017K(943744K), 31.0324970 secs]
1431542K->592702K(32401024K), 31.0326950 secs] [Times: user=0.89 sys=0.02,
real=31.03 secs]
2015-07-12T23:16:02.490-0700: 9293.412: [GC2015-07-12T23:16:02.490-0700:
9293.412: [ParNew: 839929K->1100K(943744K), 0.0319400 secs]
1431614K->592785K(32401024K), 0.0321580 secs] [Times: user=0.71 sys=0.00,
real=0.03 secs]
2015-07-12T23:16:03.747-0700: 9294.669: [GC2015-07-12T23:16:03.747-0700:
9294.669: [ParNew: 840012K->894K(943744K), 0.0304370 secs]
1431697K->592579K(32401024K), 0.0305330 secs] [Times: user=0.67 sys=0.01,
real=0.03 secs]
Heap
par new generation total 943744K, used 76608K [0x00007f54d4000000,
0x00007f5514000000, 0x00007f5514000000)
eden space 838912K, 9% used [0x00007f54d4000000, 0x00007f54d89f0728,
0x00007f5507340000)
from space 104832K, 0% used [0x00007f5507340000, 0x00007f550741fab0,
0x00007f550d9a0000)
to space 104832K, 0% used [0x00007f550d9a0000, 0x00007f550d9a0000,
0x00007f5514000000)
concurrent mark-sweep generation total 31457280K, used 591685K
[0x00007f5514000000, 0x00007f5c94000000, 0x00007f5c94000000)
concurrent-mark-sweep perm gen total 131072K, used 44189K [0x00007f5c94000000,
0x00007f5c9c000000, 0x00007f5ca4000000)
Regionserver logs:-
2015-07-12 23:16:01,565 WARN [regionserver60020.periodicFlusher] util.Sleeper:
We slept 38712ms instead of 10000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-07-12 23:16:01,565 WARN [ResponseProcessor for block
BP-552832523-xxx.xxx.xxx.xxx-1433419204036:blk_1075292594_1595455]
hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
BP-552832523-xxx.xxx.xxx.xxxxxx.xxx.xxx.xxx-1433419204036:blk_1075292594_1595455
java.io.EOFException: Premature EOF: no length prefix available
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2208)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:868)
2015-07-12 23:16:01,565 INFO [regionserver60020-SendThread(<hostname>:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from server in
41080ms for sessionid 0x24e76a76ee6dc50, closing socket connection and
attempting reconnect
2015-07-12 23:16:01,565 INFO [regionserver60020-SendThread(<hostname>:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from server in
39748ms for sessionid 0x34e76a76f05e006, closing socket connection and
attempting reconnect
2015-07-12 23:16:01,565 INFO
[regionserver60020-smallCompactions-1436759027218-SendThread(<hostname>:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from server in
42697ms for sessionid 0x14e76a7707202a2, closing socket connection and
attempting reconnect
2015-07-12 23:16:01,565 INFO [regionserver60020-SendThread(<hostname>:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from server in
33764ms for sessionid 0x14e76a77071dd59, closing socket connection and
attempting reconnect
2015-07-12 23:16:01,565 WARN [ResponseProcessor for block
BP-552832523-xxx.xxx.xxx.xxx-1433419204036:blk_1075293683_1596593]
hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
BP-552832523-xxx.xxx.xxx.xxx-1433419204036:blk_1075293683_1596593
java.io.EOFException: Premature EOF: no length prefix available
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2208)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:868)
2015-07-12 23:16:01,565 WARN [regionserver60020] util.Sleeper: We slept
33688ms instead of 3000ms, this is likely due to a long garbage collecting
pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
Regards,
Ankit Singhal
hbase-site.xml
Description: hbase-site.xml
