Hi Team,

We are seeing regionservers getting down whenever major compaction is triggered 
on table(8.5TB size).
Can anybody help with the resolution or give pointers to resolve this.

Below are the current observation:-

  1.  Above behaviour is seen even when compaction is run on compacted tables.
  2.  Load average seems to be normal and under 4(for 32 core machine).
  3.  Except bad datanode and JVM pause errors, No other error is seen in the 
logs.

Cluster configuration:-
79 Nodes
32 core machine,64GB RAM ,1.2TB SSDs

JVM OPTs:-

export HBASE_OPTS="$HBASE_OPTS  -XX:+UseParNewGC  -XX:+PerfDisableSharedMem 
-XX:+UseConcMarkSweepGC -XX:ErrorFile={{log_dir}}/hs_err_pid%p.log"
$HBASE_REGIONSERVER_OPTS -XX:+PerfDisableSharedMem -XX:PermSize=128m 
-XX:MaxPermSize=256m -XX:+UseCMSInitiatingOccupancyOnly  -Xmn1024m 
-XX:CMSInitiatingOccupancyFraction=70  -Xms31744m -Xmx31744

HBase-site.xml:-
PFA

GC logs:-


2015-07-12T23:15:29.485-0700: 9260.407: [GC2015-07-12T23:15:29.485-0700: 
9260.407: [ParNew: 839872K->947K(943744K), 0.0324180 secs] 
1431555K->592630K(32401024K), 0.0325930 secs] [Times: user=0.72 sys=0.00, 
real=0.03 secs]

2015-07-12T23:15:30.532-0700: 9261.454: [GC2015-07-12T23:15:30.532-0700: 
9261.454: [ParNew: 839859K->1017K(943744K), 31.0324970 secs] 
1431542K->592702K(32401024K), 31.0326950 secs] [Times: user=0.89 sys=0.02, 
real=31.03 secs]

2015-07-12T23:16:02.490-0700: 9293.412: [GC2015-07-12T23:16:02.490-0700: 
9293.412: [ParNew: 839929K->1100K(943744K), 0.0319400 secs] 
1431614K->592785K(32401024K), 0.0321580 secs] [Times: user=0.71 sys=0.00, 
real=0.03 secs]

2015-07-12T23:16:03.747-0700: 9294.669: [GC2015-07-12T23:16:03.747-0700: 
9294.669: [ParNew: 840012K->894K(943744K), 0.0304370 secs] 
1431697K->592579K(32401024K), 0.0305330 secs] [Times: user=0.67 sys=0.01, 
real=0.03 secs]

Heap

 par new generation   total 943744K, used 76608K [0x00007f54d4000000, 
0x00007f5514000000, 0x00007f5514000000)

  eden space 838912K,   9% used [0x00007f54d4000000, 0x00007f54d89f0728, 
0x00007f5507340000)

  from space 104832K,   0% used [0x00007f5507340000, 0x00007f550741fab0, 
0x00007f550d9a0000)

  to   space 104832K,   0% used [0x00007f550d9a0000, 0x00007f550d9a0000, 
0x00007f5514000000)

 concurrent mark-sweep generation total 31457280K, used 591685K 
[0x00007f5514000000, 0x00007f5c94000000, 0x00007f5c94000000)

 concurrent-mark-sweep perm gen total 131072K, used 44189K [0x00007f5c94000000, 
0x00007f5c9c000000, 0x00007f5ca4000000)




Regionserver logs:-


2015-07-12 23:16:01,565 WARN  [regionserver60020.periodicFlusher] util.Sleeper: 
We slept 38712ms instead of 10000ms, this is likely due to a long garbage 
collecting pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2015-07-12 23:16:01,565 WARN  [ResponseProcessor for block 
BP-552832523-xxx.xxx.xxx.xxx-1433419204036:blk_1075292594_1595455] 
hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block 
BP-552832523-xxx.xxx.xxx.xxxxxx.xxx.xxx.xxx-1433419204036:blk_1075292594_1595455

java.io.EOFException: Premature EOF: no length prefix available

        at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2208)

        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)

        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:868)

2015-07-12 23:16:01,565 INFO  [regionserver60020-SendThread(<hostname>:2181)] 
zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
41080ms for sessionid 0x24e76a76ee6dc50, closing socket connection and 
attempting reconnect

2015-07-12 23:16:01,565 INFO  [regionserver60020-SendThread(<hostname>:2181)] 
zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
39748ms for sessionid 0x34e76a76f05e006, closing socket connection and 
attempting reconnect

2015-07-12 23:16:01,565 INFO  
[regionserver60020-smallCompactions-1436759027218-SendThread(<hostname>:2181)] 
zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
42697ms for sessionid 0x14e76a7707202a2, closing socket connection and 
attempting reconnect

2015-07-12 23:16:01,565 INFO  [regionserver60020-SendThread(<hostname>:2181)] 
zookeeper.ClientCnxn: Client session timed out, have not heard from server in 
33764ms for sessionid 0x14e76a77071dd59, closing socket connection and 
attempting reconnect

2015-07-12 23:16:01,565 WARN  [ResponseProcessor for block 
BP-552832523-xxx.xxx.xxx.xxx-1433419204036:blk_1075293683_1596593] 
hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block 
BP-552832523-xxx.xxx.xxx.xxx-1433419204036:blk_1075293683_1596593

java.io.EOFException: Premature EOF: no length prefix available

        at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2208)

        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)

        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:868)

2015-07-12 23:16:01,565 WARN  [regionserver60020] util.Sleeper: We slept 
33688ms instead of 3000ms, this is likely due to a long garbage collecting 
pause and it's usually bad, see 
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired


Regards,
Ankit Singhal

Attachment: hbase-site.xml
Description: hbase-site.xml

Reply via email to