What JDK are you using?  I've seen such behavior when a machine was
swapping.  Can you tell if there was any swap in use?

On Mon, Jul 13, 2015 at 3:24 AM, Ankit Singhal
<[email protected]> wrote:
> Hi Team,
>
> We are seeing regionservers getting down whenever major compaction is
> triggered on table(8.5TB size).
> Can anybody help with the resolution or give pointers to resolve this.
>
> Below are the current observation:-
>
> Above behaviour is seen even when compaction is run on compacted tables.
> Load average seems to be normal and under 4(for 32 core machine).
> Except bad datanode and JVM pause errors, No other error is seen in the
> logs.
>
>
> Cluster configuration:-
> 79 Nodes
> 32 core machine,64GB RAM ,1.2TB SSDs
>
> JVM OPTs:-
>
> export HBASE_OPTS="$HBASE_OPTS  -XX:+UseParNewGC  -XX:+PerfDisableSharedMem
> -XX:+UseConcMarkSweepGC -XX:ErrorFile={{log_dir}}/hs_err_pid%p.log"
> $HBASE_REGIONSERVER_OPTS -XX:+PerfDisableSharedMem -XX:PermSize=128m
> -XX:MaxPermSize=256m -XX:+UseCMSInitiatingOccupancyOnly  -Xmn1024m
> -XX:CMSInitiatingOccupancyFraction=70  -Xms31744m -Xmx31744
>
> HBase-site.xml:-
> PFA
>
> GC logs:-
>
> 2015-07-12T23:15:29.485-0700: 9260.407: [GC2015-07-12T23:15:29.485-0700:
> 9260.407: [ParNew: 839872K->947K(943744K), 0.0324180 secs]
> 1431555K->592630K(32401024K), 0.0325930 secs] [Times: user=0.72 sys=0.00,
> real=0.03 secs]
>
> 2015-07-12T23:15:30.532-0700: 9261.454: [GC2015-07-12T23:15:30.532-0700:
> 9261.454: [ParNew: 839859K->1017K(943744K), 31.0324970 secs]
> 1431542K->592702K(32401024K), 31.0326950 secs] [Times: user=0.89 sys=0.02,
> real=31.03 secs]
>
> 2015-07-12T23:16:02.490-0700: 9293.412: [GC2015-07-12T23:16:02.490-0700:
> 9293.412: [ParNew: 839929K->1100K(943744K), 0.0319400 secs]
> 1431614K->592785K(32401024K), 0.0321580 secs] [Times: user=0.71 sys=0.00,
> real=0.03 secs]
>
> 2015-07-12T23:16:03.747-0700: 9294.669: [GC2015-07-12T23:16:03.747-0700:
> 9294.669: [ParNew: 840012K->894K(943744K), 0.0304370 secs]
> 1431697K->592579K(32401024K), 0.0305330 secs] [Times: user=0.67 sys=0.01,
> real=0.03 secs]
>
> Heap
>
>  par new generation   total 943744K, used 76608K [0x00007f54d4000000,
> 0x00007f5514000000, 0x00007f5514000000)
>
>   eden space 838912K,   9% used [0x00007f54d4000000, 0x00007f54d89f0728,
> 0x00007f5507340000)
>
>   from space 104832K,   0% used [0x00007f5507340000, 0x00007f550741fab0,
> 0x00007f550d9a0000)
>
>   to   space 104832K,   0% used [0x00007f550d9a0000, 0x00007f550d9a0000,
> 0x00007f5514000000)
>
>  concurrent mark-sweep generation total 31457280K, used 591685K
> [0x00007f5514000000, 0x00007f5c94000000, 0x00007f5c94000000)
>
>  concurrent-mark-sweep perm gen total 131072K, used 44189K
> [0x00007f5c94000000, 0x00007f5c9c000000, 0x00007f5ca4000000)
>
>
>
>
> Regionserver logs:-
>
>
> 2015-07-12 23:16:01,565 WARN  [regionserver60020.periodicFlusher]
> util.Sleeper: We slept 38712ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>
> 2015-07-12 23:16:01,565 WARN  [ResponseProcessor for block
> BP-552832523-xxx.xxx.xxx.xxx-1433419204036:blk_1075292594_1595455]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> BP-552832523-xxx.xxx.xxx.xxxxxx.xxx.xxx.xxx-1433419204036:blk_1075292594_1595455
>
> java.io.EOFException: Premature EOF: no length prefix available
>
>         at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2208)
>
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
>
>         at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:868)
>
> 2015-07-12 23:16:01,565 INFO
> [regionserver60020-SendThread(<hostname>:2181)] zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 41080ms for sessionid
> 0x24e76a76ee6dc50, closing socket connection and attempting reconnect
>
> 2015-07-12 23:16:01,565 INFO
> [regionserver60020-SendThread(<hostname>:2181)] zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 39748ms for sessionid
> 0x34e76a76f05e006, closing socket connection and attempting reconnect
>
> 2015-07-12 23:16:01,565 INFO
> [regionserver60020-smallCompactions-1436759027218-SendThread(<hostname>:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server
> in 42697ms for sessionid 0x14e76a7707202a2, closing socket connection and
> attempting reconnect
>
> 2015-07-12 23:16:01,565 INFO
> [regionserver60020-SendThread(<hostname>:2181)] zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 33764ms for sessionid
> 0x14e76a77071dd59, closing socket connection and attempting reconnect
>
> 2015-07-12 23:16:01,565 WARN  [ResponseProcessor for block
> BP-552832523-xxx.xxx.xxx.xxx-1433419204036:blk_1075293683_1596593]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> BP-552832523-xxx.xxx.xxx.xxx-1433419204036:blk_1075293683_1596593
>
> java.io.EOFException: Premature EOF: no length prefix available
>
>         at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2208)
>
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
>
>         at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:868)
>
> 2015-07-12 23:16:01,565 WARN  [regionserver60020] util.Sleeper: We slept
> 33688ms instead of 3000ms, this is likely due to a long garbage collecting
> pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>
>
>
> Regards,
> Ankit Singhal

Reply via email to