Re: RegionServers Crashing every hour in production env

Pablo Musa Fri, 08 Mar 2013 10:59:21 -0800

0.94 currently doesn't support hadoop 2.0
Can you deploy hadoop 1.1.1 instead ?


I am using cdh4.2.0 which uses this version as default installation.
I think it will be a problem for me to deploy 1.1.1 because I would need to
"upgrade" the whole cluster with 70TB of data (backup everything, go offline, 
etc.).

Is there a problem to use cdh4.2.0?
I should send my email to cdh list?

Are you using 0.94.5 ?


I am using 0.94.2.

I think it is with your GC config.  What is your heap size?  What is the
data that you pump in and how much is the block cache size?


#JVM config:
export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m -XX:+UseConcMarkSweepGC 
-XX:MaxDirectMemorySize=2G -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-Xloggc:/var/logs/hbase/gc-hbase.log"

# heap size
export HBASE_HEAPSIZE=8192

#hbase metrics
requestsPerSecond=8, numberOfOnlineRegions=1252, numberOfStores=1272, 
numberOfStorefiles=1651, storefileIndexSizeMB=66, rootIndexSizeKB=68176, 
totalStaticIndexSizeKB=55028, totalStaticBloomSizeKB=0, memstoreSizeMB=3, 
mbInMemoryWithoutWAL=0, numberOfPutsWithoutWAL=0, readRequestsCount=1176287, 
writeRequestsCount=2165, compactionQueueSize=0, flushQueueSize=0, 
usedHeapMB=328, maxHeapMB=8185, blockCacheSizeMB=117.94, 
blockCacheFreeMB=1928.47, blockCacheCount=2083, blockCacheHitCount=34815, 
blockCacheMissCount=10259, blockCacheEvictedCount=17, blockCacheHitRatio=77%, 
blockCacheHitCachingRatio=94%, hdfsBlocksLocalityIndex=65, 
slowHLogAppendCount=0, fsReadLatencyHistogramMean=0, 
fsReadLatencyHistogramCount=0, fsReadLatencyHistogramMedian=0, 
fsReadLatencyHistogram75th=0, fsReadLatencyHistogram95th=0, 
fsReadLatencyHistogram99th=0, fsReadLatencyHistogram999th=0, 
fsPreadLatencyHistogramMean=0, fsPreadLatencyHistogramCount=0, 
fsPreadLatencyHistogramMedian=0, fsPreadLatencyHistogram75th=0, 
fsPreadLatencyHistogram95th=0, fsPreadLatencyHistogram99th=0, 
fsPreadLatencyHistogram999th=0, fsWriteLatencyHistogramMean=0, 
fsWriteLatencyHistogramCount=0, fsWriteLatencyHistogramMedian=0, 
fsWriteLatencyHistogram75th=0, fsWriteLatencyHistogram95th=0, 
fsWriteLatencyHistogram99th=0, fsWriteLatencyHistogram999th=0

#hbase-site.xml
  <property>
      <name>hbase.hregion.memstore.mslab.enabled</name>
      <value>true</value>
  </property>
  <property>
      <name>hbase.regionserver.handler.count</name>
      <value>20</value>
  </property>

All the other parameters I am using are default, both hbase and hadoop.

Four tables with this same configuration.
{NAME => 'T1', FAMILIES => [{NAME => 'details', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', 
VERSIONS => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 
'false', BLOCKCACHE => 'true'}]}

Rows from one table can vary from 4kb to 50kb while rows from the other 3
usually vary from 60 bytes to 300 bytes.

You Full GC'ing around this time?


The GC shows it took a long time. However it does not make any sense
to be it, since the same ammount of data was cleaned before and AFTER
in just 0.01 secs!

[Times: user=0.08 sys=137.62, real=137.62 secs]

Besides the whole time was used by system. That is what is bugging me.

 ...

1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990 secs]
275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03 sys=0.00,
real=0.01 secs]

1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000 secs]
269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04 sys=0.01,
real=0.00 secs]

1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620
secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times: user=0.08
sys=137.62, real=137.62 secs]

1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960 secs]
287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05 sys=0.00,
real=0.01 secs]

1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650 secs]
283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06 sys=0.00,
real=0.01 secs]


I really appreciate you guys helping me to find out what is wrong.

Thanks,
Pablo


On 03/08/2013 02:11 PM, Stack wrote:

What RAM says.

2013-03-07 17:24:57,887 INFO org.apache.zookeeper.**ClientCnxn: Client
session timed out, have not heard from server in 159348ms for sessionid
0x13d3c4bcba600a7, closing socket connection and attempting reconnect

You Full GC'ing around this time?

Put up your configs in a place where we can take a look?

St.Ack


On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan <
[email protected]> wrote:

I think it is with your GC config.  What is your heap size?  What is the
data that you pump in and how much is the block cache size?

Regards
Ram

On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <[email protected]> wrote:

0.94 currently doesn't support hadoop 2.0

Can you deploy hadoop 1.1.1 instead ?

Are you using 0.94.5 ?

Thanks

On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <[email protected]> wrote:

Hey guys,
as I sent in an email a long time ago, the RSs in my cluster did not

get

along
and crashed 3 times a day. I tried a lot of options we discussed in the
emails, but it not solved the problem. As I used an old version of

hadoop I

thought this was the problem.

So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to

hadoop

2.0.0
- hbase 0.94 - zookeeper 3.4.5.

Unfortunately the RSs did not stop crashing, and worst! Now they crash
every
hour and some times when the RS that holds the .ROOT. crashes all

cluster

get
stuck in transition and everything stops working.
In this case I need to clean zookeeper znodes, restart the master and

the

RSs.
To avoid this case I am running on production with only ONE RS and a
monitoring
script that check every minute, if the RS is ok. If not, restart it.
* This case does not get the cluster stuck.

This is driving me crazy, but I really cant find a solution for the
cluster.
I tracked all logs from the start time 16:49 from all interesting nodes
(zoo,
namenode, master, rs, dn2, dn9, dn10) and copied here what I think is
usefull.

There are some strange errors in the DATANODE2, as an error copiyng a

block

to itself.

The gc log points to GC timeout. However it is very weird that the RS

spend

so much time in GC while in the other cases it takes 0.001sec. Besides,
the time
spent, is in sys which makes me think that might be a problem in

another

place.

I know that it is a bunch of logs, and that it is very difficult to

find

the
problem without much context. But I REALLY need some help. If it is not

the

solution, at least what I should read, where I should look, or which

cases

I
should monitor.

Thank you very much,
Pablo Musa

Re: RegionServers Crashing every hour in production env

Reply via email to