Hi Alex,

On the same idea, note that hbase is launched with
-XX:OnOutOfMemoryError="kill -9 %p".

N.

On Tue, May 1, 2012 at 10:41 AM, Igal Shilman <ig...@wix.com> wrote:

> Hi Alex, just to rule out, oom killer,
> Try this:
>
> http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer
>
>
> On Mon, Apr 30, 2012 at 10:48 PM, Alex Baranau <alex.barano...@gmail.com
> >wrote:
>
> > Hello,
> >
> > During recent weeks I constantly see some RSs *silently* dying on our
> HBase
> > cluster. By "silently" I mean that process stops, but no errors in logs
> > [1].
> >
> > The only thing I can relate to it is long CMS-concurrent-mark: almost 80
> > seconds. But this should not cause issues as it is not a "stop-the-world"
> > process.
> >
> > Any advice?
> >
> > HBase: hbase-0.90.4-cdh3u3
> > Hadoop: 0.20.2-cdh3u3
> >
> > Thank you,
> > Alex Baranau
> >
> > [1]
> >
> > last lines from RS log (no errors before too, and nothing written in
> *.out
> > file):
> >
> > 2012-04-30 18:52:11,806 DEBUG
> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
> > requested for agg-sa-1.3,0011|
> >
> >
> te|dtc|\x00\x00\x00\x00\x00\x00<\x1E\x002\x00\x00\x00\x015\x9C_n\x00\x00\x00\x00\x00\x00\x00\x00\x00,1334852280902.4285f9339b520ee617c087c0fd0dbf65.
> > because regionserver60020.cacheFlusher; priority=-1, compaction queue
> > size=0
> > 2012-04-30 18:54:58,779 DEBUG
> > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: using new
> > createWriter -- HADOOP-6840
> > 2012-04-30 18:54:58,779 DEBUG
> > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter:
> >
> >
> Path=hdfs://xxx.ec2.internal/hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335812098651,
> > syncFs=true, hflush=false
> > 2012-04-30 18:54:58,874 INFO
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> > Roll
> >
> >
> /hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335811856672,
> > entries=73789, filesize=63773934. New hlog
> >
> >
> /hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335812098651
> > 2012-04-30 18:56:31,867 INFO
> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke
> up
> > with memory above low water.
> > 2012-04-30 18:56:31,867 INFO
> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region
> > agg-sa-1.3,s_00I4|
> >
> >
> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.
> > due to global heap pressure
> > 2012-04-30 18:56:31,867 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Started memstore flush for agg-sa-1.3,s_00I4|
> >
> >
> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.,
> > current region memstore size 138.1m
> > 2012-04-30 18:56:31,867 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Finished snapshotting, commencing flushing stores
> > 2012-04-30 18:56:56,303 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=322.84
> MB,
> > free=476.34 MB, max=799.17 MB, blocks=5024, accesses=12189396,
> hits=127592,
> > hitRatio=1.04%%, cachingAccesses=132480, cachingHits=126949,
> > cachingHitsRatio=95.82%%, evictions=0, evicted=0, evictedPerRun=NaN
> > 2012-04-30 18:56:59,026 INFO org.apache.hadoop.hbase.regionserver.Store:
> > Renaming flushed file at
> >
> >
> hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/.tmp/391890051647401997
> > to
> >
> >
> hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/a/1139737908876846168
> > 2012-04-30 18:56:59,034 INFO org.apache.hadoop.hbase.regionserver.Store:
> > Added
> >
> >
> hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/a/1139737908876846168,
> > entries=476418, sequenceid=880198761, memsize=138.1m, filesize=5.7m
> > 2012-04-30 18:56:59,097 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Finished memstore flush of ~138.1m for region agg-sa-1.3,s_00I4|
> >
> >
> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.
> > in 27230ms, sequenceid=880198761, compaction requested=false
> > ~
> >
> > [2]
> >
> > last lines from GC log:
> >
> > 2012-04-30T18:58:46.683+0000: 105717.791: [GC 105717.791: [ParNew:
> > 35638K->1118K(38336K), 0.0548970 secs] 3145651K->3111412K(4091776K)
> > icms_dc=6 , 0.0550360 secs] [Times: user=0.08 sys=0.00, real=0.09 secs]
> > 2012-04-30T18:58:46.961+0000: 105718.069: [GC 105718.069: [ParNew:
> > 35230K->2224K(38336K), 0.0802440 secs] 3145524K->3112533K(4091776K)
> > icms_dc=6 , 0.0803810 secs] [Times: user=0.06 sys=0.00, real=0.13 secs]
> > 2012-04-30T18:58:47.114+0000: 105718.222: [CMS-concurrent-mark:
> > 8.770/80.230 secs] [Times: user=61.34 sys=5.69, real=80.23 secs]
> > 2012-04-30T18:58:47.114+0000: 105718.222: [CMS-concurrent-preclean-start]
> > 2012-04-30T18:58:47.183+0000: 105718.291: [GC 105718.291: [ParNew:
> > 36336K->3895K(38336K), 0.0272610 secs] 3146645K->3114281K(4091776K)
> > icms_dc=6 , 0.0274040 secs] [Times: user=0.05 sys=0.00, real=0.03 secs]
> > 2012-04-30T18:58:47.452+0000: 105718.560: [GC 105718.560: [ParNew:
> > 37990K->3082K(38336K), 0.0504660 secs] 3148376K->3114329K(4091776K)
> > icms_dc=6 , 0.0506020 secs] [Times: user=0.06 sys=0.00, real=0.05 secs]
> > 2012-04-30T18:58:47.823+0000: 105718.930: [GC 105718.931: [ParNew:
> > 37194K->3340K(38336K), 0.0419660 secs] 3148441K->3115727K(4091776K)
> > icms_dc=6 , 0.0421110 secs] [Times: user=0.05 sys=0.00, real=0.04 secs]
> > 2012-04-30T18:58:48.099+0000: 105719.207: [GC 105719.207: [ParNew:
> > 37351K->3249K(38336K), 0.0695950 secs] 3149738K->3115636K(4091776K)
> > icms_dc=6 , 0.0697320 secs] [Times: user=0.07 sys=0.00, real=0.07 secs]
> > 2012-04-30T18:58:48.447+0000: 105719.555: [GC 105719.555: [ParNew:
> > 37361K->2512K(38336K), 0.0510560 secs] 3149748K->3116166K(4091776K)
> > icms_dc=6 , 0.0512050 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]
> > 2012-04-30T18:58:48.794+0000: 105719.902: [GC 105719.902: [ParNew:
> > 36624K->2616K(38336K), 0.1986770 secs] 3150278K->3116270K(4091776K)
> > icms_dc=6 , 0.1988160 secs] [Times: user=0.05 sys=0.00, real=0.20 secs]
> > 2012-04-30T18:58:49.614+0000: 105720.722: [GC 105720.722: [ParNew:
> > 36728K->3450K(38336K), 0.1099190 secs] 3150382K->3117103K(4091776K)
> > icms_dc=6 , 0.1100650 secs] [Times: user=0.05 sys=0.00, real=0.11 secs]
> > 2012-04-30T18:58:50.631+0000: 105721.739: [GC 105721.739: [ParNew:
> > 37546K->3176K(38336K), 0.2096280 secs] 3151199K->3117154K(4091776K)
> > icms_dc=6 , 0.2097770 secs] [Times: user=0.10 sys=0.00, real=0.21 secs]
> > 2012-04-30T18:58:51.300+0000: 105722.408: [GC 105722.408: [ParNew:
> > 37288K->3928K(38336K), 0.0873820 secs] 3151266K->3117907K(4091776K)
> > icms_dc=6 , 0.0875310 secs] [Times: user=0.05 sys=0.00, real=0.08 secs]
> > 2012-04-30T18:58:52.109+0000: 105723.217: [GC 105723.217: [ParNew:
> > 38040K->2095K(38336K), 0.0791560 secs] 3152019K->3116839K(4091776K)
> > icms_dc=6 , 0.0792970 secs] [Times: user=0.05 sys=0.00, real=0.08 secs]
> > 2012-04-30T18:58:53.132+0000: 105724.240: [GC 105724.240: [ParNew:
> > 36207K->2581K(38336K), 0.1651210 secs] 3150951K->3117700K(4091776K)
> > icms_dc=6 , 0.1652650 secs] [Times: user=0.06 sys=0.00, real=0.16 secs]
> > 2012-04-30T18:58:55.402+0000: 105726.510: [GC 105726.510: [ParNew:
> > 36693K->2297K(38336K), 0.1262480 secs] 3151812K->3117742K(4091776K)
> > icms_dc=6 , 0.1264420 secs] [Times: user=0.06 sys=0.00, real=0.13 secs]
> > 2012-04-30T18:58:56.103+0000: 105727.211: [GC 105727.211: [ParNew:
> > 36409K->4224K(38336K), 0.3658760 secs] 3151854K->3120205K(4091776K)
> > icms_dc=6 , 0.3660220 secs]
> >
>

Reply via email to