Re: RegionServer silently stops (only "issue": CMS-concurrent-mark ~80sec)

Alex Baranau Tue, 01 May 2012 08:28:07 -0700

Yep, that is what I thought (OOM). Our monitoring tool (sematext.com/spm)
says that everything was OK with theregard to JVM memory consumption at the
time process stopped. But found that oom killer is one who killed it (on
this cluster we have HBASE_HEAPSIZE=4000):


/var/log/kern.log.1:Apr 30 18:59:04 ip-10-88-231-73 kernel:
[32076201.925847] Out of memory: Kill process 15227 (java) score 573 or
sacrifice child
/var/log/kern.log.1-Apr 30 18:59:04 ip-10-88-231-73 kernel:
[32076201.925881] Killed process 15227 (java) total-vm:4716576kB,
anon-rss:4364780kB, file-rss:12020kB

All clear, thanx!

Alex Baranau
------
Sematext :: http://blog.sematext.com/

On Tue, May 1, 2012 at 9:02 AM, Dhaval Shah <[email protected]>wrote:

>
>
> Not sure if its related (or even helpful) but we were using cdh3b4 (which
> is 0.90.1) and we saw similar issues with region servers going down.. we
> didn't look at GC logs but we had very high zookeeper leases so its
> unlikely that the GC could have caused the issue.. this problem went away
> when we upgraded to cdh3u3 which is rock steady in terms of region
> servers.. (havent had a single region server crash in a month where on the
> older version I used to have 1 crash every couple of days).. the only other
> difference between the two is that we use snappy on the newer one and gz on
> the old
>
> We also noticed that having replication enabled also contributed to the
> issues..
>
>
> ------------------------------
> On Tue 1 May, 2012 3:15 PM IST N Keywal wrote:
>
> >Hi Alex,
> >
> >On the same idea, note that hbase is launched with
> >-XX:OnOutOfMemoryError="kill -9 %p".
> >
> >N.
> >
> >On Tue, May 1, 2012 at 10:41 AM, Igal Shilman <[email protected]> wrote:
> >
> >> Hi Alex, just to rule out, oom killer,
> >> Try this:
> >>
> >>
> http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer
> >>
> >>
> >> On Mon, Apr 30, 2012 at 10:48 PM, Alex Baranau <
> [email protected]
> >> >wrote:
> >>
> >> > Hello,
> >> >
> >> > During recent weeks I constantly see some RSs *silently* dying on our
> >> HBase
> >> > cluster. By "silently" I mean that process stops, but no errors in
> logs
> >> > [1].
> >> >
> >> > The only thing I can relate to it is long CMS-concurrent-mark: almost
> 80
> >> > seconds. But this should not cause issues as it is not a
> "stop-the-world"
> >> > process.
> >> >
> >> > Any advice?
> >> >
> >> > HBase: hbase-0.90.4-cdh3u3
> >> > Hadoop: 0.20.2-cdh3u3
> >> >
> >> > Thank you,
> >> > Alex Baranau
> >> >
> >> > [1]
> >> >
> >> > last lines from RS log (no errors before too, and nothing written in
> >> *.out
> >> > file):
> >> >
> >> > 2012-04-30 18:52:11,806 DEBUG
> >> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
> >> > requested for agg-sa-1.3,0011|
> >> >
> >> >
> >>
> te|dtc|\x00\x00\x00\x00\x00\x00<\x1E\x002\x00\x00\x00\x015\x9C_n\x00\x00\x00\x00\x00\x00\x00\x00\x00,1334852280902.4285f9339b520ee617c087c0fd0dbf65.
> >> > because regionserver60020.cacheFlusher; priority=-1, compaction queue
> >> > size=0
> >> > 2012-04-30 18:54:58,779 DEBUG
> >> > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: using
> new
> >> > createWriter -- HADOOP-6840
> >> > 2012-04-30 18:54:58,779 DEBUG
> >> > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter:
> >> >
> >> >
> >>
> Path=hdfs://xxx.ec2.internal/hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335812098651,
> >> > syncFs=true, hflush=false
> >> > 2012-04-30 18:54:58,874 INFO
> >> org.apache.hadoop.hbase.regionserver.wal.HLog:
> >> > Roll
> >> >
> >> >
> >>
> /hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335811856672,
> >> > entries=73789, filesize=63773934. New hlog
> >> >
> >> >
> >>
> /hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335812098651
> >> > 2012-04-30 18:56:31,867 INFO
> >> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread
> woke
> >> up
> >> > with memory above low water.
> >> > 2012-04-30 18:56:31,867 INFO
> >> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region
> >> > agg-sa-1.3,s_00I4|
> >> >
> >> >
> >>
> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.
> >> > due to global heap pressure
> >> > 2012-04-30 18:56:31,867 DEBUG
> >> org.apache.hadoop.hbase.regionserver.HRegion:
> >> > Started memstore flush for agg-sa-1.3,s_00I4|
> >> >
> >> >
> >>
> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.,
> >> > current region memstore size 138.1m
> >> > 2012-04-30 18:56:31,867 DEBUG
> >> org.apache.hadoop.hbase.regionserver.HRegion:
> >> > Finished snapshotting, commencing flushing stores
> >> > 2012-04-30 18:56:56,303 DEBUG
> >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats:
> total=322.84
> >> MB,
> >> > free=476.34 MB, max=799.17 MB, blocks=5024, accesses=12189396,
> >> hits=127592,
> >> > hitRatio=1.04%%, cachingAccesses=132480, cachingHits=126949,
> >> > cachingHitsRatio=95.82%%, evictions=0, evicted=0, evictedPerRun=NaN
> >> > 2012-04-30 18:56:59,026 INFO
> org.apache.hadoop.hbase.regionserver.Store:
> >> > Renaming flushed file at
> >> >
> >> >
> >>
> hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/.tmp/391890051647401997
> >> > to
> >> >
> >> >
> >>
> hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/a/1139737908876846168
> >> > 2012-04-30 18:56:59,034 INFO
> org.apache.hadoop.hbase.regionserver.Store:
> >> > Added
> >> >
> >> >
> >>
> hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/a/1139737908876846168,
> >> > entries=476418, sequenceid=880198761, memsize=138.1m, filesize=5.7m
> >> > 2012-04-30 18:56:59,097 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegion:
> >> > Finished memstore flush of ~138.1m for region agg-sa-1.3,s_00I4|
> >> >
> >> >
> >>
> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.
> >> > in 27230ms, sequenceid=880198761, compaction requested=false
> >> > ~
> >> >
> >> > [2]
> >> >
> >> > last lines from GC log:
> >> >
> >> > 2012-04-30T18:58:46.683+0000: 105717.791: [GC 105717.791: [ParNew:
> >> > 35638K->1118K(38336K), 0.0548970 secs] 3145651K->3111412K(4091776K)
> >> > icms_dc=6 , 0.0550360 secs] [Times: user=0.08 sys=0.00, real=0.09
> secs]
> >> > 2012-04-30T18:58:46.961+0000: 105718.069: [GC 105718.069: [ParNew:
> >> > 35230K->2224K(38336K), 0.0802440 secs] 3145524K->3112533K(4091776K)
> >> > icms_dc=6 , 0.0803810 secs] [Times: user=0.06 sys=0.00, real=0.13
> secs]
> >> > 2012-04-30T18:58:47.114+0000: 105718.222: [CMS-concurrent-mark:
> >> > 8.770/80.230 secs] [Times: user=61.34 sys=5.69, real=80.23 secs]
> >> > 2012-04-30T18:58:47.114+0000: 105718.222:
> [CMS-concurrent-preclean-start]
> >> > 2012-04-30T18:58:47.183+0000: 105718.291: [GC 105718.291: [ParNew:
> >> > 36336K->3895K(38336K), 0.0272610 secs] 3146645K->3114281K(4091776K)
> >> > icms_dc=6 , 0.0274040 secs] [Times: user=0.05 sys=0.00, real=0.03
> secs]
> >> > 2012-04-30T18:58:47.452+0000: 105718.560: [GC 105718.560: [ParNew:
> >> > 37990K->3082K(38336K), 0.0504660 secs] 3148376K->3114329K(4091776K)
> >> > icms_dc=6 , 0.0506020 secs] [Times: user=0.06 sys=0.00, real=0.05
> secs]
> >> > 2012-04-30T18:58:47.823+0000: 105718.930: [GC 105718.931: [ParNew:
> >> > 37194K->3340K(38336K), 0.0419660 secs] 3148441K->3115727K(4091776K)
> >> > icms_dc=6 , 0.0421110 secs] [Times: user=0.05 sys=0.00, real=0.04
> secs]
> >> > 2012-04-30T18:58:48.099+0000: 105719.207: [GC 105719.207: [ParNew:
> >> > 37351K->3249K(38336K), 0.0695950 secs] 3149738K->3115636K(4091776K)
> >> > icms_dc=6 , 0.0697320 secs] [Times: user=0.07 sys=0.00, real=0.07
> secs]
> >> > 2012-04-30T18:58:48.447+0000: 105719.555: [GC 105719.555: [ParNew:
> >> > 37361K->2512K(38336K), 0.0510560 secs] 3149748K->3116166K(4091776K)
> >> > icms_dc=6 , 0.0512050 secs] [Times: user=0.05 sys=0.00, real=0.00
> secs]
> >> > 2012-04-30T18:58:48.794+0000: 105719.902: [GC 105719.902: [ParNew:
> >> > 36624K->2616K(38336K), 0.1986770 secs] 3150278K->3116270K(4091776K)
> >> > icms_dc=6 , 0.1988160 secs] [Times: user=0.05 sys=0.00, real=0.20
> secs]
> >> > 2012-04-30T18:58:49.614+0000: 105720.722: [GC 105720.722: [ParNew:
> >> > 36728K->3450K(38336K), 0.1099190 secs] 3150382K->3117103K(4091776K)
> >> > icms_dc=6 , 0.1100650 secs] [Times: user=0.05 sys=0.00, real=0.11
> secs]
> >> > 2012-04-30T18:58:50.631+0000: 105721.739: [GC 105721.739: [ParNew:
> >> > 37546K->3176K(38336K), 0.2096280 secs] 3151199K->3117154K(4091776K)
> >> > icms_dc=6 , 0.2097770 secs] [Times: user=0.10 sys=0.00, real=0.21
> secs]
> >> > 2012-04-30T18:58:51.300+0000: 105722.408: [GC 105722.408: [ParNew:
> >> > 37288K->3928K(38336K), 0.0873820 secs] 3151266K->3117907K(4091776K)
> >> > icms_dc=6 , 0.0875310 secs] [Times: user=0.05 sys=0.00, real=0.08
> secs]
> >> > 2012-04-30T18:58:52.109+0000: 105723.217: [GC 105723.217: [ParNew:
> >> > 38040K->2095K(38336K), 0.0791560 secs] 3152019K->3116839K(4091776K)
> >> > icms_dc=6 , 0.0792970 secs] [Times: user=0.05 sys=0.00, real=0.08
> secs]
> >> > 2012-04-30T18:58:53.132+0000: 105724.240: [GC 105724.240: [ParNew:
> >> > 36207K->2581K(38336K), 0.1651210 secs] 3150951K->3117700K(4091776K)
> >> > icms_dc=6 , 0.1652650 secs] [Times: user=0.06 sys=0.00, real=0.16
> secs]
> >> > 2012-04-30T18:58:55.402+0000: 105726.510: [GC 105726.510: [ParNew:
> >> > 36693K->2297K(38336K), 0.1262480 secs] 3151812K->3117742K(4091776K)
> >> > icms_dc=6 , 0.1264420 secs] [Times: user=0.06 sys=0.00, real=0.13
> secs]
> >> > 2012-04-30T18:58:56.103+0000: 105727.211: [GC 105727.211: [ParNew:
> >> > 36409K->4224K(38336K), 0.3658760 secs] 3151854K->3120205K(4091776K)
> >> > icms_dc=6 , 0.3660220 secs]
> >> >
> >>
>
>

Re: RegionServer silently stops (only "issue": CMS-concurrent-mark ~80sec)

Reply via email to