Re: RegionServer silently stops (only "issue": CMS-concurrent-mark ~80sec)

Andrew Purtell Tue, 01 May 2012 09:38:07 -0700

> This will work, but I don't think that this is a good way to go, since you 
> will be treating the 
symptom and not the actual problem, which is an overloaded node.


I should have been a bit more specific. One would do this for the HDFS and 
HBase daemons, and not for other subsystems like MR Child processes or the 
TaskTracker if the oom_adj is inherited by its children. It doesn't cure the 
problem as you say but will help people avoid shooting their toes off with 
common misconfiguration issues -- too many mappers, or some other non-Hadoop 
process on the box chewing up VM. The Ruby interpreter (running Puppet) comes 
to mind as an example, though recent releases have fixed that particular 
problem.
 

Best regards,


    - Andy


Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)



>________________________________
> From: Igal Shilman <[email protected]>
>To: [email protected]; Andrew Purtell <[email protected]> 
>Sent: Tuesday, May 1, 2012 9:28 AM
>Subject: Re: RegionServer silently stops (only "issue": CMS-concurrent-mark 
>~80sec)
> 
>
>This will work, but I don't think that this is a good way to go, since you 
>will be treating the 
symptom and not the actual problem, which is an overloaded node.
>
>Igal.
>
>
>On Tue, May 1, 2012 at 6:57 PM, Andrew Purtell <[email protected]> wrote:
>
>Does it make sense to include the below in ./bin/hbase and ./bin/hadoop:
>>
>>    echo -17 > /proc/$PID/oom_adj
>>
>>-17 is OOM_DISABLE: http://linux-mm.org/OOM_Killer
>>
>>Best regards,
>>
>>    - Andy
>>
>>Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
>>Tom White)
>>
>>
>>
>>
>>----- Original Message -----
>>> From: Alex Baranau <[email protected]>
>>> To: [email protected]
>>> Cc:
>>> Sent: Tuesday, May 1, 2012 8:27 AM
>>> Subject: Re: RegionServer silently stops (only "issue": CMS-concurrent-mark 
>>> ~80sec)
>>>
>>> Yep, that is what I thought (OOM). Our monitoring tool (sematext.com/spm)
>>> says that everything was OK with theregard to JVM memory consumption at the
>>> time process stopped. But found that oom killer is one who killed it (on
>>> this cluster we have HBASE_HEAPSIZE=4000):
>>>
>>> /var/log/kern.log.1:Apr 30 18:59:04 ip-10-88-231-73 kernel:
>>> [32076201.925847] Out of memory: Kill process 15227 (java) score 573 or
>>> sacrifice child
>>> /var/log/kern.log.1-Apr 30 18:59:04 ip-10-88-231-73 kernel:
>>> [32076201.925881] Killed process 15227 (java) total-vm:4716576kB,
>>> anon-rss:4364780kB, file-rss:12020kB
>>>
>>> All clear, thanx!
>>>
>>> Alex Baranau
>>> ------
>>> Sematext :: http://blog.sematext.com/
>>>
>>> On Tue, May 1, 2012 at 9:02 AM, Dhaval Shah
>>> <[email protected]>wrote:
>>>
>>>>
>>>>
>>>>  Not sure if its related (or even helpful) but we were using cdh3b4 (which
>>>>  is 0.90.1) and we saw similar issues with region servers going down.. we
>>>>  didn't look at GC logs but we had very high zookeeper leases so its
>>>>  unlikely that the GC could have caused the issue.. this problem went away
>>>>  when we upgraded to cdh3u3 which is rock steady in terms of region
>>>>  servers.. (havent had a single region server crash in a month where on the
>>>>  older version I used to have 1 crash every couple of days).. the only 
>>>>other
>>>>  difference between the two is that we use snappy on the newer one and gz 
>>>>on
>>>>  the old
>>>>
>>>>  We also noticed that having replication enabled also contributed to the
>>>>  issues..
>>>>
>>>>
>>>>  ------------------------------
>>>>  On Tue 1 May, 2012 3:15 PM IST N Keywal wrote:
>>>>
>>>>  >Hi Alex,
>>>>  >
>>>>  >On the same idea, note that hbase is launched with
>>>>  >-XX:OnOutOfMemoryError="kill -9 %p".
>>>>  >
>>>>  >N.
>>>>  >
>>>>  >On Tue, May 1, 2012 at 10:41 AM, Igal Shilman <[email protected]>
>>> wrote:
>>>>  >
>>>>  >> Hi Alex, just to rule out, oom killer,
>>>>  >> Try this:
>>>>  >>
>>>>  >>
>>>>
>>> http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer
>>>>  >>
>>>>  >>
>>>>  >> On Mon, Apr 30, 2012 at 10:48 PM, Alex Baranau <
>>>>  [email protected]
>>>>  >> >wrote:
>>>>  >>
>>>>  >> > Hello,
>>>>  >> >
>>>>  >> > During recent weeks I constantly see some RSs *silently*
>>> dying on our
>>>>  >> HBase
>>>>  >> > cluster. By "silently" I mean that process stops,
>>> but no errors in
>>>>  logs
>>>>  >> > [1].
>>>>  >> >
>>>>  >> > The only thing I can relate to it is long
>>> CMS-concurrent-mark: almost
>>>>  80
>>>>  >> > seconds. But this should not cause issues as it is not a
>>>>  "stop-the-world"
>>>>  >> > process.
>>>>  >> >
>>>>  >> > Any advice?
>>>>  >> >
>>>>  >> > HBase: hbase-0.90.4-cdh3u3
>>>>  >> > Hadoop: 0.20.2-cdh3u3
>>>>  >> >
>>>>  >> > Thank you,
>>>>  >> > Alex Baranau
>>>>  >> >
>>>>  >> > [1]
>>>>  >> >
>>>>  >> > last lines from RS log (no errors before too, and nothing
>>> written in
>>>>  >> *.out
>>>>  >> > file):
>>>>  >> >
>>>>  >> > 2012-04-30 18:52:11,806 DEBUG
>>>>  >> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>> Compaction
>>>>  >> > requested for agg-sa-1.3,0011|
>>>>  >> >
>>>>  >> >
>>>>  >>
>>>>
>>> te|dtc|\x00\x00\x00\x00\x00\x00<\x1E\x002\x00\x00\x00\x015\x9C_n\x00\x00\x00\x00\x00\x00\x00\x00\x00,1334852280902.4285f9339b520ee617c087c0fd0dbf65.
>>>>  >> > because regionserver60020.cacheFlusher; priority=-1,
>>> compaction queue
>>>>  >> > size=0
>>>>  >> > 2012-04-30 18:54:58,779 DEBUG
>>>>  >> >
>>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: using
>>>>  new
>>>>  >> > createWriter -- HADOOP-6840
>>>>  >> > 2012-04-30 18:54:58,779 DEBUG
>>>>  >> >
>>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter:
>>>>  >> >
>>>>  >> >
>>>>  >>
>>>>
>>> Path=hdfs://xxx.ec2.internal/hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335812098651,
>>>>  >> > syncFs=true, hflush=false
>>>>  >> > 2012-04-30 18:54:58,874 INFO
>>>>  >> org.apache.hadoop.hbase.regionserver.wal.HLog:
>>>>  >> > Roll
>>>>  >> >
>>>>  >> >
>>>>  >>
>>>>
>>> /hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335811856672,
>>>>  >> > entries=73789, filesize=63773934. New hlog
>>>>  >> >
>>>>  >> >
>>>>  >>
>>>>
>>> /hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335812098651
>>>>  >> > 2012-04-30 18:56:31,867 INFO
>>>>  >> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush
>>> thread
>>>>  woke
>>>>  >> up
>>>>  >> > with memory above low water.
>>>>  >> > 2012-04-30 18:56:31,867 INFO
>>>>  >> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush
>>> of region
>>>>  >> > agg-sa-1.3,s_00I4|
>>>>  >> >
>>>>  >> >
>>>>  >>
>>>>
>>> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.
>>>>  >> > due to global heap pressure
>>>>  >> > 2012-04-30 18:56:31,867 DEBUG
>>>>  >> org.apache.hadoop.hbase.regionserver.HRegion:
>>>>  >> > Started memstore flush for agg-sa-1.3,s_00I4|
>>>>  >> >
>>>>  >> >
>>>>  >>
>>>>
>>> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.,
>>>>  >> > current region memstore size 138.1m
>>>>  >> > 2012-04-30 18:56:31,867 DEBUG
>>>>  >> org.apache.hadoop.hbase.regionserver.HRegion:
>>>>  >> > Finished snapshotting, commencing flushing stores
>>>>  >> > 2012-04-30 18:56:56,303 DEBUG
>>>>  >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats:
>>>>  total=322.84
>>>>  >> MB,
>>>>  >> > free=476.34 MB, max=799.17 MB, blocks=5024,
>>> accesses=12189396,
>>>>  >> hits=127592,
>>>>  >> > hitRatio=1.04%%, cachingAccesses=132480, cachingHits=126949,
>>>>  >> > cachingHitsRatio=95.82%%, evictions=0, evicted=0,
>>> evictedPerRun=NaN
>>>>  >> > 2012-04-30 18:56:59,026 INFO
>>>>  org.apache.hadoop.hbase.regionserver.Store:
>>>>  >> > Renaming flushed file at
>>>>  >> >
>>>>  >> >
>>>>  >>
>>>>
>>> hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/.tmp/391890051647401997
>>>>  >> > to
>>>>  >> >
>>>>  >> >
>>>>  >>
>>>>
>>> hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/a/1139737908876846168
>>>>  >> > 2012-04-30 18:56:59,034 INFO
>>>>  org.apache.hadoop.hbase.regionserver.Store:
>>>>  >> > Added
>>>>  >> >
>>>>  >> >
>>>>  >>
>>>>
>>> hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/a/1139737908876846168,
>>>>  >> > entries=476418, sequenceid=880198761, memsize=138.1m,
>>> filesize=5.7m
>>>>  >> > 2012-04-30 18:56:59,097 INFO
>>>>  >> org.apache.hadoop.hbase.regionserver.HRegion:
>>>>  >> > Finished memstore flush of ~138.1m for region
>>> agg-sa-1.3,s_00I4|
>>>>  >> >
>>>>  >> >
>>>>  >>
>>>>
>>> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.
>>>>  >> > in 27230ms, sequenceid=880198761, compaction requested=false
>>>>  >> > ~
>>>>  >> >
>>>>  >> > [2]
>>>>  >> >
>>>>  >> > last lines from GC log:
>>>>  >> >
>>>>  >> > 2012-04-30T18:58:46.683+0000: 105717.791: [GC 105717.791:
>>> [ParNew:
>>>>  >> > 35638K->1118K(38336K), 0.0548970 secs]
>>> 3145651K->3111412K(4091776K)
>>>>  >> > icms_dc=6 , 0.0550360 secs] [Times: user=0.08 sys=0.00,
>>> real=0.09
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:46.961+0000: 105718.069: [GC 105718.069:
>>> [ParNew:
>>>>  >> > 35230K->2224K(38336K), 0.0802440 secs]
>>> 3145524K->3112533K(4091776K)
>>>>  >> > icms_dc=6 , 0.0803810 secs] [Times: user=0.06 sys=0.00,
>>> real=0.13
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:47.114+0000: 105718.222:
>>> [CMS-concurrent-mark:
>>>>  >> > 8.770/80.230 secs] [Times: user=61.34 sys=5.69, real=80.23
>>> secs]
>>>>  >> > 2012-04-30T18:58:47.114+0000: 105718.222:
>>>>  [CMS-concurrent-preclean-start]
>>>>  >> > 2012-04-30T18:58:47.183+0000: 105718.291: [GC 105718.291:
>>> [ParNew:
>>>>  >> > 36336K->3895K(38336K), 0.0272610 secs]
>>> 3146645K->3114281K(4091776K)
>>>>  >> > icms_dc=6 , 0.0274040 secs] [Times: user=0.05 sys=0.00,
>>> real=0.03
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:47.452+0000: 105718.560: [GC 105718.560:
>>> [ParNew:
>>>>  >> > 37990K->3082K(38336K), 0.0504660 secs]
>>> 3148376K->3114329K(4091776K)
>>>>  >> > icms_dc=6 , 0.0506020 secs] [Times: user=0.06 sys=0.00,
>>> real=0.05
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:47.823+0000: 105718.930: [GC 105718.931:
>>> [ParNew:
>>>>  >> > 37194K->3340K(38336K), 0.0419660 secs]
>>> 3148441K->3115727K(4091776K)
>>>>  >> > icms_dc=6 , 0.0421110 secs] [Times: user=0.05 sys=0.00,
>>> real=0.04
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:48.099+0000: 105719.207: [GC 105719.207:
>>> [ParNew:
>>>>  >> > 37351K->3249K(38336K), 0.0695950 secs]
>>> 3149738K->3115636K(4091776K)
>>>>  >> > icms_dc=6 , 0.0697320 secs] [Times: user=0.07 sys=0.00,
>>> real=0.07
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:48.447+0000: 105719.555: [GC 105719.555:
>>> [ParNew:
>>>>  >> > 37361K->2512K(38336K), 0.0510560 secs]
>>> 3149748K->3116166K(4091776K)
>>>>  >> > icms_dc=6 , 0.0512050 secs] [Times: user=0.05 sys=0.00,
>>> real=0.00
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:48.794+0000: 105719.902: [GC 105719.902:
>>> [ParNew:
>>>>  >> > 36624K->2616K(38336K), 0.1986770 secs]
>>> 3150278K->3116270K(4091776K)
>>>>  >> > icms_dc=6 , 0.1988160 secs] [Times: user=0.05 sys=0.00,
>>> real=0.20
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:49.614+0000: 105720.722: [GC 105720.722:
>>> [ParNew:
>>>>  >> > 36728K->3450K(38336K), 0.1099190 secs]
>>> 3150382K->3117103K(4091776K)
>>>>  >> > icms_dc=6 , 0.1100650 secs] [Times: user=0.05 sys=0.00,
>>> real=0.11
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:50.631+0000: 105721.739: [GC 105721.739:
>>> [ParNew:
>>>>  >> > 37546K->3176K(38336K), 0.2096280 secs]
>>> 3151199K->3117154K(4091776K)
>>>>  >> > icms_dc=6 , 0.2097770 secs] [Times: user=0.10 sys=0.00,
>>> real=0.21
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:51.300+0000: 105722.408: [GC 105722.408:
>>> [ParNew:
>>>>  >> > 37288K->3928K(38336K), 0.0873820 secs]
>>> 3151266K->3117907K(4091776K)
>>>>  >> > icms_dc=6 , 0.0875310 secs] [Times: user=0.05 sys=0.00,
>>> real=0.08
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:52.109+0000: 105723.217: [GC 105723.217:
>>> [ParNew:
>>>>  >> > 38040K->2095K(38336K), 0.0791560 secs]
>>> 3152019K->3116839K(4091776K)
>>>>  >> > icms_dc=6 , 0.0792970 secs] [Times: user=0.05 sys=0.00,
>>> real=0.08
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:53.132+0000: 105724.240: [GC 105724.240:
>>> [ParNew:
>>>>  >> > 36207K->2581K(38336K), 0.1651210 secs]
>>> 3150951K->3117700K(4091776K)
>>>>  >> > icms_dc=6 , 0.1652650 secs] [Times: user=0.06 sys=0.00,
>>> real=0.16
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:55.402+0000: 105726.510: [GC 105726.510:
>>> [ParNew:
>>>>  >> > 36693K->2297K(38336K), 0.1262480 secs]
>>> 3151812K->3117742K(4091776K)
>>>>  >> > icms_dc=6 , 0.1264420 secs] [Times: user=0.06 sys=0.00,
>>> real=0.13
>>>>  secs]
>>>>  >> > 2012-04-30T18:58:56.103+0000: 105727.211: [GC 105727.211:
>>> [ParNew:
>>>>  >> > 36409K->4224K(38336K), 0.3658760 secs]
>>> 3151854K->3120205K(4091776K)
>>>>  >> > icms_dc=6 , 0.3660220 secs]
>>>>  >> >
>>>>  >>
>>>>
>>>>
>>>
>>
>
>
>

Re: RegionServer silently stops (only "issue": CMS-concurrent-mark ~80sec)

Reply via email to