Re: RegionServers Crashing every hour in production env

Azuryy Yu Sun, 10 Mar 2013 19:14:55 -0700

Pablo,

another, what's your java version?



On Mon, Mar 11, 2013 at 10:13 AM, Azuryy Yu <azury...@gmail.com> wrote:

> Hi Pablo,
> It'a terrible for a long minor GC. I don't think there are swaping from
> your vmstat log.
> but I just suggest you
> 1) add following JVM options:
> -XX:+DisableExplicitGC -XX:+UseCompressedOops -XX:GCTimeRatio=19
> -XX:SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=2
> -XX:MaxTenuringThreshold=3 -XX:+UseFastAccessorMethods
>
> 2) -Xmn is two small, your total Mem is 74GB, just make -Xmn2g
> 3) what are you doing during long GC happened? read or write? if reading,
> what the block cache size?
>
>
>
>
> On Mon, Mar 11, 2013 at 6:41 AM, Stack <st...@duboce.net> wrote:
>
>> You could increase your zookeeper session timeout to 5 minutes while you
>> are figuring why these long pauses.
>> http://hbase.apache.org/book.html#zookeeper.session.timeout
>>
>> Above, there is an outage for almost 5 minutes:
>>
>> >> We slept 225100ms instead of 3000ms, this is likely due to a long
>>
>> You have ganglia or tsdb running?  When you see the big pause above, can
>> you see anything going on on the machine?  (swap, iowait, concurrent fat
>> mapreduce job?)
>>
>> St.Ack
>>
>>
>>
>> On Sun, Mar 10, 2013 at 3:29 PM, Pablo Musa <pa...@psafe.com> wrote:
>>
>> > Hi Sreepathi,
>> > they say in the book (or the site), we could try it to see if it is
>> really
>> > a timeout error
>> > or there is something more. But it is not recomended for production
>> > environments.
>> >
>> > I could give it a try if five minutes will ensure to us that the problem
>> > is the GC or
>> > elsewhere!! Anyway, I think it is hard to beleive a GC is taking 2:30
>> > minutes.
>> >
>> > Abs,
>> > Pablo
>> >
>> >
>> > On 03/10/2013 04:06 PM, Sreepathi wrote:
>> >
>> >> Hi Stack/Ted/Pablo,
>> >>
>> >> Should we increase the hbase.rpc.timeout property to 5 minutes (
>> 300000 ms
>> >> )  ?
>> >>
>> >> Regards,
>> >> - Sreepathi
>> >>
>> >> On Sun, Mar 10, 2013 at 11:59 AM, Pablo Musa <pa...@psafe.com> wrote:
>> >>
>> >>  That combo should be fine.
>> >>>>
>> >>> Great!!
>> >>>
>> >>>
>> >>>  If JVM is full GC'ing, the application is stopped.
>> >>>> The below does not look like a full GC but that is a long pause in
>> >>>> system
>> >>>> time, enough to kill your zk session.
>> >>>>
>> >>> Exactly. This pause is really making the zk expire the RS which
>> shutsdown
>> >>> (logs
>> >>> in the end of the email).
>> >>> But the question is: what is causing this pause??!!
>> >>>
>> >>>  You swapping?
>> >>>>
>> >>> I don't think so (stats below).
>> >>>
>> >>>  Hardware is good?
>> >>>>
>> >>> Yes, it is a 16 processor machine with 74GB of RAM and plenty disk
>> space.
>> >>> Below are some metrics I have heard about. Hope it helps.
>> >>>
>> >>>
>> >>> ** I am having some problems with the datanodes[1] which are having
>> >>> trouble to
>> >>> write. I really think the issues are related, but cannot solve any of
>> >>> them
>> >>> :(
>> >>>
>> >>> Thanks again,
>> >>> Pablo
>> >>>
>> >>> [1] http://mail-archives.apache.
>> ****org/mod_mbox/hadoop-hdfs-user/****
>> >>> 201303.mbox/%3CCAJzooYfS-F1KS+******jGOPUt15PwFjcCSzigE0APeM9FXaCr****
>> >>> qfv...@mail.gmail.com%3E<http:**//mail-archives.apache.org/**
>> >>> mod_mbox/hadoop-hdfs-user/**201303.mbox/%3CCAJzooYfS-F1KS+**
>> >>> jGOPUt15PwFjcCSzigE0APeM9FXaCr**qfv...@mail.gmail.com%3E<
>> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201303.mbox/%3ccajzooyfs-f1ks+jgoput15pwfjccszige0apem9fxacrqfv...@mail.gmail.com%3E
>> >
>> >>> >
>> >>>
>> >>> top - 15:38:04 up 297 days, 21:03,  2 users,  load average: 4.34,
>> 2.55,
>> >>> 1.28
>> >>> Tasks: 528 total,   1 running, 527 sleeping,   0 stopped,   0 zombie
>> >>> Cpu(s):  0.1%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi, 0.0%si,
>> >>>   0.0%st
>> >>> Mem:  74187256k total, 29493992k used, 44693264k free,  5836576k
>> buffers
>> >>> Swap: 51609592k total,   128312k used, 51481280k free,  1353400k
>> cached
>> >>>
>> >>> ]$ vmstat -w
>> >>> procs -------------------memory-----****------------- ---swap--
>> >>> -----io----
>> >>> --system-- -----cpu-------
>> >>>   r  b       swpd       free       buff      cache   si   so    bi bo
>> >>> in
>> >>>    cs  us sy  id wa st
>> >>>   2  0     128312   32416928    5838288    5043560    0    0   202 53
>> >>>  0
>> >>>     0   2  1  96  1  0
>> >>>
>> >>> ]$ sar
>> >>> 02:20:01 PM     all     26.18      0.00      2.90      0.63 0.00
>> >>> 70.29
>> >>> 02:30:01 PM     all      1.66      0.00      1.25      1.05 0.00
>> >>> 96.04
>> >>> 02:40:01 PM     all     10.01      0.00      2.14      0.75 0.00
>> >>> 87.11
>> >>> 02:50:01 PM     all      0.76      0.00      0.80      1.03 0.00
>> >>> 97.40
>> >>> 03:00:01 PM     all      0.23      0.00      0.30      0.71 0.00
>> >>> 98.76
>> >>> 03:10:01 PM     all      0.22      0.00      0.30      0.66 0.00
>> >>> 98.82
>> >>> 03:20:01 PM     all      0.22      0.00      0.31      0.76 0.00
>> >>> 98.71
>> >>> 03:30:01 PM     all      0.24      0.00      0.31      0.64 0.00
>> >>> 98.81
>> >>> 03:40:01 PM     all      1.13      0.00      2.97      1.18 0.00
>> >>> 94.73
>> >>> Average:        all      3.86      0.00      1.38      0.88 0.00
>> >>> 93.87
>> >>>
>> >>> ]$ iostat
>> >>> Linux 2.6.32-220.7.1.el6.x86_64 (PSLBHDN002)     03/10/2013 _x86_64_
>> >>>   (16 CPU)
>> >>>
>> >>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >>>             1.86    0.00    0.96    0.78    0.00   96.41
>> >>>
>> >>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read Blk_wrtn
>> >>> sda               1.23        20.26        23.53  521533196 605566924
>> >>> sdb               6.51       921.55       241.90 23717850730
>> 6225863488
>> >>> sdc               6.22       921.83       236.41 23725181162
>> 6084471192
>> >>> sdd               6.25       925.13       237.26 23810004970
>> 6106357880
>> >>> sde               6.19       913.90       235.60 23521108818
>> 6063722504
>> >>> sdh               6.26       933.08       237.77 24014594546
>> 6119511376
>> >>> sdg               6.18       914.36       235.31 23532747378
>> 6056257016
>> >>> sdf               6.24       923.66       235.33 23772251810
>> 6056604008
>> >>>
>> >>> Some more logging which reinforce that the RS crash is happening
>> because
>> >>> of
>> >>> timeout. However this time the GC log is not accusing a big time.
>> >>>
>> >>> #####RS LOG#####
>> >>> 2013-03-10 15:37:46,712 INFO org.apache.zookeeper.****ClientCnxn:
>> Client
>> >>> session timed out, have not heard from server in 257739ms for
>> sessionid
>> >>> 0x13d3c4bcba6014a, closing socket connection and attempting reconnect
>> >>> 2013-03-10 15:37:46,712 INFO org.apache.zookeeper.****ClientCnxn:
>> Client
>> >>> session timed out, have not heard from server in 226785ms for
>> sessionid
>> >>> 0x13d3c4bcba60149, closing socket connection and attempting reconnect
>> >>> 2013-03-10 15:37:46,712 DEBUG org.apache.hadoop.hbase.io.****
>> >>> hfile.LruBlockCache:
>> >>> Stats: total=61.91 MB, free=1.94 GB, max=2 GB, blocks=1254,
>> >>> accesses=60087,
>> >>> hits=58811, hitRatio=97.87%, , cachingAccesses=60069,
>> cachingHits=58811,
>> >>> cachingHitsRatio=97.90%, , evictions=0, evicted=0, evictedPerRun=NaN
>> >>> 2013-03-10 15:37:46,712 WARN org.apache.hadoop.hbase.util.****Sleeper:
>> >>> We
>> >>> slept 225100ms instead of 3000ms, this is likely due to a long garbage
>> >>> collecting pause and it's usually bad, see
>> >>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**>
>> >>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.**
>> >>> org/book.html#trouble.rs.**runtime.zkexpired<
>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>> >>> >
>> >>> 2013-03-10 15:37:46,714 WARN org.apache.hadoop.hdfs.****DFSClient:
>> >>> DFSOutputStream ResponseProcessor exception  for block
>> >>> BP-43236042-172.17.2.10-****1362490844340:blk_-**
>> >>> 6834190810033122569_25150229
>> >>>
>> >>> java.io.EOFException: Premature EOF: no length prefix available
>> >>>          at org.apache.hadoop.hdfs.****protocol.HdfsProtoUtil.**
>> >>> vintPrefixed(HdfsProtoUtil.****java:171)
>> >>>          at org.apache.hadoop.hdfs.****protocol.datatransfer.**
>> >>> PipelineAck.readFields(****PipelineAck.java:114)
>> >>>          at
>> org.apache.hadoop.hdfs.****DFSOutputStream$DataStreamer$****
>> >>> ResponseProcessor.run(****DFSOutputStream.java:670)
>> >>> 2013-03-10 15:37:46,716 ERROR org.apache.hadoop.hbase.**
>> >>> regionserver.HRegionServer:
>> >>> org.apache.hadoop.hbase.ipc.****CallerDisconnectedException: Aborting
>> >>> call
>> >>> get([B@7caf69ed, {"timeRange":[0,****9223372036854775807],"**
>> >>> totalColumns":1,"cacheBlocks":****true,"families":{"details":[**"**
>> >>> ALL"]},"maxVersions":1,"row":"****\\x00\\x00\\x00\\x00\\x00\\***
>> >>> *x12\\x93@"}),
>> >>> rpc version=1, client version=29, methodsFingerPrint=1891768260 from
>> >>> 172.17.1.71:51294 after 0 ms, since caller disconnected
>> >>>
>> >>>
>> >>> #####GC LOG#####
>> >>> 2716.635: [GC 2716.635: [ParNew: 57570K->746K(59008K), 0.0046530 secs]
>> >>> 354857K->300891K(1152704K), 0.0047370 secs] [Times: user=0.03
>> sys=0.00,
>> >>> real=0.00 secs]
>> >>> 2789.478: [GC 2789.478: [ParNew: 53226K->1192K(59008K), 0.0041370
>> secs]
>> >>> 353371K->301337K(1152704K), 0.0042220 secs] [Times: user=0.04
>> sys=0.00,
>> >>> real=0.00 secs]
>> >>> 2868.435: [GC 2868.435: [ParNew: 53672K->740K(59008K), 0.0041570 secs]
>> >>> 353817K->300886K(1152704K), 0.0042440 secs] [Times: user=0.03
>> sys=0.00,
>> >>> real=0.01 secs]
>> >>> 2920.309: [GC 2920.309: [ParNew: 53220K->6202K(59008K), 0.0058440
>> secs]
>> >>> 353366K->310443K(1152704K), 0.0059410 secs] [Times: user=0.05
>> sys=0.00,
>> >>> real=0.01 secs]
>> >>> 2984.239: [GC 2984.239: [ParNew: 58680K->5282K(59008K), 0.0048070
>> secs]
>> >>> 362921K->310977K(1152704K), 0.0049180 secs] [Times: user=0.03
>> sys=0.00,
>> >>> real=0.01 secs]
>> >>> 3050.620: [GC 3050.620: [ParNew: 57524K->5292K(59008K), 0.0055510
>> secs]
>> >>> 363219K->313384K(1152704K), 0.0056520 secs] [Times: user=0.04
>> sys=0.00,
>> >>> real=0.01 secs]
>> >>> 3111.331: [GC 3111.331: [ParNew: 57542K->5637K(59008K), 0.0056900
>> secs]
>> >>> 365634K->318079K(1152704K), 0.0057950 secs] [Times: user=0.03
>> sys=0.00,
>> >>> real=0.01 secs]
>> >>> 3163.510: [GC 3163.510: [ParNew: 58117K->1138K(59008K), 0.0046540
>> secs]
>> >>> 370559K->316086K(1152704K), 0.0047470 secs] [Times: user=0.03
>> sys=0.00,
>> >>> real=0.01 secs]
>> >>> 3217.642: [GC 3217.643: [ParNew: 53618K->1001K(59008K), 0.0038530
>> secs]
>> >>> 368566K->315949K(1152704K), 0.0039450 secs] [Times: user=0.03
>> sys=0.00,
>> >>> real=0.00 secs]
>> >>> 3464.067: [GC 3464.067: [ParNew: 53481K->5348K(59008K), 0.0068370
>> secs]
>> >>> 368429K->322343K(1152704K), 0.0069170 secs] [Times: user=0.07
>> sys=0.01,
>> >>> real=0.00 secs]
>> >>> 3465.648: [GC 3465.648: [ParNew: 57828K->1609K(59008K), 0.0111520
>> secs]
>> >>> 374823K->321496K(1152704K), 0.0112170 secs] [Times: user=0.07
>> sys=0.00,
>> >>> real=0.01 secs]
>> >>> 3466.291: [GC 3466.291: [ParNew: 54089K->2186K(59008K), 0.0093190
>> secs]
>> >>> 373976K->322073K(1152704K), 0.0093800 secs] [Times: user=0.06
>> sys=0.00,
>> >>> real=0.01 secs]
>> >>> Heap
>> >>>   par new generation   total 59008K, used 15497K [0x00000005fae00000,
>> >>> 0x00000005fee00000, 0x00000005fee00000)
>> >>>    eden space 52480K,  25% used [0x00000005fae00000,
>> 0x00000005fbaff8c8,
>> >>> 0x00000005fe140000)
>> >>>    from space 6528K,  33% used [0x00000005fe140000,
>> 0x00000005fe362b70,
>> >>> 0x00000005fe7a0000)
>> >>>    to   space 6528K,   0% used [0x00000005fe7a0000,
>> 0x00000005fe7a0000,
>> >>> 0x00000005fee00000)
>> >>>   concurrent mark-sweep generation total 1093696K, used 319887K
>> >>> [0x00000005fee00000, 0x0000000641a10000, 0x00000007fae00000)
>> >>>   concurrent-mark-sweep perm gen total 30464K, used 30427K
>> >>> [0x00000007fae00000, 0x00000007fcbc0000, 0x0000000800000000)
>> >>>
>> >>>
>> >>>
>> >>> On 03/08/2013 07:02 PM, Stack wrote:
>> >>>
>> >>>  On Fri, Mar 8, 2013 at 10:58 AM, Pablo Musa <pa...@psafe.com> wrote:
>> >>>>
>> >>>>   0.94 currently doesn't support hadoop 2.0
>> >>>>
>> >>>>> Can you deploy hadoop 1.1.1 instead ?
>> >>>>>>
>> >>>>>>   I am using cdh4.2.0 which uses this version as default
>> installation.
>> >>>>>>
>> >>>>> I think it will be a problem for me to deploy 1.1.1 because I would
>> >>>>> need
>> >>>>> to
>> >>>>> "upgrade" the whole cluster with 70TB of data (backup everything, go
>> >>>>> offline, etc.).
>> >>>>>
>> >>>>> Is there a problem to use cdh4.2.0?
>> >>>>> I should send my email to cdh list?
>> >>>>>
>> >>>>>
>> >>>>>   That combo should be fine.
>> >>>>>
>> >>>>
>> >>>>     You Full GC'ing around this time?
>> >>>>
>> >>>>> The GC shows it took a long time. However it does not make any sense
>> >>>>> to be it, since the same ammount of data was cleaned before and
>> AFTER
>> >>>>> in just 0.01 secs!
>> >>>>>
>> >>>>>
>> >>>>>   If JVM is full GC'ing, the application is stopped.
>> >>>>>
>> >>>>
>> >>>>   [Times: user=0.08 sys=137.62, real=137.62 secs]
>> >>>>
>> >>>>> Besides the whole time was used by system. That is what is bugging
>> me.
>> >>>>>
>> >>>>>
>> >>>>>   The below does not look like a full GC but that is a long pause in
>> >>>>>
>> >>>> system
>> >>>> time, enough to kill your zk session.
>> >>>>
>> >>>> You swapping?
>> >>>>
>> >>>> Hardware is good?
>> >>>>
>> >>>> St.Ack
>> >>>>
>> >>>>
>> >>>>
>> >>>>     ...
>> >>>>
>> >>>>>
>> >>>>> 1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990
>> secs]
>> >>>>> 275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03
>> sys=0.00,
>> >>>>> real=0.01 secs]
>> >>>>>
>> >>>>> 1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000
>> secs]
>> >>>>> 269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04
>> sys=0.01,
>> >>>>> real=0.00 secs]
>> >>>>>
>> >>>>> 1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620
>> >>>>> secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times:
>> user=0.08
>> >>>>> sys=137.62, real=137.62 secs]
>> >>>>>
>> >>>>> 1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960
>> secs]
>> >>>>> 287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05
>> sys=0.00,
>> >>>>> real=0.01 secs]
>> >>>>>
>> >>>>> 1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650
>> secs]
>> >>>>> 283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06
>> sys=0.00,
>> >>>>> real=0.01 secs]
>> >>>>>
>> >>>>>
>> >>>>> I really appreciate you guys helping me to find out what is wrong.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Pablo
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On 03/08/2013 02:11 PM, Stack wrote:
>> >>>>>
>> >>>>>   What RAM says.
>> >>>>>
>> >>>>>> 2013-03-07 17:24:57,887 INFO
>> org.apache.zookeeper.********ClientCnxn:
>> >>>>>> Client
>> >>>>>>
>> >>>>>> session timed out, have not heard from server in 159348ms for
>> >>>>>> sessionid
>> >>>>>> 0x13d3c4bcba600a7, closing socket connection and attempting
>> reconnect
>> >>>>>>
>> >>>>>> You Full GC'ing around this time?
>> >>>>>>
>> >>>>>> Put up your configs in a place where we can take a look?
>> >>>>>>
>> >>>>>> St.Ack
>> >>>>>>
>> >>>>>>
>> >>>>>> On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan <
>> >>>>>> ramkrishna.s.vasudevan@gmail.******com
>> <ramkrishna.s.vasudevan@gmail.
>> >>>>>> ****
>> >>>>>> com <ramkrishna.s.vasudevan@gmail.**com<
>> ramkrishna.s.vasude...@gmail.com>
>> >>>>>> >>>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>    I think it is with your GC config.  What is your heap size?
>>  What
>> >>>>>> is
>> >>>>>> the
>> >>>>>>
>> >>>>>>  data that you pump in and how much is the block cache size?
>> >>>>>>>
>> >>>>>>> Regards
>> >>>>>>> Ram
>> >>>>>>>
>> >>>>>>> On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <yuzhih...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>>    0.94 currently doesn't support hadoop 2.0
>> >>>>>>>
>> >>>>>>>  Can you deploy hadoop 1.1.1 instead ?
>> >>>>>>>>
>> >>>>>>>> Are you using 0.94.5 ?
>> >>>>>>>>
>> >>>>>>>> Thanks
>> >>>>>>>>
>> >>>>>>>> On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <pa...@psafe.com>
>> wrote:
>> >>>>>>>>
>> >>>>>>>>    Hey guys,
>> >>>>>>>>
>> >>>>>>>>  as I sent in an email a long time ago, the RSs in my cluster did
>> >>>>>>>>> not
>> >>>>>>>>>
>> >>>>>>>>>   get
>> >>>>>>>>>
>> >>>>>>>> along
>> >>>>>>>>
>> >>>>>>>>  and crashed 3 times a day. I tried a lot of options we
>> discussed in
>> >>>>>>>>> the
>> >>>>>>>>> emails, but it not solved the problem. As I used an old version
>> of
>> >>>>>>>>>
>> >>>>>>>>>   hadoop I
>> >>>>>>>>>
>> >>>>>>>>   thought this was the problem.
>> >>>>>>>>
>> >>>>>>>>> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5
>> to
>> >>>>>>>>>
>> >>>>>>>>>   hadoop
>> >>>>>>>>>
>> >>>>>>>> 2.0.0
>> >>>>>>>>
>> >>>>>>>>  - hbase 0.94 - zookeeper 3.4.5.
>> >>>>>>>>>
>> >>>>>>>>> Unfortunately the RSs did not stop crashing, and worst! Now they
>> >>>>>>>>> crash
>> >>>>>>>>> every
>> >>>>>>>>> hour and some times when the RS that holds the .ROOT. crashes
>> all
>> >>>>>>>>>
>> >>>>>>>>>   cluster
>> >>>>>>>>>
>> >>>>>>>> get
>> >>>>>>>>
>> >>>>>>>>  stuck in transition and everything stops working.
>> >>>>>>>>> In this case I need to clean zookeeper znodes, restart the
>> master
>> >>>>>>>>> and
>> >>>>>>>>>
>> >>>>>>>>>   the
>> >>>>>>>>>
>> >>>>>>>> RSs.
>> >>>>>>>>
>> >>>>>>>>  To avoid this case I am running on production with only ONE RS
>> and
>> >>>>>>>>> a
>> >>>>>>>>> monitoring
>> >>>>>>>>> script that check every minute, if the RS is ok. If not, restart
>> >>>>>>>>> it.
>> >>>>>>>>> * This case does not get the cluster stuck.
>> >>>>>>>>>
>> >>>>>>>>> This is driving me crazy, but I really cant find a solution for
>> the
>> >>>>>>>>> cluster.
>> >>>>>>>>> I tracked all logs from the start time 16:49 from all
>> interesting
>> >>>>>>>>> nodes
>> >>>>>>>>> (zoo,
>> >>>>>>>>> namenode, master, rs, dn2, dn9, dn10) and copied here what I
>> think
>> >>>>>>>>> is
>> >>>>>>>>> usefull.
>> >>>>>>>>>
>> >>>>>>>>> There are some strange errors in the DATANODE2, as an error
>> >>>>>>>>> copiyng a
>> >>>>>>>>>
>> >>>>>>>>>   block
>> >>>>>>>>>
>> >>>>>>>>   to itself.
>> >>>>>>>>
>> >>>>>>>>> The gc log points to GC timeout. However it is very weird that
>> the
>> >>>>>>>>> RS
>> >>>>>>>>>
>> >>>>>>>>>   spend
>> >>>>>>>>>
>> >>>>>>>>   so much time in GC while in the other cases it takes 0.001sec.
>> >>>>>>>>
>> >>>>>>>>> Besides,
>> >>>>>>>>> the time
>> >>>>>>>>> spent, is in sys which makes me think that might be a problem in
>> >>>>>>>>>
>> >>>>>>>>>   another
>> >>>>>>>>>
>> >>>>>>>> place.
>> >>>>>>>>
>> >>>>>>>>  I know that it is a bunch of logs, and that it is very
>> difficult to
>> >>>>>>>>>
>> >>>>>>>>>   find
>> >>>>>>>>>
>> >>>>>>>> the
>> >>>>>>>>
>> >>>>>>>>  problem without much context. But I REALLY need some help. If
>> it is
>> >>>>>>>>> not
>> >>>>>>>>>
>> >>>>>>>>>   the
>> >>>>>>>>>
>> >>>>>>>>   solution, at least what I should read, where I should look, or
>> >>>>>>>> which
>> >>>>>>>>
>> >>>>>>>>>   cases
>> >>>>>>>>>
>> >>>>>>>>   I
>> >>>>>>>>
>> >>>>>>>>> should monitor.
>> >>>>>>>>>
>> >>>>>>>>> Thank you very much,
>> >>>>>>>>> Pablo Musa
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >> --
>> >> *Regards,*
>> >> --- *Sreepathi *
>> >>
>> >
>> >
>>
>
>

Re: RegionServers Crashing every hour in production env

Reply via email to