root cause analysis when jobtracker hang

Patai Sangbutsarakum Wed, 06 Feb 2013 20:24:01 -0800

I wish it is the case. i have another prod. cluster using cdh3u4 too,
but it won't happen.


On Wed, Feb 6, 2013 at 6:12 PM, java8964 java8964 <[email protected]> wrote:
> Our cluster on cdh3u4 has the same problem. I think it is caused by some
> bugs in JobTracker. I believe Cloudera knows about this issue.
>
> After upgrading to cdh3u5, we havn't faced this issue yet, but I am not sure
> if it is confirmed to fix in the CDH3U5.
>
> Yong
>
>> Date: Mon, 4 Feb 2013 15:21:18 -0800
>> Subject: What to do/check/debug/root cause analysis when jobtracker hang
>> From: [email protected]
>> To: [email protected]
>
>>
>> Lately, jobtracker in one of our production cluster fall into hang state.
>> The load 5,10,15min is like 1 ish;
>> with top command, jobtracker has 100% cpu all the time.
>>
>> So, i went ahead to try top -H -p jobtracker_pid, and always see a
>> thread that have 100% cpu all the time.
>>
>> Unless we restart jobtracker, the hang state would never go away.
>>
>> I found OOM in jobtracker log file during the hang state.
>>
>> how could i know what is really going on on the one and only one
>> thread that has 100% cpu.
>>
>> how could i prove that we run out of memory because amount of job
>> _OR_
>> there is memory leak in application side. ?
>>
>>
>> I tried jstack to dump, and http://jobtracker:50030/stacks
>>
>> i just don't know what I should really look at output of those commands.
>>
>> The cluster is cdh3u4, on Centos6.2, with disable transparent_hugepage.
>>
>>
>>
>> hopefully this make sense,
>> -P

Re: What to do/check/debug/root cause analysis when jobtracker hang

Reply via email to