I wish it is the case. i have another prod. cluster using cdh3u4 too, but it won't happen.
On Wed, Feb 6, 2013 at 6:12 PM, java8964 java8964 <[email protected]> wrote: > Our cluster on cdh3u4 has the same problem. I think it is caused by some > bugs in JobTracker. I believe Cloudera knows about this issue. > > After upgrading to cdh3u5, we havn't faced this issue yet, but I am not sure > if it is confirmed to fix in the CDH3U5. > > Yong > >> Date: Mon, 4 Feb 2013 15:21:18 -0800 >> Subject: What to do/check/debug/root cause analysis when jobtracker hang >> From: [email protected] >> To: [email protected] > >> >> Lately, jobtracker in one of our production cluster fall into hang state. >> The load 5,10,15min is like 1 ish; >> with top command, jobtracker has 100% cpu all the time. >> >> So, i went ahead to try top -H -p jobtracker_pid, and always see a >> thread that have 100% cpu all the time. >> >> Unless we restart jobtracker, the hang state would never go away. >> >> I found OOM in jobtracker log file during the hang state. >> >> how could i know what is really going on on the one and only one >> thread that has 100% cpu. >> >> how could i prove that we run out of memory because amount of job >> _OR_ >> there is memory leak in application side. ? >> >> >> I tried jstack to dump, and http://jobtracker:50030/stacks >> >> i just don't know what I should really look at output of those commands. >> >> The cluster is cdh3u4, on Centos6.2, with disable transparent_hugepage. >> >> >> >> hopefully this make sense, >> -P
