Thanks harsh, Appreciate your input, as always. On Aug 12, 2013, at 20:01, Harsh J <[email protected]> wrote:
> If you're not already doing it, run a local name caching daemon (such > as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and > a local cache would do good in reducing the load on your DNS. > > On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum > <[email protected]> wrote: >> Update, after adjust the network routing, dns query speed is in micro sec as >> suppose to be. the issue is completely solve. >> Jobtracker page doesn't hang anymore when launch 100k mappers job.. >> >> Cheers, >> >> >> >> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum >> <[email protected]> wrote: >>> >>> Ok, after some sweat, i think I found the root cause but still need >>> another team to help me fix it. >>> It lines on the DNS. Somehow each of the tip:task line, through the >>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me. >>> >>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress: >>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1 >>> >>> 127 ms >>> >>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress: >>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2 >>> >>> 126 ms >>> >>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress: >>> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3 >>> >>> >>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37) >>> >>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37) >>> >>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89) >>> >>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189) >>> >>> >>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37) >>> >>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37) >>> >>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy >>> (189) >>> >>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89) >>> >>> >>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37) >>> >>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37) >>> >>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz >>> (189) >>> >>> I looked at the jobtracker log in other datacenter when submitted with the >>> same query. Timestamp in each tip:task line is much much faster. >>> >>> The question that raise here is the job initialization is really request >>> the DNS, if so is there any way to suppress that. topology file is already >>> in place where name and ip are already there. >>> >>> >>> Hope this make sense >>> >>> Patai >>> >>> >>> >>> >>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum >>> <[email protected]> wrote: >>>> >>>> Appreciate your input Bryant, i will try to reproduce and see the >>>> namenode log before, while, and after it pause. >>>> Wish me luck >>>> >>>> >>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault >>>> <[email protected]> wrote: >>>>> >>>>> When I've had problems with a slow jobtracker, i've found the issue to >>>>> be one of the following two (so far) possibilities: >>>>> >>>>> - long GC pause (I'm guessing this is not it based on your email) >>>>> - hdfs is slow >>>>> >>>>> I haven't dived into the code yet, but circumstantially I've found that >>>>> when you submit a job the jobtracker needs to put a bunch of files in >>>>> hdfs, >>>>> such as the job.xml, the job jar, etc. I'm not sure how this scales with >>>>> larger and larger jobs, aside form the size of the splits serialization in >>>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses >>>>> in >>>>> your jobtracker. This affects other jobs being able to submit, as well as >>>>> the 50030 web ui. >>>>> >>>>> I'd take a look at your namenode logs. When the jobtracker logs pause, >>>>> do you see a corresponding pause in the namenode logs? What gets spewed >>>>> before and after that pause? >>>>> >>>>> >>>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum >>>>> <[email protected]> wrote: >>>>>> >>>>>> A while back, i was fighting with the jobtracker page hangs when i >>>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as usual >>>>>> which ends up because of allowing too much job history kept in >>>>>> jobtracker. >>>>>> >>>>>> Currently, i am setting up a new cluster 40g heap on the namenode and >>>>>> jobtracker in dedicated machines. Not fun part starts here; a developer >>>>>> tried to test out the cluster by launching a 76k map job (the cluster has >>>>>> around 6k-ish mappers) >>>>>> Job initialization was success, and finished the job. >>>>>> >>>>>> However, before the job is actually running, i can't access to the >>>>>> jobtracker page anymore same symptom as above. >>>>>> >>>>>> i see bunch of this in jobtracker log >>>>>> >>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: >>>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node >>>>>> .. >>>>>> .. >>>>>> .. >>>>>> >>>>>> Until i see this >>>>>> >>>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619 >>>>>> LOCALITY_WAIT_FACTOR=1.0 >>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress: >>>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks >>>>>> and >>>>>> 10 reduce tasks. >>>>>> >>>>>> that's when i can access to the jobtracker page again. >>>>>> >>>>>> >>>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like >>>>>> 1ish gig from 40 >>>>>> network bandwidth is far from filled up. >>>>>> >>>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime >>>>>> Environment (build 1.6.0_32-b05) >>>>>> >>>>>> >>>>>> What would be the root cause i should looking at or at least where to >>>>>> start? >>>>>> >>>>>> Thanks you in advanced > > > > -- > Harsh J
