Re: Jobtracker page hangs ..again.

Harsh J Mon, 12 Aug 2013 20:03:44 -0700

If you're not already doing it, run a local name caching daemon (such
as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and
a local cache would do good in reducing the load on your DNS.


On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum
<[email protected]> wrote:
> Update, after adjust the network routing, dns query speed is in micro sec as
> suppose to be. the issue is completely solve.
> Jobtracker page doesn't hang anymore when launch 100k mappers job..
>
> Cheers,
>
>
>
> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum
> <[email protected]> wrote:
>>
>> Ok, after some sweat, i think I found the root cause but still need
>> another team to help me fix it.
>> It lines on the DNS.  Somehow each of the tip:task line, through the
>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>>
>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>>
>> 127 ms
>>
>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>>
>> 126 ms
>>
>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
>> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>>
>>
>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>>
>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>>
>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>>
>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>>
>>
>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>>
>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>>
>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
>> (189)
>>
>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>>
>>
>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>>
>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>>
>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
>> (189)
>>
>> I looked at the jobtracker log in other datacenter when submitted with the
>> same query. Timestamp in each tip:task line is much much faster.
>>
>> The question that raise here is the job initialization is really request
>> the DNS, if so is there any way to suppress that. topology file is already
>> in place where name and ip are already there.
>>
>>
>> Hope this make sense
>>
>> Patai
>>
>>
>>
>>
>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum
>> <[email protected]> wrote:
>>>
>>> Appreciate your input Bryant, i will try to reproduce and see the
>>> namenode log before, while, and after it pause.
>>> Wish me luck
>>>
>>>
>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
>>> <[email protected]> wrote:
>>>>
>>>> When I've had problems with a slow jobtracker, i've found the issue to
>>>> be one of the following two (so far) possibilities:
>>>>
>>>> - long GC pause (I'm guessing this is not it based on your email)
>>>> - hdfs is slow
>>>>
>>>> I haven't dived into the code yet, but circumstantially I've found that
>>>> when you submit a job the jobtracker needs to put a bunch of files in hdfs,
>>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales with
>>>> larger and larger jobs, aside form the size of the splits serialization in
>>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses in
>>>> your jobtracker.  This affects other jobs being able to submit, as well as
>>>> the 50030 web ui.
>>>>
>>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>>> before and after that pause?
>>>>
>>>>
>>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum
>>>> <[email protected]> wrote:
>>>>>
>>>>> A while back, i was fighting with the jobtracker page hangs when i
>>>>> browse to http://jobtracker:50030 browser doesn't show jobs info as usual
>>>>> which ends up because of allowing too much job history kept in jobtracker.
>>>>>
>>>>> Currently, i am setting up a new cluster 40g heap on the namenode and
>>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>>> tried to test out the cluster by launching a 76k map job (the cluster has
>>>>> around 6k-ish mappers)
>>>>> Job initialization was success, and finished the job.
>>>>>
>>>>> However, before the job is actually running, i can't access to the
>>>>> jobtracker page anymore same symptom as above.
>>>>>
>>>>> i see bunch of this in jobtracker log
>>>>>
>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>>> ..
>>>>> ..
>>>>> ..
>>>>>
>>>>> Until i see this
>>>>>
>>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>>> LOCALITY_WAIT_FACTOR=1.0
>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>> Job job_201307291733_0619 initialized successfully with 76797 map tasks 
>>>>> and
>>>>> 10 reduce tasks.
>>>>>
>>>>> that's when i can access to the jobtracker page again.
>>>>>
>>>>>
>>>>> CPU on jobtracker is very little load, JTK's Heap is far from full like
>>>>> 1ish gig from 40
>>>>> network bandwidth is far from filled up.
>>>>>
>>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>>> Environment (build 1.6.0_32-b05)
>>>>>
>>>>>
>>>>> What would be the root cause i should looking at or at least where to
>>>>> start?
>>>>>
>>>>> Thanks you in advanced
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>



-- 
Harsh J

Re: Jobtracker page hangs ..again.

Reply via email to