Re: kylin job stop accidentally and can resume success!

Alberto Ramón Mon, 13 Feb 2017 14:06:23 -0800

Do you have the Resource Manager in a dedicated node ?(without container or
Node Manager)


2017-02-13 17:38 GMT+01:00 不清 <[email protected]>:

> I check the configure in CM。
>
> Java Heap Size of ResourceManager in Bytes =1536 MiB
> Container Memory Minimum =1GiB
>
> Container Memory Increment =512MiB
>
> Container Memory Maximum =8GiB
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Alberto Ramón";<[email protected]>;
> *发送时间:* 2017年2月14日(星期二) 凌晨0:34
> *收件人:* "user"<[email protected]>;
> *主题:* Re: kylin job stop accidentally and can resume success!
>
> check this
> <https://www.mapr.com/blog/best-practices-yarn-resource-management>:
> "Basically, it means RM can only allocate memory to containers in
> increments of .  . . "
>
> TIP: is your RM in a work node? If this is true, this can be the problem
> (Its good idea put yarn master, RM, in a dedicated node)
>
>
> 2017-02-13 17:19 GMT+01:00 不清 <[email protected]>:
>
>> how can i get this heap size？
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "Alberto Ramón";<[email protected]>;
>> *发送时间:* 2017年2月14日(星期二) 凌晨0:17
>> *收件人:* "user"<[email protected]>;
>> *主题:* Re: kylin job stop accidentally and can resume success!
>>
>> Sounds like a problem of Resource Manager (RM) of YARN, check the Heap
>> size for RM
>> Kylin loose connectivity whit RM
>>
>> 2017-02-13 17:00 GMT+01:00 不清 <[email protected]>:
>>
>>> hello,kylin community!
>>>
>>> sometimes my jobs stop accidenttly.It is can stop by any step.
>>>
>>> kylin log is like :
>>> 2017-02-13 23:27:01,549 DEBUG [pool-8-thread-8]
>>> hbase.HBaseResourceStore:262 : Update row 
>>> /execute_output/48dee96e-10fd-472b-b466-39505b6e57c0-02
>>> from oldTs: 1486999611524, to newTs: 1486999621545, operation result: true
>>> 2017-02-13 23:27:13,384 INFO  [pool-8-thread-8] ipc.Client:842 :
>>> Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504.
>>> Already tried 0 time(s); retry policy is 
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=3,
>>> sleepTime=1000 MILLISECONDS)
>>> 2017-02-13 23:27:14,387 INFO  [pool-8-thread-8] ipc.Client:842 :
>>> Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504.
>>> Already tried 1 time(s); retry policy is 
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=3,
>>> sleepTime=1000 MILLISECONDS)
>>> 2017-02-13 23:27:15,388 INFO  [pool-8-thread-8] ipc.Client:842 :
>>> Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504.
>>> Already tried 2 time(s); retry policy is 
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=3,
>>> sleepTime=1000 MILLISECONDS)
>>> 2017-02-13 23:27:15,495 INFO  [pool-8-thread-8]
>>> mapred.ClientServiceDelegate:273 : Application state is completed.
>>> FinalApplicationStatus=KILLED. Redirecting to job history server
>>> 2017-02-13 23:27:15,539 DEBUG [pool-8-thread-8] dao.ExecutableDao:210 :
>>> updating job output, id: 48dee96e-10fd-472b-b466-39505b6e57c0-02
>>>
>>> CM log is like:
>>> Job Name: Kylin_Cube_Builder_user_all_cube_2_only_msisdn
>>> User Name: tmn
>>> Queue: root.tmn
>>> State: KILLED
>>> Uberized: false
>>> Submitted: Sun Feb 12 19:19:24 CST 2017
>>> Started: Sun Feb 12 19:19:38 CST 2017
>>> Finished: Sun Feb 12 20:30:13 CST 2017
>>> Elapsed: 1hrs, 10mins, 35sec
>>> Diagnostics:
>>> Kill job job_1486825738076_4205 received from tmn (auth:SIMPLE) at
>>> 10.180.212.38
>>> Job received Kill while in RUNNING state.
>>> Average Map Time 24mins, 48sec
>>>
>>> mapreduce job log
>>> Task KILL is received. Killing attempt!
>>>
>>> and when this happened ,by resume job,the job can resume success! I mean
>>>  it is not stop by error!
>>>
>>> what's the problem?
>>>
>>> My hadoop cluster is very busy,this situation happens very often.
>>>
>>> can I set retry time and retry  Interval?
>>>
>>
>>
>

Re: kylin job stop accidentally and can resume success!

Reply via email to