check this <https://www.mapr.com/blog/best-practices-yarn-resource-management>: "Basically, it means RM can only allocate memory to containers in increments of . . . "
TIP: is your RM in a work node? If this is true, this can be the problem (Its good idea put yarn master, RM, in a dedicated node) 2017-02-13 17:19 GMT+01:00 不清 <452652...@qq.com>: > how can i get this heap size? > > > ------------------ 原始邮件 ------------------ > *发件人:* "Alberto Ramón";<a.ramonporto...@gmail.com>; > *发送时间:* 2017年2月14日(星期二) 凌晨0:17 > *收件人:* "user"<user@kylin.apache.org>; > *主题:* Re: kylin job stop accidentally and can resume success! > > Sounds like a problem of Resource Manager (RM) of YARN, check the Heap > size for RM > Kylin loose connectivity whit RM > > 2017-02-13 17:00 GMT+01:00 不清 <452652...@qq.com>: > >> hello,kylin community! >> >> sometimes my jobs stop accidenttly.It is can stop by any step. >> >> kylin log is like : >> 2017-02-13 23:27:01,549 DEBUG [pool-8-thread-8] >> hbase.HBaseResourceStore:262 : Update row >> /execute_output/48dee96e-10fd-472b-b466-39505b6e57c0-02 >> from oldTs: 1486999611524, to newTs: 1486999621545, operation result: true >> 2017-02-13 23:27:13,384 INFO [pool-8-thread-8] ipc.Client:842 : Retrying >> connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 0 >> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, >> sleepTime=1000 MILLISECONDS) >> 2017-02-13 23:27:14,387 INFO [pool-8-thread-8] ipc.Client:842 : Retrying >> connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 1 >> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, >> sleepTime=1000 MILLISECONDS) >> 2017-02-13 23:27:15,388 INFO [pool-8-thread-8] ipc.Client:842 : Retrying >> connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 2 >> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, >> sleepTime=1000 MILLISECONDS) >> 2017-02-13 23:27:15,495 INFO [pool-8-thread-8] >> mapred.ClientServiceDelegate:273 : Application state is completed. >> FinalApplicationStatus=KILLED. Redirecting to job history server >> 2017-02-13 23:27:15,539 DEBUG [pool-8-thread-8] dao.ExecutableDao:210 : >> updating job output, id: 48dee96e-10fd-472b-b466-39505b6e57c0-02 >> >> CM log is like: >> Job Name: Kylin_Cube_Builder_user_all_cube_2_only_msisdn >> User Name: tmn >> Queue: root.tmn >> State: KILLED >> Uberized: false >> Submitted: Sun Feb 12 19:19:24 CST 2017 >> Started: Sun Feb 12 19:19:38 CST 2017 >> Finished: Sun Feb 12 20:30:13 CST 2017 >> Elapsed: 1hrs, 10mins, 35sec >> Diagnostics: >> Kill job job_1486825738076_4205 received from tmn (auth:SIMPLE) at >> 10.180.212.38 >> Job received Kill while in RUNNING state. >> Average Map Time 24mins, 48sec >> >> mapreduce job log >> Task KILL is received. Killing attempt! >> >> and when this happened ,by resume job,the job can resume success! I mean >> it is not stop by error! >> >> what's the problem? >> >> My hadoop cluster is very busy,this situation happens very often. >> >> can I set retry time and retry Interval? >> > >