Do you have the Resource Manager in a dedicated node ?(without container or Node Manager)
2017-02-13 17:38 GMT+01:00 不清 <[email protected]>: > I check the configure in CM。 > > Java Heap Size of ResourceManager in Bytes =1536 MiB > Container Memory Minimum =1GiB > > Container Memory Increment =512MiB > > Container Memory Maximum =8GiB > > ------------------ 原始邮件 ------------------ > *发件人:* "Alberto Ramón";<[email protected]>; > *发送时间:* 2017年2月14日(星期二) 凌晨0:34 > *收件人:* "user"<[email protected]>; > *主题:* Re: kylin job stop accidentally and can resume success! > > check this > <https://www.mapr.com/blog/best-practices-yarn-resource-management>: > "Basically, it means RM can only allocate memory to containers in > increments of . . . " > > TIP: is your RM in a work node? If this is true, this can be the problem > (Its good idea put yarn master, RM, in a dedicated node) > > > 2017-02-13 17:19 GMT+01:00 不清 <[email protected]>: > >> how can i get this heap size? >> >> >> ------------------ 原始邮件 ------------------ >> *发件人:* "Alberto Ramón";<[email protected]>; >> *发送时间:* 2017年2月14日(星期二) 凌晨0:17 >> *收件人:* "user"<[email protected]>; >> *主题:* Re: kylin job stop accidentally and can resume success! >> >> Sounds like a problem of Resource Manager (RM) of YARN, check the Heap >> size for RM >> Kylin loose connectivity whit RM >> >> 2017-02-13 17:00 GMT+01:00 不清 <[email protected]>: >> >>> hello,kylin community! >>> >>> sometimes my jobs stop accidenttly.It is can stop by any step. >>> >>> kylin log is like : >>> 2017-02-13 23:27:01,549 DEBUG [pool-8-thread-8] >>> hbase.HBaseResourceStore:262 : Update row >>> /execute_output/48dee96e-10fd-472b-b466-39505b6e57c0-02 >>> from oldTs: 1486999611524, to newTs: 1486999621545, operation result: true >>> 2017-02-13 23:27:13,384 INFO [pool-8-thread-8] ipc.Client:842 : >>> Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504. >>> Already tried 0 time(s); retry policy is >>> RetryUpToMaximumCountWithFixedSleep(maxRetries=3, >>> sleepTime=1000 MILLISECONDS) >>> 2017-02-13 23:27:14,387 INFO [pool-8-thread-8] ipc.Client:842 : >>> Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504. >>> Already tried 1 time(s); retry policy is >>> RetryUpToMaximumCountWithFixedSleep(maxRetries=3, >>> sleepTime=1000 MILLISECONDS) >>> 2017-02-13 23:27:15,388 INFO [pool-8-thread-8] ipc.Client:842 : >>> Retrying connect to server: jxhdp1datanode29/10.180.212.61:50504. >>> Already tried 2 time(s); retry policy is >>> RetryUpToMaximumCountWithFixedSleep(maxRetries=3, >>> sleepTime=1000 MILLISECONDS) >>> 2017-02-13 23:27:15,495 INFO [pool-8-thread-8] >>> mapred.ClientServiceDelegate:273 : Application state is completed. >>> FinalApplicationStatus=KILLED. Redirecting to job history server >>> 2017-02-13 23:27:15,539 DEBUG [pool-8-thread-8] dao.ExecutableDao:210 : >>> updating job output, id: 48dee96e-10fd-472b-b466-39505b6e57c0-02 >>> >>> CM log is like: >>> Job Name: Kylin_Cube_Builder_user_all_cube_2_only_msisdn >>> User Name: tmn >>> Queue: root.tmn >>> State: KILLED >>> Uberized: false >>> Submitted: Sun Feb 12 19:19:24 CST 2017 >>> Started: Sun Feb 12 19:19:38 CST 2017 >>> Finished: Sun Feb 12 20:30:13 CST 2017 >>> Elapsed: 1hrs, 10mins, 35sec >>> Diagnostics: >>> Kill job job_1486825738076_4205 received from tmn (auth:SIMPLE) at >>> 10.180.212.38 >>> Job received Kill while in RUNNING state. >>> Average Map Time 24mins, 48sec >>> >>> mapreduce job log >>> Task KILL is received. Killing attempt! >>> >>> and when this happened ,by resume job,the job can resume success! I mean >>> it is not stop by error! >>> >>> what's the problem? >>> >>> My hadoop cluster is very busy,this situation happens very often. >>> >>> can I set retry time and retry Interval? >>> >> >> >
