Sounds like a problem of Resource Manager (RM) of YARN, check the Heap size for RM Kylin loose connectivity whit RM
2017-02-13 17:00 GMT+01:00 不清 <452652...@qq.com>: > hello,kylin community! > > sometimes my jobs stop accidenttly.It is can stop by any step. > > kylin log is like : > 2017-02-13 23:27:01,549 DEBUG [pool-8-thread-8] > hbase.HBaseResourceStore:262 : Update row > /execute_output/48dee96e-10fd-472b-b466-39505b6e57c0-02 > from oldTs: 1486999611524, to newTs: 1486999621545, operation result: true > 2017-02-13 23:27:13,384 INFO [pool-8-thread-8] ipc.Client:842 : Retrying > connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 0 > time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, > sleepTime=1000 MILLISECONDS) > 2017-02-13 23:27:14,387 INFO [pool-8-thread-8] ipc.Client:842 : Retrying > connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 1 > time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, > sleepTime=1000 MILLISECONDS) > 2017-02-13 23:27:15,388 INFO [pool-8-thread-8] ipc.Client:842 : Retrying > connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 2 > time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, > sleepTime=1000 MILLISECONDS) > 2017-02-13 23:27:15,495 INFO [pool-8-thread-8] > mapred.ClientServiceDelegate:273 : Application state is completed. > FinalApplicationStatus=KILLED. Redirecting to job history server > 2017-02-13 23:27:15,539 DEBUG [pool-8-thread-8] dao.ExecutableDao:210 : > updating job output, id: 48dee96e-10fd-472b-b466-39505b6e57c0-02 > > CM log is like: > Job Name: Kylin_Cube_Builder_user_all_cube_2_only_msisdn > User Name: tmn > Queue: root.tmn > State: KILLED > Uberized: false > Submitted: Sun Feb 12 19:19:24 CST 2017 > Started: Sun Feb 12 19:19:38 CST 2017 > Finished: Sun Feb 12 20:30:13 CST 2017 > Elapsed: 1hrs, 10mins, 35sec > Diagnostics: > Kill job job_1486825738076_4205 received from tmn (auth:SIMPLE) at > 10.180.212.38 > Job received Kill while in RUNNING state. > Average Map Time 24mins, 48sec > > mapreduce job log > Task KILL is received. Killing attempt! > > and when this happened ,by resume job,the job can resume success! I mean > it is not stop by error! > > what's the problem? > > My hadoop cluster is very busy,this situation happens very often. > > can I set retry time and retry Interval? >