Dmitry Sivachenko created YARN-3788:
---------------------------------------

             Summary: Application Master and Task Tracker timeouts are applied 
incorrectly
                 Key: YARN-3788
                 URL: https://issues.apache.org/jira/browse/YARN-3788
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 2.4.1
            Reporter: Dmitry Sivachenko


I am running a streaming job which requires a big (~50GB) data file to run 
(file is attached via hadoop jar <...> -file BigFile.dat).

Most likely this command will fail as follows (note that error message is 
rather meaningless):
2015-05-27 15:55:00,754 WARN  [main] streaming.StreamJob 
(StreamJob.java:parseArgv(291)) - -file option is deprecated, please use 
generic option -files instead.
packageJobJar: [/ssd/mt/lm/en_reorder.ylm, mapper.py, 
/tmp/hadoop-mitya/hadoop-unjar3778165585140840383/] [] 
/var/tmp/streamjob633547925483233845.jar tmpDir=null
2015-05-27 19:46:22,942 INFO  [main] client.RMProxy 
(RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at 
nezabudka1-00.yandex.ru/5.255.231.129:8032
2015-05-27 19:46:23,733 INFO  [main] client.RMProxy 
(RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at 
nezabudka1-00.yandex.ru/5.255.231.129:8032
2015-05-27 20:13:37,231 INFO  [main] mapred.FileInputFormat 
(FileInputFormat.java:listStatus(247)) - Total input paths to process : 1
2015-05-27 20:13:38,110 INFO  [main] mapreduce.JobSubmitter 
(JobSubmitter.java:submitJobInternal(396)) - number of splits:1
2015-05-27 20:13:38,136 INFO  [main] Configuration.deprecation 
(Configuration.java:warnOnceIfDeprecated(1009)) - mapred.reduce.tasks is 
deprecated. Instead, use mapreduce.job.reduces
2015-05-27 20:13:38,390 INFO  [main] mapreduce.JobSubmitter 
(JobSubmitter.java:printTokens(479)) - Submitting tokens for job: 
job_1431704916575_2531
2015-05-27 20:13:38,689 INFO  [main] impl.YarnClientImpl 
(YarnClientImpl.java:submitApplication(204)) - Submitted application 
application_1431704916575_2531
2015-05-27 20:13:38,743 INFO  [main] mapreduce.Job (Job.java:submit(1289)) - 
The url to track the job: 
http://nezabudka1-00.yandex.ru:8088/proxy/application_1431704916575_2531/
2015-05-27 20:13:38,746 INFO  [main] mapreduce.Job 
(Job.java:monitorAndPrintJob(1334)) - Running job: job_1431704916575_2531
2015-05-27 21:04:12,353 INFO  [main] mapreduce.Job 
(Job.java:monitorAndPrintJob(1355)) - Job job_1431704916575_2531 running in 
uber mode : false
2015-05-27 21:04:12,356 INFO  [main] mapreduce.Job 
(Job.java:monitorAndPrintJob(1362)) - map 0% reduce 0%
2015-05-27 21:04:12,374 INFO  [main] mapreduce.Job 
(Job.java:monitorAndPrintJob(1375)) - Job job_1431704916575_2531 failed with 
state FAILED due to: Application application_1431704916575_2531 failed 2 times 
due to ApplicationMaster for attempt appattempt_1431704916575_2531_000002 timed 
out. Failing the application.
2015-05-27 21:04:12,473 INFO  [main] mapreduce.Job 
(Job.java:monitorAndPrintJob(1380)) - Counters: 0
2015-05-27 21:04:12,474 ERROR [main] streaming.StreamJob 
(StreamJob.java:submitAndMonitorJob(1019)) - Job not Successful!
Streaming Command Failed!


This is because yarn.am.liveness-monitor.expiry-interval-ms (defaults to 600 
sec) timeout expires before large data file is transferred.

Next step I increase yarn.am.liveness-monitor.expiry-interval-ms.  After that 
application is successfully initialized and tasks are spawned.

But I encounter another error: the default 600 seconds mapreduce.task.timeout 
expires before tasks are initialized and tasks fail.

Error message Task attempt_XXX failed to report status for 600 seconds is also 
misleading: this timeout is supposed to kill non-responsive (stuck) tasks but 
it rather strikes because auxiliary data files are copying slowly.

So I need to increase mapreduce.task.timeout too and only after that my job is 
successful.

At the very least error messages need to be tweaked to indicate that 
Application (or Task) is failing because auxiliary files are not copied during 
that time, not just generic "timeout expired".

Better solution would be not to account time spent for data files distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to