[ 
https://issues.apache.org/jira/browse/YARN-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579188#comment-14579188
 ] 

Rohith commented on YARN-3788:
------------------------------

This is MapReduce project issue/query, moving to MR for further discussion.

> Application Master and Task Tracker timeouts are applied incorrectly
> --------------------------------------------------------------------
>
>                 Key: YARN-3788
>                 URL: https://issues.apache.org/jira/browse/YARN-3788
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.4.1
>            Reporter: Dmitry Sivachenko
>
> I am running a streaming job which requires a big (~50GB) data file to run 
> (file is attached via hadoop jar <...> -file BigFile.dat).
> Most likely this command will fail as follows (note that error message is 
> rather meaningless):
> 2015-05-27 15:55:00,754 WARN  [main] streaming.StreamJob 
> (StreamJob.java:parseArgv(291)) - -file option is deprecated, please use 
> generic option -files instead.
> packageJobJar: [/ssd/mt/lm/en_reorder.ylm, mapper.py, 
> /tmp/hadoop-mitya/hadoop-unjar3778165585140840383/] [] 
> /var/tmp/streamjob633547925483233845.jar tmpDir=null
> 2015-05-27 19:46:22,942 INFO  [main] client.RMProxy 
> (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at 
> nezabudka1-00.yandex.ru/5.255.231.129:8032
> 2015-05-27 19:46:23,733 INFO  [main] client.RMProxy 
> (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at 
> nezabudka1-00.yandex.ru/5.255.231.129:8032
> 2015-05-27 20:13:37,231 INFO  [main] mapred.FileInputFormat 
> (FileInputFormat.java:listStatus(247)) - Total input paths to process : 1
> 2015-05-27 20:13:38,110 INFO  [main] mapreduce.JobSubmitter 
> (JobSubmitter.java:submitJobInternal(396)) - number of splits:1
> 2015-05-27 20:13:38,136 INFO  [main] Configuration.deprecation 
> (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.reduce.tasks is 
> deprecated. Instead, use mapreduce.job.reduces
> 2015-05-27 20:13:38,390 INFO  [main] mapreduce.JobSubmitter 
> (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: 
> job_1431704916575_2531
> 2015-05-27 20:13:38,689 INFO  [main] impl.YarnClientImpl 
> (YarnClientImpl.java:submitApplication(204)) - Submitted application 
> application_1431704916575_2531
> 2015-05-27 20:13:38,743 INFO  [main] mapreduce.Job (Job.java:submit(1289)) - 
> The url to track the job: 
> http://nezabudka1-00.yandex.ru:8088/proxy/application_1431704916575_2531/
> 2015-05-27 20:13:38,746 INFO  [main] mapreduce.Job 
> (Job.java:monitorAndPrintJob(1334)) - Running job: job_1431704916575_2531
> 2015-05-27 21:04:12,353 INFO  [main] mapreduce.Job 
> (Job.java:monitorAndPrintJob(1355)) - Job job_1431704916575_2531 running in 
> uber mode : false
> 2015-05-27 21:04:12,356 INFO  [main] mapreduce.Job 
> (Job.java:monitorAndPrintJob(1362)) - map 0% reduce 0%
> 2015-05-27 21:04:12,374 INFO  [main] mapreduce.Job 
> (Job.java:monitorAndPrintJob(1375)) - Job job_1431704916575_2531 failed with 
> state FAILED due to: Application application_1431704916575_2531 failed 2 
> times due to ApplicationMaster for attempt 
> appattempt_1431704916575_2531_000002 timed out. Failing the application.
> 2015-05-27 21:04:12,473 INFO  [main] mapreduce.Job 
> (Job.java:monitorAndPrintJob(1380)) - Counters: 0
> 2015-05-27 21:04:12,474 ERROR [main] streaming.StreamJob 
> (StreamJob.java:submitAndMonitorJob(1019)) - Job not Successful!
> Streaming Command Failed!
> This is because yarn.am.liveness-monitor.expiry-interval-ms (defaults to 600 
> sec) timeout expires before large data file is transferred.
> Next step I increase yarn.am.liveness-monitor.expiry-interval-ms.  After that 
> application is successfully initialized and tasks are spawned.
> But I encounter another error: the default 600 seconds mapreduce.task.timeout 
> expires before tasks are initialized and tasks fail.
> Error message Task attempt_XXX failed to report status for 600 seconds is 
> also misleading: this timeout is supposed to kill non-responsive (stuck) 
> tasks but it rather strikes because auxiliary data files are copying slowly.
> So I need to increase mapreduce.task.timeout too and only after that my job 
> is successful.
> At the very least error messages need to be tweaked to indicate that 
> Application (or Task) is failing because auxiliary files are not copied 
> during that time, not just generic "timeout expired".
> Better solution would be not to account time spent for data files 
> distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to