[
https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903297#comment-13903297
]
Jason Lowe commented on YARN-221:
---------------------------------
Personally I think the AM racing to kill tasks that have indicated they are
done is a bug. It causes all sorts of problems:
- Occasional "Container killed by ApplicationMaster" messages on otherwise
normal tasks confuses users into thinking something went wrong for some of
their tasks
- Trying to take a java profile for a task can fail if the profile dump takes
too long or the kill arrives too quickly (see MAPREDUCE-5465)
- Killing a task that should otherwise be exiting on its own creates a constant
race-condition scenario that has caused problems in other similar setups (see
MAPREDUCE-4157 for a similar situation where the RM was killing AMs too early
and causing problems).
I think we should fix these races by implementing a reasonable delay between a
task reporting a terminal state and a kill being issued by the AM. That allows
the task to complete on its own with an appropriate exit code, eliminating the
need to specify log states on stop as a workaround.
> NM should provide a way for AM to tell it not to aggregate logs.
> ----------------------------------------------------------------
>
> Key: YARN-221
> URL: https://issues.apache.org/jira/browse/YARN-221
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Reporter: Robert Joseph Evans
> Assignee: Chris Trezzo
> Attachments: YARN-221-trunk-v1.patch
>
>
> The NodeManager should provide a way for an AM to tell it that either the
> logs should not be aggregated, that they should be aggregated with a high
> priority, or that they should be aggregated but with a lower priority. The
> AM should be able to do this in the ContainerLaunch context to provide a
> default value, but should also be able to update the value when the container
> is released.
> This would allow for the NM to not aggregate logs in some cases, and avoid
> connection to the NN at all.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)