[
https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910411#comment-13910411
]
Jason Lowe commented on YARN-221:
---------------------------------
bq. We can have RM AM wait for notification as in container exit -> NM notifies
RM -> RM notifies AM. That will create some delay for AM to declare the job is
done. With the NM -> RM heartbeat value used in big clusters, it could add
couple seconds delay for the job. That might not be a big deal for regular MR
jobs.
The NM does out-of-band heartbeats when containers exit, so the turnaround time
can be shorter than a full NM heartbeat interval.
If we're really concerned about any additional time added for graceful task
exit we can also have the AM unregister when the job succeeds/fails but before
all tasks exit, and eventually the RM will kill all containers of the
application when the AM eventually exits (or times out waiting). In that sense
it would not add any time from the job client's perspective, as the job could
report completion at the same time it did before. However it would add some
time from the YARN perspective, as the application is lingering on the cluster
a few extra seconds in the FINISHING state than it did before.
bq. One thing to add we need the definition and policy on how to handle those
tasks that are in the finishing state and MR AM ends up stopping them as they
don't exit by themselves.
I don't think we need to get too tricky here. The NM will see the container
return a non-zero exit code and assume that's failure. If tasks are succeeding
but returning non-zero exit codes then that's probably a bug and arguably a
good thing we're grabbing the logs to show what went wrong when it tried to
tear down. IMHO we should fix what's causing the non-zero exit code rather
than try to add a mechanism to prevent logs from being aggregated in what
should be a rare and abnormal case.
> NM should provide a way for AM to tell it not to aggregate logs.
> ----------------------------------------------------------------
>
> Key: YARN-221
> URL: https://issues.apache.org/jira/browse/YARN-221
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Reporter: Robert Joseph Evans
> Assignee: Chris Trezzo
> Attachments: YARN-221-trunk-v1.patch
>
>
> The NodeManager should provide a way for an AM to tell it that either the
> logs should not be aggregated, that they should be aggregated with a high
> priority, or that they should be aggregated but with a lower priority. The
> AM should be able to do this in the ContainerLaunch context to provide a
> default value, but should also be able to update the value when the container
> is released.
> This would allow for the NM to not aggregate logs in some cases, and avoid
> connection to the NN at all.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)