[
https://issues.apache.org/jira/browse/YARN-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725526#comment-13725526
]
Zhijie Shen commented on YARN-966:
----------------------------------
bq. Potentially I don't see when we will in fact start ContainerLaunch#call
without its all resources getting downloaded.
YARN-906 is such a corner case.
bq. This I still see should not be done via NULL check. Proper way is to set
boolean flag of ContainerLaunch in the event of KILL synchronously.
The original code checks state == LOCALIZED, and throws AssertError when
getting the localized resources. I just modified the way to indicate the error,
such that the callers of it can more easily handle the error. If you think
calling getLocalizedResources() when the container is not at LOCALIZED is not
wrong, I'm afraid we're in the different conversation.
bq. which is completely misleading.. Indeed this occurred because user killed
container not because it failed to localize resources.
I don't think the message is misleading. Again, getLocalizedResources() is not
allowed to be called when the container is not at LOCALIZED (at least the
original code means it). So the message clearly states problem. Please note
that killing signal is not the root problem of the thread failure here. If
getLocalizedResources() were not called, the thread would still complete
without exception.
> The thread of ContainerLaunch#call will fail without any signal if
> getLocalizedResources() is called when the container is not at LOCALIZED
> -------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-966
> URL: https://issues.apache.org/jira/browse/YARN-966
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Zhijie Shen
> Assignee: Zhijie Shen
> Fix For: 2.1.1-beta
>
> Attachments: YARN-966.1.patch
>
>
> In ContainerImpl.getLocalizedResources(), there's:
> {code}
> assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!!
> {code}
> ContainerImpl.getLocalizedResources() is called in ContainerLaunch.call(),
> which is scheduled on a separate thread. If the container is not at LOCALIZED
> (e.g. it is at KILLING, see YARN-906), an AssertError will be thrown and
> fails the thread without notifying NM. Therefore, the container cannot
> receive more events, which are supposed to be sent from
> ContainerLaunch.call(), and move towards completion.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira