Tao Yang created YARN-10059:
-------------------------------
Summary: Final states of failed-to-localize containers are not
recorded in NM state store
Key: YARN-10059
URL: https://issues.apache.org/jira/browse/YARN-10059
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Reporter: Tao Yang
Assignee: Tao Yang
Currently we found an issue that many localizers of completed containers were
launched and exhausted memory/cpu of that machine after NM restarted, these
containers were all failed and completed when localizing on a non-existed local
directory which is caused by another problem, but their final states weren't
recorded in NM state store.
The process flow of a fail-to-localize container is as follow:
{noformat}
ResourceLocalizationService$LocalizerRunner#run
-> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING ->
LOCALIZATION_FAILED upon RESOURCE_FAILED
dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
-> ResourceLocalizationService#handleCleanupContainerResources handle
CLEANUP_CONTAINER_RESOURCES
dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
-> ContainerImpl$LocalizationFailedToDoneTransition#transition
handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
{noformat}
There's no update for state store in this flow now, which is required to avoid
unnecessary localizations after NM restarts.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]