Tao Yang created YARN-10059: ------------------------------- Summary: Final states of failed-to-localize containers are not recorded in NM state store Key: YARN-10059 URL: https://issues.apache.org/jira/browse/YARN-10059 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Tao Yang Assignee: Tao Yang
Currently we found an issue that many localizers of completed containers were launched and exhausted memory/cpu of that machine after NM restarted, these containers were all failed and completed when localizing on a non-existed local directory which is caused by another problem, but their final states weren't recorded in NM state store. The process flow of a fail-to-localize container is as follow: {noformat} ResourceLocalizationService$LocalizerRunner#run -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> LOCALIZATION_FAILED upon RESOURCE_FAILED dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES -> ResourceLocalizationService#handleCleanupContainerResources handle CLEANUP_CONTAINER_RESOURCES dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP -> ContainerImpl$LocalizationFailedToDoneTransition#transition handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP {noformat} There's no update for state store in this flow now, which is required to avoid unnecessary localizations after NM restarts. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org