Tao Yang created YARN-10059:
-------------------------------

             Summary: Final states of failed-to-localize containers are not 
recorded in NM state store
                 Key: YARN-10059
                 URL: https://issues.apache.org/jira/browse/YARN-10059
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
            Reporter: Tao Yang
            Assignee: Tao Yang


Currently we found an issue that many localizers of completed containers were 
launched and exhausted memory/cpu of that machine after NM restarted, these 
containers were all failed and completed when localizing on a non-existed local 
directory which is caused by another problem, but their final states weren't 
recorded in NM state store.
 The process flow of a fail-to-localize container is as follow:
{noformat}
ResourceLocalizationService$LocalizerRunner#run
-> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> 
LOCALIZATION_FAILED upon RESOURCE_FAILED
      dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
      -> ResourceLocalizationService#handleCleanupContainerResources  handle 
CLEANUP_CONTAINER_RESOURCES
          dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
          -> ContainerImpl$LocalizationFailedToDoneTransition#transition  
handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
{noformat}
There's no update for state store in this flow now, which is required to avoid 
unnecessary localizations after NM restarts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to