[jira] [Updated] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store

2024-01-04 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated YARN-10059:
--
Target Version/s: 3.5.0  (was: 3.4.0)

> Final states of failed-to-localize containers are not recorded in NM state 
> store
> 
>
> Key: YARN-10059
> URL: https://issues.apache.org/jira/browse/YARN-10059
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10059.001.patch
>
>
> Currently we found an issue that many localizers of completed containers were 
> launched and exhausted memory/cpu of that machine after NM restarted, these 
> containers were all failed and completed when localizing on a non-existed 
> local directory which is caused by another problem, but their final states 
> weren't recorded in NM state store.
>  The process flow of a fail-to-localize container is as follow:
> {noformat}
> ResourceLocalizationService$LocalizerRunner#run
> -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> 
> LOCALIZATION_FAILED upon RESOURCE_FAILED
>   dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
>   -> ResourceLocalizationService#handleCleanupContainerResources  handle 
> CLEANUP_CONTAINER_RESOURCES
>   dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
>   -> ContainerImpl$LocalizationFailedToDoneTransition#transition  
> handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
> {noformat}
> There's no update for state store in this flow now, which is required to 
> avoid unnecessary localizations after NM restarts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store

2020-04-09 Thread Brahma Reddy Battula (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-10059:

Target Version/s: 3.4.0  (was: 3.3.0, 3.2.2, 3.1.4, 2.10.1)

Bulk update: moved all 3.3.0 non-blocker issues, please move back if it is a 
blocker.

> Final states of failed-to-localize containers are not recorded in NM state 
> store
> 
>
> Key: YARN-10059
> URL: https://issues.apache.org/jira/browse/YARN-10059
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10059.001.patch
>
>
> Currently we found an issue that many localizers of completed containers were 
> launched and exhausted memory/cpu of that machine after NM restarted, these 
> containers were all failed and completed when localizing on a non-existed 
> local directory which is caused by another problem, but their final states 
> weren't recorded in NM state store.
>  The process flow of a fail-to-localize container is as follow:
> {noformat}
> ResourceLocalizationService$LocalizerRunner#run
> -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> 
> LOCALIZATION_FAILED upon RESOURCE_FAILED
>   dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
>   -> ResourceLocalizationService#handleCleanupContainerResources  handle 
> CLEANUP_CONTAINER_RESOURCES
>   dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
>   -> ContainerImpl$LocalizationFailedToDoneTransition#transition  
> handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
> {noformat}
> There's no update for state store in this flow now, which is required to 
> avoid unnecessary localizations after NM restarts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store

2019-12-24 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10059:

Attachment: YARN-10059.001.patch

> Final states of failed-to-localize containers are not recorded in NM state 
> store
> 
>
> Key: YARN-10059
> URL: https://issues.apache.org/jira/browse/YARN-10059
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10059.001.patch
>
>
> Currently we found an issue that many localizers of completed containers were 
> launched and exhausted memory/cpu of that machine after NM restarted, these 
> containers were all failed and completed when localizing on a non-existed 
> local directory which is caused by another problem, but their final states 
> weren't recorded in NM state store.
>  The process flow of a fail-to-localize container is as follow:
> {noformat}
> ResourceLocalizationService$LocalizerRunner#run
> -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> 
> LOCALIZATION_FAILED upon RESOURCE_FAILED
>   dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
>   -> ResourceLocalizationService#handleCleanupContainerResources  handle 
> CLEANUP_CONTAINER_RESOURCES
>   dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
>   -> ContainerImpl$LocalizationFailedToDoneTransition#transition  
> handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
> {noformat}
> There's no update for state store in this flow now, which is required to 
> avoid unnecessary localizations after NM restarts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store

2019-12-24 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10059:

Attachment: (was: YARN-10059.001.patch)

> Final states of failed-to-localize containers are not recorded in NM state 
> store
> 
>
> Key: YARN-10059
> URL: https://issues.apache.org/jira/browse/YARN-10059
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> Currently we found an issue that many localizers of completed containers were 
> launched and exhausted memory/cpu of that machine after NM restarted, these 
> containers were all failed and completed when localizing on a non-existed 
> local directory which is caused by another problem, but their final states 
> weren't recorded in NM state store.
>  The process flow of a fail-to-localize container is as follow:
> {noformat}
> ResourceLocalizationService$LocalizerRunner#run
> -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> 
> LOCALIZATION_FAILED upon RESOURCE_FAILED
>   dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
>   -> ResourceLocalizationService#handleCleanupContainerResources  handle 
> CLEANUP_CONTAINER_RESOURCES
>   dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
>   -> ContainerImpl$LocalizationFailedToDoneTransition#transition  
> handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
> {noformat}
> There's no update for state store in this flow now, which is required to 
> avoid unnecessary localizations after NM restarts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10059) Final states of failed-to-localize containers are not recorded in NM state store

2019-12-23 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10059:

Attachment: YARN-10059.001.patch

> Final states of failed-to-localize containers are not recorded in NM state 
> store
> 
>
> Key: YARN-10059
> URL: https://issues.apache.org/jira/browse/YARN-10059
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-10059.001.patch
>
>
> Currently we found an issue that many localizers of completed containers were 
> launched and exhausted memory/cpu of that machine after NM restarted, these 
> containers were all failed and completed when localizing on a non-existed 
> local directory which is caused by another problem, but their final states 
> weren't recorded in NM state store.
>  The process flow of a fail-to-localize container is as follow:
> {noformat}
> ResourceLocalizationService$LocalizerRunner#run
> -> ContainerImpl$ResourceFailedTransition#transition handle LOCALIZING -> 
> LOCALIZATION_FAILED upon RESOURCE_FAILED
>   dispatch LocalizationEventType.CLEANUP_CONTAINER_RESOURCES
>   -> ResourceLocalizationService#handleCleanupContainerResources  handle 
> CLEANUP_CONTAINER_RESOURCES
>   dispatch ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP
>   -> ContainerImpl$LocalizationFailedToDoneTransition#transition  
> handle LOCALIZATION_FAILED -> DONE upon CONTAINER_RESOURCES_CLEANEDUP
> {noformat}
> There's no update for state store in this flow now, which is required to 
> avoid unnecessary localizations after NM restarts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org