[ 
https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825465#comment-16825465
 ] 

Jim Brennan commented on YARN-9486:
-----------------------------------

{quote}Patch 003 added the safe guard for missing pid file, and reverted the 
isLaunchCompleted logic. If IOException is thrown by disk health check, it will 
leave containers behind. Is that ok? I feel safer to check isLaunchCompleted 
flag to catch the corner cases, but I understand it may not be helpful in code 
readability.
{quote}
Yeah - really anything that throws before you actually call relaunchContainer() 
will put you in that state - the new call to getLocalPathForWrite() can throw 
IOException as well.
 I don't think it's ok to leave containers behind.

The only option I can think of other than adding the isLaunchCompleted check in 
ContainerCleanup would be to call markLaunched() when you catch an exception in 
ContainerRelaunch.call(). That's a little unexpected, so you'd need to add a 
comment to say we need to mark isLaunched in this case to ensure the original 
container is cleaned up.

My concern about the isLaunchCompleted check is that we always set that in the 
finally clause for ContainerLaunch.call(), so any failure before the 
launchContainer() call will now cause a cleanup where it didn't before (like if 
we fail on the areDisksHealthy() check like you mentioned for the relaunch case.

> Docker container exited with failure does not get clean up correctly
> --------------------------------------------------------------------
>
>                 Key: YARN-9486
>                 URL: https://issues.apache.org/jira/browse/YARN-9486
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 3.2.0
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>            Priority: Major
>         Attachments: YARN-9486.001.patch, YARN-9486.002.patch, 
> YARN-9486.003.patch
>
>
> When docker container encounters error and exit prematurely 
> (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we 
> get messages that look like this:
> {code}
> java.io.IOException: Could not find 
> nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_000007//container_1555111445937_0008_01_000007.pid
>  in any of the directories
> 2019-04-15 20:42:16,454 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_000007 transitioned from 
> RELAUNCHING to EXITED_WITH_FAILURE
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Cleaning up container container_1555111445937_0008_01_000007
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Container container_1555111445937_0008_01_000007 not launched. No cleanup 
> needed to be done
> 2019-04-15 20:42:16,455 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase      
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl    
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1555111445937_0008    
> CONTAINERID=container_1555111445937_0008_01_000007
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_000007 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_1555111445937_0008_01_000007 from application 
> application_1555111445937_0008
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1555111445937_0008_01_000007
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1555111445937_0008_01_000007 for 
> log-aggregation
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_1555111445937_0008_01_000007
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting localization status for container_1555111445937_0008_01_000007
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_1555111445937_0008_01_000007, ExecutionType: GUARANTEED, State: 
> COMPLETE, Capability: <memory:1024, vCores:1>, Diagnostics: ..., ExitStatus: 
> -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE]
> 2019-04-15 20:42:18,464 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed containers from NM context: [container_1555111445937_0008_01_000007]
> 2019-04-15 20:43:50,476 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Stopping container with container Id: container_1555111445937_0008_01_000007
> {code}
> There is no docker rm command performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to