[ 
https://issues.apache.org/jira/browse/YARN-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093909#comment-16093909
 ] 

Jason Lowe commented on YARN-6846:
----------------------------------

Sample log from a 2.8-based release.  In this case I believe nftw is returning 
FTW_NS as a file type since the file was in the directory list but is no longer 
stat-able because it has been removed by the other container-executor.  FTW_NS 
is not handled by the switch statement in nftw_cb and results in the "Internal 
error" message.
{noformat}
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO 
application.ApplicationImpl: Application application_1496686551678_5664018 
transitioned from FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO 
monitor.ContainersMonitorImpl: Stopping resource-monitoring for 
container_e03_1496686551678_5664018_01_027791
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO 
containermanager.AuxServices: Got event CONTAINER_STOP for appId 
application_1496686551678_5664018
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO 
yarn.YarnShuffleService: Stopping container 
container_e03_1496686551678_5664018_01_027791
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO 
containermanager.AuxServices: Got event APPLICATION_STOP for appId 
application_1496686551678_5664018
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO 
yarn.YarnShuffleService: Stopping application application_1496686551678_5664018
2017-06-30 13:43:47,397 [AsyncDispatcher event handler] INFO 
application.ApplicationImpl: Application application_1496686551678_5664018 
transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED
2017-06-30 13:43:47,734 [DeletionService #3] INFO 
nodemanager.LinuxContainerExecutor: Deleting absolute path : 
/.../appcache/application_1496686551678_5664018/container_e03_1496686551678_5664018_01_027791
2017-06-30 13:43:47,746 [DeletionService #0] INFO 
nodemanager.LinuxContainerExecutor: Deleting absolute path : 
/.../appcache/application_1496686551678_5664018
2017-06-30 13:43:48,990 [DeletionService #0] WARN 
privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 
255. Privileged Execution Operation Output: 
main : command provided 3
main : run as user is ...
main : requested yarn user is ...
Internal error deleting 
/.../appcache/application_1496686551678_5664018/container_e03_1496686551678_5664018_01_027791
Error in nftw while deleting /.../appcache/application_1496686551678_5664018
Couldn't delete directory /.../appcache/application_1496686551678_5664018 - 
Directory not empty
{noformat}

The deletion code has changed in 2.9, but I believe it too will fail if files 
are deleted out from underneath it.  Minimally we need to make the deletion 
more robust to errors, and it should try to delete as much of the directory 
tree as possible rather than giving up on the first error and leaking the rest 
of the tree.

> Nodemanager can fail to fully delete application local directories when 
> applications are killed
> -----------------------------------------------------------------------------------------------
>
>                 Key: YARN-6846
>                 URL: https://issues.apache.org/jira/browse/YARN-6846
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.8.1
>            Reporter: Jason Lowe
>            Priority: Critical
>
> When an application is killed all of the running containers are killed and 
> the app waits for the containers to complete before cleaning up.  As each 
> container completes the container directory is deleted via the 
> DeletionService.  After all containers have completed the app completes and 
> the app directory is deleted.  If the app completes quickly enough then the 
> deletion of the container and app directories can race against each other.  
> If the container deletion executor deletes a file just before the application 
> deletion executor then it can cause the application deletion executor to 
> fail, leaving the remaining entries in the application directory lingering.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to