[
https://issues.apache.org/jira/browse/YARN-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093909#comment-16093909
]
Jason Lowe commented on YARN-6846:
----------------------------------
Sample log from a 2.8-based release. In this case I believe nftw is returning
FTW_NS as a file type since the file was in the directory list but is no longer
stat-able because it has been removed by the other container-executor. FTW_NS
is not handled by the switch statement in nftw_cb and results in the "Internal
error" message.
{noformat}
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO
application.ApplicationImpl: Application application_1496686551678_5664018
transitioned from FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO
monitor.ContainersMonitorImpl: Stopping resource-monitoring for
container_e03_1496686551678_5664018_01_027791
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO
containermanager.AuxServices: Got event CONTAINER_STOP for appId
application_1496686551678_5664018
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO
yarn.YarnShuffleService: Stopping container
container_e03_1496686551678_5664018_01_027791
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO
containermanager.AuxServices: Got event APPLICATION_STOP for appId
application_1496686551678_5664018
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO
yarn.YarnShuffleService: Stopping application application_1496686551678_5664018
2017-06-30 13:43:47,397 [AsyncDispatcher event handler] INFO
application.ApplicationImpl: Application application_1496686551678_5664018
transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED
2017-06-30 13:43:47,734 [DeletionService #3] INFO
nodemanager.LinuxContainerExecutor: Deleting absolute path :
/.../appcache/application_1496686551678_5664018/container_e03_1496686551678_5664018_01_027791
2017-06-30 13:43:47,746 [DeletionService #0] INFO
nodemanager.LinuxContainerExecutor: Deleting absolute path :
/.../appcache/application_1496686551678_5664018
2017-06-30 13:43:48,990 [DeletionService #0] WARN
privileged.PrivilegedOperationExecutor: Shell execution returned exit code:
255. Privileged Execution Operation Output:
main : command provided 3
main : run as user is ...
main : requested yarn user is ...
Internal error deleting
/.../appcache/application_1496686551678_5664018/container_e03_1496686551678_5664018_01_027791
Error in nftw while deleting /.../appcache/application_1496686551678_5664018
Couldn't delete directory /.../appcache/application_1496686551678_5664018 -
Directory not empty
{noformat}
The deletion code has changed in 2.9, but I believe it too will fail if files
are deleted out from underneath it. Minimally we need to make the deletion
more robust to errors, and it should try to delete as much of the directory
tree as possible rather than giving up on the first error and leaking the rest
of the tree.
> Nodemanager can fail to fully delete application local directories when
> applications are killed
> -----------------------------------------------------------------------------------------------
>
> Key: YARN-6846
> URL: https://issues.apache.org/jira/browse/YARN-6846
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.8.1
> Reporter: Jason Lowe
> Priority: Critical
>
> When an application is killed all of the running containers are killed and
> the app waits for the containers to complete before cleaning up. As each
> container completes the container directory is deleted via the
> DeletionService. After all containers have completed the app completes and
> the app directory is deleted. If the app completes quickly enough then the
> deletion of the container and app directories can race against each other.
> If the container deletion executor deletes a file just before the application
> deletion executor then it can cause the application deletion executor to
> fail, leaving the remaining entries in the application directory lingering.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]