[ 
https://issues.apache.org/jira/browse/YARN-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929009#comment-15929009
 ] 

Haibo Chen commented on YARN-6319:
----------------------------------

Thanks [~zhiguohong] for more explanation on option 2. While I agree with you 
that post-callback can completely avoid race condition, linearizing container 
cleanup and app cleanup will unnecessarily slow down the application state 
transition process which other tasks, such as log aggregation, depend on. 
Especially when you have a lot of containers for a given application, 
previously the app dir cleanup task can be running concurrently with all 
container cleanup tasks, now it will need to wait for all container cleanup 
tasks to finish. The point I wan to make is that the race condition is safe to 
have as long as we ignore the fileNotException error during deletion. I notice 
YARN-2902 add the code to ignore FileNotExistent error code for LCE. Is it 
included in the version where you ran into this issue?

> race condition between deleting app dir and deleting container dir
> ------------------------------------------------------------------
>
>                 Key: YARN-6319
>                 URL: https://issues.apache.org/jira/browse/YARN-6319
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hong Zhiguo
>            Assignee: Hong Zhiguo
>
> Last container (on one node) of one app complete
>     |    --> triggers async deletion of container dir (container cleanup)
>     |    --> triggers async deletion of app dir (app cleanup)
> For LCE, deletion is done by container-executor. The "app cleanup" lists 
> sub-dir (step 1), and then unlink items one by one(step 2).   If a file is 
> deleted by "container cleanup" between step 1 and step2, it'll report below 
> error and breaks the deletion.
> {code}
> ContainerExecutor: Couldn't delete file 
> $LOCAL/usercache/$USER/appcache/application_1481785469354_353539/container_1481785469354_353539_01_000028/$FILE
>  - No such file or directory
> {code}
> This app dir then escape the cleanup. And that's why we always have many app 
> dirs left there.
> solution 1: just ignore the error without breaking in 
> container-executor.c::delete_path()
> solution 2: use a lock to serialize the cleanup of same app dir.
> solution 3: backoff and retry on error
> Comments are welcome.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to