[
https://issues.apache.org/jira/browse/YARN-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930366#comment-15930366
]
Haibo Chen commented on YARN-6319:
----------------------------------
By linearizing container cleanup and app cleanup, I mean that application
cleanup has to wait for all container cleanup to finish before it can start,
i.e., application cleanup can only happen after the last container cleanup
finishes, not to say that container cleanups need to be done one after another.
In cases where deletion threads are occupied/delayed, it can take some time to
finish the last container cleanup task. Again, I don't think this is a
dependency that we need to have. Even though we may potentially need to change
two containerExecutors for option 1, the change should be fairly self-contained
and does not change the rest of the flow. BTW, can you please set the affect
version just so that we are talking about the same version?
> race condition between deleting app dir and deleting container dir
> ------------------------------------------------------------------
>
> Key: YARN-6319
> URL: https://issues.apache.org/jira/browse/YARN-6319
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Hong Zhiguo
> Assignee: Hong Zhiguo
>
> Last container (on one node) of one app complete
> | --> triggers async deletion of container dir (container cleanup)
> | --> triggers async deletion of app dir (app cleanup)
> For LCE, deletion is done by container-executor. The "app cleanup" lists
> sub-dir (step 1), and then unlink items one by one(step 2). If a file is
> deleted by "container cleanup" between step 1 and step2, it'll report below
> error and breaks the deletion.
> {code}
> ContainerExecutor: Couldn't delete file
> $LOCAL/usercache/$USER/appcache/application_1481785469354_353539/container_1481785469354_353539_01_000028/$FILE
> - No such file or directory
> {code}
> This app dir then escape the cleanup. And that's why we always have many app
> dirs left there.
> solution 1: just ignore the error without breaking in
> container-executor.c::delete_path()
> solution 2: use a lock to serialize the cleanup of same app dir.
> solution 3: backoff and retry on error
> Comments are welcome.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]