[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248343#comment-17248343
 ] 

Peter Bacsko commented on YARN-9833:
------------------------------------

[~ebadger] [~Jim_Brennan] thanks for sharing some thougts on this.

1. We were not thinking about errorDirs because as we were tracking down the 
issue, only {{localDirs}} seemed to be problematic, although I agree that it is 
inconsistent this way. Shall we follow-up on this?

2. What Jim said is interesting. Does it mean that we potentially introduced a 
new bug by fixing this? That would be really bad. If this is really an issue, 
perhaps we can also follow-up on this, too, by creating a new JIRA to examine 
call hierarchies.

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> --------------------------------------------------------------------------------
>
>                 Key: YARN-9833
>                 URL: https://issues.apache.org/jira/browse/YARN-9833
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.2.0
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>             Fix For: 3.3.0, 3.2.2, 3.1.4
>
>         Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
>     this.writeLock.lock();
>     try {
>       localDirs.clear();
>       errorDirs.clear();
>       fullDirs.clear();
>       ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List<String> getGoodDirs() {
>     this.readLock.lock();
>     try {
>       return Collections.unmodifiableList(localDirs);
>     } finally {
>       this.readLock.unlock();
>     }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to