Jim Brennan created YARN-10562:
----------------------------------

             Summary: Alternate fix for DirectoryCollection.checkDirs() race
                 Key: YARN-10562
                 URL: https://issues.apache.org/jira/browse/YARN-10562
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: yarn
    Affects Versions: 3.4.0
            Reporter: Jim Brennan
            Assignee: Jim Brennan


In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
related methods were returning an unmodifiable view of the lists. These 
accesses were protected by read/write locks, but because the lists are 
CopyOnWriteArrayLists, subsequent changes to the list, even when done under the 
writelock, were exposed when a caller started iterating the list view. 
CopyOnWriteArrayLists cache the current underlying list in the iterator, so it 
is safe to iterate them even while they are being changed - at least the view 
will be consistent.

The problem was that checkDirs() was clearing the lists and rebuilding them 
from scratch every time, so if a caller called getGoodDirs() just before 
checkDirs cleared it, and then started iterating right after the clear, they 
could get an empty list.

The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
return a copy of the list, which definitely fixes the race condition. The 
disadvantage is that now we create a new copy of these lists every time we 
launch a container. The advantage using CopyOnWriteArrayList was that the lists 
should rarely ever change, and we can avoid all the copying. Unfortunately, the 
way checkDirs() was written, it guaranteed that it would modify those lists 
multiple times every time.

So this Jira proposes an alternate solution for YARN-9833, which mainly just 
rewrites checkDirs() to minimize the changes to the underlying lists. There are 
still some small windows where a disk will have been added to one list, but not 
yet removed from another if you hit it just right, but I think these should be 
pretty rare and relatively harmless, and in the vast majority of cases I 
suspect only one disk will be moving from one list to another at any time.   
The question is whether this type of inconsistency (which was always there 
before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to