[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262976#comment-17262976
 ] 

Eric Badger commented on YARN-10562:
------------------------------------

Discussed this a little bit with [~Jim_Brennan] offline and here's the summary 
of my thoughts. There are a few problems with the current code.
1) There is an inherent race in the code
2) There is unnecessary and overlapping locking between the read/write lock and 
the CopyOnWriteArrayList

The only way we can reliably address 1) is to return a copy of all lists at 
once. Otherwise DirectoryCollection.checkDirs() can come along and change the 
overall status of the 3 dirs lists. If we put checkDirs in a critical section 
(like it is now with the write lock), then we can return all dirs at once while 
in the read lock and assure that all dirs are consistent with each other. If we 
get the dirs in separate calls that grab the lock, we could be inconsistent 
because checkDirs could be called in between the getDirs calls. I suppose the 
other way to fix the locking is to do fine-grained locking within the caller 
code itself, but I think that is pretty bad practice by exposing internal 
locking to the caller.

For 2) we should either change CopyOnWriteArrayList to a regular ArrayList (as 
they had original planned to do in 
[YARN-5214|https://issues.apache.org/jira/browse/YARN-5214?focusedCommentId=15342587&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15342587]
 or remove the read/write lock. These locks serve more or less the same purpose 
and having both of them is unncecessary.

Since I think that locking is usually difficult, complex, and misunderstood by 
those who change it later, I think we should get rid of the 
CopyOnWriteArrayList and change it to a regular ArrayList and then make the 
changes that [~Jim_Brennan] has made here so that we aren't reconstructing each 
list from scratch each time we run checkDirs. The downside of this change is 
that every container launch will create a new copy each list and that is a 
performance regression. But I don't think it will be much of an issue. Would be 
happy to hear other opinions on this

> Alternate fix for DirectoryCollection.checkDirs() race
> ------------------------------------------------------
>
>                 Key: YARN-10562
>                 URL: https://issues.apache.org/jira/browse/YARN-10562
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>    Affects Versions: 3.4.0
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Major
>         Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to