[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263493#comment-17263493
 ] 

Peter Bacsko commented on YARN-10562:
-------------------------------------

I was looking at the patch. I *think* I probably understand [~ebadger]'s 
reasoning about the code being racy.

So, we have a view, but even if we update the lists inside a lock, the client 
which checks the view might miss using a lock, plus elements are added 
one-by-one. A possible solution could be collecting the new elements in a new 
local list then call {{CopyOnWriteArrayList.addAll()}} but that also involves a 
full copy by using {{System.arrayCopy()}} and seems like we go back to square 
one (not only that, COWAL also uses locks interally).

Is my understanding correct? Anyway, usually if it's hard to reason about the 
correctness of a code in a multi-threaded environment (which, unfortunately, 
happens more often than not), then let's play safe, copy stuff in a critical 
section and forget about mutability. I doubt that it affects the performance 
noticeably, especially since a container launch means starting a new process, 
which is a far more heavyweight operation from the perspective of the OS.

> Alternate fix for DirectoryCollection.checkDirs() race
> ------------------------------------------------------
>
>                 Key: YARN-10562
>                 URL: https://issues.apache.org/jira/browse/YARN-10562
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>    Affects Versions: 3.4.0
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Major
>         Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to