[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263601#comment-17263601
 ] 

Eric Badger commented on YARN-10562:
------------------------------------

Yea the original problem (before YARN-9833) was that we were getting a view of 
the list instead of a copy. And those views could iterate the list at any time. 
The issue there was that checkDirs was going out and clearing those lists in a 
separate thread. So when the client iterated through the lists, it would 
periodically see an empty list if it iterated at just the right time. 

After YARN-9833 there is still a race, but it is a smaller and less nefarious 
one. The issue there is that we have 3 lists (localDirs, fullDirs, errorDirs). 
Those can really be thought of as a single list with different attributes for 
each dir. Because the sum of all of those lists should give you all disks on 
the node. YARN-9833 added code to return a copy of the list instead of a view. 
So we'll never have a list that is incomplete. But the race becomes the fact 
that you could potentially call getGoodDirs(), then have checkDirs run, then 
call getErrorDirs. If a dir transitioned from good -> error just after 
getGoodDirs was called, it would show up in both lists. 

But like you said, [~pbacsko], I think it makes sense to remove complexity of 
the code if it requires this type of discussion to understand exactly why the 
code works (or doesn't work). It makes the code harder to maintain and even 
harder to modify.

> Alternate fix for DirectoryCollection.checkDirs() race
> ------------------------------------------------------
>
>                 Key: YARN-10562
>                 URL: https://issues.apache.org/jira/browse/YARN-10562
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>    Affects Versions: 3.4.0
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Major
>         Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to