[
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263462#comment-17263462
]
Jim Brennan commented on YARN-10562:
------------------------------------
Thanks for the discussion and comment [~ebadger]! I agree that probably the
best approach is to remove the use of CopyOnWriteArrayList and stick with
simple ArrayLists. We can preserve the changes made in [YARN-9833] to return
copies of the lists and fix that issue with GetErrorDirs.
I don't think the cost of the copies, even for every launch, is really much of
a concern in the grand scope of things. My inclination is to make the
minimum changes for this - that is, not rewrite checkDirs() as I've done here -
it is not nearly as inefficient with ArrayLists as it was with
CopyOnWriteArrayLists.
I will put up another patch with these changes. We'll probably want to change
the Summary as well to indicate this is just a follow-on to [YARN-9833], not an
alternate solution. If you'd prefer I file a new Jira instead, let me know.
> Alternate fix for DirectoryCollection.checkDirs() race
> ------------------------------------------------------
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: yarn
> Affects Versions: 3.4.0
> Reporter: Jim Brennan
> Assignee: Jim Brennan
> Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch,
> YARN-10562.003.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and
> related methods were returning an unmodifiable view of the lists. These
> accesses were protected by read/write locks, but because the lists are
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under
> the writelock, were exposed when a caller started iterating the list view.
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so
> it is safe to iterate them even while they are being changed - at least the
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them
> from scratch every time, so if a caller called getGoodDirs() just before
> checkDirs cleared it, and then started iterating right after the clear, they
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to
> return a copy of the list, which definitely fixes the race condition. The
> disadvantage is that now we create a new copy of these lists every time we
> launch a container. The advantage using CopyOnWriteArrayList was that the
> lists should rarely ever change, and we can avoid all the copying.
> Unfortunately, the way checkDirs() was written, it guaranteed that it would
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just
> rewrites checkDirs() to minimize the changes to the underlying lists. There
> are still some small windows where a disk will have been added to one list,
> but not yet removed from another if you hit it just right, but I think these
> should be pretty rare and relatively harmless, and in the vast majority of
> cases I suspect only one disk will be moving from one list to another at any
> time. The question is whether this type of inconsistency (which was always
> there before -YARN-9833- is worth reducing all the copying.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]