[
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jim Brennan updated YARN-10562:
-------------------------------
Attachment: (was: YARN-9833.001.patch)
> Alternate fix for DirectoryCollection.checkDirs() race
> ------------------------------------------------------
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: yarn
> Affects Versions: 3.4.0
> Reporter: Jim Brennan
> Assignee: Jim Brennan
> Priority: Major
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and
> related methods were returning an unmodifiable view of the lists. These
> accesses were protected by read/write locks, but because the lists are
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under
> the writelock, were exposed when a caller started iterating the list view.
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so
> it is safe to iterate them even while they are being changed - at least the
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them
> from scratch every time, so if a caller called getGoodDirs() just before
> checkDirs cleared it, and then started iterating right after the clear, they
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to
> return a copy of the list, which definitely fixes the race condition. The
> disadvantage is that now we create a new copy of these lists every time we
> launch a container. The advantage using CopyOnWriteArrayList was that the
> lists should rarely ever change, and we can avoid all the copying.
> Unfortunately, the way checkDirs() was written, it guaranteed that it would
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just
> rewrites checkDirs() to minimize the changes to the underlying lists. There
> are still some small windows where a disk will have been added to one list,
> but not yet removed from another if you hit it just right, but I think these
> should be pretty rare and relatively harmless, and in the vast majority of
> cases I suspect only one disk will be moving from one list to another at any
> time. The question is whether this type of inconsistency (which was always
> there before -YARN-9833- is worth reducing all the copying.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]