[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263696#comment-17263696 ] Jim Brennan commented on YARN-10562: Patch 004 replaces the {{CopyOnWriteArrayLists}} with {{ArrayLists}}. It also fixes {{getErroredDirs()}} to use {{ImmutableList.copyOf()}} instead of {{Collections.unmodifiableList()}}. One additional change I made was to change {{getFailedDirs()}} to use {{Collections.unmodifiableList()}} instead of {{ImmutableList.copyOf()}}. There is no need to make another copy in this case, because {{DirectoryCollection.concat()}} was already constructing a new list. > Alternate fix for DirectoryCollection.checkDirs() race > -- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10562.001.patch, YARN-10562.002.patch, > YARN-10562.003.patch, YARN-10562.004.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263601#comment-17263601 ] Eric Badger commented on YARN-10562: Yea the original problem (before YARN-9833) was that we were getting a view of the list instead of a copy. And those views could iterate the list at any time. The issue there was that checkDirs was going out and clearing those lists in a separate thread. So when the client iterated through the lists, it would periodically see an empty list if it iterated at just the right time. After YARN-9833 there is still a race, but it is a smaller and less nefarious one. The issue there is that we have 3 lists (localDirs, fullDirs, errorDirs). Those can really be thought of as a single list with different attributes for each dir. Because the sum of all of those lists should give you all disks on the node. YARN-9833 added code to return a copy of the list instead of a view. So we'll never have a list that is incomplete. But the race becomes the fact that you could potentially call getGoodDirs(), then have checkDirs run, then call getErrorDirs. If a dir transitioned from good -> error just after getGoodDirs was called, it would show up in both lists. But like you said, [~pbacsko], I think it makes sense to remove complexity of the code if it requires this type of discussion to understand exactly why the code works (or doesn't work). It makes the code harder to maintain and even harder to modify. > Alternate fix for DirectoryCollection.checkDirs() race > -- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10562.001.patch, YARN-10562.002.patch, > YARN-10562.003.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263493#comment-17263493 ] Peter Bacsko commented on YARN-10562: - I was looking at the patch. I *think* I probably understand [~ebadger]'s reasoning about the code being racy. So, we have a view, but even if we update the lists inside a lock, the client which checks the view might miss using a lock, plus elements are added one-by-one. A possible solution could be collecting the new elements in a new local list then call {{CopyOnWriteArrayList.addAll()}} but that also involves a full copy by using {{System.arrayCopy()}} and seems like we go back to square one (not only that, COWAL also uses locks interally). Is my understanding correct? Anyway, usually if it's hard to reason about the correctness of a code in a multi-threaded environment (which, unfortunately, happens more often than not), then let's play safe, copy stuff in a critical section and forget about mutability. I doubt that it affects the performance noticeably, especially since a container launch means starting a new process, which is a far more heavyweight operation from the perspective of the OS. > Alternate fix for DirectoryCollection.checkDirs() race > -- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10562.001.patch, YARN-10562.002.patch, > YARN-10562.003.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263462#comment-17263462 ] Jim Brennan commented on YARN-10562: Thanks for the discussion and comment [~ebadger]! I agree that probably the best approach is to remove the use of CopyOnWriteArrayList and stick with simple ArrayLists. We can preserve the changes made in [YARN-9833] to return copies of the lists and fix that issue with GetErrorDirs. I don't think the cost of the copies, even for every launch, is really much of a concern in the grand scope of things.My inclination is to make the minimum changes for this - that is, not rewrite checkDirs() as I've done here - it is not nearly as inefficient with ArrayLists as it was with CopyOnWriteArrayLists. I will put up another patch with these changes. We'll probably want to change the Summary as well to indicate this is just a follow-on to [YARN-9833], not an alternate solution. If you'd prefer I file a new Jira instead, let me know. > Alternate fix for DirectoryCollection.checkDirs() race > -- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10562.001.patch, YARN-10562.002.patch, > YARN-10562.003.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262976#comment-17262976 ] Eric Badger commented on YARN-10562: Discussed this a little bit with [~Jim_Brennan] offline and here's the summary of my thoughts. There are a few problems with the current code. 1) There is an inherent race in the code 2) There is unnecessary and overlapping locking between the read/write lock and the CopyOnWriteArrayList The only way we can reliably address 1) is to return a copy of all lists at once. Otherwise DirectoryCollection.checkDirs() can come along and change the overall status of the 3 dirs lists. If we put checkDirs in a critical section (like it is now with the write lock), then we can return all dirs at once while in the read lock and assure that all dirs are consistent with each other. If we get the dirs in separate calls that grab the lock, we could be inconsistent because checkDirs could be called in between the getDirs calls. I suppose the other way to fix the locking is to do fine-grained locking within the caller code itself, but I think that is pretty bad practice by exposing internal locking to the caller. For 2) we should either change CopyOnWriteArrayList to a regular ArrayList (as they had original planned to do in [YARN-5214|https://issues.apache.org/jira/browse/YARN-5214?focusedCommentId=15342587=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15342587] or remove the read/write lock. These locks serve more or less the same purpose and having both of them is unncecessary. Since I think that locking is usually difficult, complex, and misunderstood by those who change it later, I think we should get rid of the CopyOnWriteArrayList and change it to a regular ArrayList and then make the changes that [~Jim_Brennan] has made here so that we aren't reconstructing each list from scratch each time we run checkDirs. The downside of this change is that every container launch will create a new copy each list and that is a performance regression. But I don't think it will be much of an issue. Would be happy to hear other opinions on this > Alternate fix for DirectoryCollection.checkDirs() race > -- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10562.001.patch, YARN-10562.002.patch, > YARN-10562.003.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261543#comment-17261543 ] Peter Bacsko commented on YARN-10562: - Thanks for the patch [~Jim_Brennan]. cc [~leftnoteasy] [~snemeth] this might be interesting / important for us, you guys might want to look at it. I'll try to find some free cycles to go through the changes. > Alternate fix for DirectoryCollection.checkDirs() race > -- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10562.001.patch, YARN-10562.002.patch, > YARN-10562.003.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261490#comment-17261490 ] Hadoop QA commented on YARN-10562: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 42s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 56s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 14s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 9s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 31s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 29s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 19s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 17s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 7s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 7s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 2s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 35s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green}{color} | {color:green} the
[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261403#comment-17261403 ] Jim Brennan commented on YARN-10562: patch 003 fixes the new checkstyle issues. > Alternate fix for DirectoryCollection.checkDirs() race > -- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10562.001.patch, YARN-10562.002.patch, > YARN-10562.003.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260697#comment-17260697 ] Hadoop QA commented on YARN-10562: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 26s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 12s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 16s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 12s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 25s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 23s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 40s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 11s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 11s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 4s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 4s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 21s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/437/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 4 new + 31 unchanged - 0 fixed = 35 total (was 31) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 36s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | |
[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260614#comment-17260614 ] Jim Brennan commented on YARN-10562: Submitted patch 002 to fix the checkstyle issues and add unit tests that [~ebadger] wrote when he was trying to verify this race condition. > Alternate fix for DirectoryCollection.checkDirs() race > -- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10562.001.patch, YARN-10562.002.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260124#comment-17260124 ] Hadoop QA commented on YARN-10562: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 19s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 19s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 17s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 6s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 25s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 56s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 21s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 19s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 7s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 7s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 2s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 21s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/433/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 20 new + 13 unchanged - 0 fixed = 33 total (was 13) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 33s{color} |