[jira] [Updated] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-13 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-4589:
--
Attachment: YARN-4589-branch-3.2.001.patch

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589-branch-3.2.001.patch, YARN-4589.004.patch, 
> YARN-4589.005.patch, YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-13 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264306#comment-17264306
 ] 

Jim Brennan commented on YARN-4589:
---

Thanks [~epayne]!  I will put up a patch for branch-3.2.

 

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.005.patch, 
> YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-12 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-4589:
--
Attachment: YARN-4589.005.patch

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.005.patch, 
> YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-12 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263700#comment-17263700
 ] 

Jim Brennan commented on YARN-4589:
---

patch 005 removes the extra file.

 

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.005.patch, 
> YARN-4589.2.patch, YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10562) Follow up changes for YARN-9833

2021-01-12 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10562:
---
Summary: Follow up changes for YARN-9833  (was: Alternate fix for 
DirectoryCollection.checkDirs() race)

> Follow up changes for YARN-9833
> ---
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-12 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263696#comment-17263696
 ] 

Jim Brennan commented on YARN-10562:


Patch 004 replaces the {{CopyOnWriteArrayLists}} with {{ArrayLists}}.  It also 
fixes {{getErroredDirs()}} to use {{ImmutableList.copyOf()}} instead of 
{{Collections.unmodifiableList()}}.

One additional change I made was to change {{getFailedDirs()}} to use 
{{Collections.unmodifiableList()}} instead of {{ImmutableList.copyOf()}}.  
There is no need to make another copy in this case, because 
{{DirectoryCollection.concat()}} was already constructing a new list.


> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-12 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10562:
---
Attachment: YARN-10562.004.patch

> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-12 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263691#comment-17263691
 ] 

Jim Brennan commented on YARN-4589:
---

I don't think I need to add a unit test for this, as it is only adding a log 
message.
I believe the other unit tests are unrelated, but I will double-check.
It looks like I included a stray file - I will remove it and put up another 
patch.


> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.2.patch, 
> YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-12 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263462#comment-17263462
 ] 

Jim Brennan commented on YARN-10562:


Thanks for the discussion and comment [~ebadger]!  I agree that probably the 
best approach is to remove the use of CopyOnWriteArrayList and stick with 
simple ArrayLists.  We can preserve the changes made in [YARN-9833] to return 
copies of the lists and fix that issue with GetErrorDirs.

I don't think the cost of the copies, even for every launch, is really much of 
a concern in the grand scope of things.My inclination is to make the 
minimum changes for this - that is, not rewrite checkDirs() as I've done here - 
it is not nearly as inefficient with ArrayLists as it was with 
CopyOnWriteArrayLists.

I will put up another patch with these changes.  We'll probably want to change 
the Summary as well to indicate this is just a follow-on to [YARN-9833], not an 
alternate solution.  If you'd prefer I file a new Jira instead, let me know.


> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262933#comment-17262933
 ] 

Jim Brennan commented on YARN-4589:
---

Forgot to attach the patch.  Doh!


> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.2.patch, 
> YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4589) Diagnostics for localization timeouts is lacking

2021-01-11 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-4589:
--
Attachment: YARN-4589.004.patch

> Diagnostics for localization timeouts is lacking
> 
>
> Key: YARN-4589
> URL: https://issues.apache.org/jira/browse/YARN-4589
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
>Priority: Major
> Attachments: YARN-4589.004.patch, YARN-4589.2.patch, 
> YARN-4589.3.patch, YARN-4589.patch
>
>
> When a container takes too long to localize it manifests as a timeout, and 
> there's no indication that localization was the issue. We need diagnostics 
> for timeouts to indicate the container was still localizing when the timeout 
> occurred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-08 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10562:
---
Attachment: YARN-10562.003.patch

> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-08 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261403#comment-17261403
 ] 

Jim Brennan commented on YARN-10562:


patch 003 fixes the new checkstyle issues.


> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-07 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260614#comment-17260614
 ] 

Jim Brennan commented on YARN-10562:


Submitted patch 002 to fix the checkstyle issues and add unit tests that 
[~ebadger] wrote when he was trying to verify this race condition.


> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-07 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10562:
---
Attachment: YARN-10562.002.patch

> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2021-01-06 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260089#comment-17260089
 ] 

Jim Brennan commented on YARN-9833:
---

[~ebadger], [~pbacsko], I filed a new Jira so I could put up the alternate 
solution [YARN-10562].

If we decide not to go with that approach, then I think we should file a 
follow-up Jira to fix the errorDirs issue and while we are at it, remove the 
use of CopyOnWriteArrayList for these, as it is pretty wasteful to use it now.


> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-06 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10562:
---
Attachment: (was: YARN-9833.001.patch)

> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-06 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10562:
--

 Summary: Alternate fix for DirectoryCollection.checkDirs() race
 Key: YARN-10562
 URL: https://issues.apache.org/jira/browse/YARN-10562
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
related methods were returning an unmodifiable view of the lists. These 
accesses were protected by read/write locks, but because the lists are 
CopyOnWriteArrayLists, subsequent changes to the list, even when done under the 
writelock, were exposed when a caller started iterating the list view. 
CopyOnWriteArrayLists cache the current underlying list in the iterator, so it 
is safe to iterate them even while they are being changed - at least the view 
will be consistent.

The problem was that checkDirs() was clearing the lists and rebuilding them 
from scratch every time, so if a caller called getGoodDirs() just before 
checkDirs cleared it, and then started iterating right after the clear, they 
could get an empty list.

The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
return a copy of the list, which definitely fixes the race condition. The 
disadvantage is that now we create a new copy of these lists every time we 
launch a container. The advantage using CopyOnWriteArrayList was that the lists 
should rarely ever change, and we can avoid all the copying. Unfortunately, the 
way checkDirs() was written, it guaranteed that it would modify those lists 
multiple times every time.

So this Jira proposes an alternate solution for YARN-9833, which mainly just 
rewrites checkDirs() to minimize the changes to the underlying lists. There are 
still some small windows where a disk will have been added to one list, but not 
yet removed from another if you hit it just right, but I think these should be 
pretty rare and relatively harmless, and in the vast majority of cases I 
suspect only one disk will be moving from one list to another at any time.   
The question is whether this type of inconsistency (which was always there 
before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-23 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254133#comment-17254133
 ] 

Jim Brennan commented on YARN-10540:


Thanks [~sunilg]!

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, Screenshot 2020-12-23 at 
> 8.24.42 PM.png, YARN-10540.001.patch, Yarn-UI-Ubuntu.png, osx-yarn-ui2.png, 
> yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-22 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253560#comment-17253560
 ] 

Jim Brennan commented on YARN-10540:


Thanks [~ebadger]!  And thanks [~hexiaoqiao],  [~ayushtkn] and [~sunilg] for 
finding and investigating this bug.  Much appreciated.

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10542) Node Utilization on UI is misleading if nodes don't report utilization

2020-12-21 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10542:
--

 Summary: Node Utilization on UI is misleading if nodes don't 
report utilization
 Key: YARN-10542
 URL: https://issues.apache.org/jira/browse/YARN-10542
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Jim Brennan
Assignee: Jim Brennan


As reported in YARN-10540, if the ResourceCalculatorPlugin fails to initialize, 
the nodes will report no utilization.  This makes the RM UI misleading, because 
it presents cluster-wide and per node utilization as 0 instead of indicating 
that it is not being tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253117#comment-17253117
 ] 

Jim Brennan commented on YARN-10540:


I filed YARN-10542 as a follow-up.

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reassigned YARN-10540:
--

Assignee: Jim Brennan

I have attached a patch that initializes nodeUtilization to a 
ResourceUtilization of all zeros instead of null.  This fixes the NPE.   In 
cases where ResourceCalculatorPlugin fails to initialize (like on Mac) the UI 
will be misleading in this case, always showing 0 utilization.  I will file a 
follow-up Jira to make this better.

 

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10540:
---
Attachment: YARN-10540.001.patch

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253090#comment-17253090
 ] 

Jim Brennan commented on YARN-10540:


I have manually reproduced this in trunk on a VM by setting the 
ResourceCalculatorPlugin to null.

I am testing out this fix and will put up a patch shortly.  This would fix the 
NPE, but the UI would report zeros in this case.  My proposal would be to 
follow-up with another Jira to improve the UI presentation in this case.

 

 

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, Yarn-UI-Ubuntu.png, 
> yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253022#comment-17253022
 ] 

Jim Brennan commented on YARN-10540:


Simpler fix might be to initialize {{nodeUtilization}} in 
NodeResourceMonitorImpl to {{ResourceUtilization.newInstance(0, 0, 0f)}}

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, Yarn-UI-Ubuntu.png, 
> yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253018#comment-17253018
 ] 

Jim Brennan commented on YARN-10540:


Just to clarify, we are seeing this only on Mac in branch-3.2.2?
Or are we seeing it on Mac on all branches?
Or on branch-3.2.2 on platforms?

The ResourceCalculatorPlugin issue on Mac would explain it failing on Mac for 
all branches.  If it's failing for Mac only on branch-3.2.2, that is 
interesting.




> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, Yarn-UI-Ubuntu.png, 
> yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253003#comment-17253003
 ] 

Jim Brennan commented on YARN-10540:


Trying to figure out why we might get an NPE in this case.  One option to avoid 
the NPE might be to change SchedulerNode.setNodeUtilization() to check for null 
and either set to an empty resource (which is what we initialize it to), or 
just ignore the update.


> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Priority: Critical
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, Yarn-UI-Ubuntu.png, 
> yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2020-12-14 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249316#comment-17249316
 ] 

Jim Brennan commented on YARN-9833:
---

{quote}
My worry with this is that code changes in the future will incorrectly use 
getGoodDirs or the other methods that expose the private lists from within 
DirectoryCollection.
{quote}
Isn't this how it has been for years?  It was returning an unmodifiableList 
view of the underlying List, so that limits what the caller can do.  
getGoodDirs() and the others just return a read-only List.  They don't have to 
know about the internals.
If we are going to change these to return a copy of the list, we may want to 
reconsider the use of CopyOnWriteArrayList - I'm not sure it is buying us 
anything.


> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2020-12-14 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249043#comment-17249043
 ] 

Jim Brennan commented on YARN-9833:
---

Thinking about it more over the weekend, I suspect that the reason the 
CopyOnWriteArrayList was used was more for performance than to allow someone to 
hang onto the reference for a long time.  This ideally is a list that doesn't 
change very often, so handing out a view of a copy-on-write array is cheaper 
than making a copy every time we launch a container.  Unfortunately, 
{{checkdirs()}} as written seems to ruin any advantage we've gained by mutating 
the lists every time it runs (and multiple times at that, by first clearing and 
then adding each entry individually).   This is also where the race comes in.

My suggestion for fixing this would be to fix the {{checkdirs()}}  
implementation to operate on local copies of these arrays, and then update them 
with a single assignment only if they have changed.

 

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10494) CLI tool for docker-to-squashfs conversion (pure Java)

2020-12-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248173#comment-17248173
 ] 

Jim Brennan commented on YARN-10494:


[~ccondit], [~ebadger] I am OK with including this for now. 

> CLI tool for docker-to-squashfs conversion (pure Java)
> --
>
> Key: YARN-10494
> URL: https://issues.apache.org/jira/browse/YARN-10494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10494.001.patch, 
> docker-to-squashfs-conversion-tool-design.pdf
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> *YARN-9564* defines a docker-to-squashfs image conversion tool that relies on 
> python2, multiple libraries, squashfs-tools and root access in order to 
> convert Docker images to squashfs images for use with the runc container 
> runtime in YARN.
> *YARN-9943* was created to investigate alternatives, as the response to 
> merging YARN-9564 has not been very positive. This proposal outlines the 
> design for a CLI conversion tool in 100% pure Java that will work out of the 
> box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2020-12-10 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247304#comment-17247304
 ] 

Jim Brennan commented on YARN-9833:
---

Did you consider that changing from a view to a copy changes the behavior for 
clients?  Previously, one could get a view of the local dirs once and then use 
it over a long period of time, and it would continue to be updated.  This seems 
to be what the original intent of the implementation was, but I can't find any 
code in Hadoop that relies on it, although I have not done a thorough search.  
Everything I could find was just using one of the get* functions to get a copy 
and then iterating it immediately.

 

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10494) CLI tool for docker-to-squashfs conversion (pure Java)

2020-12-03 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243242#comment-17243242
 ] 

Jim Brennan commented on YARN-10494:


I'm not sure that a PR is better than a patch for something this big.  One 
initial comment though, there is a lot of code here which is essentially 
squashf-tools-for-java.  I wonder if that portion should be a separate open 
source project?

 

 

> CLI tool for docker-to-squashfs conversion (pure Java)
> --
>
> Key: YARN-10494
> URL: https://issues.apache.org/jira/browse/YARN-10494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10494.001.patch, 
> docker-to-squashfs-conversion-tool-design.pdf
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> *YARN-9564* defines a docker-to-squashfs image conversion tool that relies on 
> python2, multiple libraries, squashfs-tools and root access in order to 
> convert Docker images to squashfs images for use with the runc container 
> runtime in YARN.
> *YARN-9943* was created to investigate alternatives, as the response to 
> merging YARN-9564 has not been very positive. This proposal outlines the 
> design for a CLI conversion tool in 100% pure Java that will work out of the 
> box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8558) NM recovery level db not cleaned up properly on container finish

2020-11-17 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233767#comment-17233767
 ] 

Jim Brennan commented on YARN-8558:
---

I have committed this to branch-2.10.

 

> NM recovery level db not cleaned up properly on container finish
> 
>
> Key: YARN-8558
> URL: https://issues.apache.org/jira/browse/YARN-8558
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bibin Chundatt
>Assignee: Bibin Chundatt
>Priority: Critical
> Fix For: 3.2.0, 3.1.1, 3.0.4
>
> Attachments: YARN-8558-branch-2.10.001.patch, 
> YARN-8558-branch-3.0.002.patch, YARN-8558-branch-3.0.003.patch, 
> YARN-8558.001.patch, YARN-8558.002.patch
>
>
> {code}
> 2018-07-20 16:49:23,117 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Application application_1531994217928_0054 transitioned from NEW to INITING
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_18 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_19 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_20 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_21 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_22 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_23 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_24 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_25 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_38 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_39 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_41 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_44 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_46 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_49 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_52 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_54 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_73 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_74 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_75 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> 

[jira] [Reopened] (YARN-8558) NM recovery level db not cleaned up properly on container finish

2020-11-17 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reopened YARN-8558:
---

Re-opening so I can put up a patch for branch-2.10.

> NM recovery level db not cleaned up properly on container finish
> 
>
> Key: YARN-8558
> URL: https://issues.apache.org/jira/browse/YARN-8558
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bibin Chundatt
>Assignee: Bibin Chundatt
>Priority: Critical
> Fix For: 3.2.0, 3.1.1, 3.0.4
>
> Attachments: YARN-8558-branch-3.0.002.patch, 
> YARN-8558-branch-3.0.003.patch, YARN-8558.001.patch, YARN-8558.002.patch
>
>
> {code}
> 2018-07-20 16:49:23,117 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Application application_1531994217928_0054 transitioned from NEW to INITING
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_18 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_19 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_20 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_21 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_22 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_23 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_24 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_25 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_38 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_39 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_41 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_44 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_46 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_49 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_52 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_54 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_73 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_74 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_75 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container 

[jira] [Resolved] (YARN-10485) TimelineConnector swallows InterruptedException

2020-11-16 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-10485.

Fix Version/s: 3.2.3
   3.4.1
   3.1.5
   3.3.1
   Resolution: Fixed

Thanks for the contribution [~ahussein] and [~daryn]!
I have committed this to trunk, branch-3.3, branch-3.2, and branch-3.1.

> TimelineConnector swallows InterruptedException
> ---
>
> Key: YARN-10485
> URL: https://issues.apache.org/jira/browse/YARN-10485
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.3.1, 3.1.5, 3.4.1, 3.2.3
>
>
> Some tests timeout or take excessively long to shutdown because the 
> {{TimelineConnector}} will catch InterruptedException and go into a retry 
> loop instead of aborting.
> [~daryn] reported that this makes debugging more difficult and he suggests 
> the exception to be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10485) TimelineConnector swallows InterruptedException

2020-11-16 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232803#comment-17232803
 ] 

Jim Brennan commented on YARN-10485:


Apologies, I marked this resolved by accident.  Got my tabs mixed up.  Was 
intending to resolve HADOOP-17362.

I am hoping to resolve this one later today.

 

> TimelineConnector swallows InterruptedException
> ---
>
> Key: YARN-10485
> URL: https://issues.apache.org/jira/browse/YARN-10485
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>
> Some tests timeout or take excessively long to shutdown because the 
> {{TimelineConnector}} will catch InterruptedException and go into a retry 
> loop instead of aborting.
> [~daryn] reported that this makes debugging more difficult and he suggests 
> the exception to be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10485) TimelineConnector swallows InterruptedException

2020-11-16 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10485:
---
Fix Version/s: (was: 3.2.3)
   (was: 3.4.1)
   (was: 3.1.5)
   (was: 3.3.1)

> TimelineConnector swallows InterruptedException
> ---
>
> Key: YARN-10485
> URL: https://issues.apache.org/jira/browse/YARN-10485
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>
> Some tests timeout or take excessively long to shutdown because the 
> {{TimelineConnector}} will catch InterruptedException and go into a retry 
> loop instead of aborting.
> [~daryn] reported that this makes debugging more difficult and he suggests 
> the exception to be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-10485) TimelineConnector swallows InterruptedException

2020-11-16 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reopened YARN-10485:


> TimelineConnector swallows InterruptedException
> ---
>
> Key: YARN-10485
> URL: https://issues.apache.org/jira/browse/YARN-10485
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.3.1, 3.1.5, 3.4.1, 3.2.3
>
>
> Some tests timeout or take excessively long to shutdown because the 
> {{TimelineConnector}} will catch InterruptedException and go into a retry 
> loop instead of aborting.
> [~daryn] reported that this makes debugging more difficult and he suggests 
> the exception to be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10485) TimelineConnector swallows InterruptedException

2020-11-13 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10485:
---
Fix Version/s: 3.2.3
   3.4.1
   3.1.5
   3.3.1

> TimelineConnector swallows InterruptedException
> ---
>
> Key: YARN-10485
> URL: https://issues.apache.org/jira/browse/YARN-10485
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.3.1, 3.1.5, 3.4.1, 3.2.3
>
>
> Some tests timeout or take excessively long to shutdown because the 
> {{TimelineConnector}} will catch InterruptedException and go into a retry 
> loop instead of aborting.
> [~daryn] reported that this makes debugging more difficult and he suggests 
> the exception to be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10485) TimelineConnector swallows InterruptedException

2020-11-13 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-10485.

Resolution: Fixed

> TimelineConnector swallows InterruptedException
> ---
>
> Key: YARN-10485
> URL: https://issues.apache.org/jira/browse/YARN-10485
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.3.1, 3.1.5, 3.4.1, 3.2.3
>
>
> Some tests timeout or take excessively long to shutdown because the 
> {{TimelineConnector}} will catch InterruptedException and go into a retry 
> loop instead of aborting.
> [~daryn] reported that this makes debugging more difficult and he suggests 
> the exception to be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8558) NM recovery level db not cleaned up properly on container finish

2020-11-13 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231778#comment-17231778
 ] 

Jim Brennan commented on YARN-8558:
---

Any objection to pulling this back to branch-2.10?  It looks like 
{{remove_container()}} is missing 
{noformat}
CONTAINER_START_TIME_KEY_SUFFIX
CONTAINER_VERSION_KEY_SUFFIX
CONTAINER_REMAIN_RETRIES_KEY_SUFFIX
CONTAINER_WORK_DIR_KEY_SUFFIX
CONTAINER_LOG_DIR_KEY_SUFFIX
{noformat}


> NM recovery level db not cleaned up properly on container finish
> 
>
> Key: YARN-8558
> URL: https://issues.apache.org/jira/browse/YARN-8558
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bibin Chundatt
>Assignee: Bibin Chundatt
>Priority: Critical
> Fix For: 3.2.0, 3.1.1, 3.0.4
>
> Attachments: YARN-8558-branch-3.0.002.patch, 
> YARN-8558-branch-3.0.003.patch, YARN-8558.001.patch, YARN-8558.002.patch
>
>
> {code}
> 2018-07-20 16:49:23,117 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Application application_1531994217928_0054 transitioned from NEW to INITING
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_18 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_19 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_20 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_21 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_22 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_23 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_24 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_25 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_38 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_39 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_41 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_44 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_46 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_49 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_52 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_54 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_73 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_74 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> 

[jira] [Commented] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-05 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227069#comment-17227069
 ] 

Jim Brennan commented on YARN-10479:


Thanks [~epayne]!

> RMProxy should retry on SocketTimeout Exceptions
> 
>
> Key: YARN-10479
> URL: https://issues.apache.org/jira/browse/YARN-10479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10479.001.patch, YARN-10479.002.patch, 
> YARN-10479.003.patch
>
>
> During an incident involving a DNS outage, a large number of nodemanagers 
> failed to come back into service because they hit a socket timeout when 
> trying to re-register with the RM.
> SocketTimeoutException is not currently one of the exceptions that the 
> RMProxy will retry.  Based on this incident, it seems like it should be.  We 
> made this change internally about a year ago and it has been running in 
> production since.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-05 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226770#comment-17226770
 ] 

Jim Brennan commented on YARN-10479:


All of the failed unit tests also fail in trunk, due to the change made in 
[HADOOP-17306].
[~epayne] can you please review this?


> RMProxy should retry on SocketTimeout Exceptions
> 
>
> Key: YARN-10479
> URL: https://issues.apache.org/jira/browse/YARN-10479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10479.001.patch, YARN-10479.002.patch, 
> YARN-10479.003.patch
>
>
> During an incident involving a DNS outage, a large number of nodemanagers 
> failed to come back into service because they hit a socket timeout when 
> trying to re-register with the RM.
> SocketTimeoutException is not currently one of the exceptions that the 
> RMProxy will retry.  Based on this incident, it seems like it should be.  We 
> made this change internally about a year ago and it has been running in 
> production since.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226303#comment-17226303
 ] 

Jim Brennan commented on YARN-10479:


I believe most of the YARN failures are unrelated to this change.  They fail 
for me with or without this change.
It looks to me like most of them were caused by [HADOOP-17306].
When I reverted [HADOOP-17306], most of these failures go away.


> RMProxy should retry on SocketTimeout Exceptions
> 
>
> Key: YARN-10479
> URL: https://issues.apache.org/jira/browse/YARN-10479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10479.001.patch, YARN-10479.002.patch, 
> YARN-10479.003.patch
>
>
> During an incident involving a DNS outage, a large number of nodemanagers 
> failed to come back into service because they hit a socket timeout when 
> trying to re-register with the RM.
> SocketTimeoutException is not currently one of the exceptions that the 
> RMProxy will retry.  Based on this incident, it seems like it should be.  We 
> made this change internally about a year ago and it has been running in 
> production since.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226271#comment-17226271
 ] 

Jim Brennan commented on YARN-10479:


patch 003 fixes the checkstyle issues.


> RMProxy should retry on SocketTimeout Exceptions
> 
>
> Key: YARN-10479
> URL: https://issues.apache.org/jira/browse/YARN-10479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10479.001.patch, YARN-10479.002.patch, 
> YARN-10479.003.patch
>
>
> During an incident involving a DNS outage, a large number of nodemanagers 
> failed to come back into service because they hit a socket timeout when 
> trying to re-register with the RM.
> SocketTimeoutException is not currently one of the exceptions that the 
> RMProxy will retry.  Based on this incident, it seems like it should be.  We 
> made this change internally about a year ago and it has been running in 
> production since.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-04 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10479:
---
Attachment: YARN-10479.003.patch

> RMProxy should retry on SocketTimeout Exceptions
> 
>
> Key: YARN-10479
> URL: https://issues.apache.org/jira/browse/YARN-10479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10479.001.patch, YARN-10479.002.patch, 
> YARN-10479.003.patch
>
>
> During an incident involving a DNS outage, a large number of nodemanagers 
> failed to come back into service because they hit a socket timeout when 
> trying to re-register with the RM.
> SocketTimeoutException is not currently one of the exceptions that the 
> RMProxy will retry.  Based on this incident, it seems like it should be.  We 
> made this change internally about a year ago and it has been running in 
> production since.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-03 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17225682#comment-17225682
 ] 

Jim Brennan commented on YARN-10479:


I added a test case in patch 002.

> RMProxy should retry on SocketTimeout Exceptions
> 
>
> Key: YARN-10479
> URL: https://issues.apache.org/jira/browse/YARN-10479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10479.001.patch, YARN-10479.002.patch
>
>
> During an incident involving a DNS outage, a large number of nodemanagers 
> failed to come back into service because they hit a socket timeout when 
> trying to re-register with the RM.
> SocketTimeoutException is not currently one of the exceptions that the 
> RMProxy will retry.  Based on this incident, it seems like it should be.  We 
> made this change internally about a year ago and it has been running in 
> production since.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-03 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10479:
---
Attachment: YARN-10479.002.patch

> RMProxy should retry on SocketTimeout Exceptions
> 
>
> Key: YARN-10479
> URL: https://issues.apache.org/jira/browse/YARN-10479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10479.001.patch, YARN-10479.002.patch
>
>
> During an incident involving a DNS outage, a large number of nodemanagers 
> failed to come back into service because they hit a socket timeout when 
> trying to re-register with the RM.
> SocketTimeoutException is not currently one of the exceptions that the 
> RMProxy will retry.  Based on this incident, it seems like it should be.  We 
> made this change internally about a year ago and it has been running in 
> production since.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-03 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10479:
---
Attachment: YARN-10479.001.patch

> RMProxy should retry on SocketTimeout Exceptions
> 
>
> Key: YARN-10479
> URL: https://issues.apache.org/jira/browse/YARN-10479
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10479.001.patch
>
>
> During an incident involving a DNS outage, a large number of nodemanagers 
> failed to come back into service because they hit a socket timeout when 
> trying to re-register with the RM.
> SocketTimeoutException is not currently one of the exceptions that the 
> RMProxy will retry.  Based on this incident, it seems like it should be.  We 
> made this change internally about a year ago and it has been running in 
> production since.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-11-02 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224921#comment-17224921
 ] 

Jim Brennan commented on YARN-10475:


Thanks [~epayne]!

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10475-branch-3.2.003.patch, 
> YARN-10475-branch-3.3.003.patch, YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-02 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10479:
--

 Summary: RMProxy should retry on SocketTimeout Exceptions
 Key: YARN-10479
 URL: https://issues.apache.org/jira/browse/YARN-10479
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.10.1, 3.4.1
Reporter: Jim Brennan
Assignee: Jim Brennan


During an incident involving a DNS outage, a large number of nodemanagers 
failed to come back into service because they hit a socket timeout when trying 
to re-register with the RM.

SocketTimeoutException is not currently one of the exceptions that the RMProxy 
will retry.  Based on this incident, it seems like it should be.  We made this 
change internally about a year ago and it has been running in production since.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-11-02 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224717#comment-17224717
 ] 

Jim Brennan commented on YARN-10475:


I have filed [YARN-10478] for making this pluggable.

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475-branch-3.2.003.patch, 
> YARN-10475-branch-3.3.003.patch, YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10478) Make RM-NM heartbeat scaling calculator pluggable

2020-11-02 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10478:
--

 Summary: Make RM-NM heartbeat scaling calculator pluggable
 Key: YARN-10478
 URL: https://issues.apache.org/jira/browse/YARN-10478
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Jim Brennan


[YARN-10475] adds a feature to enable scaling the interval for heartbeats 
between the RM and NM based on CPU utilization.  [~bibinchundatt] suggested 
that we make this pluggable so that other calculations can be used if desired.

The configuration properties added in [YARN-10475] should be applicable to any 
heartbeat calculator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-30 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10475:
---
Attachment: YARN-10475-branch-3.2.003.patch

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475-branch-3.2.003.patch, 
> YARN-10475-branch-3.3.003.patch, YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-30 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223733#comment-17223733
 ] 

Jim Brennan commented on YARN-10475:


[~epayne], I have put up patches for branch-3.3 and branch-3.2 as well.

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475-branch-3.2.003.patch, 
> YARN-10475-branch-3.3.003.patch, YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-30 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10475:
---
Attachment: YARN-10475-branch-3.3.003.patch

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475-branch-3.2.003.patch, 
> YARN-10475-branch-3.3.003.patch, YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-30 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223668#comment-17223668
 ] 

Jim Brennan commented on YARN-10475:


Thanks for the suggestion [~bibinchundatt]!  I think a plugin for calculating 
the heartbeat interval is definitely possible.  The configs as specified I 
think could remain for enabling scaling and setting up the parameters - there 
is nothing specific about cpu utilization in those properties.  Would you be ok 
with a follow-up Jira to move the calculation into a plugin?  Do you have any 
suggestions for alternate calculations?


> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-30 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223656#comment-17223656
 ] 

Jim Brennan commented on YARN-10471:


[~epayne] I agree we don't need to go to branch-3.1 nor branch-2.10.
Thanks for the contribution!


> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Fix For: 3.2.2, 3.3.1, 3.4.1, 3.2.3
>
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.004.patch, YARN.10471.005.patch, 
> YARN.10471.branch-3.2.003.patch, YARN.10471.branch-3.2.005.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-29 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223217#comment-17223217
 ] 

Jim Brennan commented on YARN-10475:


Thanks [~epayne]!  I put up patch 003, which adds documentation to 
Nodemanager.md and also fixes a minor typo in yarn-default.xml.


> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-29 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10475:
---
Attachment: YARN-10475.003.patch

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475.001.patch, YARN-10475.002.patch, 
> YARN-10475.003.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-29 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223195#comment-17223195
 ] 

Jim Brennan commented on YARN-10471:


Thanks [~epayne]!  I have committed this to trunk, branch-3.3, and branch-3.2.
The branch-3.2 patch does not apply to branch-3.1 nor branch-2.10.
Please provide patches for those branches if you want this committed further 
back.



> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Fix For: 3.3.1, 3.4.1, 3.2.3
>
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.004.patch, YARN.10471.005.patch, 
> YARN.10471.branch-3.2.003.patch, YARN.10471.branch-3.2.005.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-29 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10471:
---
Fix Version/s: 3.2.3
   3.4.1
   3.3.1

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Fix For: 3.3.1, 3.4.1, 3.2.3
>
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.004.patch, YARN.10471.005.patch, 
> YARN.10471.branch-3.2.003.patch, YARN.10471.branch-3.2.005.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10477) runc launch failure should not cause nodemanager to go unhealthy

2020-10-28 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-10477.

Resolution: Invalid

Closing this as invalid.  The problem was only there in our internal version of 
container-executor.  I should have checked the code in trunk before filing.


> runc launch failure should not cause nodemanager to go unhealthy
> 
>
> Key: YARN-10477
> URL: https://issues.apache.org/jira/browse/YARN-10477
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> We have observed some failures when launching containers with runc.  We have 
> not yet identified the root cause of those failures, but a side-effect of 
> these failures was the Nodemanager marked itself unhealthy.  Since these are 
> rare failures that only affect a single launch, they should not cause the 
> Nodemanager to be marked unhealthy.
> Here is an example RM log:
> {noformat}
> resourcemanager.log.2020-10-02-03.bz2:2020-10-02 03:20:10,255 [RM Event 
> dispatcher] INFO rmnode.RMNodeImpl: Node node:8041 reported UNHEALTHY with 
> details: Linux Container Executor reached unrecoverable exception
> {noformat}
> And here is an example of the NM log:
> {noformat}
> 2020-10-02 03:20:02,033 [ContainersLauncher #434] INFO 
> runtime.RuncContainerRuntime: Launch container failed for 
> container_e25_1601602719874_10691_01_001723
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=24: OCI command has bad/missing local dire
> ctories
> {noformat}
> The problem is that the runc code in container-executor is re-using exit code 
> 24 (INVALID_CONFIG_FILE) which is intended for problems with the 
> container-executor.cfg file, and those failures are fatal for the NM.  We 
> should use a different exit code for these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-28 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222409#comment-17222409
 ] 

Jim Brennan commented on YARN-10471:


Thanks [~epayne]! It looks like there is a problem in the last line in 
Nodemanager.md.  The line appears to be split in two.


> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.004.patch, YARN.10471.branch-3.2.003.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10477) runc launch failure should not cause nodemanager to go unhealthy

2020-10-28 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10477:
--

 Summary: runc launch failure should not cause nodemanager to go 
unhealthy
 Key: YARN-10477
 URL: https://issues.apache.org/jira/browse/YARN-10477
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.3.1, 3.4.1
Reporter: Jim Brennan
Assignee: Jim Brennan


We have observed some failures when launching containers with runc.  We have 
not yet identified the root cause of those failures, but a side-effect of these 
failures was the Nodemanager marked itself unhealthy.  Since these are rare 
failures that only affect a single launch, they should not cause the 
Nodemanager to be marked unhealthy.

Here is an example RM log:
{noformat}
resourcemanager.log.2020-10-02-03.bz2:2020-10-02 03:20:10,255 [RM Event 
dispatcher] INFO rmnode.RMNodeImpl: Node node:8041 reported UNHEALTHY with 
details: Linux Container Executor reached unrecoverable exception
{noformat}
And here is an example of the NM log:
{noformat}
2020-10-02 03:20:02,033 [ContainersLauncher #434] INFO 
runtime.RuncContainerRuntime: Launch container failed for 
container_e25_1601602719874_10691_01_001723
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
 ExitCodeException exitCode=24: OCI command has bad/missing local dire
ctories
{noformat}

The problem is that the runc code in container-executor is re-using exit code 
24 (INVALID_CONFIG_FILE) which is intended for problems with the 
container-executor.cfg file, and those failures are fatal for the NM.  We 
should use a different exit code for these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-28 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1794#comment-1794
 ] 

Jim Brennan commented on YARN-10475:


I put up patch 002 to address checkstyle/javac issues.

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475.001.patch, YARN-10475.002.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-28 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10475:
---
Attachment: YARN-10475.002.patch

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475.001.patch, YARN-10475.002.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-28 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1735#comment-1735
 ] 

Jim Brennan commented on YARN-10467:


Thanks for reporting this and for the solution [~haibochen]!    Everything 
looks good to me.

I hesitate to mention one minor nit, a typo in this comment:
{quote}// there might be some completed containers that *are have* not been 
pulled
{quote}
It's up to you whether you want to fix this.

[~jhung] were you planning to commit this?

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -
>
> Key: YARN-10467
> URL: https://issues.apache.org/jira/browse/YARN-10467
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.10.0, 3.0.3, 3.2.1, 3.1.4
>Reporter: Haibo Chen
>Assignee: Haibo Chen
>Priority: Major
> Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch, 
> YARN-10467.branch-2.10.02.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10471) Prevent logs for any container from becoming larger than a configurable size.

2020-10-28 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1723#comment-1723
 ] 

Jim Brennan commented on YARN-10471:


Thanks for putting this up [~epayne]!   I am +1 on patches for trunk and 
branch-3.2.

I will wait until tomorrow to commit this, to give others a chance to chime in 
if desired.

 

> Prevent logs for any container from becoming larger than a configurable size.
> -
>
> Key: YARN-10471
> URL: https://issues.apache.org/jira/browse/YARN-10471
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.4
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: YARN.10471.001.patch, YARN.10471.002.patch, 
> YARN.10471.003.patch, YARN.10471.branch-3.2.003.patch
>
>
> Configure a cluster such that a task attempt will be killed if any container 
> log exceeds a configured size. This would help prevent logs from filling 
> disks and also prevent the need to aggregate enormous logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-27 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10475:
---
Attachment: YARN-10475.001.patch

> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10475.001.patch
>
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-27 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221774#comment-17221774
 ] 

Jim Brennan commented on YARN-10475:


This adds the following {{yarn.resourcemanager.nodemanagers}} configuration 
properties:

{{heartbeat-interval-scaling-enable}}
 * enables heartbeat interval scaling, defaults to false

{{heartbeat-interval-min-ms}}
 * If heart-beat interval scaling is enabled, this is the minimum heart-beat 
interval in milliseconds.

{{heartbeat-interval-max-ms}}
 * If heart-beat interval scaling is enabled, this is the maximum heart-beat 
interval in milliseconds.

{{heartbeat-interval-speedup-factor}}
 * This controls the degree of adjustment when speeding up heartbeat intervals. 
At 1.0, 20% lesser than average CPU utilization will result in a 20% decrease 
in heartbeat interval.

 {{heartbeat-interval-slowdown-factor}}
* This controls the degree of adjustment when slowing down heartbeat intervals. 
At 1.0, 20% greater than average CPU utilization will result in a 20% increase 
in heartbeat interval.


> Scale RM-NM heartbeat interval based on node utilization
> 
>
> Key: YARN-10475
> URL: https://issues.apache.org/jira/browse/YARN-10475
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.10.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
>
> Add the ability to scale the RM-NM heartbeat interval based on node cpu 
> utilization compared to overall cluster cpu utilization.  If a node is 
> over-utilized compared to the rest of the cluster, it's heartbeat interval 
> slows down.  If it is under-utilized compared to the rest of the cluster, 
> it's heartbeat interval speeds up.
> This is a feature we have been running with internally in production for 
> several years.  It was developed by [~nroberts], based on the observation 
> that larger faster nodes on our cluster were under-utilized compared to 
> smaller slower nodes. 
> This feature is dependent on [YARN-10450], which added cluster-wide 
> utilization metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-27 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10475:
--

 Summary: Scale RM-NM heartbeat interval based on node utilization
 Key: YARN-10475
 URL: https://issues.apache.org/jira/browse/YARN-10475
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.10.1, 3.4.1
Reporter: Jim Brennan
Assignee: Jim Brennan


Add the ability to scale the RM-NM heartbeat interval based on node cpu 
utilization compared to overall cluster cpu utilization.  If a node is 
over-utilized compared to the rest of the cluster, it's heartbeat interval 
slows down.  If it is under-utilized compared to the rest of the cluster, it's 
heartbeat interval speeds up.

This is a feature we have been running with internally in production for 
several years.  It was developed by [~nroberts], based on the observation that 
larger faster nodes on our cluster were under-utilized compared to smaller 
slower nodes. 

This feature is dependent on [YARN-10450], which added cluster-wide utilization 
metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-19 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216747#comment-17216747
 ] 

Jim Brennan commented on YARN-10450:


Thanks [~ebadger]!

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: NodesPage.png, YARN-10450-branch-2.10.003.patch, 
> YARN-10450-branch-3.1.003.patch, YARN-10450-branch-3.2.003.patch, 
> YARN-10450.001.patch, YARN-10450.002.patch, YARN-10450.003.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-15 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214879#comment-17214879
 ] 

Jim Brennan commented on YARN-10450:


Thanks [~ebadger]!  I've attached patches for branches 3.2, 3.1, and 2.10 as 
well.

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: NodesPage.png, YARN-10450-branch-2.10.003.patch, 
> YARN-10450-branch-3.1.003.patch, YARN-10450-branch-3.2.003.patch, 
> YARN-10450.001.patch, YARN-10450.002.patch, YARN-10450.003.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-15 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10450:
---
Attachment: YARN-10450-branch-3.2.003.patch
YARN-10450-branch-3.1.003.patch
YARN-10450-branch-2.10.003.patch

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: NodesPage.png, YARN-10450-branch-2.10.003.patch, 
> YARN-10450-branch-3.1.003.patch, YARN-10450-branch-3.2.003.patch, 
> YARN-10450.001.patch, YARN-10450.002.patch, YARN-10450.003.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-15 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214760#comment-17214760
 ] 

Jim Brennan commented on YARN-10450:


patch 003 changes the Cluster page to use *Physical Mem Used %* / *Physical 
VCores Used %* and the nodes page columns are *Phys Mem Used %* / *Phys VCores 
Used %.*

 

[~ebadger], [~jhung] is this wording ok?

 

 

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: NodesPage.png, YARN-10450.001.patch, 
> YARN-10450.002.patch, YARN-10450.003.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-13 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10450:
---
Attachment: YARN-10450.003.patch

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: NodesPage.png, YARN-10450.001.patch, 
> YARN-10450.002.patch, YARN-10450.003.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-13 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213305#comment-17213305
 ] 

Jim Brennan commented on YARN-10450:


Thanks [~ebadger] and [~jhung]!  I will upload a new patch with the suggested 
change.

 

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: NodesPage.png, YARN-10450.001.patch, YARN-10450.002.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-12 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212662#comment-17212662
 ] 

Jim Brennan commented on YARN-10450:


Thanks for the review and comments [~ebadger]!  I agree the names could be 
clearer.  I'm not sure if we should change *Mem Used* because even though I 
agree it could be more accurate, it has been called that for a long time.

I'm definitely open to changing the name for *Mem Utilization %*, which in the 
Cluster Metrics is the actual memory utilization percentage for all nodes in 
the cluster, and in the Node Metrics it's the actual memory utilization 
percentage for the node. Maybe it should be something like *Physical Mem Used 
%* / *Physical VCores Used %*?


 [~epayne], [~jhung]  what do you think?

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: NodesPage.png, YARN-10450.001.patch, YARN-10450.002.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-12 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212547#comment-17212547
 ] 

Jim Brennan commented on YARN-10450:


Anyone else available to review? [~jhung], [~ebadger] ?

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: NodesPage.png, YARN-10450.001.patch, YARN-10450.002.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-12 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9667:
--
Fix Version/s: 2.10.2

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.5, 2.10.2
>
> Attachments: YARN-9667-001.patch, YARN-9667-branch-2.10.001.patch, 
> YARN-9667-branch-3.2.001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-08 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210465#comment-17210465
 ] 

Jim Brennan commented on YARN-9667:
---

I've committed this to branch-3.2 and branch-3.1, but the patch does not apply 
to branch-2.10.

[~ebadger] can you provide a patch for branch-2.10?

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.5
>
> Attachments: YARN-9667-001.patch, YARN-9667-branch-3.2.001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-08 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9667:
--
Fix Version/s: 3.1.5

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.5
>
> Attachments: YARN-9667-001.patch, YARN-9667-branch-3.2.001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-08 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9667:
--
Fix Version/s: (was: 3.2.3)
   3.2.2

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
> Attachments: YARN-9667-001.patch, YARN-9667-branch-3.2.001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-08 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9667:
--
Fix Version/s: 3.2.3

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.3
>
> Attachments: YARN-9667-001.patch, YARN-9667-branch-3.2.001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10455) TestNMProxy.testNMProxyRPCRetry is not consistent

2020-10-08 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10455:
---
Fix Version/s: (was: 3.2.1)
   3.2.2

> TestNMProxy.testNMProxyRPCRetry is not consistent
> -
>
> Key: YARN-10455
> URL: https://issues.apache.org/jira/browse/YARN-10455
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.1.2, 3.2.2, 3.4.0, 3.3.1
>
> Attachments: YARN-10455.001.patch
>
>
> The fix in YARN-8844 may fail depending on the configuration of the machine 
> running the test.
>  In some cases the address gets resolved and the Unit throws a connection 
> timeout exception instead. In such scenario the JUnit times out the main 
> reason behind the failure is swallowed by the shutdown of the clients.
>  To make sure that the JUnit behavior is consistent, a suggested fix is to 
> set the host address to {{127.0.0.1:1}}. The latter will omit the probability 
> of collisions on non-privileged ports.
>  Also, it is more correct to catch {{SocketException}} directly rather than 
> catching IOException with a check for not {{SocketException}}.
>  
> The stack trace with such failures:
> {code:bash}
> [INFO] Running 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 24.293 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] 
> testNMProxyRPCRetry(org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy)
>   Time elapsed: 20.18 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 2 
> milliseconds
>   at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
>   at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
>   at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117)
>   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:700)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
>   at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1645)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1461)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1414)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:234)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:119)
>   at com.sun.proxy.$Proxy24.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:133)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>   at com.sun.proxy.$Proxy25.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy.testNMProxyRPCRetry(TestNMProxy.java:167)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> 

[jira] [Commented] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-08 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210457#comment-17210457
 ] 

Jim Brennan commented on YARN-9667:
---

+1 on the patch for branch-3.2

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9667-001.patch, YARN-9667-branch-3.2.001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10455) TestNMProxy.testNMProxyRPCRetry is not consistent

2020-10-08 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210423#comment-17210423
 ] 

Jim Brennan commented on YARN-10455:


I have committed this to the following branches: trunk, 3.3, 3.2, 3.1

[~ahussein] can you please provide a patch for branch-2.10?

 

> TestNMProxy.testNMProxyRPCRetry is not consistent
> -
>
> Key: YARN-10455
> URL: https://issues.apache.org/jira/browse/YARN-10455
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.1.2, 3.2.1, 3.4.0, 3.3.1
>
> Attachments: YARN-10455.001.patch
>
>
> The fix in YARN-8844 may fail depending on the configuration of the machine 
> running the test.
>  In some cases the address gets resolved and the Unit throws a connection 
> timeout exception instead. In such scenario the JUnit times out the main 
> reason behind the failure is swallowed by the shutdown of the clients.
>  To make sure that the JUnit behavior is consistent, a suggested fix is to 
> set the host address to {{127.0.0.1:1}}. The latter will omit the probability 
> of collisions on non-privileged ports.
>  Also, it is more correct to catch {{SocketException}} directly rather than 
> catching IOException with a check for not {{SocketException}}.
>  
> The stack trace with such failures:
> {code:bash}
> [INFO] Running 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 24.293 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] 
> testNMProxyRPCRetry(org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy)
>   Time elapsed: 20.18 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 2 
> milliseconds
>   at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
>   at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
>   at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117)
>   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:700)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
>   at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1645)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1461)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1414)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:234)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:119)
>   at com.sun.proxy.$Proxy24.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:133)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>   at com.sun.proxy.$Proxy25.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy.testNMProxyRPCRetry(TestNMProxy.java:167)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> 

[jira] [Updated] (YARN-10455) TestNMProxy.testNMProxyRPCRetry is not consistent

2020-10-08 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10455:
---
Fix Version/s: 3.1.2

> TestNMProxy.testNMProxyRPCRetry is not consistent
> -
>
> Key: YARN-10455
> URL: https://issues.apache.org/jira/browse/YARN-10455
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.1.2, 3.2.1, 3.4.0, 3.3.1
>
> Attachments: YARN-10455.001.patch
>
>
> The fix in YARN-8844 may fail depending on the configuration of the machine 
> running the test.
>  In some cases the address gets resolved and the Unit throws a connection 
> timeout exception instead. In such scenario the JUnit times out the main 
> reason behind the failure is swallowed by the shutdown of the clients.
>  To make sure that the JUnit behavior is consistent, a suggested fix is to 
> set the host address to {{127.0.0.1:1}}. The latter will omit the probability 
> of collisions on non-privileged ports.
>  Also, it is more correct to catch {{SocketException}} directly rather than 
> catching IOException with a check for not {{SocketException}}.
>  
> The stack trace with such failures:
> {code:bash}
> [INFO] Running 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 24.293 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] 
> testNMProxyRPCRetry(org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy)
>   Time elapsed: 20.18 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 2 
> milliseconds
>   at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
>   at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
>   at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117)
>   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:700)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
>   at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1645)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1461)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1414)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:234)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:119)
>   at com.sun.proxy.$Proxy24.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:133)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>   at com.sun.proxy.$Proxy25.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy.testNMProxyRPCRetry(TestNMProxy.java:167)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> 

[jira] [Updated] (YARN-10455) TestNMProxy.testNMProxyRPCRetry is not consistent

2020-10-08 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10455:
---
Fix Version/s: 3.2.1

> TestNMProxy.testNMProxyRPCRetry is not consistent
> -
>
> Key: YARN-10455
> URL: https://issues.apache.org/jira/browse/YARN-10455
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.2.1, 3.4.0, 3.3.1
>
> Attachments: YARN-10455.001.patch
>
>
> The fix in YARN-8844 may fail depending on the configuration of the machine 
> running the test.
>  In some cases the address gets resolved and the Unit throws a connection 
> timeout exception instead. In such scenario the JUnit times out the main 
> reason behind the failure is swallowed by the shutdown of the clients.
>  To make sure that the JUnit behavior is consistent, a suggested fix is to 
> set the host address to {{127.0.0.1:1}}. The latter will omit the probability 
> of collisions on non-privileged ports.
>  Also, it is more correct to catch {{SocketException}} directly rather than 
> catching IOException with a check for not {{SocketException}}.
>  
> The stack trace with such failures:
> {code:bash}
> [INFO] Running 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 24.293 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] 
> testNMProxyRPCRetry(org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy)
>   Time elapsed: 20.18 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 2 
> milliseconds
>   at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
>   at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
>   at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117)
>   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:700)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
>   at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1645)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1461)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1414)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:234)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:119)
>   at com.sun.proxy.$Proxy24.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:133)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>   at com.sun.proxy.$Proxy25.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy.testNMProxyRPCRetry(TestNMProxy.java:167)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> 

[jira] [Updated] (YARN-10455) TestNMProxy.testNMProxyRPCRetry is not consistent

2020-10-08 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10455:
---
Fix Version/s: 3.3.1
   3.4.0

> TestNMProxy.testNMProxyRPCRetry is not consistent
> -
>
> Key: YARN-10455
> URL: https://issues.apache.org/jira/browse/YARN-10455
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10455.001.patch
>
>
> The fix in YARN-8844 may fail depending on the configuration of the machine 
> running the test.
>  In some cases the address gets resolved and the Unit throws a connection 
> timeout exception instead. In such scenario the JUnit times out the main 
> reason behind the failure is swallowed by the shutdown of the clients.
>  To make sure that the JUnit behavior is consistent, a suggested fix is to 
> set the host address to {{127.0.0.1:1}}. The latter will omit the probability 
> of collisions on non-privileged ports.
>  Also, it is more correct to catch {{SocketException}} directly rather than 
> catching IOException with a check for not {{SocketException}}.
>  
> The stack trace with such failures:
> {code:bash}
> [INFO] Running 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 24.293 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] 
> testNMProxyRPCRetry(org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy)
>   Time elapsed: 20.18 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 2 
> milliseconds
>   at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
>   at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
>   at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117)
>   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:700)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
>   at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1645)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1461)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1414)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:234)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:119)
>   at com.sun.proxy.$Proxy24.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:133)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>   at com.sun.proxy.$Proxy25.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy.testNMProxyRPCRetry(TestNMProxy.java:167)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> 

[jira] [Commented] (YARN-10455) TestNMProxy.testNMProxyRPCRetry is not consistent

2020-10-08 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210312#comment-17210312
 ] 

Jim Brennan commented on YARN-10455:


Thanks for the patch [~ahussein]!  I verified that this test failed for me when 
I ran it on my mac, but passes with your patch.  On my linux vm, it passes in 
both cases.

+1 on this patch. Can you provide a patch for branch-2.10 as well?

 

> TestNMProxy.testNMProxyRPCRetry is not consistent
> -
>
> Key: YARN-10455
> URL: https://issues.apache.org/jira/browse/YARN-10455
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: YARN-10455.001.patch
>
>
> The fix in YARN-8844 may fail depending on the configuration of the machine 
> running the test.
>  In some cases the address gets resolved and the Unit throws a connection 
> timeout exception instead. In such scenario the JUnit times out the main 
> reason behind the failure is swallowed by the shutdown of the clients.
>  To make sure that the JUnit behavior is consistent, a suggested fix is to 
> set the host address to {{127.0.0.1:1}}. The latter will omit the probability 
> of collisions on non-privileged ports.
>  Also, it is more correct to catch {{SocketException}} directly rather than 
> catching IOException with a check for not {{SocketException}}.
>  
> The stack trace with such failures:
> {code:bash}
> [INFO] Running 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 24.293 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] 
> testNMProxyRPCRetry(org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy)
>   Time elapsed: 20.18 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 2 
> milliseconds
>   at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
>   at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
>   at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117)
>   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:700)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
>   at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1645)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1461)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1414)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:234)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:119)
>   at com.sun.proxy.$Proxy24.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:133)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>   at com.sun.proxy.$Proxy25.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy.testNMProxyRPCRetry(TestNMProxy.java:167)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-07 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209597#comment-17209597
 ] 

Jim Brennan commented on YARN-10393:


Thanks [~adam.antal]!

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5, 2.10.2
>
> Attachments: YARN-10393-branch-2.10.001.patch, 
> YARN-10393-branch-2.10.001.patch, YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at 

[jira] [Commented] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-06 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209068#comment-17209068
 ] 

Jim Brennan commented on YARN-10451:


I have committed this to trunk, branch-3.3, branch-3.2, branch-3.1, and 
branch-2.10.

Thanks for the contribution [~epayne] and thanks for the review [~sunilg]!

 

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5, 2.10.2
>
> Attachments: YARN-10451.001.patch, YARN-10451.002.patch, 
> YARN-10451.003.patch, YARN-10451.branch-3.2.003.patch
>
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-06 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10451:
---
Fix Version/s: 2.10.2
   3.1.5
   3.3.1
   3.4.0
   3.2.2

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.2.2, 3.4.0, 3.3.1, 3.1.5, 2.10.2
>
> Attachments: YARN-10451.001.patch, YARN-10451.002.patch, 
> YARN-10451.003.patch, YARN-10451.branch-3.2.003.patch
>
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-06 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209014#comment-17209014
 ] 

Jim Brennan commented on YARN-10451:


Thanks [~epayne]!  I will commit this shortly.

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10451.001.patch, YARN-10451.002.patch, 
> YARN-10451.003.patch, YARN-10451.branch-3.2.003.patch
>
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-06 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208867#comment-17208867
 ] 

Jim Brennan commented on YARN-10393:


[~adam.antal] I have uploaded a patch for branch-2.10.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393-branch-2.10.001.patch, YARN-10393.001.patch, 
> YARN-10393.002.patch, YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-06 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10393:
---
Attachment: YARN-10393-branch-2.10.001.patch

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.0.4, 3.2.2, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10393-branch-2.10.001.patch, YARN-10393.001.patch, 
> YARN-10393.002.patch, YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at 

[jira] [Commented] (YARN-10451) RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.

2020-10-06 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208793#comment-17208793
 ] 

Jim Brennan commented on YARN-10451:


I am +1 on patch 003.  The unit test that failed did not fail for me locally 
with this patch.

> RM (v1) UI NodesPage can NPE when yarn.io/gpu resource type is defined.
> ---
>
> Key: YARN-10451
> URL: https://issues.apache.org/jira/browse/YARN-10451
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10451.001.patch, YARN-10451.002.patch, 
> YARN-10451.003.patch
>
>
> The NodesPage in the RM (v1) UI will NPE when the {{yarn.resource-types}} 
> property defines {{yarn.io}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    1   2   3   4   5   6   7   8   9   >