[jira] [Updated] (YARN-10562) Follow up changes for YARN-9833

2021-01-15 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10562:
---
Fix Version/s: 2.10.2

I committed this to branch-2.10 after backporting YARN-9833. this JIRA has now 
been committed to trunk (3.4), branch-3.3, branch-3.2, branch-3.1, and 
branch-2.10

> Follow up changes for YARN-9833
> ---
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: resourcemanager
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2021-01-15 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9833:
--
Fix Version/s: 2.10.2

I backported this to branch-2.10

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4, 2.10.2
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10562) Follow up changes for YARN-9833

2021-01-13 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10562:
---
Fix Version/s: 3.2.3
   3.1.5
   3.3.1
   3.4.0

I committed this to trunk (3.4), branch-3.3, branch-3.2, and branch-3.1. To put 
this back into branch-2.10, we'll need to also backport YARN-9833. 
[~Jim_Brennan], let me know if you'd like me to do this

> Follow up changes for YARN-9833
> ---
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: resourcemanager
> Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3
>
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Follow up changes for YARN-9833

2021-01-12 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263705#comment-17263705
 ] 

Eric Badger commented on YARN-10562:


+1 on the patch. As mentioned above, there is still the race in the code based 
on the fact that the caller doesn't have to get all Dirs at the same time. But 
the only issue that this will cause is the dirs being out of date for that 
iteration. The next time they get a copy, it will be updated. And the list will 
always be well-constructed. It just has the possibility of being out of sync 
when compared with the other lists. 

Will wait for the precommit to come back and commit if there are no errors and 
no objections

> Follow up changes for YARN-9833
> ---
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch, YARN-10562.004.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-12 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263601#comment-17263601
 ] 

Eric Badger commented on YARN-10562:


Yea the original problem (before YARN-9833) was that we were getting a view of 
the list instead of a copy. And those views could iterate the list at any time. 
The issue there was that checkDirs was going out and clearing those lists in a 
separate thread. So when the client iterated through the lists, it would 
periodically see an empty list if it iterated at just the right time. 

After YARN-9833 there is still a race, but it is a smaller and less nefarious 
one. The issue there is that we have 3 lists (localDirs, fullDirs, errorDirs). 
Those can really be thought of as a single list with different attributes for 
each dir. Because the sum of all of those lists should give you all disks on 
the node. YARN-9833 added code to return a copy of the list instead of a view. 
So we'll never have a list that is incomplete. But the race becomes the fact 
that you could potentially call getGoodDirs(), then have checkDirs run, then 
call getErrorDirs. If a dir transitioned from good -> error just after 
getGoodDirs was called, it would show up in both lists. 

But like you said, [~pbacsko], I think it makes sense to remove complexity of 
the code if it requires this type of discussion to understand exactly why the 
code works (or doesn't work). It makes the code harder to maintain and even 
harder to modify.

> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-11 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262976#comment-17262976
 ] 

Eric Badger commented on YARN-10562:


Discussed this a little bit with [~Jim_Brennan] offline and here's the summary 
of my thoughts. There are a few problems with the current code.
1) There is an inherent race in the code
2) There is unnecessary and overlapping locking between the read/write lock and 
the CopyOnWriteArrayList

The only way we can reliably address 1) is to return a copy of all lists at 
once. Otherwise DirectoryCollection.checkDirs() can come along and change the 
overall status of the 3 dirs lists. If we put checkDirs in a critical section 
(like it is now with the write lock), then we can return all dirs at once while 
in the read lock and assure that all dirs are consistent with each other. If we 
get the dirs in separate calls that grab the lock, we could be inconsistent 
because checkDirs could be called in between the getDirs calls. I suppose the 
other way to fix the locking is to do fine-grained locking within the caller 
code itself, but I think that is pretty bad practice by exposing internal 
locking to the caller.

For 2) we should either change CopyOnWriteArrayList to a regular ArrayList (as 
they had original planned to do in 
[YARN-5214|https://issues.apache.org/jira/browse/YARN-5214?focusedCommentId=15342587=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15342587]
 or remove the read/write lock. These locks serve more or less the same purpose 
and having both of them is unncecessary.

Since I think that locking is usually difficult, complex, and misunderstood by 
those who change it later, I think we should get rid of the 
CopyOnWriteArrayList and change it to a regular ArrayList and then make the 
changes that [~Jim_Brennan] has made here so that we aren't reconstructing each 
list from scratch each time we run checkDirs. The downside of this change is 
that every container launch will create a new copy each list and that is a 
performance regression. But I don't think it will be much of an issue. Would be 
happy to hear other opinions on this

> Alternate fix for DirectoryCollection.checkDirs() race
> --
>
> Key: YARN-10562
> URL: https://issues.apache.org/jira/browse/YARN-10562
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10562.001.patch, YARN-10562.002.patch, 
> YARN-10562.003.patch
>
>
> In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
> related methods were returning an unmodifiable view of the lists. These 
> accesses were protected by read/write locks, but because the lists are 
> CopyOnWriteArrayLists, subsequent changes to the list, even when done under 
> the writelock, were exposed when a caller started iterating the list view. 
> CopyOnWriteArrayLists cache the current underlying list in the iterator, so 
> it is safe to iterate them even while they are being changed - at least the 
> view will be consistent.
> The problem was that checkDirs() was clearing the lists and rebuilding them 
> from scratch every time, so if a caller called getGoodDirs() just before 
> checkDirs cleared it, and then started iterating right after the clear, they 
> could get an empty list.
> The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
> return a copy of the list, which definitely fixes the race condition. The 
> disadvantage is that now we create a new copy of these lists every time we 
> launch a container. The advantage using CopyOnWriteArrayList was that the 
> lists should rarely ever change, and we can avoid all the copying. 
> Unfortunately, the way checkDirs() was written, it guaranteed that it would 
> modify those lists multiple times every time.
> So this Jira proposes an alternate solution for YARN-9833, which mainly just 
> rewrites checkDirs() to minimize the changes to the underlying lists. There 
> are still some small windows where a disk will have been added to one list, 
> but not yet removed from another if you hit it just right, but I think these 
> should be pretty rare and relatively harmless, and in the vast majority of 
> cases I suspect only one disk will be moving from one list to another at any 
> time.   The question is whether this type of inconsistency (which was always 
> there before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2021-01-04 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258592#comment-17258592
 ] 

Eric Badger commented on YARN-10501:


Thanks for the explanation, [~caozhiqiang]. 

{quote}
Overall I have 2 main issues with the code that I need cleared up because I 
either don't understand the code or think things are broken/unnecessary.
1) We have labels associated with Hosts (which are collections of Hosts) and 
labels associated with just Nodes
2) We add Nodes that have no associated port or have the wildcard port and add 
those same nodes with their associated ports.
{quote}
Ok so out of these points, it looks like 1) exists because we want to have a 
"host default" that all nodes get when they start up. But I still don't 
understand why 2) exists. I think it sounds like you also don't understand why 
2) exists, but correct me if I'm wrong. Hopefully one of the people you tagged 
can give an explanation on why we add nodes that have no associated port

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Attachments: YARN-10501.002.patch, YARN-10501.003.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}}
> 4.yarn rmadmin -replaceLabelsOnNode "server001"
> 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}}
>  {code}
> You can see after the 4 process to remove nodemanager labels, the label info 
> is still in the node info.
> {code:java}
>  641 case REPLACE:
>  642 replaceNodeForLabels(nodeId, host.labels, labels);
>  643 replaceLabelsForNode(nodeId, host.labels, labels);
>  644 host.labels.clear();
>  645 host.labels.addAll(labels);
>  646 for (Node node : host.nms.values()) {
>  647 replaceNodeForLabels(node.nodeId, node.labels, labels);
>  649 node.labels = null;
>  650 }
>  651 break;{code}
> The cause is in 647 line, when add labels to node without port, the 0 port 
> and the real nm port with be both add to node info, and when remove labels, 
> the parameter node.labels in 647 line is null, so it will not remove the old 
> label. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port

2020-12-22 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253792#comment-17253792
 ] 

Eric Badger commented on YARN-10501:


I'm looking at this patch and I have some questions about other pieces of code 
that are in the same section. I'll admit, this code is a little bit confusing 
to me because we have Hosts->Labels maps as well as Labels->Nodes maps and then 
on top of that, each Host can have multiple Nodes. Before I can comment on your 
patch, I think I need to clear up some things that are going on in this area of 
code that are confusing to me. Below is what I _think_ is happening. Feel free 
to correct me where I'm wrong.

(Assuming no port and/or wildcard port for this)
1. When we add a node label, we invoke this piece of code. 
{noformat}
case ADD:
  addNodeToLabels(nodeId, labels);
  host.labels.addAll(labels);
  for (Node node : host.nms.values()) {
if (node.labels != null) {
  node.labels.addAll(labels);
}
addNodeToLabels(node.nodeId, labels);
  }
  break;
{noformat}
  1a. This code adds the NodeId (without a port/with a wildcard port) to the 
Labels->Nodes map via addNodeToLabels. 
*Why do we do this? There is no port associated with this node. In 1d we 
add the nodes to the map with their associated port, so I don't understand why 
we're adding the node here when it doesn't have a port.*
  1b. It adds all of the labels to the Host. This part doesn't make sense to 
me. *If we are giving Hosts the granularity to have multiple labels per host 
(due to multiple NMs), then why does the Host itself have labels?*
  1c. We add all the labels to each Node in the host, but _only_ if they 
already have labels. *Why do we only add the labels if they already have 
labels? Don't we want to add the labels regardless? Should it be possible for 
us to be in the ADD method while node.labels == null? Maybe this should throw 
an exception*
  1d. We add the Nodes (with their associated NM port) to the Labels->nodes map 
via addNodeToLabels.

2. When we replace the node label we invoke this piece of code
{noformat}
case REPLACE:
  replaceNodeForLabels(nodeId, host.labels, labels);
  host.labels.clear();
  host.labels.addAll(labels);
  for (Node node : host.nms.values()) {
replaceNodeForLabels(node.nodeId, node.labels, labels);
node.labels = null;
  }
{noformat}
  2a. We remove the Node (without port or with wildcard port) from the specific 
label in the Labels->Nodes map via removeNoveFromLabels(). *Why do we have the 
node without a port in the first place?* This comes from 1a.
  2b. We add the Node (without port or with wildcard port) to the new specific 
label in the Labels->Nodes map via addNodeToLabels(). *Why do we add the node 
without a port?*
  2c. We clear the labels associated with the Host. *Why are there labels 
associated with a Host when each Host is actually a collection of Nodes?*
  2d. We add the new labels to the Host. Same question as 2c.
  2e. We iterate through the list of Nodes associated with each Host and 
perform 2a and 2b, except with Nodes that have their associated ports.
  2f. We set the Labels to Null for each Node associated with the Host. I don't 
understand the purpose of this. I must be missing something here.

Overall I have 2 main issues with the code that I need cleared up because I 
either don't understand the code or think things are broken/unnecessary. 
1) We have labels associated with Hosts (which are collections of Hosts) _and_ 
labels associated with just Nodes
2) We add Nodes that have no associated port or have the wildcard port _and_ 
add those same nodes with their associated ports.

Probably need [~leftnoteasy], [~varunsaxena], or [~sunilg] to comment on this 
since they were involved with YARN-3075 that added much of this code

> Can't remove all node labels after add node label without nodemanager port
> --
>
> Key: YARN-10501
> URL: https://issues.apache.org/jira/browse/YARN-10501
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Attachments: YARN-10501.002.patch, YARN-10501.003.patch
>
>
> When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) 
> port, it can't remove all label info in these nodes
> Reproduce process:
> {code:java}
> 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)"
> 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode"
> 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings
> 

[jira] [Updated] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes

2020-12-21 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10540:
---
Fix Version/s: 3.2.3
   2.10.2
   3.1.5
   3.3.1
   3.4.0

> Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
> 
>
> Key: YARN-10540
> URL: https://issues.apache.org/jira/browse/YARN-10540
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: webapp
>Affects Versions: 3.2.2
>Reporter: Sunil G
>Assignee: Jim Brennan
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 
> PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, 
> Yarn-UI-Ubuntu.png, yarnodes.png, yarnui2onubuntu.png
>
>
> YARN-10450 added changes in NodeInfo class.
> Various exceptions are showing while accessing UI2 and UI1 NODE pages. 
> {code:java}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:243)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
> at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
> at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
> at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
> at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70)
>  {code}
> {code:java}
> 2020-12-19 22:55:54,846 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2020-12-14 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249402#comment-17249402
 ] 

Eric Badger commented on YARN-9833:
---

bq. Isn't this how it has been for years? It was returning an unmodifiableList 
view of the underlying List, so that limits what the caller can do. 
getGoodDirs() and the others just return a read-only List. They don't have to 
know about the internals.
Well, yes an no. It was _supposed_ to be like that. But given this bug, we can 
see that it clearly wasn't. The callee in this case _should_ have been atomic 
and so the unmodifiable view of the list _should_ have been fine. But when you 
get into fine-grained locking like this, mistakes are easy to make because the 
person making the change doesn't necessarily understand the history behind why 
the code is written the way it is. If we can guarantee that the callee will 
always perform atomic operations on the lists, then there isn't an issue. Maybe 
we can guarantee this by adding a comment/warning in the checkDirs() function 
to make sure that anyone touching this code is super careful about locking.

I agree with the idea of fixing this on the callee side and not having the 
caller create a new object everytime. I just want to make sure that this bug 
isn't reintroduced by accident down the line because of the added complexity of 
fine-grained locking. I don't know if this is a performance-sensitive area of 
the code where such a tradeoff would clearly be to go for performance.

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2020-12-14 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249289#comment-17249289
 ] 

Eric Badger commented on YARN-9833:
---

I agree that the CopyOnWriteArrayList is most likely a performance thing. Since 
{{getGoodDirs()}} is called on every container launch, then that's a lot of 
copying. Not sure it's that bad in the grand scheme of things though. 

bq. My suggestion for fixing this would be to fix the checkdirs() 
implementation to operate on local copies of these arrays, and then update them 
with a single assignment only if they have changed.
My worry with this is that code changes in the future will incorrectly use 
{{getGoodDirs}} or the other methods that expose the private lists from within 
DirectoryCollection. So in my mind it's a tradeoff between performance and 
maintainability. I don't know what the performance impact is. We could 
potentially mitigate some (most?) of the maintainability impact via a comment 
on the getGoodDirs() method (as well as the getLocalDirs() method in 
LocalDirsHandlerService). In general, I don't like calling methods to have to 
be aware of callee methods and having to deal with their locking. That could 
also be mitigated by fixing the callee method to remove the race condition, but 
that could be reintroduced by accident in the future, since they may not 
understand the full impact of the CopyOnWriteArrayList

bq. 1. We were not thinking about errorDirs because as we were tracking down 
the issue, only localDirs seemed to be problematic, although I agree that it is 
inconsistent this way. Shall we follow-up on this?
Yea, we should definitely follow up to fix errorDirs

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch

2020-12-09 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246922#comment-17246922
 ] 

Eric Badger commented on YARN-9833:
---

[~pbacsko], is there a reason that errorDirs wasn't added in this patch? It 
still returns an unmodifiableList instead of a copy.

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> 
>
> Key: YARN-9833
> URL: https://issues.apache.org/jira/browse/YARN-9833
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
> this.writeLock.lock();
> try {
>   localDirs.clear();
>   errorDirs.clear();
>   fullDirs.clear();
>   ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List getGoodDirs() {
> this.readLock.lock();
> try {
>   return Collections.unmodifiableList(localDirs);
> } finally {
>   this.readLock.unlock();
> }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10495) make the rpath of container-executor configurable

2020-12-07 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10495:
---
Fix Version/s: 3.4.1

> make the rpath of container-executor configurable
> -
>
> Key: YARN-10495
> URL: https://issues.apache.org/jira/browse/YARN-10495
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.4.1
>
> Attachments: YARN-10495.001.patch, YARN-10495.002.patch
>
>
> In  https://issues.apache.org/jira/browse/YARN-9561 we add dependency on 
> crypto to container-executor, we meet a case that in our jenkins machine, we 
> have libcrypto.so.1.0.0  in shared lib env. but in our nodemanager machine we 
> don't have  libcrypto.so.1.0.0  but *libcrypto.so.1.1.*
> We use a  internal custom dynamic link library environment 
> /usr/lib/x86_64-linux-gnu
> and we build hadoop with parameter as blow
> {code:java}
>  -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu
> {code}
>  
> Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where 
> is libcrypto)
> {code:java}
> -rw-r--r-- 1 root root   240136 Nov 28  2014 libcroco-0.6.so.3.0.1
> -rw-r--r-- 1 root root54550 Jun 18  2017 libcrypt.a
> -rw-r--r-- 1 root root  4306444 Sep 26  2019 libcrypto.a
> lrwxrwxrwx 1 root root   18 Sep 26  2019 libcrypto.so -> 
> libcrypto.so.1.0.0
> -rw-r--r-- 1 root root  2070976 Sep 26  2019 libcrypto.so.1.0.0
> lrwxrwxrwx 1 root root   35 Jun 18  2017 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r-- 1 root root  298 Jun 18  2017 libc.so
> {code}
>  
> Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is 
> libcrypto)
> {code:java}
> -rw-r--r--  1 root root55852 2��   7  2019 libcrypt.a
> -rw-r--r--  1 root root  4864244 9��  28  2019 libcrypto.a
> lrwxrwxrwx  1 root root   16 9��  28  2019 libcrypto.so -> 
> libcrypto.so.1.1
> -rw-r--r--  1 root root  2504576 12�� 24  2019 libcrypto.so.1.0.2
> -rw-r--r--  1 root root  2715840 9��  28  2019 libcrypto.so.1.1
> lrwxrwxrwx  1 root root   35 2��   7  2019 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r--  1 root root  298 2��   7  2019 libc.so
> {code}
>  We build container-executor with 
> The  libcrypto.so 's version is not same case error when we start nodemanager
>  
> {code:java}
> .. 3 more Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: 
> error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared 
> object file: No such file or directory at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306)
>  ... 4 more Caused by: ExitCodeException exitCode=127: 
> /home/hadoop/hadoop/bin/container-executor: error while loading shared 
> libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file 
> or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at 
> org.apache.hadoop.util.Shell.run(Shell.java:901) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154)
>  ... 6 more 
> {code}
>  
> We should make RPATH of container-executor configurable to solve this problem 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10494) CLI tool for docker-to-squashfs conversion (pure Java)

2020-12-03 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243597#comment-17243597
 ] 

Eric Badger commented on YARN-10494:


I'm pretty much in line with [~ccondit]'s opinion here. This squashfs code 
definitely isn't specific to Hadoop and could be a separate project. However, 
that's a lot of work and I'm not sure there's any real benefit there. Also, I 
think a PR is much better for a large commit because it becomes much easier to 
make comments in line with the code instead of having to explain where the 
issues are in a JIRA comment

> CLI tool for docker-to-squashfs conversion (pure Java)
> --
>
> Key: YARN-10494
> URL: https://issues.apache.org/jira/browse/YARN-10494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10494.001.patch, 
> docker-to-squashfs-conversion-tool-design.pdf
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> *YARN-9564* defines a docker-to-squashfs image conversion tool that relies on 
> python2, multiple libraries, squashfs-tools and root access in order to 
> convert Docker images to squashfs images for use with the runc container 
> runtime in YARN.
> *YARN-9943* was created to investigate alternatives, as the response to 
> merging YARN-9564 has not been very positive. This proposal outlines the 
> design for a CLI conversion tool in 100% pure Java that will work out of the 
> box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10494) CLI tool for docker-to-squashfs conversion (pure Java)

2020-11-20 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236375#comment-17236375
 ] 

Eric Badger commented on YARN-10494:


Hey [~ccondit], thanks for the document. I'm really excited about this tool.

bq. This tool will connect to a Docker repository
Is this a requirement of the tool? I think we definitely need to at least have 
support for local image import from the docker daemon. Ideally we would also 
include support for any OCI-compliant registry, but that is probably outside of 
the scope of the initial design. We just want to make sure to leave the door 
open for that support in the future

Another question: Does this tool support reproducible builds as was added to 
squashfs-tools 4.4 
(https://github.com/plougher/squashfs-tools/blob/master/README-4.4)?

And as we discussed in the most recent YARN call, we'll need to figure out how 
to run this tool (e.g. a service in the RM, standalone, Hadoop job, etc.) and 
with what user it needs to be run as. There are certainly challenges around 
permissions and security where we won't want arbitrary users creating 
potentially malicious squashfs images that will be blindly loaded by the 
kernel. This is outside of the scope of this specific JIRA, but wanted to 
mention it here for posterity. 

> CLI tool for docker-to-squashfs conversion (pure Java)
> --
>
> Key: YARN-10494
> URL: https://issues.apache.org/jira/browse/YARN-10494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
> Attachments: docker-to-squashfs-conversion-tool-design.pdf
>
>
> *YARN-9564* defines a docker-to-squashfs image conversion tool that relies on 
> python2, multiple libraries, squashfs-tools and root access in order to 
> convert Docker images to squashfs images for use with the runc container 
> runtime in YARN.
> *YARN-9943* was created to investigate alternatives, as the response to 
> merging YARN-9564 has not been very positive. This proposal outlines the 
> design for a CLI conversion tool in 100% pure Java that will work out of the 
> box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10495) make the rpath of container-executor configurable

2020-11-20 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236353#comment-17236353
 ] 

Eric Badger commented on YARN-10495:


Hadoop QA seems to have failed pretty epically. Looks like OOM issues on the 
box where it ran. I re-uploaded your patch (as 002) to retrigger this build

> make the rpath of container-executor configurable
> -
>
> Key: YARN-10495
> URL: https://issues.apache.org/jira/browse/YARN-10495
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Attachments: YARN-10495.001.patch, YARN-10495.002.patch
>
>
> In  https://issues.apache.org/jira/browse/YARN-9561 we add dependency on 
> crypto to container-executor, we meet a case that in our jenkins machine, we 
> have libcrypto.so.1.0.0  in shared lib env. but in our nodemanager machine we 
> don't have  libcrypto.so.1.0.0  but *libcrypto.so.1.1.*
> We use a  internal custom dynamic link library environment 
> /usr/lib/x86_64-linux-gnu
> and we build hadoop with parameter as blow
> {code:java}
>  -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu
> {code}
>  
> Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where 
> is libcrypto)
> {code:java}
> -rw-r--r-- 1 root root   240136 Nov 28  2014 libcroco-0.6.so.3.0.1
> -rw-r--r-- 1 root root54550 Jun 18  2017 libcrypt.a
> -rw-r--r-- 1 root root  4306444 Sep 26  2019 libcrypto.a
> lrwxrwxrwx 1 root root   18 Sep 26  2019 libcrypto.so -> 
> libcrypto.so.1.0.0
> -rw-r--r-- 1 root root  2070976 Sep 26  2019 libcrypto.so.1.0.0
> lrwxrwxrwx 1 root root   35 Jun 18  2017 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r-- 1 root root  298 Jun 18  2017 libc.so
> {code}
>  
> Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is 
> libcrypto)
> {code:java}
> -rw-r--r--  1 root root55852 2��   7  2019 libcrypt.a
> -rw-r--r--  1 root root  4864244 9��  28  2019 libcrypto.a
> lrwxrwxrwx  1 root root   16 9��  28  2019 libcrypto.so -> 
> libcrypto.so.1.1
> -rw-r--r--  1 root root  2504576 12�� 24  2019 libcrypto.so.1.0.2
> -rw-r--r--  1 root root  2715840 9��  28  2019 libcrypto.so.1.1
> lrwxrwxrwx  1 root root   35 2��   7  2019 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r--  1 root root  298 2��   7  2019 libc.so
> {code}
>  We build container-executor with 
> The  libcrypto.so 's version is not same case error when we start nodemanager
>  
> {code:java}
> .. 3 more Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: 
> error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared 
> object file: No such file or directory at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306)
>  ... 4 more Caused by: ExitCodeException exitCode=127: 
> /home/hadoop/hadoop/bin/container-executor: error while loading shared 
> libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file 
> or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at 
> org.apache.hadoop.util.Shell.run(Shell.java:901) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154)
>  ... 6 more 
> {code}
>  
> We should make RPATH of container-executor configurable to solve this problem 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10495) make the rpath of container-executor configurable

2020-11-20 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10495:
---
Attachment: YARN-10495.002.patch

> make the rpath of container-executor configurable
> -
>
> Key: YARN-10495
> URL: https://issues.apache.org/jira/browse/YARN-10495
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Attachments: YARN-10495.001.patch, YARN-10495.002.patch
>
>
> In  https://issues.apache.org/jira/browse/YARN-9561 we add dependency on 
> crypto to container-executor, we meet a case that in our jenkins machine, we 
> have libcrypto.so.1.0.0  in shared lib env. but in our nodemanager machine we 
> don't have  libcrypto.so.1.0.0  but *libcrypto.so.1.1.*
> We use a  internal custom dynamic link library environment 
> /usr/lib/x86_64-linux-gnu
> and we build hadoop with parameter as blow
> {code:java}
>  -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu
> {code}
>  
> Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where 
> is libcrypto)
> {code:java}
> -rw-r--r-- 1 root root   240136 Nov 28  2014 libcroco-0.6.so.3.0.1
> -rw-r--r-- 1 root root54550 Jun 18  2017 libcrypt.a
> -rw-r--r-- 1 root root  4306444 Sep 26  2019 libcrypto.a
> lrwxrwxrwx 1 root root   18 Sep 26  2019 libcrypto.so -> 
> libcrypto.so.1.0.0
> -rw-r--r-- 1 root root  2070976 Sep 26  2019 libcrypto.so.1.0.0
> lrwxrwxrwx 1 root root   35 Jun 18  2017 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r-- 1 root root  298 Jun 18  2017 libc.so
> {code}
>  
> Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is 
> libcrypto)
> {code:java}
> -rw-r--r--  1 root root55852 2��   7  2019 libcrypt.a
> -rw-r--r--  1 root root  4864244 9��  28  2019 libcrypto.a
> lrwxrwxrwx  1 root root   16 9��  28  2019 libcrypto.so -> 
> libcrypto.so.1.1
> -rw-r--r--  1 root root  2504576 12�� 24  2019 libcrypto.so.1.0.2
> -rw-r--r--  1 root root  2715840 9��  28  2019 libcrypto.so.1.1
> lrwxrwxrwx  1 root root   35 2��   7  2019 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r--  1 root root  298 2��   7  2019 libc.so
> {code}
>  We build container-executor with 
> The  libcrypto.so 's version is not same case error when we start nodemanager
>  
> {code:java}
> .. 3 more Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: 
> error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared 
> object file: No such file or directory at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306)
>  ... 4 more Caused by: ExitCodeException exitCode=127: 
> /home/hadoop/hadoop/bin/container-executor: error while loading shared 
> libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file 
> or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at 
> org.apache.hadoop.util.Shell.run(Shell.java:901) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154)
>  ... 6 more 
> {code}
>  
> We should make RPATH of container-executor configurable to solve this problem 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10495) make the rpath of container-executor configurable

2020-11-19 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17235627#comment-17235627
 ] 

Eric Badger commented on YARN-10495:


Also, I've added you as a contributor in Hadoop Common, HDFS, Map/Reduce, and 
YARN. So you will now be able to assign JIRAs to yourself (as I've already done 
for you on this JIRA).

> make the rpath of container-executor configurable
> -
>
> Key: YARN-10495
> URL: https://issues.apache.org/jira/browse/YARN-10495
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Attachments: YARN-10495.001.patch
>
>
> In  https://issues.apache.org/jira/browse/YARN-9561 we add dependency on 
> crypto to container-executor, we meet a case that in our jenkins machine, we 
> have libcrypto.so.1.0.0  in shared lib env. but in our nodemanager machine we 
> don't have  libcrypto.so.1.0.0  but *libcrypto.so.1.1.*
> We use a  internal custom dynamic link library environment 
> /usr/lib/x86_64-linux-gnu
> and we build hadoop with parameter as blow
> {code:java}
>  -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu
> {code}
>  
> Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where 
> is libcrypto)
> {code:java}
> -rw-r--r-- 1 root root   240136 Nov 28  2014 libcroco-0.6.so.3.0.1
> -rw-r--r-- 1 root root54550 Jun 18  2017 libcrypt.a
> -rw-r--r-- 1 root root  4306444 Sep 26  2019 libcrypto.a
> lrwxrwxrwx 1 root root   18 Sep 26  2019 libcrypto.so -> 
> libcrypto.so.1.0.0
> -rw-r--r-- 1 root root  2070976 Sep 26  2019 libcrypto.so.1.0.0
> lrwxrwxrwx 1 root root   35 Jun 18  2017 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r-- 1 root root  298 Jun 18  2017 libc.so
> {code}
>  
> Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is 
> libcrypto)
> {code:java}
> -rw-r--r--  1 root root55852 2��   7  2019 libcrypt.a
> -rw-r--r--  1 root root  4864244 9��  28  2019 libcrypto.a
> lrwxrwxrwx  1 root root   16 9��  28  2019 libcrypto.so -> 
> libcrypto.so.1.1
> -rw-r--r--  1 root root  2504576 12�� 24  2019 libcrypto.so.1.0.2
> -rw-r--r--  1 root root  2715840 9��  28  2019 libcrypto.so.1.1
> lrwxrwxrwx  1 root root   35 2��   7  2019 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r--  1 root root  298 2��   7  2019 libc.so
> {code}
>  We build container-executor with 
> The  libcrypto.so 's version is not same case error when we start nodemanager
>  
> {code:java}
> .. 3 more Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: 
> error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared 
> object file: No such file or directory at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306)
>  ... 4 more Caused by: ExitCodeException exitCode=127: 
> /home/hadoop/hadoop/bin/container-executor: error while loading shared 
> libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file 
> or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at 
> org.apache.hadoop.util.Shell.run(Shell.java:901) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154)
>  ... 6 more 
> {code}
>  
> We should make RPATH of container-executor configurable to solve this problem 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10495) make the rpath of container-executor configurable

2020-11-19 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger reassigned YARN-10495:
--

Assignee: angerszhu

> make the rpath of container-executor configurable
> -
>
> Key: YARN-10495
> URL: https://issues.apache.org/jira/browse/YARN-10495
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Attachments: YARN-10495.001.patch
>
>
> In  https://issues.apache.org/jira/browse/YARN-9561 we add dependency on 
> crypto to container-executor, we meet a case that in our jenkins machine, we 
> have libcrypto.so.1.0.0  in shared lib env. but in our nodemanager machine we 
> don't have  libcrypto.so.1.0.0  but *libcrypto.so.1.1.*
> We use a  internal custom dynamic link library environment 
> /usr/lib/x86_64-linux-gnu
> and we build hadoop with parameter as blow
> {code:java}
>  -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu
> {code}
>  
> Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where 
> is libcrypto)
> {code:java}
> -rw-r--r-- 1 root root   240136 Nov 28  2014 libcroco-0.6.so.3.0.1
> -rw-r--r-- 1 root root54550 Jun 18  2017 libcrypt.a
> -rw-r--r-- 1 root root  4306444 Sep 26  2019 libcrypto.a
> lrwxrwxrwx 1 root root   18 Sep 26  2019 libcrypto.so -> 
> libcrypto.so.1.0.0
> -rw-r--r-- 1 root root  2070976 Sep 26  2019 libcrypto.so.1.0.0
> lrwxrwxrwx 1 root root   35 Jun 18  2017 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r-- 1 root root  298 Jun 18  2017 libc.so
> {code}
>  
> Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is 
> libcrypto)
> {code:java}
> -rw-r--r--  1 root root55852 2��   7  2019 libcrypt.a
> -rw-r--r--  1 root root  4864244 9��  28  2019 libcrypto.a
> lrwxrwxrwx  1 root root   16 9��  28  2019 libcrypto.so -> 
> libcrypto.so.1.1
> -rw-r--r--  1 root root  2504576 12�� 24  2019 libcrypto.so.1.0.2
> -rw-r--r--  1 root root  2715840 9��  28  2019 libcrypto.so.1.1
> lrwxrwxrwx  1 root root   35 2��   7  2019 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r--  1 root root  298 2��   7  2019 libc.so
> {code}
>  We build container-executor with 
> The  libcrypto.so 's version is not same case error when we start nodemanager
>  
> {code:java}
> .. 3 more Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: 
> error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared 
> object file: No such file or directory at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306)
>  ... 4 more Caused by: ExitCodeException exitCode=127: 
> /home/hadoop/hadoop/bin/container-executor: error while loading shared 
> libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file 
> or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at 
> org.apache.hadoop.util.Shell.run(Shell.java:901) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154)
>  ... 6 more 
> {code}
>  
> We should make RPATH of container-executor configurable to solve this problem 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10495) make the rpath of container-executor configurable

2020-11-19 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17235625#comment-17235625
 ] 

Eric Badger commented on YARN-10495:


[~angerszhuuu], I imagine the {{-Dbundle.openssl}} adds the libcrypto.so 
library to {{../lib/native}} of the build that is created? I don't have 
experience with this flag. Also, have you tested this out in your environment?

> make the rpath of container-executor configurable
> -
>
> Key: YARN-10495
> URL: https://issues.apache.org/jira/browse/YARN-10495
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: angerszhu
>Priority: Major
> Attachments: YARN-10495.001.patch
>
>
> In  https://issues.apache.org/jira/browse/YARN-9561 we add dependency on 
> crypto to container-executor, we meet a case that in our jenkins machine, we 
> have libcrypto.so.1.0.0  in shared lib env. but in our nodemanager machine we 
> don't have  libcrypto.so.1.0.0  but *libcrypto.so.1.1.*
> We use a  internal custom dynamic link library environment 
> /usr/lib/x86_64-linux-gnu
> and we build hadoop with parameter as blow
> {code:java}
>  -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu
> {code}
>  
> Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where 
> is libcrypto)
> {code:java}
> -rw-r--r-- 1 root root   240136 Nov 28  2014 libcroco-0.6.so.3.0.1
> -rw-r--r-- 1 root root54550 Jun 18  2017 libcrypt.a
> -rw-r--r-- 1 root root  4306444 Sep 26  2019 libcrypto.a
> lrwxrwxrwx 1 root root   18 Sep 26  2019 libcrypto.so -> 
> libcrypto.so.1.0.0
> -rw-r--r-- 1 root root  2070976 Sep 26  2019 libcrypto.so.1.0.0
> lrwxrwxrwx 1 root root   35 Jun 18  2017 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r-- 1 root root  298 Jun 18  2017 libc.so
> {code}
>  
> Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is 
> libcrypto)
> {code:java}
> -rw-r--r--  1 root root55852 2��   7  2019 libcrypt.a
> -rw-r--r--  1 root root  4864244 9��  28  2019 libcrypto.a
> lrwxrwxrwx  1 root root   16 9��  28  2019 libcrypto.so -> 
> libcrypto.so.1.1
> -rw-r--r--  1 root root  2504576 12�� 24  2019 libcrypto.so.1.0.2
> -rw-r--r--  1 root root  2715840 9��  28  2019 libcrypto.so.1.1
> lrwxrwxrwx  1 root root   35 2��   7  2019 libcrypt.so -> 
> /lib/x86_64-linux-gnu/libcrypt.so.1
> -rw-r--r--  1 root root  298 2��   7  2019 libc.so
> {code}
>  We build container-executor with 
> The  libcrypto.so 's version is not same case error when we start nodemanager
>  
> {code:java}
> .. 3 more Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: 
> error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared 
> object file: No such file or directory at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306)
>  ... 4 more Caused by: ExitCodeException exitCode=127: 
> /home/hadoop/hadoop/bin/container-executor: error while loading shared 
> libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file 
> or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at 
> org.apache.hadoop.util.Shell.run(Shell.java:901) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154)
>  ... 6 more 
> {code}
>  
> We should make RPATH of container-executor configurable to solve this problem 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime

2020-11-18 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234867#comment-17234867
 ] 

Eric Badger commented on YARN-9561:
---

I think the short answer is that you need to have openssl-devel installed on 
the machine where you are compiling Hadoop. 

>From BUILDING.txt
{noformat}
160   * Use -Drequire.openssl to fail the build if libcrypto.so is not found.
161 If this option is not specified and the openssl library is missing,
162 we silently build a version of libhadoop.so that cannot make use of
163 openssl. This option is recommended if you plan on making use of openssl
164 and want to get more repeatable builds.
165   * Use -Dopenssl.prefix to specify a nonstandard location for the libcrypto
166 header files and library files. You do not need this option if you have
167 installed openssl using a package manager.
168   * Use -Dopenssl.lib to specify a nonstandard location for the libcrypto 
library
169 files. Similarly to openssl.prefix, you do not need this option if you 
have
170 installed openssl using a package manager.
171   * Use -Dbundle.openssl to copy the contents of the openssl.lib directory 
into
172 the final tar file. This option requires that -Dopenssl.lib is also 
given,
173 and it ignores the -Dopenssl.prefix option. If -Dopenssl.lib isn't 
given, the
174 bundling and building will fail.
{noformat}

The crypto library is statically linked to the container-executor during 
compilation. I guess it just quietly moves on if it doesn't find it instead of 
failing. That's sort of troubling and something I'll look into fixing. But the 
answer is to make sure openssl-devel is installed before you compile Hadoop

> Add C changes for the new RuncContainerRuntime
> --
>
> Key: YARN-9561
> URL: https://issues.apache.org/jira/browse/YARN-9561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9561.001.patch, YARN-9561.002.patch, 
> YARN-9561.003.patch, YARN-9561.004.patch, YARN-9561.005.patch, 
> YARN-9561.006.patch, YARN-9561.007.patch, YARN-9561.008.patch, 
> YARN-9561.009.patch, YARN-9561.010.patch, YARN-9561.011.patch, 
> YARN-9561.012.patch, YARN-9561.013.patch, YARN-9561.014.patch, 
> YARN-9561.015.patch
>
>
> This JIRA will be used to add the C changes to the container-executor native 
> binary that are necessary for the new RuncContainerRuntime. There should be 
> no changes to existing code paths. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-15 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214990#comment-17214990
 ] 

Eric Badger commented on YARN-10460:


Really interesting find. Good job tracking this down.

Since we're not testing the IPC layer here, I don't see a big deal with it. 
It's not ideal, but I still think it's fairly clean. Just resetting the 
environment before the next test. So I'd be ok with this change.

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10460-POC.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> 

[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-15 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214891#comment-17214891
 ] 

Eric Badger commented on YARN-10450:


Thanks, [~Jim_Brennan]. I've committed this to trunk and branch-3.3. I'll wait 
for the precommit builds to come back and then will commit to the rest of the 
branches

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
> Attachments: NodesPage.png, YARN-10450-branch-2.10.003.patch, 
> YARN-10450-branch-3.1.003.patch, YARN-10450-branch-3.2.003.patch, 
> YARN-10450.001.patch, YARN-10450.002.patch, YARN-10450.003.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-15 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10450:
---
Fix Version/s: 3.3.1
   3.4.0

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
> Attachments: NodesPage.png, YARN-10450-branch-2.10.003.patch, 
> YARN-10450-branch-3.1.003.patch, YARN-10450-branch-3.2.003.patch, 
> YARN-10450.001.patch, YARN-10450.002.patch, YARN-10450.003.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-15 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214833#comment-17214833
 ] 

Eric Badger commented on YARN-10450:


[~Jim_Brennan], looks good now! +1 committing now

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: NodesPage.png, YARN-10450.001.patch, 
> YARN-10450.002.patch, YARN-10450.003.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2

2020-10-15 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214795#comment-17214795
 ] 

Eric Badger commented on YARN-10244:


Makes sense. Thanks, [~aajisaka]

> backport YARN-9848 to branch-3.2
> 
>
> Key: YARN-10244
> URL: https://issues.apache.org/jira/browse/YARN-10244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-10244-branch-3.2.001.patch, 
> YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch
>
>
> Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2

2020-10-14 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214050#comment-17214050
 ] 

Eric Badger commented on YARN-10244:


I'm pretty confused with all of the JIRAs on this. In the future, I think we 
should revert the JIRA using the JIRA that was committed. Let me summarize what 
I think happened and you all can let me know if I have it right.

YARN-4946 committed to 3.2, so it is in 3.2, 3.3, and trunk
YARN-9848 reverted YARN-4946 from 3.3, so YARN-4946 only remains in 3.2
YARN-10244 reverted YARN-4946 from 3.2, so YARN-4946 has been completely 
reverted

It's really confusing to me because YARN-4946 has the Fix Version set as 3.2. 
And then this JIRA says it is backporting YARN-9848, instead of saying it's 
reverting YARN-4946. Anyway, like I said above, if we're going to revert stuff, 
I think it is better to do it on the JIRA where it was committed so that we 
have a clear linear log of where it was committed to and reverted from. We can 
also then look at the Fix Version for that particular JIRA and know where it is 
actually committed

> backport YARN-9848 to branch-3.2
> 
>
> Key: YARN-10244
> URL: https://issues.apache.org/jira/browse/YARN-10244
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, resourcemanager
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: YARN-10244-branch-3.2.001.patch, 
> YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch
>
>
> Backporting YARN-9848 to branch-3.2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-13 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213215#comment-17213215
 ] 

Eric Badger commented on YARN-10450:


bq. Physical Mem Used % makes sense to me
Yea this works for me too. Much more clear IMO

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: NodesPage.png, YARN-10450.001.patch, YARN-10450.002.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-12 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212632#comment-17212632
 ] 

Eric Badger commented on YARN-10450:


The patch itself looks good to me. However, I'm wondering if "Mem Utilization" 
is the correct phrase to convey what we mean. To me this means "Mem Used" / 
"Mem Avail". But in this case it's the actual utilization of the node. And "Mem 
Used" isn't really the actual memory that's being used. It's the memory that is 
allocated to that node via YARN.

[~Jim_Brennan], [~epayne] do you have any thoughts on making this terminology a 
little more clear on the UI?

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: NodesPage.png, YARN-10450.001.patch, YARN-10450.002.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-12 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212557#comment-17212557
 ] 

Eric Badger commented on YARN-10450:


I'll review it

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: NodesPage.png, YARN-10450.001.patch, YARN-10450.002.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-12 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212524#comment-17212524
 ] 

Eric Badger commented on YARN-9667:
---

Thanks, [~Jim_Brennan]!

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.5, 2.10.2
>
> Attachments: YARN-9667-001.patch, YARN-9667-branch-2.10.001.patch, 
> YARN-9667-branch-3.2.001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10455) TestNMProxy.testNMProxyRPCRetry is not consistent

2020-10-09 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10455:
---
Fix Version/s: 2.10.2

Thanks for the branch-2.10 patch, [~ahussein]. I committed this to branch-2.10

> TestNMProxy.testNMProxyRPCRetry is not consistent
> -
>
> Key: YARN-10455
> URL: https://issues.apache.org/jira/browse/YARN-10455
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.1.2, 3.2.2, 3.4.0, 3.3.1, 2.10.2
>
> Attachments: YARN-10455-branch-2.10.001.patch, YARN-10455.001.patch
>
>
> The fix in YARN-8844 may fail depending on the configuration of the machine 
> running the test.
>  In some cases the address gets resolved and the Unit throws a connection 
> timeout exception instead. In such scenario the JUnit times out the main 
> reason behind the failure is swallowed by the shutdown of the clients.
>  To make sure that the JUnit behavior is consistent, a suggested fix is to 
> set the host address to {{127.0.0.1:1}}. The latter will omit the probability 
> of collisions on non-privileged ports.
>  Also, it is more correct to catch {{SocketException}} directly rather than 
> catching IOException with a check for not {{SocketException}}.
>  
> The stack trace with such failures:
> {code:bash}
> [INFO] Running 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 24.293 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy
> [ERROR] 
> testNMProxyRPCRetry(org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy)
>   Time elapsed: 20.18 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 2 
> milliseconds
>   at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
>   at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198)
>   at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117)
>   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:700)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
>   at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1645)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1461)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1414)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:234)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:119)
>   at com.sun.proxy.$Proxy24.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:133)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>   at com.sun.proxy.$Proxy25.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy.testNMProxyRPCRetry(TestNMProxy.java:167)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> 

[jira] [Updated] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-08 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9667:
--
Attachment: YARN-9667-branch-2.10.001.patch

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.5
>
> Attachments: YARN-9667-001.patch, YARN-9667-branch-2.10.001.patch, 
> YARN-9667-branch-3.2.001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-08 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210518#comment-17210518
 ] 

Eric Badger commented on YARN-9667:
---

Thanks, [~Jim_Brennan]! I attached another patch that should work for 
branch-2.10

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.5
>
> Attachments: YARN-9667-001.patch, YARN-9667-branch-2.10.001.patch, 
> YARN-9667-branch-3.2.001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-07 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9667:
--
Attachment: YARN-9667-branch-3.2.001.patch

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9667-001.patch, YARN-9667-branch-3.2.001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-07 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9667:
--
Attachment: YARN-5121-branch-3.2.001.patch

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9667-001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-07 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9667:
--
Attachment: (was: YARN-5121-branch-3.2.001.patch)

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9667-001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-9667) Container-executor.c duplicates messages to stdout

2020-10-07 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger reopened YARN-9667:
---

Re-opening so that we can pull this back to branch-3.2 and beyond. Hopefully 
all the way back to 2.10

> Container-executor.c duplicates messages to stdout
> --
>
> Key: YARN-9667
> URL: https://issues.apache.org/jira/browse/YARN-9667
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, yarn
>Affects Versions: 3.2.0
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9667-001.patch
>
>
> When a container is killed by its AM we get a similar error message like this:
> {noformat}
> 2019-06-30 12:09:04,412 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 143. Privileged Execution Operation 
> Stderr:
> Stdout: main : command provided 1
> main : run as user is systest
> main : requested yarn user is systest
> Getting exit code file...
> Creating script paths...
> Writing pid file...
> Writing to tmp file 
> /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp
> Writing to cgroup task files...
> Creating local dirs...
> Launching container...
> Getting exit code file...
> Creating script paths...
> {noformat}
> In container-executor.c the fork point is right after the "Creating script 
> paths..." part, though in the Stdout log we can clearly see it has been 
> written there twice. After consulting with [~pbacsko] it seems like there's a 
> missing flush in container-executor.c before the fork and that causes the 
> duplication.
> I suggest to add a flush there so that it won't be duplicated: it's a bit 
> misleading that the child process writes out "Getting exit code file" and 
> "Creating script paths" even though it is clearly not doing that.
> A more appealing solution could be to revisit the fprintf-fflush pairs in the 
> code and change them to a single call, so that the fflush calls would not be 
> forgotten accidentally. (It can cause problems in every place where it's 
> used).
> Note: this issue probably affects every occasion of fork(), not just the one 
> from {{launch_container_as_user}} in {{main.c}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-25 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9809:
--
Attachment: YARN-9809-branch-3.2.009.patch

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, 
> YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, 
> YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, 
> YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-25 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202314#comment-17202314
 ] 

Eric Badger commented on YARN-9809:
---

So close. Those pesky unit tests. Patch 009 fixes the unit test failure. Thanks 
for the review, [~Jim_Brennan]!

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, 
> YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, 
> YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, 
> YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-24 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201814#comment-17201814
 ] 

Eric Badger commented on YARN-9809:
---

I've attached branch-3.2 patch 008 to address your comments, [~Jim_Brennan]. I 
think I got all of the unit tests to pass. But 
TestCombinedSystemMetricsPublisher, TestSystemMetricsPublisherForV2, 
TestFSSchedulerConfigurationStore, and TestZKConfigurationStore failed for me 
locally on straight up branch-3.2

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-24 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9809:
--
Attachment: YARN-9809-branch-3.2.008.patch

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, 
> YARN-9809-branch-3.2.008.patch, YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-24 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201782#comment-17201782
 ] 

Eric Badger commented on YARN-9809:
---

{noformat}
RMNodeImpl#AddNodeTransition#transition
RMNodeStatusEvent rmNodeStatusEvent =
new RMNodeStatusEvent(nodeId, nodeStatus);

NodeHealthStatus nodeHealthStatus =
updateRMNodeFromStatusEvents(rmNode, rmNodeStatusEvent);

if (nodeHealthStatus.getIsNodeHealthy()) {
{noformat}
bq. Do we run the risk of nodeHealthStatus being null?

[~epayne], nope we should be fine here. {{nodeHealthStatus}} comes from the 
return value of {{updateRMNodeFromStatusEvents}}. The return value of that 
method comes from {{statusEvent.getNodeHealthStatus()}}. But {{statusEvent}} is 
passed into this method via an argument. On the caller side that argument is 
named {{rmNodeStatusEvent}} and it is craeted a few lines up via the 
RMNodeStatusEvent constructor. The {{nodeStatus}} is set there via the 
constructor and we know it won't be null because we are in the "else" of the 
"if" statement that checked for {{nodeStatus}} being null.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, 
> YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, 
> YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-23 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201116#comment-17201116
 ] 

Eric Badger commented on YARN-9809:
---

Thanks for the initial reviews, [~epayne] and [~Jim_Brennan]! I will put up an 
updated patch soon with changes related to your comments. I also noticed some 
other issues that are manifesting as the unit test failures. So I will fix 
those as well.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, 
> YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, 
> YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-21 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199707#comment-17199707
 ] 

Eric Badger commented on YARN-9809:
---

[~epayne], [~Jim_Brennan], sorry for the delay. I have put up a patch for 
branch-3.2. However I think this needs another round of review because the diff 
was quite massive on the cherry-pick and I had to redo a lot of stuff by hand. 
So in a lot of ways, this is a completely new patch. I think I got all of the 
unit tests that would've failed, but we'll see what HadoopQA says.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, 
> YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, 
> YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-21 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger reopened YARN-9809:
---

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, 
> YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, 
> YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM

2020-09-21 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9809:
--
Attachment: YARN-9809-branch-3.2.007.patch

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, 
> YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, 
> YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10391) --module-gpu functionality is broken in container-executor

2020-08-17 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179152#comment-17179152
 ] 

Eric Badger commented on YARN-10391:


Thanks, [~Jim_Brennan]!

> --module-gpu functionality is broken in container-executor
> --
>
> Key: YARN-10391
> URL: https://issues.apache.org/jira/browse/YARN-10391
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10391.001.patch
>
>
> {{--module-gpu}} doesn't set the {{operation}} variable, and so the 
> {{main()}} function's switch statement on {{operation}} falls through to the 
> default case. This causes it to report a failure, even though it succeeded. 
> {noformat}
>   default:
> fprintf(ERRORFILE, "Unexpected operation code: %d\n", operation);
> exit_code = INVALID_COMMAND_PROVIDED;
> break;
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10391) --module-gpu functionality is broken in container-executor

2020-08-14 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10391:
---
Attachment: YARN-10391.001.patch

> --module-gpu functionality is broken in container-executor
> --
>
> Key: YARN-10391
> URL: https://issues.apache.org/jira/browse/YARN-10391
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10391.001.patch
>
>
> {{--module-gpu}} doesn't set the {{operation}} variable, and so the 
> {{main()}} function's switch statement on {{operation}} falls through to the 
> default case. This causes it to report a failure, even though it succeeded. 
> {noformat}
>   default:
> fprintf(ERRORFILE, "Unexpected operation code: %d\n", operation);
> exit_code = INVALID_COMMAND_PROVIDED;
> break;
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10391) --module-gpu functionality is broken in container-executor

2020-08-06 Thread Eric Badger (Jira)
Eric Badger created YARN-10391:
--

 Summary: --module-gpu functionality is broken in container-executor
 Key: YARN-10391
 URL: https://issues.apache.org/jira/browse/YARN-10391
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: Eric Badger
Assignee: Eric Badger


{{--module-gpu}} doesn't set the {{operation}} variable, and so the {{main()}} 
function's switch statement on {{operation}} falls through to the default case. 
This causes it to report a failure, even though it succeeded. 

{noformat}
  default:
fprintf(ERRORFILE, "Unexpected operation code: %d\n", operation);
exit_code = INVALID_COMMAND_PROVIDED;
break;
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR

2020-08-06 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-7677:
--
Fix Version/s: 2.10.1

Thanks for the patch, [~Jim_Brennan]! +1. I committed this to branch-2.10

> Docker image cannot set HADOOP_CONF_DIR
> ---
>
> Key: YARN-7677
> URL: https://issues.apache.org/jira/browse/YARN-7677
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Eric Badger
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Fix For: 3.1.0, 2.10.1
>
> Attachments: YARN-7677-branch-2.10.001.patch, 
> YARN-7677-branch-2.10.002.patch, YARN-7677.001.patch, YARN-7677.002.patch, 
> YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, 
> YARN-7677.006.patch, YARN-7677.007.patch
>
>
> Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether 
> it's set by the user or not. It completely bypasses the whitelist and so 
> there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes 
> problems in the Docker use case where Docker containers will set up their own 
> environment and have their own {{HADOOP_CONF_DIR}} preset in the image 
> itself. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR

2020-08-05 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-7677:
--
Attachment: (was: YARN-7677-branch-2.10.002.patch)

> Docker image cannot set HADOOP_CONF_DIR
> ---
>
> Key: YARN-7677
> URL: https://issues.apache.org/jira/browse/YARN-7677
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Eric Badger
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Fix For: 3.1.0
>
> Attachments: YARN-7677-branch-2.10.001.patch, 
> YARN-7677-branch-2.10.002.patch, YARN-7677.001.patch, YARN-7677.002.patch, 
> YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, 
> YARN-7677.006.patch, YARN-7677.007.patch
>
>
> Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether 
> it's set by the user or not. It completely bypasses the whitelist and so 
> there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes 
> problems in the Docker use case where Docker containers will set up their own 
> environment and have their own {{HADOOP_CONF_DIR}} preset in the image 
> itself. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR

2020-08-05 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-7677:
--
Attachment: YARN-7677-branch-2.10.002.patch

> Docker image cannot set HADOOP_CONF_DIR
> ---
>
> Key: YARN-7677
> URL: https://issues.apache.org/jira/browse/YARN-7677
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Eric Badger
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Fix For: 3.1.0
>
> Attachments: YARN-7677-branch-2.10.001.patch, 
> YARN-7677-branch-2.10.002.patch, YARN-7677-branch-2.10.002.patch, 
> YARN-7677.001.patch, YARN-7677.002.patch, YARN-7677.003.patch, 
> YARN-7677.004.patch, YARN-7677.005.patch, YARN-7677.006.patch, 
> YARN-7677.007.patch
>
>
> Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether 
> it's set by the user or not. It completely bypasses the whitelist and so 
> there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes 
> problems in the Docker use case where Docker containers will set up their own 
> environment and have their own {{HADOOP_CONF_DIR}} preset in the image 
> itself. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource

2020-08-05 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-4575:
--
Fix Version/s: 2.10.1
   3.1.4

Thanks for the updated patch, [~epayne]! +1. I've committed this to branch-3.1 
and branch-2.10.

It's now been committed to trunk (3.4), branch-3.3, branch-3.2, branch-3.1, and 
branch-2.10

> ApplicationResourceUsageReport should return ALL  reserved resource
> ---
>
> Key: YARN-4575
> URL: https://issues.apache.org/jira/browse/YARN-4575
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin Chundatt
>Priority: Major
>  Labels: oct16-easy
> Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1
>
> Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch, 
> YARN-4575.003.patch, YARN-4575.004.patch, YARN-4575.005.patch, 
> YARN-4575.branch-3.1..005.patch
>
>
> ApplicationResourceUsageReport reserved resource report  is only of default 
> parition should be of all partitions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource

2020-08-05 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-4575:
--
Fix Version/s: 3.3.1
   3.4.0
   3.2.2

> ApplicationResourceUsageReport should return ALL  reserved resource
> ---
>
> Key: YARN-4575
> URL: https://issues.apache.org/jira/browse/YARN-4575
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin Chundatt
>Priority: Major
>  Labels: oct16-easy
> Fix For: 3.2.2, 3.4.0, 3.3.1
>
> Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch, 
> YARN-4575.003.patch, YARN-4575.004.patch, YARN-4575.005.patch
>
>
> ApplicationResourceUsageReport reserved resource report  is only of default 
> parition should be of all partitions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource

2020-08-05 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171701#comment-17171701
 ] 

Eric Badger commented on YARN-4575:
---

+1 lgtm. The test failures are unrelated to this patch. I've committed this to 
trunk (3.4), branch-3.3, and branch-3.2. The cherry-pick to branch-3.1 is 
clean, but compilation fails. [~epayne], if you'd like it to go all the way 
back to 2.10, could you put up an additional patch for branch-3.1 (and 
branch-2.10 if necessary)? Thanks

> ApplicationResourceUsageReport should return ALL  reserved resource
> ---
>
> Key: YARN-4575
> URL: https://issues.apache.org/jira/browse/YARN-4575
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin Chundatt
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch, 
> YARN-4575.003.patch, YARN-4575.004.patch, YARN-4575.005.patch
>
>
> ApplicationResourceUsageReport reserved resource report  is only of default 
> parition should be of all partitions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR

2020-08-04 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170951#comment-17170951
 ] 

Eric Badger commented on YARN-7677:
---

Thanks, [~aajisaka]!


Cancelling and resubmitting the patch to rerun Hadoop QA

> Docker image cannot set HADOOP_CONF_DIR
> ---
>
> Key: YARN-7677
> URL: https://issues.apache.org/jira/browse/YARN-7677
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Eric Badger
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Fix For: 3.1.0
>
> Attachments: YARN-7677-branch-2.10.001.patch, 
> YARN-7677-branch-2.10.002.patch, YARN-7677.001.patch, YARN-7677.002.patch, 
> YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, 
> YARN-7677.006.patch, YARN-7677.007.patch
>
>
> Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether 
> it's set by the user or not. It completely bypasses the whitelist and so 
> there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes 
> problems in the Docker use case where Docker containers will set up their own 
> environment and have their own {{HADOOP_CONF_DIR}} preset in the image 
> itself. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR

2020-08-03 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170464#comment-17170464
 ] 

Eric Badger commented on YARN-7677:
---

[~Jim_Brennan], I'm +1 on this patch. I'll give a day for others to chime in if 
they'd like to.

> Docker image cannot set HADOOP_CONF_DIR
> ---
>
> Key: YARN-7677
> URL: https://issues.apache.org/jira/browse/YARN-7677
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Eric Badger
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Fix For: 3.1.0
>
> Attachments: YARN-7677-branch-2.10.001.patch, 
> YARN-7677-branch-2.10.002.patch, YARN-7677.001.patch, YARN-7677.002.patch, 
> YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, 
> YARN-7677.006.patch, YARN-7677.007.patch
>
>
> Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether 
> it's set by the user or not. It completely bypasses the whitelist and so 
> there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes 
> problems in the Docker use case where Docker containers will set up their own 
> environment and have their own {{HADOOP_CONF_DIR}} preset in the image 
> itself. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR

2020-08-03 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170463#comment-17170463
 ] 

Eric Badger commented on YARN-7677:
---

I think https://issues.apache.org/jira/browse/HADOOP-17091 is related to the 
javadoc failures we're seeing. It was committed yesterday and adds 
{{YETUS_ARGS+=("--mvn-javadoc-goals=process-sources,javadoc:javadoc-no-fork")}}

[~aajisaka], can you take a look and take appropriate action on HADOOP-17091 if 
you think this is related?

> Docker image cannot set HADOOP_CONF_DIR
> ---
>
> Key: YARN-7677
> URL: https://issues.apache.org/jira/browse/YARN-7677
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Eric Badger
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Fix For: 3.1.0
>
> Attachments: YARN-7677-branch-2.10.001.patch, 
> YARN-7677-branch-2.10.002.patch, YARN-7677.001.patch, YARN-7677.002.patch, 
> YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, 
> YARN-7677.006.patch, YARN-7677.007.patch
>
>
> Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether 
> it's set by the user or not. It completely bypasses the whitelist and so 
> there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes 
> problems in the Docker use case where Docker containers will set up their own 
> environment and have their own {{HADOOP_CONF_DIR}} preset in the image 
> itself. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10363) TestRMAdminCLI.testHelp is failing in branch-2.10

2020-07-31 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10363:
---
Fix Version/s: 2.10.1
   3.1.4

+1. Thanks for the patch, [~BilwaST] and for the review, [~Jim_Brennan]. I've 
committed this to branch-3.1 and branch-2.10. 

> TestRMAdminCLI.testHelp is failing in branch-2.10
> -
>
> Key: YARN-10363
> URL: https://issues.apache.org/jira/browse/YARN-10363
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.10.1
>Reporter: Jim Brennan
>Assignee: Bilwa S T
>Priority: Major
> Fix For: 3.1.4, 2.10.1
>
> Attachments: YARN-10363-branch-2.10.patch
>
>
> TestRMAdminCLI.testHelp is failing in branch-2.10.
> Example failure:
> {noformat}
> ---
> Test set: org.apache.hadoop.yarn.client.cli.TestRMAdminCLI
> ---
> Tests run: 31, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 18.668 s <<< 
> FAILURE! - in org.apache.hadoop.yarn.client.cli.TestRMAdminCLI
> testHelp(org.apache.hadoop.yarn.client.cli.TestRMAdminCLI)  Time elapsed: 
> 0.043 s  <<< FAILURE!
> java.lang.AssertionError: 
> Expected error message: 
> Usage: yarn rmadmin [-failover [--forcefence] [--forceactive]  
> ] is not included in messages: 
> Usage: yarn rmadmin
>-refreshQueues 
>-refreshNodes [-g|graceful [timeout in seconds] -client|server]
>-refreshNodesResources 
>-refreshSuperUserGroupsConfiguration 
>-refreshUserToGroupsMappings 
>-refreshAdminAcls 
>-refreshServiceAcl 
>-getGroups [username]
>-addToClusterNodeLabels 
> <"label1(exclusive=true),label2(exclusive=false),label3">
>-removeFromClusterNodeLabels  (label splitted by ",")
>-replaceLabelsOnNode <"node1[:port]=label1,label2 
> node2[:port]=label1,label2"> [-failOnUnknownNodes] 
>-directlyAccessNodeLabelStore 
>-refreshClusterMaxPriority 
>-updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout])
>-help [cmd]
> Generic options supported are:
> -conf specify an application configuration file
> -Ddefine a value for a given property
> -fs  specify default filesystem URL to use, 
> overrides 'fs.defaultFS' property from configurations.
> -jt   specify a ResourceManager
> -files specify a comma-separated list of files to 
> be copied to the map reduce cluster
> -libjarsspecify a comma-separated list of jar files 
> to be included in the classpath
> -archives   specify a comma-separated list of archives 
> to be unarchived on the compute machines
> The general command line syntax is:
> command [genericOptions] [commandOptions]
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at 
> org.apache.hadoop.yarn.client.cli.TestRMAdminCLI.testError(TestRMAdminCLI.java:859)
>   at 
> org.apache.hadoop.yarn.client.cli.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:585)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> 

[jira] [Commented] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource

2020-07-31 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169118#comment-17169118
 ] 

Eric Badger commented on YARN-4575:
---

[~epayne], the content looks good, but could you address the 2 checkstyle 
issues?

> ApplicationResourceUsageReport should return ALL  reserved resource
> ---
>
> Key: YARN-4575
> URL: https://issues.apache.org/jira/browse/YARN-4575
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin Chundatt
>Priority: Major
>  Labels: oct16-easy
> Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch, 
> YARN-4575.003.patch, YARN-4575.004.patch
>
>
> ApplicationResourceUsageReport reserved resource report  is only of default 
> parition should be of all partitions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2020-07-24 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-4771:
--
Fix Version/s: 3.3.1
   2.10.1
   3.1.4
   3.2.2

I cherry-picked this back to branch-2.10. 

It has now been committed to trunk (3.4), branch-3.3, branch-3.2, branch-3.1, 
and branch-2.10

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Jason Darrell Lowe
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1
>
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch, 
> YARN-4771.003.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2020-07-24 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-4771:
--
Fix Version/s: 3.4.0

Thanks for the updated patch, [~Jim_Brennan]. And thanks to [~jlowe] for the 
original patch. +1. I've committed this to trunk (3.4)

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Jason Darrell Lowe
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch, 
> YARN-4771.003.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10353) Log vcores used and cumulative cpu in containers monitor

2020-07-20 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10353:
---
Fix Version/s: 3.4.0

> Log vcores used and cumulative cpu in containers monitor
> 
>
> Key: YARN-10353
> URL: https://issues.apache.org/jira/browse/YARN-10353
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10353.001.patch, YARN-10353.002.patch
>
>
> We currently log the percentage/cpu and percentage/cpus-used-by-yarn in the 
> Containers Monitor log. It would be useful to also log vcores used vs vcores 
> assigned, and total accumulated CPU time.
> For example, currently we have an audit log that looks like this:
> {noformat}
> 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
> (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
> 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
> physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
> CPU/core:35.772625
> {noformat}
> The proposal is to add two more fields to show vCores and Cumulative CPU ms:
> {noformat}
> 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
> (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
> 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
> physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
> CPU/core:35.772625 vCores:2/1 CPU-ms:4180
> {noformat}
> This is a snippet of a log from one of our clusters running branch-2.8 with a 
> similar change.
> {noformat}
> 2020-07-16 21:00:02,240 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 5267 for 
> container-id container_e04_1594079801456_1397450_01_001992: 1.6 GB of 2.5 GB 
> physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 18 of 
> 10 CPU vCores used. Cumulative CPU time: 157410
> 2020-07-16 21:00:02,269 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 18801 for 
> container-id container_e04_1594079801456_1390375_01_19: 413.2 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 113830
> 2020-07-16 21:00:02,298 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 5279 for 
> container-id container_e04_1594079801456_1397450_01_001991: 2.2 GB of 2.5 GB 
> physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 17 of 
> 10 CPU vCores used. Cumulative CPU time: 128630
> 2020-07-16 21:00:02,339 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 24189 for 
> container-id container_e04_1594079801456_1390430_01_000415: 392.7 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 96060
> 2020-07-16 21:00:02,367 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 6751 for 
> container-id container_e04_1594079801456_1397923_01_003248: 1.3 GB of 3 GB 
> physical memory used; 4.3 GB of 6.3 GB virtual memory used. CPU usage: 12 of 
> 10 CPU vCores used. Cumulative CPU time: 116820
> 2020-07-16 21:00:02,396 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 12138 for 
> container-id container_e04_1594079801456_1397760_01_44: 4.4 GB of 6 GB 
> physical memory used; 6.9 GB of 12.6 GB virtual memory used. CPU usage: 15 of 
> 10 CPU vCores used. Cumulative CPU time: 45900
> 2020-07-16 21:00:02,424 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 101918 for 
> container-id container_e04_1594079801456_1391130_01_002378: 2.4 GB of 4 GB 
> physical memory used; 5.8 GB of 8.4 GB virtual memory used. CPU usage: 13 of 
> 10 CPU vCores used. Cumulative CPU time: 2572390
> 2020-07-16 21:00:02,456 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 26596 for 
> container-id container_e04_1594079801456_1390446_01_000665: 418.6 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 101210
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10353) Log vcores used and cumulative cpu in containers monitor

2020-07-17 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160127#comment-17160127
 ] 

Eric Badger commented on YARN-10353:


bq. Would Vcores: 2 of 10 be better?
Yea, this is way more intuitive to me. Though it would be 2 of 1 in this case, 
right?

bq. I was thinking spaces delimit fields, and colons delimit label vs value.
This is fine too. You could do something like vCores used/vCores available:2/1 
or something similar. That way you know what the values mean and it's still 
easily parsable

Agreed on the CPU naming. When I see a number over 100 I assume the number is 
cpu ms or something else, not a percentage. And I see now, I incorrectly 
thought the logs in your 2.8 snippet your all the same job and was confused at 
the number jumping all over the place. I think CPU-ms is intuitive in this case 
because you can see that it is monotonically increasing. To be more clear, you 
could use "Total-CPU-ms", "Cumulative-CPU-ms", or "Accumulated-CPU-ms". But I'm 
ok with it either way.



> Log vcores used and cumulative cpu in containers monitor
> 
>
> Key: YARN-10353
> URL: https://issues.apache.org/jira/browse/YARN-10353
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10353.001.patch
>
>
> We currently log the percentage/cpu and percentage/cpus-used-by-yarn in the 
> Containers Monitor log. It would be useful to also log vcores used vs vcores 
> assigned, and total accumulated CPU time.
> For example, currently we have an audit log that looks like this:
> {noformat}
> 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
> (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
> 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
> physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
> CPU/core:35.772625
> {noformat}
> The proposal is to add two more fields to show vCores and Cumulative CPU ms:
> {noformat}
> 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
> (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
> 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
> physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
> CPU/core:35.772625 vCores:2/1 CPU-ms:4180
> {noformat}
> This is a snippet of a log from one of our clusters running branch-2.8 with a 
> similar change.
> {noformat}
> 2020-07-16 21:00:02,240 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 5267 for 
> container-id container_e04_1594079801456_1397450_01_001992: 1.6 GB of 2.5 GB 
> physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 18 of 
> 10 CPU vCores used. Cumulative CPU time: 157410
> 2020-07-16 21:00:02,269 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 18801 for 
> container-id container_e04_1594079801456_1390375_01_19: 413.2 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 113830
> 2020-07-16 21:00:02,298 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 5279 for 
> container-id container_e04_1594079801456_1397450_01_001991: 2.2 GB of 2.5 GB 
> physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 17 of 
> 10 CPU vCores used. Cumulative CPU time: 128630
> 2020-07-16 21:00:02,339 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 24189 for 
> container-id container_e04_1594079801456_1390430_01_000415: 392.7 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 96060
> 2020-07-16 21:00:02,367 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 6751 for 
> container-id container_e04_1594079801456_1397923_01_003248: 1.3 GB of 3 GB 
> physical memory used; 4.3 GB of 6.3 GB virtual memory used. CPU usage: 12 of 
> 10 CPU vCores used. Cumulative CPU time: 116820
> 2020-07-16 21:00:02,396 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 12138 for 
> container-id container_e04_1594079801456_1397760_01_44: 4.4 GB of 6 GB 
> physical memory used; 6.9 GB of 12.6 GB virtual memory used. CPU usage: 15 of 
> 10 CPU vCores used. Cumulative CPU time: 45900
> 2020-07-16 21:00:02,424 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 101918 for 
> container-id 

[jira] [Commented] (YARN-10353) Log vcores used and cumulative cpu in containers monitor

2020-07-17 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160111#comment-17160111
 ] 

Eric Badger commented on YARN-10353:


I like the change, but the log isn't very intuitive to me. {{vCores:2/1}} is 
confusing to me. Looking at the code it looks like this is number of vCores 
actually used over number of vCores allocated to the container. Personally, I 
like the way it's logged in the branch-2.8 snippet better. And {{CPU-ms}} is 
the CPU-ms since the monitor last ran, right? I suppose that one probably can't 
be much clearer than it is without a decent amount of text

> Log vcores used and cumulative cpu in containers monitor
> 
>
> Key: YARN-10353
> URL: https://issues.apache.org/jira/browse/YARN-10353
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10353.001.patch
>
>
> We currently log the percentage/cpu and percentage/cpus-used-by-yarn in the 
> Containers Monitor log. It would be useful to also log vcores used vs vcores 
> assigned, and total accumulated CPU time.
> For example, currently we have an audit log that looks like this:
> {noformat}
> 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
> (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
> 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
> physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
> CPU/core:35.772625
> {noformat}
> The proposal is to add two more fields to show vCores and Cumulative CPU ms:
> {noformat}
> 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
> (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
> 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
> physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
> CPU/core:35.772625 vCores:2/1 CPU-ms:4180
> {noformat}
> This is a snippet of a log from one of our clusters running branch-2.8 with a 
> similar change.
> {noformat}
> 2020-07-16 21:00:02,240 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 5267 for 
> container-id container_e04_1594079801456_1397450_01_001992: 1.6 GB of 2.5 GB 
> physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 18 of 
> 10 CPU vCores used. Cumulative CPU time: 157410
> 2020-07-16 21:00:02,269 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 18801 for 
> container-id container_e04_1594079801456_1390375_01_19: 413.2 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 113830
> 2020-07-16 21:00:02,298 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 5279 for 
> container-id container_e04_1594079801456_1397450_01_001991: 2.2 GB of 2.5 GB 
> physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 17 of 
> 10 CPU vCores used. Cumulative CPU time: 128630
> 2020-07-16 21:00:02,339 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 24189 for 
> container-id container_e04_1594079801456_1390430_01_000415: 392.7 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 
> of 10 CPU vCores used. Cumulative CPU time: 96060
> 2020-07-16 21:00:02,367 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 6751 for 
> container-id container_e04_1594079801456_1397923_01_003248: 1.3 GB of 3 GB 
> physical memory used; 4.3 GB of 6.3 GB virtual memory used. CPU usage: 12 of 
> 10 CPU vCores used. Cumulative CPU time: 116820
> 2020-07-16 21:00:02,396 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 12138 for 
> container-id container_e04_1594079801456_1397760_01_44: 4.4 GB of 6 GB 
> physical memory used; 6.9 GB of 12.6 GB virtual memory used. CPU usage: 15 of 
> 10 CPU vCores used. Cumulative CPU time: 45900
> 2020-07-16 21:00:02,424 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 101918 for 
> container-id container_e04_1594079801456_1391130_01_002378: 2.4 GB of 4 GB 
> physical memory used; 5.8 GB of 8.4 GB virtual memory used. CPU usage: 13 of 
> 10 CPU vCores used. Cumulative CPU time: 2572390
> 2020-07-16 21:00:02,456 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 26596 for 
> container-id container_e04_1594079801456_1390446_01_000665: 418.6 MB of 2.5 
> GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU 

[jira] [Updated] (YARN-10348) Allow RM to always cancel tokens after app completes

2020-07-14 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10348:
---
Fix Version/s: 3.1.5
   2.10.1
   3.2.2

Thanks for the new patch, [~Jim_Brennan]! I've committed this through 
branch-2.10. So it's now been committed to: trunk (3.4), branch-3.3, 
branch-3.2, branch-3.1, and branch-2.10

> Allow RM to always cancel tokens after app completes
> 
>
> Key: YARN-10348
> URL: https://issues.apache.org/jira/browse/YARN-10348
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.10.0, 3.1.3
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.2.2, 2.10.1, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10348-branch-3.2.001.patch, YARN-10348.001.patch, 
> YARN-10348.002.patch
>
>
> (Note: this change was originally done on our internal branch by [~daryn]).
> The RM currently has an option for a client to specify disabling token 
> cancellation when a job completes. This feature was an initial attempt to 
> address the use case of a job launching sub-jobs (ie. oozie launcher) and the 
> original job finishing prior to the sub-job(s) completion - ex. original job 
> completion triggered premature cancellation of tokens needed by the sub-jobs.
> Many years ago, [~daryn] added a more robust implementation to ref count 
> tokens ([YARN-3055]). This prevented premature cancellation of the token 
> until all apps using the token complete, and invalidated the need for a 
> client to specify cancel=false. Unfortunately the config option was not 
> removed.
> We have seen cases where oozie "java actions" and some users were explicitly 
> disabling token cancellation. This can lead to a buildup of defunct tokens 
> that may overwhelm the ZK buffer used by the KDC's backing store. At which 
> point the KMS fails to connect to ZK and is unable to issue/validate new 
> tokens - rendering the KDC only able to authenticate pre-existing tokens. 
> Production incidents have occurred due to the buffer size issue.
> To avoid these issues, the RM should have the option to ignore/override the 
> client's request to not cancel tokens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10348) Allow RM to always cancel tokens after app completes

2020-07-13 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157030#comment-17157030
 ] 

Eric Badger commented on YARN-10348:


+1 (binding), I committed this to trunk (3.4) and branch-3.3. I attempted to 
cherry-pick to branch-3.2 and there were some conflicts. [~Jim_Brennan], would 
you mind putting up a patch for branch-3.2?

> Allow RM to always cancel tokens after app completes
> 
>
> Key: YARN-10348
> URL: https://issues.apache.org/jira/browse/YARN-10348
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.10.0, 3.1.3
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10348.001.patch, YARN-10348.002.patch
>
>
> (Note: this change was originally done on our internal branch by [~daryn]).
> The RM currently has an option for a client to specify disabling token 
> cancellation when a job completes. This feature was an initial attempt to 
> address the use case of a job launching sub-jobs (ie. oozie launcher) and the 
> original job finishing prior to the sub-job(s) completion - ex. original job 
> completion triggered premature cancellation of tokens needed by the sub-jobs.
> Many years ago, [~daryn] added a more robust implementation to ref count 
> tokens ([YARN-3055]). This prevented premature cancellation of the token 
> until all apps using the token complete, and invalidated the need for a 
> client to specify cancel=false. Unfortunately the config option was not 
> removed.
> We have seen cases where oozie "java actions" and some users were explicitly 
> disabling token cancellation. This can lead to a buildup of defunct tokens 
> that may overwhelm the ZK buffer used by the KDC's backing store. At which 
> point the KMS fails to connect to ZK and is unable to issue/validate new 
> tokens - rendering the KDC only able to authenticate pre-existing tokens. 
> Production incidents have occurred due to the buffer size issue.
> To avoid these issues, the RM should have the option to ignore/override the 
> client's request to not cancel tokens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10348) Allow RM to always cancel tokens after app completes

2020-07-13 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10348:
---
Fix Version/s: 3.3.1
   3.4.0

> Allow RM to always cancel tokens after app completes
> 
>
> Key: YARN-10348
> URL: https://issues.apache.org/jira/browse/YARN-10348
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.10.0, 3.1.3
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10348.001.patch, YARN-10348.002.patch
>
>
> (Note: this change was originally done on our internal branch by [~daryn]).
> The RM currently has an option for a client to specify disabling token 
> cancellation when a job completes. This feature was an initial attempt to 
> address the use case of a job launching sub-jobs (ie. oozie launcher) and the 
> original job finishing prior to the sub-job(s) completion - ex. original job 
> completion triggered premature cancellation of tokens needed by the sub-jobs.
> Many years ago, [~daryn] added a more robust implementation to ref count 
> tokens ([YARN-3055]). This prevented premature cancellation of the token 
> until all apps using the token complete, and invalidated the need for a 
> client to specify cancel=false. Unfortunately the config option was not 
> removed.
> We have seen cases where oozie "java actions" and some users were explicitly 
> disabling token cancellation. This can lead to a buildup of defunct tokens 
> that may overwhelm the ZK buffer used by the KDC's backing store. At which 
> point the KMS fails to connect to ZK and is unable to issue/validate new 
> tokens - rendering the KDC only able to authenticate pre-existing tokens. 
> Production incidents have occurred due to the buffer size issue.
> To avoid these issues, the RM should have the option to ignore/override the 
> client's request to not cancel tokens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-30 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148906#comment-17148906
 ] 

Eric Badger commented on YARN-9809:
---

Thanks, [~eyang] for the review and commit and [~Jim_Brennan] for the review!

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-30 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148829#comment-17148829
 ] 

Eric Badger commented on YARN-9809:
---

Thanks for the review, [~eyang]! Are you planning on committing this or would 
you like me to?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-26 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146522#comment-17146522
 ] 

Eric Badger commented on YARN-9809:
---

Thanks, [~Jim_Brennan]! [~eyang], would you take another look?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-25 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145893#comment-17145893
 ] 

Eric Badger commented on YARN-9809:
---

Good catch, [~Jim_Brennan]. {{updateMetricsForRejoinedNode()}} is only called 
in one other place and I don't want to add the node and then remove it again. 
So I removed the increment from {{updateMetricsForRejoinedNode()}} and 
explicitly added it to just before the other place where 
{{updateMetricsForRejoinedNode()}} is called.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-25 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9809:
--
Attachment: YARN-9809.007.patch

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch, YARN-9809.007.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-25 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145717#comment-17145717
 ] 

Eric Badger commented on YARN-9809:
---

The TestFairScheduler and TestFairSchedulerPreemption test failures are 
unrelated to this JIRA as they have also been reported in 
https://issues.apache.org/jira/browse/YARN-10329

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-25 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145711#comment-17145711
 ] 

Eric Badger commented on YARN-9809:
---

Patch 006 moves {{ClusterMetrics.getMetrics().incrNumActiveNodes();}} into 
{{reportNodeRunning}} inside of the addNodeTransition. This fixes the failing 
unit test and prevents a scenario where we add an unhealthy node as RUNNING and 
then quickly switching it to UNHEALTHY. This way we go straight to UNHEALTHY.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-25 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9809:
--
Attachment: YARN-9809.006.patch

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, 
> YARN-9809.006.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-24 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144544#comment-17144544
 ] 

Eric Badger commented on YARN-9809:
---

Thanks for the review, [~Jim_Brennan]! I've uploaded patch 005 to fix the 
things you mentioned in your comments

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-24 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9809:
--
Attachment: YARN-9809.005.patch

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-17 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138899#comment-17138899
 ] 

Eric Badger commented on YARN-9809:
---

I can see pros and cons to both approaches. On the one hand, if the health 
check script fails to execute properly, that's not good and could imply 
something bad. But health check scripts are pretty dangerous since they can 
take out an entire cluster if they're written improperly. So if someone updates 
the script and all of a sudden the script errors out, the whole cluster is 
unhealthy. Or the health check script could rely on querying a service and that 
service times out. The node is healthy, but the health check script returned 
error. Unless you are parsing for specific error codes, you can no longer 
differentiate between the health check script failing internally and the health 
check script returning successfully that the node is unhealthy. 

Regardless of this discussion though, this is outside of the scope of this 
JIRA. That's an issue with how the health check script is handled while this 
JIRA is just about providing a health status at NM startup

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-17 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138892#comment-17138892
 ] 

Eric Badger commented on YARN-9809:
---

{noformat:title=NodeHealthScriptRunner.newInstance()}
if (!shouldRun(scriptName, nodeHealthScript)) {
  return null;
}
{noformat}

{noformat:title=NodeHealthScriptRunner.shouldRun()}
  static boolean shouldRun(String script, String healthScript) {
if (healthScript == null || healthScript.trim().isEmpty()) {
  LOG.info("Missing location for the node health check script \"{}\".",
  script);
  return false;
}
{noformat}

If the health check script doesn't exist, then the health {{shouldRun}} will 
return false and the {{newInstance}} will return null. This will cause the 
health reporter to not be added as a service. So at the end of the day, your 
statement is correct. If the health check script doesn't exist, the node will 
report as healthy.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-16 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137934#comment-17137934
 ] 

Eric Badger commented on YARN-9809:
---

{noformat}
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesSchedulerActivities
hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
{noformat}
Neither of these tests fail for me locally and are unrelated to the changes 
made in patch 004. 

Both the javac and the javadoc errors are coming from generated protobuf java 
files. I don't know how to get rid of these errors, but they aren't introducing 
any warnings that don't already exist. I think they're fine. The generation of 
the java files is the issue here.

[~Jim_Brennan], [~ccondit], [~eyang], could you guys review patch 004?

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-15 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136095#comment-17136095
 ] 

Eric Badger commented on YARN-9809:
---

Patch 004 fixes checkstyle. There is still the javac error with PARSER being 
deprecated, but I don't know how to get rid of that. It is coming from a 
generated proto file. So I'm not quite sure what to do about that. The PARSER 
is used in many other places within the same generated file

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-15 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9809:
--
Attachment: YARN-9809.004.patch

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch, YARN-9809.004.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-12 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9809:
--
Attachment: YARN-9809.003.patch

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch, 
> YARN-9809.003.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10312) Add support for yarn logs -logFile to retain backward compatibility

2020-06-12 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10312:
---
Fix Version/s: 3.1.5
   2.10.1
   3.2.2

Thanks for the new patch, [~Jim_Brennan]! I committed this all the way to 
branch-2.10.

Overall it has now been committed to trunk, branch-3.3, branch-3.2, branch-3.1, 
and branch-2.10

> Add support for yarn logs -logFile to retain backward compatibility
> ---
>
> Key: YARN-10312
> URL: https://issues.apache.org/jira/browse/YARN-10312
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.10.0, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: compatibility
> Fix For: 3.2.2, 2.10.1, 3.3.1, 3.1.5, 3.4.1
>
> Attachments: YARN-10312-branch-3.2.001.patch, YARN-10312.001.patch
>
>
> The YARN CLI logs command line option {{-logFiles}} was changed to 
> {{-log_files}}  in 2.9 and later releases.   This change was made as part of 
> YARN-5363.
> Verizon Media is in the process of moving from Hadoop-2.8 to Hadoop-2.10, and 
> while testing integration with Spark, we ran into this issue.   We are 
> concerned that we will run into more cases of this as we roll out to 
> production, and rather than break user scripts, we'd prefer to add 
> {{-logFiles}} as an alias of {{-log_files}}.  If both are provided, 
> {{-logFiles}} will be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10312) Add support for yarn logs -logFile to retain backward compatibility

2020-06-11 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10312:
---
Fix Version/s: 3.4.1
   3.3.1

+1 I committed this to trunk and branch-3.3. The cherry-pick back to branch-3.2 
is clean, but the build fails to compile due to a missing method. 
[~Jim_Brennan], can you put up an additional patch for branch-3.2?

> Add support for yarn logs -logFile to retain backward compatibility
> ---
>
> Key: YARN-10312
> URL: https://issues.apache.org/jira/browse/YARN-10312
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.10.0, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: compatibility
> Fix For: 3.3.1, 3.4.1
>
> Attachments: YARN-10312.001.patch
>
>
> The YARN CLI logs command line option {{-logFiles}} was changed to 
> {{-log_files}}  in 2.9 and later releases.   This change was made as part of 
> YARN-5363.
> Verizon Media is in the process of moving from Hadoop-2.8 to Hadoop-2.10, and 
> while testing integration with Spark, we ran into this issue.   We are 
> concerned that we will run into more cases of this as we roll out to 
> production, and rather than break user scripts, we'd prefer to add 
> {{-logFiles}} as an alias of {{-log_files}}.  If both are provided, 
> {{-logFiles}} will be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-11 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133540#comment-17133540
 ] 

Eric Badger commented on YARN-10300:


Thanks, [~epayne] for the review/commit and [~Jim_Brennan] for the review!

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.2.2, 2.10.1, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: YARN-10300.001.patch, YARN-10300.002.patch, 
> YARN-10300.003.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-08 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128713#comment-17128713
 ] 

Eric Badger commented on YARN-10300:


[~epayne], thanks for the review! In patch 003 I've modified an additional 
existing test to better test the {{createAppSummary()}} code. The unit test 
fails without the code change and succeeds with it.

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch, YARN-10300.002.patch, 
> YARN-10300.003.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-08 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10300:
---
Attachment: YARN-10300.003.patch

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch, YARN-10300.002.patch, 
> YARN-10300.003.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-08 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128686#comment-17128686
 ] 

Eric Badger commented on YARN-9809:
---

Attaching patch 002 to address unit test failures

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-08 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9809:
--
Attachment: YARN-9809.002.patch

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch, YARN-9809.002.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-05 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127025#comment-17127025
 ] 

Eric Badger commented on YARN-9809:
---

Patch 001 adds the feature but makes it opt-in via the config 
{{yarn.nodemanager.health-checker.run-before-startup}}. I didn't put in the 
retries flag for shutting down the NM if there are a certain number of 
failures. I can do that in a subsequent patch if you'd like. But I tested this 
patch out and it seems to work.

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM

2020-06-05 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-9809:
--
Attachment: YARN-9809.001.patch

> NMs should supply a health status when registering with RM
> --
>
> Key: YARN-9809
> URL: https://issues.apache.org/jira/browse/YARN-9809
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9809.001.patch
>
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-03 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125139#comment-17125139
 ] 

Eric Badger edited comment on YARN-10300 at 6/3/20, 5:09 PM:
-

Thanks for the review, [~Jim_Brennan]! [~epayne], would you mind looking at 
this patch as well to give a binding review?


was (Author: ebadger):
[~epayne], would you mind looking at this patch as well to give a binding 
review?

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch, YARN-10300.002.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-03 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125139#comment-17125139
 ] 

Eric Badger commented on YARN-10300:


[~epayne], would you mind looking at this patch as well to give a binding 
review?

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch, YARN-10300.002.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-02 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124461#comment-17124461
 ] 

Eric Badger commented on YARN-10300:


Added some null checks to patch 002

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch, YARN-10300.002.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-02 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10300:
---
Attachment: YARN-10300.002.patch

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch, YARN-10300.002.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-02 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124450#comment-17124450
 ] 

Eric Badger commented on YARN-10300:


bq. Eric Badger can you explain under what circumstances 
attempt.getMasterContainer().getNodeId().getHost() will succeed where 
attempt.getHost() fails?

{{attempt.getHost()}} grabs {{host}} within RMAppAttemptImpl.java. This is set 
during {{AMRegisteredTransition}}, so it will be "N/A" until the AM registers 
with the NM during the first heartbeat. 

{{attempt.getMasterContainer()}} grabs {{masterContainer}} within 
RMAppAttemptImpl.java. This is set during {{AMContainerAllocatedTransition}}. 

So from the time between when the container is allocated until the time of the 
first AM heartbeat, {{masterContainer}} will be set, but {{host}} will be "N/A"

bq. Also, do we need to add some null checks for these? getMasterContainer(), 
getNodeId(), getHost()?
Probably wouldn't hurt. 

bq. Note that the attempt.getHost() defaults to "N/A" before it is set - what 
do we get if NodeID().getHost() isn't valid yet? Is that even a possibility?
I don't know if it's possible to be invalid at the start or not. The 
{{Container}} is going to be created via {{newInstance}}, which requires a 
{{NodeId}} as a parameter. But that could potentially be sent in as null. But I 
think it will either be the correct nodeId or will be null, which I can 
interpret as "N/A". There are so many places that containers are instantiated 
in the scheduler that it'd be pretty tough to see if all of the cases have the 
nodeID set initially. 

I can add in the null checks and default the string to "N/A" if any of them 
don't exist. 

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-02 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124328#comment-17124328
 ] 

Eric Badger commented on YARN-10300:


The unit tests pass for me locally. [~epayne], [~Jim_Brennan], could you take a 
look at this patch?

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >