[jira] [Updated] (YARN-10562) Follow up changes for YARN-9833
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10562: --- Fix Version/s: 2.10.2 I committed this to branch-2.10 after backporting YARN-9833. this JIRA has now been committed to trunk (3.4), branch-3.3, branch-3.2, branch-3.1, and branch-2.10 > Follow up changes for YARN-9833 > --- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: resourcemanager > Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3 > > Attachments: YARN-10562.001.patch, YARN-10562.002.patch, > YARN-10562.003.patch, YARN-10562.004.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch
[ https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9833: -- Fix Version/s: 2.10.2 I backported this to branch-2.10 > Race condition when DirectoryCollection.checkDirs() runs during container > launch > > > Key: YARN-9833 > URL: https://issues.apache.org/jira/browse/YARN-9833 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4, 2.10.2 > > Attachments: YARN-9833-001.patch > > > During endurance testing, we found a race condition that cause an empty > {{localDirs}} being passed to container-executor. > The problem is that {{DirectoryCollection.checkDirs()}} clears three > collections: > {code:java} > this.writeLock.lock(); > try { > localDirs.clear(); > errorDirs.clear(); > fullDirs.clear(); > ... > {code} > This happens in critical section guarded by a write lock. When we start a > container, we retrieve the local dirs by calling > {{dirsHandler.getLocalDirs();}} which in turn invokes > {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is: > {code:java} > List getGoodDirs() { > this.readLock.lock(); > try { > return Collections.unmodifiableList(localDirs); > } finally { > this.readLock.unlock(); > } > } > {code} > So we're also in a critical section guarded by the lock. But > {{Collections.unmodifiableList()}} only returns a _view_ of the collection, > not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be > scheduled to run and immediately clears {{localDirs}}. > This caused a weird behaviour in container-executor, which exited with error > code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES). > Therefore we can't just return a view, we must return a copy with > {{ImmutableList.copyOf()}}. > Credits to [~snemeth] for analyzing and determining the root cause. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10562) Follow up changes for YARN-9833
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10562: --- Fix Version/s: 3.2.3 3.1.5 3.3.1 3.4.0 I committed this to trunk (3.4), branch-3.3, branch-3.2, and branch-3.1. To put this back into branch-2.10, we'll need to also backport YARN-9833. [~Jim_Brennan], let me know if you'd like me to do this > Follow up changes for YARN-9833 > --- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: resourcemanager > Fix For: 3.4.0, 3.3.1, 3.1.5, 3.2.3 > > Attachments: YARN-10562.001.patch, YARN-10562.002.patch, > YARN-10562.003.patch, YARN-10562.004.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10562) Follow up changes for YARN-9833
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263705#comment-17263705 ] Eric Badger commented on YARN-10562: +1 on the patch. As mentioned above, there is still the race in the code based on the fact that the caller doesn't have to get all Dirs at the same time. But the only issue that this will cause is the dirs being out of date for that iteration. The next time they get a copy, it will be updated. And the list will always be well-constructed. It just has the possibility of being out of sync when compared with the other lists. Will wait for the precommit to come back and commit if there are no errors and no objections > Follow up changes for YARN-9833 > --- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10562.001.patch, YARN-10562.002.patch, > YARN-10562.003.patch, YARN-10562.004.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263601#comment-17263601 ] Eric Badger commented on YARN-10562: Yea the original problem (before YARN-9833) was that we were getting a view of the list instead of a copy. And those views could iterate the list at any time. The issue there was that checkDirs was going out and clearing those lists in a separate thread. So when the client iterated through the lists, it would periodically see an empty list if it iterated at just the right time. After YARN-9833 there is still a race, but it is a smaller and less nefarious one. The issue there is that we have 3 lists (localDirs, fullDirs, errorDirs). Those can really be thought of as a single list with different attributes for each dir. Because the sum of all of those lists should give you all disks on the node. YARN-9833 added code to return a copy of the list instead of a view. So we'll never have a list that is incomplete. But the race becomes the fact that you could potentially call getGoodDirs(), then have checkDirs run, then call getErrorDirs. If a dir transitioned from good -> error just after getGoodDirs was called, it would show up in both lists. But like you said, [~pbacsko], I think it makes sense to remove complexity of the code if it requires this type of discussion to understand exactly why the code works (or doesn't work). It makes the code harder to maintain and even harder to modify. > Alternate fix for DirectoryCollection.checkDirs() race > -- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10562.001.patch, YARN-10562.002.patch, > YARN-10562.003.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
[ https://issues.apache.org/jira/browse/YARN-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262976#comment-17262976 ] Eric Badger commented on YARN-10562: Discussed this a little bit with [~Jim_Brennan] offline and here's the summary of my thoughts. There are a few problems with the current code. 1) There is an inherent race in the code 2) There is unnecessary and overlapping locking between the read/write lock and the CopyOnWriteArrayList The only way we can reliably address 1) is to return a copy of all lists at once. Otherwise DirectoryCollection.checkDirs() can come along and change the overall status of the 3 dirs lists. If we put checkDirs in a critical section (like it is now with the write lock), then we can return all dirs at once while in the read lock and assure that all dirs are consistent with each other. If we get the dirs in separate calls that grab the lock, we could be inconsistent because checkDirs could be called in between the getDirs calls. I suppose the other way to fix the locking is to do fine-grained locking within the caller code itself, but I think that is pretty bad practice by exposing internal locking to the caller. For 2) we should either change CopyOnWriteArrayList to a regular ArrayList (as they had original planned to do in [YARN-5214|https://issues.apache.org/jira/browse/YARN-5214?focusedCommentId=15342587=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15342587] or remove the read/write lock. These locks serve more or less the same purpose and having both of them is unncecessary. Since I think that locking is usually difficult, complex, and misunderstood by those who change it later, I think we should get rid of the CopyOnWriteArrayList and change it to a regular ArrayList and then make the changes that [~Jim_Brennan] has made here so that we aren't reconstructing each list from scratch each time we run checkDirs. The downside of this change is that every container launch will create a new copy each list and that is a performance regression. But I don't think it will be much of an issue. Would be happy to hear other opinions on this > Alternate fix for DirectoryCollection.checkDirs() race > -- > > Key: YARN-10562 > URL: https://issues.apache.org/jira/browse/YARN-10562 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10562.001.patch, YARN-10562.002.patch, > YARN-10562.003.patch > > > In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and > related methods were returning an unmodifiable view of the lists. These > accesses were protected by read/write locks, but because the lists are > CopyOnWriteArrayLists, subsequent changes to the list, even when done under > the writelock, were exposed when a caller started iterating the list view. > CopyOnWriteArrayLists cache the current underlying list in the iterator, so > it is safe to iterate them even while they are being changed - at least the > view will be consistent. > The problem was that checkDirs() was clearing the lists and rebuilding them > from scratch every time, so if a caller called getGoodDirs() just before > checkDirs cleared it, and then started iterating right after the clear, they > could get an empty list. > The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to > return a copy of the list, which definitely fixes the race condition. The > disadvantage is that now we create a new copy of these lists every time we > launch a container. The advantage using CopyOnWriteArrayList was that the > lists should rarely ever change, and we can avoid all the copying. > Unfortunately, the way checkDirs() was written, it guaranteed that it would > modify those lists multiple times every time. > So this Jira proposes an alternate solution for YARN-9833, which mainly just > rewrites checkDirs() to minimize the changes to the underlying lists. There > are still some small windows where a disk will have been added to one list, > but not yet removed from another if you hit it just right, but I think these > should be pretty rare and relatively harmless, and in the vast majority of > cases I suspect only one disk will be moving from one list to another at any > time. The question is whether this type of inconsistency (which was always > there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258592#comment-17258592 ] Eric Badger commented on YARN-10501: Thanks for the explanation, [~caozhiqiang]. {quote} Overall I have 2 main issues with the code that I need cleared up because I either don't understand the code or think things are broken/unnecessary. 1) We have labels associated with Hosts (which are collections of Hosts) and labels associated with just Nodes 2) We add Nodes that have no associated port or have the wildcard port and add those same nodes with their associated ports. {quote} Ok so out of these points, it looks like 1) exists because we want to have a "host default" that all nodes get when they start up. But I still don't understand why 2) exists. I think it sounds like you also don't understand why 2) exists, but correct me if I'm wrong. Hopefully one of the people you tagged can give an explanation on why we add nodes that have no associated port > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Attachments: YARN-10501.002.patch, YARN-10501.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":["server001:0","server001:45454"],"partitionInfo":{"resourceAvailable":{"memory":"510","vCores":"1","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"510"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"1"}]}}} > 4.yarn rmadmin -replaceLabelsOnNode "server001" > 5.curl http://RM_IP:8088/ws/v1/cluster/label-mappings > {"labelsToNodes":{"entry":{"key":{"name":"cpunode","exclusivity":"true"},"value":{"nodes":"server001:45454","partitionInfo":{"resourceAvailable":{"memory":"0","vCores":"0","resourceInformations":{"resourceInformation":[{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"memory-mb","resourceType":"COUNTABLE","units":"Mi","value":"0"},{"attributes":null,"maximumAllocation":"9223372036854775807","minimumAllocation":"0","name":"vcores","resourceType":"COUNTABLE","units":"","value":"0"}]}}} > {code} > You can see after the 4 process to remove nodemanager labels, the label info > is still in the node info. > {code:java} > 641 case REPLACE: > 642 replaceNodeForLabels(nodeId, host.labels, labels); > 643 replaceLabelsForNode(nodeId, host.labels, labels); > 644 host.labels.clear(); > 645 host.labels.addAll(labels); > 646 for (Node node : host.nms.values()) { > 647 replaceNodeForLabels(node.nodeId, node.labels, labels); > 649 node.labels = null; > 650 } > 651 break;{code} > The cause is in 647 line, when add labels to node without port, the 0 port > and the real nm port with be both add to node info, and when remove labels, > the parameter node.labels in 647 line is null, so it will not remove the old > label. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10501) Can't remove all node labels after add node label without nodemanager port
[ https://issues.apache.org/jira/browse/YARN-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253792#comment-17253792 ] Eric Badger commented on YARN-10501: I'm looking at this patch and I have some questions about other pieces of code that are in the same section. I'll admit, this code is a little bit confusing to me because we have Hosts->Labels maps as well as Labels->Nodes maps and then on top of that, each Host can have multiple Nodes. Before I can comment on your patch, I think I need to clear up some things that are going on in this area of code that are confusing to me. Below is what I _think_ is happening. Feel free to correct me where I'm wrong. (Assuming no port and/or wildcard port for this) 1. When we add a node label, we invoke this piece of code. {noformat} case ADD: addNodeToLabels(nodeId, labels); host.labels.addAll(labels); for (Node node : host.nms.values()) { if (node.labels != null) { node.labels.addAll(labels); } addNodeToLabels(node.nodeId, labels); } break; {noformat} 1a. This code adds the NodeId (without a port/with a wildcard port) to the Labels->Nodes map via addNodeToLabels. *Why do we do this? There is no port associated with this node. In 1d we add the nodes to the map with their associated port, so I don't understand why we're adding the node here when it doesn't have a port.* 1b. It adds all of the labels to the Host. This part doesn't make sense to me. *If we are giving Hosts the granularity to have multiple labels per host (due to multiple NMs), then why does the Host itself have labels?* 1c. We add all the labels to each Node in the host, but _only_ if they already have labels. *Why do we only add the labels if they already have labels? Don't we want to add the labels regardless? Should it be possible for us to be in the ADD method while node.labels == null? Maybe this should throw an exception* 1d. We add the Nodes (with their associated NM port) to the Labels->nodes map via addNodeToLabels. 2. When we replace the node label we invoke this piece of code {noformat} case REPLACE: replaceNodeForLabels(nodeId, host.labels, labels); host.labels.clear(); host.labels.addAll(labels); for (Node node : host.nms.values()) { replaceNodeForLabels(node.nodeId, node.labels, labels); node.labels = null; } {noformat} 2a. We remove the Node (without port or with wildcard port) from the specific label in the Labels->Nodes map via removeNoveFromLabels(). *Why do we have the node without a port in the first place?* This comes from 1a. 2b. We add the Node (without port or with wildcard port) to the new specific label in the Labels->Nodes map via addNodeToLabels(). *Why do we add the node without a port?* 2c. We clear the labels associated with the Host. *Why are there labels associated with a Host when each Host is actually a collection of Nodes?* 2d. We add the new labels to the Host. Same question as 2c. 2e. We iterate through the list of Nodes associated with each Host and perform 2a and 2b, except with Nodes that have their associated ports. 2f. We set the Labels to Null for each Node associated with the Host. I don't understand the purpose of this. I must be missing something here. Overall I have 2 main issues with the code that I need cleared up because I either don't understand the code or think things are broken/unnecessary. 1) We have labels associated with Hosts (which are collections of Hosts) _and_ labels associated with just Nodes 2) We add Nodes that have no associated port or have the wildcard port _and_ add those same nodes with their associated ports. Probably need [~leftnoteasy], [~varunsaxena], or [~sunilg] to comment on this since they were involved with YARN-3075 that added much of this code > Can't remove all node labels after add node label without nodemanager port > -- > > Key: YARN-10501 > URL: https://issues.apache.org/jira/browse/YARN-10501 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Assignee: caozhiqiang >Priority: Critical > Attachments: YARN-10501.002.patch, YARN-10501.003.patch > > > When add a label to nodes without nodemanager port or use WILDCARD_PORT (0) > port, it can't remove all label info in these nodes > Reproduce process: > {code:java} > 1.yarn rmadmin -addToClusterNodeLabels "cpunode(exclusive=true)" > 2.yarn rmadmin -replaceLabelsOnNode "server001=cpunode" > 3.curl http://RM_IP:8088/ws/v1/cluster/label-mappings >
[jira] [Updated] (YARN-10540) Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes
[ https://issues.apache.org/jira/browse/YARN-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10540: --- Fix Version/s: 3.2.3 2.10.2 3.1.5 3.3.1 3.4.0 > Node page is broken in YARN UI1 and UI2 including RMWebService api for nodes > > > Key: YARN-10540 > URL: https://issues.apache.org/jira/browse/YARN-10540 > Project: Hadoop YARN > Issue Type: Task > Components: webapp >Affects Versions: 3.2.2 >Reporter: Sunil G >Assignee: Jim Brennan >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3 > > Attachments: Mac-Yarn-UI.png, Screenshot 2020-12-19 at 11.01.43 > PM.png, Screenshot 2020-12-19 at 11.02.14 PM.png, YARN-10540.001.patch, > Yarn-UI-Ubuntu.png, yarnodes.png, yarnui2onubuntu.png > > > YARN-10450 added changes in NodeInfo class. > Various exceptions are showing while accessing UI2 and UI1 NODE pages. > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.NodesPage$NodesBlock.render(NodesPage.java:164) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:243) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at > org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.nodes(RmController.java:70) > {code} > {code:java} > 2020-12-19 22:55:54,846 WARN > org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.NodeInfo.(NodeInfo.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getNodes(RMWebServices.java:450) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch
[ https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249402#comment-17249402 ] Eric Badger commented on YARN-9833: --- bq. Isn't this how it has been for years? It was returning an unmodifiableList view of the underlying List, so that limits what the caller can do. getGoodDirs() and the others just return a read-only List. They don't have to know about the internals. Well, yes an no. It was _supposed_ to be like that. But given this bug, we can see that it clearly wasn't. The callee in this case _should_ have been atomic and so the unmodifiable view of the list _should_ have been fine. But when you get into fine-grained locking like this, mistakes are easy to make because the person making the change doesn't necessarily understand the history behind why the code is written the way it is. If we can guarantee that the callee will always perform atomic operations on the lists, then there isn't an issue. Maybe we can guarantee this by adding a comment/warning in the checkDirs() function to make sure that anyone touching this code is super careful about locking. I agree with the idea of fixing this on the callee side and not having the caller create a new object everytime. I just want to make sure that this bug isn't reintroduced by accident down the line because of the added complexity of fine-grained locking. I don't know if this is a performance-sensitive area of the code where such a tradeoff would clearly be to go for performance. > Race condition when DirectoryCollection.checkDirs() runs during container > launch > > > Key: YARN-9833 > URL: https://issues.apache.org/jira/browse/YARN-9833 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-9833-001.patch > > > During endurance testing, we found a race condition that cause an empty > {{localDirs}} being passed to container-executor. > The problem is that {{DirectoryCollection.checkDirs()}} clears three > collections: > {code:java} > this.writeLock.lock(); > try { > localDirs.clear(); > errorDirs.clear(); > fullDirs.clear(); > ... > {code} > This happens in critical section guarded by a write lock. When we start a > container, we retrieve the local dirs by calling > {{dirsHandler.getLocalDirs();}} which in turn invokes > {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is: > {code:java} > List getGoodDirs() { > this.readLock.lock(); > try { > return Collections.unmodifiableList(localDirs); > } finally { > this.readLock.unlock(); > } > } > {code} > So we're also in a critical section guarded by the lock. But > {{Collections.unmodifiableList()}} only returns a _view_ of the collection, > not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be > scheduled to run and immediately clears {{localDirs}}. > This caused a weird behaviour in container-executor, which exited with error > code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES). > Therefore we can't just return a view, we must return a copy with > {{ImmutableList.copyOf()}}. > Credits to [~snemeth] for analyzing and determining the root cause. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch
[ https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249289#comment-17249289 ] Eric Badger commented on YARN-9833: --- I agree that the CopyOnWriteArrayList is most likely a performance thing. Since {{getGoodDirs()}} is called on every container launch, then that's a lot of copying. Not sure it's that bad in the grand scheme of things though. bq. My suggestion for fixing this would be to fix the checkdirs() implementation to operate on local copies of these arrays, and then update them with a single assignment only if they have changed. My worry with this is that code changes in the future will incorrectly use {{getGoodDirs}} or the other methods that expose the private lists from within DirectoryCollection. So in my mind it's a tradeoff between performance and maintainability. I don't know what the performance impact is. We could potentially mitigate some (most?) of the maintainability impact via a comment on the getGoodDirs() method (as well as the getLocalDirs() method in LocalDirsHandlerService). In general, I don't like calling methods to have to be aware of callee methods and having to deal with their locking. That could also be mitigated by fixing the callee method to remove the race condition, but that could be reintroduced by accident in the future, since they may not understand the full impact of the CopyOnWriteArrayList bq. 1. We were not thinking about errorDirs because as we were tracking down the issue, only localDirs seemed to be problematic, although I agree that it is inconsistent this way. Shall we follow-up on this? Yea, we should definitely follow up to fix errorDirs > Race condition when DirectoryCollection.checkDirs() runs during container > launch > > > Key: YARN-9833 > URL: https://issues.apache.org/jira/browse/YARN-9833 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-9833-001.patch > > > During endurance testing, we found a race condition that cause an empty > {{localDirs}} being passed to container-executor. > The problem is that {{DirectoryCollection.checkDirs()}} clears three > collections: > {code:java} > this.writeLock.lock(); > try { > localDirs.clear(); > errorDirs.clear(); > fullDirs.clear(); > ... > {code} > This happens in critical section guarded by a write lock. When we start a > container, we retrieve the local dirs by calling > {{dirsHandler.getLocalDirs();}} which in turn invokes > {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is: > {code:java} > List getGoodDirs() { > this.readLock.lock(); > try { > return Collections.unmodifiableList(localDirs); > } finally { > this.readLock.unlock(); > } > } > {code} > So we're also in a critical section guarded by the lock. But > {{Collections.unmodifiableList()}} only returns a _view_ of the collection, > not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be > scheduled to run and immediately clears {{localDirs}}. > This caused a weird behaviour in container-executor, which exited with error > code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES). > Therefore we can't just return a view, we must return a copy with > {{ImmutableList.copyOf()}}. > Credits to [~snemeth] for analyzing and determining the root cause. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9833) Race condition when DirectoryCollection.checkDirs() runs during container launch
[ https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246922#comment-17246922 ] Eric Badger commented on YARN-9833: --- [~pbacsko], is there a reason that errorDirs wasn't added in this patch? It still returns an unmodifiableList instead of a copy. > Race condition when DirectoryCollection.checkDirs() runs during container > launch > > > Key: YARN-9833 > URL: https://issues.apache.org/jira/browse/YARN-9833 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-9833-001.patch > > > During endurance testing, we found a race condition that cause an empty > {{localDirs}} being passed to container-executor. > The problem is that {{DirectoryCollection.checkDirs()}} clears three > collections: > {code:java} > this.writeLock.lock(); > try { > localDirs.clear(); > errorDirs.clear(); > fullDirs.clear(); > ... > {code} > This happens in critical section guarded by a write lock. When we start a > container, we retrieve the local dirs by calling > {{dirsHandler.getLocalDirs();}} which in turn invokes > {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is: > {code:java} > List getGoodDirs() { > this.readLock.lock(); > try { > return Collections.unmodifiableList(localDirs); > } finally { > this.readLock.unlock(); > } > } > {code} > So we're also in a critical section guarded by the lock. But > {{Collections.unmodifiableList()}} only returns a _view_ of the collection, > not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be > scheduled to run and immediately clears {{localDirs}}. > This caused a weird behaviour in container-executor, which exited with error > code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES). > Therefore we can't just return a view, we must return a copy with > {{ImmutableList.copyOf()}}. > Credits to [~snemeth] for analyzing and determining the root cause. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10495) make the rpath of container-executor configurable
[ https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10495: --- Fix Version/s: 3.4.1 > make the rpath of container-executor configurable > - > > Key: YARN-10495 > URL: https://issues.apache.org/jira/browse/YARN-10495 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.4.1 > > Attachments: YARN-10495.001.patch, YARN-10495.002.patch > > > In https://issues.apache.org/jira/browse/YARN-9561 we add dependency on > crypto to container-executor, we meet a case that in our jenkins machine, we > have libcrypto.so.1.0.0 in shared lib env. but in our nodemanager machine we > don't have libcrypto.so.1.0.0 but *libcrypto.so.1.1.* > We use a internal custom dynamic link library environment > /usr/lib/x86_64-linux-gnu > and we build hadoop with parameter as blow > {code:java} > -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu > {code} > > Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where > is libcrypto) > {code:java} > -rw-r--r-- 1 root root 240136 Nov 28 2014 libcroco-0.6.so.3.0.1 > -rw-r--r-- 1 root root54550 Jun 18 2017 libcrypt.a > -rw-r--r-- 1 root root 4306444 Sep 26 2019 libcrypto.a > lrwxrwxrwx 1 root root 18 Sep 26 2019 libcrypto.so -> > libcrypto.so.1.0.0 > -rw-r--r-- 1 root root 2070976 Sep 26 2019 libcrypto.so.1.0.0 > lrwxrwxrwx 1 root root 35 Jun 18 2017 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 Jun 18 2017 libc.so > {code} > > Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is > libcrypto) > {code:java} > -rw-r--r-- 1 root root55852 2�� 7 2019 libcrypt.a > -rw-r--r-- 1 root root 4864244 9�� 28 2019 libcrypto.a > lrwxrwxrwx 1 root root 16 9�� 28 2019 libcrypto.so -> > libcrypto.so.1.1 > -rw-r--r-- 1 root root 2504576 12�� 24 2019 libcrypto.so.1.0.2 > -rw-r--r-- 1 root root 2715840 9�� 28 2019 libcrypto.so.1.1 > lrwxrwxrwx 1 root root 35 2�� 7 2019 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 2�� 7 2019 libc.so > {code} > We build container-executor with > The libcrypto.so 's version is not same case error when we start nodemanager > > {code:java} > .. 3 more Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: > error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared > object file: No such file or directory at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306) > ... 4 more Caused by: ExitCodeException exitCode=127: > /home/hadoop/hadoop/bin/container-executor: error while loading shared > libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file > or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at > org.apache.hadoop.util.Shell.run(Shell.java:901) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154) > ... 6 more > {code} > > We should make RPATH of container-executor configurable to solve this problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10494) CLI tool for docker-to-squashfs conversion (pure Java)
[ https://issues.apache.org/jira/browse/YARN-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243597#comment-17243597 ] Eric Badger commented on YARN-10494: I'm pretty much in line with [~ccondit]'s opinion here. This squashfs code definitely isn't specific to Hadoop and could be a separate project. However, that's a lot of work and I'm not sure there's any real benefit there. Also, I think a PR is much better for a large commit because it becomes much easier to make comments in line with the code instead of having to explain where the issues are in a JIRA comment > CLI tool for docker-to-squashfs conversion (pure Java) > -- > > Key: YARN-10494 > URL: https://issues.apache.org/jira/browse/YARN-10494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > Attachments: YARN-10494.001.patch, > docker-to-squashfs-conversion-tool-design.pdf > > Time Spent: 40m > Remaining Estimate: 0h > > *YARN-9564* defines a docker-to-squashfs image conversion tool that relies on > python2, multiple libraries, squashfs-tools and root access in order to > convert Docker images to squashfs images for use with the runc container > runtime in YARN. > *YARN-9943* was created to investigate alternatives, as the response to > merging YARN-9564 has not been very positive. This proposal outlines the > design for a CLI conversion tool in 100% pure Java that will work out of the > box. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10494) CLI tool for docker-to-squashfs conversion (pure Java)
[ https://issues.apache.org/jira/browse/YARN-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236375#comment-17236375 ] Eric Badger commented on YARN-10494: Hey [~ccondit], thanks for the document. I'm really excited about this tool. bq. This tool will connect to a Docker repository Is this a requirement of the tool? I think we definitely need to at least have support for local image import from the docker daemon. Ideally we would also include support for any OCI-compliant registry, but that is probably outside of the scope of the initial design. We just want to make sure to leave the door open for that support in the future Another question: Does this tool support reproducible builds as was added to squashfs-tools 4.4 (https://github.com/plougher/squashfs-tools/blob/master/README-4.4)? And as we discussed in the most recent YARN call, we'll need to figure out how to run this tool (e.g. a service in the RM, standalone, Hadoop job, etc.) and with what user it needs to be run as. There are certainly challenges around permissions and security where we won't want arbitrary users creating potentially malicious squashfs images that will be blindly loaded by the kernel. This is outside of the scope of this specific JIRA, but wanted to mention it here for posterity. > CLI tool for docker-to-squashfs conversion (pure Java) > -- > > Key: YARN-10494 > URL: https://issues.apache.org/jira/browse/YARN-10494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0 >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Major > Attachments: docker-to-squashfs-conversion-tool-design.pdf > > > *YARN-9564* defines a docker-to-squashfs image conversion tool that relies on > python2, multiple libraries, squashfs-tools and root access in order to > convert Docker images to squashfs images for use with the runc container > runtime in YARN. > *YARN-9943* was created to investigate alternatives, as the response to > merging YARN-9564 has not been very positive. This proposal outlines the > design for a CLI conversion tool in 100% pure Java that will work out of the > box. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10495) make the rpath of container-executor configurable
[ https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236353#comment-17236353 ] Eric Badger commented on YARN-10495: Hadoop QA seems to have failed pretty epically. Looks like OOM issues on the box where it ran. I re-uploaded your patch (as 002) to retrigger this build > make the rpath of container-executor configurable > - > > Key: YARN-10495 > URL: https://issues.apache.org/jira/browse/YARN-10495 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Attachments: YARN-10495.001.patch, YARN-10495.002.patch > > > In https://issues.apache.org/jira/browse/YARN-9561 we add dependency on > crypto to container-executor, we meet a case that in our jenkins machine, we > have libcrypto.so.1.0.0 in shared lib env. but in our nodemanager machine we > don't have libcrypto.so.1.0.0 but *libcrypto.so.1.1.* > We use a internal custom dynamic link library environment > /usr/lib/x86_64-linux-gnu > and we build hadoop with parameter as blow > {code:java} > -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu > {code} > > Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where > is libcrypto) > {code:java} > -rw-r--r-- 1 root root 240136 Nov 28 2014 libcroco-0.6.so.3.0.1 > -rw-r--r-- 1 root root54550 Jun 18 2017 libcrypt.a > -rw-r--r-- 1 root root 4306444 Sep 26 2019 libcrypto.a > lrwxrwxrwx 1 root root 18 Sep 26 2019 libcrypto.so -> > libcrypto.so.1.0.0 > -rw-r--r-- 1 root root 2070976 Sep 26 2019 libcrypto.so.1.0.0 > lrwxrwxrwx 1 root root 35 Jun 18 2017 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 Jun 18 2017 libc.so > {code} > > Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is > libcrypto) > {code:java} > -rw-r--r-- 1 root root55852 2�� 7 2019 libcrypt.a > -rw-r--r-- 1 root root 4864244 9�� 28 2019 libcrypto.a > lrwxrwxrwx 1 root root 16 9�� 28 2019 libcrypto.so -> > libcrypto.so.1.1 > -rw-r--r-- 1 root root 2504576 12�� 24 2019 libcrypto.so.1.0.2 > -rw-r--r-- 1 root root 2715840 9�� 28 2019 libcrypto.so.1.1 > lrwxrwxrwx 1 root root 35 2�� 7 2019 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 2�� 7 2019 libc.so > {code} > We build container-executor with > The libcrypto.so 's version is not same case error when we start nodemanager > > {code:java} > .. 3 more Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: > error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared > object file: No such file or directory at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306) > ... 4 more Caused by: ExitCodeException exitCode=127: > /home/hadoop/hadoop/bin/container-executor: error while loading shared > libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file > or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at > org.apache.hadoop.util.Shell.run(Shell.java:901) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154) > ... 6 more > {code} > > We should make RPATH of container-executor configurable to solve this problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10495) make the rpath of container-executor configurable
[ https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10495: --- Attachment: YARN-10495.002.patch > make the rpath of container-executor configurable > - > > Key: YARN-10495 > URL: https://issues.apache.org/jira/browse/YARN-10495 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Attachments: YARN-10495.001.patch, YARN-10495.002.patch > > > In https://issues.apache.org/jira/browse/YARN-9561 we add dependency on > crypto to container-executor, we meet a case that in our jenkins machine, we > have libcrypto.so.1.0.0 in shared lib env. but in our nodemanager machine we > don't have libcrypto.so.1.0.0 but *libcrypto.so.1.1.* > We use a internal custom dynamic link library environment > /usr/lib/x86_64-linux-gnu > and we build hadoop with parameter as blow > {code:java} > -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu > {code} > > Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where > is libcrypto) > {code:java} > -rw-r--r-- 1 root root 240136 Nov 28 2014 libcroco-0.6.so.3.0.1 > -rw-r--r-- 1 root root54550 Jun 18 2017 libcrypt.a > -rw-r--r-- 1 root root 4306444 Sep 26 2019 libcrypto.a > lrwxrwxrwx 1 root root 18 Sep 26 2019 libcrypto.so -> > libcrypto.so.1.0.0 > -rw-r--r-- 1 root root 2070976 Sep 26 2019 libcrypto.so.1.0.0 > lrwxrwxrwx 1 root root 35 Jun 18 2017 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 Jun 18 2017 libc.so > {code} > > Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is > libcrypto) > {code:java} > -rw-r--r-- 1 root root55852 2�� 7 2019 libcrypt.a > -rw-r--r-- 1 root root 4864244 9�� 28 2019 libcrypto.a > lrwxrwxrwx 1 root root 16 9�� 28 2019 libcrypto.so -> > libcrypto.so.1.1 > -rw-r--r-- 1 root root 2504576 12�� 24 2019 libcrypto.so.1.0.2 > -rw-r--r-- 1 root root 2715840 9�� 28 2019 libcrypto.so.1.1 > lrwxrwxrwx 1 root root 35 2�� 7 2019 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 2�� 7 2019 libc.so > {code} > We build container-executor with > The libcrypto.so 's version is not same case error when we start nodemanager > > {code:java} > .. 3 more Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: > error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared > object file: No such file or directory at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306) > ... 4 more Caused by: ExitCodeException exitCode=127: > /home/hadoop/hadoop/bin/container-executor: error while loading shared > libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file > or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at > org.apache.hadoop.util.Shell.run(Shell.java:901) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154) > ... 6 more > {code} > > We should make RPATH of container-executor configurable to solve this problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10495) make the rpath of container-executor configurable
[ https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17235627#comment-17235627 ] Eric Badger commented on YARN-10495: Also, I've added you as a contributor in Hadoop Common, HDFS, Map/Reduce, and YARN. So you will now be able to assign JIRAs to yourself (as I've already done for you on this JIRA). > make the rpath of container-executor configurable > - > > Key: YARN-10495 > URL: https://issues.apache.org/jira/browse/YARN-10495 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Attachments: YARN-10495.001.patch > > > In https://issues.apache.org/jira/browse/YARN-9561 we add dependency on > crypto to container-executor, we meet a case that in our jenkins machine, we > have libcrypto.so.1.0.0 in shared lib env. but in our nodemanager machine we > don't have libcrypto.so.1.0.0 but *libcrypto.so.1.1.* > We use a internal custom dynamic link library environment > /usr/lib/x86_64-linux-gnu > and we build hadoop with parameter as blow > {code:java} > -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu > {code} > > Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where > is libcrypto) > {code:java} > -rw-r--r-- 1 root root 240136 Nov 28 2014 libcroco-0.6.so.3.0.1 > -rw-r--r-- 1 root root54550 Jun 18 2017 libcrypt.a > -rw-r--r-- 1 root root 4306444 Sep 26 2019 libcrypto.a > lrwxrwxrwx 1 root root 18 Sep 26 2019 libcrypto.so -> > libcrypto.so.1.0.0 > -rw-r--r-- 1 root root 2070976 Sep 26 2019 libcrypto.so.1.0.0 > lrwxrwxrwx 1 root root 35 Jun 18 2017 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 Jun 18 2017 libc.so > {code} > > Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is > libcrypto) > {code:java} > -rw-r--r-- 1 root root55852 2�� 7 2019 libcrypt.a > -rw-r--r-- 1 root root 4864244 9�� 28 2019 libcrypto.a > lrwxrwxrwx 1 root root 16 9�� 28 2019 libcrypto.so -> > libcrypto.so.1.1 > -rw-r--r-- 1 root root 2504576 12�� 24 2019 libcrypto.so.1.0.2 > -rw-r--r-- 1 root root 2715840 9�� 28 2019 libcrypto.so.1.1 > lrwxrwxrwx 1 root root 35 2�� 7 2019 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 2�� 7 2019 libc.so > {code} > We build container-executor with > The libcrypto.so 's version is not same case error when we start nodemanager > > {code:java} > .. 3 more Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: > error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared > object file: No such file or directory at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306) > ... 4 more Caused by: ExitCodeException exitCode=127: > /home/hadoop/hadoop/bin/container-executor: error while loading shared > libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file > or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at > org.apache.hadoop.util.Shell.run(Shell.java:901) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154) > ... 6 more > {code} > > We should make RPATH of container-executor configurable to solve this problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10495) make the rpath of container-executor configurable
[ https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger reassigned YARN-10495: -- Assignee: angerszhu > make the rpath of container-executor configurable > - > > Key: YARN-10495 > URL: https://issues.apache.org/jira/browse/YARN-10495 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Attachments: YARN-10495.001.patch > > > In https://issues.apache.org/jira/browse/YARN-9561 we add dependency on > crypto to container-executor, we meet a case that in our jenkins machine, we > have libcrypto.so.1.0.0 in shared lib env. but in our nodemanager machine we > don't have libcrypto.so.1.0.0 but *libcrypto.so.1.1.* > We use a internal custom dynamic link library environment > /usr/lib/x86_64-linux-gnu > and we build hadoop with parameter as blow > {code:java} > -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu > {code} > > Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where > is libcrypto) > {code:java} > -rw-r--r-- 1 root root 240136 Nov 28 2014 libcroco-0.6.so.3.0.1 > -rw-r--r-- 1 root root54550 Jun 18 2017 libcrypt.a > -rw-r--r-- 1 root root 4306444 Sep 26 2019 libcrypto.a > lrwxrwxrwx 1 root root 18 Sep 26 2019 libcrypto.so -> > libcrypto.so.1.0.0 > -rw-r--r-- 1 root root 2070976 Sep 26 2019 libcrypto.so.1.0.0 > lrwxrwxrwx 1 root root 35 Jun 18 2017 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 Jun 18 2017 libc.so > {code} > > Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is > libcrypto) > {code:java} > -rw-r--r-- 1 root root55852 2�� 7 2019 libcrypt.a > -rw-r--r-- 1 root root 4864244 9�� 28 2019 libcrypto.a > lrwxrwxrwx 1 root root 16 9�� 28 2019 libcrypto.so -> > libcrypto.so.1.1 > -rw-r--r-- 1 root root 2504576 12�� 24 2019 libcrypto.so.1.0.2 > -rw-r--r-- 1 root root 2715840 9�� 28 2019 libcrypto.so.1.1 > lrwxrwxrwx 1 root root 35 2�� 7 2019 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 2�� 7 2019 libc.so > {code} > We build container-executor with > The libcrypto.so 's version is not same case error when we start nodemanager > > {code:java} > .. 3 more Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: > error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared > object file: No such file or directory at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306) > ... 4 more Caused by: ExitCodeException exitCode=127: > /home/hadoop/hadoop/bin/container-executor: error while loading shared > libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file > or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at > org.apache.hadoop.util.Shell.run(Shell.java:901) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154) > ... 6 more > {code} > > We should make RPATH of container-executor configurable to solve this problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10495) make the rpath of container-executor configurable
[ https://issues.apache.org/jira/browse/YARN-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17235625#comment-17235625 ] Eric Badger commented on YARN-10495: [~angerszhuuu], I imagine the {{-Dbundle.openssl}} adds the libcrypto.so library to {{../lib/native}} of the build that is created? I don't have experience with this flag. Also, have you tested this out in your environment? > make the rpath of container-executor configurable > - > > Key: YARN-10495 > URL: https://issues.apache.org/jira/browse/YARN-10495 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: angerszhu >Priority: Major > Attachments: YARN-10495.001.patch > > > In https://issues.apache.org/jira/browse/YARN-9561 we add dependency on > crypto to container-executor, we meet a case that in our jenkins machine, we > have libcrypto.so.1.0.0 in shared lib env. but in our nodemanager machine we > don't have libcrypto.so.1.0.0 but *libcrypto.so.1.1.* > We use a internal custom dynamic link library environment > /usr/lib/x86_64-linux-gnu > and we build hadoop with parameter as blow > {code:java} > -Drequire.openssl -Dbundle.openssl -Dopenssl.lib=/usr/lib/x86_64-linux-gnu > {code} > > Under jenkins machine shared lib library path /usr/lib/x86_64-linux-gun(where > is libcrypto) > {code:java} > -rw-r--r-- 1 root root 240136 Nov 28 2014 libcroco-0.6.so.3.0.1 > -rw-r--r-- 1 root root54550 Jun 18 2017 libcrypt.a > -rw-r--r-- 1 root root 4306444 Sep 26 2019 libcrypto.a > lrwxrwxrwx 1 root root 18 Sep 26 2019 libcrypto.so -> > libcrypto.so.1.0.0 > -rw-r--r-- 1 root root 2070976 Sep 26 2019 libcrypto.so.1.0.0 > lrwxrwxrwx 1 root root 35 Jun 18 2017 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 Jun 18 2017 libc.so > {code} > > Under nodemanager shared lib library path /usr/lib/x86_64-linux-gun(where is > libcrypto) > {code:java} > -rw-r--r-- 1 root root55852 2�� 7 2019 libcrypt.a > -rw-r--r-- 1 root root 4864244 9�� 28 2019 libcrypto.a > lrwxrwxrwx 1 root root 16 9�� 28 2019 libcrypto.so -> > libcrypto.so.1.1 > -rw-r--r-- 1 root root 2504576 12�� 24 2019 libcrypto.so.1.0.2 > -rw-r--r-- 1 root root 2715840 9�� 28 2019 libcrypto.so.1.1 > lrwxrwxrwx 1 root root 35 2�� 7 2019 libcrypt.so -> > /lib/x86_64-linux-gnu/libcrypt.so.1 > -rw-r--r-- 1 root root 298 2�� 7 2019 libc.so > {code} > We build container-executor with > The libcrypto.so 's version is not same case error when we start nodemanager > > {code:java} > .. 3 more Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=127: /home/hadoop/hadoop/bin/container-executor: > error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared > object file: No such file or directory at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:182) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:208) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:306) > ... 4 more Caused by: ExitCodeException exitCode=127: > /home/hadoop/hadoop/bin/container-executor: error while loading shared > libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file > or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at > org.apache.hadoop.util.Shell.run(Shell.java:901) at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:154) > ... 6 more > {code} > > We should make RPATH of container-executor configurable to solve this problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234867#comment-17234867 ] Eric Badger commented on YARN-9561: --- I think the short answer is that you need to have openssl-devel installed on the machine where you are compiling Hadoop. >From BUILDING.txt {noformat} 160 * Use -Drequire.openssl to fail the build if libcrypto.so is not found. 161 If this option is not specified and the openssl library is missing, 162 we silently build a version of libhadoop.so that cannot make use of 163 openssl. This option is recommended if you plan on making use of openssl 164 and want to get more repeatable builds. 165 * Use -Dopenssl.prefix to specify a nonstandard location for the libcrypto 166 header files and library files. You do not need this option if you have 167 installed openssl using a package manager. 168 * Use -Dopenssl.lib to specify a nonstandard location for the libcrypto library 169 files. Similarly to openssl.prefix, you do not need this option if you have 170 installed openssl using a package manager. 171 * Use -Dbundle.openssl to copy the contents of the openssl.lib directory into 172 the final tar file. This option requires that -Dopenssl.lib is also given, 173 and it ignores the -Dopenssl.prefix option. If -Dopenssl.lib isn't given, the 174 bundling and building will fail. {noformat} The crypto library is statically linked to the container-executor during compilation. I guess it just quietly moves on if it doesn't find it instead of failing. That's sort of troubling and something I'll look into fixing. But the answer is to make sure openssl-devel is installed before you compile Hadoop > Add C changes for the new RuncContainerRuntime > -- > > Key: YARN-9561 > URL: https://issues.apache.org/jira/browse/YARN-9561 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9561.001.patch, YARN-9561.002.patch, > YARN-9561.003.patch, YARN-9561.004.patch, YARN-9561.005.patch, > YARN-9561.006.patch, YARN-9561.007.patch, YARN-9561.008.patch, > YARN-9561.009.patch, YARN-9561.010.patch, YARN-9561.011.patch, > YARN-9561.012.patch, YARN-9561.013.patch, YARN-9561.014.patch, > YARN-9561.015.patch > > > This JIRA will be used to add the C changes to the container-executor native > binary that are necessary for the new RuncContainerRuntime. There should be > no changes to existing code paths. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214990#comment-17214990 ] Eric Badger commented on YARN-10460: Really interesting find. Good job tracking this down. Since we're not testing the IPC layer here, I don't see a big deal with it. It's not ideal, but I still think it's fairly clean. Just resetting the environment before the next test. So I'd be ok with this change. > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > - > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10460-POC.patch > > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask task = new FutureTask(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); < This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at >
[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics
[ https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214891#comment-17214891 ] Eric Badger commented on YARN-10450: Thanks, [~Jim_Brennan]. I've committed this to trunk and branch-3.3. I'll wait for the precommit builds to come back and then will commit to the rest of the branches > Add cpu and memory utilization per node and cluster-wide metrics > > > Key: YARN-10450 > URL: https://issues.apache.org/jira/browse/YARN-10450 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.3.1 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Fix For: 3.4.0, 3.3.1 > > Attachments: NodesPage.png, YARN-10450-branch-2.10.003.patch, > YARN-10450-branch-3.1.003.patch, YARN-10450-branch-3.2.003.patch, > YARN-10450.001.patch, YARN-10450.002.patch, YARN-10450.003.patch > > > Add metrics to show actual cpu and memory utilization for each node and > aggregated for the entire cluster. This is information is already passed > from NM to RM in the node status update. > We have been running with this internally for quite a while and found it > useful to be able to quickly see the actual cpu/memory utilization on the > node/cluster. It's especially useful if some form of overcommit is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics
[ https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10450: --- Fix Version/s: 3.3.1 3.4.0 > Add cpu and memory utilization per node and cluster-wide metrics > > > Key: YARN-10450 > URL: https://issues.apache.org/jira/browse/YARN-10450 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.3.1 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Fix For: 3.4.0, 3.3.1 > > Attachments: NodesPage.png, YARN-10450-branch-2.10.003.patch, > YARN-10450-branch-3.1.003.patch, YARN-10450-branch-3.2.003.patch, > YARN-10450.001.patch, YARN-10450.002.patch, YARN-10450.003.patch > > > Add metrics to show actual cpu and memory utilization for each node and > aggregated for the entire cluster. This is information is already passed > from NM to RM in the node status update. > We have been running with this internally for quite a while and found it > useful to be able to quickly see the actual cpu/memory utilization on the > node/cluster. It's especially useful if some form of overcommit is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics
[ https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214833#comment-17214833 ] Eric Badger commented on YARN-10450: [~Jim_Brennan], looks good now! +1 committing now > Add cpu and memory utilization per node and cluster-wide metrics > > > Key: YARN-10450 > URL: https://issues.apache.org/jira/browse/YARN-10450 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.3.1 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Attachments: NodesPage.png, YARN-10450.001.patch, > YARN-10450.002.patch, YARN-10450.003.patch > > > Add metrics to show actual cpu and memory utilization for each node and > aggregated for the entire cluster. This is information is already passed > from NM to RM in the node status update. > We have been running with this internally for quite a while and found it > useful to be able to quickly see the actual cpu/memory utilization on the > node/cluster. It's especially useful if some form of overcommit is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214795#comment-17214795 ] Eric Badger commented on YARN-10244: Makes sense. Thanks, [~aajisaka] > backport YARN-9848 to branch-3.2 > > > Key: YARN-10244 > URL: https://issues.apache.org/jira/browse/YARN-10244 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-10244-branch-3.2.001.patch, > YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch > > > Backporting YARN-9848 to branch-3.2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10244) backport YARN-9848 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214050#comment-17214050 ] Eric Badger commented on YARN-10244: I'm pretty confused with all of the JIRAs on this. In the future, I think we should revert the JIRA using the JIRA that was committed. Let me summarize what I think happened and you all can let me know if I have it right. YARN-4946 committed to 3.2, so it is in 3.2, 3.3, and trunk YARN-9848 reverted YARN-4946 from 3.3, so YARN-4946 only remains in 3.2 YARN-10244 reverted YARN-4946 from 3.2, so YARN-4946 has been completely reverted It's really confusing to me because YARN-4946 has the Fix Version set as 3.2. And then this JIRA says it is backporting YARN-9848, instead of saying it's reverting YARN-4946. Anyway, like I said above, if we're going to revert stuff, I think it is better to do it on the JIRA where it was committed so that we have a clear linear log of where it was committed to and reverted from. We can also then look at the Fix Version for that particular JIRA and know where it is actually committed > backport YARN-9848 to branch-3.2 > > > Key: YARN-10244 > URL: https://issues.apache.org/jira/browse/YARN-10244 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: YARN-10244-branch-3.2.001.patch, > YARN-10244-branch-3.2.002.patch, YARN-10244-branch-3.2.003.patch > > > Backporting YARN-9848 to branch-3.2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics
[ https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213215#comment-17213215 ] Eric Badger commented on YARN-10450: bq. Physical Mem Used % makes sense to me Yea this works for me too. Much more clear IMO > Add cpu and memory utilization per node and cluster-wide metrics > > > Key: YARN-10450 > URL: https://issues.apache.org/jira/browse/YARN-10450 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.3.1 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Attachments: NodesPage.png, YARN-10450.001.patch, YARN-10450.002.patch > > > Add metrics to show actual cpu and memory utilization for each node and > aggregated for the entire cluster. This is information is already passed > from NM to RM in the node status update. > We have been running with this internally for quite a while and found it > useful to be able to quickly see the actual cpu/memory utilization on the > node/cluster. It's especially useful if some form of overcommit is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics
[ https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212632#comment-17212632 ] Eric Badger commented on YARN-10450: The patch itself looks good to me. However, I'm wondering if "Mem Utilization" is the correct phrase to convey what we mean. To me this means "Mem Used" / "Mem Avail". But in this case it's the actual utilization of the node. And "Mem Used" isn't really the actual memory that's being used. It's the memory that is allocated to that node via YARN. [~Jim_Brennan], [~epayne] do you have any thoughts on making this terminology a little more clear on the UI? > Add cpu and memory utilization per node and cluster-wide metrics > > > Key: YARN-10450 > URL: https://issues.apache.org/jira/browse/YARN-10450 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.3.1 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Attachments: NodesPage.png, YARN-10450.001.patch, YARN-10450.002.patch > > > Add metrics to show actual cpu and memory utilization for each node and > aggregated for the entire cluster. This is information is already passed > from NM to RM in the node status update. > We have been running with this internally for quite a while and found it > useful to be able to quickly see the actual cpu/memory utilization on the > node/cluster. It's especially useful if some form of overcommit is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics
[ https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212557#comment-17212557 ] Eric Badger commented on YARN-10450: I'll review it > Add cpu and memory utilization per node and cluster-wide metrics > > > Key: YARN-10450 > URL: https://issues.apache.org/jira/browse/YARN-10450 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.3.1 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Attachments: NodesPage.png, YARN-10450.001.patch, YARN-10450.002.patch > > > Add metrics to show actual cpu and memory utilization for each node and > aggregated for the entire cluster. This is information is already passed > from NM to RM in the node status update. > We have been running with this internally for quite a while and found it > useful to be able to quickly see the actual cpu/memory utilization on the > node/cluster. It's especially useful if some form of overcommit is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9667) Container-executor.c duplicates messages to stdout
[ https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212524#comment-17212524 ] Eric Badger commented on YARN-9667: --- Thanks, [~Jim_Brennan]! > Container-executor.c duplicates messages to stdout > -- > > Key: YARN-9667 > URL: https://issues.apache.org/jira/browse/YARN-9667 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, yarn >Affects Versions: 3.2.0 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.5, 2.10.2 > > Attachments: YARN-9667-001.patch, YARN-9667-branch-2.10.001.patch, > YARN-9667-branch-3.2.001.patch > > > When a container is killed by its AM we get a similar error message like this: > {noformat} > 2019-06-30 12:09:04,412 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 143. Privileged Execution Operation > Stderr: > Stdout: main : command provided 1 > main : run as user is systest > main : requested yarn user is systest > Getting exit code file... > Creating script paths... > Writing pid file... > Writing to tmp file > /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp > Writing to cgroup task files... > Creating local dirs... > Launching container... > Getting exit code file... > Creating script paths... > {noformat} > In container-executor.c the fork point is right after the "Creating script > paths..." part, though in the Stdout log we can clearly see it has been > written there twice. After consulting with [~pbacsko] it seems like there's a > missing flush in container-executor.c before the fork and that causes the > duplication. > I suggest to add a flush there so that it won't be duplicated: it's a bit > misleading that the child process writes out "Getting exit code file" and > "Creating script paths" even though it is clearly not doing that. > A more appealing solution could be to revisit the fprintf-fflush pairs in the > code and change them to a single call, so that the fflush calls would not be > forgotten accidentally. (It can cause problems in every place where it's > used). > Note: this issue probably affects every occasion of fork(), not just the one > from {{launch_container_as_user}} in {{main.c}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10455) TestNMProxy.testNMProxyRPCRetry is not consistent
[ https://issues.apache.org/jira/browse/YARN-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10455: --- Fix Version/s: 2.10.2 Thanks for the branch-2.10 patch, [~ahussein]. I committed this to branch-2.10 > TestNMProxy.testNMProxyRPCRetry is not consistent > - > > Key: YARN-10455 > URL: https://issues.apache.org/jira/browse/YARN-10455 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Fix For: 3.1.2, 3.2.2, 3.4.0, 3.3.1, 2.10.2 > > Attachments: YARN-10455-branch-2.10.001.patch, YARN-10455.001.patch > > > The fix in YARN-8844 may fail depending on the configuration of the machine > running the test. > In some cases the address gets resolved and the Unit throws a connection > timeout exception instead. In such scenario the JUnit times out the main > reason behind the failure is swallowed by the shutdown of the clients. > To make sure that the JUnit behavior is consistent, a suggested fix is to > set the host address to {{127.0.0.1:1}}. The latter will omit the probability > of collisions on non-privileged ports. > Also, it is more correct to catch {{SocketException}} directly rather than > catching IOException with a check for not {{SocketException}}. > > The stack trace with such failures: > {code:bash} > [INFO] Running > org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy > [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 24.293 s <<< FAILURE! - in > org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy > [ERROR] > testNMProxyRPCRetry(org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy) > Time elapsed: 20.18 s <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 2 > milliseconds > at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method) > at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198) > at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:700) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821) > at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1645) > at org.apache.hadoop.ipc.Client.call(Client.java:1461) > at org.apache.hadoop.ipc.Client.call(Client.java:1414) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:234) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:119) > at com.sun.proxy.$Proxy24.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:133) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) > at com.sun.proxy.$Proxy25.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.TestNMProxy.testNMProxyRPCRetry(TestNMProxy.java:167) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at >
[jira] [Updated] (YARN-9667) Container-executor.c duplicates messages to stdout
[ https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9667: -- Attachment: YARN-9667-branch-2.10.001.patch > Container-executor.c duplicates messages to stdout > -- > > Key: YARN-9667 > URL: https://issues.apache.org/jira/browse/YARN-9667 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, yarn >Affects Versions: 3.2.0 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.5 > > Attachments: YARN-9667-001.patch, YARN-9667-branch-2.10.001.patch, > YARN-9667-branch-3.2.001.patch > > > When a container is killed by its AM we get a similar error message like this: > {noformat} > 2019-06-30 12:09:04,412 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 143. Privileged Execution Operation > Stderr: > Stdout: main : command provided 1 > main : run as user is systest > main : requested yarn user is systest > Getting exit code file... > Creating script paths... > Writing pid file... > Writing to tmp file > /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp > Writing to cgroup task files... > Creating local dirs... > Launching container... > Getting exit code file... > Creating script paths... > {noformat} > In container-executor.c the fork point is right after the "Creating script > paths..." part, though in the Stdout log we can clearly see it has been > written there twice. After consulting with [~pbacsko] it seems like there's a > missing flush in container-executor.c before the fork and that causes the > duplication. > I suggest to add a flush there so that it won't be duplicated: it's a bit > misleading that the child process writes out "Getting exit code file" and > "Creating script paths" even though it is clearly not doing that. > A more appealing solution could be to revisit the fprintf-fflush pairs in the > code and change them to a single call, so that the fflush calls would not be > forgotten accidentally. (It can cause problems in every place where it's > used). > Note: this issue probably affects every occasion of fork(), not just the one > from {{launch_container_as_user}} in {{main.c}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9667) Container-executor.c duplicates messages to stdout
[ https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210518#comment-17210518 ] Eric Badger commented on YARN-9667: --- Thanks, [~Jim_Brennan]! I attached another patch that should work for branch-2.10 > Container-executor.c duplicates messages to stdout > -- > > Key: YARN-9667 > URL: https://issues.apache.org/jira/browse/YARN-9667 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, yarn >Affects Versions: 3.2.0 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.5 > > Attachments: YARN-9667-001.patch, YARN-9667-branch-2.10.001.patch, > YARN-9667-branch-3.2.001.patch > > > When a container is killed by its AM we get a similar error message like this: > {noformat} > 2019-06-30 12:09:04,412 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 143. Privileged Execution Operation > Stderr: > Stdout: main : command provided 1 > main : run as user is systest > main : requested yarn user is systest > Getting exit code file... > Creating script paths... > Writing pid file... > Writing to tmp file > /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp > Writing to cgroup task files... > Creating local dirs... > Launching container... > Getting exit code file... > Creating script paths... > {noformat} > In container-executor.c the fork point is right after the "Creating script > paths..." part, though in the Stdout log we can clearly see it has been > written there twice. After consulting with [~pbacsko] it seems like there's a > missing flush in container-executor.c before the fork and that causes the > duplication. > I suggest to add a flush there so that it won't be duplicated: it's a bit > misleading that the child process writes out "Getting exit code file" and > "Creating script paths" even though it is clearly not doing that. > A more appealing solution could be to revisit the fprintf-fflush pairs in the > code and change them to a single call, so that the fflush calls would not be > forgotten accidentally. (It can cause problems in every place where it's > used). > Note: this issue probably affects every occasion of fork(), not just the one > from {{launch_container_as_user}} in {{main.c}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9667) Container-executor.c duplicates messages to stdout
[ https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9667: -- Attachment: YARN-9667-branch-3.2.001.patch > Container-executor.c duplicates messages to stdout > -- > > Key: YARN-9667 > URL: https://issues.apache.org/jira/browse/YARN-9667 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, yarn >Affects Versions: 3.2.0 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9667-001.patch, YARN-9667-branch-3.2.001.patch > > > When a container is killed by its AM we get a similar error message like this: > {noformat} > 2019-06-30 12:09:04,412 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 143. Privileged Execution Operation > Stderr: > Stdout: main : command provided 1 > main : run as user is systest > main : requested yarn user is systest > Getting exit code file... > Creating script paths... > Writing pid file... > Writing to tmp file > /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp > Writing to cgroup task files... > Creating local dirs... > Launching container... > Getting exit code file... > Creating script paths... > {noformat} > In container-executor.c the fork point is right after the "Creating script > paths..." part, though in the Stdout log we can clearly see it has been > written there twice. After consulting with [~pbacsko] it seems like there's a > missing flush in container-executor.c before the fork and that causes the > duplication. > I suggest to add a flush there so that it won't be duplicated: it's a bit > misleading that the child process writes out "Getting exit code file" and > "Creating script paths" even though it is clearly not doing that. > A more appealing solution could be to revisit the fprintf-fflush pairs in the > code and change them to a single call, so that the fflush calls would not be > forgotten accidentally. (It can cause problems in every place where it's > used). > Note: this issue probably affects every occasion of fork(), not just the one > from {{launch_container_as_user}} in {{main.c}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9667) Container-executor.c duplicates messages to stdout
[ https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9667: -- Attachment: YARN-5121-branch-3.2.001.patch > Container-executor.c duplicates messages to stdout > -- > > Key: YARN-9667 > URL: https://issues.apache.org/jira/browse/YARN-9667 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, yarn >Affects Versions: 3.2.0 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9667-001.patch > > > When a container is killed by its AM we get a similar error message like this: > {noformat} > 2019-06-30 12:09:04,412 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 143. Privileged Execution Operation > Stderr: > Stdout: main : command provided 1 > main : run as user is systest > main : requested yarn user is systest > Getting exit code file... > Creating script paths... > Writing pid file... > Writing to tmp file > /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp > Writing to cgroup task files... > Creating local dirs... > Launching container... > Getting exit code file... > Creating script paths... > {noformat} > In container-executor.c the fork point is right after the "Creating script > paths..." part, though in the Stdout log we can clearly see it has been > written there twice. After consulting with [~pbacsko] it seems like there's a > missing flush in container-executor.c before the fork and that causes the > duplication. > I suggest to add a flush there so that it won't be duplicated: it's a bit > misleading that the child process writes out "Getting exit code file" and > "Creating script paths" even though it is clearly not doing that. > A more appealing solution could be to revisit the fprintf-fflush pairs in the > code and change them to a single call, so that the fflush calls would not be > forgotten accidentally. (It can cause problems in every place where it's > used). > Note: this issue probably affects every occasion of fork(), not just the one > from {{launch_container_as_user}} in {{main.c}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9667) Container-executor.c duplicates messages to stdout
[ https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9667: -- Attachment: (was: YARN-5121-branch-3.2.001.patch) > Container-executor.c duplicates messages to stdout > -- > > Key: YARN-9667 > URL: https://issues.apache.org/jira/browse/YARN-9667 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, yarn >Affects Versions: 3.2.0 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9667-001.patch > > > When a container is killed by its AM we get a similar error message like this: > {noformat} > 2019-06-30 12:09:04,412 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 143. Privileged Execution Operation > Stderr: > Stdout: main : command provided 1 > main : run as user is systest > main : requested yarn user is systest > Getting exit code file... > Creating script paths... > Writing pid file... > Writing to tmp file > /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp > Writing to cgroup task files... > Creating local dirs... > Launching container... > Getting exit code file... > Creating script paths... > {noformat} > In container-executor.c the fork point is right after the "Creating script > paths..." part, though in the Stdout log we can clearly see it has been > written there twice. After consulting with [~pbacsko] it seems like there's a > missing flush in container-executor.c before the fork and that causes the > duplication. > I suggest to add a flush there so that it won't be duplicated: it's a bit > misleading that the child process writes out "Getting exit code file" and > "Creating script paths" even though it is clearly not doing that. > A more appealing solution could be to revisit the fprintf-fflush pairs in the > code and change them to a single call, so that the fflush calls would not be > forgotten accidentally. (It can cause problems in every place where it's > used). > Note: this issue probably affects every occasion of fork(), not just the one > from {{launch_container_as_user}} in {{main.c}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-9667) Container-executor.c duplicates messages to stdout
[ https://issues.apache.org/jira/browse/YARN-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger reopened YARN-9667: --- Re-opening so that we can pull this back to branch-3.2 and beyond. Hopefully all the way back to 2.10 > Container-executor.c duplicates messages to stdout > -- > > Key: YARN-9667 > URL: https://issues.apache.org/jira/browse/YARN-9667 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, yarn >Affects Versions: 3.2.0 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9667-001.patch > > > When a container is killed by its AM we get a similar error message like this: > {noformat} > 2019-06-30 12:09:04,412 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 143. Privileged Execution Operation > Stderr: > Stdout: main : command provided 1 > main : run as user is systest > main : requested yarn user is systest > Getting exit code file... > Creating script paths... > Writing pid file... > Writing to tmp file > /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_19/container_e84_1561921629886_0001_01_19.pid.tmp > Writing to cgroup task files... > Creating local dirs... > Launching container... > Getting exit code file... > Creating script paths... > {noformat} > In container-executor.c the fork point is right after the "Creating script > paths..." part, though in the Stdout log we can clearly see it has been > written there twice. After consulting with [~pbacsko] it seems like there's a > missing flush in container-executor.c before the fork and that causes the > duplication. > I suggest to add a flush there so that it won't be duplicated: it's a bit > misleading that the child process writes out "Getting exit code file" and > "Creating script paths" even though it is clearly not doing that. > A more appealing solution could be to revisit the fprintf-fflush pairs in the > code and change them to a single call, so that the fflush calls would not be > forgotten accidentally. (It can cause problems in every place where it's > used). > Note: this issue probably affects every occasion of fork(), not just the one > from {{launch_container_as_user}} in {{main.c}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9809: -- Attachment: YARN-9809-branch-3.2.009.patch > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, > YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, > YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, > YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, > YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202314#comment-17202314 ] Eric Badger commented on YARN-9809: --- So close. Those pesky unit tests. Patch 009 fixes the unit test failure. Thanks for the review, [~Jim_Brennan]! > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, > YARN-9809-branch-3.2.008.patch, YARN-9809-branch-3.2.009.patch, > YARN-9809.001.patch, YARN-9809.002.patch, YARN-9809.003.patch, > YARN-9809.004.patch, YARN-9809.005.patch, YARN-9809.006.patch, > YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201814#comment-17201814 ] Eric Badger commented on YARN-9809: --- I've attached branch-3.2 patch 008 to address your comments, [~Jim_Brennan]. I think I got all of the unit tests to pass. But TestCombinedSystemMetricsPublisher, TestSystemMetricsPublisherForV2, TestFSSchedulerConfigurationStore, and TestZKConfigurationStore failed for me locally on straight up branch-3.2 > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, > YARN-9809-branch-3.2.008.patch, YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9809: -- Attachment: YARN-9809-branch-3.2.008.patch > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, > YARN-9809-branch-3.2.008.patch, YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201782#comment-17201782 ] Eric Badger commented on YARN-9809: --- {noformat} RMNodeImpl#AddNodeTransition#transition RMNodeStatusEvent rmNodeStatusEvent = new RMNodeStatusEvent(nodeId, nodeStatus); NodeHealthStatus nodeHealthStatus = updateRMNodeFromStatusEvents(rmNode, rmNodeStatusEvent); if (nodeHealthStatus.getIsNodeHealthy()) { {noformat} bq. Do we run the risk of nodeHealthStatus being null? [~epayne], nope we should be fine here. {{nodeHealthStatus}} comes from the return value of {{updateRMNodeFromStatusEvents}}. The return value of that method comes from {{statusEvent.getNodeHealthStatus()}}. But {{statusEvent}} is passed into this method via an argument. On the caller side that argument is named {{rmNodeStatusEvent}} and it is craeted a few lines up via the RMNodeStatusEvent constructor. The {{nodeStatus}} is set there via the constructor and we know it won't be null because we are in the "else" of the "if" statement that checked for {{nodeStatus}} being null. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, > YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, > YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201116#comment-17201116 ] Eric Badger commented on YARN-9809: --- Thanks for the initial reviews, [~epayne] and [~Jim_Brennan]! I will put up an updated patch soon with changes related to your comments. I also noticed some other issues that are manifesting as the unit test failures. So I will fix those as well. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, > YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, > YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199707#comment-17199707 ] Eric Badger commented on YARN-9809: --- [~epayne], [~Jim_Brennan], sorry for the delay. I have put up a patch for branch-3.2. However I think this needs another round of review because the diff was quite massive on the cherry-pick and I had to redo a lot of stuff by hand. So in a lot of ways, this is a completely new patch. I think I got all of the unit tests that would've failed, but we'll see what HadoopQA says. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, > YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, > YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger reopened YARN-9809: --- > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, > YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, > YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9809: -- Attachment: YARN-9809-branch-3.2.007.patch > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809-branch-3.2.007.patch, YARN-9809.001.patch, > YARN-9809.002.patch, YARN-9809.003.patch, YARN-9809.004.patch, > YARN-9809.005.patch, YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10391) --module-gpu functionality is broken in container-executor
[ https://issues.apache.org/jira/browse/YARN-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179152#comment-17179152 ] Eric Badger commented on YARN-10391: Thanks, [~Jim_Brennan]! > --module-gpu functionality is broken in container-executor > -- > > Key: YARN-10391 > URL: https://issues.apache.org/jira/browse/YARN-10391 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10391.001.patch > > > {{--module-gpu}} doesn't set the {{operation}} variable, and so the > {{main()}} function's switch statement on {{operation}} falls through to the > default case. This causes it to report a failure, even though it succeeded. > {noformat} > default: > fprintf(ERRORFILE, "Unexpected operation code: %d\n", operation); > exit_code = INVALID_COMMAND_PROVIDED; > break; > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10391) --module-gpu functionality is broken in container-executor
[ https://issues.apache.org/jira/browse/YARN-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10391: --- Attachment: YARN-10391.001.patch > --module-gpu functionality is broken in container-executor > -- > > Key: YARN-10391 > URL: https://issues.apache.org/jira/browse/YARN-10391 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10391.001.patch > > > {{--module-gpu}} doesn't set the {{operation}} variable, and so the > {{main()}} function's switch statement on {{operation}} falls through to the > default case. This causes it to report a failure, even though it succeeded. > {noformat} > default: > fprintf(ERRORFILE, "Unexpected operation code: %d\n", operation); > exit_code = INVALID_COMMAND_PROVIDED; > break; > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10391) --module-gpu functionality is broken in container-executor
Eric Badger created YARN-10391: -- Summary: --module-gpu functionality is broken in container-executor Key: YARN-10391 URL: https://issues.apache.org/jira/browse/YARN-10391 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.3.0 Reporter: Eric Badger Assignee: Eric Badger {{--module-gpu}} doesn't set the {{operation}} variable, and so the {{main()}} function's switch statement on {{operation}} falls through to the default case. This causes it to report a failure, even though it succeeded. {noformat} default: fprintf(ERRORFILE, "Unexpected operation code: %d\n", operation); exit_code = INVALID_COMMAND_PROVIDED; break; {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-7677: -- Fix Version/s: 2.10.1 Thanks for the patch, [~Jim_Brennan]! +1. I committed this to branch-2.10 > Docker image cannot set HADOOP_CONF_DIR > --- > > Key: YARN-7677 > URL: https://issues.apache.org/jira/browse/YARN-7677 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Eric Badger >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Fix For: 3.1.0, 2.10.1 > > Attachments: YARN-7677-branch-2.10.001.patch, > YARN-7677-branch-2.10.002.patch, YARN-7677.001.patch, YARN-7677.002.patch, > YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, > YARN-7677.006.patch, YARN-7677.007.patch > > > Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether > it's set by the user or not. It completely bypasses the whitelist and so > there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes > problems in the Docker use case where Docker containers will set up their own > environment and have their own {{HADOOP_CONF_DIR}} preset in the image > itself. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-7677: -- Attachment: (was: YARN-7677-branch-2.10.002.patch) > Docker image cannot set HADOOP_CONF_DIR > --- > > Key: YARN-7677 > URL: https://issues.apache.org/jira/browse/YARN-7677 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Eric Badger >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Fix For: 3.1.0 > > Attachments: YARN-7677-branch-2.10.001.patch, > YARN-7677-branch-2.10.002.patch, YARN-7677.001.patch, YARN-7677.002.patch, > YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, > YARN-7677.006.patch, YARN-7677.007.patch > > > Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether > it's set by the user or not. It completely bypasses the whitelist and so > there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes > problems in the Docker use case where Docker containers will set up their own > environment and have their own {{HADOOP_CONF_DIR}} preset in the image > itself. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-7677: -- Attachment: YARN-7677-branch-2.10.002.patch > Docker image cannot set HADOOP_CONF_DIR > --- > > Key: YARN-7677 > URL: https://issues.apache.org/jira/browse/YARN-7677 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Eric Badger >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Fix For: 3.1.0 > > Attachments: YARN-7677-branch-2.10.001.patch, > YARN-7677-branch-2.10.002.patch, YARN-7677-branch-2.10.002.patch, > YARN-7677.001.patch, YARN-7677.002.patch, YARN-7677.003.patch, > YARN-7677.004.patch, YARN-7677.005.patch, YARN-7677.006.patch, > YARN-7677.007.patch > > > Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether > it's set by the user or not. It completely bypasses the whitelist and so > there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes > problems in the Docker use case where Docker containers will set up their own > environment and have their own {{HADOOP_CONF_DIR}} preset in the image > itself. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource
[ https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-4575: -- Fix Version/s: 2.10.1 3.1.4 Thanks for the updated patch, [~epayne]! +1. I've committed this to branch-3.1 and branch-2.10. It's now been committed to trunk (3.4), branch-3.3, branch-3.2, branch-3.1, and branch-2.10 > ApplicationResourceUsageReport should return ALL reserved resource > --- > > Key: YARN-4575 > URL: https://issues.apache.org/jira/browse/YARN-4575 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin Chundatt >Priority: Major > Labels: oct16-easy > Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1 > > Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch, > YARN-4575.003.patch, YARN-4575.004.patch, YARN-4575.005.patch, > YARN-4575.branch-3.1..005.patch > > > ApplicationResourceUsageReport reserved resource report is only of default > parition should be of all partitions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource
[ https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-4575: -- Fix Version/s: 3.3.1 3.4.0 3.2.2 > ApplicationResourceUsageReport should return ALL reserved resource > --- > > Key: YARN-4575 > URL: https://issues.apache.org/jira/browse/YARN-4575 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin Chundatt >Priority: Major > Labels: oct16-easy > Fix For: 3.2.2, 3.4.0, 3.3.1 > > Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch, > YARN-4575.003.patch, YARN-4575.004.patch, YARN-4575.005.patch > > > ApplicationResourceUsageReport reserved resource report is only of default > parition should be of all partitions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource
[ https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171701#comment-17171701 ] Eric Badger commented on YARN-4575: --- +1 lgtm. The test failures are unrelated to this patch. I've committed this to trunk (3.4), branch-3.3, and branch-3.2. The cherry-pick to branch-3.1 is clean, but compilation fails. [~epayne], if you'd like it to go all the way back to 2.10, could you put up an additional patch for branch-3.1 (and branch-2.10 if necessary)? Thanks > ApplicationResourceUsageReport should return ALL reserved resource > --- > > Key: YARN-4575 > URL: https://issues.apache.org/jira/browse/YARN-4575 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin Chundatt >Priority: Major > Labels: oct16-easy > Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch, > YARN-4575.003.patch, YARN-4575.004.patch, YARN-4575.005.patch > > > ApplicationResourceUsageReport reserved resource report is only of default > parition should be of all partitions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170951#comment-17170951 ] Eric Badger commented on YARN-7677: --- Thanks, [~aajisaka]! Cancelling and resubmitting the patch to rerun Hadoop QA > Docker image cannot set HADOOP_CONF_DIR > --- > > Key: YARN-7677 > URL: https://issues.apache.org/jira/browse/YARN-7677 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Eric Badger >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Fix For: 3.1.0 > > Attachments: YARN-7677-branch-2.10.001.patch, > YARN-7677-branch-2.10.002.patch, YARN-7677.001.patch, YARN-7677.002.patch, > YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, > YARN-7677.006.patch, YARN-7677.007.patch > > > Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether > it's set by the user or not. It completely bypasses the whitelist and so > there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes > problems in the Docker use case where Docker containers will set up their own > environment and have their own {{HADOOP_CONF_DIR}} preset in the image > itself. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170464#comment-17170464 ] Eric Badger commented on YARN-7677: --- [~Jim_Brennan], I'm +1 on this patch. I'll give a day for others to chime in if they'd like to. > Docker image cannot set HADOOP_CONF_DIR > --- > > Key: YARN-7677 > URL: https://issues.apache.org/jira/browse/YARN-7677 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Eric Badger >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Fix For: 3.1.0 > > Attachments: YARN-7677-branch-2.10.001.patch, > YARN-7677-branch-2.10.002.patch, YARN-7677.001.patch, YARN-7677.002.patch, > YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, > YARN-7677.006.patch, YARN-7677.007.patch > > > Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether > it's set by the user or not. It completely bypasses the whitelist and so > there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes > problems in the Docker use case where Docker containers will set up their own > environment and have their own {{HADOOP_CONF_DIR}} preset in the image > itself. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170463#comment-17170463 ] Eric Badger commented on YARN-7677: --- I think https://issues.apache.org/jira/browse/HADOOP-17091 is related to the javadoc failures we're seeing. It was committed yesterday and adds {{YETUS_ARGS+=("--mvn-javadoc-goals=process-sources,javadoc:javadoc-no-fork")}} [~aajisaka], can you take a look and take appropriate action on HADOOP-17091 if you think this is related? > Docker image cannot set HADOOP_CONF_DIR > --- > > Key: YARN-7677 > URL: https://issues.apache.org/jira/browse/YARN-7677 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Eric Badger >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Fix For: 3.1.0 > > Attachments: YARN-7677-branch-2.10.001.patch, > YARN-7677-branch-2.10.002.patch, YARN-7677.001.patch, YARN-7677.002.patch, > YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, > YARN-7677.006.patch, YARN-7677.007.patch > > > Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether > it's set by the user or not. It completely bypasses the whitelist and so > there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes > problems in the Docker use case where Docker containers will set up their own > environment and have their own {{HADOOP_CONF_DIR}} preset in the image > itself. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10363) TestRMAdminCLI.testHelp is failing in branch-2.10
[ https://issues.apache.org/jira/browse/YARN-10363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10363: --- Fix Version/s: 2.10.1 3.1.4 +1. Thanks for the patch, [~BilwaST] and for the review, [~Jim_Brennan]. I've committed this to branch-3.1 and branch-2.10. > TestRMAdminCLI.testHelp is failing in branch-2.10 > - > > Key: YARN-10363 > URL: https://issues.apache.org/jira/browse/YARN-10363 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.10.1 >Reporter: Jim Brennan >Assignee: Bilwa S T >Priority: Major > Fix For: 3.1.4, 2.10.1 > > Attachments: YARN-10363-branch-2.10.patch > > > TestRMAdminCLI.testHelp is failing in branch-2.10. > Example failure: > {noformat} > --- > Test set: org.apache.hadoop.yarn.client.cli.TestRMAdminCLI > --- > Tests run: 31, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 18.668 s <<< > FAILURE! - in org.apache.hadoop.yarn.client.cli.TestRMAdminCLI > testHelp(org.apache.hadoop.yarn.client.cli.TestRMAdminCLI) Time elapsed: > 0.043 s <<< FAILURE! > java.lang.AssertionError: > Expected error message: > Usage: yarn rmadmin [-failover [--forcefence] [--forceactive] > ] is not included in messages: > Usage: yarn rmadmin >-refreshQueues >-refreshNodes [-g|graceful [timeout in seconds] -client|server] >-refreshNodesResources >-refreshSuperUserGroupsConfiguration >-refreshUserToGroupsMappings >-refreshAdminAcls >-refreshServiceAcl >-getGroups [username] >-addToClusterNodeLabels > <"label1(exclusive=true),label2(exclusive=false),label3"> >-removeFromClusterNodeLabels (label splitted by ",") >-replaceLabelsOnNode <"node1[:port]=label1,label2 > node2[:port]=label1,label2"> [-failOnUnknownNodes] >-directlyAccessNodeLabelStore >-refreshClusterMaxPriority >-updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout]) >-help [cmd] > Generic options supported are: > -conf specify an application configuration file > -Ddefine a value for a given property > -fs specify default filesystem URL to use, > overrides 'fs.defaultFS' property from configurations. > -jt specify a ResourceManager > -files specify a comma-separated list of files to > be copied to the map reduce cluster > -libjarsspecify a comma-separated list of jar files > to be included in the classpath > -archives specify a comma-separated list of archives > to be unarchived on the compute machines > The general command line syntax is: > command [genericOptions] [commandOptions] > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at > org.apache.hadoop.yarn.client.cli.TestRMAdminCLI.testError(TestRMAdminCLI.java:859) > at > org.apache.hadoop.yarn.client.cli.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:585) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) > at org.junit.runners.ParentRunner.run(ParentRunner.java:309) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at >
[jira] [Commented] (YARN-4575) ApplicationResourceUsageReport should return ALL reserved resource
[ https://issues.apache.org/jira/browse/YARN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169118#comment-17169118 ] Eric Badger commented on YARN-4575: --- [~epayne], the content looks good, but could you address the 2 checkstyle issues? > ApplicationResourceUsageReport should return ALL reserved resource > --- > > Key: YARN-4575 > URL: https://issues.apache.org/jira/browse/YARN-4575 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin Chundatt >Priority: Major > Labels: oct16-easy > Attachments: 0001-YARN-4575.patch, 0002-YARN-4575.patch, > YARN-4575.003.patch, YARN-4575.004.patch > > > ApplicationResourceUsageReport reserved resource report is only of default > parition should be of all partitions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-4771: -- Fix Version/s: 3.3.1 2.10.1 3.1.4 3.2.2 I cherry-picked this back to branch-2.10. It has now been committed to trunk (3.4), branch-3.3, branch-3.2, branch-3.1, and branch-2.10 > Some containers can be skipped during log aggregation after NM restart > -- > > Key: YARN-4771 > URL: https://issues.apache.org/jira/browse/YARN-4771 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.0, 3.2.1, 3.1.3 >Reporter: Jason Darrell Lowe >Assignee: Jim Brennan >Priority: Major > Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1 > > Attachments: YARN-4771.001.patch, YARN-4771.002.patch, > YARN-4771.003.patch > > > A container can be skipped during log aggregation after a work-preserving > nodemanager restart if the following events occur: > # Container completes more than > yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the > restart > # At least one other container completes after the above container and before > the restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-4771: -- Fix Version/s: 3.4.0 Thanks for the updated patch, [~Jim_Brennan]. And thanks to [~jlowe] for the original patch. +1. I've committed this to trunk (3.4) > Some containers can be skipped during log aggregation after NM restart > -- > > Key: YARN-4771 > URL: https://issues.apache.org/jira/browse/YARN-4771 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.0, 3.2.1, 3.1.3 >Reporter: Jason Darrell Lowe >Assignee: Jim Brennan >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-4771.001.patch, YARN-4771.002.patch, > YARN-4771.003.patch > > > A container can be skipped during log aggregation after a work-preserving > nodemanager restart if the following events occur: > # Container completes more than > yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the > restart > # At least one other container completes after the above container and before > the restart -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10353) Log vcores used and cumulative cpu in containers monitor
[ https://issues.apache.org/jira/browse/YARN-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10353: --- Fix Version/s: 3.4.0 > Log vcores used and cumulative cpu in containers monitor > > > Key: YARN-10353 > URL: https://issues.apache.org/jira/browse/YARN-10353 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Fix For: 3.4.0 > > Attachments: YARN-10353.001.patch, YARN-10353.002.patch > > > We currently log the percentage/cpu and percentage/cpus-used-by-yarn in the > Containers Monitor log. It would be useful to also log vcores used vs vcores > assigned, and total accumulated CPU time. > For example, currently we have an audit log that looks like this: > {noformat} > 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit > (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree > 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB > physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 > CPU/core:35.772625 > {noformat} > The proposal is to add two more fields to show vCores and Cumulative CPU ms: > {noformat} > 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit > (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree > 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB > physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 > CPU/core:35.772625 vCores:2/1 CPU-ms:4180 > {noformat} > This is a snippet of a log from one of our clusters running branch-2.8 with a > similar change. > {noformat} > 2020-07-16 21:00:02,240 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 5267 for > container-id container_e04_1594079801456_1397450_01_001992: 1.6 GB of 2.5 GB > physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 18 of > 10 CPU vCores used. Cumulative CPU time: 157410 > 2020-07-16 21:00:02,269 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 18801 for > container-id container_e04_1594079801456_1390375_01_19: 413.2 MB of 2.5 > GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 > of 10 CPU vCores used. Cumulative CPU time: 113830 > 2020-07-16 21:00:02,298 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 5279 for > container-id container_e04_1594079801456_1397450_01_001991: 2.2 GB of 2.5 GB > physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 17 of > 10 CPU vCores used. Cumulative CPU time: 128630 > 2020-07-16 21:00:02,339 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 24189 for > container-id container_e04_1594079801456_1390430_01_000415: 392.7 MB of 2.5 > GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 > of 10 CPU vCores used. Cumulative CPU time: 96060 > 2020-07-16 21:00:02,367 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 6751 for > container-id container_e04_1594079801456_1397923_01_003248: 1.3 GB of 3 GB > physical memory used; 4.3 GB of 6.3 GB virtual memory used. CPU usage: 12 of > 10 CPU vCores used. Cumulative CPU time: 116820 > 2020-07-16 21:00:02,396 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 12138 for > container-id container_e04_1594079801456_1397760_01_44: 4.4 GB of 6 GB > physical memory used; 6.9 GB of 12.6 GB virtual memory used. CPU usage: 15 of > 10 CPU vCores used. Cumulative CPU time: 45900 > 2020-07-16 21:00:02,424 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 101918 for > container-id container_e04_1594079801456_1391130_01_002378: 2.4 GB of 4 GB > physical memory used; 5.8 GB of 8.4 GB virtual memory used. CPU usage: 13 of > 10 CPU vCores used. Cumulative CPU time: 2572390 > 2020-07-16 21:00:02,456 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 26596 for > container-id container_e04_1594079801456_1390446_01_000665: 418.6 MB of 2.5 > GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 > of 10 CPU vCores used. Cumulative CPU time: 101210 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10353) Log vcores used and cumulative cpu in containers monitor
[ https://issues.apache.org/jira/browse/YARN-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160127#comment-17160127 ] Eric Badger commented on YARN-10353: bq. Would Vcores: 2 of 10 be better? Yea, this is way more intuitive to me. Though it would be 2 of 1 in this case, right? bq. I was thinking spaces delimit fields, and colons delimit label vs value. This is fine too. You could do something like vCores used/vCores available:2/1 or something similar. That way you know what the values mean and it's still easily parsable Agreed on the CPU naming. When I see a number over 100 I assume the number is cpu ms or something else, not a percentage. And I see now, I incorrectly thought the logs in your 2.8 snippet your all the same job and was confused at the number jumping all over the place. I think CPU-ms is intuitive in this case because you can see that it is monotonically increasing. To be more clear, you could use "Total-CPU-ms", "Cumulative-CPU-ms", or "Accumulated-CPU-ms". But I'm ok with it either way. > Log vcores used and cumulative cpu in containers monitor > > > Key: YARN-10353 > URL: https://issues.apache.org/jira/browse/YARN-10353 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Attachments: YARN-10353.001.patch > > > We currently log the percentage/cpu and percentage/cpus-used-by-yarn in the > Containers Monitor log. It would be useful to also log vcores used vs vcores > assigned, and total accumulated CPU time. > For example, currently we have an audit log that looks like this: > {noformat} > 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit > (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree > 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB > physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 > CPU/core:35.772625 > {noformat} > The proposal is to add two more fields to show vCores and Cumulative CPU ms: > {noformat} > 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit > (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree > 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB > physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 > CPU/core:35.772625 vCores:2/1 CPU-ms:4180 > {noformat} > This is a snippet of a log from one of our clusters running branch-2.8 with a > similar change. > {noformat} > 2020-07-16 21:00:02,240 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 5267 for > container-id container_e04_1594079801456_1397450_01_001992: 1.6 GB of 2.5 GB > physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 18 of > 10 CPU vCores used. Cumulative CPU time: 157410 > 2020-07-16 21:00:02,269 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 18801 for > container-id container_e04_1594079801456_1390375_01_19: 413.2 MB of 2.5 > GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 > of 10 CPU vCores used. Cumulative CPU time: 113830 > 2020-07-16 21:00:02,298 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 5279 for > container-id container_e04_1594079801456_1397450_01_001991: 2.2 GB of 2.5 GB > physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 17 of > 10 CPU vCores used. Cumulative CPU time: 128630 > 2020-07-16 21:00:02,339 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 24189 for > container-id container_e04_1594079801456_1390430_01_000415: 392.7 MB of 2.5 > GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 > of 10 CPU vCores used. Cumulative CPU time: 96060 > 2020-07-16 21:00:02,367 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 6751 for > container-id container_e04_1594079801456_1397923_01_003248: 1.3 GB of 3 GB > physical memory used; 4.3 GB of 6.3 GB virtual memory used. CPU usage: 12 of > 10 CPU vCores used. Cumulative CPU time: 116820 > 2020-07-16 21:00:02,396 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 12138 for > container-id container_e04_1594079801456_1397760_01_44: 4.4 GB of 6 GB > physical memory used; 6.9 GB of 12.6 GB virtual memory used. CPU usage: 15 of > 10 CPU vCores used. Cumulative CPU time: 45900 > 2020-07-16 21:00:02,424 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 101918 for > container-id
[jira] [Commented] (YARN-10353) Log vcores used and cumulative cpu in containers monitor
[ https://issues.apache.org/jira/browse/YARN-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160111#comment-17160111 ] Eric Badger commented on YARN-10353: I like the change, but the log isn't very intuitive to me. {{vCores:2/1}} is confusing to me. Looking at the code it looks like this is number of vCores actually used over number of vCores allocated to the container. Personally, I like the way it's logged in the branch-2.8 snippet better. And {{CPU-ms}} is the CPU-ms since the monitor last ran, right? I suppose that one probably can't be much clearer than it is without a decent amount of text > Log vcores used and cumulative cpu in containers monitor > > > Key: YARN-10353 > URL: https://issues.apache.org/jira/browse/YARN-10353 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Minor > Attachments: YARN-10353.001.patch > > > We currently log the percentage/cpu and percentage/cpus-used-by-yarn in the > Containers Monitor log. It would be useful to also log vcores used vs vcores > assigned, and total accumulated CPU time. > For example, currently we have an audit log that looks like this: > {noformat} > 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit > (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree > 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB > physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 > CPU/core:35.772625 > {noformat} > The proposal is to add two more fields to show vCores and Cumulative CPU ms: > {noformat} > 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit > (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree > 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB > physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 > CPU/core:35.772625 vCores:2/1 CPU-ms:4180 > {noformat} > This is a snippet of a log from one of our clusters running branch-2.8 with a > similar change. > {noformat} > 2020-07-16 21:00:02,240 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 5267 for > container-id container_e04_1594079801456_1397450_01_001992: 1.6 GB of 2.5 GB > physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 18 of > 10 CPU vCores used. Cumulative CPU time: 157410 > 2020-07-16 21:00:02,269 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 18801 for > container-id container_e04_1594079801456_1390375_01_19: 413.2 MB of 2.5 > GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 > of 10 CPU vCores used. Cumulative CPU time: 113830 > 2020-07-16 21:00:02,298 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 5279 for > container-id container_e04_1594079801456_1397450_01_001991: 2.2 GB of 2.5 GB > physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 17 of > 10 CPU vCores used. Cumulative CPU time: 128630 > 2020-07-16 21:00:02,339 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 24189 for > container-id container_e04_1594079801456_1390430_01_000415: 392.7 MB of 2.5 > GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 > of 10 CPU vCores used. Cumulative CPU time: 96060 > 2020-07-16 21:00:02,367 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 6751 for > container-id container_e04_1594079801456_1397923_01_003248: 1.3 GB of 3 GB > physical memory used; 4.3 GB of 6.3 GB virtual memory used. CPU usage: 12 of > 10 CPU vCores used. Cumulative CPU time: 116820 > 2020-07-16 21:00:02,396 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 12138 for > container-id container_e04_1594079801456_1397760_01_44: 4.4 GB of 6 GB > physical memory used; 6.9 GB of 12.6 GB virtual memory used. CPU usage: 15 of > 10 CPU vCores used. Cumulative CPU time: 45900 > 2020-07-16 21:00:02,424 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 101918 for > container-id container_e04_1594079801456_1391130_01_002378: 2.4 GB of 4 GB > physical memory used; 5.8 GB of 8.4 GB virtual memory used. CPU usage: 13 of > 10 CPU vCores used. Cumulative CPU time: 2572390 > 2020-07-16 21:00:02,456 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 26596 for > container-id container_e04_1594079801456_1390446_01_000665: 418.6 MB of 2.5 > GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU
[jira] [Updated] (YARN-10348) Allow RM to always cancel tokens after app completes
[ https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10348: --- Fix Version/s: 3.1.5 2.10.1 3.2.2 Thanks for the new patch, [~Jim_Brennan]! I've committed this through branch-2.10. So it's now been committed to: trunk (3.4), branch-3.3, branch-3.2, branch-3.1, and branch-2.10 > Allow RM to always cancel tokens after app completes > > > Key: YARN-10348 > URL: https://issues.apache.org/jira/browse/YARN-10348 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.10.0, 3.1.3 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.2.2, 2.10.1, 3.4.0, 3.3.1, 3.1.5 > > Attachments: YARN-10348-branch-3.2.001.patch, YARN-10348.001.patch, > YARN-10348.002.patch > > > (Note: this change was originally done on our internal branch by [~daryn]). > The RM currently has an option for a client to specify disabling token > cancellation when a job completes. This feature was an initial attempt to > address the use case of a job launching sub-jobs (ie. oozie launcher) and the > original job finishing prior to the sub-job(s) completion - ex. original job > completion triggered premature cancellation of tokens needed by the sub-jobs. > Many years ago, [~daryn] added a more robust implementation to ref count > tokens ([YARN-3055]). This prevented premature cancellation of the token > until all apps using the token complete, and invalidated the need for a > client to specify cancel=false. Unfortunately the config option was not > removed. > We have seen cases where oozie "java actions" and some users were explicitly > disabling token cancellation. This can lead to a buildup of defunct tokens > that may overwhelm the ZK buffer used by the KDC's backing store. At which > point the KMS fails to connect to ZK and is unable to issue/validate new > tokens - rendering the KDC only able to authenticate pre-existing tokens. > Production incidents have occurred due to the buffer size issue. > To avoid these issues, the RM should have the option to ignore/override the > client's request to not cancel tokens. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10348) Allow RM to always cancel tokens after app completes
[ https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157030#comment-17157030 ] Eric Badger commented on YARN-10348: +1 (binding), I committed this to trunk (3.4) and branch-3.3. I attempted to cherry-pick to branch-3.2 and there were some conflicts. [~Jim_Brennan], would you mind putting up a patch for branch-3.2? > Allow RM to always cancel tokens after app completes > > > Key: YARN-10348 > URL: https://issues.apache.org/jira/browse/YARN-10348 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.10.0, 3.1.3 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10348.001.patch, YARN-10348.002.patch > > > (Note: this change was originally done on our internal branch by [~daryn]). > The RM currently has an option for a client to specify disabling token > cancellation when a job completes. This feature was an initial attempt to > address the use case of a job launching sub-jobs (ie. oozie launcher) and the > original job finishing prior to the sub-job(s) completion - ex. original job > completion triggered premature cancellation of tokens needed by the sub-jobs. > Many years ago, [~daryn] added a more robust implementation to ref count > tokens ([YARN-3055]). This prevented premature cancellation of the token > until all apps using the token complete, and invalidated the need for a > client to specify cancel=false. Unfortunately the config option was not > removed. > We have seen cases where oozie "java actions" and some users were explicitly > disabling token cancellation. This can lead to a buildup of defunct tokens > that may overwhelm the ZK buffer used by the KDC's backing store. At which > point the KMS fails to connect to ZK and is unable to issue/validate new > tokens - rendering the KDC only able to authenticate pre-existing tokens. > Production incidents have occurred due to the buffer size issue. > To avoid these issues, the RM should have the option to ignore/override the > client's request to not cancel tokens. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10348) Allow RM to always cancel tokens after app completes
[ https://issues.apache.org/jira/browse/YARN-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10348: --- Fix Version/s: 3.3.1 3.4.0 > Allow RM to always cancel tokens after app completes > > > Key: YARN-10348 > URL: https://issues.apache.org/jira/browse/YARN-10348 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.10.0, 3.1.3 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10348.001.patch, YARN-10348.002.patch > > > (Note: this change was originally done on our internal branch by [~daryn]). > The RM currently has an option for a client to specify disabling token > cancellation when a job completes. This feature was an initial attempt to > address the use case of a job launching sub-jobs (ie. oozie launcher) and the > original job finishing prior to the sub-job(s) completion - ex. original job > completion triggered premature cancellation of tokens needed by the sub-jobs. > Many years ago, [~daryn] added a more robust implementation to ref count > tokens ([YARN-3055]). This prevented premature cancellation of the token > until all apps using the token complete, and invalidated the need for a > client to specify cancel=false. Unfortunately the config option was not > removed. > We have seen cases where oozie "java actions" and some users were explicitly > disabling token cancellation. This can lead to a buildup of defunct tokens > that may overwhelm the ZK buffer used by the KDC's backing store. At which > point the KMS fails to connect to ZK and is unable to issue/validate new > tokens - rendering the KDC only able to authenticate pre-existing tokens. > Production incidents have occurred due to the buffer size issue. > To avoid these issues, the RM should have the option to ignore/override the > client's request to not cancel tokens. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148906#comment-17148906 ] Eric Badger commented on YARN-9809: --- Thanks, [~eyang] for the review and commit and [~Jim_Brennan] for the review! > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148829#comment-17148829 ] Eric Badger commented on YARN-9809: --- Thanks for the review, [~eyang]! Are you planning on committing this or would you like me to? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146522#comment-17146522 ] Eric Badger commented on YARN-9809: --- Thanks, [~Jim_Brennan]! [~eyang], would you take another look? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145893#comment-17145893 ] Eric Badger commented on YARN-9809: --- Good catch, [~Jim_Brennan]. {{updateMetricsForRejoinedNode()}} is only called in one other place and I don't want to add the node and then remove it again. So I removed the increment from {{updateMetricsForRejoinedNode()}} and explicitly added it to just before the other place where {{updateMetricsForRejoinedNode()}} is called. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9809: -- Attachment: YARN-9809.007.patch > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch, YARN-9809.007.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145717#comment-17145717 ] Eric Badger commented on YARN-9809: --- The TestFairScheduler and TestFairSchedulerPreemption test failures are unrelated to this JIRA as they have also been reported in https://issues.apache.org/jira/browse/YARN-10329 > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145711#comment-17145711 ] Eric Badger commented on YARN-9809: --- Patch 006 moves {{ClusterMetrics.getMetrics().incrNumActiveNodes();}} into {{reportNodeRunning}} inside of the addNodeTransition. This fixes the failing unit test and prevents a scenario where we add an unhealthy node as RUNNING and then quickly switching it to UNHEALTHY. This way we go straight to UNHEALTHY. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9809: -- Attachment: YARN-9809.006.patch > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch, > YARN-9809.006.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144544#comment-17144544 ] Eric Badger commented on YARN-9809: --- Thanks for the review, [~Jim_Brennan]! I've uploaded patch 005 to fix the things you mentioned in your comments > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9809: -- Attachment: YARN-9809.005.patch > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch, YARN-9809.005.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138899#comment-17138899 ] Eric Badger commented on YARN-9809: --- I can see pros and cons to both approaches. On the one hand, if the health check script fails to execute properly, that's not good and could imply something bad. But health check scripts are pretty dangerous since they can take out an entire cluster if they're written improperly. So if someone updates the script and all of a sudden the script errors out, the whole cluster is unhealthy. Or the health check script could rely on querying a service and that service times out. The node is healthy, but the health check script returned error. Unless you are parsing for specific error codes, you can no longer differentiate between the health check script failing internally and the health check script returning successfully that the node is unhealthy. Regardless of this discussion though, this is outside of the scope of this JIRA. That's an issue with how the health check script is handled while this JIRA is just about providing a health status at NM startup > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138892#comment-17138892 ] Eric Badger commented on YARN-9809: --- {noformat:title=NodeHealthScriptRunner.newInstance()} if (!shouldRun(scriptName, nodeHealthScript)) { return null; } {noformat} {noformat:title=NodeHealthScriptRunner.shouldRun()} static boolean shouldRun(String script, String healthScript) { if (healthScript == null || healthScript.trim().isEmpty()) { LOG.info("Missing location for the node health check script \"{}\".", script); return false; } {noformat} If the health check script doesn't exist, then the health {{shouldRun}} will return false and the {{newInstance}} will return null. This will cause the health reporter to not be added as a service. So at the end of the day, your statement is correct. If the health check script doesn't exist, the node will report as healthy. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137934#comment-17137934 ] Eric Badger commented on YARN-9809: --- {noformat} hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesSchedulerActivities hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer {noformat} Neither of these tests fail for me locally and are unrelated to the changes made in patch 004. Both the javac and the javadoc errors are coming from generated protobuf java files. I don't know how to get rid of these errors, but they aren't introducing any warnings that don't already exist. I think they're fine. The generation of the java files is the issue here. [~Jim_Brennan], [~ccondit], [~eyang], could you guys review patch 004? > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136095#comment-17136095 ] Eric Badger commented on YARN-9809: --- Patch 004 fixes checkstyle. There is still the javac error with PARSER being deprecated, but I don't know how to get rid of that. It is coming from a generated proto file. So I'm not quite sure what to do about that. The PARSER is used in many other places within the same generated file > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9809: -- Attachment: YARN-9809.004.patch > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch, YARN-9809.004.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9809: -- Attachment: YARN-9809.003.patch > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch, > YARN-9809.003.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10312) Add support for yarn logs -logFile to retain backward compatibility
[ https://issues.apache.org/jira/browse/YARN-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10312: --- Fix Version/s: 3.1.5 2.10.1 3.2.2 Thanks for the new patch, [~Jim_Brennan]! I committed this all the way to branch-2.10. Overall it has now been committed to trunk, branch-3.3, branch-3.2, branch-3.1, and branch-2.10 > Add support for yarn logs -logFile to retain backward compatibility > --- > > Key: YARN-10312 > URL: https://issues.apache.org/jira/browse/YARN-10312 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.10.0, 3.4.1 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: compatibility > Fix For: 3.2.2, 2.10.1, 3.3.1, 3.1.5, 3.4.1 > > Attachments: YARN-10312-branch-3.2.001.patch, YARN-10312.001.patch > > > The YARN CLI logs command line option {{-logFiles}} was changed to > {{-log_files}} in 2.9 and later releases. This change was made as part of > YARN-5363. > Verizon Media is in the process of moving from Hadoop-2.8 to Hadoop-2.10, and > while testing integration with Spark, we ran into this issue. We are > concerned that we will run into more cases of this as we roll out to > production, and rather than break user scripts, we'd prefer to add > {{-logFiles}} as an alias of {{-log_files}}. If both are provided, > {{-logFiles}} will be ignored. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10312) Add support for yarn logs -logFile to retain backward compatibility
[ https://issues.apache.org/jira/browse/YARN-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10312: --- Fix Version/s: 3.4.1 3.3.1 +1 I committed this to trunk and branch-3.3. The cherry-pick back to branch-3.2 is clean, but the build fails to compile due to a missing method. [~Jim_Brennan], can you put up an additional patch for branch-3.2? > Add support for yarn logs -logFile to retain backward compatibility > --- > > Key: YARN-10312 > URL: https://issues.apache.org/jira/browse/YARN-10312 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.10.0, 3.4.1 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: compatibility > Fix For: 3.3.1, 3.4.1 > > Attachments: YARN-10312.001.patch > > > The YARN CLI logs command line option {{-logFiles}} was changed to > {{-log_files}} in 2.9 and later releases. This change was made as part of > YARN-5363. > Verizon Media is in the process of moving from Hadoop-2.8 to Hadoop-2.10, and > while testing integration with Spark, we ran into this issue. We are > concerned that we will run into more cases of this as we roll out to > production, and rather than break user scripts, we'd prefer to add > {{-logFiles}} as an alias of {{-log_files}}. If both are provided, > {{-logFiles}} will be ignored. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133540#comment-17133540 ] Eric Badger commented on YARN-10300: Thanks, [~epayne] for the review/commit and [~Jim_Brennan] for the review! > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Fix For: 3.2.2, 2.10.1, 3.4.0, 3.3.1, 3.1.5 > > Attachments: YARN-10300.001.patch, YARN-10300.002.patch, > YARN-10300.003.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128713#comment-17128713 ] Eric Badger commented on YARN-10300: [~epayne], thanks for the review! In patch 003 I've modified an additional existing test to better test the {{createAppSummary()}} code. The unit test fails without the code change and succeeds with it. > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch, YARN-10300.002.patch, > YARN-10300.003.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10300: --- Attachment: YARN-10300.003.patch > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch, YARN-10300.002.patch, > YARN-10300.003.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128686#comment-17128686 ] Eric Badger commented on YARN-9809: --- Attaching patch 002 to address unit test failures > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9809: -- Attachment: YARN-9809.002.patch > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch, YARN-9809.002.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127025#comment-17127025 ] Eric Badger commented on YARN-9809: --- Patch 001 adds the feature but makes it opt-in via the config {{yarn.nodemanager.health-checker.run-before-startup}}. I didn't put in the retries flag for shutting down the NM if there are a certain number of failures. I can do that in a subsequent patch if you'd like. But I tested this patch out and it seems to work. > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9809) NMs should supply a health status when registering with RM
[ https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9809: -- Attachment: YARN-9809.001.patch > NMs should supply a health status when registering with RM > -- > > Key: YARN-9809 > URL: https://issues.apache.org/jira/browse/YARN-9809 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9809.001.patch > > > Currently if the NM registers with the RM and it is unhealthy, it can be > scheduled many containers before the first heartbeat. After the first > heartbeat, the RM will mark the NM as unhealthy and kill all of the > containers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125139#comment-17125139 ] Eric Badger edited comment on YARN-10300 at 6/3/20, 5:09 PM: - Thanks for the review, [~Jim_Brennan]! [~epayne], would you mind looking at this patch as well to give a binding review? was (Author: ebadger): [~epayne], would you mind looking at this patch as well to give a binding review? > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch, YARN-10300.002.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125139#comment-17125139 ] Eric Badger commented on YARN-10300: [~epayne], would you mind looking at this patch as well to give a binding review? > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch, YARN-10300.002.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124461#comment-17124461 ] Eric Badger commented on YARN-10300: Added some null checks to patch 002 > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch, YARN-10300.002.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-10300: --- Attachment: YARN-10300.002.patch > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch, YARN-10300.002.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124450#comment-17124450 ] Eric Badger commented on YARN-10300: bq. Eric Badger can you explain under what circumstances attempt.getMasterContainer().getNodeId().getHost() will succeed where attempt.getHost() fails? {{attempt.getHost()}} grabs {{host}} within RMAppAttemptImpl.java. This is set during {{AMRegisteredTransition}}, so it will be "N/A" until the AM registers with the NM during the first heartbeat. {{attempt.getMasterContainer()}} grabs {{masterContainer}} within RMAppAttemptImpl.java. This is set during {{AMContainerAllocatedTransition}}. So from the time between when the container is allocated until the time of the first AM heartbeat, {{masterContainer}} will be set, but {{host}} will be "N/A" bq. Also, do we need to add some null checks for these? getMasterContainer(), getNodeId(), getHost()? Probably wouldn't hurt. bq. Note that the attempt.getHost() defaults to "N/A" before it is set - what do we get if NodeID().getHost() isn't valid yet? Is that even a possibility? I don't know if it's possible to be invalid at the start or not. The {{Container}} is going to be created via {{newInstance}}, which requires a {{NodeId}} as a parameter. But that could potentially be sent in as null. But I think it will either be the correct nodeId or will be null, which I can interpret as "N/A". There are so many places that containers are instantiated in the scheduler that it'd be pretty tough to see if all of the cases have the nodeID set initially. I can add in the null checks and default the string to "N/A" if any of them don't exist. > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124328#comment-17124328 ] Eric Badger commented on YARN-10300: The unit tests pass for me locally. [~epayne], [~Jim_Brennan], could you take a look at this patch? > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org