[jira] [Commented] (YARN-4845) Upgrade fields of o.a.h.y.api.records.ResourceUtilization from int32 to int64
[ https://issues.apache.org/jira/browse/YARN-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203793#comment-15203793 ] Wangda Tan commented on YARN-4845: -- >From the Java doc: {code} ResourceUtilization models the utilization of a set of computer resources in the cluster. {code} It could be used to track a group of nodes, if so, we may need to mark this as a block of 2.8.0 release. > Upgrade fields of o.a.h.y.api.records.ResourceUtilization from int32 to int64 > - > > Key: YARN-4845 > URL: https://issues.apache.org/jira/browse/YARN-4845 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api >Reporter: Wangda Tan > > Similar to YARN-4844, if the ResourceUtilization could track all node's > resources instead of single node, we need to make sure the fields should be > long instead of int. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4845) Upgrade fields of o.a.h.y.api.records.ResourceUtilization from int32 to int64
Wangda Tan created YARN-4845: Summary: Upgrade fields of o.a.h.y.api.records.ResourceUtilization from int32 to int64 Key: YARN-4845 URL: https://issues.apache.org/jira/browse/YARN-4845 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Similar to YARN-4844, if the ResourceUtilization could track all node's resources instead of single node, we need to make sure the fields should be long instead of int. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4844) Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64
Wangda Tan created YARN-4844: Summary: Upgrade fields of o.a.h.y.api.records.Resource from int32 to int64 Key: YARN-4844 URL: https://issues.apache.org/jira/browse/YARN-4844 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Wangda Tan Priority: Critical We use int32 for memory now, if a cluster has 10k nodes, each node has 210G memory, we will get a negative total cluster memory. And another case that easier overflows int32 is: we added all pending resources of running apps to cluster's total pending resources. If a problematic app requires too much resources (let's say 1M+ containers, each of them has 3G containers), int32 will be not enough. Even if we can cap each app's pending request, we cannot handle the case that there're many running apps, each of them has capped but still significant numbers of pending resources. So we may possibly need to upgrade int32 memory field (could include v-cores as well) to int64 to avoid integer overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4843) [Umbrella] Revisit YARN ProtocolBuffer int32 usages that need to upgrade to int64
Wangda Tan created YARN-4843: Summary: [Umbrella] Revisit YARN ProtocolBuffer int32 usages that need to upgrade to int64 Key: YARN-4843 URL: https://issues.apache.org/jira/browse/YARN-4843 Project: Hadoop YARN Issue Type: Bug Components: api Reporter: Wangda Tan This JIRA is to track all int32 usages in YARN's ProtocolBuffer APIs that we possibly need to update to int64. One example is resource API. We use int32 for memory now, if a cluster has 10k nodes, each node has 210G memory, we will get a negative total cluster memory. We may have other fields may need to upgrade from int32 to int64. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203759#comment-15203759 ] Yi Zhou commented on YARN-796: -- Hi [~Naganarasimha], If you finished the jira for 2.6 doc, please kindly posted the ID number for me to track and reference. Thanks a lot ! > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, > Node-labels-Requirements-Design-doc-V2.pdf, > Non-exclusive-Node-Partition-Design.pdf, YARN-796-Diagram.pdf, > YARN-796.node-label.consolidate.1.patch, > YARN-796.node-label.consolidate.10.patch, > YARN-796.node-label.consolidate.11.patch, > YARN-796.node-label.consolidate.12.patch, > YARN-796.node-label.consolidate.13.patch, > YARN-796.node-label.consolidate.14.patch, > YARN-796.node-label.consolidate.2.patch, > YARN-796.node-label.consolidate.3.patch, > YARN-796.node-label.consolidate.4.patch, > YARN-796.node-label.consolidate.5.patch, > YARN-796.node-label.consolidate.6.patch, > YARN-796.node-label.consolidate.7.patch, > YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, > YARN-796.patch, YARN-796.patch4 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203757#comment-15203757 ] Yi Zhou commented on YARN-796: -- Appreciate[~Naganarasimha] [~wangda] for you great help! > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, > Node-labels-Requirements-Design-doc-V2.pdf, > Non-exclusive-Node-Partition-Design.pdf, YARN-796-Diagram.pdf, > YARN-796.node-label.consolidate.1.patch, > YARN-796.node-label.consolidate.10.patch, > YARN-796.node-label.consolidate.11.patch, > YARN-796.node-label.consolidate.12.patch, > YARN-796.node-label.consolidate.13.patch, > YARN-796.node-label.consolidate.14.patch, > YARN-796.node-label.consolidate.2.patch, > YARN-796.node-label.consolidate.3.patch, > YARN-796.node-label.consolidate.4.patch, > YARN-796.node-label.consolidate.5.patch, > YARN-796.node-label.consolidate.6.patch, > YARN-796.node-label.consolidate.7.patch, > YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, > YARN-796.patch, YARN-796.patch4 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4842) yarn logs command should not require the appOwner argument
[ https://issues.apache.org/jira/browse/YARN-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Venkatesh updated YARN-4842: Attachment: YARN-4842.1.patch > yarn logs command should not require the appOwner argument > -- > > Key: YARN-4842 > URL: https://issues.apache.org/jira/browse/YARN-4842 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ram Venkatesh >Assignee: Ram Venkatesh > Attachments: YARN-4842.1.patch > > > The yarn logs command is among the most common ways to troubleshoot yarn app > failures, especially by an admin. > Currently if you run the command as a user different from the job owner, the > command will fail with a subtle message that it could not find the app under > the running user's name. This can be confusing especially to new admins. > We can figure out the job owner from the app report returned by the RM or the > AHS, or, by looking for the app directory using a glob pattern, so in most > cases this error can be avoided. > Question - are there scenarios where users will still need to specify the > -appOwner option? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4842) yarn logs command should not require the appOwner argument
[ https://issues.apache.org/jira/browse/YARN-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Venkatesh reassigned YARN-4842: --- Assignee: Ram Venkatesh > yarn logs command should not require the appOwner argument > -- > > Key: YARN-4842 > URL: https://issues.apache.org/jira/browse/YARN-4842 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ram Venkatesh >Assignee: Ram Venkatesh > > The yarn logs command is among the most common ways to troubleshoot yarn app > failures, especially by an admin. > Currently if you run the command as a user different from the job owner, the > command will fail with a subtle message that it could not find the app under > the running user's name. This can be confusing especially to new admins. > We can figure out the job owner from the app report returned by the RM or the > AHS, or, by looking for the app directory using a glob pattern, so in most > cases this error can be avoided. > Question - are there scenarios where users will still need to specify the > -appOwner option? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4842) yarn logs command should not require the appOwner argument
Ram Venkatesh created YARN-4842: --- Summary: yarn logs command should not require the appOwner argument Key: YARN-4842 URL: https://issues.apache.org/jira/browse/YARN-4842 Project: Hadoop YARN Issue Type: Bug Reporter: Ram Venkatesh The yarn logs command is among the most common ways to troubleshoot yarn app failures, especially by an admin. Currently if you run the command as a user different from the job owner, the command will fail with a subtle message that it could not find the app under the running user's name. This can be confusing especially to new admins. We can figure out the job owner from the app report returned by the RM or the AHS, or, by looking for the app directory using a glob pattern, so in most cases this error can be avoided. Question - are there scenarios where users will still need to specify the -appOwner option? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3933) Race condition when calling AbstractYarnScheduler.completedContainer.
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203746#comment-15203746 ] Sunil G commented on YARN-3933: --- As per existing patch, new liveContainers check is done before below code {{FS#completedContainerInternal}}. Pls correct me if am wrong w.r.t FS, {{containerCompleted}} need to be processed for those containers which are RESERVED too. So with current patch, this scenario may not hit. {code} 864 if (rmContainer.getState() == RMContainerState.RESERVED) { 865 application.unreserve(rmContainer.getReservedPriority(), node); 866 } else { {code} > Race condition when calling AbstractYarnScheduler.completedContainer. > - > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0, 2.7.0, 2.5.2, 2.7.1 >Reporter: Lavkesh Lahngir >Assignee: Shiwei Guo > Attachments: YARN-3933.001.patch, YARN-3933.002.patch, > YARN-3933.003.patch > > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203740#comment-15203740 ] Naganarasimha G R commented on YARN-796: Ok actually i meant the same ... create a document for 2.6.x so that we can ask people to refer it (Also many times even i too forget while testing the RC cuts). I will raise a jira for the same. > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, > Node-labels-Requirements-Design-doc-V2.pdf, > Non-exclusive-Node-Partition-Design.pdf, YARN-796-Diagram.pdf, > YARN-796.node-label.consolidate.1.patch, > YARN-796.node-label.consolidate.10.patch, > YARN-796.node-label.consolidate.11.patch, > YARN-796.node-label.consolidate.12.patch, > YARN-796.node-label.consolidate.13.patch, > YARN-796.node-label.consolidate.14.patch, > YARN-796.node-label.consolidate.2.patch, > YARN-796.node-label.consolidate.3.patch, > YARN-796.node-label.consolidate.4.patch, > YARN-796.node-label.consolidate.5.patch, > YARN-796.node-label.consolidate.6.patch, > YARN-796.node-label.consolidate.7.patch, > YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, > YARN-796.patch, YARN-796.patch4 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203738#comment-15203738 ] Wangda Tan commented on YARN-796: - [~Naganarasimha], Since we could possibly update node label features in the future. Instead of indicating what is available in each release, I think we should add a node label doc for 2.6.x release (we only have doc for 2.7+ releases), which only include supported features. Thoughts? > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, > Node-labels-Requirements-Design-doc-V2.pdf, > Non-exclusive-Node-Partition-Design.pdf, YARN-796-Diagram.pdf, > YARN-796.node-label.consolidate.1.patch, > YARN-796.node-label.consolidate.10.patch, > YARN-796.node-label.consolidate.11.patch, > YARN-796.node-label.consolidate.12.patch, > YARN-796.node-label.consolidate.13.patch, > YARN-796.node-label.consolidate.14.patch, > YARN-796.node-label.consolidate.2.patch, > YARN-796.node-label.consolidate.3.patch, > YARN-796.node-label.consolidate.4.patch, > YARN-796.node-label.consolidate.5.patch, > YARN-796.node-label.consolidate.6.patch, > YARN-796.node-label.consolidate.7.patch, > YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, > YARN-796.patch, YARN-796.patch4 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203721#comment-15203721 ] Naganarasimha G R commented on YARN-796: Yes [~jameszhouyi], in 2.6.0 this command is not yet supported and the documentation which is available is for 2.7.2 and lot more fixes and features is expected to come in 2.8.0. If you are planning to experiment this feature then 2.7.2 is fine but to use it in production then i would suggest to better wait for 2.8.0. [~wangda], is it required to document what is available as part of 2.6.x ? > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, > Node-labels-Requirements-Design-doc-V2.pdf, > Non-exclusive-Node-Partition-Design.pdf, YARN-796-Diagram.pdf, > YARN-796.node-label.consolidate.1.patch, > YARN-796.node-label.consolidate.10.patch, > YARN-796.node-label.consolidate.11.patch, > YARN-796.node-label.consolidate.12.patch, > YARN-796.node-label.consolidate.13.patch, > YARN-796.node-label.consolidate.14.patch, > YARN-796.node-label.consolidate.2.patch, > YARN-796.node-label.consolidate.3.patch, > YARN-796.node-label.consolidate.4.patch, > YARN-796.node-label.consolidate.5.patch, > YARN-796.node-label.consolidate.6.patch, > YARN-796.node-label.consolidate.7.patch, > YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, > YARN-796.patch, YARN-796.patch4 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3933) Race condition when calling AbstractYarnScheduler.completedContainer.
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203723#comment-15203723 ] Wangda Tan commented on YARN-3933: -- Looked at this issue, it seems only FairScheduler has this issue. CS already checks this inside FiCaSchedulerApp. Instead of adding this separately in CS/FS, I think we can create a common {{completedContainer}} method to SchedulerApplicationAttempt. And checks liveContainers map inside the common method. Thoughts? [~kasha]/[~guoshiwei] > Race condition when calling AbstractYarnScheduler.completedContainer. > - > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0, 2.7.0, 2.5.2, 2.7.1 >Reporter: Lavkesh Lahngir >Assignee: Shiwei Guo > Attachments: YARN-3933.001.patch, YARN-3933.002.patch, > YARN-3933.003.patch > > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203684#comment-15203684 ] Yi Zhou commented on YARN-796: -- Hi [~Naganarasimha] It seems the below commands are still not supported in 2.6.0 ? sudo -u yarn yarn cluster --list-node-labels Error: Could not find or load main class cluster > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, > Node-labels-Requirements-Design-doc-V2.pdf, > Non-exclusive-Node-Partition-Design.pdf, YARN-796-Diagram.pdf, > YARN-796.node-label.consolidate.1.patch, > YARN-796.node-label.consolidate.10.patch, > YARN-796.node-label.consolidate.11.patch, > YARN-796.node-label.consolidate.12.patch, > YARN-796.node-label.consolidate.13.patch, > YARN-796.node-label.consolidate.14.patch, > YARN-796.node-label.consolidate.2.patch, > YARN-796.node-label.consolidate.3.patch, > YARN-796.node-label.consolidate.4.patch, > YARN-796.node-label.consolidate.5.patch, > YARN-796.node-label.consolidate.6.patch, > YARN-796.node-label.consolidate.7.patch, > YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, > YARN-796.patch, YARN-796.patch4 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4841) UT/FT tests for RM web UI pages
Rohith Sharma K S created YARN-4841: --- Summary: UT/FT tests for RM web UI pages Key: YARN-4841 URL: https://issues.apache.org/jira/browse/YARN-4841 Project: Hadoop YARN Issue Type: Improvement Components: webapp Reporter: Rohith Sharma K S RM webapp UI does not have UT/FT test cases at least basic validation. Everything depends on actual cluster deployment results. There should be UT/FT tests for validating RM webapp pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4002) make ResourceTrackerService.nodeHeartbeat more concurrent
[ https://issues.apache.org/jira/browse/YARN-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo updated YARN-4002: -- Attachment: YARN-4002-rwlock-v2.patch Uploaded YARN-4002-rwlock-v2.patch for an improvement: make the read side critical section smaller. {code} this.hostsReadLock.lock(); try { hostsList = hostsReader.getHosts(); excludeList = hostsReader.getExcludedHosts(); } finally { this.hostsReadLock.unlock(); } {code} As explained by [~rohithsharma], this prevents mixing up old value of hostsReader.getHosts() and new value of hostsReader.getExcludedHosts(). And this is the only reason someone may prefer rwlock solution than lockless one. If the mixing up is not thought (for example, by meself) a problem, lockless solution is good engouth. > make ResourceTrackerService.nodeHeartbeat more concurrent > - > > Key: YARN-4002 > URL: https://issues.apache.org/jira/browse/YARN-4002 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Hong Zhiguo >Assignee: Hong Zhiguo >Priority: Critical > Attachments: 0001-YARN-4002.patch, YARN-4002-lockless-read.patch, > YARN-4002-rwlock-v2.patch, YARN-4002-rwlock.patch, YARN-4002-v0.patch > > > We have multiple RPC threads to handle NodeHeartbeatRequest from NMs. By > design the method ResourceTrackerService.nodeHeartbeat should be concurrent > enough to scale for large clusters. > But we have a "BIG" lock in NodesListManager.isValidNode which I think it's > unnecessary. > First, the fields "includes" and "excludes" of HostsFileReader are only > updated on "refresh nodes". All RPC threads handling node heartbeats are > only readers. So RWLock could be used to alow concurrent access by RPC > threads. > Second, since he fields "includes" and "excludes" of HostsFileReader are > always updated by "reference assignment", which is atomic in Java, the reader > side lock could just be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4607) AppAttempt page TotalOutstandingResource Requests table support pagination
[ https://issues.apache.org/jira/browse/YARN-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203635#comment-15203635 ] Rohith Sharma K S commented on YARN-4607: - Overall patch looks good, can you add screenshot of before and after container request table? > AppAttempt page TotalOutstandingResource Requests table support pagination > -- > > Key: YARN-4607 > URL: https://issues.apache.org/jira/browse/YARN-4607 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: 0001-YARN-4607.patch > > > Simulate cluster with 10 racks with 100 nodes using sls and of we check the > table for Total Outstanding Resource Requests will consume complete page. > Good to support pagination for the table -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4808) SchedulerNode can use a few more cosmetic changes
[ https://issues.apache.org/jira/browse/YARN-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203616#comment-15203616 ] Hadoop QA commented on YARN-4808: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 57s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 19s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 1 new + 267 unchanged - 1 fixed = 268 total (was 268) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 39s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 17s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 162m 45s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_74 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanag
[jira] [Commented] (YARN-4002) make ResourceTrackerService.nodeHeartbeat more concurrent
[ https://issues.apache.org/jira/browse/YARN-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203611#comment-15203611 ] Rohith Sharma K S commented on YARN-4002: - Thanks [~leftnoteasy] for the looking at the patch.. I was thought about adding these 2 places readlock, but after looking into caller of these 2 methods I felt it is not really required. # Method {{setDecomissionedNMsMetrics}} is called during service init, so this will be called during service initialization. # Method {{printConfiguredHosts }} is called during service init and refreshNodes. ## Once again, for service init, I do not think we need really acquire readlock. ## For refresh Node,{{printConfiguredHosts }} is with in the write lock, it is safe enough to go without read lock. As of now, without acquiring read lock would not cause any problem. In future, if any new method calling these methods need to think of acquiring read lock. > make ResourceTrackerService.nodeHeartbeat more concurrent > - > > Key: YARN-4002 > URL: https://issues.apache.org/jira/browse/YARN-4002 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Hong Zhiguo >Assignee: Hong Zhiguo >Priority: Critical > Attachments: 0001-YARN-4002.patch, YARN-4002-lockless-read.patch, > YARN-4002-rwlock.patch, YARN-4002-v0.patch > > > We have multiple RPC threads to handle NodeHeartbeatRequest from NMs. By > design the method ResourceTrackerService.nodeHeartbeat should be concurrent > enough to scale for large clusters. > But we have a "BIG" lock in NodesListManager.isValidNode which I think it's > unnecessary. > First, the fields "includes" and "excludes" of HostsFileReader are only > updated on "refresh nodes". All RPC threads handling node heartbeats are > only readers. So RWLock could be used to alow concurrent access by RPC > threads. > Second, since he fields "includes" and "excludes" of HostsFileReader are > always updated by "reference assignment", which is atomic in Java, the reader > side lock could just be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4699) Scheduler UI and REST o/p is not in sync when -replaceLabelsOnNode is used to change label of a node
[ https://issues.apache.org/jira/browse/YARN-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203557#comment-15203557 ] Hadoop QA commented on YARN-4699: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 41s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 86m 49s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 73m 58s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 177m 21s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_74 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.8.0_74 Timed out junit tests | org.apache.hadoop.yarn.server.resourcemanager.TestRMHA | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:fbe3e86 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12794426/0002-YARN-4699.patch | | JIRA Issue | YARN-469
[jira] [Updated] (YARN-4808) SchedulerNode can use a few more cosmetic changes
[ https://issues.apache.org/jira/browse/YARN-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-4808: --- Attachment: yarn-4808-2.patch Rebased patch and made a few more cosmetic changes. > SchedulerNode can use a few more cosmetic changes > - > > Key: YARN-4808 > URL: https://issues.apache.org/jira/browse/YARN-4808 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-4808-1.patch, yarn-4808-2.patch > > > We have made some cosmetic changes to SchedulerNode recently. While working > on YARN-4511, realized we could improve it a little more: > # Remove volatile variables - don't see the need for them being volatile > # Some methods end up doing very similar things, so consolidating them > # Renaming totalResource to capacity. YARN-4511 plans to add inflatedCapacity > to include the un-utilized resources, and having two totals can be a little > confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4837) User facing aspects of 'AM blacklisting' feature need fixing
[ https://issues.apache.org/jira/browse/YARN-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203522#comment-15203522 ] Vinod Kumar Vavilapalli commented on YARN-4837: --- bq. "DISKS_FAILED" shouldn't be skipped for the reason I mentioned in YARN-4576. Also, we cannot simply judge system innocent when hitting memory issues. As [~vvasudev] pointed out [here on YARN-4576|https://issues.apache.org/jira/browse/YARN-4576?focusedCommentId=15202664&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15202664], the right solution is to have the RM detect bouncing nodes and then not to allocate new containers to bouncing nodes until they stabilize. bq. Also, hide all AM scheduling info/preference from application doesn't make sense in long time: AM can ask for resources for its running containers in the beginning, but application cannot ask how to place its AM even today which is sad to me. My earlier comment came out a little inaccurate when I said about "hiding AM-container-scheduling inside the RM". What I really meant is that any automatic scheduling decision coming out of system failures/events should be hidden from end-users - just like preemption-handling! We already have ResourceRequest as part of AM-launch-context. No reason why we cannot have more such things. However, this is different from RM automatically ruling out nodes as was done at YARN-2005 and related JIRAs. bq. YARN-4685 is something fixable and much better than the age without blacklist (we do see AM keep launching on bad nodes repeatedly and get stuck in many cases). We just need to go ahead to fix YARN-4685. YARN-4685 happened because of an inappropriate solution to a real problem - we should pause going down this route till we figure out the right solution. > User facing aspects of 'AM blacklisting' feature need fixing > > > Key: YARN-4837 > URL: https://issues.apache.org/jira/browse/YARN-4837 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > Was reviewing the user-facing aspects that we are releasing as part of 2.8.0. > Looking at the 'AM blacklisting feature', I see several things to be fixed > before we release it in 2.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4576) Enhancement for tracking Blacklist in AM Launching
[ https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203519#comment-15203519 ] Vinod Kumar Vavilapalli commented on YARN-4576: --- bq. To the more general point of nodes switching back and forth from good to bad and back - the better solution would be to have the RM detect bouncing nodes and then not to allocate new containers to bouncing nodes until they stabilize. [~djp], as [~vvasudev] says, this is a much better solution (dare I say the right solution) to deal with flip-flopping nodes and scheduling of all containers instead of just for AMs as YARN-2005 did. bq. 3) App should have their own choices to setup preferred nodes, hosts etc. [~leftnoteasy] / [~djp], I am not arguing against this at all. In fact, we already have ResourceRequest as part of AM-launch-context. No reason why we cannot have more such things. However, this is different from RM automatically ruling out nodes as was done at YARN-2005 and related JIRAs. bq. To address this case, container exit code is a better candidate, but I agree that it itself is not covering 100% cases or not pointing to exact failure. [~sunilg], I agree this with too - that was Sangjin's point as well. Despite explicit handling of known system problems, they will likely be cases that we will only be slowly handling over time. My only concern was exposing this as part of API already before we learn how this can be used in practice - that was my proposal too at https://issues.apache.org/jira/browse/YARN-4837?focusedCommentId=15201895&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15201895. > Enhancement for tracking Blacklist in AM Launching > -- > > Key: YARN-4576 > URL: https://issues.apache.org/jira/browse/YARN-4576 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: EnhancementAMLaunchingBlacklist.pdf > > > Before YARN-2005, YARN blacklist mechanism is to track the bad nodes by AM: > If AM tried to launch containers on a specific node get failed for several > times, AM will blacklist this node in future resource asking. This mechanism > works fine for normal containers. However, from our observation on behaviors > of several clusters: if this problematic node launch AM failed, then RM could > pickup this problematic node to launch next AM attempts again and again that > cause application failure in case other functional nodes are busy. In normal > case, the customized healthy checker script cannot be so sensitive to mark > node as unhealthy when one or two containers get launched failed. > After YARN-2005, we can have a BlacklistManager in each RMapp, so those nodes > who launching AM attempts failed for specific application before will get > blacklisted. To get rid of potential risks that all nodes being blacklisted > by BlacklistManager, a disable-failure-threshold is involved to stop adding > more nodes into blacklist if hit certain ratio already. > There are already some enhancements for this AM blacklist mechanism: > YARN-4284 is to address the more wider case for AM container get launched > failure and YARN-4389 tries to make configuration settings available for > change by App to meet app specific requirement. However, there are still > several gaps to address more scenarios: > 1. We may need a global blacklist instead of each app maintain a separated > one. The reason is: AM could get more chance to fail if other AM get failed > before. A quick example is: in a busy cluster, all nodes are busy except two > problematic nodes: node a and node b, app1 already submit and get failed in > two AM attempts on a and b. app2 and other apps should wait for other busy > nodes rather than waste attempts on these two problematic nodes. > 2. If AM container failure is recognized as global event instead app own > issue, we should consider the blacklist is not a permanent thing but with a > specific time window. > 3. We could have user defined black list polices to address more possible > cases and scenarios, so it reasonable to make blacklist policy pluggable. > 4. For some test scenario, we could have whitelist mechanism for AM launching. > 5. Some minor issues: it sounds like NM reconnect won't refresh blacklist so > far. > Will try to address all issues here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from an Allocation
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203496#comment-15203496 ] Vinod Kumar Vavilapalli commented on YARN-1040: --- Thanks for the document, [~asuresh]! [~bikassaha] - Our comments crossed on Feb 25, so didn't see yours. - Looking at the doc, I can see why it gives the impression of a redesign, but it is less of a redesign, and more of adding new functionality that needs new semantics. - Clearly the new naming makes it look like a lot of new changes for the apps, but that is the reality (for apps that want to use this new feature)! - We do make most of our decisions on JIRA. We can continue the discussion here. If need be, sure, we can send out a note on the dev lists. So, with that out of the way, let's step back and look at the semantics first and foremost and keep out the discussions about renames and the expected level of changes for later. h4. APIs There are big differences between the two proposals w.r.t the APIs. Even though it looks like your proposal earlier assumes that this can be made a localized change in the NM side APIs, there are newer semantics that mandate new (and/or modified) APIs on both AM-NM and RM-AM interactions. A couple of them that come to my mind - *Allocation/container release*: We need two separate mechanisms from AM to RM for (a) releasing allocations whole-sale (and thereby kill all running containers inside) and (b) kill one or more containers running inside an allocation *directly* at the RM - this is an existing feature - because the app either doesn't want to open N connections to N nodes in the cluster, or simply because the NM is not accessible anymore/in-the-interim. - *Allocation/container exit notifications*: The AMs will further be interested in two separate back-notifications from the RM (a) is the allocation itself released completely by the platform - say due to preemption? (b) or has one of the containers running inside the allocation exited and so I have to act on it? Remember that this is simply a disambiguation of our existing container-exit notification mechanism. h4. Internals Internally inside the RM too, the state-machine of the allocation itself is different from the containers' life-cycle. For e.g., the containers' life-cycle determines the completion notifications that we send across to the AMs and only the allocation life-cycle impacts scheduling. h4. Compatibility for existing apps What is proposed in the doc as well as the way I originally described it, it is definitely backwards compatible. Existing applications do not need a single line of change. Only newer versions of applications that desire to use the new feature have to use newer APIs - something that is not different from any other core YARN feature at all. h4. Changes for apps that want to use the new feature Even in your proposal, an app/framework that desires to use the new feature has to make non-significant changes in the AM to use this feature correctly - generating containerIDs - managing the list of containers running inside an allocation - managing the outstanding unused portion of an allocation, and incrementally launching more and more containers till the allocation is full - Containers running under non-reusable allocations do not need an explicit signal to the RM for clean up - apps can simply stop the container on the NM and everything else gets automatically taken care of. Apps that start using new feature on the other hand will *have* to now also explicitly release allocations outside of the life-cycle of the containers. - We can optionally add auxiliary flags to inform NMs to auto-reap the allocation when the last-container dies - only for apps that are okay with this -, but either ways the apps need changes to do this as they intend it. - Apps will also have to react differently on container-exit notifications and allocation-released/preempted notifications. Given the points above, I don't think we can get away with just an NM side API change. Depending on how much we have to change the APIs, I am willing to go either way on the degree of renames in the API surface area. Inside the code base though, I think we are better off calling things what they are. > De-link container life cycle from an Allocation > --- > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > Attachments: YARN-1040-rough-design.pdf > > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a pr
[jira] [Commented] (YARN-4609) RM Nodes list page takes too much time to load
[ https://issues.apache.org/jira/browse/YARN-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203477#comment-15203477 ] Hadoop QA commented on YARN-4609: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 46s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 15s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 72m 5s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 73m 48s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 162m 24s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_74 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPreemption | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:fbe3e86 |
[jira] [Updated] (YARN-4699) Scheduler UI and REST o/p is not in sync when -replaceLabelsOnNode is used to change label of a node
[ https://issues.apache.org/jira/browse/YARN-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4699: -- Attachment: 0002-YARN-4699.patch Attaching new patch with test case. I had to add a sleep because event processing was delayed. I will also see whether i can have a better wait mechanism. > Scheduler UI and REST o/p is not in sync when -replaceLabelsOnNode is used to > change label of a node > > > Key: YARN-4699 > URL: https://issues.apache.org/jira/browse/YARN-4699 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.2 >Reporter: Sunil G >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4699.patch, 0002-YARN-4699.patch, > AfterAppFInish-LabelY-Metrics.png, ForLabelX-AfterSwitch.png, > ForLabelY-AfterSwitch.png > > > Scenario is as follows: > a. 2 nodes are available in the cluster (node1 with label "x", node2 with > label "y") > b. Submit an application to node1 for label "x". > c. Change node1 label to "y" by using *replaceLabelsOnNode* command. > d. Verify Scheduler UI for metrics such as "Used Capacity", "Absolute > Capacity" etc. "x" still shows some capacity. > e. Change node1 label back to "x" and verify UI and REST o/p > Output: > 1. "Used Capacity", "Absolute Capacity" etc are not decremented once labels > is changed for a node. > 2. UI tab for respective label shows wrong GREEN color in these cases. > 3. REST o/p is wrong for each label after executing above scenario. > Attaching screen shots also. This ticket will try to cover UI and REST o/p > fix when label is changed runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4809) De-duplicate container completion across schedulers
[ https://issues.apache.org/jira/browse/YARN-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203416#comment-15203416 ] Sunil G commented on YARN-4809: --- Yes. That's perfectly fine. Thank You. > De-duplicate container completion across schedulers > --- > > Key: YARN-4809 > URL: https://issues.apache.org/jira/browse/YARN-4809 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Karthik Kambatla >Assignee: Sunil G > Attachments: 0001-YARN-4809.patch > > > CapacityScheduler and FairScheduler implement containerCompleted the exact > same way. Duplication across the schedulers can be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4809) De-duplicate container completion across schedulers
[ https://issues.apache.org/jira/browse/YARN-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203407#comment-15203407 ] Karthik Kambatla commented on YARN-4809: [~sunilg] - we might want to wait until YARN-3933 gets in, so we can figure out the commonalities better. > De-duplicate container completion across schedulers > --- > > Key: YARN-4809 > URL: https://issues.apache.org/jira/browse/YARN-4809 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Karthik Kambatla >Assignee: Sunil G > Attachments: 0001-YARN-4809.patch > > > CapacityScheduler and FairScheduler implement containerCompleted the exact > same way. Duplication across the schedulers can be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3933) Race condition when calling AbstractYarnScheduler.completedContainer.
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203404#comment-15203404 ] Karthik Kambatla commented on YARN-3933: The fix here seems benign and should be okay to get in. Should we consider adding this check to SchedulerApplicationAttempt or FSAppAttempt so any other callers don't do any damage in the future? > Race condition when calling AbstractYarnScheduler.completedContainer. > - > > Key: YARN-3933 > URL: https://issues.apache.org/jira/browse/YARN-3933 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.6.0, 2.7.0, 2.5.2, 2.7.1 >Reporter: Lavkesh Lahngir >Assignee: Shiwei Guo > Attachments: YARN-3933.001.patch, YARN-3933.002.patch, > YARN-3933.003.patch > > > In our cluster we are seeing available memory and cores being negative. > Initial inspection: > Scenario no. 1: > In capacity scheduler the method allocateContainersToNode() checks if > there are excess reservation of containers for an application, and they are > no longer needed then it calls queue.completedContainer() which causes > resources being negative. And they were never assigned in the first place. > I am still looking through the code. Can somebody suggest how to simulate > excess containers assignments ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4067) available resource could be set negative
[ https://issues.apache.org/jira/browse/YARN-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203401#comment-15203401 ] Karthik Kambatla commented on YARN-4067: IMO, available resource being negative is misleading. Even if we overcommit resources, it needs to transparent to users. This is actually one of the goals of YARN-1011. With regards to YARN-291, I was under the impression the primary motive of that work was to allow modifying the capacity of nodes dynamically. When the capacity is reduced on a fully allocated node, we should handle it more gracefully. Per YARN-1011 parlance, we should demote these containers to being called opportunistic. Sometimes, this might not be possible/allowed and the capacity update should fail. We can discuss this more on YARN-291. > available resource could be set negative > > > Key: YARN-4067 > URL: https://issues.apache.org/jira/browse/YARN-4067 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4067.patch > > > as mentioned in YARN-4045 by [~leftnoteasy], available memory could be > negative due to reservation, propose to use componentwiseMax to > updateQueueStatistics in order to cap negative value to zero -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4732) *ProcessTree classes have too many whitespace issues
[ https://issues.apache.org/jira/browse/YARN-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203396#comment-15203396 ] Hudson commented on YARN-4732: -- FAILURE: Integrated in Hadoop-trunk-Commit #9479 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9479/]) YARN-4732. *ProcessTree classes have too many whitespace issues (kasha: rev 7fae4c68e6d599d0c01bb2cb2c8d5e52925b3e1e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestWindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestResourceCalculatorProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsBasedProcessTree.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/util/ProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorProcessTree.java > *ProcessTree classes have too many whitespace issues > > > Key: YARN-4732 > URL: https://issues.apache.org/jira/browse/YARN-4732 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Karthik Kambatla >Assignee: Gabor Liptak >Priority: Trivial > Labels: newbie > Fix For: 2.9.0 > > Attachments: YARN-4732.1.patch > > > *ProcessTree classes have too many whitespace issues - extra newlines between > methods, spaces in empty lines etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4732) *ProcessTree classes have too many whitespace issues
[ https://issues.apache.org/jira/browse/YARN-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-4732: --- Release Note: was:YARN-4732 Cleanup whitespace issues in *ProcessTree classes > *ProcessTree classes have too many whitespace issues > > > Key: YARN-4732 > URL: https://issues.apache.org/jira/browse/YARN-4732 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Karthik Kambatla >Assignee: Gabor Liptak >Priority: Trivial > Labels: newbie > Attachments: YARN-4732.1.patch > > > *ProcessTree classes have too many whitespace issues - extra newlines between > methods, spaces in empty lines etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4576) Enhancement for tracking Blacklist in AM Launching
[ https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203371#comment-15203371 ] Sunil G commented on YARN-4576: --- bq.When YARN detects possible failures, it should blacklist nodes within the app (from Sangjin Lee). If AM container of an app fails on a node because of node-specific reasons, other containers of the app could fail with the same reason. But we shouldn't spread it to other apps because different app has different settings. We can do this unless we're confident enough that the two apps are very similar in configs. >> There may be one more scenario here, when application's 2nd or further >> appAttempts master container is tried to launch. If this new container >> launch falls into same faulty node (first attempt failed here), application >> can be failed. I have seen few situations like this. Main problem statement >> w.r.t one scenario may looks like this "AM container launch failed because >> of some environment issues in this node, so its better to run this AM in >> another node". To address this case, container exit code is a better >> candidate, but I agree that it itself is not covering 100% cases or not >> pointing to exact failure. And I hope that atleast if we could save scenario >> like above, it will be good. [~wangda tan], thoughts? > Enhancement for tracking Blacklist in AM Launching > -- > > Key: YARN-4576 > URL: https://issues.apache.org/jira/browse/YARN-4576 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: EnhancementAMLaunchingBlacklist.pdf > > > Before YARN-2005, YARN blacklist mechanism is to track the bad nodes by AM: > If AM tried to launch containers on a specific node get failed for several > times, AM will blacklist this node in future resource asking. This mechanism > works fine for normal containers. However, from our observation on behaviors > of several clusters: if this problematic node launch AM failed, then RM could > pickup this problematic node to launch next AM attempts again and again that > cause application failure in case other functional nodes are busy. In normal > case, the customized healthy checker script cannot be so sensitive to mark > node as unhealthy when one or two containers get launched failed. > After YARN-2005, we can have a BlacklistManager in each RMapp, so those nodes > who launching AM attempts failed for specific application before will get > blacklisted. To get rid of potential risks that all nodes being blacklisted > by BlacklistManager, a disable-failure-threshold is involved to stop adding > more nodes into blacklist if hit certain ratio already. > There are already some enhancements for this AM blacklist mechanism: > YARN-4284 is to address the more wider case for AM container get launched > failure and YARN-4389 tries to make configuration settings available for > change by App to meet app specific requirement. However, there are still > several gaps to address more scenarios: > 1. We may need a global blacklist instead of each app maintain a separated > one. The reason is: AM could get more chance to fail if other AM get failed > before. A quick example is: in a busy cluster, all nodes are busy except two > problematic nodes: node a and node b, app1 already submit and get failed in > two AM attempts on a and b. app2 and other apps should wait for other busy > nodes rather than waste attempts on these two problematic nodes. > 2. If AM container failure is recognized as global event instead app own > issue, we should consider the blacklist is not a permanent thing but with a > specific time window. > 3. We could have user defined black list polices to address more possible > cases and scenarios, so it reasonable to make blacklist policy pluggable. > 4. For some test scenario, we could have whitelist mechanism for AM launching. > 5. Some minor issues: it sounds like NM reconnect won't refresh blacklist so > far. > Will try to address all issues here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4839) ResourceManager deadlock between RMAppAttemptImpl and SchedulerApplicationAttempt
Jason Lowe created YARN-4839: Summary: ResourceManager deadlock between RMAppAttemptImpl and SchedulerApplicationAttempt Key: YARN-4839 URL: https://issues.apache.org/jira/browse/YARN-4839 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.8.0 Reporter: Jason Lowe Priority: Blocker Hit a deadlock in the ResourceManager as one thread was holding the SchedulerApplicationAttempt lock and trying to call RMAppAttemptImpl.getMasterContainer while another thread had the RMAppAttemptImpl lock and was trying to call SchedulerApplicationAttempt.getResourceUsageReport. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4636) Make blacklist tracking policy pluggable for more extensions.
[ https://issues.apache.org/jira/browse/YARN-4636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203367#comment-15203367 ] Sunil G commented on YARN-4636: --- As YARN improves in its blacklist/whitelist node functionality, one of the major usecase from our end is to save the second/further AM Container launch attempts to same failed node (if this is failed in a node due to external environment/memory issues). This can really help us. With YARN-2005, we have a mechanism in hand. And there were concerns on its strict behavior. Proposal made in YARN-4837 helps in straighten things out for immediate 2.8. I think YARN-4576 was trying to improve on current YARN-2005 and trying to generalize it. As we are going forward, if we are planning for a global blacklisting based various type of container exit codes, then policy can be helpful assuming that we may have different type of apps. For this scenario, we do not have usecases from our end. I checked with [~rohithsharma] and [~Naganarasimha Garla] also for this. It will be good if we can discuss/retrospect more on *global blacklisting* and its advantages/limitations based on current available information from containers exit codes. > Make blacklist tracking policy pluggable for more extensions. > - > > Key: YARN-4636 > URL: https://issues.apache.org/jira/browse/YARN-4636 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Sunil G > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4809) De-duplicate container completion across schedulers
[ https://issues.apache.org/jira/browse/YARN-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4809: -- Attachment: 0001-YARN-4809.patch Sharing an initial version of patch. Kindly help to check the same. > De-duplicate container completion across schedulers > --- > > Key: YARN-4809 > URL: https://issues.apache.org/jira/browse/YARN-4809 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Karthik Kambatla >Assignee: Sunil G > Attachments: 0001-YARN-4809.patch > > > CapacityScheduler and FairScheduler implement containerCompleted the exact > same way. Duplication across the schedulers can be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4609) RM Nodes list page takes too much time to load
[ https://issues.apache.org/jira/browse/YARN-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4609: --- Attachment: 0002-YARN-4609.patch > RM Nodes list page takes too much time to load > -- > > Key: YARN-4609 > URL: https://issues.apache.org/jira/browse/YARN-4609 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4609.patch, 0002-YARN-4609.patch, 7k > Nodes.png, sls-jobs.json, sls-nodes.json > > > Configure SLS with 1 NM Nodes > Check the time taken to load Nodes page > For loading 10 k Nodes it takes *30 sec* > /cluster/nodes > Chrome :Version 47.0.2526.106 m -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4812) TestFairScheduler#testContinuousScheduling fails intermittently
[ https://issues.apache.org/jira/browse/YARN-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198530#comment-15198530 ] Robert Kanter commented on YARN-4812: - +1 > TestFairScheduler#testContinuousScheduling fails intermittently > --- > > Key: YARN-4812 > URL: https://issues.apache.org/jira/browse/YARN-4812 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-4812-1.patch > > > This test has failed in the past, and there seem to be more issues. > {noformat} > java.lang.AssertionError: expected:<2> but was:<1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3816) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request
[ https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198436#comment-15198436 ] Eric Payne commented on YARN-4108: -- Thanks, [~leftnoteasy]. The patch looks good to me. > CapacityScheduler: Improve preemption to preempt only those containers that > would satisfy the incoming request > -- > > Key: YARN-4108 > URL: https://issues.apache.org/jira/browse/YARN-4108 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4108-design-doc-V3.pdf, > YARN-4108-design-doc-v1.pdf, YARN-4108-design-doc-v2.pdf, YARN-4108.1.patch, > YARN-4108.10.patch, YARN-4108.11.patch, YARN-4108.2.patch, YARN-4108.3.patch, > YARN-4108.4.patch, YARN-4108.5.patch, YARN-4108.6.patch, YARN-4108.7.patch, > YARN-4108.8.patch, YARN-4108.9.patch, YARN-4108.poc.1.patch, > YARN-4108.poc.2-WIP.patch, YARN-4108.poc.3-WIP.patch, > YARN-4108.poc.4-WIP.patch > > > This is sibling JIRA for YARN-2154. We should make sure container preemption > is more effective. > *Requirements:*: > 1) Can handle case of user-limit preemption > 2) Can handle case of resource placement requirements, such as: hard-locality > (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I > don't want to use rack1 and host\[1-3\]) > 3) Can handle preemption within a queue: cross user preemption (YARN-2113), > cross applicaiton preemption (such as priority-based (YARN-1963) / > fairness-based (YARN-3319)). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4062) Add the flush and compaction functionality via coprocessors and scanners for flow run table
[ https://issues.apache.org/jira/browse/YARN-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vrushali C updated YARN-4062: - Attachment: YARN-4062-YARN-2928.09.patch Attaching v9. Thanks [~sjlee0] for the debugging to fix the issue. > Add the flush and compaction functionality via coprocessors and scanners for > flow run table > --- > > Key: YARN-4062 > URL: https://issues.apache.org/jira/browse/YARN-4062 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Labels: yarn-2928-1st-milestone > Attachments: YARN-4062-YARN-2928.04.patch, > YARN-4062-YARN-2928.05.patch, YARN-4062-YARN-2928.06.patch, > YARN-4062-YARN-2928.07.patch, YARN-4062-YARN-2928.08.patch, > YARN-4062-YARN-2928.09.patch, YARN-4062-YARN-2928.1.patch, > YARN-4062-feature-YARN-2928.01.patch, YARN-4062-feature-YARN-2928.02.patch, > YARN-4062-feature-YARN-2928.03.patch > > > As part of YARN-3901, coprocessor and scanner is being added for storing into > the flow_run table. It also needs a flush & compaction processing in the > coprocessor and perhaps a new scanner to deal with the data during flushing > and compaction stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4831) Recovered containers will be killed after NM stateful restart
[ https://issues.apache.org/jira/browse/YARN-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197895#comment-15197895 ] Siqi Li commented on YARN-4831: --- When NM does a stateful restart, the ContainerManagerImpl will try to recover applications, and containers, and then send out ApplicationFinishEvent to apps that in appsState.getFinishedApplications(). The ApplicationFinishEvent could result in newly recovered containers to transit from NEW to DONE with a KillOnNewTransition. We could add an additional check in KillOnNewTransition to avoid killing completed containers. > Recovered containers will be killed after NM stateful restart > -- > > Key: YARN-4831 > URL: https://issues.apache.org/jira/browse/YARN-4831 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li > > {code} > 2016-03-04 19:43:48,130 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1456335621285_0040_01_66 transitioned from NEW to > DONE > 2016-03-04 19:43:48,130 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=henkins-service >OPERATION=Container Finished - Killed TARGET=ContainerImpl > RESULT=SUCCESS APPID=application_1456335621285_0040 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4766) NM should not aggregate logs older than the retention policy
[ https://issues.apache.org/jira/browse/YARN-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-4766: - Attachment: yarn4766.003.patch > NM should not aggregate logs older than the retention policy > > > Key: YARN-4766 > URL: https://issues.apache.org/jira/browse/YARN-4766 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation, nodemanager >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: yarn4766.001.patch, yarn4766.002.patch, > yarn4766.003.patch > > > When a log aggregation fails on the NM the information is for the attempt is > kept in the recovery DB. Log aggregation can fail for multiple reasons which > are often related to HDFS space or permissions. > On restart the recovery DB is read and if an application attempt needs its > logs aggregated, the files are scheduled for aggregation without any checks. > The log files could be older than the retention limit in which case we should > not aggregate them but immediately mark them for deletion from the local file > system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4766) NM should not aggregate logs older than the retention policy
[ https://issues.apache.org/jira/browse/YARN-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200863#comment-15200863 ] Hadoop QA commented on YARN-4766: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 20s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 50s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 57s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 58s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 49s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 49s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 1m 49s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 49s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 2m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 2s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 32s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 1 new + 123 unchanged - 2 fixed = 124 total (was 125) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 54s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 3s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_74. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 9s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 35s {color} | {color:green} hadoop-yarn-server-nodemanag
[jira] [Updated] (YARN-4831) Recovered containers will be killed after NM stateful restart
[ https://issues.apache.org/jira/browse/YARN-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-4831: -- Attachment: YARN-4831.v1.patch > Recovered containers will be killed after NM stateful restart > -- > > Key: YARN-4831 > URL: https://issues.apache.org/jira/browse/YARN-4831 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li > Attachments: YARN-4831.v1.patch > > > {code} > 2016-03-04 19:43:48,130 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1456335621285_0040_01_66 transitioned from NEW to > DONE > 2016-03-04 19:43:48,130 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=henkins-service >OPERATION=Container Finished - Killed TARGET=ContainerImpl > RESULT=SUCCESS APPID=application_1456335621285_0040 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4609) RM Nodes list page takes too much time to load
[ https://issues.apache.org/jira/browse/YARN-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203247#comment-15203247 ] Hadoop QA commented on YARN-4609: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 56s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 16s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 1s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 72m 23s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 7s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 163m 30s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_74 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.webapp.TestNodesPage | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.Test
[jira] [Assigned] (YARN-3773) hadoop-yarn-server-nodemanager's use of Linux /sbin/tc is non-portable
[ https://issues.apache.org/jira/browse/YARN-3773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Burlison reassigned YARN-3773: --- Assignee: Alan Burlison > hadoop-yarn-server-nodemanager's use of Linux /sbin/tc is non-portable > -- > > Key: YARN-3773 > URL: https://issues.apache.org/jira/browse/YARN-3773 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Environment: BSD OSX Solaris Windows Linux >Reporter: Alan Burlison >Assignee: Alan Burlison > > hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c > makes use of the Linux-only executable /sbin/tc > (http://lartc.org/manpages/tc.txt) but there is no corresponding > functionality for non-Linux platforms. The code in question also seems to try > to execute tc even on platforms where it will never exist. > Other platforms provide similar functionality, e.g. Solaris has an extensive > range of network management features > (http://www.oracle.com/technetwork/articles/servers-storage-admin/o11-095-s11-app-traffic-525038.html). > Work is needed to abstract the network management features of Yarn so that > the same facilities for network management can be provided on all platforms > that provide the requisite functionality, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4837) User facing aspects of 'AM blacklisting' feature need fixing
[ https://issues.apache.org/jira/browse/YARN-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201895#comment-15201895 ] Vinod Kumar Vavilapalli commented on YARN-4837: --- [~sunilg] and [~sjlee0], Appreciate your feedback. - Yes, AMs going to 'bad' nodes again and again and failing is a real problem. There are multiple reasons as to why this happens. -- It is true we cannot enumerate *all* the reasons. -- It is also true that we have some reasons that we *can* already deal with explicitly. - The primary reason for this JIRA is that I actually don't believe that users need explicit control *today* on how the AM scheduling on faults (i.e [~sunilg]'s agreement above - "agreeing to your point and its early for user to take blacklisting decisions w/o having much needed/useful information") - Like I also mentioned, it is misnamed too. So, let me just call it _AM-container-scheduling_ for the time being. h4. Modified proposal So how about we - Completely keep _AM-container-scheduling_ inside the ResourceManager and don't expose any user-APIs to skip-nodes - Explicitly treat known exit-codes: |DISKS_FAILED| node is already unhealthy, no need for any skipping nodes| |PREEMPTED, KILLED_BY_RESOURCEMANAGER, KILLED_AFTER_APP_COMPLETION| Not the app or the system's fault, it's by design, no need for skipping nodes| |KILLED_EXCEEDED_VMEM, KILLED_EXCEEDED_PMEM| No point in skipping the node as it's not the system's fault| |KILLED_BY_APPMASTER|Cannot happen for AM container| |All other non-zero codes|Need some action| - And book-keep all other failure cases and do soft-skipping *only* on the server-side. By this I refer to something similar to node->rack locality progression - avoid this node for a few scheduling opportunities and then come back to it after waiting out enough time. This way no node gets locked out, nor does any app get stuck. If we just do this, we will take care of our most important problem - apps getting affected due to AMs going repeatedly to the same places. And we also (a) won't force our users to already make these decisions without really understanding how and (b) won't introduce the bad problems of 'blacklisting' that exists today - for e.g YARN-4685. h4. 2.8.0 Even if we don't yet reach the consensus on the above or a similar proposal, I feel strongly that we should remove these user-facing configs / APIs from 2.8.0. Thoughts? /cc [~vvasudev], [~jianhe], [~wangda] who may not be looking at this. > User facing aspects of 'AM blacklisting' feature need fixing > > > Key: YARN-4837 > URL: https://issues.apache.org/jira/browse/YARN-4837 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > Was reviewing the user-facing aspects that we are releasing as part of 2.8.0. > Looking at the 'AM blacklisting feature', I see several things to be fixed > before we release it in 2.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4838) TestLogAggregationService. testLocalFileDeletionOnDiskFull failed
[ https://issues.apache.org/jira/browse/YARN-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen reassigned YARN-4838: Assignee: Haibo Chen > TestLogAggregationService. testLocalFileDeletionOnDiskFull failed > - > > Key: YARN-4838 > URL: https://issues.apache.org/jira/browse/YARN-4838 > Project: Hadoop YARN > Issue Type: Test > Components: log-aggregation >Reporter: Haibo Chen >Assignee: Haibo Chen > > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService > testLocalFileDeletionOnDiskFull failed > java.lang.AssertionError: null > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertFalse(Assert.java:64) > at org.junit.Assert.assertFalse(Assert.java:74) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService.verifyLocalFileDeletion(TestLogAggregationService.java:232) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService.testLocalFileDeletionOnDiskFull(TestLogAggregationService.java:288) > The failure is caused by the time issue of DeletionService. DeletionService > runs its only thread pool to delete files. When verfiyLocalFileDeletion() > method checks file existence, it is possible that the FileDeletionTask has > been executed by the thread pool in DeletionService. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4833) For Queue AccessControlException client retries multiple times on both RM
[ https://issues.apache.org/jira/browse/YARN-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203211#comment-15203211 ] Bibin A Chundatt commented on YARN-4833: [~sunilg] Thank you for looking into the issue. Actually by Point 2 was thinking to throw RPC.getRemoteException which is YarnException. Will try the same and upload a patch soon > For Queue AccessControlException client retries multiple times on both RM > - > > Key: YARN-4833 > URL: https://issues.apache.org/jira/browse/YARN-4833 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > > Submit application to queue where ACL is enabled and submitted user is not > having access. Client retries till failMaxattempt 10 times. > {noformat} > 16/03/18 10:01:06 INFO retry.RetryInvocationHandler: Exception while invoking > submitApplication of class ApplicationClientProtocolPBClientImpl over rm1. > Trying to fail over immediately. > org.apache.hadoop.security.AccessControlException: User hdfs does not have > permission to submit application_1458273884145_0001 to queue default > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:380) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:291) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:618) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:252) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:483) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2360) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2356) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2356) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:272) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:257) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) > at com.sun.proxy.$Proxy23.submitApplication(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:261) > at > org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:295) > at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:244) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338) > at org.apache.hadoop.mapreduce.Job.waitForC
[jira] [Assigned] (YARN-4746) yarn web services should convert parse failures of appId to 400
[ https://issues.apache.org/jira/browse/YARN-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt reassigned YARN-4746: -- Assignee: Bibin A Chundatt > yarn web services should convert parse failures of appId to 400 > --- > > Key: YARN-4746 > URL: https://issues.apache.org/jira/browse/YARN-4746 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: 0001-YARN-4746.patch, 0002-YARN-4746.patch, > 0003-YARN-4746.patch, 0003-YARN-4746.patch, 0004-YARN-4746.patch > > > I'm seeing somewhere in the WS API tests of mine an error with exception > conversion of a bad app ID sent in as an argument to a GET. I know it's in > ATS, but a scan of the core RM web services implies a same problem > {{WebServices.parseApplicationId()}} uses {{ConverterUtils.toApplicationId}} > to convert an argument; this throws IllegalArgumentException, which is then > handled somewhere by jetty as a 500 error. > In fact, it's a bad argument, which should be handled by returning a 400. > This can be done by catching the raised argument and explicitly converting it -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4609) RM Nodes list page takes too much time to load
[ https://issues.apache.org/jira/browse/YARN-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4609: --- Attachment: 7k Nodes.png [~rohithsharma]/[~devaraj.k] Check the current implementation with 7 K nodes and time taken in ~1-2 secs. > RM Nodes list page takes too much time to load > -- > > Key: YARN-4609 > URL: https://issues.apache.org/jira/browse/YARN-4609 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4609.patch, 7k Nodes.png, sls-jobs.json, > sls-nodes.json > > > Configure SLS with 1 NM Nodes > Check the time taken to load Nodes page > For loading 10 k Nodes it takes *30 sec* > /cluster/nodes > Chrome :Version 47.0.2526.106 m -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4609) RM Nodes list page takes too much time to load
[ https://issues.apache.org/jira/browse/YARN-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4609: --- Attachment: sls-nodes.json sls-jobs.json > RM Nodes list page takes too much time to load > -- > > Key: YARN-4609 > URL: https://issues.apache.org/jira/browse/YARN-4609 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4609.patch, sls-jobs.json, sls-nodes.json > > > Configure SLS with 1 NM Nodes > Check the time taken to load Nodes page > For loading 10 k Nodes it takes *30 sec* > /cluster/nodes > Chrome :Version 47.0.2526.106 m -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4609) RM Nodes list page takes too much time to load
[ https://issues.apache.org/jira/browse/YARN-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4609: --- Attachment: 0001-YARN-4609.patch Attaching patch . Please do review > RM Nodes list page takes too much time to load > -- > > Key: YARN-4609 > URL: https://issues.apache.org/jira/browse/YARN-4609 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt > Attachments: 0001-YARN-4609.patch > > > Configure SLS with 1 NM Nodes > Check the time taken to load Nodes page > For loading 10 k Nodes it takes *30 sec* > /cluster/nodes > Chrome :Version 47.0.2526.106 m -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4736) Issues with HBaseTimelineWriterImpl
[ https://issues.apache.org/jira/browse/YARN-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15198175#comment-15198175 ] Sangjin Lee commented on YARN-4736: --- Yes, sorry I meant HBASE-15436. I think we're more OK with the situation where the entire HBase cluster is down or the master is down. That's a critical situation, and all bets are off at that point. My concern is more if one region server went down or is in a state where it times out writes and your {{BufferedMutatorImpl}} needs to flush to it. If that flush operation times out after 30+ minutes, that would be a significant problem. [~anoop.hbase], would things take 30+ minutes to time out if a region server (rather than the cluster itself) is down or misbehaving? Thoughts? > Issues with HBaseTimelineWriterImpl > --- > > Key: YARN-4736 > URL: https://issues.apache.org/jira/browse/YARN-4736 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Vrushali C >Priority: Critical > Labels: yarn-2928-1st-milestone > Attachments: NM_Hang_hbase1.0.3.tar.gz, hbaseException.log, > threaddump.log > > > Faced some issues while running ATSv2 in single node Hadoop cluster and in > the same node had launched Hbase with embedded zookeeper. > # Due to some NPE issues i was able to see NM was trying to shutdown, but the > NM daemon process was not completed due to the locks. > # Got some exception related to Hbase after application finished execution > successfully. > will attach logs and the trace for the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3933) Race condition when calling AbstractYarnScheduler.completedContainer.
[ https://issues.apache.org/jira/browse/YARN-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203136#comment-15203136 ] Hadoop QA commented on YARN-3933: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} trunk passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} the patch passed with JDK v1.8.0_74 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 17s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_74. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 69m 2s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 155m 39s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_74 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:fbe3e86 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12794386/YARN-3933.003.patch | | JIRA Issue | YARN-3933 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs